Scraping HTML Tables with VBA

DeonM · Oct 22, 2020

I am scraping HTML tables with a GET request. The req.responseText is then passed to the module below to process the HTML tables.

My code works 100% but I'm trying to optimise it. It is a large daily crawl. I have 2 questions:

1. In the section where I'm processing the "className = "views-field views-field-field-date-published", I don't want the innerText of the TableCell, I want the innerText of the ...content="2020-10-22T00:00:00+02:00". At the moment I'm processing the innerHTML with a custom function, which works fine. But is there a way to access the content="2020-10-22T00:00:00+02:00" innerText itself, similar to what I'm doing for the href in the TableCell above?

2. I need to do different things with the content of each TableCell, therefore I've resorted to the multiple If statements based on TableCell.className. But it does require a lot of extra looping. Is there a more efficient way to get right to the applicable section?

Thanks in advance for any help.

VBA Code:

Option Explicit

Sub ProcessHTMLTables(HTMLString As String)

    Dim HTMLDoc As New MSHTML.HTMLDocument
    Dim HTMLTable As MSHTML.IHTMLElement
    Dim TableRow As MSHTML.IHTMLElement
    Dim TableCell As MSHTML.IHTMLElement
    Dim Category As String, URL As String, Description As String, PDate As String, ClDate As String, BDate As String
   
    HTMLDoc.body.innerHTML = HTMLString

    Set HTMLTable = HTMLDoc.getElementsByTagName("table")(0)
        'Debug.Print HTMLTable.innerHTML
        If HTMLTable Is Nothing Then
            Exit Sub
        End If

    For Each TableRow In HTMLTable.getElementsByTagName("tr")
        'Debug.Print TableRow.innerHTML

        For Each TableCell In TableRow.getElementsByTagName("td")
               
            If TableCell.className = "views-field views-field-field-category" Then
                Category = TableCell.innerText

            ElseIf TableCell.className = "views-field views-field-title" Then
                URL = "https://****.com" & Mid(TableCell.getElementsByTagName("a")(0).href, 7)
                Description = TableCell.innerText
             
            ElseIf TableCell.className = "views-field views-field-field-date-published" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-22T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 22/10/2020
                PDate = ExtractDateValue(TableCell.innerHTML)
                
            ElseIf TableCell.className = "views-field views-field-field-date-closing" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-23T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 23/10/2020
                ClDate = ExtractDateValue(TableCell.innerHTML)
                
            ElseIf TableCell.className = "views-field views-field-field-date-briefing" Then
                Debug.Print TableCell.innerHTML 'Returns: <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-10-24T00:00:00+02:00">22/10/2020</span>
                Debug.Print TableCell.innerText 'Returns: 24/10/2020
                BDate = ExtractDateValue(TableCell.innerHTML)
                                
            End If

        Next TableCell
        
    Next TableRow

End Sub

John_w · Oct 26, 2020

DeonM said:
1. In the section where I'm processing the "className = "views-field views-field-field-date-published", I don't want the innerText of the TableCell, I want the innerText of the ...content="2020-10-22T00:00:00+02:00". At the moment I'm processing the innerHTML with a custom function, which works fine. But is there a way to access the content="2020-10-22T00:00:00+02:00" innerText itself, similar to what I'm doing for the href in the TableCell above?

Try TableCell.getAttribute("content")

DeonM said:
2. I need to do different things with the content of each TableCell, therefore I've resorted to the multiple If statements based on TableCell.className. But it does require a lot of extra looping. Is there a more efficient way to get right to the applicable section?

It depends on what you want to do with the extracted data and your code doesn't show that.

There is no need to loop through the table cells (the inner loop) and look at the class name if you know that a particular column index always relates to the same 'field'. For example, if the first column is always the Category then you could have Category = TableRow.Cells(0).innerText

Other ways of getting all the table cell elements for a specific class name and looping through them are:

VBA Code:

    Dim Categories As IHTMLDOMChildrenCollection
    Dim Category As IHTMLDOMNode
    Set Categories = HTMLdoc.querySelectorAll("td.views-field views-field-field-category")
    'or
    'Set Categories = HTMLdoc.querySelectorAll("views-field-field-category")
    'or
    'Set Categories = HTMLdoc.querySelectorAll("td[class='views-field views-field-field-category']")
    For Each Category In Categories
        Debug.Print Category.innerText
    Next
   
    Dim Categories2 As IHTMLElementCollection
    Dim Category2 As HTMLTableCell
    Set Categories2 = HTMLdoc.getElementsByClassName("td.views-field views-field-field-category")
    'or
    'Set Categories2 = HTMLdoc.getElementsByClassName("views-field-field-category")
    For Each Category2 In Categories2
        Debug.Print Category2.innerText
    Next

DeonM · Oct 27, 2020

Hi John

Thanks for the reply.

1.

VBA Code:

TableCell.getAttribute("content")

If I use that I get a "Run time error 94, Invalid use of Null." If I look in the locals window I get a Null return for the variable. Any ideas?

2.

VBA Code:

Category = TableRow.Cells(0).innerText

That approach worked perfectly for me. As you say the columns are fixed 'fields', so I can read directly from each.
(I had to rewrite the code a bit to deal with the header row which has no usable data for me, but it's much preferable to the many if loops.).

Thanks for the help.

John_w · Oct 27, 2020

I didn't notice that the content attribute is in a child span element, therefore this should work:

VBA Code:

Debug.Print TableCell.Children(0).getAttribute("content")

DeonM · Oct 28, 2020

That works perfectly. I used your way of referencing TableRow.Cells so ended up with:

VBA Code:

If TableRow.Cells(3).innerHTML <> "" Then DatePublished = TableRow.Cells(3).Children(0).getAttribute("content") Else DatePublished = ""

I also had to deal with some instances where the table cells could be empty, which would cause an error.

Thanks again.

Scraping HTML Tables with VBA

DeonM

New Member

John_w

MrExcel MVP

DeonM

New Member

John_w

MrExcel MVP

DeonM

New Member

Similar threads

Forum statistics

Share this page

Scraping HTML Tables with VBA

DeonM

New Member

John_w

MrExcel MVP

DeonM

New Member

John_w

MrExcel MVP

DeonM

New Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock