Picking HTML with VBA

cambone

New Member
Joined
Jun 30, 2017
Messages
15
Is there a way to get just English from web HTML?
I am looping through the children and it gets all the languages.

How can I get HTMLTable and HTMLSeason to loop at the same time?

Here is my code:

Code:
For Each HTMLTable In HTMLTables
    
    
            For Each HTMLSeason In HTMLSeasons
            
            
'        Debug.Print HTMLSeason.innerText
         Cells(RowNum, 1) = HTMLSeason.innerText
        RowNum = RowNum + 1
        If HTMLSeason.innerText = "Specials" Then PreSeason = "0x"
        If Left(HTMLSeason.innerText, 6) = "Season" Then PreSeason = _
        Mid(HTMLSeason.innerText, InStrRev(HTMLSeason.innerText, " ") + 1) & "x"
       
       
        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")

                                
            ColNum = 1
            For Each HTMLCell In HTMLRow.Children
            
            If ColNum = 1 Then
'                Debug.Print PreSeason & HTMLCell.innerText
                Cells(RowNum, ColNum) = PreSeason & HTMLCell.innerText
                Else
'                Debug.Print HTMLCell.innerText
                Cells(RowNum, ColNum) = HTMLCell.innerText
                End If

                ColNum = ColNum + 1
            Next HTMLCell
            
            RowNum = RowNum + 1
            
          Next HTMLRow

        Next HTMLSeason

    Next HTMLTable


here is the ex. Website
https://www.thetvdb.com/series/quantico/seasons/all
 

Excel Facts

Lock one reference in a formula
Need 1 part of a formula to always point to the same range? use $ signs: $V$2:$Z$99 will always point to V2:Z99, even after copying
I dunno about the other parts of your question, but You could use something like this to filter out the English lines (only showing basic structure) I'm not totally sure how it would fit into your project:

(Change the stuff in brackets if that weren't obvious)

EDIT: Upon further review, you would need to be at the < span > level of element before applying this filter

Code:
For Each [obj] in [HTMLsomething].getElementsByTagName([Some HTML tag - "tr" would be fine])

If InStr(1,[obj].innerHTML, "data-language=""en""") > 0 Then

MORE CODE HERE

End If

Next

See this:

wGHL0CR.png
 
Last edited:
Upvote 0
Thank you very much!

The HTMLSeason is a list of seasons for the series, like specials and the Seasons
is there another way to collect that info?

What i'm trying get is:
Specials
0x1 Series name Episode name Date Played
0x2 Series name Episode name Date Played
and so on
Seasons (seasons can be 1-??? or the year like 2017-???)
1x1 Series name Episode name Date Played
1x2 Series name Episode name Date Played
and so on


Here is the whole thing:
Code:
Option Explicit


Sub TVDBEpisodeGetter()

    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "https://www.thetvdb.com/series/quantico/seasons/all", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
    
       ActiveSheet.UsedRange
    
End Sub

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLTables As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim HTMLSeasons As MSHTML.IHTMLElementCollection
    Dim HTMLSeason As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer
    Dim PreSeason As String
    Dim SeasonCT As Integer
    Dim EE As Integer
    
    Set HTMLTables = HTMLPage.getElementsByTagName("table")
    Set HTMLSeasons = HTMLPage.getElementsByTagName("h3")
    
'    Debug.Print HTMLSeasons.Length
    
    Sheets("Sheet1").Select
    Cells.Select
    Selection.ClearContents
    Range("A1").Select

     RowNum = 1
     EE = 1
      
    For Each HTMLTable In HTMLTables
    
    
            For Each HTMLSeason In HTMLSeasons
            
            
'        Debug.Print HTMLSeason.innerText
         Cells(RowNum, 1) = HTMLSeason.innerText
        RowNum = RowNum + 1
        If HTMLSeason.innerText = "Specials" Then PreSeason = "0x"
        If Left(HTMLSeason.innerText, 6) = "Season" Then PreSeason = _
        Mid(HTMLSeason.innerText, InStrRev(HTMLSeason.innerText, " ") + 1) & "x"
       
       
        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")

                                
            ColNum = 1
            For Each HTMLCell In HTMLRow.Children
            
            If ColNum = 1 Then
'                Debug.Print PreSeason & HTMLCell.innerText
                Cells(RowNum, ColNum) = PreSeason & HTMLCell.innerText
                Else
'                Debug.Print HTMLCell.innerText
                Cells(RowNum, ColNum) = HTMLCell.innerText
                End If

                ColNum = ColNum + 1
            Next HTMLCell
            
            RowNum = RowNum + 1
            
          Next HTMLRow

        Next HTMLSeason

    Next HTMLTable
    

  
End Sub
 
Upvote 0
Alright, I'm out of time to work on this today, but here is the code with the English issue taken care of.

Changes are noted in RED

Code:
Sub TVDBEpisodeGetter()

    [COLOR="#FF0000"]ActiveSheet.UsedRange.ClearContents[/COLOR]

    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "https://www.thetvdb.com/series/quantico/seasons/all", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
    
       ActiveSheet.UsedRange
    
End Sub

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLTables As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim HTMLSeasons As MSHTML.IHTMLElementCollection
    Dim HTMLSeason As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer
    Dim PreSeason As String
    Dim SeasonCT As Integer
    Dim EE As Integer
    
[COLOR="#FF0000"]    Dim englishFindIndex As Integer
    Dim endOfEnglishIndex As Integer[/COLOR]
    
    Set HTMLTables = HTMLPage.getElementsByTagName("table")
    Set HTMLSeasons = HTMLPage.getElementsByTagName("h3")
    
'    Debug.Print HTMLSeasons.Length
    
    Sheets("Sheet1").Select
    Cells.Select
    Selection.ClearContents
    Range("A1").Select

     RowNum = 1
     EE = 1
      
    For Each HTMLTable In HTMLTables
    
    
            For Each HTMLSeason In HTMLSeasons
            
            
'        Debug.Print HTMLSeason.innerText
         Cells(RowNum, 1) = HTMLSeason.innerText
        RowNum = RowNum + 1
        If HTMLSeason.innerText = "Specials" Then PreSeason = "0x"
        If Left(HTMLSeason.innerText, 6) = "Season" Then PreSeason = _
        Mid(HTMLSeason.innerText, InStrRev(HTMLSeason.innerText, " ") + 1) & "x"
       
       
        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")

                                
            ColNum = 1
            For Each HTMLCell In HTMLRow.Children
            
            If ColNum = 1 Then
'                Debug.Print PreSeason & HTMLCell.innerText
                Cells(RowNum, ColNum) = PreSeason & HTMLCell.innerText
                Else
                'Debug.Print HTMLCell.innerText

                [COLOR="#FF0000"]If InStr(1, HTMLCell.innerHTML, "data-language=""en""") > 0 Then

                    englishFindIndex = InStr(1, HTMLCell.innerHTML, "data-language=""en""") + 19
                    endOfEnglishIndex = InStr(englishFindIndex, HTMLCell.innerHTML, "<")
                    
                    Cells(RowNum, ColNum) = Mid(HTMLCell.innerHTML, englishFindIndex, endOfEnglishIndex - englishFindIndex)
                    
                Else
                
                    Cells(RowNum, ColNum) = HTMLCell.innerText
                    
                End If
[/COLOR]                
                
                
                
            End If
                
                ColNum = ColNum + 1
                
            Next HTMLCell
            
            RowNum = RowNum + 1
            
          Next HTMLRow

        Next HTMLSeason

    Next HTMLTable
    

  
End Sub
 
Upvote 0
Alright, here's what I came up with. It seems to work fine for this specific show. I dunno how it works with others (hopefully well)

The basic idea is to minimize the collections of html objects; in this code I mainly use the "rows" and just filter them / react to their contents.

To see what flags/hooks to use as filters, i'd recommend Dialog.Print 'ing the HTML.innerHTML more often, rather than text. it shows all the tags (which are useful for filtration)

I've modified it considerably, so I'd recommend saving a copy of your original code somewhere and then just copy/pasting mine.

Here you go; lemme know how it works

Code:
Sub TVDBEpisodeGetter()

    ActiveSheet.UsedRange.ClearContents
    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "https://www.thetvdb.com/series/quantico/seasons/all", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
       
    'I think excel does this on its own, but I added these lines to close the instances
    Set XMLPage = Nothing
    Set HTMLDoc = Nothing
    
End Sub

Code:
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    'Grabs every Html row (includes the episodes from every season)
    Dim HTMLRow As MSHTML.IHTMLElement
    
    'Accesses each element within the Table Row
    Dim HTMLCell As MSHTML.IHTMLElement
    
    'Stores the Elements tagged as H3 headers (usually the season titles)
    Dim HTMLSeasons As MSHTML.IHTMLElementCollection
    Dim HTMLSeason As MSHTML.IHTMLElement
    
    'Tracks active position on sheet (where next data is headed)
    Dim RowNum As Long, ColNum As Integer
    
    'Array to store all the season titles
    Dim seasonHeader() As String
    Dim currentSeason As Integer: currentSeason = 1
    
    'Iterator for building seasons array // initialize to 1
    Dim i As Integer: i = 1
    
    'Since we want to start at 0 IF there is a specials category,
    'I made this counter to subtract from the season number if necessary
    Dim specialsCount As Integer: specialsCount = 0
    
    'Finds the start and end positions of the HTML Code snippet that is english
    Dim englishFindIndex As Integer
    Dim endOfEnglishIndex As Integer
    'Read all the Season Titles
    Set HTMLSeasons = HTMLPage.getElementsByTagName("h3")
    
    'Prepare an array to hold the titles
    ReDim seasonHeader(1 To HTMLSeasons.Length)
        
    'Iterate through all season titles and store them to array
    For Each HTMLSeason In HTMLSeasons
        
        seasonHeader(i) = HTMLSeason.innerText
        i = i + 1
        
    Next HTMLSeason
    
     'Initialize active position on sheet
     RowNum = 1
 
        'Process one row at a time (episodes or season headers)
        For Each HTMLRow In HTMLPage.getElementsByTagName("tr")
                               
            'Start on Left
            ColNum = 1
            
            'In current row, access each entry 1-at-a-time
            For Each HTMLCell In HTMLRow.Children
                
                'Check to see if we are on a subheader
                If InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    'If we are, we are going to slide the Season title in before adding the subheader.
                    'This will generate the correct header
                    Select Case seasonHeader(currentSeason)
                    
                        'If the category is named "specials" use that name and increment specialsCount
                        Case "Specials"
                            Cells(RowNum, ColNum) = "Specials"
                            specialsCount = specialsCount + 1
                        
                        'This is the case for any title of the form [Season #]
                        'If something is getting labeled like this that shouldnt be... youll need to add another case
                        'with "specialsCount" to handle it
                        Case Else
                            Cells(RowNum, ColNum) = "Season " & (currentSeason - specialsCount)
                        
                    End Select
                
                'Since we just used the label, make sure to move to next row and increment the season number
                RowNum = RowNum + 1
                currentSeason = currentSeason + 1
                
                End If
            
                'Adds Season Prefix to Episode Number
                If ColNum = 1 And Not InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    Cells(RowNum, ColNum) = (currentSeason - 1 - specialsCount) & "x" & HTMLCell.innerText
    
                Else
                    'Debug.Print HTMLCell.innerText
    
                    If InStr(1, HTMLCell.innerHTML, "data-language=""en""") > 0 Then
    
                        'finding start and end position of string that will look like this:
                        '       data-language="en">Inside Look
                        englishFindIndex = InStr(1, HTMLCell.innerHTML, "data-language=""en""") + 19
                        endOfEnglishIndex = InStr(englishFindIndex, HTMLCell.innerHTML, "<")
                        'Write properly formatted (english only) title to cell
                        Cells(RowNum, ColNum) = Mid(HTMLCell.innerHTML, englishFindIndex, endOfEnglishIndex - englishFindIndex)
                        
                    Else
                    
                        'No language issue, just write to cell in original format
                        Cells(RowNum, ColNum) = HTMLCell.innerText
                        
                    End If
    
                End If
                
            'Done with current HTML-cell, move to next sheet-column
            ColNum = ColNum + 1
            
            'Move to next HTML-Cell
            Next HTMLCell
            
        'Done with current HTML-row, move to next sheet-row
        RowNum = RowNum + 1
            
        'Move to next HTML-row
        Next HTMLRow
          
      'Format Sheet (AutoFit) to display all text
      ActiveSheet.UsedRange.Columns.AutoFit
      ActiveSheet.UsedRange.Rows.AutoFit

End Sub
 
Upvote 0
Can it be read from the page to pick-up how its formatted?

Yeah, tweak this line as shown in red, and it should work regardless of format.

Code:
                'Check to see if we are on a subheader
                If InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    'If we are, we are going to slide the Season title in before adding the subheader.
                    'This will generate the correct header
                    Select Case seasonHeader(currentSeason)
                    
                        'If the category is named "specials" use that name and increment specialsCount
                        Case "Specials"
                            Cells(RowNum, ColNum) = "Specials"
                            specialsCount = specialsCount + 1
                        
                        'This is the case for any title of the form [Season #]
                        'If something is getting labeled like this that shouldnt be... youll need to add another case
                        'with "specialsCount" to handle it
                        Case Else
                            [COLOR=#ff0000]Cells(RowNum, ColNum) = seasonHeader(currentSeason)[/COLOR]
                        
                    End Select
                
                'Since we just used the label, make sure to move to next row and increment the season number
                RowNum = RowNum + 1
                currentSeason = currentSeason + 1
                
                End If
 
Upvote 0
It appears to get it from

Code:
                'Adds Season Prefix to Episode Number
                If ColNum = 1 And Not InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    [COLOR=#008080]Cells(RowNum, ColNum) = (currentSeason - 1 - specialsCount) & "x" & HTMLCell.innerText[/COLOR]
    
                Else
                    'Debug.Print HTMLCell.innerText

That would mean add another Case?

Thanks
 
Upvote 0

Forum statistics

Threads
1,224,822
Messages
6,181,164
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top