Picking HTML with VBA

cambone

New Member
Joined
Jun 30, 2017
Messages
15
Is there a way to get just English from web HTML?
I am looping through the children and it gets all the languages.

How can I get HTMLTable and HTMLSeason to loop at the same time?

Here is my code:

Code:
For Each HTMLTable In HTMLTables
    
    
            For Each HTMLSeason In HTMLSeasons
            
            
'        Debug.Print HTMLSeason.innerText
         Cells(RowNum, 1) = HTMLSeason.innerText
        RowNum = RowNum + 1
        If HTMLSeason.innerText = "Specials" Then PreSeason = "0x"
        If Left(HTMLSeason.innerText, 6) = "Season" Then PreSeason = _
        Mid(HTMLSeason.innerText, InStrRev(HTMLSeason.innerText, " ") + 1) & "x"
       
       
        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")

                                
            ColNum = 1
            For Each HTMLCell In HTMLRow.Children
            
            If ColNum = 1 Then
'                Debug.Print PreSeason & HTMLCell.innerText
                Cells(RowNum, ColNum) = PreSeason & HTMLCell.innerText
                Else
'                Debug.Print HTMLCell.innerText
                Cells(RowNum, ColNum) = HTMLCell.innerText
                End If

                ColNum = ColNum + 1
            Next HTMLCell
            
            RowNum = RowNum + 1
            
          Next HTMLRow

        Next HTMLSeason

    Next HTMLTable


here is the ex. Website
https://www.thetvdb.com/series/quantico/seasons/all
 
You should only have to change what I marked in red earlier in post #9 (Just copy and paste it over the existing line)

The line you indicated is how the 1x1, 2x3, 0x1 stuff gets constructed. It is separate from the season's naming convention.
 
Last edited:
Upvote 0

Excel Facts

Do you hate GETPIVOTDATA?
Prevent GETPIVOTDATA. Select inside a PivotTable. In the Analyze tab of the ribbon, open the dropown next to Options and turn it off
I tried and this is what I got

Code:
Sub TVDBEpisodeGetter()

    ActiveSheet.UsedRange.ClearContents
    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "https://www.thetvdb.com/series/american-pickers/seasons/all", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
       
    'I think excel does this on its own, but I added these lines to close the instances
    Set XMLPage = Nothing
    Set HTMLDoc = Nothing
    
End Sub


Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    'Grabs every Html row (includes the episodes from every season)
    Dim HTMLRow As MSHTML.IHTMLElement
    
    'Accesses each element within the Table Row
    Dim HTMLCell As MSHTML.IHTMLElement
    
    'Stores the Elements tagged as H3 headers (usually the season titles)
    Dim HTMLSeasons As MSHTML.IHTMLElementCollection
    Dim HTMLSeason As MSHTML.IHTMLElement
    
    'Tracks active position on sheet (where next data is headed)
    Dim RowNum As Long, ColNum As Integer
    
    'Array to store all the season titles
    Dim seasonHeader() As String
    Dim currentSeason As Integer: currentSeason = 1
    
    'Iterator for building seasons array // initialize to 1
    Dim i As Integer: i = 1
    
    'Since we want to start at 0 IF there is a specials category,
    'I made this counter to subtract from the season number if necessary
    Dim specialsCount As Integer: specialsCount = 0
    
    'Finds the start and end positions of the HTML Code snippet that is english
    Dim englishFindIndex As Integer
    Dim endOfEnglishIndex As Integer
    'Read all the Season Titles
    Set HTMLSeasons = HTMLPage.getElementsByTagName("h3")
    
    'Prepare an array to hold the titles
    ReDim seasonHeader(1 To HTMLSeasons.Length)
        
    'Iterate through all season titles and store them to array
    For Each HTMLSeason In HTMLSeasons
        
        seasonHeader(i) = HTMLSeason.innerText
        i = i + 1
        
    Next HTMLSeason
    
     'Initialize active position on sheet
     RowNum = 1
 
        'Process one row at a time (episodes or season headers)
        For Each HTMLRow In HTMLPage.getElementsByTagName("tr")
                               
            'Start on Left
            ColNum = 1
            
            'In current row, access each entry 1-at-a-time
            For Each HTMLCell In HTMLRow.Children
                
                'Check to see if we are on a subheader
                If InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    'If we are, we are going to slide the Season title in before adding the subheader.
                    'This will generate the correct header
                    Select Case seasonHeader(currentSeason)
                    
                        'If the category is named "specials" use that name and increment specialsCount
                        Case "Specials"
                            Cells(RowNum, ColNum) = "Specials"
                            specialsCount = specialsCount + 1
                        
                        'This is the case for any title of the form [Season #]
                        'If something is getting labeled like this that shouldnt be... youll need to add another case
                        'with "specialsCount" to handle it
                        Case Else
'                            Cells(RowNum, ColNum) = "Season " & (currentSeason - specialsCount)
                             Cells(RowNum, ColNum) = seasonHeader(currentSeason)
                        
                    End Select
                
                'Since we just used the label, make sure to move to next row and increment the season number
                RowNum = RowNum + 1
                currentSeason = currentSeason + 1
                
                End If
            
                'Adds Season Prefix to Episode Number
                If ColNum = 1 And Not InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    Cells(RowNum, ColNum) = (currentSeason - 1 - specialsCount) & "x" & HTMLCell.innerText
    
                Else
                    'Debug.Print HTMLCell.innerText
    
                    If InStr(1, HTMLCell.innerHTML, "data-language=""en""") > 0 Then
    
                        'finding start and end position of string that will look like this:
                        '       data-language="en">Inside Look
                        englishFindIndex = InStr(1, HTMLCell.innerHTML, "data-language=""en""") + 19
                        endOfEnglishIndex = InStr(englishFindIndex, HTMLCell.innerHTML, "<")
                        'Write properly formatted (english only) title to cell
                        Cells(RowNum, ColNum) = Mid(HTMLCell.innerHTML, englishFindIndex, endOfEnglishIndex - englishFindIndex)
                        
                    Else
                    
                        'No language issue, just write to cell in original format
                        Cells(RowNum, ColNum) = HTMLCell.innerText
                        
                    End If
    
                End If
                
            'Done with current HTML-cell, move to next sheet-column
            ColNum = ColNum + 1
            
            'Move to next HTML-Cell
            Next HTMLCell
            
        'Done with current HTML-row, move to next sheet-row
        RowNum = RowNum + 1
            
        'Move to next HTML-row
        Next HTMLRow
          
      'Format Sheet (AutoFit) to display all text
      ActiveSheet.UsedRange.Columns.AutoFit
      ActiveSheet.UsedRange.Rows.AutoFit

End Sub

Code:
[TABLE="width: 449"]
<colgroup><col><col><col></colgroup><tbody>[TR]
[TD]0x53 [/TD]
[TD]Man Cave Mania [/TD]
[TD="align: right"]5/16/2018[/TD]
[/TR]
[TR]
[TD]0x54 [/TD]
[TD]Father Picks Best [/TD]
[TD="align: right"]5/23/2018[/TD]
[/TR]
[TR]
[TD]Season 2010[/TD]
[TD][/TD]
[TD][/TD]
[/TR]
[TR]
[TD]#[/TD]
[TD]Name[/TD]
[TD]Originally Aired[/TD]
[/TR]
[TR]
[TD]1x1 
[/TD]
[TD]Big Bear 
[/TD]
[TD="align: right"]1/18/2010[/TD]
[/TR]
[TR]
[TD]1x2 
[/TD]
[TD]Super Scooter 
[/TD]
[TD="align: right"]1/25/2010
[/TD]
[/TR]
[TR]
[TD]1x3 [/TD]
[TD]White Castle on the Farm [/TD]
[TD="align: right"]2/1/2010[/TD]
[/TR]
[TR]
[TD]1x4 [/TD]
[TD]Invisible Pump [/TD]
[TD="align: right"]2/8/2010[/TD]
[/TR]
[TR]
[TD]1x5 [/TD]
[TD]Back Breaker [/TD]
[TD="align: right"]2/15/2010[/TD]
[/TR]
</tbody>[/TABLE]

seasonHeader was = to "Season 2011" from the ex. above

under 2010 it should be
2010x1 Big Bear
2010x2 Super Scooter

The code is great, love the neatness and the explanations, I can learn from this.
 
Upvote 0
Oh sorry, I assumed you wanted the 1x3, 2x4 format regardless.

This code should work in both cases:


EDIT: You'll also need to declare this with all the other declarations
Code:
    'Going to extract the number from Season header (2015, 3, etc...)
    Dim currentSeasonNumber() As String

Code:
                'Adds Season Prefix to Episode Number (ripped from header)
                If ColNum = 1 And Not InStr(1, HTMLCell.innerText, "#") > 0 Then
                    
[COLOR=#ff0000]                    'Looks for a space, and then grabs whatever is in the second part of the name
                    If InStr(1, seasonHeader(currentSeason - 1), " ") > 0 Then
                    
                        currentSeasonNumber = Split(seasonHeader(currentSeason - 1))
                        Cells(RowNum, ColNum) = currentSeasonNumber(1) & "x" & HTMLCell.innerText
                        
                    'Should just be the case for Specials
                    Else
                    
[COLOR=#0000ff]                         'Edit this to change the prefix for specials
                        Cells(RowNum, ColNum) = (currentSeason - 1 - specialsCount) & "x" & HTMLCell.innerText[/COLOR]
                    
                    End If[/COLOR]
    
                Else
    
                    If InStr(1, HTMLCell.innerHTML, "data-language=""en""") > 0 Then
    
                        'finding start and end position of string that will look like this:
                        '       data-language="en">Inside Look
                        englishFindIndex = InStr(1, HTMLCell.innerHTML, "data-language=""en""") + 19
                        endOfEnglishIndex = InStr(englishFindIndex, HTMLCell.innerHTML, "<")
                        'Write properly formatted (english only) title to cell
                        Cells(RowNum, ColNum) = Mid(HTMLCell.innerHTML, englishFindIndex, endOfEnglishIndex - englishFindIndex)
                        
                    Else
                    
                        'No language issue, just write to cell in original format
                        Cells(RowNum, ColNum) = HTMLCell.innerText
                        
                    End If
    
                End If

If you wanted to change the prefix for specials (to something other than zero), you could replace the blue code with something like this:

Code:
                [COLOR=#0000ff]Cells(RowNum, ColNum) = "Specials" & "x" & HTMLCell.innerText[/COLOR]
 
Last edited:
Upvote 0
This has been working great, but of found a little snag, the naming of an episode fault I think.
It has a "#" in the title, could that be whats killing it?

Try running with this web location:
Code:
Sub TVDBEpisodeGetter()

    ActiveSheet.UsedRange.ClearContents
    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "https://www.thetvdb.com/series/family-guy/seasons/all", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
       
    'I think excel does this on its own, but I added these lines to close the instances
    Set XMLPage = Nothing
    Set HTMLDoc = Nothing
    
End Sub


Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    'Grabs every Html row (includes the episodes from every season)
    Dim HTMLRow As MSHTML.IHTMLElement
    
    'Accesses each element within the Table Row
    Dim HTMLCell As MSHTML.IHTMLElement
    
    'Stores the Elements tagged as H3 headers (usually the season titles)
    Dim HTMLSeasons As MSHTML.IHTMLElementCollection
    Dim HTMLSeason As MSHTML.IHTMLElement
    
    'Tracks active position on sheet (where next data is headed)
    Dim RowNum As Long, ColNum As Integer
    
    'Array to store all the season titles
    Dim seasonHeader() As String
    Dim currentSeason As Integer: currentSeason = 1
    
    'Iterator for building seasons array // initialize to 1
    Dim i As Integer: i = 1
    
    'Since we want to start at 0 IF there is a specials category,
    'I made this counter to subtract from the season number if necessary
    Dim specialsCount As Integer: specialsCount = 0
    
    'Finds the start and end positions of the HTML Code snippet that is english
    Dim englishFindIndex As Integer
    Dim endOfEnglishIndex As Integer
    
        'Going to extract the number from Season header (2015, 3, etc...)
    Dim currentSeasonNumber() As String
    
    'Read all the Season Titles
    Set HTMLSeasons = HTMLPage.getElementsByTagName("h3")
    
    'Prepare an array to hold the titles
    ReDim seasonHeader(1 To HTMLSeasons.Length)
        
    'Iterate through all season titles and store them to array
    For Each HTMLSeason In HTMLSeasons
        
        seasonHeader(i) = HTMLSeason.innerText
        i = i + 1
        
    Next HTMLSeason
    
     'Initialize active position on sheet
     RowNum = 1
 
        'Process one row at a time (episodes or season headers)
        For Each HTMLRow In HTMLPage.getElementsByTagName("tr")
                               
            'Start on Left
            ColNum = 1
            
            'In current row, access each entry 1-at-a-time
            For Each HTMLCell In HTMLRow.Children
                
                'Check to see if we are on a subheader
                If InStr(1, HTMLCell.innerText, "#") > 0 Then
                
                    'If we are, we are going to slide the Season title in before adding the subheader.
                    'This will generate the correct header
                    Select Case seasonHeader(currentSeason)
                    
                        'If the category is named "specials" use that name and increment specialsCount
                        Case "Specials"
                            Cells(RowNum, ColNum) = "Specials"
                            specialsCount = specialsCount + 1
                        
                        'This is the case for any title of the form [Season #]
                        'If something is getting labeled like this that shouldnt be... youll need to add another case
                        'with "specialsCount" to handle it
                        Case Else
'                            Cells(RowNum, ColNum) = "Season " & (currentSeason - specialsCount)
                             Cells(RowNum, ColNum) = seasonHeader(currentSeason)
                        
                    End Select
                
                'Since we just used the label, make sure to move to next row and increment the season number
                RowNum = RowNum + 1
                currentSeason = currentSeason + 1
                
                End If
            
                   'Adds Season Prefix to Episode Number (ripped from header)
                If ColNum = 1 And Not InStr(1, HTMLCell.innerText, "#") > 0 Then
                    
                    'Looks for a space, and then grabs whatever is in the second part of the name
                    If InStr(1, seasonHeader(currentSeason - 1), " ") > 0 Then
                    
                        currentSeasonNumber = Split(seasonHeader(currentSeason - 1))
                        Cells(RowNum, ColNum) = currentSeasonNumber(1) & "x" & HTMLCell.innerText
                        
                    'Should just be the case for Specials
                    Else
                    
                         'Edit this to change the prefix for specials
                        Cells(RowNum, ColNum) = (currentSeason - 1 - specialsCount) & "x" & HTMLCell.innerText
                    
                    End If
    
                Else
                    'Debug.Print HTMLCell.innerText
    
                    If InStr(1, HTMLCell.innerHTML, "data-language=""en""") > 0 Then
    
                        'finding start and end position of string that will look like this:
                        '       data-language="en">Inside Look
                        englishFindIndex = InStr(1, HTMLCell.innerHTML, "data-language=""en""") + 19
                        endOfEnglishIndex = InStr(englishFindIndex, HTMLCell.innerHTML, "<")
                        'Write properly formatted (english only) title to cell
                        Cells(RowNum, ColNum) = Mid(HTMLCell.innerHTML, englishFindIndex, endOfEnglishIndex - englishFindIndex)
                        
                    Else
                    
                        'No language issue, just write to cell in original format
                        Cells(RowNum, ColNum) = HTMLCell.innerText
                        
                    End If
    
                End If
                
            'Done with current HTML-cell, move to next sheet-column
            ColNum = ColNum + 1
            
            'Move to next HTML-Cell
            Next HTMLCell
            
        'Done with current HTML-row, move to next sheet-row
        RowNum = RowNum + 1
            
        'Move to next HTML-row
        Next HTMLRow
          
      'Format Sheet (AutoFit) to display all text
      ActiveSheet.UsedRange.Columns.AutoFit
      ActiveSheet.UsedRange.Rows.AutoFit

End Sub

Season 3 Episode 21
Family Guy Viewer Mail [URL=https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=1]#1 [/URL]
Thanks
 
Last edited:
Upvote 0
Add this condition (makes it only check for the pound sign in the first entry of a row)

Code:
 'In current row, access each entry 1-at-a-time
            For Each HTMLCell In HTMLRow.Children
                
                'Check to see if we are on a subheader
                If InStr(1, HTMLCell.innerText, "#") > 0 [COLOR=#ff0000]And ColNum = 1 [/COLOR]Then
 
Last edited:
Upvote 0
Hoping that you would have some time to help me with this VB code.
I have been trying, I get some of the info but can't complete the task.
This code cuts my work in less than half the time.

Thanks,
Cam
 
Upvote 0

Forum statistics

Threads
1,223,228
Messages
6,170,871
Members
452,363
Latest member
merico17

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top