Excel VBA To Scrape Data From Series Of Web Pages

gurs

Board Regular
Joined
Dec 22, 2010
Messages
52
I am looking for assistance with some VBA code intended to collect information from a web site that displays data in groups of 50 rows, requiring you to click the "Next" button each time you want to view the next set of rows. I am trying to craft some code that will grab all of the rows at once and dump them into Excel, or at least loop through the various pages and consolidate the results in Excel. I even tried just setting up 20 different tabs in Excel that would each use the Excel Web Query to grab 50 rows of data, but no matter how I play with the target URL every tab grabs the first 50 rows. That is why I ended up working on the approach below, which unfortunately is resulting in errors.

The URL of the first page of data I am trying to scrape is:
Code:
http://games.espn.go.com/ffl/freeagency?leagueId=228988&teamId=10&seasonId=2014#&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview

The URL of the second page of data is:
Code:
http://games.espn.go.com/ffl/freeagency?leagueId=228988&teamId=10&seasonId=2014#&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=50

Subsequent pages increment by 50.

Here is the VBA code I have so far (which is based on the work of John_W in this earlier thread):
Code:
Public Sub ScrapeData()

'based on solution in http://www.mrexcel.com/forum/excel-questions/677031-pull-entire-web-table-not-just-what-visible-webpage.html
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim p1 As Long, p2 As Long
    
    Set dest = ActiveSheet.Range("A1")
    dest.Parent.Activate
    dest.Parent.Cells.Clear

    baseURL = "http://games.espn.go.com/ffl/freeagency"
    params = "leagueId=228988&teamId=10&seasonId=2014#&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex="
       
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

    'For first page of results start at HTML row index 1 to include column headings in extracted data
    
    HTMLrow = 1
    
    'Request all pages
    
    Do
        With XMLreq
            URL = baseURL & "?" & params & CLng(Rnd() * 99999999)
            Debug.Print Now, URL
            .Open "POST", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
        'Extract player table into array
        '< table id="playertable_0" class="playerTableTable
        
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Then
                    playerData(i, c) = tableCell.innerText
                    c = c + 1
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
        'For subsequent pages start at HTML row index 2 to ignore column headings
    
        HTMLrow = 2
        
        'Find NEXT» link and extract parameters from *******
        '< a href="#" *******="players('leagueId=306149&teamId=2&seasonId=2013&=undefined&gamesInScoringPeriodId=66&scoringPeriodId=65&view=stats&context=freeagency&version=last7&startIndex=50'); return false;">NEXT»< /span>< /a>
        
        params = ""
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).*******, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).*******, "'")
                params = Mid(HTMLdoc.Links(i).*******, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
    MsgBox "Finished"
    
End Sub

Please note that this message board removes every instance of "o_nclick" without the underscore from the above code and inserts "*******", but rest assured that my VBA contains "o_nclick" without the underscore.

The code above is throwing the following error at the line “Set tableRows = playerTable.Rows”:
Run-time error ‘91’:
Object variable or With block variable not set

I have tried (1) changing the definition of URL = baseURL & "?" & params & CLng(Rnd() * 99999999), and (2) changing "Set playerTable = HTMLdoc.getElementById("playertable_0")" to "Set playerTable = HTMLdoc.getElementById("playertable_1")". Neither changed the resulting error.

Any help would be greatly appreciated!
 

Excel Facts

Test for Multiple Conditions in IF?
Use AND(test, test, test, test) or OR(test, test, test, ...) as the logical_test argument of IF.
I saw your reply in the earlier thread.

Page Free Agents - Free Fantasy Football - ESPN loads the data by requesting http://games.espn.go.com/ffl/player...eeagency&view=overview&startIndex=0&r=3775685. The r parameter value appears to be a random number which the code can generate.

Try the following code, which is based on the earlier code with a few changes. As before, remove the underscore from all instances of "o_nclick".
Code:
Public Sub Get_Player_Data3()
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim startIndex As Integer
    Dim p1 As Long, p2 As Long
    
    With ActiveSheet
        Set dest = .Range("A1")
        .Activate
        .Cells.Clear
    End With

    baseURL = "http://games.espn.go.com/ffl/playertable/prebuilt/freeagency"
    params = "leagueId=228988&teamId=10&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=0"
          
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

    'For first page of results start at HTML row index 1 to include column headings in extracted data
    
    HTMLrow = 1
    
    'Request all pages
    
    Do
        With XMLreq
            URL = baseURL & "?" & params & "&r=" & CLng(Rnd() * 99999999)
            .Open "GET", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
        'Extract player table into array
        '< table id="playertable_0" class="playerTableTable tableBody" cellspacing="1" cellpadding="0" border="0">

        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).cells
                If tableCell.innerText <> "" And tableCell.cellIndex <> 4 Then
                    playerData(i, c) = tableCell.innerText
                    c = c + tableCell.colSpan
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
        'For subsequent pages start at HTML row index 2 to ignore column headings
    
        HTMLrow = 2
        
        'Find "NEXT»" link and extract parameters from o_nclick
        '< a href="#" o_nclick="players('leagueId=228988&teamId=10&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=50'); return false;">NEXT< span style="font-size:12px;">»< /span>< /a>

        params = ""
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).o_nclick, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).o_nclick, "'")
                params = Mid(HTMLdoc.Links(i).o_nclick, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
    MsgBox "Finished"
    
End Sub
Note also when posting HTML snippets in code comments I've had to add a space after each "<" to prevent the forum rendering the HTML.
 
Last edited:
Upvote 0
Slight change - the column headings and data were misaligned.
Code:
Public Sub Get_Player_Data3()
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim startIndex As Integer
    Dim p1 As Long, p2 As Long
    
    With ActiveSheet
        Set dest = .Range("A1")
        .Activate
        .Cells.Clear
    End With

    baseURL = "http://games.espn.go.com/ffl/playertable/prebuilt/freeagency"
    params = "leagueId=228988&teamId=10&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=0"
          
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

    'For first page of results start at HTML row index 1 to include column headings in extracted data
    
    HTMLrow = 1
    
    'Request all pages
    
    Do
        With XMLreq
            URL = baseURL & "?" & params & "&r=" & CLng(Rnd() * 99999999)
            .Open "GET", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
        'Extract player table into array
        '< table id="playertable_0" class="playerTableTable tableBody" cellspacing="1" cellpadding="0" border="0">
        
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Or tableCell.cellIndex = 4 Then
                    playerData(i, c) = tableCell.innerText
                    c = c + tableCell.colSpan
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
        'For subsequent pages start at HTML row index 2 to ignore column headings
    
        HTMLrow = 2
        
        'Find NEXT» link and extract parameters from o_nclick
        '< a href="#" o_nclick="players('leagueId=228988&teamId=10&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=50'); return false;">NEXT»< /span>< /a>
        
        params = ""
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).o_nclick, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).o_nclick, "'")
                params = Mid(HTMLdoc.Links(i).o_nclick, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
    'Delete empty D1
    
    dest.Parent.Range("D1").Delete Shift:=xlToLeft
    
    MsgBox "Finished"
    
End Sub
 
Upvote 0
HALLELUJAH! You're a genius! Thanks a million. I have one more question though (no good deed goes unpunished). I customized the code a bit to fit into my workbook. At the beginning I defined the target worksheet (because it won't be called from the active sheet) and selected the appropriate spot for the data. That works fine. Then at the end, in place of your msgbox, I included some formatting lines intended to remove the empty header cell, autofit column width for the new data, fill down some formulas in columns A:D and select cell A4. Of those four tasks, only the last one works. The sub finishes without errors, but the cells are not filled down, the columns are not autofitted and the empty header cell remains. Any ideas? Code below. Thanks,!

Code:
Public Sub Get_ESPN_Data()
'from http://www.mrexcel.com/forum/excel-questions/809589-excel-visual-basic-applications-scrape-data-series-web-pages.html#post3956908
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim startIndex As Integer
    Dim p1 As Long, p2 As Long
    Dim Target As String
    Dim sh As Worksheet
    Dim RowLast As Integer

    Target = "ESPN-W"
    Set sh = Worksheets(Target)
        
    'erase existing data
    With sh
        .Range("A3") = "Data Scraped On " & Now()
        Set dest = .Range("E6")
        .Activate
        If dest.Offset(1, 0) <> "" Then
            RowLast = .Cells(Rows.Count, "E").End(xlUp).Row
            .Range(Cells(dest.Row + 1, dest.Column), Cells(RowLast, 18)).ClearContents
            .Range(Cells(dest.Row + 2, 1), Cells(RowLast, 4)).Clear
        End If
    End With

    baseURL = "http://games.espn.go.com/ffl/playertable/prebuilt/freeagency"
    params = "leagueId=228988&teamId=10&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=0"
          
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

    'For first page of results start at HTML row index 1 to include column headings in extracted data
    
    HTMLrow = 1
    
    'Request all pages
    
    Do
        With XMLreq
            URL = baseURL & "?" & params & "&r=" & CLng(Rnd() * 99999999)
            .Open "GET", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
        'Extract player table into array
        '< table id="playertable_0" class="playerTableTable tableBody" cellspacing="1" cellpadding="0" border="0">
        
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Or tableCell.cellIndex = 4 Then
                    playerData(i, c) = tableCell.innerText
                    c = c + tableCell.colSpan
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
        'For subsequent pages start at HTML row index 2 to ignore column headings
    
        HTMLrow = 2
        
        'Find NEXT» link and extract parameters from *******
        '< a href="#" *******="players('leagueId=228988&teamId=10&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=50'); return false;">NEXT»< /span>< /a>
        
        params = ""
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).*******, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).*******, "'")
                params = Mid(HTMLdoc.Links(i).*******, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
   'Delete empty data header cell, format new data & fill formulas down in columns A:D starting in row 7
    With sh
        .Range(Cells(dest.Row, dest.Column + 4), Cells(dest.Row, dest.Column + 4)).Delete Shift:=xlToLeft
        RowLast = .Cells(Rows.Count, "E").End(xlUp).Row
        .Range(Cells(dest.Row, dest.Column), Cells(RowLast, 18)).Columns.AutoFit
        .Range(Cells(dest.Row + 1, 1), Cells(RowLast, 4)).FillDown
        .Range("A4").Select
    End With
    
End Sub
 
Upvote 0
remove the empty header cell, autofit column width for the new data, fill down some formulas in columns A:D and select cell A4.
Have you tried recording a macro of the manual steps? This should get you very close to what you need and then edit it as required.
 
Upvote 0
Have you tried recording a macro of the manual steps? This should get you very close to what you need and then edit it as required.

Good suggestion. But the thing is, that section of code works fine. If I put the last section of code (starting with "With sh") into a standalone macro, it works. When I include it at the end of the macro above, the only line that works is selecting cell A4. So it seems like it is something to do with how this section interfaces with the rest of the code, as opposed to the specific content of this section.
 
Upvote 0
Thanks for the lead l3g4to. I will give your suggestion a try on my next project. John_W's code for scraping the web data actually works perfectly though, and is pretty fast. The problematic chunk of code is something that I added at the end to fit the previously captured data into my pre-existing worksheet. That is where the problem lies. Unfortunately, I still haven't solved this problem yet.
 
Upvote 0
Got it working. For some reason I had to repeat the definition for the variable "dest" in the last "with" clause, and then everything worked fine. Not sure why I had to do that, but don't really care now that it works!
 
Upvote 0
At the risk of incurring scorn for resurrecting a 2-year old thread, here goes. I am trying to do exactly the same thing I was two years ago (yes, I am the OP), but the code is no longer working. When I run this VBA, it throws the following error:
Run-time error ‘438’:
Object doesn’t support this property or method

The URL structure has changed a bit from 2014. The new URL for the first page I am trying to access is:
Code:
http://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview

And the URL for subsequent pages is (incrementing by 50):
Code:
http://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview&startIndex=50

Here is the code I am using, which has been modified to reflect the new URLs above:
Code:
Public Sub Get_Player_Data3()
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim startIndex As Integer
    Dim p1 As Long, p2 As Long
    
    With ActiveSheet
        Set dest = .Range("A1")
        .Activate
        .Cells.Clear
    End With

    baseURL = "http://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview"
    params = "&startIndex=0"
          
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

'For first page of results start at HTML row index 1 to include column headings in extracted data
    HTMLrow = 1
    
'Request all pages
    Do
        With XMLreq
            URL = baseURL & "?" & params & "&r=" & CLng(Rnd() * 99999999)
            .Open "GET", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
'Extract player table into array
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Or tableCell.cellIndex = 4 Then
                    playerData(i, c) = tableCell.innerText
                    c = c + tableCell.colSpan
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
'Copy array to sheet cells
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
'For subsequent pages start at HTML row index 2 to ignore column headings
        HTMLrow = 2
        
'Find NEXT» link and extract parameters from o_nclick
        params = ""
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).o_nclick, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).o_nclick, "'")
                params = Mid(HTMLdoc.Links(i).o_nclick, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
'Delete empty D1
    dest.Parent.Range("D1").Delete Shift:=xlToLeft
    
    MsgBox "Finished"
    
End Sub

The error comes up at the line that reads:
Code:
p1 = InStr(HTMLdoc.Links(i).o_nclick, "'") + 1

Any idea what might be causing the error? Thanks for the help!
 
Upvote 0

Forum statistics

Threads
1,223,214
Messages
6,170,771
Members
452,353
Latest member
strainu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top