Excel VBA To Scrape Data From Series Of Web Pages

gurs

Board Regular
Joined
Dec 22, 2010
Messages
52
I am looking for assistance with some VBA code intended to collect information from a web site that displays data in groups of 50 rows, requiring you to click the "Next" button each time you want to view the next set of rows. I am trying to craft some code that will grab all of the rows at once and dump them into Excel, or at least loop through the various pages and consolidate the results in Excel. I even tried just setting up 20 different tabs in Excel that would each use the Excel Web Query to grab 50 rows of data, but no matter how I play with the target URL every tab grabs the first 50 rows. That is why I ended up working on the approach below, which unfortunately is resulting in errors.

The URL of the first page of data I am trying to scrape is:
Code:
http://games.espn.go.com/ffl/freeagency?leagueId=228988&teamId=10&seasonId=2014#&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview

The URL of the second page of data is:
Code:
http://games.espn.go.com/ffl/freeagency?leagueId=228988&teamId=10&seasonId=2014#&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex=50

Subsequent pages increment by 50.

Here is the VBA code I have so far (which is based on the work of John_W in this earlier thread):
Code:
Public Sub ScrapeData()

'based on solution in http://www.mrexcel.com/forum/excel-questions/677031-pull-entire-web-table-not-just-what-visible-webpage.html
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim p1 As Long, p2 As Long
    
    Set dest = ActiveSheet.Range("A1")
    dest.Parent.Activate
    dest.Parent.Cells.Clear

    baseURL = "http://games.espn.go.com/ffl/freeagency"
    params = "leagueId=228988&teamId=10&seasonId=2014#&seasonId=2014&=undefined&avail=-1&context=freeagency&view=overview&startIndex="
       
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

    'For first page of results start at HTML row index 1 to include column headings in extracted data
    
    HTMLrow = 1
    
    'Request all pages
    
    Do
        With XMLreq
            URL = baseURL & "?" & params & CLng(Rnd() * 99999999)
            Debug.Print Now, URL
            .Open "POST", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
        'Extract player table into array
        '< table id="playertable_0" class="playerTableTable
        
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Then
                    playerData(i, c) = tableCell.innerText
                    c = c + 1
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
        'For subsequent pages start at HTML row index 2 to ignore column headings
    
        HTMLrow = 2
        
        'Find NEXT» link and extract parameters from *******
        '< a href="#" *******="players('leagueId=306149&teamId=2&seasonId=2013&=undefined&gamesInScoringPeriodId=66&scoringPeriodId=65&view=stats&context=freeagency&version=last7&startIndex=50'); return false;">NEXT»< /span>< /a>
        
        params = ""
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).*******, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).*******, "'")
                params = Mid(HTMLdoc.Links(i).*******, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
    MsgBox "Finished"
    
End Sub

Please note that this message board removes every instance of "o_nclick" without the underscore from the above code and inserts "*******", but rest assured that my VBA contains "o_nclick" without the underscore.

The code above is throwing the following error at the line “Set tableRows = playerTable.Rows”:
Run-time error ‘91’:
Object variable or With block variable not set

I have tried (1) changing the definition of URL = baseURL & "?" & params & CLng(Rnd() * 99999999), and (2) changing "Set playerTable = HTMLdoc.getElementById("playertable_0")" to "Set playerTable = HTMLdoc.getElementById("playertable_1")". Neither changed the resulting error.

Any help would be greatly appreciated!
 
This revised code doesn't seem to throw an error, but it just grabs the first page of data over and over again until I manually stop the macro. I have let it go until it filled 40k rows of data with the same 50 entries. It is an infinite loop.

Code:
Public Sub Get_ESPN_Data()
'from http://www.mrexcel.com/forum/excel-questions/809589-excel-visual-basic-applications-scrape-data-series-web-pages.html#post3956908
    
    Dim baseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim startIndex As Integer
    Dim p1 As Long, p2 As Long
    Dim Target As String
    Dim sh As Worksheet
    Dim RowLast As Long

    'remember starting location so you can return there at the end of the macro
    Dim StartSheet As Worksheet
    Dim StartCell As String
    Set StartSheet = ActiveWorkbook.ActiveSheet
    StartCell = ActiveCell.Address
            
    Application.Calculation = xlCalculationManual
    Application.ScreenUpdating = False

    Target = "ESPN-W"
    Set sh = Worksheets(Target)
        
    'erase existing data
    With sh
        .Range("A3") = "Data Scraped On " & Now()
        Set dest = .Range("A6")
        .Activate
        If dest.Offset(1, 0) <> "" Then
            RowLast = .Cells(Rows.Count, "A").End(xlUp).Row
            .Range(Cells(dest.Row + 1, dest.Column), Cells(RowLast, 14)).ClearContents
            .Range(Cells(dest.Row + 2, 15), Cells(RowLast, 18)).Clear
        End If
    End With
    
    'insert extra columns to make room for fact that ESPN column headings end up offsetting to the right, don't want to overwrite data
    sh.Range("O:R").Insert
    
    'define base URL and add-on URL
    baseURL = "http://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview&startIndex=0"
    params = ""
          
    Set XMLreq = CreateObject("MSXML2.XMLhttp")

    'For first page of results start at HTML row index 1 to include column headings in extracted data
    HTMLrow = 1
    
    'Request all pages
    Do
        With XMLreq
            URL = baseURL & "?" & params & "&r=" & CLng(Rnd() * 99999999)
            .Open "GET", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
        
        'Extract player table into array
        '< table id="playertable_0" class="playerTableTable tableBody" cellspacing="1" cellpadding="0" border="0">
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
        
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Or tableCell.cellIndex = 4 Then
                    playerData(i, c) = tableCell.innerText
                    c = c + tableCell.colSpan
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
        
        'For subsequent pages start at HTML row index 2 to ignore column headings
        HTMLrow = 2
        
        'Find NEXT» link and extract parameters from *******
        i = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).*******, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).*******, "'")
                params = Mid(HTMLdoc.Links(i).*******, p1, p2 - p1)
            End If
            i = i + 1
        Wend
        
    Loop Until params = ""
    
   Application.Calculation = xlCalculationAutomatic
   
   'Delete empty data header cell, format new data & fill formulas down in columns A:D starting in row 7
    With sh
        '.Activate
        Set dest = .Range("A6")
        .Range(Cells(dest.Row, dest.Column + 3), Cells(dest.Row, dest.Column + 3)).Delete Shift:=xlToLeft
        .Range(Cells(dest.Row, dest.Column + 17), Cells(dest.Row, dest.Column + 17)).Insert Shift:=xlToRight
        .Columns("O:R").Delete
         RowLast = .Cells(Rows.Count, "A").End(xlUp).Row
        .Range(Cells(dest.Row + 1, 14), Cells(RowLast, 18)).FillDown
        .Range(Cells(dest.Row, dest.Column), Cells(RowLast, 18)).Columns.AutoFit
        .Range("A4").Select
    End With
    
    Application.ScreenUpdating = True
    
    StartSheet.Activate

End Sub
 
Last edited:
Upvote 0

Excel Facts

Wildcard in VLOOKUP
Use =VLOOKUP("Apple*" to find apple, Apple, or applesauce
I haven't checked, but it looks like you need an account and be logged in to access the 2nd and subsequent pages.
 
Upvote 0
Hi John! Thanks for re-engaging on this.

The answer to your question about being logged in is "sort of". If you visit the first page of results and click on the "NEXT" link at the bottom of the page, page 2 won't populate with results unless you have logged in first. But, if you access page 2 directly by using the correct URL (in this case, ht_tp://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview&startIndex=50 (without the underscore, of course)), the results will display just fine. So I suppose I either need the VBA to log in somehow, or I need the VBA to directly call the URL for each page instead of relying on clicking the "NEXT" link.

FYI, right now this query returns about 925 results. You can tell that you have found the last page of results because there is no "NEXT" link, only a "PREVIOUS" link.
 
Upvote 0
Why not loop the startIndex parameter. As you pointed out it increases by 50 for every next page. Last page is: startIndex=900.
Code:
For x = 0 to 900 step 50

Code:
_http://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview&startIndex=900
 
Upvote 0
That makes sense to me, although I'm not quite sure how to work it into the code. I have dimmed a new variable "Increment" as an integer. Do I then define "params" as being equal to Increment? And then how to I incorporate the "For Increment = 0 to 900 step 50" concept into the "Do" loop?
 
Upvote 0
I managed to get it working. Rather than using your suggestion of "For x = 0 to 900 step 50", I built a loop that checks whether there is a "NEXT" link on the page (since the last page of data doesn't have that link, only PREVIOUS). I did this by re-using your code that previously functioned to select the next page of data, before using that link required a login. So now my code starts from the Base URL, grabs the data, and if that URL has a NEXT link, it increments the URL by 50 and repeats. This way, if the number of players listed by ESPN goes below 901 or above 950 the code will still get all the data without errors. Code is below, for posterity. I'm sure it isn't the most artful, but it does work! Thanks again for the help John.

Code:
Public Sub Get_ESPN_Data()
'from http://www.mrexcel.com/forum/excel-questions/809589-excel-visual-basic-applications-scrape-data-series-web-pages.html
   
    Dim BaseURL As String, URL As String, params As String
    Dim XMLreq As Object
    Dim HTMLdoc As Object
    Dim playerTable As Object
    Dim tableRows As Object
    Dim tableCell As Object
    Dim dest As Range
    Dim playerData As Variant
    Dim HTMLrow As Integer, i As Integer, c As Integer
    Dim startIndex As Integer
    Dim p1 As Long, p2 As Long
    Dim Target As String
    Dim sh As Worksheet
    Dim RowLast As Long
    Dim Increment As Integer
    Dim IsNext As Integer
   
    'remember starting location so you can return there at the end of the macro
    Dim StartSheet As Worksheet
    Dim StartCell As String
    Set StartSheet = ActiveWorkbook.ActiveSheet
    StartCell = ActiveCell.Address
           
    Application.Calculation = xlCalculationManual  'TURN OFF FOR DEBUGGING
    Application.ScreenUpdating = False  'TURN OFF FOR DEBUGGING
 
    Target = "ESPN-W"
    Set sh = Worksheets(Target)
       
    'erase existing data
    With sh
        .Range("A3") = "Data Scraped On " & Now()
        Set dest = .Range("A6")
        .Activate
        If dest.Offset(1, 0) <> "" Then
            RowLast = .Cells(Rows.Count, "A").End(xlUp).Row
            .Range(Cells(dest.Row + 1, dest.Column), Cells(RowLast, 14)).ClearContents
            .Range(Cells(dest.Row + 2, 15), Cells(RowLast, 18)).Clear
        End If
    End With
   
    'insert extra columns to make room for fact that ESPN column headings end up offsetting to the right, don't want to overwrite data
    sh.Range("O:R").Insert
   
    'define base URL, add-on URL and increment
    BaseURL = Worksheets("ESPN URLs").Range("C6").Value '2016 value is http://games.espn.com/ffl/freeagency?leagueId=228988&seasonId=2016&avail=-1&context=freeagency&view=overview&startIndex=
    params = ""
    Increment = 0
    IsNext = 1
         
    'define XMLreq
    Set XMLreq = CreateObject("MSXML2.XMLhttp")
 
    'For first page of results start at HTML row index 1 to include column headings in extracted data
    HTMLrow = 1
   
    'Request first page
    With XMLreq
        URL = BaseURL & Increment
        .Open "GET", URL, False
        .send
        Set HTMLdoc = CreateObject("HTMLFile")
        HTMLdoc.body.innerHTML = .responseText
    End With
   
    'Extract player table into array
    Set playerTable = HTMLdoc.getElementById("playertable_0")
    Set tableRows = playerTable.Rows
    ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
   
    i = 1
    While HTMLrow < tableRows.Length
        c = 1
        For Each tableCell In tableRows(HTMLrow).Cells
            If tableCell.innerText <> "" Or tableCell.cellIndex = 4 Then
                playerData(i, c) = tableCell.innerText
                c = c + tableCell.colSpan
            End If
        Next
        i = i + 1
        HTMLrow = HTMLrow + 1
    Wend
   
    'Copy array to sheet cells
    dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
    Set dest = dest.Offset(UBound(playerData, 1))
    dest.Select
    DoEvents
   
    'For subsequent pages start Increment 50
    Increment = Worksheets("ESPN URLs").Range("C4").Value
   
    'Request next page loop
    Do Until IsNext <> 1
        HTMLrow = 2 'For subsequent pages start at HTML row index 2 to ignore column headings
       
        With XMLreq
            URL = BaseURL & Increment
            .Open "GET", URL, False
            .send
            Set HTMLdoc = CreateObject("HTMLFile")
            HTMLdoc.body.innerHTML = .responseText
        End With
       
        'Extract player table into array
        Set playerTable = HTMLdoc.getElementById("playertable_0")
        Set tableRows = playerTable.Rows
       
        ReDim playerData(1 To tableRows.Length - HTMLrow, 1 To tableRows(HTMLrow).Cells.Length)
       
        i = 1
        While HTMLrow < tableRows.Length
            c = 1
            For Each tableCell In tableRows(HTMLrow).Cells
                If tableCell.innerText <> "" Or tableCell.cellIndex = 4 Then
                    playerData(i, c) = tableCell.innerText
                    c = c + tableCell.colSpan
                End If
            Next
            i = i + 1
            HTMLrow = HTMLrow + 1
        Wend
        
        'Copy array to sheet cells
        dest.Resize(UBound(playerData, 1), UBound(playerData, 2)).Value = playerData
        Set dest = dest.Offset(UBound(playerData, 1))
        dest.Select
        DoEvents
       
        'Check if page contains "NEXT" link
        params = ""
        i = 0
        IsNext = 0
        While i < HTMLdoc.Links.Length And params = ""
            If HTMLdoc.Links(i).innerText = "NEXT»" Then
                p1 = InStr(HTMLdoc.Links(i).*******, "'") + 1
                p2 = InStr(p1, HTMLdoc.Links(i).*******, "'")
                params = Mid(HTMLdoc.Links(i).*******, p1, p2 - p1)
                IsNext = 1
            End If
            i = i + 1
        Wend
       
        Increment = Increment + 50
       
    Loop
   
    'Delete empty data header cell, format new data & fill formulas down in columns O:R starting in row 7
    With sh
        Set dest = .Range("A6")
        .Range(Cells(dest.Row, dest.Column + 3), Cells(dest.Row, dest.Column + 3)).Delete Shift:=xlToLeft
        .Range(Cells(dest.Row, dest.Column + 17), Cells(dest.Row, dest.Column + 17)).Insert Shift:=xlToRight
        .Columns("O:R").Delete
         RowLast = .Cells(Rows.Count, "A").End(xlUp).Row
        .Range(Cells(dest.Row + 1, 14), Cells(RowLast, 18)).FillDown
        .Range(Cells(dest.Row, dest.Column), Cells(RowLast, 18)).Columns.AutoFit
        .Range("A4").Select
    End With
   
    'cleanup
    Application.Calculation = xlCalculationAutomatic
    sh.Cells.Font.Size = "8"
    Application.ScreenUpdating = True
    StartSheet.Activate
    Range(StartCell).Select
   
End Sub
 
Upvote 0
Yes, as simple as appending the next "&startIndex=" & startIndex number to the URL Though now the Next page check doesn't need the params stuff.
Code:
        i = 0
        IsNext = 0
        While i < HTMLdoc.Links.Length And IsNext = 0
            If HTMLdoc.Links(i).innerText = "NEXT»" Then IsNext = 1
            i = i + 1
        Wend
 
Upvote 0
Thanks John. I got my code cleaned up as you suggested. Looks like I'm good for another season!
 
Upvote 0
I am now trying to modify the code to scrape data from a different website. I am not sure how to reference the table with the player data on the new web page. In the code above, you used:
Code:
Set playerTable = HTMLdoc.getElementById("playertable_0")

The new target page is here:
Code:
http://fftoolbox.scout.com/football/2016/weeklycheatsheets.cfm?Player_Pos=RB&page=1&WeekNumber=2

I have checked the source code and it looks to me like the data table starts on row 466 with this code:
PHP:

But when I try using HTMLdoc.getElementById("proj") or HTMLdoc.getElementByClass("grid"), the code throws a Runtime Error 91 (object variable or with block variable not set) at the subsequent line in my VBA code, which is "Set tableRows = playerTable.Rows". I assume that the error is because playerTable isn't grabbing any data, and it is impossible to count the number of rows in nothingness.

Any suggestions?[TABLE="class: grid"]
[/TABLE]
 
Upvote 0
I have checked the source code and it looks to me like the data table starts on row 466 with this code

For some reason mrexcel doesn't want me to post the website code. It is
with greater than/less than symbols instead of the brackets.
 
Upvote 0

Forum statistics

Threads
1,223,214
Messages
6,170,771
Members
452,353
Latest member
strainu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top