Scrape HTML Data

Hi,

I think this macro should work.

If you paste it into a standard Module it will write the table to Sheet1 starting in cell A1.

Code:
Sub Get_Batting_Gamelogs()

    Dim Doc As HTMLDocument
    Dim htmTable As HTMLTable
    Dim ws As Worksheet
    Dim i As Long, j As Long
    
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set Doc = New HTMLDocument
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://www.baseball-reference.com/players/gl.cgi?id=harpebr03&t=b&year=2015"
        .send
        Do: DoEvents: Loop Until .readyState = 4
        Doc.body.innerHTML = .responseText
    End With
    
    Set htmTable = Doc.getElementById("batting_gamelogs")
    
    With htmTable
        For i = 1 To .Rows.Length
            For j = 1 To .Rows.Item(i - 1).Cells.Length
                ws.Cells(i, j).Value = .Rows(i - 1).Cells(j - 1).innerText
            Next
        Next
    End With

End Sub

How does it work?

The way the code has been written you need to add references to the Microsoft XML, v6.0 and Microsoft HTML Object Library.

The macro opens the url and waits for a response.
When a response is received it copies the web page text to something that handles HTML code.
That looks for the table called batting_gamelogs.
The data in the rows and columns of the table are written to the worksheet Sheet1.
 
Upvote 0
RickXL,

This works great! I built a solution myself after posting and it is no where close to as clean and efficient as this. Thanks so much for your help.

The next step in my project is to loop this code to scrape the "batting_gamelogs" table for a list of players in sheet "PlayerList". A sample PlayerList table is here:


[TABLE="class: grid, width: 500, align: left"]
<tbody>[TR]
[TD]1[/TD]
[TD]A[/TD]
[TD]B[/TD]
[TD]C[/TD]
[/TR]
[TR]
[TD]2[/TD]
[TD]ID[/TD]
[TD]Name[/TD]
[TD]Link[/TD]
[/TR]
[TR]
[TD]3[/TD]
[TD][TABLE="width: 80"]
<tbody>[TR]
[TD="width: 80"]jonesad01[/TD]
[/TR]
[TR]
[TD="width: 80"][/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 152"]
<tbody>[TR]
[TD="class: xl65, width: 152"]ADAM JONES[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 540"]
<tbody>[TR]
[TD="width: 540"]http://www.baseball-reference.com/players/gl.cgi?id=jonesad01&t=b&year=2015[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
[TR]
[TD]4[/TD]
[TD][TABLE="width: 80"]
<tbody>[TR]
[TD="width: 80"]larocad01[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 152"]
<tbody>[TR]
[TD="class: xl65, width: 152"]ADAM LAROCHE[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 540"]
<tbody>[TR]
[TD="width: 540"]http://www.baseball-reference.com/players/gl.cgi?id=larocad01&t=b&year=2015[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
[TR]
[TD]5[/TD]
[TD][TABLE="width: 80"]
<tbody>[TR]
[TD="width: 80"]beltrad01[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 152"]
<tbody>[TR]
[TD="class: xl65, width: 152"]ADRIAN BELTRE[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 540"]
<tbody>[TR]
[TD="width: 540"]http://www.baseball-reference.com/players/gl.cgi?id=beltrad01&t=b&year=2015[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
[TR]
[TD]6[/TD]
[TD][TABLE="width: 80"]
<tbody>[TR]
[TD="width: 80"]gonzaad01[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 152"]
<tbody>[TR]
[TD="class: xl65, width: 152"]ADRIAN GONZALEZ[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[TD][TABLE="width: 540"]
<tbody>[TR]
[TD="width: 540"]http://www.baseball-reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2015[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
</tbody>[/TABLE]
So the code would loop through x times (x being number of links in column C) and paste the results into 1 table in sheet "DailyData"

The struggle I am having is trying to get the DailyData sheet to 1) have the players ID (from PlayerList Column A) as the identifier of who's stats we are looking at, and 2) not including the totals row from the HTML table (row 10 in RickXl's above code).

Again, the goal is to have all daily stats in one table of the players listed in sheet "PlayerList"

I'm a newbie here and don't know how to paste my worksheet in here, sorry for being wordy but I hope you can help!!

Thanks
 
Upvote 0
Hi again,

Try this instead:
Code:
' Note: Use Tools-->References to add:
'    Microsoft XML, v6.0
'    Microsoft HTML Object Library

Sub Get_Batting_Gamelogs()

    Dim Doc As HTMLDocument
    Dim htmTable As HTMLTable
    Dim ws2 As Worksheet, ws1 As Worksheet
    Dim i As Long, j As Long, lr As Long, idRow As Long, outRow As Long, outStrt As Long
    Dim ID As String
    
    Set ws1 = ThisWorkbook.Worksheets("PlayerList")
    Set ws2 = ThisWorkbook.Worksheets("Sheet1")
    
    Set Doc = New HTMLDocument
    
    With ws1
        lr = .Cells(.Rows.Count, "A").End(xlUp).Row
    End With
    
    ws2.Cells.Clear
    
    outRow = 1
    outStrt = 1
    For idRow = 2 To lr
        ID = ws1.Cells(idRow, 1).Value
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", "http://www.baseball-reference.com/players/gl.cgi?id=" & ID & "&t=b&year=2015"
            .send
            Do: DoEvents: Loop Until .readyState = 4
            Doc.body.innerHTML = .responseText
        End With
        
        Set htmTable = Doc.getElementById("batting_gamelogs")
        
        With htmTable
            For i = outStrt To .Rows.Length - 1
                ws2.Cells(outRow, 1).Value = ID
                For j = 1 To .Rows.Item(i - 1).Cells.Length
                    ws2.Cells(outRow, j + 1).Value = .Rows(i - 1).Cells(j - 1).innerText
                Next
                outRow = outRow + 1
            Next
            outStrt = 2 ' ie Skip headings for subsequent IDs
        End With
        
    Next idRow
    
    ws2.Cells.Columns.AutoFit
    
End Sub

Sorry, it is a bit rushed (it is my bedtime!).

regards,
 
Upvote 0
Well done!

The output is exactly as I was describing and I'm learning so much from your code and seeing what you did.

I have seen 2 situations that cause errors and for the macro to stop, both of which are caused by the user populating the spreadsheet.

1) the ID field was entered incorrectly and no data is returned
2) the link works but that player has not played any games (cannot find "batting_gamelogs" table)

Both of these situations will cause the .Rows.Length to error out.

Any suggestions?
 
Upvote 0
Hi,

If no data is returned then htmTable will be "Nothing". So you could try this:

Rich (BB code):
' Note: Use Tools-->References to add:
'    Microsoft XML, v6.0
'    Microsoft HTML Object Library

Sub Get_Batting_Gamelogs()

    Dim Doc As HTMLDocument
    Dim htmTable As HTMLTable
    Dim ws2 As Worksheet, ws1 As Worksheet
    Dim i As Long, j As Long, lr As Long, idRow As Long, outRow As Long, outStrt As Long
    Dim ID As String
    
    Set ws1 = ThisWorkbook.Worksheets("PlayerList")
    Set ws2 = ThisWorkbook.Worksheets("Sheet1")
    
    Set Doc = New HTMLDocument
    
    With ws1
        lr = .Cells(.Rows.Count, "A").End(xlUp).Row
    End With
    
    ws2.Cells.Clear
    
    outRow = 1
    outStrt = 1
    For idRow = 2 To lr
        ID = ws1.Cells(idRow, 1).Value
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", "http://www.baseball-reference.com/players/gl.cgi?id=" & ID & "&t=b&year=2015"
            .send
            Do: DoEvents: Loop Until .readyState = 4
            Doc.body.innerHTML = .responseText
        End With
        
        Set htmTable = Doc.getElementById("batting_gamelogs")
        
        If Not htmTable Is Nothing Then
            With htmTable
                For i = outStrt To .Rows.Length - 1
                    ws2.Cells(outRow, 1).Value = ID
                    For j = 1 To .Rows.Item(i - 1).Cells.Length
                        ws2.Cells(outRow, j + 1).Value = .Rows(i - 1).Cells(j - 1).innerText
                    Next
                    outRow = outRow + 1
                Next
                outStrt = 2 ' ie Skip headings for subsequent IDs
            End With
        End If
        
    Next idRow
    
    ws2.Cells.Columns.AutoFit
    
End Sub
 
Upvote 0

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top