How to clean scrape a website for financial data?

jonathanwang003

Board Regular
Joined
May 9, 2014
Messages
129
I used to scrape financial statements cleanly into spreadsheets by examining the HTML as a table and ensuring there is data by checking the number of columns and rows. This code used to work, but I'm redirecting to advfn.com and I'm having trouble properly grabbing the table.

Code:
'Download Info
Public oXML As MSXML2.XMLHTTP
Public oHTML As MSHTML.HTMLDocument
Public lngRow As Long
Public lngCol As Long
Public lngTable As Long
Public lngX As Long
Public strUrl(1 To 6) As String
Public rngPasteDest As Range

sub Download_Financials()

    Set oXML = New MSXML2.XMLHTTP
    Set oHTML = New MSHTML.HTMLDocument
    Set rngPasteDest = Range("A1")
    
    strUrl(1) = "https://www.advfn.com/stock-market/NASDAQ/SMCI/financials"
    
    With oXML
        .Open "GET", strUrl(1), False
        .send
        oHTML.body.innerHTML = .ResponseText
    End With
    lngX = 0
    
    With oHTML.getElementsByTagName("Table 1")
    
        If .Length <= 1 Then
            If AttemptCount >= 10 Then
                AttemptCount = 0
                sheets(1).Cells.Clear
                Exit Sub
            End If
        
            AttemptCount = AttemptCount + 1
            GoTo ParsePage
        End If
        
        For lngTable = 0 To .Length - 1
            For lngRow = 0 To .Item(lngTable).Rows.Length - 1
                For lngCol = 0 To .Item(lngTable).Rows(lngRow).Cells.Length - 2
                
                rngPasteDest.Offset(lngRow + lngX, lngCol).Value = .Item(lngTable).Rows(lngRow).Cells(lngCol).innerText
                
                Next lngCol
            Next lngRow
            lngX = lngRow + lngX
        Next lngTable
        
    End With
    
    Set rngPasteDest = Nothing

End Sub



This code is a recording of the macro where I can dump the table I want using the data import from web feature in Excel, but it's incredibly sloppy. Is this something that can be cleaned up?


VBA Code:
    ActiveWorkbook.Queries.Add Name:="Table 1 (2)", Formula:= _
        "let" & Chr(13) & "" & Chr(10) & "    Source = Web.BrowserContents(""https://www.advfn.com/stock-market/NASDAQ/SMCI/financials"")," & Chr(13) & "" & Chr(10) & "    #""Extracted Table From Html"" = Html.Table(Source, {{""Column1"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(1)""}, {""Column2"", ""DIV[id='quarterIncomeStatement'] > DIV.table-res" & _
        "ponsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(2)""}, {""Column3"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(3)""}, {""Column4"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(4)""}, " & _
        "{""Column5"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(5)""}, {""Column6"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(6)""}}, [RowSelector=""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE." & _
        "table.financial-table:nth-child(1) > * > TR""])," & Chr(13) & "" & Chr(10) & "    #""Changed Type"" = Table.TransformColumnTypes(#""Extracted Table From Html"",{{""Column1"", type text}, {""Column2"", type text}, {""Column3"", type text}, {""Column4"", type text}, {""Column5"", type text}, {""Column6"", type text}})" & Chr(13) & "" & Chr(10) & "in" & Chr(13) & "" & Chr(10) & "    #""Changed Type"""
    With ActiveSheet.ListObjects.Add(SourceType:=0, Source:= _
        "OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=""Table 1 (2)"";Extended Properties=""""" _
        , Destination:=Range("$A$1")).QueryTable
        .CommandType = xlCmdSql
        .CommandText = Array("SELECT * FROM [Table 1 (2)]")
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .PreserveColumnInfo = True
        .ListObject.DisplayName = "Table_1__2"
        .Refresh BackgroundQuery:=False
    End With
 

Excel Facts

How can you automate Excel?
Press Alt+F11 from Windows Excel to open the Visual Basic for Applications (VBA) editor.
Have you considered using Power Query aka Get and Transform Data. No coding required. Can do it all in the user interface.

Data-->Get and Transform Data-->From Web.
 
Upvote 0
I went into advfn.com and I didn't bother checking whether it is redirecting me to local data, but the following code works fine:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets.Add

    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse("https://advfn.com")
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(2).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                ws.Cells(r, c).Value = cell.innerText
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
End Sub
 
Upvote 0
I went into advfn.com and I didn't bother checking whether it is redirecting me to local data, but the following code works fine:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets.Add

    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse("https://advfn.com")
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(2).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                ws.Cells(r, c).Value = cell.innerText
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
End Sub
Edgar this worked perfectly. thank you
Have you considered using Power Query aka Get and Transform Data. No coding required. Can do it all in the user interface.

Data-->Get and Transform Data-->From Web.

I went into advfn.com and I didn't bother checking whether it is redirecting me to local data, but the following code works fine:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets.Add

    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse("https://advfn.com")
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(2).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                ws.Cells(r, c).Value = cell.innerText
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
End Sub
 
Upvote 0
I went into advfn.com and I didn't bother checking whether it is redirecting me to local data, but the following code works fine:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets.Add

    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse("https://advfn.com")
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(2).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                ws.Cells(r, c).Value = cell.innerText
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
End Sub
Hi Edgar, can you teach me how to find the TagName for that line that says "For Each row in .getElementsByTagName("table")(2).rows"
I'm trying to redirect to another source "Veeva Systems (VEEV) Cash Flow Statement - Stock Analysis" but this line is what errors out. I tried using "table 0" from what I could find recording the macro but that doesn't seem to be correct.
 
Upvote 0
Just open the dev tools of your browser and, in the console, write this:
document.getElementsByTagName("table")
Press Enter and see how many elements it returns. In that site, I got only one element, so you can use this to access the table:
document.getElementsByTagName("table")[0]
If you press Enter, chances are it will become selected by dev tools. That's an indication that you're targeting the right element.

Go back to VBA and modify the line, just note that VBA requires parentheses, not square brackets. The code looks like this:
VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets.Add

    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse("https://stockanalysis.com/stocks/veev/financials/cash-flow-statement/?p=quarterly")
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(0).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                ws.Cells(r, c).Value = cell.innerText
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
End Sub

I could dive deeper, but this format limits how much I can explain. You’ll pick up a lot just by playing around with getElementsByTagName, getElementById, and other similar methods. For example, if you had 10 tables and needed to find one, you’d go trial and error through them from 0 to 9 until you find the right one and then use that as index.
 
Upvote 0
Just open the dev tools of your browser and, in the console, write this:
document.getElementsByTagName("table")
Press Enter and see how many elements it returns. In that site, I got only one element, so you can use this to access the table:
document.getElementsByTagName("table")[0]
If you press Enter, chances are it will become selected by dev tools. That's an indication that you're targeting the right element.

Go back to VBA and modify the line, just note that VBA requires parentheses, not square brackets. The code looks like this:
VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    Dim ws As Worksheet
    Set ws = ThisWorkbook.Sheets.Add

    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse("https://stockanalysis.com/stocks/veev/financials/cash-flow-statement/?p=quarterly")
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(0).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                ws.Cells(r, c).Value = cell.innerText
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
End Sub

I could dive deeper, but this format limits how much I can explain. You’ll pick up a lot just by playing around with getElementsByTagName, getElementById, and other similar methods. For example, if you had 10 tables and needed to find one, you’d go trial and error through them from 0 to 9 until you find the right one and then use that as index.

Thank you Edgar. This is really helpful. I appreciate the guidance on how to navigate through HTML and identifying tables. This is perfect.
 
Upvote 0
Thank you Edgar. This is really helpful. I appreciate the guidance on how to navigate through HTML and identifying tables. This is perfect.
Question. I was tinkering around with two methods of building the URL. The first one is to construct it by using a string variable called "Ticker", while the second one is to explicitly define the URL.

I commented out the first methods I used to define the strURL variable so I can toggle between them to test.

For some reason, the first method of constructing the strURL brings in the wrong data, it pulls the annual financials. Whereas when I explicitly define it, it pulls in the correct table, the quarterly financials.

I tried inspecting the HTML table properties and the way the table is tagged seems correct, but I can't seem to understand why the constructed URL doesn't work. This is the code, same as before with a few adds for strURL:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    
    Dim strURL As String
    Dim ws As Worksheet
    Set ws = ActiveSheet
    
    ws.Activate
    ws.Cells.Clear
    
    Set rngPasteDest = ws.Cells(1, 1)
    
    Ticker = "VEEV"
    'strURL = "https://stockanalysis.com/stocks/" & Ticker & "/financials/cash-flow-statement/?p=quarterly"
    strURL = "https://stockanalysis.com/stocks/veev/financials/cash-flow-statement/?p=quarterly"
    
    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse(strURL)
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(0).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                With rngPasteDest.Offset(r, c)
                    .Value = cell.innerText
                    .WrapText = False
                End With
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
    
End Sub
 
Upvote 0
1. When you're writing URLs, make sure you use capitals as the URL intends. You should not arbitrarily use VEEV instead of veev.
2. Make sure you use Option Explicit every time you use VBA because random things like this can happen, I know you are not using it because you have 2 undeclared variables. You could also be using public variables, but it's not clear for me with just the code you posted.
 
Upvote 0
1. When you're writing URLs, make sure you use capitals as the URL intends. You should not arbitrarily use VEEV instead of veev.
2. Make sure you use Option Explicit every time you use VBA because random things like this can happen, I know you are not using it because you have 2 undeclared variables. You could also be using public variables, but it's not clear for me with just the code you posted.
Thank you Edgar. I never knew that URLs would be case sensitive, that fixed it. Appreciate all the help.
 
Upvote 0

Forum statistics

Threads
1,221,310
Messages
6,159,176
Members
451,543
Latest member
cesymcox

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top