How to clean scrape a website for financial data?

jonathanwang003

Board Regular
Joined
May 9, 2014
Messages
133
I used to scrape financial statements cleanly into spreadsheets by examining the HTML as a table and ensuring there is data by checking the number of columns and rows. This code used to work, but I'm redirecting to advfn.com and I'm having trouble properly grabbing the table.

Code:
'Download Info
Public oXML As MSXML2.XMLHTTP
Public oHTML As MSHTML.HTMLDocument
Public lngRow As Long
Public lngCol As Long
Public lngTable As Long
Public lngX As Long
Public strUrl(1 To 6) As String
Public rngPasteDest As Range

sub Download_Financials()

    Set oXML = New MSXML2.XMLHTTP
    Set oHTML = New MSHTML.HTMLDocument
    Set rngPasteDest = Range("A1")
    
    strUrl(1) = "https://www.advfn.com/stock-market/NASDAQ/SMCI/financials"
    
    With oXML
        .Open "GET", strUrl(1), False
        .send
        oHTML.body.innerHTML = .ResponseText
    End With
    lngX = 0
    
    With oHTML.getElementsByTagName("Table 1")
    
        If .Length <= 1 Then
            If AttemptCount >= 10 Then
                AttemptCount = 0
                sheets(1).Cells.Clear
                Exit Sub
            End If
        
            AttemptCount = AttemptCount + 1
            GoTo ParsePage
        End If
        
        For lngTable = 0 To .Length - 1
            For lngRow = 0 To .Item(lngTable).Rows.Length - 1
                For lngCol = 0 To .Item(lngTable).Rows(lngRow).Cells.Length - 2
                
                rngPasteDest.Offset(lngRow + lngX, lngCol).Value = .Item(lngTable).Rows(lngRow).Cells(lngCol).innerText
                
                Next lngCol
            Next lngRow
            lngX = lngRow + lngX
        Next lngTable
        
    End With
    
    Set rngPasteDest = Nothing

End Sub



This code is a recording of the macro where I can dump the table I want using the data import from web feature in Excel, but it's incredibly sloppy. Is this something that can be cleaned up?


VBA Code:
    ActiveWorkbook.Queries.Add Name:="Table 1 (2)", Formula:= _
        "let" & Chr(13) & "" & Chr(10) & "    Source = Web.BrowserContents(""https://www.advfn.com/stock-market/NASDAQ/SMCI/financials"")," & Chr(13) & "" & Chr(10) & "    #""Extracted Table From Html"" = Html.Table(Source, {{""Column1"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(1)""}, {""Column2"", ""DIV[id='quarterIncomeStatement'] > DIV.table-res" & _
        "ponsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(2)""}, {""Column3"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(3)""}, {""Column4"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(4)""}, " & _
        "{""Column5"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(5)""}, {""Column6"", ""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE.table.financial-table:nth-child(1) > * > TR > :nth-child(6)""}}, [RowSelector=""DIV[id='quarterIncomeStatement'] > DIV.table-responsive > TABLE." & _
        "table.financial-table:nth-child(1) > * > TR""])," & Chr(13) & "" & Chr(10) & "    #""Changed Type"" = Table.TransformColumnTypes(#""Extracted Table From Html"",{{""Column1"", type text}, {""Column2"", type text}, {""Column3"", type text}, {""Column4"", type text}, {""Column5"", type text}, {""Column6"", type text}})" & Chr(13) & "" & Chr(10) & "in" & Chr(13) & "" & Chr(10) & "    #""Changed Type"""
    With ActiveSheet.ListObjects.Add(SourceType:=0, Source:= _
        "OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=""Table 1 (2)"";Extended Properties=""""" _
        , Destination:=Range("$A$1")).QueryTable
        .CommandType = xlCmdSql
        .CommandText = Array("SELECT * FROM [Table 1 (2)]")
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .PreserveColumnInfo = True
        .ListObject.DisplayName = "Table_1__2"
        .Refresh BackgroundQuery:=False
    End With
 
Follow up question. I adapted this code from our conversation above and it had worked until today. I checked the URL to confirm that it is still correct, but I keep getting a debug error. Can you teach me?

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .send
        GetResponse = .responseText
    End With
End Function

Sub test()
    
    Dim rngPasteDest As Range
    Dim strURL As String
    Dim ws As Worksheet
    Set ws = ActiveSheet
    
    ws.Activate
    ws.Cells.Clear
    
    Set rngPasteDest = ws.Cells(1, 1)
    
    strURL = "https://finance.yahoo.com/quote/^DJI/history/?period1=1695686400&period2=1730246400"
    
    With CreateObject("htmlfile")
        .body.innerHTML = GetResponse(strURL)
        Dim tableRows As Object, row As Object, cell As Object
        Dim r As Long, c As Long
        r = 1 ' start at first row
        For Each row In .getElementsByTagName("table")(0).Rows
            c = 1 ' start at first col
            For Each cell In row.Cells
                With rngPasteDest.Offset(r, c)
                    .Value = cell.innerText
                    .WrapText = False
                End With
                c = c + 1
            Next cell
            r = r + 1
        Next row
    End With
    
End Sub
 
Upvote 0

Excel Facts

Formula for Yesterday
Name Manager, New Name. Yesterday =TODAY()-1. OK. Then, use =YESTERDAY in any cell. Tomorrow could be =TODAY()+1.
I also tried recording macro of extracting that link from web in excel and that failed... I suspect Yahoo changed their site to not allow this?
 
Upvote 0
I also tried recording macro of extracting that link from web in excel and that failed... I suspect Yahoo changed their site to not allow this?
Add the Accept header to the request:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"
        .send
        GetResponse = .responseText
    End With
End Function
 
Upvote 0
Solution
Add the Accept header to the request:

VBA Code:
Function GetResponse(url As String) As String
    With CreateObject("WinHttp.WinHttpRequest.5.1")
        .Open "GET", url, False
        .setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"
        .send
        GetResponse = .responseText
    End With
End Function
Thanks so much Edgar. that worked perfectly. Can you teach me wat that ".setRequestHeader" line does?
 
Upvote 0
Thanks so much Edgar. that worked perfectly. Can you teach me wat that ".setRequestHeader" line does?
I opened the link in your code using my browser while its Dev Tools were open to the Network tab.
Once the page was fully loaded, I checked the source of the table and noticed it was a HTML document.
I copied the link from the Network tab and tried to open it from the browser again, it worked.
I tried to bring the data into Excel through the HTTP code from before, it did not work.
I opened the link from Postman, another tool for this type of work and noticed the link didn't work there either.
Postman is easier to work with when you want to analyze HTTP requests, so that was good news.
I checked the Network tab again and re-sent the request to see what the request headers were.
I then copied all the headers and used them with Postman.
Upon adding the headers exactly as sent by the website, I noticed it did bring the data.
Then I just removed any header that I thought was unnecessary.
The culprit seemed to be the Accept header, so I added it to the VBA code and it worked as expected.
 
Upvote 0
I opened the link in your code using my browser while its Dev Tools were open to the Network tab.
Once the page was fully loaded, I checked the source of the table and noticed it was a HTML document.
I copied the link from the Network tab and tried to open it from the browser again, it worked.
I tried to bring the data into Excel through the HTTP code from before, it did not work.
I opened the link from Postman, another tool for this type of work and noticed the link didn't work there either.
Postman is easier to work with when you want to analyze HTTP requests, so that was good news.
I checked the Network tab again and re-sent the request to see what the request headers were.
I then copied all the headers and used them with Postman.
Upon adding the headers exactly as sent by the website, I noticed it did bring the data.
Then I just removed any header that I thought was unnecessary.
The culprit seemed to be the Accept header, so I added it to the VBA code and it worked as expected.
Thank you so much Edgar. It's incredibly helpful to follow your thought process on how you assess the problem and use the process of elimination to get you to the answer. Many thanks.
 
Upvote 0
Thank you so much Edgar. It's incredibly helpful to follow your thought process on how you assess the problem and use the process of elimination to get you to the answer. Many thanks.
I'm happy to have been of help, Jonathan, and I wish you the best of luck with your project. If you find my posts helpful in the future, I’d appreciate it if you could consider liking them to support my reputation on this platform, just in case it's important some time.

Best regards,
Edgar
 
Upvote 0

Forum statistics

Threads
1,225,733
Messages
6,186,705
Members
453,369
Latest member
positivemind

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top