Speed of multiple URL web scraping

johnnyL

Well-known Member
Joined
Nov 7, 2011
Messages
4,546
Office Version
  1. 2007
Platform
  1. Windows
Need some help with this code.

1) The current code leaves a bunch of IE windows open.
2) The current code leads to a Run-time error '-2147437259 (80004005)':
3) It takes forever to run, Hopefully someone can assist me in converting it to use MSXML2.XMLHTTP60 for example, I heard that works faster.


Code:
'
'-----------------------------------------------------
'   Run-time error '-2147437259 (80004005)':    ' This Error Occurs, eventually, in the 'Yahoo_One_Year_Estimates_Scrape_Error' section \/ \/ \/
'                                                   Also many internet explorer windows are left open that should have been closed
'
'   Automation Error
'   Unspecified Error
'-----------------------------------------------------
'
'
'   Global Variables That will be used
'
    Public Doc                                      As HTMLDocument
'
    Public StockMainPageURL                         As String       ' This will be the main portion of the URL that we send to Internet Explorer
    Public TotalURL                                 As String       ' This will be the complete URL that we send to Internet Explorer
'
    Public CellCounter                              As Integer      ' This will be used to adjust left to right on web site cells
    Public RowCounter                               As Integer      ' This adjusts the offset from the top of the spreadsheet to the start of the columns
    Public StockCount                               As Integer      ' This counts the actual stocks being analyzed currently
    Public TotalStocksToLoad                        As Integer      ' This counts the stocks that should be analyzed right now
'
    Public PageLoadAttempt                          As Long         ' This counts the number of times we have tried to load a page
'
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub RefreshEntireDocument_Click()
'
'   This will Clear certain cell values in the spreadsheet when the $B$1 'Refresh' cell is clicked
'
    Range("$B$5:$K$254").Select                                 ' Select the range of $B$5 thru $J$254
    Selection.ClearContents                                     ' Delete the contents of this range
'
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 1st URL page
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    MaxYahooDelay = 0                                           ' Initialize MaxYahooDelay = 0
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 1st loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_1                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 2nd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 2nd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_2                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 3rd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 50                                      ' we will Scrape this amount of stocks from the 3rd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_3                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape values from Yahoo to Update the one year estimates from previous pages of stocks scraped
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalYahooDelay = 0                                         ' Initialize TotalYahooDelay = 0
    TotalYahooPageAttempts = 0                                  ' Initialize TotalYahooPageAttempts = 0
    TotalStocksToLoad = 250                                     ' we will Scrape this amount of stocks from the 3rd loaded page of stocks

    Call Scrape_Yahoo_One_Year_Estimates                        ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Display some final results in the status bar
    Application.StatusBar = "Spreadsheet Refreshing Complete :)" ' & "    Avg Yahoo Delay = " & AvgYahooDelay & "     Avg Yahoo Page Attempts = " & AvgYahooPageAttempts
'
End Sub
'-------------------------------------------------------------------------------------------------------------------------------
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub Scrape_Yahoo_One_Year_Estimates()                       ' *** Good up to here ***
'
'
    For StockCount = 1 To TotalStocksToLoad                         ' Grab One Year stock price estimate
'
'
ReloadScrape_Yahoo_One_Year_Estimates:
'
'       Load all of the Update one year estimates
        DelaySeconds = 0                                            '   Initialize DelaySeconds to zero
        PageLoadAttempt = PageLoadAttempt + 1                       '   Add 1 to our PageLoadAttempt counter
''''        TotalYahooPageAttempts = TotalYahooPageAttempts + 1         '   This will be the total yahoo Page Attempts
'
        StockMainPageURL = "finance.yahoo.com/quote/"               '   This will be the main portion of the URL that we send to Internet Explorer
        CurrentStockSymbol = Trim(Range("B" & RowCounter).Value)    '   This is the stock symbol that we will be addressing
'
'       Setup and Load the Internet Explorer Page ...
''''        Dim IE As New SHDocVw.InternetExplorer  ' This works
        Dim IE As New InternetExplorer
''      Dim IE As MSXML2.XMLHTTP60
''      Set IE = New MSXML2.XMLHTTP60
'
        TotalURL = "https://" & StockMainPageURL & CurrentStockSymbol   ' This will be the complete URL that we send to Internet Explorer
'
        If CurrentStockSymbol = 0 Or CurrentStockSymbol = "" Or IsEmpty(CurrentStockSymbol) = True Then ' If no stock symbol found @ $B?  then ...
            PageLoadAttempt = 0                                                                         '   Reset PageLoadAttempt = 0
            StockCount = TotalStocksToLoad                                                              '   Indicate no more stocks to load
'
            IE.Quit                                                                                     '   Close Internet Explorer Window
            Set IE = Nothing                                                                            '   Clear Internet Explorer Memory
'
            Exit Sub                                                                                    '   Exit this sub
        Else
'
            On Error GoTo Yahoo_One_Year_Estimates_Scrape_Error                                         '   If Error occurs then goto Yahoo_One_Year_Estimates_Scrape_Error
'
            Set IE = New InternetExplorer                                                               '   Open Internet Explorer Browser
'
'           Browser address that we will be scraping values from
            IE.navigate TotalURL                                                                        '   Load the Internet Explorer URL
'
'           Make the Browser window, that we will be scraping values from, visible
            IE.Visible = True                                           '   Make Internet Explorer Windows Visible
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.readyState <> 4 And DelaySeconds <= 19                                          '   Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))                                          '   Delay for 1 second
                DoEvents                                                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
''                  TotalYahooDelay = TotalYahooDelay + 1
'
            Loop                                                        ' Loop back
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.Busy And DelaySeconds <= 19 ' Or IE.readyState <> 4 And DelaySeconds <= 19  ' Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))          '   Delay for 1 second
                DoEvents                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
            Loop                                                        ' Loop back
'
'
            If DelaySeconds > 19 Then                                   ' If we have delayed for > 19 seconds to allow the page to load then ...
                IE.Quit                                                 '   Close Internet Explorer Window
'
                If PageLoadAttempt <= 4 Then GoTo ReloadScrape_Yahoo_One_Year_Estimates '   If we have'nt tried 4 reloads of this page then reload page again
            End If                                                      ' End If
'
            If PageLoadAttempt > 4 Then                                 ' If we have tried 4 reloads of the URL page then Display a message box & Exit program
                MsgBox "We've reloaded the same web page  " & PageLoadAttempt & " times without success so we're going to pause the program" & _
                " so you can investigate.", , "Multiple errors detected"
'
                PageLoadAttempt = 0                                     '   Reset PageLoadAttempt = 0
'
                Stop                                                    '   Stop this Excel program!
            End If
'
            Set Doc = IE.document
'
        End If
'
'
''''        TotalYahooDelay = TotalYahooDelay + DelaySeconds
''''        AvgYahooDelay = TotalYahooDelay / (RowCounter - 4)
''''        AvgYahooPageAttempts = TotalYahooPageAttempts / (RowCounter - 4)
'
'       Update status bar to inform the user of what is occurring
        Application.StatusBar = "Gathering Data from website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "    AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
        Range("J" & RowCounter).Value = Doc.getElementsByTagName("td")(11).innerText        '   Scrape the Yahoo 52 Week Price Range
        Range("K" & RowCounter).Value = Doc.getElementsByTagName("td")(31).innerText        '   Scrape the Yahoo One Year Price Estimate
'
        On Error GoTo 0                                                                     '   Clear Errors & Set Excel Error handling to Default
'
        RowCounter = RowCounter + 1                                                         '   Advance to next row in Excel sheet
'
        IE.Quit                                                                             '   Close Internet Explorer Window
        Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
        PageLoadAttempt = 0                                                                 '   Reset PageLoadAttempt = 0
'
    Next                                                                                    '   Load next stock until all are loaded
'
    Exit Sub                                                                                ' Exit this Sub
'
Yahoo_One_Year_Estimates_Scrape_Error:
'
'   Tried this solution from google \/ \/ to solve errors, No luck :(                       ' Shut down all Internet Explorer windows
''    Dim wsh As Object
''    Dim windowStyle As Integer: windowStyle = 1
''    Dim waitOnReturn As Boolean: waitOnReturn = True
'
''    Set wsh = VBA.CreateObject("Wscript.Shell")
''    wsh.Run "taskkill /F /IM iexplore.exe", windowStyle, waitOnReturn
'
'
'
''    IE.Quit                                                                             '   Close Internet Explorer Window
    Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
'   This works some what
    Set IE = New InternetExplorer                                                           ' Open Internet Explorer Browser
'
'
    Resume Next                                                                             ' Go back to the next line after the previous error occurred
'
End Sub
'________________________________________________________________________________________________________________________________________________________
 
I updated the code so I'll give a bit more of an explanation. I start off with a list of stocks like this.

Book1
A
1Stock
2AAPL
3GOOG
4TSLA
5AMZN
Sheet1


And add that as a table to PowerQuery.

Then, in Power Query, add a new blank source, get into the 'advanced editor' and paste the code below.

Power Query:
(sym as text) as record =>

let
    Source = Web.Page(Web.Contents("https://finance.yahoo.com/quote/" & sym)),
    record = Record.AddField([],"52 Week",Source{0}[Data]{5}[Column2]),
    result = Record.AddField(record, "Other", Source{1}[Data]{7}[Column2])
in
    result

That's the custom function. Then go back to the first table, and on the 'add column' tab, select 'invoke custom column' and select the custom function that was just created, giving it the 1 column you have available to the custom function as its parameter.

This gives us basically the same results as before, but probably faster.

Book1
CDE
1Stock52 WeekOther
2AAPL51.06 - 134.80113.71
3GOOG1,013.54 - 1,665.731,699.84
4TSLA43.67 - 502.49281.66
5AMZN1,626.03 - 3,513.873,636.42
Sheet1
 
Upvote 0

Excel Facts

Can a formula spear through sheets?
Use =SUM(January:December!E7) to sum E7 on all of the sheets from January through December
Thank you lrobbo314 for your responses!

Unfortunately, I have no experience with Power Query so that is my loss. :(

After much googling, I am making some headway on one of my original problems of waaaaaaay too many Explorer windows being left opened. I hope to resolve that issue of Explorer windows being left open with some more googling. :)
 
Upvote 0
I would just like to say that lrobbo314's Power Query solution is an elegant one, and I suspect will almost certainly be faster than VBA. I have to confess that I am not as proficient in PQ as I would like to be, so your further explanation above, lrobbo314, was greatly appreciated.

Returning to the VBA code, anything that avoids instantiating several hundered instances of Internet Explorer like the original code does was always going to be a winner! My initial comments on the paired down code you first provided should not be interpreted as anything other than obvservations as to where I suspect the code is running into difficulty. Given that you had originally asked "Hopefully someone can assist me in converting it to use MSXML2.XMLHTTP60 for example", I took this to mean that we can abandon internet explorer altogether - IE and MSXML2 are serving the same function here, so trying to make some hybrid code that includes both is wholly unnecessary.

A lot of the full code you helpfully provided appears to deal with managing Internet Explorer - if we are abandoning that, then much of the full code is no longer of use. Assuming we can get my code to work on your computer - which performs the key function of getting the data and depositing it in the worksheet - then the key task will be working through the full code to work out whatelse needs to be salvaged and worked into the updated code.

I will upload the clean workbook with the code in just a moment, and hopefully it will work for you too.
 
Upvote 0
Here is a link to the workbook, comprising one worksheet with a command button, and one standard module (modScrape). I set out that code in full below, on the off-chance it may be of use to someone else:

VBA Code:
'Remember to add reference to Microsoft HTML Object Library

Sub ScrapeSite()
    Dim Doc                          As HTMLDocument
    Dim StockMainPageURL             As String       ' This will be the main portion of the URL that we send to Internet Explorer
    Dim TotalURL                     As String       ' This will be the complete URL that we send to Internet Explorer
    Dim RowCounter                   As Integer      ' This adjusts the offset from the top of the spreadsheet to the start of the columns
    Dim StockCount                   As Integer      ' This counts the actual stocks being analyzed currently
    Dim strHTML                      As String
    
    Set Doc = New HTMLDocument
    StockMainPageURL = "https://finance.yahoo.com/quote/"               '   This will be the main portion of the URL
    RowCounter = 5
    TotalStocksToLoad = 250
    For StockCount = 1 To TotalStocksToLoad                                          ' Grab One Year stock price estimate
        currentstocksymbol = Trim(Range("B" & RowCounter).value)      '   This is the stock symbol that we will be addressing
        If currentstocksymbol = vbNullString Then Exit For
        TotalURL = StockMainPageURL & currentstocksymbol                 ' This will be the complete URL
        strHTML = GetHTML(TotalURL)
        
        Doc.body.innerhtml = strHTML
        
        Range("J" & RowCounter).value = Doc.getElementsByTagName("td")(11).innerText
        Range("K" & RowCounter).value = Doc.getElementsByTagName("td")(31).innerText
        
        RowCounter = RowCounter + 1
    Next
    
    Set Doc = Nothing
    
End Sub

Function GetHTML(strURL As String) As String
    Dim objHTTP As Object, strTemp As String
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    objHTTP.Open "GET", strURL, False
    objHTTP.send
    If objHTTP.Status = 200 Then
        strTemp = objHTTP.responseText
    Else
        'There has been an error
        strTemp = ""
    End If
    GetHTML = strTemp
End Function
 
Upvote 0
Thank You Dan_W.

Your workbook does work for me. :) Not sure why, perhaps because you put your code in the module section instead of worksheet?

Anyway, thank you so much for that. I can't seem to get the sections of my code that run prior to section that you provided a code example for. When I try to run the previous sections I immediately get an error at the "objHTTP.send" line. The error is "An error occurred in the secure channel support"

The previous sections work fine for InternetExplorer, but I can't seem to get it to work like the sample code you provided.

Any thoughts?
 

Attachments

  • Capture2.PNG
    Capture2.PNG
    7.8 KB · Views: 7
Upvote 0
The following code is what I thought should work based on the previous Dan_W code example posted above ...
VBA Code:
'
Private Sub Scrape_BarChart_Stock_Page_1()    ' Currently not working !!!
'
'   Scrape stocks to consider looking into further from 1st URL page
'
    Dim StockMainPageURL        As String          ' This will be the main portion of the URL that we send to Internet Explorer
    Dim RowCounter              As Integer         ' This adjusts the offset from the top of the spreadsheet to the start of the columns
    Dim CellCounter             As Integer         ' This will be used to adjust left to right on web site cells
    Dim TotalStocksToLoad       As Integer         ' This counts the stocks that should be analyzed right now
    Dim StockCount              As Integer         ' This counts the actual stocks being analyzed currently
    Dim TotalURL                As String          ' This will be the complete URL that we send to Internet Explorer
    Dim strHTML                 As String          '
    Dim Doc                     As HTMLDocument    '
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    CellCounter = 0                                             ' Left to right cell counter on Web page
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 1st loaded page of stocks
'
'   URL of 1St page of stocks that we will scrape from
    StockMainPageURL = "https://www.barchart.com/stocks/performance/percent-change/declines?timeFrame=3m&viewName=main"
'
    TotalURL = StockMainPageURL
'
    strHTML = GetHTML(TotalURL)
'
    Doc.body.innerhtml = strHTML
'
'   Update status bar to inform the user of what is occurring
    Application.StatusBar = "Gathering Data of stocks to load website Page 1 … "
'
    For StockCount = 1 To TotalStocksToLoad                                                         ' Grab stock names & some values
'
        Range("A" & RowCounter).Value = (RowCounter - 4)                                            '   Stock number Counter
        Range("B" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter).innerText       '   Stock Symbol
        Range("C" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 1).innerText   '   Stock Name
        Range("D" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 2).innerText   '   3 Month % Change
        Range("E" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 3).innerText   '   Last Price
        Range("F" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 4).innerText   '   Change
        Range("G" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 5).innerText   '   % Change
        Range("H" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 7).innerText   '   3 Month High
        Range("I" & RowCounter).Value = Doc.getElementsByTagName("td")(CellCounter + 8).innerText   '   3 Month Low
'
        CellCounter = CellCounter + 12                                                              '   Advance to next row on URL page
        RowCounter = RowCounter + 1                                                                 '   Advance to next row in Excel sheet
'
    Next
'
End Sub

I am not sure why it immediately errors upon execution. See most recent error above /\ /\ /\.

The only thing I see is that the Dan_W code that works above uses a URL that Does NOT have www in it. The code I am trying to get to work DOES have www in the URL.

I Dunno.

As I mentioned, if I use the Internet Explorer approach, the code works, just sloooooooow.
 
Upvote 0
Sorry - I will have a look at it later tonight.
 
Upvote 0
Hi - sorry I didn't get back to you sooner.

When I first read your original full code, I hadn't picked up the fact that you were in fact attempting to scrape two separate websites; I had thought you were only scraping Yahoo - my apologies. I looked at the original code to see how the BarCharts scrape worked, and tried your revised code above to see why it wasn't working. Basically, the Barcharts site is Javascript-rendered - this means that sourcing the HTML code with MSXML is not going to help on this ooccasion. Instead, you will need to automate a browser to first render and populate the site with the data, and then extract the code from which to scrape. This could be done by a API, a headless browser or, as is the case in your original code, Internet Explorer.

As such, in regards to the BarCharts scraping, I would suggest that your original code will do the job - each subroutine properly instantiates and terminates an instance of Internet Explorer as it goes, so there won't be multiple versions of it left running in the background. There are a few things that could be done to optimise the code and speed up the process, but let me know your thoughts.
 
Upvote 0
Thank you Dan_W for all your efforts. I have to admit that I have no experience with API's. I have no clue how to take the API approach. :( I have seen posts where API's give raw data that can be scraped, but that is the extent of my knowledge on API's. IE. next to nothing basically.

I will continue plugging away at this in the mean time, the speed, thus far has increased incredibly between your submitted code and my googling in the mean time so many thanks to you!
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,181
Members
453,022
Latest member
Mohamed Magdi Tawfiq Emam

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top