Speed of multiple URL web scraping

johnnyL

Well-known Member
Joined
Nov 7, 2011
Messages
4,546
Office Version
  1. 2007
Platform
  1. Windows
Need some help with this code.

1) The current code leaves a bunch of IE windows open.
2) The current code leads to a Run-time error '-2147437259 (80004005)':
3) It takes forever to run, Hopefully someone can assist me in converting it to use MSXML2.XMLHTTP60 for example, I heard that works faster.


Code:
'
'-----------------------------------------------------
'   Run-time error '-2147437259 (80004005)':    ' This Error Occurs, eventually, in the 'Yahoo_One_Year_Estimates_Scrape_Error' section \/ \/ \/
'                                                   Also many internet explorer windows are left open that should have been closed
'
'   Automation Error
'   Unspecified Error
'-----------------------------------------------------
'
'
'   Global Variables That will be used
'
    Public Doc                                      As HTMLDocument
'
    Public StockMainPageURL                         As String       ' This will be the main portion of the URL that we send to Internet Explorer
    Public TotalURL                                 As String       ' This will be the complete URL that we send to Internet Explorer
'
    Public CellCounter                              As Integer      ' This will be used to adjust left to right on web site cells
    Public RowCounter                               As Integer      ' This adjusts the offset from the top of the spreadsheet to the start of the columns
    Public StockCount                               As Integer      ' This counts the actual stocks being analyzed currently
    Public TotalStocksToLoad                        As Integer      ' This counts the stocks that should be analyzed right now
'
    Public PageLoadAttempt                          As Long         ' This counts the number of times we have tried to load a page
'
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub RefreshEntireDocument_Click()
'
'   This will Clear certain cell values in the spreadsheet when the $B$1 'Refresh' cell is clicked
'
    Range("$B$5:$K$254").Select                                 ' Select the range of $B$5 thru $J$254
    Selection.ClearContents                                     ' Delete the contents of this range
'
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 1st URL page
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    MaxYahooDelay = 0                                           ' Initialize MaxYahooDelay = 0
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 1st loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_1                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 2nd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 2nd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_2                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 3rd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 50                                      ' we will Scrape this amount of stocks from the 3rd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_3                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape values from Yahoo to Update the one year estimates from previous pages of stocks scraped
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalYahooDelay = 0                                         ' Initialize TotalYahooDelay = 0
    TotalYahooPageAttempts = 0                                  ' Initialize TotalYahooPageAttempts = 0
    TotalStocksToLoad = 250                                     ' we will Scrape this amount of stocks from the 3rd loaded page of stocks

    Call Scrape_Yahoo_One_Year_Estimates                        ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Display some final results in the status bar
    Application.StatusBar = "Spreadsheet Refreshing Complete :)" ' & "    Avg Yahoo Delay = " & AvgYahooDelay & "     Avg Yahoo Page Attempts = " & AvgYahooPageAttempts
'
End Sub
'-------------------------------------------------------------------------------------------------------------------------------
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub Scrape_Yahoo_One_Year_Estimates()                       ' *** Good up to here ***
'
'
    For StockCount = 1 To TotalStocksToLoad                         ' Grab One Year stock price estimate
'
'
ReloadScrape_Yahoo_One_Year_Estimates:
'
'       Load all of the Update one year estimates
        DelaySeconds = 0                                            '   Initialize DelaySeconds to zero
        PageLoadAttempt = PageLoadAttempt + 1                       '   Add 1 to our PageLoadAttempt counter
''''        TotalYahooPageAttempts = TotalYahooPageAttempts + 1         '   This will be the total yahoo Page Attempts
'
        StockMainPageURL = "finance.yahoo.com/quote/"               '   This will be the main portion of the URL that we send to Internet Explorer
        CurrentStockSymbol = Trim(Range("B" & RowCounter).Value)    '   This is the stock symbol that we will be addressing
'
'       Setup and Load the Internet Explorer Page ...
''''        Dim IE As New SHDocVw.InternetExplorer  ' This works
        Dim IE As New InternetExplorer
''      Dim IE As MSXML2.XMLHTTP60
''      Set IE = New MSXML2.XMLHTTP60
'
        TotalURL = "https://" & StockMainPageURL & CurrentStockSymbol   ' This will be the complete URL that we send to Internet Explorer
'
        If CurrentStockSymbol = 0 Or CurrentStockSymbol = "" Or IsEmpty(CurrentStockSymbol) = True Then ' If no stock symbol found @ $B?  then ...
            PageLoadAttempt = 0                                                                         '   Reset PageLoadAttempt = 0
            StockCount = TotalStocksToLoad                                                              '   Indicate no more stocks to load
'
            IE.Quit                                                                                     '   Close Internet Explorer Window
            Set IE = Nothing                                                                            '   Clear Internet Explorer Memory
'
            Exit Sub                                                                                    '   Exit this sub
        Else
'
            On Error GoTo Yahoo_One_Year_Estimates_Scrape_Error                                         '   If Error occurs then goto Yahoo_One_Year_Estimates_Scrape_Error
'
            Set IE = New InternetExplorer                                                               '   Open Internet Explorer Browser
'
'           Browser address that we will be scraping values from
            IE.navigate TotalURL                                                                        '   Load the Internet Explorer URL
'
'           Make the Browser window, that we will be scraping values from, visible
            IE.Visible = True                                           '   Make Internet Explorer Windows Visible
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.readyState <> 4 And DelaySeconds <= 19                                          '   Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))                                          '   Delay for 1 second
                DoEvents                                                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
''                  TotalYahooDelay = TotalYahooDelay + 1
'
            Loop                                                        ' Loop back
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.Busy And DelaySeconds <= 19 ' Or IE.readyState <> 4 And DelaySeconds <= 19  ' Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))          '   Delay for 1 second
                DoEvents                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
            Loop                                                        ' Loop back
'
'
            If DelaySeconds > 19 Then                                   ' If we have delayed for > 19 seconds to allow the page to load then ...
                IE.Quit                                                 '   Close Internet Explorer Window
'
                If PageLoadAttempt <= 4 Then GoTo ReloadScrape_Yahoo_One_Year_Estimates '   If we have'nt tried 4 reloads of this page then reload page again
            End If                                                      ' End If
'
            If PageLoadAttempt > 4 Then                                 ' If we have tried 4 reloads of the URL page then Display a message box & Exit program
                MsgBox "We've reloaded the same web page  " & PageLoadAttempt & " times without success so we're going to pause the program" & _
                " so you can investigate.", , "Multiple errors detected"
'
                PageLoadAttempt = 0                                     '   Reset PageLoadAttempt = 0
'
                Stop                                                    '   Stop this Excel program!
            End If
'
            Set Doc = IE.document
'
        End If
'
'
''''        TotalYahooDelay = TotalYahooDelay + DelaySeconds
''''        AvgYahooDelay = TotalYahooDelay / (RowCounter - 4)
''''        AvgYahooPageAttempts = TotalYahooPageAttempts / (RowCounter - 4)
'
'       Update status bar to inform the user of what is occurring
        Application.StatusBar = "Gathering Data from website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "    AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
        Range("J" & RowCounter).Value = Doc.getElementsByTagName("td")(11).innerText        '   Scrape the Yahoo 52 Week Price Range
        Range("K" & RowCounter).Value = Doc.getElementsByTagName("td")(31).innerText        '   Scrape the Yahoo One Year Price Estimate
'
        On Error GoTo 0                                                                     '   Clear Errors & Set Excel Error handling to Default
'
        RowCounter = RowCounter + 1                                                         '   Advance to next row in Excel sheet
'
        IE.Quit                                                                             '   Close Internet Explorer Window
        Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
        PageLoadAttempt = 0                                                                 '   Reset PageLoadAttempt = 0
'
    Next                                                                                    '   Load next stock until all are loaded
'
    Exit Sub                                                                                ' Exit this Sub
'
Yahoo_One_Year_Estimates_Scrape_Error:
'
'   Tried this solution from google \/ \/ to solve errors, No luck :(                       ' Shut down all Internet Explorer windows
''    Dim wsh As Object
''    Dim windowStyle As Integer: windowStyle = 1
''    Dim waitOnReturn As Boolean: waitOnReturn = True
'
''    Set wsh = VBA.CreateObject("Wscript.Shell")
''    wsh.Run "taskkill /F /IM iexplore.exe", windowStyle, waitOnReturn
'
'
'
''    IE.Quit                                                                             '   Close Internet Explorer Window
    Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
'   This works some what
    Set IE = New InternetExplorer                                                           ' Open Internet Explorer Browser
'
'
    Resume Next                                                                             ' Go back to the next line after the previous error occurred
'
End Sub
'________________________________________________________________________________________________________________________________________________________
 
lol, no controversy. I'm sure Dan would agree, we're just adding to what each other say. I've liked his comments and he's liked mine. It's all good.
 
Upvote 0

Excel Facts

Highlight Duplicates
Home, Conditional Formatting, Highlight Cells, Duplicate records, OK to add pink formatting to any duplicates in selected range.
Bwahaha - I love causing controversy. My day just isn't complete unless I've the been involved in some minor international dispute.

It's completely fine, mate. I agree wholeheartedly with lrobbo314. As I said, I certainly learnt a whole lot researching that point, it's been beneficial for me and it is always helpful to get the thoughts of others who use/don't use these techniques and to learn the reasons why. For my projects, they tend not to large-scale so DoEvents saves my sanity by not forcing me to state at a blank excel window wondering what is going on.
 
Upvote 0
This is a good point. If I were doing this as a personal project, I'd probably steer clear of Excel entirely. I would use Python and the Beautiful Soup library.
I agree re: Python. I think the only Excel connection I would have would be the resulting spreadsheet! Beautiful Soup library is just such a effortless dream to work with - not helpful, though, with javascript-rendedered sites, though, no? With that we need to rely on selenium, no?
 
Upvote 0
I agree re: Python. I think the only Excel connection I would have would be the resulting spreadsheet! Beautiful Soup library is just such a effortless dream to work with - not helpful, though, with javascript-rendedered sites, though, no? With that we need to rely on selenium, no?

You may be right. It's been a long time since I did anything with Python for web-scraping. I would think that if Power Query has no problem parsing the site then BS would be able to as well, but I honestly don't know.

If either of you have a 'Final Product' I would be interested in seeing the workbook, share through dropbox or something, and try to do the same thing in Power Query just to see if there is any speed difference. Like you say about using IE, I would imagine that PQ has optimized a lot of the HTTP overhead nonsense.

Also, not sure what version of Excel you have, but I know that 365 has built in stock data fetching built in. Might be worth looking into as well.
 
Upvote 0
Just an Update! I added the 'Array approach' and saw minimal speed increase. :( I'm sure it is something that I have messed up.

I am going to proceed to try and condense my code and then submit that code in hopes that someone will find some more flaws in my techniques that are slowing the code down.

Dan_W asked what is slowing the code down, I am sure it is in the Internet Explorer opening/closings, but I just don't know any better at this point.
 

Attachments

  • 4.0 Capture.JPG
    4.0 Capture.JPG
    47 KB · Views: 8
Upvote 0
I'd really like to see the current version of the code, because you should be seeing a better time than that. Is it opening and closing a new instance of Internet Explorer each time it accesses a site? If so, that might explain it and can be easily fixed, but it would be helpful to apply a timer within this Phase 3 to see exaclty what's happening.
If either of you have a 'Final Product' I would be interested in seeing the workbook, share through dropbox or something, and try to do the same thing in Power Query just to see if there is any speed difference.
I don't have a copy of the code in its final state, but I also don't know which 250 sites are being accessed - Johnny - do you have a dummy set of companies/stocks that you can provide us with to test it?
Also, not sure what version of Excel you have, but I know that 365 has built in stock data fetching built in. Might be worth looking into as well.
THIS is an excellent point and I completely forgot all about it. Johnny - what version of Excel are you working on? Have you looked at this (see picture)? Although there are a dozen or so fields of data available per company, would this be helpful? I vaguely recall you being interested in the forecasting data.
1601249062362.png
 
Upvote 0
I'd really like to see the current version of the code, because you should be seeing a better time than that. Is it opening and closing a new instance of Internet Explorer each time it accesses a site? If so, that might explain it and can be easily fixed, but it would be helpful to apply a timer within this Phase 3 to see exaclty what's happening.

Hey Dan_W!

Timer has already been applied to each phase.

You both can can check out my most recent code here.

Hopefully you folks can spot what I have messed up.

BTW, I have Microsoft Office 2007 installed.
 
Upvote 0
Hi
I see that the timer has been applied to Phases 1, 2 and 3 - what I mean is that you should apply it within Phase 3, because it appears to be broken down in to a number of elements, and those elements are being looped over 250 times, so it would be helpful to know what part of that is consuming the most time/energy.

Thanks for uploading the code - I've had a quick look at it, and one or two things jump out at me - namely, I think that there is still remnant code in there opening and closing instances of IE and also I think I saw the first part of the IE process (where you extract details for the 250 companies over the 3 pages) that are being written direct to the sheet and not to an array. There is something else, but I can't put my finger on it - I need to wrap my head around the process of the code.

I just wanted to say, if you're making several hundred calls to the website everytime you run this, aren't you at risk of getting your IP blocked? I would be interested to see what lrobbo314's assessment is re: PQ. I had a quick try, but was getting bogged down in trying to transform the data into anything sensible.

Also, I wonder whether this might be a good candidate for the Stackexchange Code Review? I'm not sure what the requirements are, but just throwing it out there as one possibility.
 
Upvote 0
Dan, you were right about the javascript rendered website. PQ works fine for the yahoo site I did on page 2, but when I tried to use it on the barchart.com site, it was a no-go.

If you can find another website that has the same kind of information it, and it has the data in an html table, and isn't rendered using React.js, then PQ can do it. I've searched for a similar site but my Google-Fu is failing me.
 
Upvote 0
Dan_W said:
Thanks for uploading the code - I've had a quick look at it, and one or two things jump out at me - namely, I think that there is still remnant code in there opening and closing instances of IE and also I think I saw the first part of the IE process (where you extract details for the 250 companies over the 3 pages) that are being written direct to the sheet and not to an array. There is something else, but I can't put my finger on it - I need to wrap my head around the process of the code.

Hey Dan_W. I am sorry if my code is confusing. I always try to comment my code to relay my intent, I was taught that many years ago when I took some computer programming.

Dan_W said:
I just wanted to say, if you're making several hundred calls to the website everytime you run this, aren't you at risk of getting your IP blocked? I would be interested to see what lrobbo314's assessment is re: PQ. I had a quick try, but was getting bogged down in trying to transform the data into anything sensible.

I have ran this code several times in a night and haven't been blocked yet. :) If I ever was, I have ways around that. ;)

In regards to Power Query and my current version of Office 2007, to prep for a potential lrobbo314 Power Query response, I have discovered that 2007 Excel does not support Power query, so I am in the process of acquiring an updated version of Excel.


Dan_W said:
Also, I wonder whether this might be a good candidate for the Stackexchange Code Review? I'm not sure what the requirements are, but just throwing it out there as one possibility.

I have no idea what that means, sorry.
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,181
Members
453,022
Latest member
Mohamed Magdi Tawfiq Emam

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top