Speed of multiple URL web scraping

johnnyL

Well-known Member
Joined
Nov 7, 2011
Messages
4,546
Office Version
  1. 2007
Platform
  1. Windows
Need some help with this code.

1) The current code leaves a bunch of IE windows open.
2) The current code leads to a Run-time error '-2147437259 (80004005)':
3) It takes forever to run, Hopefully someone can assist me in converting it to use MSXML2.XMLHTTP60 for example, I heard that works faster.


Code:
'
'-----------------------------------------------------
'   Run-time error '-2147437259 (80004005)':    ' This Error Occurs, eventually, in the 'Yahoo_One_Year_Estimates_Scrape_Error' section \/ \/ \/
'                                                   Also many internet explorer windows are left open that should have been closed
'
'   Automation Error
'   Unspecified Error
'-----------------------------------------------------
'
'
'   Global Variables That will be used
'
    Public Doc                                      As HTMLDocument
'
    Public StockMainPageURL                         As String       ' This will be the main portion of the URL that we send to Internet Explorer
    Public TotalURL                                 As String       ' This will be the complete URL that we send to Internet Explorer
'
    Public CellCounter                              As Integer      ' This will be used to adjust left to right on web site cells
    Public RowCounter                               As Integer      ' This adjusts the offset from the top of the spreadsheet to the start of the columns
    Public StockCount                               As Integer      ' This counts the actual stocks being analyzed currently
    Public TotalStocksToLoad                        As Integer      ' This counts the stocks that should be analyzed right now
'
    Public PageLoadAttempt                          As Long         ' This counts the number of times we have tried to load a page
'
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub RefreshEntireDocument_Click()
'
'   This will Clear certain cell values in the spreadsheet when the $B$1 'Refresh' cell is clicked
'
    Range("$B$5:$K$254").Select                                 ' Select the range of $B$5 thru $J$254
    Selection.ClearContents                                     ' Delete the contents of this range
'
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 1st URL page
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    MaxYahooDelay = 0                                           ' Initialize MaxYahooDelay = 0
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 1st loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_1                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 2nd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 2nd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_2                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 3rd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 50                                      ' we will Scrape this amount of stocks from the 3rd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_3                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape values from Yahoo to Update the one year estimates from previous pages of stocks scraped
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalYahooDelay = 0                                         ' Initialize TotalYahooDelay = 0
    TotalYahooPageAttempts = 0                                  ' Initialize TotalYahooPageAttempts = 0
    TotalStocksToLoad = 250                                     ' we will Scrape this amount of stocks from the 3rd loaded page of stocks

    Call Scrape_Yahoo_One_Year_Estimates                        ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Display some final results in the status bar
    Application.StatusBar = "Spreadsheet Refreshing Complete :)" ' & "    Avg Yahoo Delay = " & AvgYahooDelay & "     Avg Yahoo Page Attempts = " & AvgYahooPageAttempts
'
End Sub
'-------------------------------------------------------------------------------------------------------------------------------
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub Scrape_Yahoo_One_Year_Estimates()                       ' *** Good up to here ***
'
'
    For StockCount = 1 To TotalStocksToLoad                         ' Grab One Year stock price estimate
'
'
ReloadScrape_Yahoo_One_Year_Estimates:
'
'       Load all of the Update one year estimates
        DelaySeconds = 0                                            '   Initialize DelaySeconds to zero
        PageLoadAttempt = PageLoadAttempt + 1                       '   Add 1 to our PageLoadAttempt counter
''''        TotalYahooPageAttempts = TotalYahooPageAttempts + 1         '   This will be the total yahoo Page Attempts
'
        StockMainPageURL = "finance.yahoo.com/quote/"               '   This will be the main portion of the URL that we send to Internet Explorer
        CurrentStockSymbol = Trim(Range("B" & RowCounter).Value)    '   This is the stock symbol that we will be addressing
'
'       Setup and Load the Internet Explorer Page ...
''''        Dim IE As New SHDocVw.InternetExplorer  ' This works
        Dim IE As New InternetExplorer
''      Dim IE As MSXML2.XMLHTTP60
''      Set IE = New MSXML2.XMLHTTP60
'
        TotalURL = "https://" & StockMainPageURL & CurrentStockSymbol   ' This will be the complete URL that we send to Internet Explorer
'
        If CurrentStockSymbol = 0 Or CurrentStockSymbol = "" Or IsEmpty(CurrentStockSymbol) = True Then ' If no stock symbol found @ $B?  then ...
            PageLoadAttempt = 0                                                                         '   Reset PageLoadAttempt = 0
            StockCount = TotalStocksToLoad                                                              '   Indicate no more stocks to load
'
            IE.Quit                                                                                     '   Close Internet Explorer Window
            Set IE = Nothing                                                                            '   Clear Internet Explorer Memory
'
            Exit Sub                                                                                    '   Exit this sub
        Else
'
            On Error GoTo Yahoo_One_Year_Estimates_Scrape_Error                                         '   If Error occurs then goto Yahoo_One_Year_Estimates_Scrape_Error
'
            Set IE = New InternetExplorer                                                               '   Open Internet Explorer Browser
'
'           Browser address that we will be scraping values from
            IE.navigate TotalURL                                                                        '   Load the Internet Explorer URL
'
'           Make the Browser window, that we will be scraping values from, visible
            IE.Visible = True                                           '   Make Internet Explorer Windows Visible
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.readyState <> 4 And DelaySeconds <= 19                                          '   Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))                                          '   Delay for 1 second
                DoEvents                                                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
''                  TotalYahooDelay = TotalYahooDelay + 1
'
            Loop                                                        ' Loop back
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.Busy And DelaySeconds <= 19 ' Or IE.readyState <> 4 And DelaySeconds <= 19  ' Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))          '   Delay for 1 second
                DoEvents                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
            Loop                                                        ' Loop back
'
'
            If DelaySeconds > 19 Then                                   ' If we have delayed for > 19 seconds to allow the page to load then ...
                IE.Quit                                                 '   Close Internet Explorer Window
'
                If PageLoadAttempt <= 4 Then GoTo ReloadScrape_Yahoo_One_Year_Estimates '   If we have'nt tried 4 reloads of this page then reload page again
            End If                                                      ' End If
'
            If PageLoadAttempt > 4 Then                                 ' If we have tried 4 reloads of the URL page then Display a message box & Exit program
                MsgBox "We've reloaded the same web page  " & PageLoadAttempt & " times without success so we're going to pause the program" & _
                " so you can investigate.", , "Multiple errors detected"
'
                PageLoadAttempt = 0                                     '   Reset PageLoadAttempt = 0
'
                Stop                                                    '   Stop this Excel program!
            End If
'
            Set Doc = IE.document
'
        End If
'
'
''''        TotalYahooDelay = TotalYahooDelay + DelaySeconds
''''        AvgYahooDelay = TotalYahooDelay / (RowCounter - 4)
''''        AvgYahooPageAttempts = TotalYahooPageAttempts / (RowCounter - 4)
'
'       Update status bar to inform the user of what is occurring
        Application.StatusBar = "Gathering Data from website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "    AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
        Range("J" & RowCounter).Value = Doc.getElementsByTagName("td")(11).innerText        '   Scrape the Yahoo 52 Week Price Range
        Range("K" & RowCounter).Value = Doc.getElementsByTagName("td")(31).innerText        '   Scrape the Yahoo One Year Price Estimate
'
        On Error GoTo 0                                                                     '   Clear Errors & Set Excel Error handling to Default
'
        RowCounter = RowCounter + 1                                                         '   Advance to next row in Excel sheet
'
        IE.Quit                                                                             '   Close Internet Explorer Window
        Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
        PageLoadAttempt = 0                                                                 '   Reset PageLoadAttempt = 0
'
    Next                                                                                    '   Load next stock until all are loaded
'
    Exit Sub                                                                                ' Exit this Sub
'
Yahoo_One_Year_Estimates_Scrape_Error:
'
'   Tried this solution from google \/ \/ to solve errors, No luck :(                       ' Shut down all Internet Explorer windows
''    Dim wsh As Object
''    Dim windowStyle As Integer: windowStyle = 1
''    Dim waitOnReturn As Boolean: waitOnReturn = True
'
''    Set wsh = VBA.CreateObject("Wscript.Shell")
''    wsh.Run "taskkill /F /IM iexplore.exe", windowStyle, waitOnReturn
'
'
'
''    IE.Quit                                                                             '   Close Internet Explorer Window
    Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
'   This works some what
    Set IE = New InternetExplorer                                                           ' Open Internet Explorer Browser
'
'
    Resume Next                                                                             ' Go back to the next line after the previous error occurred
'
End Sub
'________________________________________________________________________________________________________________________________________________________
 
johnnyL said:
I could eliminate the updating of the status bar in the 3rd phase, but if I do that, basically the screen will just sit there for 50 minutes or so and I will have no way of knowing if the program is still running or is just stuck in a loop somewhere. Basically would have to wait like an hour and then hope that everything was working properly and it completes properly.

Would code for a progress bar be more efficient?
 
Upvote 0

Excel Facts

Why are there 1,048,576 rows in Excel?
The Excel team increased the size of the grid in 2007. There are 2^20 rows and 2^14 columns for a total of 17 billion cells.
Hi - sorry, I should've given you an update. It's been a manic few days. Let me finish some work and I will get back to you in 2-3 hrs.
 
Upvote 0
Hi. So I'm reading back over your messages now:-

- I'm not sure where the concern re: the progerss bar started. I'm not so worried about that. There are ways around that and it won't be what's slowing the process down significantly.
- there are certainly some adjustments to be made to the code - some minor changes (the NUM function is probably unnecessary) and some significant (e.g., closing down and starting up instances of IE - there only needs to be one instance), but I think it boils down to a more fundamental question as to whether Internet Explorer / VBA is the best solution.

Ultimately, IE is not the fastest browser, and this isn't helped by the fact that you're scraping 250 webpages. Let's assume it takes IE 10 seconds to load up, navigate to the website, and then scrape the website (remembering that your code enforces a 1 second WAIT on each turn of the loop), it is still going to take you 42 mins to work your way through your list. There are a few options I've been looking at over the weekend, but I think I got block by the site on Sunday night (?!?) - I'll need to check again.
 
Upvote 0
Anyway, I've recoded the Phase 3 subroutine from scratach, and basically all it does is:
- get a list of companies
- build out the corresponding URL
- instantiate IE once (and only once)
- send it to navigate to each site, pull the data into an array and then move onto the next

It doesn't WAIT, there are DelaySeconds, etc. Even then, I was struggling to get it to do better than 13 seconds (I timed each pass of the loop).
 
Upvote 0
So then I thought perhaps one solution would be this 'pseudo'(?) multithreading. The idea here is to divvy up the workload between a number of helpers, and get each of them to go off and scrape the websites. There is an example workbook at the bottom of the post here I was trying it out on Monday, with mixed results, to be honest. At the end of the day, you're essentiallyloading up 4 instances of IE, and I don't know if that's especially helpful. The author of the blog references an attempt done Daniel Ferry at Databison - I remember looking at it a while ago and remember it working out to be a lot quicker. The Databison website seems to be down - but I just checked it on Internet Archive, and not only only do they have an archived copy of the webpage explaining the process, etc., but they also archived a copy of the workbook Daniel Ferry used. I've just downloaded it so will see if it works, and whether it is something you could conceivably build into yours.have an archived copy of the webpage explaining the process

If that doesn't work, then the other options will be to see if Selenium is quicker (it probably is) and whether you can 'outsource' the process (at least in part) through API calls.

Anyway, that's my update for now.
 
Upvote 0
Dan_W said:
So then I thought perhaps one solution would be this 'pseudo'(?) multithreading. The idea here is to divvy up the workload between a number of helpers, and get each of them to go off and scrape the websites. There is an example workbook at the bottom of the post here I was trying it out on Monday, with mixed results, to be honest. At the end of the day, you're essentiallyloading up 4 instances of IE, and I don't know if that's especially helpful. The author of the blog references an attempt done Daniel Ferry at Databison - I remember looking at it a while ago and remember it working out to be a lot quicker. The Databison website seems to be down - but I just checked it on Internet Archive, and not only only do they have an archived copy of the webpage explaining the process, etc., but they also archived a copy of the workbook Daniel Ferry used. I've just downloaded it so will see if it works, and whether it is something you could conceivably build into yours.have an archived copy of the webpage explaining the process

If that doesn't work, then the other options will be to see if Selenium is quicker (it probably is) and whether you can 'outsource' the process (at least in part) through API calls.

Anyway, that's my update for now.

Daniel Ferry appears to be a beast Excel programmer!!!

With that said, much of his coding is above my pay grade at this time, so I would need help in the translation to my project. :(

In the mean time, I have changed the delay from 1 second to a sleep of .2, but have not seen much of an improvement in time, I just don't get it, the more I mess with the script to try and speed it up, it appears to slow down. :sigh
 
Upvote 0
Just a quick clarification to my last post, I switched to a sleep of 200 which equates to .2 seconds.
 
Upvote 0
Dan_W said:
So then I thought perhaps one solution would be this 'pseudo'(?) multithreading. The idea here is to divvy up the workload between a number of helpers, and get each of them to go off and scrape the websites.

Just thinking out loud here ... Would it be possible/beneficial to load up say 5 or 10 webpages, either in separate windows or in one window with several tabs, wait for the webpages to fully load, scrape the data from those fully loaded webpages, then start the process over again for the next set of webpages? My thinking is that if you allow several webpages to load at one time, the total loading time would be decreased compared to loading them separately.

Wishful thinking?

Also Dan_W, I am not ignoring your several mentions of using an API for faster loading, but I don't think there is an API for phase 3. Is there an API for that phase? I would love if there was a way to use that approach for phase 3, being that phase 3 takes the lion share of time to complete.

Again thanks to all for all of the support given thus far.
 
Last edited:
Upvote 0
[QUOTE="Dan_W]
...
The Databison website seems to be down - but I just checked it on Internet Archive, and not only only do they have an archived copy of the webpage explaining the process, etc., but they also archived a copy of the workbook Daniel Ferry used. I've just downloaded it so will see if it works, and whether it is something you could conceivably build into yours.have an archived copy of the webpage explaining the process
...
[/QUOTE]

I was able to find the webpage that you mentioned there, but it appears to be only the first portion of it. :( That being said, it appears to test the speed of loading IE windows one at a time VS loading MSMXML2 one at a time VS a "swarm" method that does "multithreading" at a time. The original file no longer works because the website that it used is no longer available. :( I will have to see if I can get it to work, with my limited knowledge, for my phase 3 links.

The website states that the MSMXML2 was approximately 4x faster than the IE approach, and the "swarm" method, while taking a few seconds to start yielding results, was way faster than the other two approaches.

The site also mentioned that you should adjust the "swarm size" according to your version of windows ...
Each agent is programmed with specific range coordinates to deposit the data. The demo includes a control to allow the user to fine tune the number of agents in the swarm. Different computers and versions of Windows can handle different sizes of swarms. Windows Vista seems to have a more robust OLE subsystem, and quite handily tromps on WinXP. On my computer XP performs well with 50 to 60 agents in a swarm, while Vista on the same machine can smoothly handle a couple hundred agents.

One other thing I spotted with the "swarm" technique, an FYI basically ...
Since the demo will be creating a number of .vbs agent files, it is best to create a folder for the demo workbook.
 
Upvote 0

Forum statistics

Threads
1,225,738
Messages
6,186,734
Members
453,369
Latest member
juliewar

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top