Speed of multiple URL web scraping

johnnyL

Well-known Member
Joined
Nov 7, 2011
Messages
4,546
Office Version
  1. 2007
Platform
  1. Windows
Need some help with this code.

1) The current code leaves a bunch of IE windows open.
2) The current code leads to a Run-time error '-2147437259 (80004005)':
3) It takes forever to run, Hopefully someone can assist me in converting it to use MSXML2.XMLHTTP60 for example, I heard that works faster.


Code:
'
'-----------------------------------------------------
'   Run-time error '-2147437259 (80004005)':    ' This Error Occurs, eventually, in the 'Yahoo_One_Year_Estimates_Scrape_Error' section \/ \/ \/
'                                                   Also many internet explorer windows are left open that should have been closed
'
'   Automation Error
'   Unspecified Error
'-----------------------------------------------------
'
'
'   Global Variables That will be used
'
    Public Doc                                      As HTMLDocument
'
    Public StockMainPageURL                         As String       ' This will be the main portion of the URL that we send to Internet Explorer
    Public TotalURL                                 As String       ' This will be the complete URL that we send to Internet Explorer
'
    Public CellCounter                              As Integer      ' This will be used to adjust left to right on web site cells
    Public RowCounter                               As Integer      ' This adjusts the offset from the top of the spreadsheet to the start of the columns
    Public StockCount                               As Integer      ' This counts the actual stocks being analyzed currently
    Public TotalStocksToLoad                        As Integer      ' This counts the stocks that should be analyzed right now
'
    Public PageLoadAttempt                          As Long         ' This counts the number of times we have tried to load a page
'
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub RefreshEntireDocument_Click()
'
'   This will Clear certain cell values in the spreadsheet when the $B$1 'Refresh' cell is clicked
'
    Range("$B$5:$K$254").Select                                 ' Select the range of $B$5 thru $J$254
    Selection.ClearContents                                     ' Delete the contents of this range
'
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 1st URL page
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    MaxYahooDelay = 0                                           ' Initialize MaxYahooDelay = 0
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 1st loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_1                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 2nd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 100                                     ' we will Scrape this amount of stocks from the 2nd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_2                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape stocks to consider looking into further from 3rd URL page
'
    CellCounter = 0                                             ' Left to right cell counter
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalStocksToLoad = 50                                      ' we will Scrape this amount of stocks from the 3rd loaded page of stocks
'
    Call Scrape_BarChart_Stock_Page_3                           ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
' -------------------------------------------------------------------------------------------------------------------------
'
'   Scrape values from Yahoo to Update the one year estimates from previous pages of stocks scraped
'
    RowCounter = 5                                              ' Start loading stock values recieved into the 5th row of Excel
    PageLoadAttempt = 0                                         ' Initialize PageLoadAttempt = 0
    TotalYahooDelay = 0                                         ' Initialize TotalYahooDelay = 0
    TotalYahooPageAttempts = 0                                  ' Initialize TotalYahooPageAttempts = 0
    TotalStocksToLoad = 250                                     ' we will Scrape this amount of stocks from the 3rd loaded page of stocks

    Call Scrape_Yahoo_One_Year_Estimates                        ' Scrape the amount of TotalStocksToLoad into excel
'
' -------------------------------------------------------------------------------------------------------------------------
'
'   Display some final results in the status bar
    Application.StatusBar = "Spreadsheet Refreshing Complete :)" ' & "    Avg Yahoo Delay = " & AvgYahooDelay & "     Avg Yahoo Page Attempts = " & AvgYahooPageAttempts
'
End Sub
'-------------------------------------------------------------------------------------------------------------------------------
'-------------------------------------------------------------------------------------------------------------------------------
'
Private Sub Scrape_Yahoo_One_Year_Estimates()                       ' *** Good up to here ***
'
'
    For StockCount = 1 To TotalStocksToLoad                         ' Grab One Year stock price estimate
'
'
ReloadScrape_Yahoo_One_Year_Estimates:
'
'       Load all of the Update one year estimates
        DelaySeconds = 0                                            '   Initialize DelaySeconds to zero
        PageLoadAttempt = PageLoadAttempt + 1                       '   Add 1 to our PageLoadAttempt counter
''''        TotalYahooPageAttempts = TotalYahooPageAttempts + 1         '   This will be the total yahoo Page Attempts
'
        StockMainPageURL = "finance.yahoo.com/quote/"               '   This will be the main portion of the URL that we send to Internet Explorer
        CurrentStockSymbol = Trim(Range("B" & RowCounter).Value)    '   This is the stock symbol that we will be addressing
'
'       Setup and Load the Internet Explorer Page ...
''''        Dim IE As New SHDocVw.InternetExplorer  ' This works
        Dim IE As New InternetExplorer
''      Dim IE As MSXML2.XMLHTTP60
''      Set IE = New MSXML2.XMLHTTP60
'
        TotalURL = "https://" & StockMainPageURL & CurrentStockSymbol   ' This will be the complete URL that we send to Internet Explorer
'
        If CurrentStockSymbol = 0 Or CurrentStockSymbol = "" Or IsEmpty(CurrentStockSymbol) = True Then ' If no stock symbol found @ $B?  then ...
            PageLoadAttempt = 0                                                                         '   Reset PageLoadAttempt = 0
            StockCount = TotalStocksToLoad                                                              '   Indicate no more stocks to load
'
            IE.Quit                                                                                     '   Close Internet Explorer Window
            Set IE = Nothing                                                                            '   Clear Internet Explorer Memory
'
            Exit Sub                                                                                    '   Exit this sub
        Else
'
            On Error GoTo Yahoo_One_Year_Estimates_Scrape_Error                                         '   If Error occurs then goto Yahoo_One_Year_Estimates_Scrape_Error
'
            Set IE = New InternetExplorer                                                               '   Open Internet Explorer Browser
'
'           Browser address that we will be scraping values from
            IE.navigate TotalURL                                                                        '   Load the Internet Explorer URL
'
'           Make the Browser window, that we will be scraping values from, visible
            IE.Visible = True                                           '   Make Internet Explorer Windows Visible
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.readyState <> 4 And DelaySeconds <= 19                                          '   Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))                                          '   Delay for 1 second
                DoEvents                                                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
''                  TotalYahooDelay = TotalYahooDelay + 1
'
            Loop                                                        ' Loop back
'
'           Allow mouse clicks and such while browser window is loading ... Loop until browser window is fuilly loaded, ie. READYSTATE_COMPLETE
            Do While IE.Busy And DelaySeconds <= 19 ' Or IE.readyState <> 4 And DelaySeconds <= 19  ' Loop while IE is still loading and <= 19 seconds delayed
''              Application.Wait DateAdd("s", 1, Now)
                Application.Wait (Now + TimeValue("00:00:01"))          '   Delay for 1 second
                DoEvents                                                '   Enable Mouse Clicks
'
'               Update status bar to inform the user of what is occurring
                Application.StatusBar = "Loading website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "     AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
                DelaySeconds = DelaySeconds + 1                         '   Add 1 to our DelaySeconds Counter
'
''''                If DelaySeconds > MaxYahooDelay Then MaxYahooDelay = DelaySeconds   '   Save the MaxYahooDelay
            Loop                                                        ' Loop back
'
'
            If DelaySeconds > 19 Then                                   ' If we have delayed for > 19 seconds to allow the page to load then ...
                IE.Quit                                                 '   Close Internet Explorer Window
'
                If PageLoadAttempt <= 4 Then GoTo ReloadScrape_Yahoo_One_Year_Estimates '   If we have'nt tried 4 reloads of this page then reload page again
            End If                                                      ' End If
'
            If PageLoadAttempt > 4 Then                                 ' If we have tried 4 reloads of the URL page then Display a message box & Exit program
                MsgBox "We've reloaded the same web page  " & PageLoadAttempt & " times without success so we're going to pause the program" & _
                " so you can investigate.", , "Multiple errors detected"
'
                PageLoadAttempt = 0                                     '   Reset PageLoadAttempt = 0
'
                Stop                                                    '   Stop this Excel program!
            End If
'
            Set Doc = IE.document
'
        End If
'
'
''''        TotalYahooDelay = TotalYahooDelay + DelaySeconds
''''        AvgYahooDelay = TotalYahooDelay / (RowCounter - 4)
''''        AvgYahooPageAttempts = TotalYahooPageAttempts / (RowCounter - 4)
'
'       Update status bar to inform the user of what is occurring
        Application.StatusBar = "Gathering Data from website … " & TotalURL & "    Stock # " & (RowCounter - 4) ''''& _
''''                                "   Delay Seconds =  " & DelaySeconds & "    Page Load Attempts = " & PageLoadAttempt & _
''''                                "   Avg Yahoo Delay = " & AvgYahooDelay & "    AvgYahooPageAttempts = " & AvgYahooPageAttempts
'
        Range("J" & RowCounter).Value = Doc.getElementsByTagName("td")(11).innerText        '   Scrape the Yahoo 52 Week Price Range
        Range("K" & RowCounter).Value = Doc.getElementsByTagName("td")(31).innerText        '   Scrape the Yahoo One Year Price Estimate
'
        On Error GoTo 0                                                                     '   Clear Errors & Set Excel Error handling to Default
'
        RowCounter = RowCounter + 1                                                         '   Advance to next row in Excel sheet
'
        IE.Quit                                                                             '   Close Internet Explorer Window
        Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
        PageLoadAttempt = 0                                                                 '   Reset PageLoadAttempt = 0
'
    Next                                                                                    '   Load next stock until all are loaded
'
    Exit Sub                                                                                ' Exit this Sub
'
Yahoo_One_Year_Estimates_Scrape_Error:
'
'   Tried this solution from google \/ \/ to solve errors, No luck :(                       ' Shut down all Internet Explorer windows
''    Dim wsh As Object
''    Dim windowStyle As Integer: windowStyle = 1
''    Dim waitOnReturn As Boolean: waitOnReturn = True
'
''    Set wsh = VBA.CreateObject("Wscript.Shell")
''    wsh.Run "taskkill /F /IM iexplore.exe", windowStyle, waitOnReturn
'
'
'
''    IE.Quit                                                                             '   Close Internet Explorer Window
    Set IE = Nothing                                                                    '   Clear Internet Explorer Memory
'
'   This works some what
    Set IE = New InternetExplorer                                                           ' Open Internet Explorer Browser
'
'
    Resume Next                                                                             ' Go back to the next line after the previous error occurred
'
End Sub
'________________________________________________________________________________________________________________________________________________________
 
Is this "swarm" approach a dead end approach for my phase 3?

s = s & "Set oXML = WScript.CreateObject(""MSXML2.XMLHTTP"")" & vbCrLf

Hmmm, I recall that approach not viable for my phase 3.
 
Last edited:
Upvote 0

Excel Facts

Show numbers in thousands?
Use a custom number format of #,##0,K. Each comma after the final 0 will divide the displayed number by another thousand
Looks like I will need some help converting the swarm portion of the program.

Original Swarm code =
VBA Code:
    s = s & "    ' Last Sale" & vbCrLf
    s = s & "    sFM(1, 0) = ""<td class=""""subjectmenutblleft subjectmenutbltext bold"""">Last Sale: </td>"" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & ""<td class=""""subjectmenutblright subjectmenutbltext"""">""" & vbCrLf
    s = s & "    sFM(1, 1) = ""</td>""" & vbCrLf

I am trying to convert that to the Phase 2 yahoo scraping of two values.

The code from the yahoo website for the 52 week price range for amazon AMZN =
VBA Code:
<td class="Ta(end) Fw(600) Lh(14px)" data-test="FIFTY_TWO_WK_RANGE-value" data-reactid="121">1,626.03 - 3,552.25</td>

I have tried
VBA Code:
    s = s & "    ' Last Sale" & vbCrLf
    s = s & "    sFM(1, 0) = ""<td class=""""Ta(end) Fw(600) Lh(14px)"""">data-test= & ""FIFTY_TWO_WK_RANGE-value"">data-reactid=&""121"">"
    s = s & "    sFM(1, 1) = ""</td>""" & vbCrLf

Also:
VBA Code:
    s = s & "    ' Last Sale" & vbCrLf
    s = s & "    sFM(1, 0) = Doc.getElementsByTagName("td")(11).innerText   '   Past 1 year price range
    s = s & "    sFM(1, 1) = ""</td>""" & vbCrLf

Any Help?

It would have been easier for me to attempt the conversion if the original file still worked, but I have nothing to compare results to at this point. :(
 
Upvote 0
It appears that the swarm approach could be way faster, up to 5x faster, I would guess.
 
Upvote 0
I don't get it. This is the swarm file that is created ...
VBA Code:
Dim oXML, oXL, curRow, outputCol, propAddress, sHTML, i
Dim vResults(7)



' Setup variables
curRow = 7
outputCol = "5"
propAddress = "AMZN"
Call GetFieldMarkers(sMarkers)
Set oXL = GetObject(, "Excel.Application")
Set oXML = WScript.CreateObject("MSXML2.XMLHTTP")



' Navigate to property page
oXML.Open "GET", "https://finance.yahoo.com/quote/" & propAddress, False
Wscript.Sleep 50
oXML.send
Wscript.Sleep 50



' Get html
sHTML = oXML.responseText



' Parse fields
For i = 0 To UBound(vResults)
    vResults(i) = ParseText(sHTML, sMarkers(i, 0), sMarkers(i, 1))
Next



' Write results to Excel sheet
RandomNumber = Int(Rnd * (800 + 1 - 350)) + 350
Wscript.Sleep RandomNumber
oXL.workbooks("scraping_demo_TestV1.0.xls").sheets("Demo").Range("$E$7:$L$7") = vResults










Function ParseText(s, sMarker1, sMarker2)

    Dim ptr
    Dim sTmp


    ParseText = "ERROR"
    If Len(s) Then
        If Len(sMarker1) Then
            If Len(sMarker2) Then
                ptr = InStr(s, sMarker1)
                If ptr Then
                    ptr = ptr + Len(sMarker1)
                    sTmp = Mid(s, ptr)
                    If Len(sTmp) Then
                        ptr = InStr(sTmp, sMarker2) - 1
                        If Len(ptr) Then
                            ParseText = Left(sTmp, ptr)
                            Exit Function
                        End If
                    End If
                End If
            End If
        End If
    End If


End Function



Sub GetFieldMarkers(sFM)

    ReDim sFM(7, 1)

    ' Estimate

    sFM(0, 0) = "<td class=""Ta(end) Fw(600) Lh(14px)"" data-test=""FIFTY_TWO_WK_RANGE-value"" data-reactid=""119"">"
    sFM(0, 1) = "</td>"
    ' Last Sale
    sFM(1, 0) = "<span class=""Trsdu(0.3s) "" data-reactid=""172"">"
    sFM(1, 1) = "</span>"
    ' Last Price
    sFM(2, 0) = "<td class=""subjectmenutblleft subjectmenutbltext bold"">Sales Price: </td>" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & "<td class=""subjectmenutblright subjectmenutbltext"">"
    sFM(2, 1) = "</td>"
    ' Year Built
    sFM(3, 0) = "<td class=""subjectmenutblleft subjectmenutbltext bold"">Year Built: </td>" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & "<td class=""subjectmenutblright subjectmenutbltext"">"
    sFM(3, 1) = "</td>"
    ' Bedrooms
    sFM(4, 0) = "<td class=""subjectmenutblleft subjectmenutbltext bold"">BR: </td>" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & "<td class=""subjectmenutblright subjectmenutbltext"">"
    sFM(4, 1) = "</td>"
    ' Baths
    sFM(5, 0) = "<td class=""subjectmenutblleft subjectmenutbltext bold"">BA: </td>" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & "<td class=""subjectmenutblright subjectmenutbltext"">"
    sFM(5, 1) = "</td>"
    ' Square Feet
    sFM(6, 0) = "<td class=""subjectmenutblleft subjectmenutbltext bold"">BLDG SQ FTP: </td>" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & "<td class=""subjectmenutblright subjectmenutbltext"">"
    sFM(6, 1) = "</td>"
    ' Lot
    sFM(7, 0) = "<td class=""subjectmenutblleft subjectmenutbltext bold"">LOT SQ FT: </td>" & vbCrLf + vbTab + vbTab + vbTab + vbTab + vbTab + vbTab & "<td class=""subjectmenutblright subjectmenutbltext"">"
    sFM(7, 1) = "</td>"


End Sub

I would think that the first two, sFM(0,0) and sFM(1,0), would yield me the correct responses but all I can get is "ERROR" in every column.
 
Upvote 0
Has anyone been able to get the swarm approach to work, lord knows, my limited knowledge hasn't been able to get it to run successfully.
 
Upvote 0
[QUOTE="Dan_W]
...
The Databison website seems to be down - but I just checked it on Internet Archive, and not only only do they have an archived copy of the webpage explaining the process, etc., but they also archived a copy of the workbook Daniel Ferry used. I've just downloaded it so will see if it works, and whether it is something you could conceivably build into yours.have an archived copy of the webpage explaining the process
...
I was able to find the webpage that you mentioned there, but it appears to be only the first portion of it. :( That being said, it appears to test the speed of loading IE windows one at a time VS loading MSMXML2 one at a time VS a "swarm" method that does "multithreading" at a time. The original file no longer works because the website that it used is no longer available. :( I will have to see if I can get it to work, with my limited knowledge, for my phase 3 links.

The website states that the MSMXML2 was approximately 4x faster than the IE approach, and the "swarm" method, while taking a few seconds to start yielding results, was way faster than the other two approaches.

The site also mentioned that you should adjust the "swarm size" according to your version of windows ...


One other thing I spotted with the "swarm" technique, an FYI basically ...
Hi, yes I found the same thing re: the nonexistent webpage, which is fairly annoying but you could still test the approach by running it against another website. In anyevent it does change the fact MSXML wouldn't be an appropriate means by which to get data generated through javascript.

I did think about the possibility of replacing the MSXML method with Internet Explorer. So maybe just divvy up the 250 pages into a swarm of 5 workers (for example), and then have each of them run one of five instances of Internet Explorer. Then consolidate the data each manage to collate into a single worksheet. I don't know, though, if it would be any quicker to run 5 instances of internet explorer at the same time, though.
 
Upvote 0
Hey Dan_W!

I have, since my last response, been able to get the InternetExplorer portion as well as the MSXML portion to work for the yahoo page in that Databison/Daniel Ferry code. The problem I ran into is getting the Swarm portion to work because it, for the most part, uses VBScripting code, sadly I have not used that in probably 15 or more years so I have forgotten it. :(

I took the working code I used for the MSXML portion of the script and put it into the Swarm portion and it made the VBscript files, but VBscript does not operate exactly the same, syntax wise. :(

I posted one of the VBscript files that it generated here. It sadly will not compile though due to numerous errors, among them being: Class not defined, Object required, invalid or unqualified reference, Type Mismatch, & invalid exit statement. VBscript also does not support On error GoTo Labels.

I would love to try the swarm method with multiple browser windows just to see how it compares, but the Swarm VBscripting required to do that, I fear, is again above my level.

Thank you Dan_W!
 
Upvote 0
Well, it's not really the Yahoo part of the process that's the worry, is it? From memory, didn't we get that down to 4 mins?Also, from memory, it was only ever scraping three pages, so that would only require 3 bots. I thnk the more concerning area is the 250 pages that need to accessed via Internet Explorer. On that, quick question - how critically important to you is it to get that information in the right-border-separator class? I ask because it appears that that is the only piece of information generated by the Javascript. It may be possible to access the rest quicker if that part can be jettisoned.
 
Upvote 0
Hey Dan_W!

Yes, the 3rd phase is the more concerning area. I will take what I can get. I have previously tried the other elements desired from that 3rd phase but haven't had any luck with the XML approach. From my googling I believe there is an issue with the XML approach and the getElementsByClassName. Have you had any luck scraping the other values besides the right-border-separator?

The error I encounter is "An error occurred in the secure channel support" in the GetHTML function at the "objHTTP.send" line.
 
Upvote 0

Forum statistics

Threads
1,223,888
Messages
6,175,212
Members
452,618
Latest member
Tam84

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top