Looking for faster way to scrape ~250 pages than Excel's built-in WebQuery tool

d0rian

Active Member
Joined
May 30, 2015
Messages
313
Office Version
  1. 365
I want to get the data from ~250 different web pages (they're all formatted the same >> they're pages of data for different stock symbols) into Excel. I currently use Excel 2007's built-in WebQuery tool, but:

1) It takes about 3 minutes for Excel to grab the data from all the pages, and
2) Excel's frozen / unusable while it's running the 3-minute web query.

Yes, I know I might be able to use a 2nd instance of Excel to partially address (2) above, but it's really the 3-minute duration I want to improve. I can use a multiple-URL opener to open all 250 pages in Chrome or Firefox in under 10 seconds...I just don't know what to do next. Like is there some tool that will scrape the data from all open pages in seconds? Anything that doesn't make me rely on Excel's super-slow web-querying process...
 
Last edited:

Excel Facts

What is the fastest way to copy a formula?
If A2:A50000 contain data. Enter a formula in B2. Select B2. Double-click the Fill Handle and Excel will shoot the formula down to B50000.
Use XMLhttpRequest or WinHttpRequest to request each page, however that will still be quite slow with 250 pages because each page has to be requested in sequence.

For much faster performance, write VBA code which creates multiple VBScript processes, each of which requests 1 or more pages and writes the data retrieved to the main workbook. This technique is described at https://analystcave.com/excel-multithreading-vba-vs-vbscript-vs-c-net/, however the coding to achieve this is quite complicated.
 
Upvote 0
I made some progress on this -- I found a browser extension that batch-downloads a list of webpages as HTML files, and I have a small .bat file that will amalgamate them all into a single .txt file...so now I have a single (large / ~10MB) text file, but it's got all of the HTML formatting in it (e.g. all of the data that had been in tables on the webpages are prefaced with <td>) and I'm not sure any of Excel 2007's native data-import features can parse this stuff and import it cleanly into Excel.
i) Any ideas of where to go from here?
ii) Alternatively, once I've got a folder of ~250 individual HTML files (downloaded by the browser extension i found), is there any .bat command I can run to amalgamate them into a format that Excel perhaps CAN parse/import? I've just been converting them into a single .txt file, since that's all I really have experience with, but if there's some format that will make this all easy OTHER than .txt, that would be good...
 
Upvote 0
FWIW, what makes me think that I must be getting close to an answer is that Excel 2007 clearly has SOME functionality that lets it parse/import HTML...after all, the process I'm trying to improve upon is Excel's own WebQuery / import-from-web functionality...but as I wrote in the OP, the bottleneck with that is that it takes around 1 second for each URL I have it go fetch, so it's a 3-4 minute process. Theoretically, I'm asking Excel to run exactly the same functionality except essentially saying to it "Wait, you don't have to go query the web to go fetch these pages (presumably what's causing the speed issue), I've actually done all the hard work for you and have a version of every page to import saved locally!" ...I just don't know how to point Excel at the local HTML pages and tell it to run the same parse/import function as if I were using WebQuery...did I explain that clearly?
 
Upvote 0
I made some progress on this -- I found a browser extension that batch-downloads a list of webpages as HTML files, and I have a small .bat file that will amalgamate them all into a single .txt file...so now I have a single (large / ~10MB) text file, but it's got all of the HTML formatting in it (e.g. all of the data that had been in tables on the webpages are prefaced with ) and I'm not sure any of Excel 2007's native data-import features can parse this stuff and import it cleanly into Excel.
i) Any ideas of where to go from here?
This macro parses a local HTML file (you'll need to rename the .txt file to .html) containing multiple tables into the active sheet. You must set a reference to Microsoft HTML Object Library, via Tools -> References in the VBA editor.

Code:
Public Sub Parse_Local_HTML_File()

    Dim HTMLdoc As HTMLDocument, HTMLdoc2 As HTMLDocument
    Dim table As HTMLTable, tRow As HTMLTableRow, tCell As HTMLTableCell
    Dim destCell As Range, n As Long
    Dim localFile As String
    
    localFile = "C:\folder\path\page.html")  'Change this.  Note .html, not .txt
    
    With ActiveSheet
        .Cells.Clear
        Set destCell = .Range("A1")
        n = 0
    End With
    
    Set HTMLdoc2 = New HTMLDocument
    Set HTMLdoc = HTMLdoc2.createDocumentFromUrl(localFile, "")
    While HTMLdoc.readyState <> "complete": DoEvents: Wend
    
    For Each table In HTMLdoc.getElementsByTagName("TABLE")
        For Each tRow In table.Rows
            For Each tCell In tRow.Cells
                destCell.Offset(n + tRow.RowIndex, tCell.cellIndex).Value = tCell.innerText
            Next
        Next
        n = n + table.Rows.Length
    Next
    
    MsgBox "Done"
    
End Sub
 
Upvote 0
This is awesome, thanks John...except for 1 thing (that I didn't realize til looking at the output): there are indeed 200+ identical tables in the combined HTML file that I have saved, BUT there's no actual identifier *in the table itself* as to which stock it contains the data for(!) So your script does a great job of parsing/extracting 200+ tables of data from the HTML file, but now I can't tell which data came from which table. Here's a pic of the table from the HTML file; see the "CR - Crew Energy" line? That's what tells me what the table beneath it has the data for. But the 2nd pic shows the output...there's no way of telling what data came from which stock because the stock identifier isn't in the table itself, but rather a few lines above.

All I could think of was...is there any way of grabbing the ~8 lines of text ABOVE each table and including it in the output? That, I suppose, would at least include the stock identifier so I could tell each table apart...Thanks for any ideas!

Dbxhdwz.jpg


EqwhbBn.jpg
 
Upvote 0
Can you post the HTML for that image - everything from "CR - Crew Energy ..." to "Data delayed ...".
 
Upvote 0

Forum statistics

Threads
1,225,738
Messages
6,186,728
Members
453,368
Latest member
positivemind

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top