Convert InternetExplorer web scraping code to faster code

johnnyL

Well-known Member
Joined
Nov 7, 2011
Messages
4,546
Office Version
  1. 2007
Platform
  1. Windows
I recently submitted some code for CreateObject("InternetExplorer.Application") to interact with a site. It is fairly slow though do to the fact that the pages have to load before processing.

The code is fairly short & working ... Insert some text into a textbox on the first page of the website, click a button on the website page to process the inserted text, & then download a file that is generated from the next page that is loaded.

VBA Code:
Sub IE_Browser_Test()
'
    Dim Browser     As Object
    Dim CopiedData  As String, WebSite  As String
    Dim WB          As Workbook
'
    WebSite = "https://my.gstzen.in/p/gstin-validator/"                                     ' <--- Set this to the website that you want to go to
'
    Sheets("Sheet1").Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row).Copy             ' Copy data from sheet to clipboard
    CopiedData = CreateObject("htmlfile").parentWindow.clipboardData.GetData("Text")        ' Save contents from clipboard into CopiedData
'
    Set Browser = CreateObject("InternetExplorer.Application")
'
    With Browser
        .Visible = True                                                                     '   Set IE window status to visible
        .Navigate WebSite                                                                   '   Go to the website
'
        Do While .Busy Or .ReadyState <> 4                                                  '   Loop to wait for website to fully load
            DoEvents
        Loop                                                                                '   Loop back
'
        .Document.getElementsByTagName("textarea")(0).innerText = CopiedData                '   Copy data to text box on website
        .Document.querySelector("button[type=submit]").Click                                '   Click the 'Check GSTIN/UIN Numbers' button
'
        Do While .Busy Or .ReadyState <> 4                                                  '   Loop to wait for website to fully load
            DoEvents
        Loop                                                                                '   Loop back
'
        Set WB = Workbooks.Open(.Document.querySelector("[class='pull-right btn btn-sm btn-excel']"))   ' Open the download link into Excel workbook
'
        .Quit                                                                               '   Close the browser
        Set Browser = Nothing                                                               '   Clear browser from memory
    End With
End Sub


Book2
AB
109ABIPM5346A1ZS
206AAVPK7054P1ZT
329AAACD1390F1ZU
427AAFCK1511N1ZI
527AAFCK1511N1ZI
606AXFPA2657R1Z5
706AXFPA2657R1Z5
829AAACP0165G1ZL
929AAGCR3967G1ZX
1029AALCR3173P1ZJ
1129AIUPB4936C1Z4
1229AIUPB4936C1Z4
1329ADDFS6267R1Z1
1433AAKCS6703M1Z2
1527AEWPJ1063J1Z4
1629AAACS9735K1ZR
1729AHXPT9340A1ZQ
1809AAGFU9919G1ZM
1933AAMFV5931D1Z6
2027DREPK9227D1ZQ
2106AABCV3609C1ZR
2209AABCV3609C1ZL
2309AABCV3609C1ZL
2406AABCV3609C1ZR
2506AABCV3609C1ZR
2609AABCV3609C1ZL
27
Sheet1



I have previously converted for grabbing data from a website, but I have no idea how, or if it is even possible to do 'clicks'/download with, for example, MSXML.

Any help?
 
Last edited:

Excel Facts

Can you sort left to right?
To sort left-to-right, use the Sort dialog box. Click Options. Choose "Sort left to right"
I had no problem in downloadingin few seconds the attachment
Code:
Start                        58532,72 
Page ready                   58536,06 
GSTIN Filled                 58536,1 
Page ready                   58541,17 
Tbl ready                    27           
Ready to download            58541,21 
Downloaded                   58544,73
As you see it takes 12 seconds; most of the timing (5 secs) was requested for the initial page loading and docment physical downloading

The macro used is your with the addiction of several Debug.Print and reading the length of the imported table:
Code:
Sub IE_Browser_Test()
'
    Dim Browser     As Object
    Dim CopiedData  As String, WebSite  As String
    Dim WB          As Workbook
'
    WebSite = "https://my.gstzen.in/p/gstin-validator/"                                     ' <--- Set this to the website that you want to go to
'
    Sheets("Sheet1").Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row).Copy             ' Copy data from sheet to clipboard
    CopiedData = CreateObject("htmlfile").parentWindow.clipboardData.GetData("Text")        ' Save contents from clipboard into CopiedData
'
    Set Browser = CreateObject("InternetExplorer.Application")
'
    With Browser
        .Visible = True                                                                     '   Set IE window status to visible
Debug.Print "Start           ", Timer
        .Navigate WebSite                                                                   '   Go to the website
'
        Do While .Busy Or .ReadyState <> 4                                                  '   Loop to wait for website to fully load
            DoEvents
        Loop                                                                                '   Loop back
'
Debug.Print "Page ready      ", Timer
        .document.getElementsByTagName("textarea")(0).innerText = CopiedData                '   Copy data to text box on website
        .document.querySelector("button[type=submit]").Click                                '   Click the 'Check GSTIN/UIN Numbers' button
'
Debug.Print "GSTIN Filled     ", Timer
       Do While .Busy Or .ReadyState <> 4                                                  '   Loop to wait for website to fully load
          DoEvents
        Loop                                                                                '   Loop back
'
Dim tdCnt As Long
        On Error Resume Next
Debug.Print "Page ready       ", Timer
            tdCnt = .document.getElementsByTagName("table")(0).getElementsByTagName("TR").Length
        On Error GoTo 0
Debug.Print "Tbl ready        ", tdCnt, i

Debug.Print "Ready to download", Timer
        Set WB = Workbooks.Open(.document.querySelector("[class='pull-right btn btn-sm btn-excel']"))   ' Open the download link into Excel workbook
Debug.Print "Downloaded       ", Timer
'
        .Quit                                                                               '   Close the browser
        Set Browser = Nothing                                                               '   Clear browser from memory
    End With
End Sub

We can save few seconds reducing the page load wait but adding an additional check about the availability of the results; or example:
Code:
Sub IE_Browser_Test1()
'
    Dim Browser     As Object
    Dim CopiedData  As String, WebSite  As String
    Dim WB          As Workbook
    Dim tdCnt As Long, I As Long, J As Long, tBox As Object
'
    WebSite = "https://my.gstzen.in/p/gstin-validator/"                                     ' <--- Set this to the website that you want to go to
'
    Sheets("Sheet1").Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row).Copy             ' Copy data from sheet to clipboard
    CopiedData = CreateObject("htmlfile").parentWindow.clipboardData.GetData("Text")        ' Save contents from clipboard into CopiedData
'
    Set Browser = CreateObject("InternetExplorer.Application")
'
    With Browser
        .Visible = True                                                                     '   Set IE window status to visible
        Debug.Print "Start           ", Timer
        .Navigate WebSite                                                                   '   Go to the website
        Do While .Busy                     'Or .ReadyState <> 4                                                  '   Loop to wait for website to fully load
            DoEvents
        Loop                                                                                '   Loop back
'
        Debug.Print "Status = Ready  ", Timer, .readyState
        On Error Resume Next
        For I = 1 To 100
            Set tBox = .document.getElementsByClassName("cz-no-double-click")
            If Not tBox Is Nothing Then Exit For
        Next I
reLoad:
        For J = 1 To 30
            Debug.Print "Ready to load    ", Timer, "I=" & I
            .document.getElementsByTagName("textarea")(0).innerText = CopiedData                '   Copy data to text box on website
            .document.querySelector("button[type=submit]").Click                                '   Click the 'Check GSTIN/UIN Numbers' button
    '
            Debug.Print "GSTIN Filled     ", Timer, "J=" & J
            Do While .Busy Or .readyState <> 4                                                  '   Loop to wait for website to fully load
              DoEvents
            Loop                                                                                '   Loop back
    '
            On Error Resume Next
            Debug.Print "Page ready       ", Timer
                tdCnt = .document.getElementsByTagName("table")(0).getElementsByTagName("TR").Length
            On Error GoTo 0
            Debug.Print "Tbl ready        ", tdCnt, I
            If tdCnt > (Sheets("Sheet1").Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row).Rows.Count / 2 + 1) Then Exit For
        Next J
        Debug.Print "Ready to download", Timer
        Set WB = Workbooks.Open(.document.querySelector("[class='pull-right btn btn-sm btn-excel']"))   ' Open the download link into Excel workbook
        Debug.Print "Downloaded       ", Timer
'
        .Quit                                                                               '   Close the browser
        Set Browser = Nothing                                                               '   Clear browser from memory
    End With
    Debug.Print
End Sub

With this version the required time is reduced to 7 seconds
Code:
Start                        62420,94 
Status = Ready               62421,66      3 
Ready to load                62421,74     I=1
GSTIN Filled                 62421,74     J=1
Page ready                   62423,43 
Tbl ready                    0             1 
Ready to load                62423,5      I=1
GSTIN Filled                 62423,53     J=2
Page ready                   62423,54 
Tbl ready                    0             1 
Ready to load                62423,54     I=1
GSTIN Filled                 62423,56     J=3
Page ready                   62424,45 
Tbl ready                    27            1 
Ready to download            62424,66 
Downloaded                   62427,4

The results that I presented above was obtained with the 26 GSTINs you listed (19 unique GSTin)
Using this second Sub IE_Browser_Test1, what timing do you get? You can copy the content of the vba Immediate window (from Start xyzwz to Downloaded) and insert that in your next message.
To access the vba Immediate window, typing Contr-g (while in the vba window) should do the job; or Menu /View /Immediate window
 
Upvote 0
Thank you @Anthony47 for responding.

Honestly, on my end, there was only a slight increase in speed by using the version that you submitted.

Results from code I submitted:
Rich (BB code):
Start                        63552.18  \  7.05 seconds Initial Page Load Time        6.04 avg seconds Initial Page Load Time
Initial Page Ready           63559.23  /
Textbox Filled & Submitted   63559.32
Second Page Ready            63561.46
Tbl ready                    27
Ready to Open File           63561.82  \  6.76 seconds Open Time            7.135 avg seconds Open Time
File Opened                  63568.58  /

Results from code you submitted:
Rich (BB code):
Start                        65790.45             4.26 seconds Initial Page Load Time        3.995 avg
Status = Ready               65794.71      3
Ready to load                65795.2      I=1
GSTIN Filled                 65795.23     J=1
Page ready                   65797.11
Tbl ready                    27            1
Ready to download            65797.84             6.9 seconds Open Time                7.05 avg
Downloaded                   65804.74

That being said, in addition to IE pretty much being phased out for use, in my testing MSXML, for example, is probably 5 times faster, or more, than the slowest approach possible ... using IE. :(

That is why I was asking if there was an alternative approach to using IE.

The following is some different code I worked up for testing the IE approach to scrape a couple values ... Time elapsed = 22.8878354816379 seconds. :
VBA Code:
Option Explicit

#If VBA7 Then
    Private Declare PtrSafe Function getFrequency Lib "kernel32" Alias _
            "QueryPerformanceFrequency" (cyFrequency As Currency) As Long
    Private Declare PtrSafe Function getTickCount Lib "kernel32" Alias _
            "QueryPerformanceCounter" (cyTickCount As Currency) As Long
#Else
    Private Declare Function getFrequency Lib "kernel32" Alias _
            "QueryPerformanceFrequency" (cyFrequency As Currency) As Long
    Private Declare Function getTickCount Lib "kernel32" Alias _
            "QueryPerformanceCounter" (cyTickCount As Currency) As Long
#End If

Sub IE_BrowserTestV2()                                                                        ' 22.95 seconds average
'
    Dim StartTime   As Double
    StartTime = MicroTimer                                                                  ' Start the stopwatch
'
    Dim RowCounter  As Long
    Dim Browser     As Object
    Dim WebSite     As String
'
    Set Browser = CreateObject("InternetExplorer.Application")
'
    WebSite = "https://finance.yahoo.com/quote/AMZN"                                        ' URL to scrape data from
    RowCounter = 1                                                                          ' Results Row
'
    With Browser
        .Visible = True                                                                     '   Set IE window status to visible
        .navigate WebSite                                                                   '   Go to the website
'
        Do While .Busy Or .readyState <> 4                                                  '   Loop to wait for website to fully load
            DoEvents
        Loop                                                                                '   Loop back
'
        Range("A" & RowCounter).Value = .document.getElementsByTagName("td")(11).innerText  '   Past 1 year price range
        Range("B" & RowCounter).Value = .document.getElementsByTagName("td")(31).innerText  '   Yahoo 1 year price estimate
'
        .Quit                                                                               '   Close the browser
    End With
'
    Set Browser = Nothing                                                                   ' Clear browser from memory
'
    Debug.Print "Time elapsed = " & (MicroTimer - StartTime) & " seconds."                  ' Display Elapsed Time into Immediate Window (CTRL+G)
    MsgBox "Completed."                                                                     ' Notify user that the script has finished
End Sub


Public Function MicroTimer() As Double                                                      ' Precision depends on the frequency of the CPU in the computer
'
' Code by Charles Williams originally
' Uses Windows API calls to the high resolution timer
' Returns time in seconds
'
    Dim cyTicks1        As Currency
    Static cyFrequency  As Currency
'
    MicroTimer = 0                                                                          ' Initialize MicroTimer to zero
'
    If cyFrequency = 0 Then getFrequency cyFrequency                                        ' Get ticks/second aka frequency
'
    getTickCount cyTicks1                                                                   ' Get # of ticks
'
    If cyFrequency Then MicroTimer = cyTicks1 / cyFrequency                                 ' Calculate seconds ... seconds = Ticks/Frequency
End Function


When converted to use MSXML2 ... Time elapsed = 5.1709506996167 seconds. :
VBA Code:
Option Explicit

#If VBA7 Then
    Private Declare PtrSafe Function getFrequency Lib "kernel32" Alias _
            "QueryPerformanceFrequency" (cyFrequency As Currency) As Long
    Private Declare PtrSafe Function getTickCount Lib "kernel32" Alias _
            "QueryPerformanceCounter" (cyTickCount As Currency) As Long
#Else
    Private Declare Function getFrequency Lib "kernel32" Alias _
            "QueryPerformanceFrequency" (cyFrequency As Currency) As Long
    Private Declare Function getTickCount Lib "kernel32" Alias _
            "QueryPerformanceCounter" (cyTickCount As Currency) As Long
#End If

Sub XML_TestV2()                                                                      ' 3.8 seconds average
'
' Add reference to Microsoft HTML Object Library ;)
'
    Dim StartTime   As Double
    StartTime = MicroTimer                                                          ' Start the stopwatch
'
    Dim RowCounter  As Long
    Dim WebSite     As String
    Dim Doc         As New HTMLDocument
'
    WebSite = "https://finance.yahoo.com/quote/AMZN"                                ' URL to scrape data from
    RowCounter = 2                                                                  ' Results Row
'
    Doc.body.innerhtml = GetHTML(WebSite)
'
    Range("A" & RowCounter).Value = Doc.getElementsByTagName("td")(11).innerText    ' Past 1 year price range
    Range("B" & RowCounter).Value = Doc.getElementsByTagName("td")(31).innerText    ' Yahoo 1 year price estimate
'
    Set Doc = Nothing                                                               ' Clear Doc from memory
'
    Debug.Print "Time elapsed = " & (MicroTimer - StartTime) & " seconds."          ' Display Elapsed Time into Immediate Window (CTRL+G)
    MsgBox "Completed."                                                             ' Notify user that the script has finished
End Sub


Function GetHTML(strURL As String) As String
'
    Dim WebSiteRequest  As Object
    Dim WebSiteResponse As String
'
    Set WebSiteRequest = CreateObject("MSXML2.ServerXMLHTTP")
'
    With WebSiteRequest
        .Open "GET", strURL, False
        .send
'
        If .Status = 200 Then                                                       '   If Status = 200 then ...
            WebSiteResponse = .responseText                                         '       Save the response received into WebSiteResponse
        Else                                                                        '   Else ...
            WebSiteResponse = ""                                                    '       Error encountered
        End If
    End With
'
    GetHTML = WebSiteResponse                                                       ' Save WebSiteResponse for result of GetHTML function
End Function


Public Function MicroTimer() As Double                                              ' Precision depends on the frequency of the CPU in the computer
'
' Code by Charles Williams originally
' Uses Windows API calls to the high resolution timer
' Returns time in seconds
'
    Dim cyTicks1        As Currency
    Static cyFrequency  As Currency
'
    MicroTimer = 0                                                                  ' Initialize MicroTimer to zero
'
    If cyFrequency = 0 Then getFrequency cyFrequency                                ' Get ticks/second aka frequency
'
    getTickCount cyTicks1                                                           ' Get # of ticks
'
    If cyFrequency Then MicroTimer = cyTicks1 / cyFrequency                         ' Calculate seconds ... seconds = Ticks/Frequency
End Function

Quite a difference for the results of the two codes.
 
Last edited:
Upvote 0
That site processes the data when you submit the form, there is not a different method for enquiry. Filling the form requires the page to be loaded and then interacting with it.
Your need on average 16 secs to complete the process, and 13 are needed to get the page, download and open the excel file; my macro cut a couple of seconds on opening the page (basically not waiting for the complete page but stopping when the table is ready to be filled).
My cycle took, on my Pc and my network, some 7 seconds and I guess this is due to a faster internet connection.

In addition, while searching for a different method for quering the database, I went to www.gstzen.in/a/terms-and-conditions.html:
You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by GSTZen, unless you have been specifically allowed to do so in a separate agreement with GSTZen. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

So I think that we are breaching the terms & conditions, and this is against the policy of this forum too
 
Upvote 0
That site processes the data when you submit the form, there is not a different method for enquiry. Filling the form requires the page to be loaded and then interacting with it.

So are you saying, there is no way to enter data to a site without using IE to load the page?
 
Upvote 0
It depends on the site, how it receive data and how its results are generated.
I didn't find a different method for querying THAT site (whereas I found the therms & conditions)
 
Upvote 0

Forum statistics

Threads
1,225,754
Messages
6,186,826
Members
453,377
Latest member
JoyousOne

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top