Scrape website with pagination

beijing0305

New Member
Joined
Feb 14, 2021
Messages
9
Office Version
  1. 2016
Platform
  1. Windows
I have tried a couple of approach to scrape a website with simple data layout, but was not successful with the following 3 issues:
1) tweaking the TagName "tabble", "tr", "tb", not resulting desirable results
2) when (1) was able to pick up the title row, unicode result display the Chinese Character as ???
3) Pagination - probably will need to add another loop to pick up page 3, etc.
Thanks in advance for your help.
Bei


VBA Code:
Sub CAACCertVendorPartListXML()

    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLPage.Open "GET", "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraftConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection=", False
    XMLPage.send
    
    HTMLDoc.body.innerHTML = XMLPage.responseText
    
    ProcessHTMLPage HTMLDoc
    
End Sub

VBA Code:
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLTables As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
        
    Set HTMLTables = HTMLPage.getElementsByTagName("tbody")
    
    For Each HTMLTable In HTMLTables
        Debug.Print HTMLTable.tagName
                
        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
            'Debug.Print HTMLRow.Children
                                                
            For Each HTMLCell In HTMLRow.getElementsByTagName("td")
                Debug.Print vbTab & HTMLCell.innerText
            Next HTMLCell
                      
        Next HTMLRow
            
    Next HTMLTable
    
End Sub
 

Excel Facts

What is the shortcut key for Format Selection?
Ctrl+1 (the number one) will open the Format dialog for whatever is selected.
Welcome


I did it with Chrome, you will need to follow the instructions on post #4 of the thread below. Note that no looping is required to copy the table.

This example transfers the first two tables.

import data from Web

chinese.PNG


VBA Code:
Public driver As New ChromeDriver
Sub WChr()
Dim pt As WebElement
driver.get "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraft" & _
"ConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection="
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example")
pt.AsTable.ToExcel ActiveSheet.[a80]
Set pt = driver.FindElementByXPath("//*[@id=""example_paginate""]/span/a[2]")   ' page 2 button
pt.Click
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example")
pt.AsTable.ToExcel ActiveSheet.Cells(Range("a" & Rows.Count).End(xlUp).row + 1, 1)
End Sub
 
Upvote 0
installed the files in the referenced post. Stop the same error when testing the script. Chrome webversion, Version 88.0.4324.182. Webdriver installed is Version 88.0.4324.96.

Don't know if this is the cause of the error.

On trying to understand the code, don't I need a loop somewhere to get the rest of the pages?

Thanks.
 

Attachments

  • excel vba error.JPG
    excel vba error.JPG
    17 KB · Views: 11
Upvote 0
  • I am running the same Chrome version as you, web driver 87.0.4280.20 and surprisingly it is working for me. Normally when the error is version related the message is more explicit.
  • What line gets highlighted when you click the debug button?
  • Are you trying my code without modifications?
  • Yes, we will need a loop but first let’s get this initial example right for you.
 
Upvote 0
  • I tried installing 87.0.4280.20 web driver like you, but don't know it is installed. Is there a way to uninstall the previous version of the web driver just to make sure I have the same version as yours? The cmd screen just flashes with no confirmation whether the ".20" version is installed. Earlier version has a confirmation in cmd indicating successful installation.​
  • When I click debug button. The "driver.get" link was highlight.... driver.get "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraft" & "ConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection="
  • I applied your code verbatim, nothing added or deleted.
  • Makes sense to add a loop after the first 2 iterations working first. :-)
 
Upvote 0
Selenium will use the chromedriver.exe file located at the Selenium Basic folder.

When you run it manually the information below should appear. During macro execution everything concerning the driver happens behind the scenes.

Are you running Excel 2016 on Windows?

cd87.PNG
 
Upvote 0
On this laptop, it is running Excel 2013 on Windows 10 machine. Does the code need to adjusted because of the version of Excel?
 
Upvote 0
My version looks completely different! :-(
 

Attachments

  • Capture.JPG
    Capture.JPG
    19.5 KB · Views: 10
Upvote 0
You probably have an old chromedriver.exe in the Selenium folder; delete it and replace with the one compatible with your browser version.

The executable file should have a recent creation date.
 
Upvote 0
Thanks very much Worf! Deleted the driver and used your driver version...worked! Got the first 2 pages in Active Sheet started in cell a80.

Now, I adjusted the starting row at cell a10. (worked as expected) Then, I added a loop to get the first 5 pages with the following adjustment...worked! attached below. A couple of minor issues to overcome: (1) no need the title row. I can always manually paste a row in; doesn't seem there is an ID class to pick up; (2) the "pcount" scheme works only up to page 5, so if I include the loop to 10. It would stop at 5 because there is no "6" button to click. It may make sense to click "Next" button instead - trying it now.

Public driver As New ChromeDriver
Sub WChr()
Dim pt As WebElement
Dim pcount As Integer 'page counter
For pcount = 2 To 5
driver.get "http://fsop.caac.gov.cn/g145/CARS/WebSiteQueryServlet?method=loadAircraft" & _
"ConditionsResultPage&enterpriseName=&licenceCode=&partsNumber=&partsName=&ataChaptersection="
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example")
pt.AsTable.ToExcel ActiveSheet.[a10]
Set pt = driver.FindElementByXPath("//*[@id=""example_paginate""]/span/a[" & pcount & "]") ' page 2 button
pt.Click
Application.Wait Now + TimeValue("0:00:10")
Set pt = driver.FindElementById("example")
pt.AsTable.ToExcel ActiveSheet.Cells(Range("a" & Rows.Count).End(xlUp).Row + 1, 1)
Next pcount
End Sub
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,171
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top