Scrape Webpage Contents with SeleniumBasic

Barklie

Board Regular
Joined
Jul 4, 2013
Messages
86
Hello,

I am having difficulties scraping webpage contents using SeleniumBasic, which I am new to. I can get a bunch of HTML language, but I cannot get the actual contents from the webpage. Below is the code I have, which requires the Selenium Type Library to run. I'm not sure what I am doing wrong.

VBA Code:
Sub WikiScrape()

Dim driver As New ChromeDriver
driver.Get "https://en.wikipedia.org/wiki/Car"

Debug.Print driver.PageSource


End Sub
 

Excel Facts

Excel Can Read to You
Customize Quick Access Toolbar. From All Commands, add Speak Cells or Speak Cells on Enter to QAT. Select cells. Press Speak Cells.
How is the code you provided not getting the contents from the webpage?

In this code, I added a Wait for 1 second to make sure the Get had time to complete. Then, by printing characters from random text locations throughout the page, I was able to verify that PageSource had the data. The "Around the world" text is toward the end, so by waiting 1 second, the code was able to get it.

VBA Code:
Sub WikiScrape()

Dim driver As New ChromeDriver
driver.Get "https://en.wikipedia.org/wiki/Car"

driver.Wait 1000

Debug.Print Mid(driver.PageSource, InStr(1, driver.PageSource, "Around the world there"), 100)

driver.Quit
End Sub
 
Upvote 0
How is the code you provided not getting the contents from the webpage?

In this code, I added a Wait for 1 second to make sure the Get had time to complete. Then, by printing characters from random text locations throughout the page, I was able to verify that PageSource had the data. The "Around the world" text is toward the end, so by waiting 1 second, the code was able to get it.

VBA Code:
Sub WikiScrape()

Dim driver As New ChromeDriver
driver.Get "https://en.wikipedia.org/wiki/Car"

driver.Wait 1000

Debug.Print Mid(driver.PageSource, InStr(1, driver.PageSource, "Around the world there"), 100)

driver.Quit
End Sub
Thank you very much for your response. The code you sent also works for me. I think I may have a fundamental misunderstanding about how either PageSource works or how Debug.Print works because when I ran my code (where I Debug.Print the entire PageSource) and then paste the results into Word, the words "around the world" are nowhere to be found. Perhaps there is some sort of character limitation.

I may have to alter my normal way of webscrape coding, which used to be to examine the html (IE.Document.body.innerHTML) and isolate the variable I wanted by the company it keeps.

Thanks again,
Barklie
 
Upvote 0
You're welcome.

The Immediate Window (e.g., where Debug.Print prints) definitely has a size limitation.

I use Selenium to do some web scraping of data from the Google Patents pages. Check out examples of scraping data using Selenium. I use targeted element finding to help parse the data.

For example
VBA Code:
Dim Result As Selenium.WebElement

Set Result = ch.FindElementByName("citation_patent_application_number")
appSerNo = Split(Result.Attribute("content"), ":")(1)

Set Result = ch.FindElementByName("DC.contributor")
inventor = Result.Attribute("content")

I didn't study the Cars page on Wikipedia enough to know if such elements are even used. That's why I went with just looking for the text. If there are elements you can use on whatever page like my usage above, it helps reduce manual parsing a lot.
 
Upvote 0
Thanks for that information. I've been using FindElementbyCss and it is a million times easier than parsing data based on nearby characters and things.

I appreciate the help!
 
Upvote 0

Forum statistics

Threads
1,225,743
Messages
6,186,778
Members
453,371
Latest member
HMX180

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top