Excel VBA - Web Scraping - How to scrape "not table-ized" data

alpha2007

New Member
Joined
Jun 20, 2021
Messages
24
Office Version
  1. 2016
Platform
  1. Windows
Hi guys,

I am facing another problem in my web scraping VBA macro - I need to scrape data from HTML code

The HTML is as following:

HTML:
<div class="broker">
<a id="contactBrokerPhoto" href="/business-broker/robert-j-hough/sunbelt-business-brokers/511/" target="">
<div class="bImages">
<img id="ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_ctl03_imgPersonalPhoto" class="headshot" onerror="this.onerror=null;this.src=&#39;/xcommon/images/broker/nophoto.png&#39;;" src="https://images.bizbuysell.com/shared/brokerdirectory/images/1644/lg_prs_FormalHeadshot2.JPG" alt="Robert J. Hough" />
</div>
</a>
<h3>
Business
Listed By:<br />
<a id="ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_ctl03_ContactBrokerNameHyperLink" href="/business-broker/robert-j-hough/sunbelt-business-brokers/511/">Robert Hough</a>
<h4><span>Sunbelt Business Brokers</span></h4>
<div class="disclaimer" style="clear: both;">
<hr style="margin: 8px 0 16px;"/>
<p><b>Ad#:1842483</b></p>


In this example, I would need the following 3 data elements being scraped

strListedBy
"Robert Hough" from the HTML line:
<a id="ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_ctl03_ContactBrokerNameHyperLink" href="/business-broker/robert-j-hough/sunbelt-business-brokers/511/">Robert Hough</a>

strBroker
"Sunbelt Business Brokers" from the HTML line:
<h4><span>Sunbelt Business Brokers</span></h4>

strAdID
"1842483" from the HTML line:
<p><b>Ad#:1842483</b></p>

Thanks for your help and for a code sample of how to do it!

Best,
Tony
 

Excel Facts

When they said...
When they said you are going to "Excel at life", they meant you "will be doing Excel your whole life".
Keeping it simple and assuming the layout is consistent between different pages you could do this:
VBA Code:
    Dim HTMLdoc As HTMLDocument
    Dim div As HTMLDivElement
    Dim lines As Variant, i As Long
    
    Set HTMLdoc = IE.document  'IE is internetExplorer browser object with page loaded
    Set div = HTMLdoc.getElementsByClassName("broker")(0)
    lines = Split(div.innerText, vbCrLf)    
    Debug.Print Trim(lines(4))
    Debug.Print Trim(lines(6))
    Debug.Print Split(lines(10), ":")(1)
The code uses early binding so requires a reference to Microsoft HTML Object Library (set via Tools -> References in the VBA editor).
 
Upvote 0
John_w, thank you!

The line

Debug.Print Split(lines(10), ":")(1)

produces an error no '9' (index out of range)

What would be the code to get the actual data so that I can fill the data in cells?
 
Upvote 0
The 4, 6, 10 indexes worked with the posted HTML. 10 is the last array index, which can be written as:
VBA Code:
    Debug.Print Split(lines(UBound(lines)), ":")(1)

Output the lines array and see if you need to adjust the indexes:
VBA Code:
    For i = 0 To UBound(lines)
        Debug.Print i, lines(i)
    Next
Which outputs:
HTML:
 0         
 1           
 2           
 3            Business Listed By:
 4            Robert Hough
 5         
 6            Sunbelt Business Brokers
 7         
 8         
 9         
 10           Ad#:1842483
As you can see, the only 'marker' text is "Business Listed By:", so you might need to look for that in the lines array and extract the array elements relative to that (i.e. name is the 1st element after it, and business name is the 3rd element after it, etc.

What would be the code to get the actual data so that I can fill the data in cells?
Something like this:
VBA Code:
    Range("A1").Value = Trim(lines(4))
    Range("B1").Value = Trim(lines(6))
    Range("C1").Value = Split(lines(UBound(lines)), ":")(1)
 
Upvote 0
OK, I understand

But the risk is extremely high that the lines will be different on each of the web pages I want to scrape data
Is there no other option?

These web pages are all built dynamically and can change the number of lines all the time depending on the content
 
Upvote 0
OK, I understand

But the risk is extremely high that the lines will be different on each of the web pages I want to scrape data
Is there no other option?

These web pages are all built dynamically and can change the number of lines all the time depending on the content
Yes, there is another way, but this relies on a consistent HTML element structure:
VBA Code:
    Set div = HTMLdoc.getElementsByClassName("broker")(0)
    Range("A1").Value = div.querySelector("h3 > a").innerText
    Range("B1").Value = div.querySelector("h4").innerText
    'Either
    Range("C1").Value = Split(div.querySelector("div > p").innerText, ":")(1)
    'Or
    Range("D1").Value = Split(div.querySelector("div.disclaimer > p").innerText, ":")(1)
 
Upvote 0
Solution
John_w - Thank you very much! I have learned a lot but using your code!
 
Upvote 0
Att. - John_w

When running your last VBA code an error message pops-up

For the code line

Set div = HTMLdoc.getElementsByClassName("broker")(0)

Runtime error = 424
"Object needed"

What do I miss here?
 
Upvote 0
Have you got this line?
VBA Code:
    Set HTMLdoc = IE.document  'IE is internetExplorer browser object with page loaded
Are there any HTML elements with class name "broker"?
 
Upvote 0

Forum statistics

Threads
1,223,896
Messages
6,175,259
Members
452,626
Latest member
huntinghunter

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top