Excel VBA - Web Scraping - How to scrape "not table-ized" data

alpha2007

New Member
Joined
Jun 20, 2021
Messages
24
Office Version
  1. 2016
Platform
  1. Windows
Hi guys,

I am facing another problem in my web scraping VBA macro - I need to scrape data from HTML code

The HTML is as following:

HTML:
<div class="broker">
<a id="contactBrokerPhoto" href="/business-broker/robert-j-hough/sunbelt-business-brokers/511/" target="">
<div class="bImages">
<img id="ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_ctl03_imgPersonalPhoto" class="headshot" onerror="this.onerror=null;this.src=&#39;/xcommon/images/broker/nophoto.png&#39;;" src="https://images.bizbuysell.com/shared/brokerdirectory/images/1644/lg_prs_FormalHeadshot2.JPG" alt="Robert J. Hough" />
</div>
</a>
<h3>
Business
Listed By:<br />
<a id="ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_ctl03_ContactBrokerNameHyperLink" href="/business-broker/robert-j-hough/sunbelt-business-brokers/511/">Robert Hough</a>
<h4><span>Sunbelt Business Brokers</span></h4>
<div class="disclaimer" style="clear: both;">
<hr style="margin: 8px 0 16px;"/>
<p><b>Ad#:1842483</b></p>


In this example, I would need the following 3 data elements being scraped

strListedBy
"Robert Hough" from the HTML line:
<a id="ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_ctl03_ContactBrokerNameHyperLink" href="/business-broker/robert-j-hough/sunbelt-business-brokers/511/">Robert Hough</a>

strBroker
"Sunbelt Business Brokers" from the HTML line:
<h4><span>Sunbelt Business Brokers</span></h4>

strAdID
"1842483" from the HTML line:
<p><b>Ad#:1842483</b></p>

Thanks for your help and for a code sample of how to do it!

Best,
Tony
 
Hi John_w

Yes, there are HTML elements with the class name "broker"

I added the HTMLdoc....

Results:
The first two code lines worked fine and scraped the wanted data correctly

The third and the fourth lines of code did produce the error "424" - Object needed


VBA Code:
    Set div = HTMLdoc.getElementsByClassName("broker")(0)
    Range("A1").Value = div.querySelector("h3 > a").innerText
    Range("B1").Value = div.querySelector("h4").innerText
    'Either
    Range("C1").Value = Split(div.querySelector("div > p").innerText, ":")(1)
    'Or
    Range("D1").Value = Split(div.querySelector("div.disclaimer > p").innerText, ":")(1)


Obviously, there is something missing again?

Could you help, please
 

Attachments

  • 1624365861297.png
    1624365861297.png
    3.1 KB · Views: 40
Upvote 0

Excel Facts

Whats the difference between CONCAT and CONCATENATE?
The newer CONCAT function can reference a range of cells. =CONCATENATE(A1,A2,A3,A4,A5) becomes =CONCAT(A1:A5)
What is the URL? I can't help without it.
Unfortunately, the offer that I copied the HTML source code from is no longer on sale and therefore, I can't give you the URL
Another offer would be



But the HTML might be different?
 
Upvote 0
The code correctly extracts the person's name and business name from the HTML. The error occurs when trying to extract the Ad number because that is in a different part of the HTML to that shown in your first post. The Ad number isn't inside <div class="broker">, but inside <div class="disclaimer" style="clear: both;"> within <div class="span8" id="premiumListingDetails">.

Try this code:
VBA Code:
    Dim HTMLdoc As HTMLDocument
    Dim div As HTMLDivElement
    Dim adP As HTMLParaElement
    
    Set HTMLdoc = IE.document
    
    Set div = HTMLdoc.querySelector("div.broker")
    Range("A1").Value = div.querySelector("h3 > a").innerText           'person
    Range("B1").Value = div.querySelector("h4").innerText               'business name
    
    Set div = HTMLdoc.getElementById("premiumListingdetails")
    Set adP = div.querySelector("div.disclaimer > p")
    Range("C1").Value = Split(adP.innerText, ":")(1)                    'ad#
 
Upvote 0
Hi,

the code line

VBA Code:
Set div = HTMLdoc.getElementById("premiumListingdetails")

still triggers the same error message
 
Upvote 0
No changes - this code works for me.
VBA Code:
Public Sub IE_Test()

    Dim IE As InternetExplorer
    Dim URL As String
    Dim HTMLdoc As HTMLDocument
    Dim div As HTMLDivElement
    Dim adP As HTMLParaElement
    
    URL = "https://www.bizbuysell.com/Business-Opportunity/Commercial-and-Residential-Lawn-Maintenance-Company-Home-Based/1772428/"
    
    Set IE = New InternetExplorer
    With IE
        .Visible = True
        .navigate URL
        While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
        Set HTMLdoc = .document
    End With

    Range("A1:C1").Clear
    
    Set div = HTMLdoc.querySelector("div.broker")    
    Range("A1").Value = div.querySelector("h3 > a").innerText           'person
    Range("B1").Value = div.querySelector("h4").innerText               'business name
    
    Set div = HTMLdoc.getElementById("premiumListingdetails")
    Set adP = div.querySelector("div.disclaimer > p")
    Range("C1").Value = Split(adP.innerText, ":")(1)                    'ad#
            
End Sub
 
Upvote 0
OK, I copied your VBA code in a new Excel workbook and run it alone
YES - it works.

There must be some code in my VBA code that conflicts with your Dim definitions and triggers the respective error.

My solution: I just use your sub in a second step and get the needed information.
 
Upvote 0
My VBA code is as following:

VBA Code:
Sub dataBizBuySell() 's()
    
    Dim IE, element, ele, elediv, mtbl As Object
    Dim doc As HTMLDocument
    Dim HTMLdoc As HTMLDocument
    Dim ws As Worksheet
    Dim strURL, strTitle, strSubTitle, txt, askingPrice, cashFlow, grossRevenue, ebitda, ffe, inventory, established As String
    Dim strLocation, strEmployees, strFFE, strTraining, strReason As String
    Dim intRows, rowNo As Long
    
    
    Set ws = ThisWorkbook.Sheets("data")

    ws.Range("AZ1").Value = "=CountA(A:A)"
    intRows = ws.Range("AZ1").Value

    Set IE = CreateObject("InternetExplorer.application")
    IE.Visible = True

    For rowNo = 1 To intRows
        
        strURL = ws.Range("A" & rowNo).Text
        IE.navigate strURL
                
        Do While IE.Busy Or IE.readyState <> 4
        Application.Wait DateAdd("s", 1, Now)
        Loop
                
        Set doc = IE.document
                    
                  
        With doc.querySelectorAll("div.span8 > h1,h2")
            If .Length > 0 Then
                If Left(.Item(0).innerText, 36) <> "This listing is no longer available." Then
              
                    strTitle = doc.getElementsByClassName("bfsTitle")(0).innerText
                    ws.Range("C" & rowNo).Value = strTitle
                    
                    strSubTitle = doc.getElementsByClassName("span8")(0).innerText
                    ws.Range("D" & rowNo).Value = strSubTitle
                      
                    For Each ele In doc.getElementsByClassName("title")
                        txt = ele.parentElement.innerText
                        
                        If Left(txt, 12) = "Asking Price" Then
                            askingPrice = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        ElseIf Left(txt, 9) = "Cash Flow" Then
                            cashFlow = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        ElseIf Left(txt, 13) = "Gross Revenue" Then
                            grossRevenue = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        ElseIf Left(txt, 6) = "EBITDA" Then
                            ebitda = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        ElseIf Left(txt, 4) = "FF&E" Then
                            ffe = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        ElseIf Left(txt, 9) = "Inventory" Then
                            inventory = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        ElseIf Left(txt, 11) = "Established" Then
                            established = Trim(Mid(txt, InStrRev(txt, ":") + 1))
                        End If
                    Next ele
                            
                    strLocation = doc.getElementsByTagName("dd")(0).innerText
                    ws.Range("K" & rowNo).Value = strLocation
                            
                    strEmployees = doc.getElementsByTagName("dd")(1).innerText
                    ws.Range("L" & rowNo).Value = strEmployees
                            
                    strFFE = doc.getElementsByTagName("dd")(2).innerText
                    ws.Range("M" & rowNo).Value = strFFE
                            
                    strTraining = doc.getElementsByTagName("dd")(3).innerText
                    ws.Range("M" & rowNo).Value = strTraining
                            
                    strReason = doc.getElementsByTagName("dd")(4).innerText
                    ws.Range("O" & rowNo).Value = strReason
 
                    strBrokerFirm = doc.querySelector("div.broker > h4").innerText
                    ws.Range("Q" & rowNo).Value = strBrokerFirm
                            
                    ws.Range("E" & rowNo).Value = askingPrice
                    ws.Range("F" & rowNo).Value = cashFlow
                    ws.Range("G" & rowNo).Value = grossRevenue
                    ws.Range("H" & rowNo).Value = ffe
                    ws.Range("I" & rowNo).Value = inventory
                    ws.Range("J" & rowNo).Value = established
                    
                End If
                
                If ws.Range("C" & rowNo).Value = "" Then
                    ws.Range("B" & rowNo).Value = "Not available"
                End If
            
            End If
        End With
    
    Next
    
    IE.Quit
    Set IE = Nothing
    MsgBox "done"

End Sub


I have no idea what might conflict with your code?
Perhaps you can identify the problem?
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,175
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top