Extracting tables from web site using vba

viktor4e

New Member
Joined
Jun 30, 2014
Messages
16
Hi everyone,

I am interested in extracting the 2 "Longevity" and "Sillage" tables from the link below in excel using VBA:

Burberry Brit for Men Burberry cologne - a fragrance for men 2004


Currently, I am able to do so using the following code:

Longevity = doc.getElementsByTagName("tbody")(0).innerText
Sillage = doc.getElementsByTagName("tbody")(2).innerText

The issue is that these are extracted in a single cell along with the description words. Is it possible to get each value (numbers only) in a separate cell?

Much appreciated in advance!
 

Excel Facts

How to total the visible cells?
From the first blank cell below a filtered data set, press Alt+=. Instead of SUM, you will get SUBTOTAL(9,)
This is the full code, where in Column K I have multiple links to that web site and extract various information. The current question relates to just 'Extract3" and "Extract4" below:

Sub Macro()
Application.ScreenUpdating = True


Dim I As Integer
Dim ie As New InternetExplorer
Dim rating As Object
Dim doc As HTMLDocument
Dim pname As String


'IE.Visible = True


For I = 4 To 63




ie.Navigate Sheet1.Cells(I, 11).Value


Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE


Set doc = ie.document
Dim Extract As String, Extract2 As String, Extract3 As String, Extract4 As String


With doc
Set rating = .querySelectorAll("[itemprop=aggregateRating]")(0)
If Not rating Is Nothing Then


Extract = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(0).innerText
Extract2 = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(2).innerText
Extract3 = doc.getElementsByTagName("tbody")(0).innerText
Extract4 = doc.getElementsByTagName("tbody")(2).innerText
pname = .getElementById("col1").getElementsByTagName("h1")(0).innerText


Cells(I, 1).Value = Extract
Cells(I, 2).Value = Extract2
Cells(I, 3).Value = Extract3
Cells(I, 4).Value = Extract4
Cells(I, 5).Value = pname


End If
End With
Next I
ie.Quit


Application.ScreenUpdating = True
End Sub
 
Upvote 0
viktor4e,

Thanks for posting the code. Rather than make changes to it - other than adding a single line - the approach below is to simply format the extraction result.

So the line to add to your code...

Code:
...
...
IE.Quit

[COLOR=#ff0000]Call FormatFragrantica[/COLOR]
Application.ScreenUpdating = True
End Sub

And the formatting...

Code:
Sub FormatFragrantica()
Dim rng1 As Range, rng2 As Range, rng3 As Range, rng4 As Range
Set rng1 = Range("C4:C63")
Set rng2 = Range("D4:D63")
With rng1
    .Replace Chr(10), ""
    .Replace Chr(13), ""
    .Replace "poor", ""
    .Replace "weak", ","
    .Replace "moderate", ","
    .Replace "very long lasting", ","
    .Replace "long lasting", ","
    .Replace " ", ""
End With
With rng2
    .Replace Chr(10), ""
    .Replace Chr(13), ""
    .Replace "soft", ""
    .Replace "moderate", ","
    .Replace "heavy", ","
    .Replace "enormous", ","
    .Replace " ", ""
End With
Columns("C:D").WrapText = False
rng2.Offset(0, 1).Resize(rng2.Rows.Count, 3).Insert
rng2.TextToColumns Destination:=rng2, DataType:=xlDelimited, Comma:=True
rng1.Offset(0, 1).Resize(rng1.Rows.Count, 4).Insert
rng1.TextToColumns Destination:=rng1, DataType:=xlDelimited, Comma:=True
Columns.AutoFit
End Sub

Essentially the code strips out the text in Columns C and D, then uses the TextToColumns method to place the numbers into distinct cells.

Cheers,

tonyyy
 
Last edited:
Upvote 0
Edit the extraction code...

Code:
        Set doc = IE.document
        Dim Extract As String, Extract2 As String, Extract3 As String, Extract4 As String[COLOR=#ff0000], Extract5 As String[/COLOR]

        With doc
            Set rating = .querySelectorAll("[itemprop=aggregateRating]")(0)
            If Not rating Is Nothing Then
                Extract = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(0).innerText
                Extract2 = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(2).innerText
                Extract3 = doc.getElementsByTagName("tbody")(0).innerText
                Extract4 = doc.getElementsByTagName("tbody")(2).innerText
               [COLOR=#ff0000] Extract5 = doc.getElementsByTagName("p")(5).innerText[/COLOR]
                pname = .getElementById("col1").getElementsByTagName("h1")(0).innerText

                Cells(I, 1).Value = Extract
                Cells(I, 2).Value = Extract2
                Cells(I, 3).Value = Extract3
                Cells(I, 4).Value = Extract4
                Cells(I, 5).Value = pname
                [COLOR=#ff0000]Cells(I, 6).Value = Extract5
[/COLOR]

And edit the formating code...

Code:
[COLOR=#ff0000]With rng3
    .Replace Chr(10), ""
    .Replace Chr(13), ""
    .Replace "I have it:", ""
    .Replace "I had it:", ","
    .Replace "I want it:", ","
    .Replace "My signature:", ","
    .Replace " ", ""
End With[/COLOR]
[COLOR=#ff0000]Columns("C:F").WrapText = False
rng3.Offset(0, 1).Resize(rng3.Rows.Count, 3).Insert
rng3.TextToColumns Destination:=rng3, DataType:=xlDelimited, Comma:=True[/COLOR]
rng2.Offset(0, 1).Resize(rng2.Rows.Count, 3).Insert
rng2.TextToColumns Destination:=rng2, DataType:=xlDelimited, Comma:=True
rng1.Offset(0, 1).Resize(rng1.Rows.Count, 4).Insert
rng1.TextToColumns Destination:=rng1, DataType:=xlDelimited, Comma:=True
Columns.AutoFit
End Sub
 
Upvote 0
In the formatting code, edit the following 3 lines...

Code:
.Replace "poor", "'"

.Replace "soft", "'"

.Replace "I have it:", "'"

It might be difficult to see. An apostrophe (') has been added between the double quotes. This will force Excel to treat the cell content as text, and applying the TextToColumns method will recognize the commas as delimiters.
 
Upvote 0
Tonyyy,

What I meant is that the following code (Extract5 = doc.getElementsByTagName("p")(5).innerText) results in something else for this link:
The One for Men Dolce&Gabbana cologne - a fragrance for men 2008

So, instead of extracting this data (I have it: 4222 I had it: 1503 I want it: 1823 My signature: 166) it extracts the text below (After launching more than successful fragrance for women, The One, Dolce&Gabbana house will, at the beginning of March 2008, launch a fragrance for men named The One for Men.)

Can this be fixed somehow?
 
Upvote 0
The only way I've been able to reproduce your "error" is to change the (5) to a (6) in the line...

Code:
Extract5 = doc.getElementsByTagName("p")([COLOR=#ff0000]6[/COLOR]).innerText

The number in parantheses represents the paragraph index/number, so the 5th paragraph is the "I have it.." paragraph; at least it is when I run it. You might change it to a 4 or some other number to see what you get.

And realize too that if the web page changes - the data you extract might not be what you expect.
 
Upvote 0

Forum statistics

Threads
1,223,164
Messages
6,170,444
Members
452,326
Latest member
johnshaji

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top