Extracting tables from web site using vba

viktor4e · Dec 17, 2016

Hi everyone,

I am interested in extracting the 2 "Longevity" and "Sillage" tables from the link below in excel using VBA:

Burberry Brit for Men Burberry cologne - a fragrance for men 2004

Currently, I am able to do so using the following code:

Longevity = doc.getElementsByTagName("tbody")(0).innerText
Sillage = doc.getElementsByTagName("tbody")(2).innerText

The issue is that these are extracted in a single cell along with the description words. Is it possible to get each value (numbers only) in a separate cell?

Much appreciated in advance!

tonyyy · Dec 17, 2016

Can you post the entire code please?

viktor4e · Dec 17, 2016

This is the full code, where in Column K I have multiple links to that web site and extract various information. The current question relates to just 'Extract3" and "Extract4" below:

Sub Macro()
Application.ScreenUpdating = True

Dim I As Integer
Dim ie As New InternetExplorer
Dim rating As Object
Dim doc As HTMLDocument
Dim pname As String

'IE.Visible = True

For I = 4 To 63

ie.Navigate Sheet1.Cells(I, 11).Value

Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE

Set doc = ie.document
Dim Extract As String, Extract2 As String, Extract3 As String, Extract4 As String

With doc
Set rating = .querySelectorAll("[itemprop=aggregateRating]")(0)
If Not rating Is Nothing Then

Extract = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(0).innerText
Extract2 = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(2).innerText
Extract3 = doc.getElementsByTagName("tbody")(0).innerText
Extract4 = doc.getElementsByTagName("tbody")(2).innerText
pname = .getElementById("col1").getElementsByTagName("h1")(0).innerText

Cells(I, 1).Value = Extract
Cells(I, 2).Value = Extract2
Cells(I, 3).Value = Extract3
Cells(I, 4).Value = Extract4
Cells(I, 5).Value = pname

End If
End With
Next I
ie.Quit

Application.ScreenUpdating = True
End Sub

tonyyy · Dec 18, 2016

viktor4e,

Thanks for posting the code. Rather than make changes to it - other than adding a single line - the approach below is to simply format the extraction result.

So the line to add to your code...

Code:

...
...
IE.Quit

[COLOR=#ff0000]Call FormatFragrantica[/COLOR]
Application.ScreenUpdating = True
End Sub

And the formatting...

Code:

Sub FormatFragrantica()
Dim rng1 As Range, rng2 As Range, rng3 As Range, rng4 As Range
Set rng1 = Range("C4:C63")
Set rng2 = Range("D4:D63")
With rng1
    .Replace Chr(10), ""
    .Replace Chr(13), ""
    .Replace "poor", ""
    .Replace "weak", ","
    .Replace "moderate", ","
    .Replace "very long lasting", ","
    .Replace "long lasting", ","
    .Replace " ", ""
End With
With rng2
    .Replace Chr(10), ""
    .Replace Chr(13), ""
    .Replace "soft", ""
    .Replace "moderate", ","
    .Replace "heavy", ","
    .Replace "enormous", ","
    .Replace " ", ""
End With
Columns("C:D").WrapText = False
rng2.Offset(0, 1).Resize(rng2.Rows.Count, 3).Insert
rng2.TextToColumns Destination:=rng2, DataType:=xlDelimited, Comma:=True
rng1.Offset(0, 1).Resize(rng1.Rows.Count, 4).Insert
rng1.TextToColumns Destination:=rng1, DataType:=xlDelimited, Comma:=True
Columns.AutoFit
End Sub

Essentially the code strips out the text in Columns C and D, then uses the TextToColumns method to place the numbers into distinct cells.

Cheers,

tonyyy

viktor4e · Dec 18, 2016

Thank you tonyyy, this works great!

I was wondering, is there an easy way to extract the "I have it: 1520 I had it: 776 I want it: 478 My signature: 32

" line from the same link:

Burberry Brit for Men Burberry cologne - a fragrance for men 2004

Again, I would like to have numbers only all in separate cells. Thanks again!

tonyyy · Dec 18, 2016

Edit the extraction code...

Code:

        Set doc = IE.document
        Dim Extract As String, Extract2 As String, Extract3 As String, Extract4 As String[COLOR=#ff0000], Extract5 As String[/COLOR]

        With doc
            Set rating = .querySelectorAll("[itemprop=aggregateRating]")(0)
            If Not rating Is Nothing Then
                Extract = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(0).innerText
                Extract2 = doc.getElementsByClassName("effect6")(0).PreviousSibling.getElementsByTagName("span")(2).innerText
                Extract3 = doc.getElementsByTagName("tbody")(0).innerText
                Extract4 = doc.getElementsByTagName("tbody")(2).innerText
               [COLOR=#ff0000] Extract5 = doc.getElementsByTagName("p")(5).innerText[/COLOR]
                pname = .getElementById("col1").getElementsByTagName("h1")(0).innerText

                Cells(I, 1).Value = Extract
                Cells(I, 2).Value = Extract2
                Cells(I, 3).Value = Extract3
                Cells(I, 4).Value = Extract4
                Cells(I, 5).Value = pname
                [COLOR=#ff0000]Cells(I, 6).Value = Extract5

[/COLOR]

And edit the formating code...

Code:

[COLOR=#ff0000]With rng3
    .Replace Chr(10), ""
    .Replace Chr(13), ""
    .Replace "I have it:", ""
    .Replace "I had it:", ","
    .Replace "I want it:", ","
    .Replace "My signature:", ","
    .Replace " ", ""
End With[/COLOR]
[COLOR=#ff0000]Columns("C:F").WrapText = False
rng3.Offset(0, 1).Resize(rng3.Rows.Count, 3).Insert
rng3.TextToColumns Destination:=rng3, DataType:=xlDelimited, Comma:=True[/COLOR]
rng2.Offset(0, 1).Resize(rng2.Rows.Count, 3).Insert
rng2.TextToColumns Destination:=rng2, DataType:=xlDelimited, Comma:=True
rng1.Offset(0, 1).Resize(rng1.Rows.Count, 4).Insert
rng1.TextToColumns Destination:=rng1, DataType:=xlDelimited, Comma:=True
Columns.AutoFit
End Sub

viktor4e · Dec 19, 2016

Tonyyy, This code works fine but not in all case such as with this link:

The One for Men Dolce&Gabbana cologne - a fragrance for men 2008

Any way to address this issue?

tonyyy · Dec 19, 2016

In the formatting code, edit the following 3 lines...

Code:

.Replace "poor", "'"

.Replace "soft", "'"

.Replace "I have it:", "'"

It might be difficult to see. An apostrophe (') has been added between the double quotes. This will force Excel to treat the cell content as text, and applying the TextToColumns method will recognize the commas as delimiters.

viktor4e · Dec 20, 2016

Tonyyy,

What I meant is that the following code (Extract5 = doc.getElementsByTagName("p")(5).innerText) results in something else for this link:
The One for Men Dolce&Gabbana cologne - a fragrance for men 2008

So, instead of extracting this data (I have it: 4222 I had it: 1503 I want it: 1823 My signature: 166) it extracts the text below (After launching more than successful fragrance for women, The One, Dolce&Gabbana house will, at the beginning of March 2008, launch a fragrance for men named The One for Men.)

Can this be fixed somehow?

tonyyy · Dec 20, 2016

The only way I've been able to reproduce your "error" is to change the (5) to a (6) in the line...

Code:

Extract5 = doc.getElementsByTagName("p")([COLOR=#ff0000]6[/COLOR]).innerText

The number in parantheses represents the paragraph index/number, so the 5th paragraph is the "I have it.." paragraph; at least it is when I run it. You might change it to a 4 or some other number to see what you get.

And realize too that if the web page changes - the data you extract might not be what you expect.

Extracting tables from web site using vba

viktor4e

New Member

tonyyy

Well-known Member

viktor4e

New Member

tonyyy

Well-known Member

viktor4e

New Member

tonyyy

Well-known Member

viktor4e

New Member

tonyyy

Well-known Member

viktor4e

New Member

tonyyy

Well-known Member

Similar threads

Forum statistics

Share this page

Extracting tables from web site using vba

New Member

Well-known Member

New Member

Well-known Member

New Member

Well-known Member

New Member

Well-known Member

New Member

Well-known Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock