Parsing HTML into Excel

HamiTipene

Board Regular
Joined
Jun 28, 2006
Messages
67
Hi,

I'm looking to extract table data from a webpage into a document in Excel. I have looked around and seen only examples that open the webpage itself to extract the data. I would prefer that this is not done and that it extracts the relevant source data without opening any extra programs.

If you could be clear about any extra references I would need to get the script running that would be great, many example programs I have tried so far do not seem to work, presumably because Excel is unable to read them without the required references. Thanks.
 

Excel Facts

Do you hate GETPIVOTDATA?
Prevent GETPIVOTDATA. Select inside a PivotTable. In the Analyze tab of the ribbon, open the dropown next to Options and turn it off
It's difficult to help you without knowing the URL, but you could try one of these:

1. The simplest, if it works, is a web query or a workbook open to the URL.

2. An XMLhttp GET request. This requires a reference to Microsoft XML v6.0. You could parse the returned ResponseText using string functions like Instr and Mid, or assign it to HTMLDocument.Body and use the HTML library; or save it in a local file to be imported with a web query.

3. HTMLDocument.CreateDocumentFromURL. This requires a reference to Microsoft HTML Object Library. Parse the data with the HTML library.

If none of those work you could try InternetExplorer automation, but set the .Visible property to False so that IE doesn't appear. This requires a reference to Microsoft Internet Controls.

Note that the references are only needed if you want to use early binding, which gives intellisense help when typing code in the VB editor. All the techniques work equally well with late binding:

Dim xxx As Object
Set xxx = CreateObject("yyy.zzz")
 
Upvote 0
Thanks for that.

Here is a webpage example that I am looking to parse:

http://www.beb.de/internet/servlet/.../Capacities/Storageutilization/Harsefeld.html

I actually hit upon your #2 solution by accident by looking at some old code of a colleague of mine. I have been working through it but I have run into some problems.

I would like to extract the data from the table. The problem is that the source code has long spaces between the [td] and [/td] tags, so its a bit more involved than just using the Instr and Mid functions starting from the end of the [td] tag. Could you provide some tips as to how I might work around this? Thanks.

Also, if it makes the task a simpler process, perhaps you could guide me as to how I would assign the response text to HTMLDocument.Body and use the HTML library...
 
Last edited:
Upvote 0
The web page has a Download link. Copy that URL, then File - Open - the URL. The data is delimited by semicolons, so use Data - Text to Columns to separate it. Do these steps with the Macro Recorder to generate the VBA code for further automation if necessary. No need for XMLhttp or HTML parsing.
 
Upvote 0
Without downloading, you could do it like this:

Code:
Sub WebScraping()

    Dim x As Integer, p As Integer, MyTable As Object    'MSHTML.HTMLTable

    ActiveSheet.UsedRange.Clear

    On Error GoTo ErrH

    s = Timer

    With CreateObject("InternetExplorer.Application")

        .Navigate "http://www.beb.de/internet/servlet/content/5868/StartpageEnglish/Gasstorage/Capacities/Storageutilization/Harsefeld.html"

        Do While .Busy: Loop

        'Loop through all the elements in the document
        For Each MyTable In .Document.getElementsByTagName("table")
            With MyTable
                If .className = "inntertab" Then
                    ReDim q(.Rows.Length - 1, 5)
                    For x = 0 To .Rows.Length - 1
                        For p = 0 To .Rows(x).Cells.Length - 1
                            q(x, p) = .Rows(x).Cells(p).innerText
                        Next
                    Next
                End If
            End With

        Next

        [A1].Resize(x, p) = q

ErrH:         .Quit

    End With

    Set MyTable = Nothing: Erase q

    MsgBox Round(Timer - s, 2)

End Sub
 
Upvote 0
Hi again,

If you would go for the (interesting) suggeston of John, use:

Code:
Sub GetMyEnergyData()
    Workbooks.Open "http://www.beb.de/internet/servlet/storagestatistics/getCSV?storageCode=shhar&locale=en"
    Columns(1).TextToColumns Semicolon:=True
    Columns.AutoFit
End Sub

If the download option is left out on the page (websites change), then this method will break down. If on the other hand the tables change, my solution would have to be changed.
 
Upvote 0
Hi again,

If you would go for the (interesting) suggeston of John, use:

Code:
Sub GetMyEnergyData()
    Workbooks.Open "http://www.beb.de/internet/servlet/storagestatistics/getCSV?storageCode=shhar&locale=en"
    Columns(1).TextToColumns Semicolon:=True
    Columns.AutoFit
End Sub

If the download option is left out on the page (websites change), then this method will break down. If on the other hand the tables change, my solution would have to be changed.




I have the lines on the html page which are supposed to be in a table format but they are not and there is no separator between each column data except some undefined spaces as shown in below image, so how can i import it as a proper table .i.e. into separate columns in the excel file, also i have multiple such html files so i can import multiple files into one worksheet.


html page text image

gWSJUG

gWSJUG
 
Last edited:
Upvote 0

Forum statistics

Threads
1,224,603
Messages
6,179,850
Members
452,948
Latest member
UsmanAli786

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top