Scrape Specific Website Data into Excel with Auto Link Numbering

blackhat7

New Member
Joined
Dec 18, 2018
Messages
16
hi guys, not sure if the subject is correct

in short:

i want to scrape the following data from this website

"Company Name"
"Company Code"
"Labor Office"

here's the link
http://eservices.mohre.gov.ae/NewMolGateway/english/Services/EQuotaStatus.aspx?Code=715280

note that i want to start the number from 1 e.g.

http://eservices.mohre.gov.ae/NewMolGateway/english/Services/EQuotaStatus.aspx?Code=1


and scrape all the way until a specific number, e.g.

http://eservices.mohre.gov.ae/NewMolGateway/english/Services/EQuotaStatus.aspx?Code=999999

and have all the data into excel where

column A1 = Company Name
Column A2 = Company Code
Column A3 = Labor Office

and so on, how can i achieve this?
 

Excel Facts

Control Word Wrap
Press Alt+Enter to move to a new row in a cell. Lets you control where the words wrap.
Hi Blackhat,
there are a couple of routes you can take. Either scraping with VBA or with PowerQuery. Check out e.g. this tutorial: https://www.howtoexcel.org/power-query/how-to-extract-data-from-multiple-webpages/
Or with VBA: https://www.vitoshacademy.com/vba-data-scraping-from-internet-with-excel-vba/ or https://analystcave.com/web-scraping-tutorial/
Having said that: calling that website one million times (from 1 to 999.999) is probably not the best thing to do, you might want to make your script a bit smarter to narrow the numbers down.
Hope that helps,
Koen
 
Upvote 0
Having said that: calling that website one million times (from 1 to 999.999) is probably not the best thing to do, you might want to make your script a bit smarter to narrow the numbers down.

I couldn't agree more. When you cycle through one million pages Excel will not respond. That said try this.

Code:
Sub a1081120()
    '****************************************************************************
'Set references to:
'Microsoft Internet Controls
'Microsoft HTML Object Library
'Microsoft XML, v6.0
'Hit Alt+F11 to go to the Visual Basic Editor and then hit:
'Tools | References
'****************************************************************************
    Dim lngPageStart, lngPageEnd, x, y As Long
    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim htmlDoc As New MSHTML.HTMLDocument
    Dim URL As String
    Dim htmlTables As MSHTML.IHTMLElementCollection
    Dim htmlTable As MSHTML.IHTMLElement
    
    'Adjust lngPageStart AND lngPageEnd to your needs
    lngPageStart = 715270
    lngPageEnd = 715280
    
    'Cycle through your webpages
    For x = lngPageStart To lngPageEnd
    
        'Create URL and sent request
        URL = "http://eservices.mohre.gov.ae/NewMolGateway/english/Services/EQuotaStatus.aspx?Code=" & x
        XMLPage.Open "GET", URL, False
        XMLPage.send
        
        'The last part of the URL
        Debug.Print "Webpage with Code=" & x
        
        'Get the source (code) of the webpage
        htmlDoc.body.innerHTML = XMLPage.responseText
        
        'Is this page valid?
        If InStr(1, htmlDoc.body.innerHTML, "Approved Quota information not available") = 0 Then
            
            'Set reference to all tables in htmlDoc.body.innerHTML
            Set htmlTables = htmlDoc.getElementsByTagName("table")
            
            'Reset counter to cycle through tables
            y = 1
            
            'Loop through all table-tags . . .
            For Each htmlTable In htmlTables
            
                'Tables 4, 5 and 6 on every page have the data you want
                Select Case y
                    Case 4
                        Debug.Print "Company Name: " & Trim(htmlTable.all(4).innerText)
                    Case 5
                        Debug.Print "Company Code: " & Trim(htmlTable.all(4).innerText)
                    Case 6
                        Debug.Print "Labour Office: " & Trim(htmlTable.all(4).innerText)
                End Select
                y = y + 1
            Next htmlTable
        End If
    Next x
End Sub

Precautions:
I cycle through only 10 pages
The webpage you try to grab the info from is rather tricky because it has a lot of nested tables. That makes it confusing. It consists of 10 tables. Not easy to navigate. Schematic it looks like this

All_Tables/
|-- Table_1/
|-- Table_2/
|-- Table_3/
|-- Table_4/
| |-- Tbody
| |-- TableRow < Company Name
| |-- TableRow < :
| |-- TableRow < The company name < < < This is the data you want
|-- Table_5/
| |-- Tbody
| |-- TableRow < Company Code
| |-- TableRow < :
| |-- TableRow < The company code < < < This is the data you want
|-- Table_6/
| |-- Tbody
| |-- TableRow < Labour Office
| |-- TableRow < :
| |-- TableRow < The labour office < < < This is the data you want
|-- Table_7/
|-- Table_8/
|-- Table_9/
|-- Table_10/
Tables 4, 5 and 6 are the tables of interest. Hope this helps.

I printed the results to the immediate window of the Visual Basic Editor.

2018-12-18-16-10-08.png
 
Last edited:
Upvote 0
guys, before i say anything

THIS COMMUNITY IS AMAZING!

i didn't expect someone to waste their precious time so much so to provide me with a ready made copy/paste solution,

i'm thrilled.

i copy pasted the code and ran it in excel but gives the following error, also please note that i'm not very very well versed in excel, can you please give me step by step guide, thank you.

Z2ZmPXW

Z2ZmPXW
 
Upvote 0
Did you set the apropriate references as pointed out in the green part of the code at the top?
 
Upvote 0
strooman, you're a genius, yes i followed your instructions and i can see the details in the immediate pane

however, i have 1 question and 1 request

Q- how can i get the data copied to excel?

request: i can see the process is very time consuming, 10 links took around 1 minute, is there a way to speed things up? maybe use a software or a different approach?
 
Upvote 0
strooman, you're a genius

Well, that's to much praise for a humble person who is just interested in Excel. But I'll take the compliment. Thanks.

Q- how can i get the data copied to excel?

No problem. Try this code:

Code:
Sub a1081120_Second_Version()
'****************************************************************************
'Set references to:
'Microsoft Internet Controls
'Microsoft HTML Object Library
'Microsoft XML, v6.0
'Hit Alt+F11 to go to the Visual Basic Editor and then hit:
'Tools | References
'****************************************************************************
    Dim lngRow, lngColumn, lngPageStart, lngPageEnd, x, y, z As Long
    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim htmlDoc As New MSHTML.HTMLDocument
    Dim URL As String
    Dim htmlTables As MSHTML.IHTMLElementCollection
    Dim htmlTable As MSHTML.IHTMLElement
    
    
    'Adjust lngPageStart AND lngPageEnd to your needs
    lngPageStart = 715270
    lngPageEnd = 715280
    lngRow = 2
    
    'Cycle through your webpages
    For x = lngPageStart To lngPageEnd
        
        'Create URL and sent request
        URL = "http://eservices.mohre.gov.ae/NewMolGateway/english/Services/EQuotaStatus.aspx?Code=" & x
        XMLPage.Open "GET", URL, False
        XMLPage.send
        
        'The last part of the URL
        Debug.Print "Webpage with Code=" & x
        
        'Get the source (code) of the webpage
        htmlDoc.body.innerHTML = XMLPage.responseText
        
        'Is this page valid?
        If InStr(1, htmlDoc.body.innerHTML, "Approved Quota information not available") = 0 Then
            
            'Set reference to all tables in htmlDoc.body.innerHTML
            Set htmlTables = htmlDoc.getElementsByTagName("table")
            
            lngColumn = 1
            
            'Loop through all table-tags . . .
            For z = 1 To htmlTables.Length
            
                'Tables 4, 5 and 6 on every page have the data you want
                Select Case z
                    Case 3
                        Cells(lngRow, lngColumn).Value = Trim(htmlTables(z).all(4).innerText)
                        lngColumn = lngColumn + 1
                    Case 4
                        Cells(lngRow, lngColumn).Value = Trim(htmlTables(z).all(4).innerText)
                        lngColumn = lngColumn + 1
                    Case 5
                        Cells(lngRow, lngColumn).Value = Trim(htmlTables(z).all(4).innerText)
                        lngRow = lngRow + 1
                End Select
               
            Next z
        End If
    Next x
End Sub


It fills columns A, B and C

request: i can see the process is very time consuming, 10 links took around 1 minute, is there a way to speed things up? maybe use a software or a different approach?

I noticed this too. When I cycle through 100 pages it took me about 15 minutes and it seems like Excel freezes but that is not the case. Excel is working fine but it will take time. I think it has to do with the loading time of the website. I don't see where I can speed things up in the code. Perhaps somebody else has suggestions here.
 
Upvote 0

Forum statistics

Threads
1,224,818
Messages
6,181,150
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top