Sharid
Well-known Member
- Joined
- Apr 22, 2007
- Messages
- 1,066
- Office Version
- 2016
- Platform
- Windows
I have a web scraper that works, it extracts URLs from Google, I need to update it so it can scrape emails as well as the urls.
I have 2 sheets Data and Keywords, a KEYWORD is e.g. "Taxi Base" is placed in sheet2 "Keywords" cell C3 and that is what is searched in Google, currently I have set it to search only first 5 pages with a 7 second delay, not 100% sure if the delay is 1 to 7 seconds or every 7 seconds, could someone please advise.
I NEED the emails to go into row 2 column B and Urls in Row 2 Column C. If NO EMAILS are found it leaves that cell in column B blank put places the urls in C
Thanks for having a look
I have 2 sheets Data and Keywords, a KEYWORD is e.g. "Taxi Base" is placed in sheet2 "Keywords" cell C3 and that is what is searched in Google, currently I have set it to search only first 5 pages with a 7 second delay, not 100% sure if the delay is 1 to 7 seconds or every 7 seconds, could someone please advise.
I NEED the emails to go into row 2 column B and Urls in Row 2 Column C. If NO EMAILS are found it leaves that cell in column B blank put places the urls in C
VBA Code:
Private Sub CommandButton1_Click()
[COLOR=rgb(26, 188, 156)][B]'URL SCRAPER[/B][/COLOR]
Dim ie As Object
Dim htmlDoc As Object
Dim nextPageElement As Object
Dim div As Object
Dim link As Object
Dim url As String
Dim pageNumber As Long
Dim i As Long
[COLOR=rgb(26, 188, 156)] ' Takes seach from C3 and places it into google[/COLOR]
url = "https://www.google.co.uk/search?q=" & Replace(Worksheets("Keywords").Range("C3").Value, " ", "+")
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.navigate url
Do While .Busy Or .readyState <> 4
DoEvents
Loop
End With
Application.Wait Now + TimeSerial(0, 0, 5)
Set htmlDoc = ie.document
[COLOR=rgb(26, 188, 156)] ' Searches URLS and places them in ROW 2 Column C[/COLOR]
pageNumber = 1
i = 2[COLOR=rgb(26, 188, 156)] 'This is Row 2[/COLOR]
Do
For Each div In htmlDoc.getElementsByTagName("div")
If div.getAttribute("class") = "r" Then
Set link = div.getElementsByTagName("a")(0)
Cells(i, 3).Value = link.getAttribute("href") [COLOR=rgb(26, 188, 156)]'This bit seaches the href (URL), I need it to do urls and emails[/COLOR]
i = i + 1
End If
Next div
[COLOR=rgb(26, 188, 156)] 'Currently only set to search the first 5 pages on GOOGLE[/COLOR]
If pageNumber >=[B] 5[/B] Then Exit Do[COLOR=rgb(26, 188, 156)] [/COLOR]
Set nextPageElement = htmlDoc.getElementById("pnnext")
If nextPageElement Is Nothing Then Exit Do
[COLOR=rgb(26, 188, 156)]' Clicks web next page every 7 seconds [/COLOR]
nextPageElement.Click 'next web page
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 7)[COLOR=rgb(26, 188, 156)] 'NOT sure if this is between 1 to 7 seconds or every 7 seconds[/COLOR]
Set htmlDoc = ie.document
pageNumber = pageNumber + 1
Loop
ie.Quit
Set ie = Nothing
Set htmlDoc = Nothing
Set nextPageElement = Nothing
Set div = Nothing
Set link = Nothing
MsgBox "All Done"
End Sub
Thanks for having a look