Extract all the links of a website

Excel Facts

Which Excel functions can ignore hidden rows?
The SUBTOTAL and AGGREGATE functions ignore hidden rows. AGGREGATE can also exclude error cells and more.
The folder contains four files. Which file are you referring to?
The file to use is: links
In this file you need to insert the file code: Scraping_title
that will have to extract the name of the website and insert it in column "b" of sheet 2.
In sheet 1 of the link file it is necessary to insert the code of file 1 to extract all the links of each url.
Then I found another file which is: Regxp Emails2
This last file has the hidden code and it appeared to me when I debugged it following a macro error.
Which of the two macros (file 1 and Regxp Emails2 file) in your opinion extracts all the links more effectively and efficiently? You can use the better code of the two to do this.
The Regxp Emails2 file makes the example of extracting the link of the facebook page, instead I would like to extract in addition to this link to be inserted in sheet 2 of the links file (column d) also the other links that contain the word indicated in the column header of sheet 2 (from column d onwards).
For column c I will need the regex code that I already have and that I will have to use it the same way I use it to extract the email.
 
Upvote 0
The code below:

  • Lists all links for the sites mentioned at sheet2. The list appears at sheet1.
  • Writes the site title
  • Lists the results that match the words inserted at sheet2
VBA Code:
Private Sub CommandButton1_Click()
 'Columns for both tables
    Const colUrl As Long = 1      'Must always be the first column
    Const colmail As Long = 2     'Must always be the first column before Some platforms
    Const colFacebook As Long = 3 'Must always be the last column of Some platforms
    Const colError As Long = 4    'Must always be the last column

    Dim url As String, http As Object, htmlDoc As Object, nodeAllLinks As Object
    Dim nodeOneLink As Object, pageLoadSuccessful As Boolean
    Dim tbl_url_oal As String, tbl_all As String, currentRowTableUrls As Long, lastRowTableUrls&
    Dim currentRowsTableAll(colUrl To colFacebook) As Long
    Dim lastRowTableAll As Long, addressCounters(colmail To colFacebook) As Long
    Dim checkCounters As Long, cel As Range
   
    tbl_url_oal = "foglio2"             'Name of Sheet
    currentRowTableUrls = 2           'First row for content
    tbl_all = "Sheet1"     'Name of Sheet
    Sheets(tbl_url_oal).Activate
    With New XMLHTTP60
            On Error Resume Next
        For Each cel In Range("A4:A" & Cells(Rows.Count, 1).End(xlUp).Row)
            .Open "GET", cel.Value, False
            .send
            If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
    For checkCounters = colUrl To colFacebook
        currentRowsTableAll(checkCounters) = 2   'First rows for content
    Next checkCounters
    Set htmlDoc = CreateObject("htmlfile")
    Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
  
    'Delete all rows except headline in the sheet with all addresses
    lastRowTableAll = Sheets(tbl_all).Cells(Rows.Count, colUrl).End(xlUp).Row
    Sheets(tbl_all).Rows(currentRowsTableAll(colUrl) & ":" & lastRowTableAll).Delete Shift:=xlUp
 
    'Loop over all URLs in column A in the URL source sheet
    Do While Sheets(tbl_url_oal).Cells(currentRowTableUrls, 1).Value <> ""
        'Scroll for visual monitoring, if 'the sheet with the URLs are the
        If ActiveSheet.Name = tbl_url_oal Then
            If currentRowTableUrls > 14 Then
                ActiveWindow.SmallScroll down:=1
            End If
            Sheets(tbl_url_oal).Cells(currentRowTableUrls, 1).Select
        End If
   
        'Get next url from the URL source sheet
        url = Sheets(tbl_url_oal).Cells(currentRowTableUrls, colUrl).Value
        'Try to load page 'Temporarily disable error handling if 'there is a timeout or onother error
        On Error Resume Next
        http.Open "GET", url, False
        http.send
   
        'Check if page loading was successful
        If Err.Number = 0 Then
            pageLoadSuccessful = True
        End If
        On Error GoTo 0
   
        If pageLoadSuccessful Then
            'Build html document for DOM operations
            htmlDoc.body.innerHTML = http.responseText
            'Create node list from all links of the page
            Set nodeAllLinks = htmlDoc.getElementsByTagName("a")
            'Walk through all links of the node list
     
    For Each nodeOneLink In nodeAllLinks
             
DoEvents

            'Write mail address to both tables
            Sheets(tbl_url_oal).Cells(currentRowTableUrls, colmail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
            Sheets(tbl_all).Cells(currentRowsTableAll(colmail), colmail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
            'Check if it is a new line in the sheet with all addresses
    If currentRowsTableAll(colmail) >= currentRowsTableAll(colUrl) Then
            'Write URL in the new line of the sheet with all addresses
            Sheets(tbl_all).Cells(currentRowsTableAll(colUrl), colUrl).Value = url
            'Increment url counter
            currentRowsTableAll(colUrl) = currentRowsTableAll(colUrl) + 1
    End If
            'Increment mail counters
             currentRowsTableAll(colmail) = currentRowsTableAll(colmail) + 1
            addressCounters(colmail) = addressCounters(colmail) + 1
     
Next nodeOneLink

        'Check address counters
        For checkCounters = colmail To colFacebook
        'Set comment if more than 1 link were found
        If addressCounters(checkCounters) > 1 Then
        End If
Next checkCounters
        Else
  
        End If
   
        'Prepare for next page
        pageLoadSuccessful = False
        Erase addressCounters
        lastRowTableAll = Sheets(tbl_all).Cells(Rows.Count, colUrl).End(xlUp).Row
        For checkCounters = colUrl To colFacebook
            currentRowsTableAll(checkCounters) = lastRowTableAll + 1 'First rows for next page content
        Next checkCounters
        currentRowTableUrls = currentRowTableUrls + 1
    Loop
    'Clean up
Set http = Nothing: Set htmlDoc = Nothing
Set nodeAllLinks = Nothing
Set nodeOneLink = Nothing
Dim r As Range, c
Sheets("Sheet1").Activate
[b1] = "header"
[c1] = [b1]
For c = 4 To Sheets("Foglio2").[3:3].Find("*", SearchOrder:=xlByColumns, SearchDirection:=xlPrevious).Column
    [c2] = "*" & Sheets("Foglio2").Cells(3, c) & "*"
    Columns("B:B").AdvancedFilter Action:=xlFilterCopy, CriteriaRange:=Range("c1:c2"), CopyToRange:=[e1], Unique:=True
    Set r = [e1].CurrentRegion
    Set r = Range(Cells(2, 5), Cells(r.Rows.Count, 5))
    If Len([e2]) Then r.Copy Sheets("Foglio2").Cells(4, c)
    r.Delete
Next
End Sub
 
Upvote 0
The code below:

  • Lists all links for the sites mentioned at sheet2. The list appears at sheet1.
  • Writes the site title
  • Lists the results that match the words inserted at sheet2
VBA Code:
Private Sub CommandButton1_Click()
 'Columns for both tables
    Const colUrl As Long = 1      'Must always be the first column
    Const colmail As Long = 2     'Must always be the first column before Some platforms
    Const colFacebook As Long = 3 'Must always be the last column of Some platforms
    Const colError As Long = 4    'Must always be the last column

    Dim url As String, http As Object, htmlDoc As Object, nodeAllLinks As Object
    Dim nodeOneLink As Object, pageLoadSuccessful As Boolean
    Dim tbl_url_oal As String, tbl_all As String, currentRowTableUrls As Long, lastRowTableUrls&
    Dim currentRowsTableAll(colUrl To colFacebook) As Long
    Dim lastRowTableAll As Long, addressCounters(colmail To colFacebook) As Long
    Dim checkCounters As Long, cel As Range
  
    tbl_url_oal = "foglio2"             'Name of Sheet
    currentRowTableUrls = 2           'First row for content
    tbl_all = "Sheet1"     'Name of Sheet
    Sheets(tbl_url_oal).Activate
    With New XMLHTTP60
            On Error Resume Next
        For Each cel In Range("A4:A" & Cells(Rows.Count, 1).End(xlUp).Row)
            .Open "GET", cel.Value, False
            .send
            If .Status = 200 Then cel.Offset(, 1).Value = Split(Split(.responseText, "<title>")(1), "</")(0)
        Next
    End With
    For checkCounters = colUrl To colFacebook
        currentRowsTableAll(checkCounters) = 2   'First rows for content
    Next checkCounters
    Set htmlDoc = CreateObject("htmlfile")
    Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
 
    'Delete all rows except headline in the sheet with all addresses
    lastRowTableAll = Sheets(tbl_all).Cells(Rows.Count, colUrl).End(xlUp).Row
    Sheets(tbl_all).Rows(currentRowsTableAll(colUrl) & ":" & lastRowTableAll).Delete Shift:=xlUp
 
    'Loop over all URLs in column A in the URL source sheet
    Do While Sheets(tbl_url_oal).Cells(currentRowTableUrls, 1).Value <> ""
        'Scroll for visual monitoring, if 'the sheet with the URLs are the
        If ActiveSheet.Name = tbl_url_oal Then
            If currentRowTableUrls > 14 Then
                ActiveWindow.SmallScroll down:=1
            End If
            Sheets(tbl_url_oal).Cells(currentRowTableUrls, 1).Select
        End If
  
        'Get next url from the URL source sheet
        url = Sheets(tbl_url_oal).Cells(currentRowTableUrls, colUrl).Value
        'Try to load page 'Temporarily disable error handling if 'there is a timeout or onother error
        On Error Resume Next
        http.Open "GET", url, False
        http.send
  
        'Check if page loading was successful
        If Err.Number = 0 Then
            pageLoadSuccessful = True
        End If
        On Error GoTo 0
  
        If pageLoadSuccessful Then
            'Build html document for DOM operations
            htmlDoc.body.innerHTML = http.responseText
            'Create node list from all links of the page
            Set nodeAllLinks = htmlDoc.getElementsByTagName("a")
            'Walk through all links of the node list
    
    For Each nodeOneLink In nodeAllLinks
            
DoEvents

            'Write mail address to both tables
            Sheets(tbl_url_oal).Cells(currentRowTableUrls, colmail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
            Sheets(tbl_all).Cells(currentRowsTableAll(colmail), colmail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
            'Check if it is a new line in the sheet with all addresses
    If currentRowsTableAll(colmail) >= currentRowsTableAll(colUrl) Then
            'Write URL in the new line of the sheet with all addresses
            Sheets(tbl_all).Cells(currentRowsTableAll(colUrl), colUrl).Value = url
            'Increment url counter
            currentRowsTableAll(colUrl) = currentRowsTableAll(colUrl) + 1
    End If
            'Increment mail counters
             currentRowsTableAll(colmail) = currentRowsTableAll(colmail) + 1
            addressCounters(colmail) = addressCounters(colmail) + 1
    
Next nodeOneLink

        'Check address counters
        For checkCounters = colmail To colFacebook
        'Set comment if more than 1 link were found
        If addressCounters(checkCounters) > 1 Then
        End If
Next checkCounters
        Else
 
        End If
  
        'Prepare for next page
        pageLoadSuccessful = False
        Erase addressCounters
        lastRowTableAll = Sheets(tbl_all).Cells(Rows.Count, colUrl).End(xlUp).Row
        For checkCounters = colUrl To colFacebook
            currentRowsTableAll(checkCounters) = lastRowTableAll + 1 'First rows for next page content
        Next checkCounters
        currentRowTableUrls = currentRowTableUrls + 1
    Loop
    'Clean up
Set http = Nothing: Set htmlDoc = Nothing
Set nodeAllLinks = Nothing
Set nodeOneLink = Nothing
Dim r As Range, c
Sheets("Sheet1").Activate
[b1] = "header"
[c1] = [b1]
For c = 4 To Sheets("Foglio2").[3:3].Find("*", SearchOrder:=xlByColumns, SearchDirection:=xlPrevious).Column
    [c2] = "*" & Sheets("Foglio2").Cells(3, c) & "*"
    Columns("B:B").AdvancedFilter Action:=xlFilterCopy, CriteriaRange:=Range("c1:c2"), CopyToRange:=[e1], Unique:=True
    Set r = [e1].CurrentRegion
    Set r = Range(Cells(2, 5), Cells(r.Rows.Count, 5))
    If Len([e2]) Then r.Copy Sheets("Foglio2").Cells(4, c)
    r.Delete
Next
End Sub
Thank you for your interview. I could ask you to insert it in the dropbox link file along with the other code with the creation of the buttons that activates each code. Then I noticed that activating the code there could be some problems with certificates that are asked that block the macro until I click on the "yes" of the mask that appears. Is it possible to automatically accept and save the data automatically and then restart the macro from the first raw line? Then I would like to learn how to insert the email extraction part included in that code.
 
Upvote 0

Forum statistics

Threads
1,225,821
Messages
6,187,209
Members
453,411
Latest member
healthcares

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top