Extract all the links of a website

Excel Facts

Using Function Arguments with nested formulas
If writing INDEX in Func. Arguments, type MATCH(. Use the mouse to click inside MATCH in the formula bar. Dialog switches to MATCH.
  • That code is heavy on the system, it froze on my home computer but completed on my work machine. You need good hardware and fast Internet connection. Try testing it with a simple website that has less pages.
  • We cannot take our discussions off the forum as per forum rules…
 
Upvote 0
Hi Worf. I had some problems.
As soon as I succeed in the weekend I will attach an excel file here with your macro to understand where vertical and horizontal header data should be placed.
See you soon
 
Upvote 0
Features of this version:


  • Column A = all site links
  • Column B = filter criteria
  • Column C = unique links
  • Column D = first result column
  • Column E = second result column

Place the level limit at the third row (D3, E3…) and the desired word at the fourth row (D4, E4…)

Code:
Dim r&, s$, dict As Object

Sub Main22()                             ' run me
Set dict = CreateObject("Scripting.Dictionary")
r = 0
s = "*salento*"                         ' to avoid external references
Request22 "https://www.unisalento.it"
Cells(1, 2) = Cells(1, 1)
Cells(2, 2) = "*"
[a:a].AdvancedFilter xlFilterCopy, Range("b1:b2"), Cells(1, 3), True
AdvFilt2
MsgBox dict.Count & " recursive calls.", , "done"
End Sub

Sub Request22(URL$, Optional Level% = 1)
Dim n&, ob, oreq As New XMLHTTP60, odoc As New HTMLDocument
oreq.Open "GET", URL, False
On Error Resume Next
oreq.Send
On Error GoTo 0
odoc.Body.innerHTML = oreq.responseText
r = r + 1
Cells(r, 1) = Level
Set ob = odoc.getElementsByTagName("a")
For n = 0 To ob.Length - 1
    r = r + 1
    Cells(r, 1) = ob.Item(n).href
Next
If Level Then
    For n = 0 To ob.Length - 1
        If ob.Length > 0 Then
            If ob.Item(n).href Like s And Not dict.Exists(ob.Item(n).href) Then
                Request22 ob.Item(n).href, Level - 1
                dict.Add ob.Item(n).href, ob.Item(n).href
            End If
        End If
    Next
End If
End Sub

Sub AdvFilt2()
Dim r As Range, a, c%
Set r = [d2]
For c = 4 To Cells(3, Columns.Count).End(xlToLeft).Column
    a = Split(r.Address, "$")
    r.Offset(-1) = ""
    r.Formula = "=AND(LEN(c2)-LEN(SUBSTITUTE(c2,""/"",""""))<$" & a(1) & _
    "$3,NOT(ISERR(FIND($" & a(1) & "$4,c2))))"
    [c:c].AdvancedFilter xlFilterCopy, Range(r.Offset(-1), r), r.Offset(4), True
    Set r = r.Offset(, 1)
Next
End Sub

Hello. I arranged the data as you indicated.
I didn't understand what should be put in column b.
As well as being prepared does not work.
could you insert all the links extracted in sheet 2 of the excel file and do the processing of each link individually passing to the next one when the previous one ends to avoid blocking?

https://www.dropbox.com/s/q4thl1jt8ky44r8/links .xlsm?dl=0
 
Upvote 0
  • I rewrote the main routine, see below. It works with the active sheet.
  • Columns A, B and C are populated by the code; you have only to inform the level (D3, E3) and the desired word (D4, E4). See example below, results are displayed on columns D and E.
  • What exactly do you call a level in this context?
  • The site address is currently hardcoded in a variable; would you like to input it via worksheet cells?
  • This example made only 10 recursive calls; I suggest you test with this site before proceeding to larger ones.
  • As written, this version processes only one address.

Code:
Dim r&, s$, dict As Object

Sub Main22()                             ' run me
Dim orig$, v
Set dict = CreateObject("Scripting.Dictionary")
r = 1
[a1] = "List"
orig = "https://poestories.com"         ' root site
v = Split(Replace(orig, "//", "/"), "/")
s = "*" & Split(v(1), ".")(0) & "*"
MsgBox "Safe word is " & s              ' to avoid external references
Request22 orig, 3
Cells(1, 2) = Cells(1, 1)
Cells(2, 2) = "*"
[a:a].AdvancedFilter xlFilterCopy, Range("b1:b2"), Cells(1, 3), True
AdvFilt2
MsgBox dict.Count & " recursive calls.", , "done"
End Sub

Sheet range on next post...
 
Last edited:
Upvote 0
bvfeQOF.jpg
 
Last edited:
Upvote 0
Hi

If it is a continuation of this subject, we can remain here.

If it would be a new topic, a fresh thread would be the best idea.

What changes do you need?
 
Upvote 0
I would like to insert in the file called "links" the code that was inserted I think in post 6 which is covered in file 1.
In file 1 the macro extrapolates all the links of each website and these are inserted horizontally.
After doing this in sheet 1 of the links file, I would like to insert in sheet 2 the links that within them fully include the words of row 3 from column d onwards.
finally in sheet 3 the macro must automatically create the columns as I set them by referring to the words to be searched that I will insert from time to time in sheet 2.
is it possible to insert a button in each sheet which performs the single operations per sheet as I described?
The "Regxp Emails2" file seems to have the code that does the same operation as sheet 1 and 2 in relation to the link of the facebook page, I think it also contains the code to extract the email and telephone number.
I am also attaching the code for etraplare the name of the site that I kindly ask you to insert in the code.

 
Upvote 0

Forum statistics

Threads
1,223,335
Messages
6,171,520
Members
452,409
Latest member
brychu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top