Read HTML Source Code with VBA

KevinJ

New Member
Joined
Jun 13, 2011
Messages
9
Using VBA, I am trying to retrieve the contents of the Source of a web page (the same as would appear if you right-clicked on the page and chose "View Source") into a variable so I can work on it in VBA (using InStr, etc.).

The problem is I can use code such as
strHTMLText = ie.Document.body.innerText
or
strHTMLText = ie.Document.body.outerText
to retrieve the code, but in either case only part, not all, of the source code is captured. I need ALL the code. Is there some kind of code such as ie.Document.body.allText or similar that would perform this function?

Much obliged!
 

Excel Facts

Ambidextrous Undo
Undo last command with Ctrl+Z or Alt+Backspace. If you use the Undo icon in the QAT, open the drop-down arrow to undo up to 100 steps.
Try this code and see result in Immediate window:
Rich (BB code):

Sub Test()
  Const URL$ = "http://online.recoveryversion.org/bibleverses.asp?fvid=2901&lvid=2901"
  Const MASK$ = "href=FootNotes.asp?FNtsID="
  Dim txt As String, i As Long
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", URL, False
    .Send
    txt = .ResponseText
  End With
  Do
    i = InStr(i + 1, txt, MASK)
    If i = 0 Then Exit Do
    Debug.Print Val(Mid$(txt, i + Len(MASK), 15))
  Loop
End Sub

Hi. I've been using the above code, which works 90% of the time. However, when I try to get the page source of some webpages, Excel freezes and ultimately crashes. I've investigated the cause of the problem, and it seems that the pages that are causing the trouble are pages that either (1) redirect to another page or (2) are temporarily unavailable. I've looked into msxml2.xmlhttp and it appears that a solution lies in the "timeout" and "ontimeout" properties and events. However, I cannot figure out how to get the "ontimeout" to work as I'd like.

Here is the function i've written to go through a list of 1000s of websites to get their page source:

Code:
Function Get_Page_Source(iURL$, iFileName As String) As String
  'Const URL$ = "http://online.recoveryversion.org/bibleverses.asp?fvid=2901&lvid=2901"
  'iURL = "http://www." & iURL
  'pause here
  Call Pause(2)
  On Error GoTo A:
  Dim txt As String, i As Long
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", iURL, False
    .timeout = 10000 'added for testing purposes
    .ontimeout Goto A:
    .Send
    Get_Page_Source = .ResponseText
  End With
      
'    Debug.Print Get_Page_Source
      
  Call Save_Text_To_File(iFileName, Get_Page_Source, True)


A:


End Function

Can someone give me some more guidance or a working example on how "ontimeout" works? If there is a problem getting the page source, i'd like the function to just skip over the particular website and move on.

Please Help! Thanks.
 
Upvote 0
Hi all,

I tried to apply Vladimir's code:

Sub Test()
Const URL$ = "Google Trends"
Const MASK$ = "All Categories"
Dim txt As String, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.Send
txt = .ResponseText
End With
'Do
' i = InStr(i + 1, txt, MASK)
' If i = 0 Then Exit Do
' Debug.Print Val(Mid$(txt, i + Len(MASK), 15))
'Loop
i = InStr(txt, MASK)
End Sub
The functions fails on this website, the string All Categories is occurs in the websites source code!
What is wrong here?

Thank you very much!
Lloyd
 
Upvote 0
Try this code and see result in Immediate window:
Rich (BB code):

Sub Test()
  Const URL$ = "http://online.recoveryversion.org/bibleverses.asp?fvid=2901&lvid=2901"
  Const MASK$ = "href=FootNotes.asp?FNtsID="
  Dim txt As String, i As Long
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", URL, False
    .Send
    txt = .ResponseText
  End With
  Do
    i = InStr(i + 1, txt, MASK)
    If i = 0 Then Exit Do
    Debug.Print Val(Mid$(txt, i + Len(MASK), 15))
  Loop
End Sub


Is there a way to use this XMLHTTP method for a secure website (HTTPS) and have it sign in automatically? I used to do it using a very similar code by putting my username and password in the URL, but, unfortunately, Microsoft has begun to block this "username/password in URL" feature in their new applications for security reasons. I have tried adding a section that uses Application.InternetExplorer to sign in, but then the XMLHTTP object still needs a password since it was IE that actually signed in and not the Excel sheet. I have since then scrapped the code that had the InternetExplorer stuff. I have also even tried to edit the registry to remove the block on that "feature", but it did not work like it should have. I have racked my brain for 3 days on this issue. Any ideas would help. Thanks in advance
 
Last edited:
Upvote 0
Is there a way to use this XMLHTTP method for a secure website (HTTPS) and have it sign in automatically? I used to do it using a very similar code by putting my username and password in the URL, but, unfortunately, Microsoft has begun to block this "username/password in URL" feature in their new applications for security reasons. I have tried adding a section that uses Application.InternetExplorer to sign in, but then the XMLHTTP object still needs a password since it was IE that actually signed in and not the Excel sheet. I have since then scrapped the code that had the InternetExplorer stuff. I have also even tried to edit the registry to remove the block on that "feature", but it did not work like it should have. I have racked my brain for 3 days on this issue. Any ideas would help. Thanks in advance

I just figured this out. I have searched for 3 days for a solution and could not find one, then by trial and error, I got it. Just so it can be known for others to use I am going to post a little of the code to show what I changed for the "auto sign in" to work. Replace UNUNUN and PWPWPW with your username and password respectively after "False", and it will sign into the secure website with no pop up window.
Code:
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sURL, False, "UNUNUN", "PWPWPW"
.send
 
Upvote 0
Been reading through this post and am very close to getting my script sorted out, just really need a push in the right direction.
Currently, I am trying to get the value (last 10 digits) after the MASK data-asin= and am unable to retrieve them (all I get right now are 0's). There are several of these values I would like to retrieve; in the actual HTML they are stacked up on top of each other in sequential lines.

My code so far:
Code:
Sub Test2()
    
  Dim bookName As String
  bookName = "american+history"
  webSite = "[URL="http://www.amazon.com/s/ref=sr_st_popularity-rank?keywords"]Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more[/URL]=" & bookName & "qid=1420565066&rh=n%3A283155%2Ck%3A" & bookName & "&sort=popularity-rank"
  URL$ = webSite
  Const MASK$ = "data-asin="
  '"id=result_"
  Dim txt As String, i As Long
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", URL, False
    .Send
    txt = .ResponseText
  End With
  Do
    'I believe the error lies here
    i = InStr(i + 1, txt, MASK)
    If i = 0 Then Exit Do
    Debug.Print Val(Mid$(txt, i + Len(MASK), 10))
  Loop
End Sub

Any suggestions?

Note: removing CONST before URL seems to be allowing me to insert variables and is pinging the correct website, so that part should be fine.
 
Last edited:
Upvote 0
Been reading through this post and am very close to getting my script sorted out, just really need a push in the right direction.
Hi, try this:
Rich (BB code):
Sub Test4()
   
  Dim i As Long
  Dim BookName As String, Txt As String, Url As String
  Const MASK As String = "data-asin="

  BookName = "american+history"
  Url = "http://www.amazon.com/s/ref=sr_st_popularity-rank?keywords=" & BookName & "qid=1420565066&rh=n%3A283155%2Ck%3A" & BookName & "&sort=popularity-rank"
 
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", Url, False
    .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
    .Send
    Txt = .ResponseText
  End With

  Do
    i = InStr(i + 1, Txt, MASK)
    If i = 0 Then Exit Do
    Debug.Print Val(Mid$(txt, i + Len(MASK) + 1, 10))
  Loop
 
End Sub
 
Last edited:
Upvote 0
That is extremely close. I've modified it a bit to assign the values to cells in my spreadsheet, but whether I'm printing the values or assigning them to cellsthere seems to be a problem where it cuts off any 0's in the front of the value. See modified code:
Code:
Sub Test6()
    
  
  Dim i As Long
  Dim BookName As String, Txt As String, Url As String
  Const MASK As String = "data-asin="
  BookName = "american+history"
  Url = "[url=http://www.amazon.com/gp/search/ref=sr_st?keywords]Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more[/url]=" & BookName & "&qid=1420631958&rh=n%3A283155%2Ck%3A" & BookName & "&sort=popularity-rank"
 
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", Url, False
    .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
    .Send
    Txt = .ResponseText
  End With
  Do
    i = InStr(i + 1, Txt, MASK)
    If i = 0 Then Exit Do
    ActiveCell.Offset(1, 0).Select
    Selection.Value = Val(Mid$(Txt, i + Len(MASK) + 1, 10))
    If Len(Selection.Value) < 10 Then
        'thinking there is a way to add zero's to the front of the value here until the string is 10 characters in length
    End If
  Loop
 
End Sub
 
Upvote 0
That is extremely close. I've modified it a bit to assign the values to cells in my spreadsheet, but whether I'm printing the values or assigning them to cellsthere seems to be a problem where it cuts off any 0's in the front of the value. See modified code:
Code:
Sub Test6()
    
  
  Dim i As Long
  Dim BookName As String, Txt As String, Url As String
  Const MASK As String = "data-asin="
  BookName = "american+history"
  Url = "[URL="http://www.amazon.com/gp/search/ref=sr_st?keywords"]Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more[/URL]=" & BookName & "&qid=1420631958&rh=n%3A283155%2Ck%3A" & BookName & "&sort=popularity-rank"
 
  With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", Url, False
    .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
    .Send
    Txt = .ResponseText
  End With
  Do
    i = InStr(i + 1, Txt, MASK)
    If i = 0 Then Exit Do
    ActiveCell.Offset(1, 0).Select
    Selection.Value = Val(Mid$(Txt, i + Len(MASK) + 1, 10))
    If Len(Selection.Value) < 10 Then
        'thinking there is a way to add zero's to the front of the value here until the string is 10 characters in length
    End If
  Loop
 
End Sub

The only solution I can think of here is to change the format of the cell you are working with to "Text". Unless you need that cell to remain in a number format, this will work nicely.


Code:
 ActiveCell.Offset(1, 0).Select
     Selection.Value = Val(Mid$(Txt, i + Len(Mask) + 1, 10))
     Selection.NumberFormat = "@"
     Do Until Len(Selection.Value) >= 10
     Selection.Value = "0" & Selection.Value
     Loop
 
Upvote 0

Forum statistics

Threads
1,224,830
Messages
6,181,227
Members
453,025
Latest member
Hannah_Pham93

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top