Can't find META tags in HTML when using WinHTTP Request

Julesdude · Apr 2, 2019

Hi all,

I am trying to locate Meta tags from the following URL:

https://england.shelter.org.uk/prof...ary_folder/2020_group_-_fiscal_stimulus_paper

However, what is passed from the result string to HTMLDoc.body.all is only a section of the actual HTML and the meta tags at the beginning of the script are absent and so cannot be found. How do I overcome this?

my code is as follows:

Code:

Dim Http2 As New WinHttpRequest
Dim i as long
dim result as string
dim elements
dim element

Set HTMLDoc = New MSHTML.HTMLDocument



    Http2.Open "GET", url, False
    ' send request
    Http2.send
    result = Http2.responseText



    'pass text of HTML document returned
    HTMLDoc.body.all = result


    Set Elements = HTMLDoc.all.tags("META")


        For Each singleElement In Elements 


        ActiveSheet.Cells(i, "A") = url
        ActiveSheet.Cells(i, "B") = "META " & singleElement.Name
            
        ActiveSheet.Cells(i, "D").NumberFormat = "@" ' text format for date
        ActiveSheet.Cells(i, "D") = singleElement.Content
        
        
        i = i + 1
        
        Next 
    
    
    Set Elements = Nothing

John_w · Apr 2, 2019

Julesdude said:
John, thank you. This works perfectly.
Will this capture all meta tags though. For example, ones like:

Code:

meta name = "...." content= "...."> meta property = "...." content = "....">

I would want to capture both elements/tags within.

Try:

Code:

singleElement.getAttribute("property")

Julesdude · Apr 3, 2019

Hi John,

Many thanks for this. I was wondering if this similar method would also work for the following tags a little further down in the script:

Code:

main tabindex="-1" id="maincontent">
   div class="bg--a">
      div class="container">
         div id="content_div_643588">
             style>
            article text.....

I would normally call in the following way to cycle through all child tags:

Code:

Set Elements = HTMLdoc.getElementById("maincontent").getElementsByTagName("*")
For Each Element In Elements
' etc

However this does not work - 'Object Variable or With block variable not set' error message appears.

I've also tried the following that doesn't work:

Code:

Set Elements = HTMLdoc.getElementsByClassName("container")(0)Set childelements = Elements.getElementsByTagName("*")
For Each Element In childelements
cells(i,1) = element.tagname
cells(i,2) = element.innertext

This doesn't work either. How do I set a marker for any one of those tags and explore their child tags and related innertext?

Julesdude · Apr 3, 2019

Here's another example of the child elements I'm trying to loop through on site https://scotland.shelter.org.uk/pro...edinburgh_community_hub_impact_report_2017-18

I'm splitting all the tags and innertext in rows in a worksheet from the container where the body of text for the article resides. I could call on any of the following from the top of that container:

main id="maincontent" tabindex=-1
div class="bg--a"
div class="container"
div class="row" data-sq="6152"
div class="s12 18 stickycontainer"
article
h1...etc etc

Ideally I'd like to find <article> and loop through all tags and innertext extraction one by one from that point to the end of the article which is marked by </article>
Does anyone know how I could reference that point in the right way so my loops work and cycle through?

Julesdude · Apr 3, 2019

Can anyone lend a hand? I'm not sure how to use HTMLDoc.getElementsByClassName to set Elements to it and then look through each element in Elements.

John_w · Apr 3, 2019

Julesdude said:
Ideally I'd like to find <article> and loop through all tags and innertext extraction one by one from that point to the end of the article which is marked by </article>
Does anyone know how I could reference that point in the right way so my loops work and cycle through?

To do that you need getElementsByTagName("article") and traverse the DOM tree from each article element, extracting the text nodes. For an unknown reason, the article element doesn't contain any child nodes, so the code below starts at its parent element.

Code:

    Dim articles As IHTMLElementCollection
    Dim article As HTMLUnknownElement
    Dim allText As String
    
    allText = vbNullString
    Set articles = HTMLdoc.getElementsByTagName("article")
    For Each article In articles
        allText = allText & ConcatTextNodes(article.parentElement) & " "
    Next
    
    Debug.Print allText

Code:

Private Function ConcatTextNodes(node As IHTMLDOMNode) As String

    Dim i As Long
    
    ConcatTextNodes = vbNullString
    If node.NodeType = NODE_TEXT Then
        If InStr(node.NodeValue, "<!--") = 0 Then ConcatTextNodes = Trim(node.NodeValue)
    End If
    
    For i = 0 To node.ChildNodes.Length - 1
        ConcatTextNodes = ConcatTextNodes & " " & ConcatTextNodes(node.ChildNodes(i))
    Next
    
End Function

Julesdude said:
Can anyone lend a hand? I'm not sure how to use HTMLDoc.getElementsByClassName to set Elements to it and then look through each element in Elements.

Like this:

Code:

    Dim HTMLdoc2 As HTMLDocument
    Dim classColl As IHTMLElementCollection
    Dim elem As HTMLGenericElement
    Set classColl = HTMLdoc2.getElementsByClassName("theClassName")
    For Each elem In classColl
        Debug.Print elem.tagName, elem.innerText
    Next

The above code needs references to Microsoft HTML Object Library and Microsoft XML v6.0.

Julesdude · Apr 4, 2019

Sorry I couldn't get back to you earlier today in response.
Thanks John for the examples, they are really helpful and provide insight. In your last example, I had something very similar to capture all tags and elements from a parent class tag. I tried to apply your example but it's not capturing all child tag names and text content under it. Do you know what I am doing wrong?

Code:

    Set Elements = html.getElementsByClassName("amp-mode-mouse")(0)
    Set childElement = Elements.getElementsByTagName("*")


    
        For Each Element In childElement
                    ActiveSheet.Cells(i, 1) = url
                    ActiveSheet.Cells(i, 2) = Element.tagName
                    ActiveSheet.Cells(i, 3) = Element.innertext
        next

This is from the page https://blog.shelter.org.uk/2018/08...ents-and-a-national-affordability-crisis/amp/
I just really want to capture the article in the centre of the page but the HTML is:

article class="amp-wp-article"
header class="amp-wp-article-header"
then h1, p, text for the article thereafter...

John_w · Apr 4, 2019

Did you try my 'articles' code snippet? Just change the Debug.Print statement to Range("A1").Value = allText and it puts all the concatenated article text in cell A1. This text is generated by scanning all the child elements (nodes) from the article element.

Your last code doesn't work because there isn't an element with the class "amp-mode-mouse" when the page is requested using WinHttp/XMLhttpRequest. This class only exists when the page is requested using a browser, probably generated by JavaScript. Requesting the page using WinHttp/XMLhttpRequest doesn't run any embedded or external JavaScript code.

Julesdude · Apr 4, 2019

Ahhh I see! I didn't realise winHTTP had limitations compared to the browser automation method, which I was trying to steer away from as it takes longer when I'm looping through 1000s of webpages for this task.
From what you're saying then, it looks like there's no way of capturing the article text using the nearest parent elements as they are JavaScript related tags. So I guess there's no way of looping through all child elements of the article column unless I use Internet Explorer automation?

Yes I will be using your a article cycle code for sure. In this exercise I need to split out the strands of text by tag name (one column) and related content in adjacent column so as to mark the formatting of the article. This is because I need to hand this over to a team that'll convert this dataframe of elements and text into JSON to upload to a new platform via the backend.

I just had a second question too John if you don't mind? I've really struggled to find a decent tutorial or documentation on the winHTTP and browser automation route in VBA. It seems very trial and error and I've only picked up things from trawling across boards like this and fishing out examples which is really time consuming. Where are people getting the knowledge from for VBA reference libraries? Only pages I can find are for C# or VB - examples which don't work in VBA. For example, I didn't realise getelementsbytagname could use a ("*") as a wildcard until I saw an example posted somewhere. Bit by bit I'm building an understanding but I still don't know when to use what and how for most things.

Julesdude · Apr 4, 2019

John_w said:
Your last code doesn't work because there isn't an element with the class "amp-mode-mouse" when the page is requested using WinHttp/XMLhttpRequest. This class only exists when the page is requested using a browser, probably generated by JavaScript. Requesting the page using WinHttp/XMLhttpRequest doesn't run any embedded or external JavaScript code.

Last post from me for today John. But I managed to get the content I needed extracting....in part .....for this and other pages is this wordpress driven blog section. Basically I used the winHTTP method still, but found a more solid element to reference a bit further down the DOM tree:

Code:

    Set Elements = html.getElementsByClassName("amp-wp-article-content")(0)       Set childElement = Elements.getElementsByTagName("*")
        For Each Element In childElement
' continuation of code extraction into cells using element.tagname and element.innertext

from https://blog.shelter.org.uk/2018/08...ents-and-a-national-affordability-crisis/amp/

Basically it manages to extract all elements/tags and content in the granular form I need splitting them per cell row. However, as part of my content migration project, I also need to capture the tag IMG and image reference url, and the author name at the very top left. For the H1 header title, I basically made a call to the page title, so :

ActiveSheet.Cells(i, "D") = Replace(HTMLdoc.Title, "| Shelter", "")

Bit sloppy but it works. I wondered if you had any ideas about how to capture the image and author name? They seem to be our of reach element reference wise. Or perhaps i can reference them specifically seeing as there is no loop required to go through them?

John_w · Apr 5, 2019

Julesdude said:
I just had a second question too John if you don't mind? I've really struggled to find a decent tutorial or documentation on the winHTTP and browser automation route in VBA. It seems very trial and error and I've only picked up things from trawling across boards like this and fishing out examples which is really time consuming. Where are people getting the knowledge from for VBA reference libraries? Only pages I can find are for C# or VB - examples which don't work in VBA. For example, I didn't realise getelementsbytagname could use a ("*") as a wildcard until I saw an example posted somewhere. Bit by bit I'm building an understanding but I still don't know when to use what and how for most things.

Most of it is trial and error to begin with because the code usually has to be specific to a website. Best to take an example and try to adapt it to your site. The VB examples should be directly convertible to VBA. Always use early binding (of MS HTML, IE and WinHttp/XMLhttp) whilst developing code to get the intellisense editing and see their properties and methods).

Here are some IE automation and HTML reference sites, why might be useful, though not all relevant to your task:

Internet Explorer Development - Hosting and Reuse
http://msdn.microsoft.com/en-us/library/aa752038(VS.85).aspx

WebBrowser Control
http://msdn.microsoft.com/en-us/library/aa752040(VS.85).aspx

InternetExplorer Object
http://msdn.microsoft.com/en-us/library/aa752084(VS.85).aspx

HTML/XHTML Reference
http://msdn.microsoft.com/en-us/library/hh772960(v=vs.85).aspx

About MSHTML
http://msdn.microsoft.com/en-us/library/bb508515(v=vs.85).aspx

Windows Internet Explorer API reference
http://msdn.microsoft.com/en-us/library/ie/hh828809(v=vs.85).aspx

Document Object Model (DOM)
http://msdn.microsoft.com/en-us/library/ie/hh772384(v=vs.85).aspx

XMLHttpRequest object
https://msdn.microsoft.com/en-us/library/ms535874(v=vs.85).aspx

Julesdude said:
Last post from me for today John. But I managed to get the content I needed extracting....in part .....for this and other pages is this wordpress driven blog section. Basically I used the winHTTP method still, but found a more solid element to reference a bit further down the DOM tree:

Code:

Set Elements = html.getElementsByClassName("amp-wp-article-content")(0) Set childElement = Elements.getElementsByTagName("*") For Each Element In childElement ' continuation of code extraction into cells using element.tagname and element.innertext

from https://blog.shelter.org.uk/2018/08...ents-and-a-national-affordability-crisis/amp/

Basically it manages to extract all elements/tags and content in the granular form I need splitting them per cell row. However, as part of my content migration project, I also need to capture the tag IMG and image reference url, and the author name at the very top left. For the H1 header title, I basically made a call to the page title, so :

ActiveSheet.Cells(i, "D") = Replace(HTMLdoc.Title, "| Shelter", "")

Bit sloppy but it works. I wondered if you had any ideas about how to capture the image and author name? They seem to be our of reach element reference wise. Or perhaps i can reference them specifically seeing as there is no loop required to go through them?

The name of the Elements variable in the above code is a bit confusing; the plural 's' suggests a collection/array, however getElementsByClassName("amp-wp-article-content") returns a collection (IHTMLElementCollection) and the (0) returns the first item in the collection. So the Elements variable is referring to a single element, not multiple elements as the name suggests. I've never used getElementsByTagName("*"), but I think it's the same as the HTMLDocument.all collection.

I tried that site with WinHttp, however for some reason the HTMLDocument doesn't contain the IMG tag, or more precisely Set elements = HTMLdoc.getElementsByTagName("IMG") returns a zero-length collection (elements.Length is zero), even though it is present in the HTML code. Therefore I think you'll need to use IE automation. This should get the author's name:

Code:

    '< li class="amp-wp-byline">Tom Weekes</li>
    Dim HTMLdoc As HTMLDocument
    Dim elements As IHTMLElementCollection
    Set elements = HTMLdoc.getElementsByClassName("amp-wp-byline")
    Debug.Print elements(0).innerText

And this the src attribute (URL) of the first IMG:

Code:

    Set elements = HTMLdoc.getElementsByTagName("IMG")
    Debug.Print elements(0).src

Can't find META tags in HTML when using WinHTTP Request

Julesdude

Board Regular

John_w

MrExcel MVP

Julesdude

Board Regular

Julesdude

Board Regular

Julesdude

Board Regular

John_w

MrExcel MVP

Julesdude

Board Regular

John_w

MrExcel MVP

Julesdude

Board Regular

Julesdude

Board Regular

John_w

MrExcel MVP

Similar threads

Share this page

Can't find META tags in HTML when using WinHTTP Request

Board Regular

MrExcel MVP

Board Regular

Board Regular

Board Regular

MrExcel MVP

Board Regular

MrExcel MVP

Board Regular

Board Regular

MrExcel MVP

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock