# Can't find META tags in HTML when using WinHTTP Request



## Julesdude (Apr 2, 2019)

Hi all,

I am trying to locate Meta tags from the following URL:

https://england.shelter.org.uk/prof...ary_folder/2020_group_-_fiscal_stimulus_paper

However, what is passed from the result string to HTMLDoc.body.all is only a section of the actual HTML and the meta tags at the beginning of the script are absent and so cannot be found. How do I overcome this?

my code is as follows:


```
Dim Http2 As New WinHttpRequest
Dim i as long
dim result as string
dim elements
dim element

Set HTMLDoc = New MSHTML.HTMLDocument



    Http2.Open "GET", url, False
    ' send request
    Http2.send
    result = Http2.responseText



    'pass text of HTML document returned
    HTMLDoc.body.all = result


    Set Elements = HTMLDoc.all.tags("META")


        For Each singleElement In Elements 


        ActiveSheet.Cells(i, "A") = url
        ActiveSheet.Cells(i, "B") = "META " & singleElement.Name
            
        ActiveSheet.Cells(i, "D").NumberFormat = "@" ' text format for date
        ActiveSheet.Cells(i, "D") = singleElement.Content
        
        
        i = i + 1
        
        Next 
    
    
    Set Elements = Nothing
```


----------



## ScottR (Apr 2, 2019)

Meta tags are in at HTML documents head, not it's body. Not in a position to check but I think 
	
	
	
	
	
	



```
HTMLDoc.all
```
 will include Meta tags

Edit: Sorry I only read your introduction and not the code. You should have what you need but obviously don't! I'll try and look later when at an appropriate machine instead of posting unhelpful suggestions!


----------



## Kyle123 (Apr 2, 2019)

Bit grotty, but gives you an idea:

```
Http2.Open "GET", Url, False
    ' send request
    Http2.send
    result = Http2.responseText

    'pass text of HTML document returned
    f = Split(result, "****** ")
    For x = LBound(f) + 1 To UBound(f) + 1
        On Error Resume Next
        Debug.Print Split(Split(f(x), "name=""")(1), """")(0), Split(Split(f(x), "content=""")(1), """")(0)
    Next x
```


----------



## Julesdude (Apr 2, 2019)

ScottR said:


> Meta tags are in at HTML documents head, not it's body. Not in a position to check but I think
> 
> 
> 
> ...



Hi Scott, thanks for helping out.

I actually did try that thinking it would work and I get a type mismatch error message when I try to pass the result string to it.

I'm declaring HTMLDoc as:


```
Dim HTMLDoc As MSHTML.HTMLDocument
Set HTMLDoc = New MSHTML.HTMLDocument
```


i also changed 
	
	
	
	
	
	



```
HTMLDoc.body.all

'to

HTMLDoc.body.innerHTML
```

This parses ok but cuts out the meta tags.

Not sure if that interrupts the passing? But anyway, it's still not working


----------



## Julesdude (Apr 2, 2019)

Thanks Kyle,
When I try this I get an error - the f is highlighted 'can't assign to array'
How are you declaring the 'f'?


----------



## John_w (Apr 2, 2019)

A trick I discovered is that, if instead of early binding the HTMLDocument object and loading it with the usual method like this:


```
Dim HTMLdoc As HTMLDocument
    Set HTMLdoc = New HTMLDocument
    HTMLdoc.body.innerHTML = WinHttpRequest.responseText
```

you late bind HTMLDocument and load the response with its Write method, then the HEAD tag, which includes META tags, is fully populated:


```
Dim HTMLdoc As Object
    Set HTMLdoc = CreateObject("HTMLfile")
    HTMLdoc.Open
    HTMLdoc.Write WinHttpRequest.responseText     
    HTMLdoc.Close
```
In your case the above Write line should be HTMLdoc.Write Http2.responseText


----------



## Kyle123 (Apr 2, 2019)

The only issue with that though John is that you lose the good stuff like getting elements by class name etc and the more modern stuff. Though in this case it's very unlikely to matter


----------



## Kyle123 (Apr 2, 2019)

Julesdude said:


> Thanks Kyle,
> When I try this I get an error - the f is highlighted 'can't assign to array'
> How are you declaring the 'f'?



As a variant


----------



## Julesdude (Apr 2, 2019)

John, thank you. This works perfectly.
Will this capture all meta tags though. For example, ones like:

```
meta name = "...." content= "....">
meta property = "...." content = "....">
```


I would want to capture both elements/tags within.

Here's my code in full with your additions:


```
Dim HTMLdoc As Object
   
 Set HTMLdoc = CreateObject("HTMLfile")
    HTMLdoc.Open
    HTMLdoc.Write Http2.responseText
  
    Set Elements = HTMLdoc.all.tags("META")
    
    
    ' add a new worksheet for the webpage and provide headers
    ActiveWorkbook.Sheets.Add
    Cells(1, 1) = "URL"
    Cells(1, 2) = "TAG"
    Cells(1, 3) = "LINK"
    Cells(1, 4) = "TEXT"
    
    i = ActiveSheet.Range("A" & Rows.Count).End(xlUp).Row + 1 ' find the last row


        For Each singleElement In Elements ' new
        ActiveSheet.Cells(i, "A") = url
        ActiveSheet.Cells(i, "B") = "META " & singleElement.Name
            
        ActiveSheet.Cells(i, "D").NumberFormat = "@" ' text format for date
        ActiveSheet.Cells(i, "D") = singleElement.Content
        
        
        i = i + 1
        
        Next ' new
 HTMLdoc.Close
```


----------



## John_w (Apr 2, 2019)

Correct Kyle, so declare two HTMLDocuments, one late bound to get the META elements and the other early bound for getElementsByClassName, etc!


----------



## Julesdude (Apr 2, 2019)

Hi all,

I am trying to locate Meta tags from the following URL:

https://england.shelter.org.uk/prof...ary_folder/2020_group_-_fiscal_stimulus_paper

However, what is passed from the result string to HTMLDoc.body.all is only a section of the actual HTML and the meta tags at the beginning of the script are absent and so cannot be found. How do I overcome this?

my code is as follows:


```
Dim Http2 As New WinHttpRequest
Dim i as long
dim result as string
dim elements
dim element

Set HTMLDoc = New MSHTML.HTMLDocument



    Http2.Open "GET", url, False
    ' send request
    Http2.send
    result = Http2.responseText



    'pass text of HTML document returned
    HTMLDoc.body.all = result


    Set Elements = HTMLDoc.all.tags("META")


        For Each singleElement In Elements 


        ActiveSheet.Cells(i, "A") = url
        ActiveSheet.Cells(i, "B") = "META " & singleElement.Name
            
        ActiveSheet.Cells(i, "D").NumberFormat = "@" ' text format for date
        ActiveSheet.Cells(i, "D") = singleElement.Content
        
        
        i = i + 1
        
        Next 
    
    
    Set Elements = Nothing
```


----------



## John_w (Apr 2, 2019)

Julesdude said:


> John, thank you. This works perfectly.
> Will this capture all meta tags though. For example, ones like:
> 
> ```
> ...


Try:

```
singleElement.getAttribute("property")
```


----------



## Julesdude (Apr 3, 2019)

Hi John,

Many thanks for this. I was wondering if this similar method would also work for the following tags a little further down in the script:


```
main tabindex="-1" id="maincontent">
   div class="bg--a">
      div class="container">
         div id="content_div_643588">
             style>
            article text.....
```

I would normally call in the following way to cycle through all child tags:


```
Set Elements = HTMLdoc.getElementById("maincontent").getElementsByTagName("*")
For Each Element In Elements
' etc
```

However this does not work - 'Object Variable or With block variable not set' error message appears.

I've also tried the following that doesn't work:

```
Set Elements = HTMLdoc.getElementsByClassName("container")(0)Set childelements = Elements.getElementsByTagName("*")
For Each Element In childelements
cells(i,1) = element.tagname
cells(i,2) = element.innertext
```

This doesn't work either. How do I set a marker for any one of those tags and explore their child tags and related innertext?


----------



## Julesdude (Apr 3, 2019)

Here's another example of the child elements I'm trying to loop through on site https://scotland.shelter.org.uk/pro...edinburgh_community_hub_impact_report_2017-18

I'm splitting all the tags and innertext in rows in a worksheet from the container where the body of text for the article resides. I could call on any of the following from the top of that container:

main id="maincontent" tabindex=-1
   div class="bg--a"
      div class="container"
         div class="row" data-sq="6152"
            div class="s12 18 stickycontainer"
               article
                  h1...etc etc

Ideally I'd like to find <article> and loop through all tags and innertext extraction one by one from that point to the end of the article which is marked by </article>
Does anyone know how I could reference that point in the right way so my loops work and cycle through?


----------



## Julesdude (Apr 3, 2019)

Can anyone lend a hand? I'm not sure how to use HTMLDoc.getElementsByClassName to set Elements to it and then look through each element in Elements.


----------



## John_w (Apr 3, 2019)

Julesdude said:


> Ideally I'd like to find <article> and loop through all tags and innertext extraction one by one from that point to the end of the article which is marked by </article>
> Does anyone know how I could reference that point in the right way so my loops work and cycle through?


To do that you need getElementsByTagName("article") and traverse the DOM tree from each article element, extracting the text nodes.  For an unknown reason, the article element doesn't contain any child nodes, so the code below starts at its parent element.


```
Dim articles As IHTMLElementCollection
    Dim article As HTMLUnknownElement
    Dim allText As String
    
    allText = vbNullString
    Set articles = HTMLdoc.getElementsByTagName("article")
    For Each article In articles
        allText = allText & ConcatTextNodes(article.parentElement) & " "
    Next
    
    Debug.Print allText
```


```
Private Function ConcatTextNodes(node As IHTMLDOMNode) As String

    Dim i As Long
    
    ConcatTextNodes = vbNullString
    If node.NodeType = NODE_TEXT Then
        If InStr(node.NodeValue, "<!--") = 0 Then ConcatTextNodes = Trim(node.NodeValue)
    End If
    
    For i = 0 To node.ChildNodes.Length - 1
        ConcatTextNodes = ConcatTextNodes & " " & ConcatTextNodes(node.ChildNodes(i))
    Next
    
End Function
```



Julesdude said:


> Can anyone lend a hand? I'm not sure how to use HTMLDoc.getElementsByClassName to set Elements to it and then look through each element in Elements.


Like this:

```
Dim HTMLdoc2 As HTMLDocument
    Dim classColl As IHTMLElementCollection
    Dim elem As HTMLGenericElement
    Set classColl = HTMLdoc2.getElementsByClassName("theClassName")
    For Each elem In classColl
        Debug.Print elem.tagName, elem.innerText
    Next
```
The above code needs references to Microsoft HTML Object Library and Microsoft XML v6.0.


----------



## Julesdude (Apr 4, 2019)

Sorry I couldn't get back to you earlier today in response.
Thanks John for the examples, they are really helpful and provide insight. In your last example, I had something very similar to capture all tags and elements from a parent class tag. I tried to apply your example but it's not capturing all child tag names and text content under it. Do you know what I am doing wrong?


```
Set Elements = html.getElementsByClassName("amp-mode-mouse")(0)
    Set childElement = Elements.getElementsByTagName("*")


    
        For Each Element In childElement
                    ActiveSheet.Cells(i, 1) = url
                    ActiveSheet.Cells(i, 2) = Element.tagName
                    ActiveSheet.Cells(i, 3) = Element.innertext
        next
```

This is from the page https://blog.shelter.org.uk/2018/08...ents-and-a-national-affordability-crisis/amp/
I just really want to capture the article in the centre of the page but the HTML is:

article class="amp-wp-article"
   header class="amp-wp-article-header"
        then h1, p, text for the article thereafter...


----------



## John_w (Apr 4, 2019)

Did you try my 'articles' code snippet?  Just change the Debug.Print statement to Range("A1").Value = allText and it puts all the concatenated article text in cell A1.  This text is generated by scanning all the child elements (nodes) from the article element.

Your last code doesn't work because there isn't an element with the class "amp-mode-mouse" when the page is requested using WinHttp/XMLhttpRequest.  This class only exists when the page is requested using a browser, probably generated by JavaScript.  Requesting the page using WinHttp/XMLhttpRequest doesn't run any embedded or external JavaScript code.


----------



## Julesdude (Apr 4, 2019)

Ahhh I see! I didn't realise winHTTP had limitations compared to the browser automation method, which I was trying to steer away from as it takes longer when I'm looping through 1000s of webpages for this task. 
From what you're saying then, it looks like there's no way of capturing the article text using the nearest parent elements as they are JavaScript related tags. So I guess there's no way of looping through all child elements of the article column unless I use Internet Explorer automation?

Yes I will be using your a article cycle code for sure. In this exercise I need to split out the strands of text by tag name (one column) and related content in adjacent column so as to mark the formatting of the article. This is because I need to hand this over to a team that'll convert this dataframe of elements and text into JSON to upload to a new platform via the backend.

I just had a second question too John if you don't mind? I've really struggled to find a decent tutorial or documentation on the winHTTP and browser automation route in VBA. It seems very trial and error and I've only picked up things from trawling across boards like this and fishing out examples which is really time consuming. Where are people getting the knowledge from for VBA reference libraries? Only pages I can find are for C# or VB - examples which don't work in VBA. For example, I didn't realise getelementsbytagname could use a ("*") as a wildcard until I saw an example posted somewhere. Bit by bit I'm building an understanding but I still don't know when to use what and how for most things.


----------



## Julesdude (Apr 4, 2019)

John_w said:


> Your last code doesn't work because there isn't an element with the class "amp-mode-mouse" when the page is requested using WinHttp/XMLhttpRequest.  This class only exists when the page is requested using a browser, probably generated by JavaScript.  Requesting the page using WinHttp/XMLhttpRequest doesn't run any embedded or external JavaScript code.



Last post from me for today John. But I managed to get the content I needed extracting....in part .....for this and other pages is this wordpress driven blog section. Basically I used the winHTTP method still, but found a more solid element to reference a bit further down the DOM tree:


```
Set Elements = html.getElementsByClassName("amp-wp-article-content")(0)       Set childElement = Elements.getElementsByTagName("*")
        For Each Element In childElement
' continuation of code extraction into cells using element.tagname and element.innertext
```

from https://blog.shelter.org.uk/2018/08...ents-and-a-national-affordability-crisis/amp/

Basically it manages to extract all elements/tags and content in the granular form I need splitting them per cell row. However, as part of my content migration project, I also need to capture the tag IMG and image reference url, and the author name at the very top left. For the H1 header title, I basically made a call to the page title, so :

    ActiveSheet.Cells(i, "D") = Replace(HTMLdoc.Title, "| Shelter", "")

Bit sloppy but it works. I wondered if you had any ideas about how to capture the image and author name? They seem to be our of reach element reference wise. Or perhaps i can reference them specifically seeing as there is no loop required to go through them?


----------



## John_w (Apr 5, 2019)

Julesdude said:


> I just had a second question too John if you don't mind? I've really struggled to find a decent tutorial or documentation on the winHTTP and browser automation route in VBA. It seems very trial and error and I've only picked up things from trawling across boards like this and fishing out examples which is really time consuming. Where are people getting the knowledge from for VBA reference libraries? Only pages I can find are for C# or VB - examples which don't work in VBA. For example, I didn't realise getelementsbytagname could use a ("*") as a wildcard until I saw an example posted somewhere. Bit by bit I'm building an understanding but I still don't know when to use what and how for most things.


Most of it is trial and error to begin with because the code usually has to be specific to a website.  Best to take an example and try to adapt it to your site.  The VB examples should be directly convertible to VBA.  Always use early binding (of MS HTML, IE and WinHttp/XMLhttp) whilst developing code to get the intellisense editing and see their properties and methods).

Here are some IE automation and HTML reference sites, why might be useful, though not all relevant to your task:

Internet Explorer Development - Hosting and Reuse
http://msdn.microsoft.com/en-us/library/aa752038(VS.85).aspx

WebBrowser Control
http://msdn.microsoft.com/en-us/library/aa752040(VS.85).aspx

InternetExplorer Object
http://msdn.microsoft.com/en-us/library/aa752084(VS.85).aspx

HTML/XHTML Reference
http://msdn.microsoft.com/en-us/library/hh772960(v=vs.85).aspx

About MSHTML
http://msdn.microsoft.com/en-us/library/bb508515(v=vs.85).aspx

Windows Internet Explorer API reference
http://msdn.microsoft.com/en-us/library/ie/hh828809(v=vs.85).aspx

Document Object Model (DOM)
http://msdn.microsoft.com/en-us/library/ie/hh772384(v=vs.85).aspx

XMLHttpRequest object
https://msdn.microsoft.com/en-us/library/ms535874(v=vs.85).aspx



Julesdude said:


> Last post from me for today John. But I managed to get the content I needed extracting....in part .....for this and other pages is this wordpress driven blog section. Basically I used the winHTTP method still, but found a more solid element to reference a bit further down the DOM tree:
> 
> 
> ```
> ...


The name of the Elements variable in the above code is a bit confusing; the plural 's' suggests a  collection/array, however  getElementsByClassName("amp-wp-article-content") returns a collection  (IHTMLElementCollection) and the (0) returns the first item in the  collection.  So the Elements variable is referring to a single element, not multiple  elements as the name suggests. I've never used  getElementsByTagName("*"), but I think it's the same as the  HTMLDocument.all collection.

I tried that site with WinHttp, however for some reason the HTMLDocument doesn't contain the IMG tag, or more precisely Set elements = HTMLdoc.getElementsByTagName("IMG") returns a zero-length collection (elements.Length is zero), even though it is present in the HTML code.  Therefore I think you'll need to use IE automation.  This should get the author's name:

```
'< li class="amp-wp-byline">Tom Weekes</li>
    Dim HTMLdoc As HTMLDocument
    Dim elements As IHTMLElementCollection
    Set elements = HTMLdoc.getElementsByClassName("amp-wp-byline")
    Debug.Print elements(0).innerText
```
And this the src attribute (URL) of the first IMG:

```
Set elements = HTMLdoc.getElementsByTagName("IMG")
    Debug.Print elements(0).src
```


----------



## Julesdude (Apr 2, 2019)

Hi all,

I am trying to locate Meta tags from the following URL:

https://england.shelter.org.uk/prof...ary_folder/2020_group_-_fiscal_stimulus_paper

However, what is passed from the result string to HTMLDoc.body.all is only a section of the actual HTML and the meta tags at the beginning of the script are absent and so cannot be found. How do I overcome this?

my code is as follows:


```
Dim Http2 As New WinHttpRequest
Dim i as long
dim result as string
dim elements
dim element

Set HTMLDoc = New MSHTML.HTMLDocument



    Http2.Open "GET", url, False
    ' send request
    Http2.send
    result = Http2.responseText



    'pass text of HTML document returned
    HTMLDoc.body.all = result


    Set Elements = HTMLDoc.all.tags("META")


        For Each singleElement In Elements 


        ActiveSheet.Cells(i, "A") = url
        ActiveSheet.Cells(i, "B") = "META " & singleElement.Name
            
        ActiveSheet.Cells(i, "D").NumberFormat = "@" ' text format for date
        ActiveSheet.Cells(i, "D") = singleElement.Content
        
        
        i = i + 1
        
        Next 
    
    
    Set Elements = Nothing
```


----------



## Julesdude (Apr 9, 2019)

Hi John, thanks for all your help again. Your first example in last post worked and thanks - I can now retrieve the author text in the page.
The second example, which I add just after the first example, didn't work for me however. I Dim HTMLdoc as a new HTMLDocument. I then pass to it - HTMLDoc.body.innerHTML = .responseText
The first example works and author is extracted, but I get an error for debug.pring elements(0).src - 'object variable or with block variable not set'.

Ultimately in the body of the article I'd want to find any image links and extract the references for them. There's usually only one at the beginning of the article, but it is possible there may be a second further down. How do I capture these?


----------



## John_w (Apr 9, 2019)

As I remember, and from what I said in my previous post, with the WinHttpRequest method, HTMLDocument.getElementsByTagName("IMG").Length is zero, for some reason, even though the IMG tag is present in the .responseText.  Therefore I think you'll need to request the page with IE to extract the image links.  To extract all image links:


```
Dim HTMLdoc As HTMLDocument
    Dim elements As IHTMLElementCollection
    Dim imgElement As HTMLImg
    Set HTMLdoc = IE.document  'IE is InternetExplorer object with page loaded and complete
    Set elements = HTMLdoc.getElementsByTagName("IMG")
    For Each imgElement In elements
        Debug.Print imgElement.src
    Next
```


----------



## Kyle123 (Apr 9, 2019)

They aren't image tags:

```
Dim req As Object: Set req = CreateObject("msxml2.xmlhttp")
    req.Open "GET", "https://blog.shelter.org.uk/2018/08/flatlining-wages-surging-rents-and-a-national-affordability-crisis/amp/", False
    req.send
    With CreateObject("htmlfile")
        .body.innerHtml = req.responsetext
        Debug.Print .getElementsByTagName("amp-img")(0).src
    End With
```


----------



## John_w (Apr 9, 2019)

Thanks Kyle, for looking more carefully than me!


----------

