Determine the XPath for any given HTML element within a HTMLDocument

AOB

Well-known Member
Joined
Dec 15, 2010
Messages
680
Office Version
  1. 365
  2. 2016
  3. 2013
Platform
  1. Windows
How might one programmatically retrieve (or construct?) the XPath for any given HTMLElement within a HTMLDocument?

So, the VBA equivalent of manually opening a web browser, navigating to a webpage, pressing F12 to open the developer window, inspecting an element, right-click > Copy > XPath ?

I'm iterating through a HTML document and recursively parsing it into a structured collection of specific HTML attributes (e.g. id, className, nodeName, innerHTML, innerText etc.)

VBA Code:
Dim docHTML As MSHTML.HTMLDocument
Dim objElement As Object
Set docHTML = New MSHTML.HTMLDocument
docHTML.Body.InnerHTML = strHTML
....
For Each objElement In docHTML.childNodes
    ....
Next objElement
....

(Grossly oversimplified snippet there, but just to give an idea of what I'm working with...)

I can get all the HTML attributes fine but would be really useful to get the XPath as well, but that property isn't exposed via MSHTML.HTMLDocument.

Can the XPath be derived, or is there some other (XML?) method by which the XPath for any given element might be retrieved?

Thanks!
 
Try this XPath function, which is based on the JavaScript code at Create XPath expression programmatically. It recursively traverses up the DOM tree from the target element, constructing the XPath selector string as it goes. It is a very basic parser and doesn't implement the following rules:
  • If the node has an ID, just use that and skip all the parents; they aren't added to the selector.
  • If the node has a tag name or a set of classes that is unique among its siblings, use that as the selector. Otherwise, use :nth-child.
VBA Code:
'References
'Microsoft HTML Object Library

'Based on https://stackoverflow.com/a/22642987

Public Function XPath(element As IHTMLDOMNode, suffix As String) As String

    Dim parent As IHTMLDOMNode
    Dim childIndex As Long

    Set parent = element.ParentNode
    If Not parent Is Nothing Then
        childIndex = nodeIndex(element, parent.ChildNodes) + 1
        XPath = XPath(parent, "/" & LCase(element.nodeName) & "[" & childIndex & "]" & suffix)
    Else
        XPath = suffix
    End If

End Function


Private Function nodeIndex(element As IHTMLDOMNode, children As IHTMLDOMChildrenCollection) As Long

    Dim i As Long
    Dim elementName As String
    Dim matched As IHTMLDOMNode
    Const NODE_ELEMENT = 1
   
    nodeIndex = -1
    elementName = LCase(element.nodeName)

    For i = 0 To children.Length - 1
        If children(i).NodeType = NODE_ELEMENT Then
            Set matched = children(i)
            If LCase(matched.nodeName) = elementName Then
                nodeIndex = nodeIndex + 1
                If matched Is element Then Exit For
            End If
        End If
    Next
       
End Function

Test procedure:
VBA Code:
'References
'Microsoft Internet Controls
'Microsoft HTML Object Library

Public Sub IE_XPaths()

    Dim IE As InternetExplorer
    Dim HTMLdoc As HTMLDocument
    Dim URL As String
   
    URL = "https://www.mrexcel.com/board/forums/excel-questions.10/"

    Set IE = New InternetExplorer 'CreateObject("InternetExplorer.Application")
    With IE
        .navigate URL
        .Visible = True
        Do While .Busy Or .readyState <> 4: DoEvents: Loop
        Set HTMLdoc = .document
    End With       
  
    Dim threadContainerDivs As IHTMLDOMChildrenCollection
    Dim threadContainerDiv As IHTMLDOMNode
    Dim thisDiv As HTMLDivElement
    Dim elem As HTMLGenericElement
   
    'Thread containers
    '<div class="structItem structItem--thread js-inlineModContainer js-threadListItem-1271116" data-author="irfman">
   
    Set threadContainerDivs = HTMLdoc.querySelectorAll("div.structItem.structItem--thread.js-inlineModContainer")
   
    For Each threadContainerDiv In threadContainerDivs
   
        Debug.Print "Thread <div>" & vbCrLf & XPath(threadContainerDiv, "")
       
        'Thread starter icon
        '<div class="structItem-iconContainer">
        Set thisDiv = threadContainerDiv.querySelector("div.structItem-iconContainer")
        Debug.Print "Starter icon <div>" & vbCrLf & XPath(thisDiv, "")
        Debug.Print "Starter icon <a>" & vbCrLf & XPath(thisDiv.getElementsByTagName("a")(0), "")
        If thisDiv.getElementsByTagName("span").Length > 0 Then
            Debug.Print "Starter icon <span>" & vbCrLf & XPath(thisDiv.getElementsByTagName("span")(0), "")
        End If
       
        'Thread starter name, posting date and time
        '<div class="structItem-minor">
        Set thisDiv = threadContainerDiv.querySelector("div.structItem-minor")
        Debug.Print "Starter name <a>" & vbCrLf & XPath(thisDiv.getElementsByTagName("a")(0), "")
        Debug.Print "Starter date and time <a>" & vbCrLf & XPath(thisDiv.getElementsByTagName("a")(1), "")

        'Answers and Views
        '<div class="structItem-cell structItem-cell--meta" title="First message reaction score: 0">
        Set thisDiv = threadContainerDiv.querySelector("div.structItem-cell.structItem-cell--meta")
        Set elem = thisDiv.getElementsByTagName("dl")(0).getElementsByTagName("dd")(0)
        Debug.Print "Answers <dd>" & vbCrLf & XPath(elem, "")
        Set elem = thisDiv.getElementsByTagName("dl")(1).getElementsByTagName("dd")(0)
        Debug.Print "Views <dd>" & vbCrLf & XPath(elem, "")
        Debug.Print
        Stop
       
    Next
   
End Sub
 
Upvote 1
Solution
This is wonderful, thanks @John_w for the detailed explanation and example code. Really appreciate the effort.

As chance would have it, I had already built a similar recursive function to traverse the node tree and construct the XPath based solely on the tags of the nth child element of each successive parent, up to the top (HTML document itself) - so effectively an absolute XPath starting with /html/...

Was still testing it this morning (it works!) so hadn't had a chance to update my own question yet - have added it here but your code is much more extensive and I really like the concept of stopping the recursion when you reach an element with a unique ID and "hanging" the XPath off that instead. I may stick to the absolute version for now but keep this in my pocket if/when I run into difficulties with it later...

(FYI I am using my own custom classes retaining certain specific attributes rather than the full DOM element object, purely for convenience on my part, but conceptually it is the same)

This function is custom to my own "Web Element" class so that I can quickly call it to retrieve the XPath of any such element within the main code; it begins the recursion using its own inherent properties and then traverses the ancestors to build the path backwards to the top of the document. dWebElementsByID is just an indexed dictionary of all the elements in the entire HTML document.

VBA Code:
Public Function GetXPath(dWebElementsByID As Object) As String
    On Error GoTo ErrorHandler
    Dim strResult As String
    strResult = ConstructXPath(dWebElementsByID, "")
Exit_GetXPath:
    GetXPath = strResult
    Exit Function
ErrorHandler:
    Err.Clear
    strResult = ""
    Resume Exit_GetXPath
End Function

And then this is the private recursive function for the construction (within the same class)

VBA Code:
Private Function ConstructXPath(dWebElementsByID As Object, strXPath As String, Optional ewe As clsExtendedWebPageElement = Nothing) As String

    On Error GoTo ErrorHandler
    
    Dim eweParent As clsExtendedWebPageElement
    Dim eweSibling As clsExtendedWebPageElement
    Dim dctSiblings As Object
    Dim strElementTag As String
    Dim lngElementID As Long
    Dim lngParentID As Long
    Dim lngSiblingCount As Long
    Dim varKey As Variant
    Dim varKeys As Variant
    Dim strResult As String
    
    If ewe Is Nothing Then
        lngElementID = lID
        strElementTag = sHTMLTagName
        lngParentID = lParentID
    Else
        lngElementID = ewe.ID
        strElementTag = ewe.HTMLTagName
        lngParentID = ewe.ParentID
    End If
    
    Select Case strElementTag
        Case "HTML"
            strResult = "/html[1]" & strXPath
            
        Case Else
            If lngParentID > 0 Then
                If dWebElementsByID.Exists(lngParentID) Then
                    Set eweParent = dWebElementsByID(lngParentID)
                    If Not eweParent Is Nothing Then
                        Set dctSiblings = eweParent.Children
                        If Not dctSiblings Is Nothing Then
                            If dctSiblings.Count > 0 Then
                                lngSiblingCount = 0
                                varKeys = dctSiblings.Keys()
                                For Each varKey In varKeys
                                    If dctSiblings.Exists(varKey) Then
                                        Set eweSibling = dctSiblings(varKey)
                                        If Not eweSibling Is Nothing Then
                                            If eweSibling.HTMLTagName = strElementTag Then lngSiblingCount = lngSiblingCount + 1
                                            If eweSibling.ID = lngElementID Then strResult = ConstructXPath(dWebElementsByID, "/" & strElementTag & "[" & lngSiblingCount & "]" & strXPath, eweParent)
                                        End If
                                    End If
                                Next varKey
                            End If
                        End If
                    End If
                End If
            End If
            
    End Select
    
Exit_ConstructXPath:
    On Error Resume Next
    ConstructXPath = strResult
    Set eweParent = Nothing
    Set eweSibling = Nothing
    Set dctSiblings = Nothing
    Exit Function
    
ErrorHandler:
    strResult = ""
    Call LogError(Err.Number, Err.Description, "ConstructXPath")
    Resume Exit_ConstructXPath
    
End Function
 
Upvote 0

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top