How to extract en.wikipedia.org first paragraph into excel with VBA

satish78

Board Regular
Joined
Aug 31, 2014
Messages
218
Hi Friends,

Been through some of the posts related to wikipedia.org to extract data into excel. But did not find exact match to my need.
I am looking to extract first paragraph of en.wikipedia.org article into excel sheet using VBA.

Here is sample links I am trying to.

I have huge list of urls in ColumnA and need first paragraph data into ColumnB.
[TABLE="width: 320"]
<tbody>[TR]
[TD="width: 320"]https://en.wikipedia.org/wiki/Epesi
https://en.wikipedia.org/wiki/PhpDocumentor[/TD]
[/TR]
</tbody>[/TABLE]


Thanks
 

Excel Facts

Highlight Duplicates
Home, Conditional Formatting, Highlight Cells, Duplicate records, OK to add pink formatting to any duplicates in selected range.
Hello satish78,

What you want to do requires two steps. First to open the URL and extract the paragraph and second to paste the HTML text to the worksheet. The first step is not to complex and only requires VBA. The second step is more complex and requires using the Windows API. If you are using a Mac then this code will not work. The API code used is compatible with 32 bit Office running on 64 bit Windows. If your computer is running 32 bit Office on 32 bit Windows then the API code will need to changed. Copy and paste each macro into its own module. You can execute the "Run" macro manually using Alt+F8 keys and selecting "Run" in the macro list and clicking "OK".

API Code to Copy HTML to the Clipboard
Code:
Option Explicit


' Written:  May 02, 2018
' Author:   Leith Ross
' Summary:  Copies a fragment of HTML code to the clipboard in HTML format.
'           This allows the system to render the HTML code as you would see it on the wewb page.


' // API calls for use with WoW64 systems. If this code fails to compile then you have a 32 bit OS.
Private Declare PtrSafe Function CloseClipboard Lib "user32.dll" () As Long
Private Declare PtrSafe Function OpenClipboard Lib "user32.dll" (ByVal hWnd As LongPtr) As Long
Private Declare PtrSafe Function GlobalAlloc Lib "kernel32.dll" (ByVal wFlags As Long, ByVal dwBytes As LongPtr) As LongPtr
Private Declare PtrSafe Function SetClipboardData Lib "user32.dll" (ByVal wFormat As Long, ByVal hMem As LongPtr) As LongPtr
Private Declare PtrSafe Function EmptyClipboard Lib "user32.dll" () As Long
Private Declare PtrSafe Function RegisterClipboardFormat Lib "user32.dll" Alias "RegisterClipboardFormatA" (ByVal lpString As String) As Long
Private Declare PtrSafe Function GlobalLock Lib "kernel32.dll" (ByVal hMem As LongPtr) As LongPtr
Private Declare PtrSafe Function GlobalUnlock Lib "kernel32.dll" (ByVal hMem As LongPtr) As Long
Private Declare PtrSafe Function GetClipboardData Lib "user32.dll" (ByVal wFormat As Long) As LongPtr
Private Declare PtrSafe Function strLen Lib "kernel32.dll" Alias "lstrlenA" (ByVal lpData As Long) As Long
Private Declare PtrSafe Sub CopyMemory Lib "kernel32.dll" Alias "RtlMoveMemory" (ByRef pDest As Any, ByRef pSource As Any, ByVal cbLength As Long)


Private Const htmlHeader = _
                  "Version:1.0" & vbCrLf & _
                  "StartHTML:aaaaaaaaaa" & vbCrLf & _
                  "EndHTML:bbbbbbbbbb" & vbCrLf & _
                  "StartFragment:cccccccccc" & vbCrLf & _
                  "EndFragment:dddddddddd" & vbCrLf
                  
' // HTML Clipboard Format
Private cfHTML As LongPtr


Function RegisterCF() As Long


    ' // Register the HTML clipboard format
    If cfHTML = 0 Then
      cfHTML = RegisterClipboardFormat("HTML Format")
    End If
    
    RegisterCF = cfHTML
   
End Function


Public Sub CopyHtmlToClipboard(ByVal strHtmlFragment As String)
   
    Dim htmlEndTags     As String
    Dim htmlStartTags   As String
    Dim strData         As String
    Dim hMem            As LongPtr
    Dim lpData          As LongPtr
    
        If RegisterCF = 0 Then Exit Sub
        
        htmlStartTags = "<HTML>******>"
        htmlEndTags = "</BODY></HTML>"
   
        ' // Add the starting and ending tags for the HTML fragment
        htmlStartTags = htmlStartTags & "<!--StartFragment -->"
        htmlEndTags = "<!--EndFragment -->" & htmlEndTags
   
        ' // Build the HTML given the description, the fragment and the context.
        ' // And, replace the offset place holders in the description with values
        ' // for the offsets of StartHMTL, EndHTML, StartFragment and EndFragment.
        strData = htmlHeader & htmlStartTags & strHtmlFragment & htmlEndTags
        strData = Replace(strData, "aaaaaaaaaa", Format(Len(htmlHeader), "0000000000"))
        strData = Replace(strData, "bbbbbbbbbb", Format(Len(strData), "0000000000"))
        strData = Replace(strData, "cccccccccc", Format(Len(htmlHeader & htmlStartTags), "0000000000"))
        strData = Replace(strData, "dddddddddd", Format(Len(htmlHeader & htmlStartTags & strHtmlFragment), "0000000000"))


        ' // Add the HTML code to the clipboard
        If OpenClipboard(0) <> 0 Then
            hMem = GlobalAlloc(0, Len(strData) + 10)
            If hMem <> 0 Then
                lpData = GlobalLock(hMem)
                If lpData <> 0 Then
                    CopyMemory ByVal lpData, ByVal strData, Len(strData)
                    GlobalUnlock hMem
                    EmptyClipboard
                    SetClipboardData cfHTML, hMem
                End If
            End If
            Call CloseClipboard
        End If


End Sub


Public Function GetHTMLClipboard() As String


    Dim hMemHandle  As Long
    Dim lpData      As Long
    Dim nClipSize   As Long
    Dim nStartFrag  As Long
    Dim nEndFrag    As Long
    Dim nIndex      As Long
    Dim sData       As String
   
        If RegisterCF = 0 Then Exit Function
   
        If OpenClipboard(0) <> 0 Then
            GlobalUnlock hMemHandle


            ' // Retrieve the data from the clipboard
            hMemHandle = GetClipboardData(cfHTML)
      
            If hMemHandle <> 0 Then
                lpData = GlobalLock(hMemHandle)
         
                If lpData <> 0 Then
                    nClipSize = strLen(lpData)
                    sData = String(nClipSize + 10, 0)
            
                    Call CopyMemory(ByVal sData, ByVal lpData, nClipSize)
            
                    ' // If StartFragment appears in the data's description,
                    ' // then retrieve the offset specified in the description
                    ' // for the start of the fragment. Likewise, if EndFragment
                    ' // appears in the description, then retrieve the
                    ' // corresponding offset.
                    nIndex = InStr(sData, "StartFragment:")
                        If nIndex Then
                            nStartFrag = CLng(Mid(sData, nIndex + Len("StartFragment:"), 10))
                        End If
            
                    nIndex = InStr(sData, "EndFragment:")
                        If nIndex Then
                            nEndFrag = CLng(Mid(sData, nIndex + Len("EndFragment:"), 10))
                        End If
            
                    ' // Return the fragment given the starting and ending
                    ' // offsets
                    If (nStartFrag > 0 And nEndFrag > 0) Then
                        GetHTMLClipboard = Mid(sData, nStartFrag + 1, (nEndFrag - nStartFrag))
                    End If
                        
                End If
      
            End If


            Call CloseClipboard
        End If


End Function

Macro Code to Extract URL Data
Code:
Option Explicit


Sub Run()


    Dim Cell    As Range
    Dim Elem    As Object
    Dim HTMLdoc As Object
    Dim LastRow As Long
    Dim PageSrc As String
    Dim Request As Object
    Dim Rng     As Range
    Dim Term    As Variant
    Dim Text    As String
    Dim URL     As String
    Dim Wks     As Worksheet
    
        Set Wks = ThisWorkbook.Worksheets("Sheet1")
        Set Rng = Wks.Range("A2")
        
        LastRow = Wks.Cells(Rows.Count, Rng.Column).End(xlUp).Row
        
        If LastRow < Rng.Row Then Exit Sub
        
        Set Rng = Rng.Resize(RowSize:=LastRow - Rng.Row + 1)
        
        Set Request = CreateObject("MSXML2.XMLHTTP")
        
        For Each Cell In Rng
            Text = ""
            
            Term = InStrRev(Cell, "/")
            Term = Right(Cell, Len(Cell) - Term)
            
            With Request
                .Open "GET", Cell.Value, False
                .Send
                If .Status <> "200" Then
                    Cell.Offset(0, 1).Value = "ERROR: " & .Status & " - " & .statusText
                    GoTo Continue
                End If
                PageSrc = .responseText
            End With
        
            Set HTMLdoc = CreateObject("htmlfile")
                HTMLdoc.Write PageSrc
                HTMLdoc.Close
        
            For Each Elem In HTMLdoc.body.GetElementsByTagName("p")
                Text = Elem.innerHTML
                If InStr(1, Text, Term, vbTextCompare) Then
                    Exit For
                End If
            Next Elem
        
            CopyHtmlToClipboard Text
            
            Cell.Offset(0, 1).PasteSpecial
Continue:
        Next Cell
        
End Sub
 
Upvote 0
Its working except for some of urls its extracting some other line or paragraph from article, instead of first paragraph.

here are examples
[TABLE="width: 322"]
[TR]
[TD]https://en.wikipedia.org/wiki/OW2_Orchestra
[TABLE="width: 322"]
[TR]
[TD]https://en.wikipedia.org/wiki/Sydle#Sydle_Seed[/TD]
[/TR]
[/TABLE]
[/TD]
[/TR]
[/TABLE]
 
Upvote 0
Hello Satish,

This version of the Run macro corrects the problems for the two URLs in your last post.

Revised Run Macro Code
Code:
Option Explicit


Sub Run()


    Dim Cell    As Range
    Dim Elem    As Object
    Dim HashTag As Long
    Dim HTMLdoc As Object
    Dim LastRow As Long
    Dim PageSrc As String
    Dim Request As Object
    Dim Rng     As Range
    Dim Term    As Variant
    Dim Text    As String
    Dim URL     As String
    Dim Wks     As Worksheet
    
        Set Wks = ThisWorkbook.Worksheets("Sheet1")
        Set Rng = Wks.Range("A2")
        
        LastRow = Wks.Cells(Rows.Count, Rng.Column).End(xlUp).Row
        
        If LastRow < Rng.Row Then Exit Sub
        
        Set Rng = Rng.Resize(RowSize:=LastRow - Rng.Row + 1)
        
        Set Request = CreateObject("MSXML2.XMLHTTP")
        
        For Each Cell In Rng
            Text = ""
            
            Term = InStrRev(Cell, "/")
            Term = Replace(Right(Cell, Len(Cell) - Term), "_", " ")
            
            HashTag = InStr(1, Term, "#")
            If HashTag Then Term = Left(Term, HashTag - 1)
            
            With Request
                .Open "GET", Cell.Value, False
                .Send
                If .Status <> "200" Then
                    Cell.Offset(0, 1).Value = "ERROR: " & .Status & " - " & .statusText
                    GoTo Continue
                End If
                PageSrc = .responseText
            End With
        
            Set HTMLdoc = CreateObject("htmlfile")
                HTMLdoc.Write PageSrc
                HTMLdoc.Close
        
            For Each Elem In HTMLdoc.body.GetElementsByTagName("p")
                Text = Elem.innerHTML
                If InStr(1, Text, Term, vbTextCompare) Then
                    Exit For
                End If
            Next Elem
        
            CopyHtmlToClipboard Text
            
            Cell.Offset(0, 1).PasteSpecial
Continue:
        Next Cell
        
End Sub
 
Last edited:
Upvote 0

Forum statistics

Threads
1,223,893
Messages
6,175,244
Members
452,622
Latest member
Laura_PinksBTHFT

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top