Extract String from PDF and Insert in Excel using VBA

Hamster1337

New Member
Joined
Nov 18, 2022
Messages
3
Office Version
  1. 365
Platform
  1. Windows
Hi all,

I'm trying to figure out how to find a certain word/string in a large number of PDFs and then insert the string following that keyword into Excel.

All PDFs are set up like this:

Date, letterhead etc.

Title: a various title, differing in words and lengths in each pdf

Number: a number (I think they're all the same lengtht)

Rest of the PDF


So what I would want the script to do is open the pdf, find the string after Title until Number:, insert in Excel Columns A1:A99
Then take the numbers after Number: and insert in Excel in Columns B1:B99

I've tried using that one: Extracting string from PDF and inserting into Excel using VBA
But it's not exactly the same thing and I can't figure out how to extract the strings after Title:.
Thanks a lot!
 

Excel Facts

Test for Multiple Conditions in IF?
Use AND(test, test, test, test) or OR(test, test, test, ...) as the logical_test argument of IF.
Try this macro, which you must edit to change the matchPDFs string to the folder path and (wildcard) file specification of the PDF files. It puts the matching PDF file names in column A, the Titles in B and the Numbers in C.

VBA Code:
Public Sub Search_PDFs_Extract_Strings()

    Dim matchPDFs As String
    Dim folder As String, file As String
    Dim destCell As Range, r As Long
    Dim startText As String, endText As String
    Dim i As Long
    Dim foundTitle As String, foundNumber As String
   
    matchPDFs = "C:\path\to\PDF files\*.pdf"   'CHANGE THIS
   
    With Worksheets("Extracted") 'ActiveSheet
        .Cells.Clear
        .Range("A1:C1").Value = Array("PDF File", "Title", "Number")
        Set destCell = .Range("A2")
        r = 0
    End With
   
    folder = Left(matchPDFs, InStrRev(matchPDFs, "\"))
    file = Dir(matchPDFs)
    While file <> vbNullString
        destCell.Offset(r, 0).Value = file
        Search_PDF_Extract_Title_and_Number folder & file, foundTitle, foundNumber
        destCell.Offset(r, 1).Resize(, 2).Value = Array(foundTitle, foundNumber)
        r = r + 1
        file = Dir
    Wend
   
End Sub


Private Sub Search_PDF_Extract_Title_and_Number(ByVal PDFfullName As String, ByRef title As String, ByRef number As String)
   
    Static AcroApp     As Object 'CAcroApp
    Dim AcroPDDoc      As Object 'CAcroPDDoc
    Dim AcroAVDoc      As Object 'CAcroAVDoc
    Dim AcroHiliteList As Object 'CAcroHiliteList
    Dim AcroPDPage     As Object 'CAcroPDPage
    Dim AcroTextSelect As Object 'CAcroPDTextSelect
    Dim p As Long, i As Long
    Dim foundTitle As Boolean, foundNumber As Boolean
   
    title = ""
    number = ""
   
    If AcroApp Is Nothing Then Set AcroApp = CreateObject("AcroExch.App") 'New AcroApp   
    AcroApp.Hide
   
    Set AcroPDDoc = CreateObject("AcroExch.PDDoc") 'New AcroPDDoc
   
    If Not AcroPDDoc.Open(PDFfullName) Then
        MsgBox "Unable to open " & PDFfullName
        Exit Sub
    End If
   
    Set AcroAVDoc = AcroPDDoc.OpenAVDoc("")
    'AcroAVDoc.BringToFront  'display the PDF document
   
    'Read pages until both "Title:" and "Number:" are found
   
    foundTitle = False
    foundNumber = False
    Set AcroHiliteList = CreateObject("AcroExch.HiliteList") 'New AcroHiliteList
    p = 0
    While p < AcroPDDoc.GetNumPages And Not foundTitle And Not foundNumber
        If AcroHiliteList.Add(0, 9999) Then
            Set AcroPDPage = AcroPDDoc.AcquirePage(p)
            Set AcroTextSelect = AcroPDPage.CreatePageHilite(AcroHiliteList)
            If Not AcroTextSelect Is Nothing Then
                i = 0
                While i < AcroTextSelect.GetNumText And (Not foundTitle Or Not foundNumber)
                    If Not foundTitle Then
                        If InStr(1, AcroTextSelect.GetText(i), "Title:", vbTextCompare) Then foundTitle = True
                        If InStr(1, AcroTextSelect.GetText(i), "Number:", vbTextCompare) Then foundNumber = True
                    Else
                        If InStr(1, AcroTextSelect.GetText(i), "Number:", vbTextCompare) Then
                            foundNumber = True
                        Else
                            title = title & AcroTextSelect.GetText(i)
                        End If
                    End If
                    i = i + 1
                Wend
                If i < AcroTextSelect.GetNumText And foundTitle And foundNumber Then
                    number = AcroTextSelect.GetText(i)
                End If
            End If
        End If
        p = p + 1
    Wend
   
    AcroPDDoc.Close
    AcroApp.Exit
    
    If foundTitle And foundNumber Then
        title = Clean(title)
        number = Clean(number)
    ElseIf foundTitle And Not foundNumber Then
        'Found 'Title:' but not 'Number:'
        title = ""
        number = ""
    ElseIf foundNumber And Not foundTitle Then
        'Found 'Number:' but not 'Title:'
        title = ""
        number = ""
    Else
        'Not found 'Title:' and 'Number:'
    End If
   
End Sub


Private Function Clean(text As String) As String
    Clean = Replace(text, vbCrLf, "")
    Clean = Trim(Clean)
End Function
 
Upvote 0
And THAT's why I don't like VBA.

Did you know that you can import a PDF using Power Query? Go to Data -> Get Data -> From File -> From PDF, browse to the file you need to work with and click [Open], navigate through the left column until you find the section you want and then click the [Transform Data] button. If you're new to Power Query it may take you some time to get what you want, but it's time well spent.

Once you have what you need, click Close & Load in the Home tab of the Power Query Editor ribbon, and the end result will be loaded to a table on a new tab (by default unless Load settings have been changed). If you need the data elsewhere, you can reference the data in the table, or use the Macro Recorder to record a (tiny) macro to copy the data and paste it as a Value where you need it.

Once you get the query set up to extract the data you want, if that PDF (same file name) is updated with new Information you only need to click Refresh All in the Data tab, and the table will be updated as will any locations referencing that data. If you don't need the Macro, you don't have to worry about saving it as a Macro Enable Excel file either.

Just giving you an alternative to the VBA solution. Hope it helps.
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,180
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top