Extracting string from PDF and inserting into Excel using VBA

Alfons1145

New Member
Joined
Mar 23, 2022
Messages
3
Office Version
  1. 365
Platform
  1. Windows
Hello all together!
I am trying to solve the problem described in the topic. I have thousands of PDF files that contain a value after a string, I would like to copy this value into an excel file.

OPTION 1
I found this very helpful little script that allows to check whether a sting is included in the PDF in general: Search For String In PDF Using VBA

I thought great, that is almost what I need, I will just extend the selection and copy the characters after the string I am looking for. But I did not manage to do that and I wonder if it is at all possible with AcroExch.AVDoc.

OPTION 1 QESTION: Is someone able to tell me how to modify the section

Excel Formula:
blnSearch = AVDocObj.findtext(szText:=searchString, _
bCaseSensitive:=False, _
bWholeWordsOnly:=True, _
bReset:=2)

To in fact select that string and the following characters and copy that to excel?

OPTION 2
The other solution I tried was what I found on this forum from people with similar problems: Extract data from .PDF

Here word opens the PDF and copies the whole content into excel. This seems to take much longer since the process includes way more work, so I would prefer option 1 even though I do not know if possible at all.

For option 2 to work I would need to simply change the following section (which I am not capable of, and require your help ?‍♂️) :

Excel Formula:
        wdRange.WholeStory
        wdRange.Copy
        Set ShtPdfData = ThisWorkbook.Worksheets.Add
        ShtPdfData.PasteSpecial Format:="HTML", Link:=False, DisplayAsIcon:=False, NoHTMLFormatting:=True
        Application.Goto ShtPdfData.Range("A1")


OPTION 2 QUESTION B: Can someone tell me how to change this section to instead of selecting the whole story - find a string within the whole story and select the following few characters and then copy that into a new line in excel and not a new sheet for each PDF that has been scanned.



I hope this is all the information needed - it is the first time I post on here. I have gotten lots of help and tips from this platform already and am so impressed by you guys :)

Thanks very much for any help!
 

Excel Facts

What does custom number format of ;;; mean?
Three semi-colons will hide the value in the cell. Although most people use white font instead.
The Acrobat FindText method only returns a True/False to say whether the text exists in the PDF or not. You need the GetText method instead to extract text from a PDF, for example in this macro:


Note that Acrobat Pro, not the free Reader, must be installed to use the Acrobat API.
 
Upvote 0
Hi John_w and thanks for the reply!

Yes, Acrobat pro is installed and I am running the Adobe Acrobat 10.0 library. You are correct the method above is what I need, however I need to know how to select and copy the X characters following the findText string and copy those into a variable that can be fed back into a cell on the spread sheet. That is the part I that I am unable to get together.

To make the example a bit clearer there many multi page PDF of varying lengths that always have one certain string "specialAmount" somewhere (but each time on a different page or position). So this is the string I am looking for since it indicates my start position but then I need to copy the numbers behind that string and extract them.

Thanks very much for any further ideas!
 
Upvote 0
See if this macro works for you. It searches all PDFs matching the (wildcard) file spec in the specified folder, matchPDFs, for the searchString and puts the PDF file name in column A and the string which occurs after the found string in column B, or left blank if searchString wasn't found. The search looks for only the first occurrence of searchString and is case-insensitive.

VBA Code:
Option Explicit


Public Sub Search_PDFs_Extract_String()

    Dim matchPDFs As String
    Dim folder As String, file As String
    Dim destCell As Range, r As Long
    Dim searchString As String, nextString As String
  
    searchString = "specialAmount"   '<---- CHANGE AS NEEDED
    matchPDFs = "C:\path\to\*.pdf"   '<---- CHANGE AS NEEDED
  
    With ActiveSheet
        .Cells.Clear
        .Range("A1:B1").Value = Array("PDF File", "Next String")
        Set destCell = .Range("A2:B2")
        r = 0
    End With
  
    folder = Left(matchPDFs, InStrRev(matchPDFs, "\"))
    file = Dir(matchPDFs)
    While file <> vbNullString
        nextString = Search_PDF_Extract_Next_String(folder & file, searchString)
        destCell.Offset(r).Value = Array(file, nextString)
        r = r + 1
        file = Dir
    Wend
  
End Sub


Private Function Search_PDF_Extract_Next_String(PDFfullName As String, FindText As String) As String
  
    Dim AcroPDDoc      As CAcroPDDoc
    Dim AcroAVDoc      As CAcroAVDoc
    Dim AcroHiliteList As CAcroHiliteList
    Dim AcroPDPage     As CAcroPDPage
    Dim AcroTextSelect As CAcroPDTextSelect
    Dim p As Long, i As Long
    Dim foundString As Boolean
  
    Search_PDF_Extract_Next_String = ""
  
    Set AcroPDDoc = New AcroPDDoc
    If Not AcroPDDoc.Open(PDFfullName) Then
        MsgBox "Unable to open " & PDFfullName
        Exit Function
    End If
  
    Set AcroAVDoc = AcroPDDoc.OpenAVDoc("")
    AcroAVDoc.BringToFront
  
    'Read pages until FindText is found
  
    foundString = False
    Set AcroHiliteList = New AcroHiliteList
    p = 0
    While p < AcroPDDoc.GetNumPages And Not foundString
        If AcroHiliteList.Add(0, 9999) Then
            Set AcroPDPage = AcroPDDoc.AcquirePage(p)
            Set AcroTextSelect = AcroPDPage.CreatePageHilite(AcroHiliteList)
            If Not AcroTextSelect Is Nothing Then
                i = 0
                While i < AcroTextSelect.GetNumText And Not foundString
                    If InStr(1, AcroTextSelect.GetText(i), FindText, vbTextCompare) Then foundString = True
                    i = i + 1
                Wend
                If foundString And i < AcroTextSelect.GetNumText Then
                    Search_PDF_Extract_Next_String = AcroTextSelect.GetText(i)
                End If
            End If
        End If
        p = p + 1
    Wend
  
    AcroPDDoc.Close
   
End Function
 
Upvote 0
Solution
Thank you so much @John_w! (y)(y)(y)

It does exactly what I need, you are genius! ?

I realized that I need to extract multiple values so I modified the code as seen below - maybe that can help others who are as confused as me when it comes to these things:


Excel Formula:
With ActiveSheet
        .Cells.Clear
        .Range("A1:E1").Value = Array("PDF File", "Value_0", "Value_1", "Value_2", "Value_3") ' Table Header"
        Set destCell = .Range("A2:E2")
        r = 0
End With

folder = Left(matchPDFs, InStrRev(matchPDFs, "\"))
    file = Dir(matchPDFs)
    While file <> vbNullString
        nextString = Search_PDF_Extract_Next_String(folder & file, searchString1)
        Results() = Split(nextString)
        destCell.Offset(r).Value = Array(file, Results(0), Results(1), Results(2), Results(3))
        r = r + 1
        file = Dir
   Wend

And in the function:

Excel Formula:
If foundString And i < AcroTextSelect.GetNumText Then
                    Search_PDF_Extract_Fuel = AcroTextSelect.GetText(i) & AcroTextSelect.GetText(i + 2) & AcroTextSelect.GetText(i + 3) & AcroTextSelect.GetText(i + 4)
End if
 
Upvote 0
That's OK if the search string is found and it isn't within the last 5 values in the PDF. However the code would crash if AcroTextSelect.GetText(i + 2) or AcroTextSelect.GetText(i + 3) or AcroTextSelect.GetText(i + 4) goes beyond the last value in the PDF.
 
Upvote 0
I also would like to thank you so much, this saved me so much work already.

Could you please let me know how I can not only extra one word after the search string but up until a "stop word". I.e. I'd need to have the titel of publications.
I'm using SearchString "Title": and like to get everything until "Date" (which is the next section in my documents).

Thanks for helping a toal beginner!
 
Upvote 0
Could you please let me know how I can not only extra one word after the search string but up until a "stop word". I.e. I'd need to have the titel of publications.
I'm using SearchString "Title": and like to get everything until "Date" (which is the next section in my documents).
Your request is slightly different to the OP's, with different inputs and outputs to and from the search procedure, so I think it's best if you start a new thread with a full description of your requirement and I will try to answer there.
 
Upvote 0

Forum statistics

Threads
1,223,885
Messages
6,175,184
Members
452,615
Latest member
bogeys2birdies

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top