read line by line PDF file using TxT from VBA

drom · Nov 6, 2022

Hi and thanks in advance!!
If Ï use th following link
How To Count The Page Numbers Of Pdf Files In Excel?
I get the desired page number but is there a way to read in every PDF page for a KeyWord (eg:country) and get the different countries in a PDF file

I do not know the Acrobat are using my working mates (and here is were I get my problems)
And when I use to open the PDF file in word to get the countries...

Takes to long because the PDF files are huge
- Works fine but takes too much time

When I use:

VBA Code:

Const Form_FileName As String = "C:\WordWideEmployees.pdf"   'Could contain over 350 pages

Sub AAA()
  On Error Resume Next
Dim fso As New FileSystemObject
  Dim wStream As TextStream:        Set wStream = fso.OpenTextFile(Form_FileName, ForReading, False)
  Dim wLine As String
Dim wKey As String:                 wKey = "Country"   'Contry is located in every page, so once found I can get the countries like: Spain, France...
  Dim aCountry_List() As Variant
Dim xArrayIndex As Integer
Dim xRow As Integer

  Do While Not wStream.AtEndOfStream
    wLine = "":                               wLine = wStream.ReadLine
    If wLine <> "" Then
      'Debug.Print wLine
      If InStr(wLine, wKey) > 0 Then
        xArrayIndex = xArrayIndex + 1:        ReDim Preserve aCountry_List(1 To xArrayIndex)
                                              aCountry_List(xArrayIndex) = mid ( wLine,7,50)     'Not real but for this eg I do not care
        'Debug.Print wLine
      End If
    End If
  Loop
 
  For xRow = UBound(aCountry_List) To LBound(aCountry_List)
   if xRow =0 then exit sub
    Debug.Print xRow, aCountry_List(xRow)
  Next xRow
  
End Sub

for wLine I get very rare strings like:

/Length1 123180
/Type /Stream
>>
stream
xœì½ xEþ7þêžÉ\Éœ™#“¹2™É1™Ü áä’+"DA@ð>@]Eï\o
—†€‚ŠëÉzëzìŠ.¢«â‰¬É¼Ÿê™ uwÿÿû>Ïû<ïvQßª®ª®®þÞßêž@Œˆl ?????????????????????????????????????????????????????????????????????????p????????????????-????????????????????????????????????????????????????????????????????-??????????????????????O??????????????????????????????????????????????????????????????????????????????????????????????????°???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????Z??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????‰????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
šA¯èãl›C€7ŽR˜Ka)©‹{òàÝ3¨?ð¼³W~Wlgí®Rø

and I cant get never get the wanted countries
My PDF's are not protected, photos...

Any Help ?

Dan_W · Nov 6, 2022

I've been working on a project that reads the metadata of PDF files (like page count, in much the same way as what you've used) and it automates the text extraction process from PDFs, but the problem is that it uses Word to do this. The difficulty is that assuming that a PDF file has an accessible text layer (i.e., meaning that it hasn't simply been scanned, and is effectively a collection of pictures!), that text layer isn't readily accessibly through just reading the file. As I understand it, the PDF file format compresses the text and other contents in the document, and that would explain the strings of data you're getting.

So you need to extract the text from the PDF file somehow, and that's usually either through Adobe or some third party provider, an online API service, Word (the free and easy approach), or you could use command line tools like XPDFTools or PDFTK. I've seen PDFTK mentioned on this forum (link) but haven't used it personally. I do use XPDF Tools - they're a free collection of 5 or 6 small executable files that each do one thing - PDF2TEXT.EXE, for example, will extract the text from a PDF file. PDF2IMAGE.EXE does what you would expect it to... and so on... You can use VBA to automate the process with a command line tool. That said, if your files are 350 pages or so, I don't know that the process will necessarily be especially quick....

I should also add that Power Query can read PDF Files, but that tends to be more for PDFs of tables rather than text files of the size you're contemplating. I could be entirely wrong on that point, but something worth investigating?

read line by line PDF file using TxT from VBA

drom

Well-known Member

Dan_W

Well-known Member

Similar threads

Share this page

read line by line PDF file using TxT from VBA

drom

Well-known Member

Dan_W

Well-known Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock