Extracting out Title in PDF -> Word conversion

chickyguy · Apr 3, 2024

Hi,

Using the solution from this, I was able to read the PDF document as a word application.

While I was able to loop through the tables using wdApp.tables, I realised some information I needed is the title of the table (if it can be called a title?)

A sample look is this:

Currently I do get the contact number and the email which i need, but tables doesn't have the personName for some reason. I thought it could be a table on its own, but looping through all the tables, I couldn't find any that has PersonName.

Is there a way to just extract all the text from the wdApp without using Tables? I don't want the image, which i think .Range does. This is as I need to find certain keyword values in the text body.

Appreciate any advice.

NdNoviceHlp · Apr 3, 2024

Hi *chickyguy. Maybe this will help you determine if the person name is part of the table. It loops through each table and outputs each cell... may take some time. You will need to adjust your file path to suit. HTH. Dave

VBA Code:

Sub test()
Dim WordApp As Object, TableTot As Integer, TableStart As Integer
Dim irow As Integer, icol As Integer
'create Word app
On Error Resume Next
Set WordApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then
Set WordApp = CreateObject("Word.Application")
End If
On Error GoTo 0
'show Word doc and leave open
WordApp.Visible = True
'********adjust file path to suit
WordApp.Documents.Open ("C:\Testfolder\testdoc.docx")
With WordApp.Activedocument
    TableTot = .tables.Count
'loop tables
    For TableStart = 1 To TableTot
        'output tbl by row contents
        With .tables(TableStart)
            For irow = 1 To .Rows.Count ' table rows
                For icol = 1 To .Columns.Count 'table columns
                MsgBox "Text in table: " & TableStart & " Row: " _
                      & irow & " Column: " & icol & " is: " _
                      & Application.WorksheetFunction.Clean(.cell(irow, icol).Range.Text)
                Next icol
            Next irow
        End With
    Next TableStart
End With
End Sub

chickyguy · Apr 4, 2024

Thanks for the code

NdNoviceHlp said:

Hi *chickyguy. Maybe this will help you determine if the person name is part of the table. It loops through each table and outputs each cell... may take some time. You will need to adjust your file path to suit. HTH. Dave

VBA Code:

Sub test()
Dim WordApp As Object, TableTot As Integer, TableStart As Integer
Dim irow As Integer, icol As Integer
'create Word app
On Error Resume Next
Set WordApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then
Set WordApp = CreateObject("Word.Application")
End If
On Error GoTo 0
'show Word doc and leave open
WordApp.Visible = True
'********adjust file path to suit
WordApp.Documents.Open ("C:\Testfolder\testdoc.docx")
With WordApp.Activedocument
    TableTot = .tables.Count
'loop tables
    For TableStart = 1 To TableTot
        'output tbl by row contents
        With .tables(TableStart)
            For irow = 1 To .Rows.Count ' table rows
                For icol = 1 To .Columns.Count 'table columns
                MsgBox "Text in table: " & TableStart & " Row: " _
                      & irow & " Column: " & icol & " is: " _
                      & Application.WorksheetFunction.Clean(.cell(irow, icol).Range.Text)
                Next icol
            Next irow
        End With
    Next TableStart
End With
End Sub

Thanks for the code. Ran it and PERSON NAME really isn't showing up. One thing I noticed is that in the PDF, if the text preceding the table is in capital letters, it's not captured in .tables.
But if it is in Pascal Casing, it strangely is.

PERSON X
....

vs

Person Y
....

So far I'm getting Person Y, but not Person X. Any possible workaround for this issue? Is it just due the way Word reads the PDF file?

chickyguy · Apr 4, 2024

Ah, disregard previous reply. It doesn't work for Person Y - I have seen it wrongly.

Issue still persist. doc.tables does not have BuyerInformation nor PersonName in it. Only the contact details.

chickyguy · Apr 4, 2024

To add some additional information if it helps:

I did the following to print out the PDF:

VBA Code:

Set wdRange = wdPdfDoc.Range(1)
strText = wdRange.Text
MsgBox(strText)

I get the following in the MsgBox:

BUYER INFROMATION
/
Buyer Name companyName
/
CONTACT PERSON
personName
[] contact no
[] ...
[] email
[] ....

I'm assuming the [] symbol means that the information is present as a table. In this case, is the only way i can extract out personName is through crawling the text?

I'm currently doing this:

Code:

For Each Table in doc.tables
    For Each Row in Table.Rows
        ConvertedRow = Row.ConvertToText
        If InStr(1,ConvertedRow, Keyword) > 0 Then
            SplitText = Split(ConvertedRow, "-")
            ' code to insert data into cell

This enable me to render the table data into something that I can pull from.

NdNoviceHlp · Apr 4, 2024

Hi again *chickyguy. The square thingees are pilcrows CHR(7). Not sure if this will help, but you can sort the paragraphs out from the tables and get their info with the following code. Again, it might take some time to test. Dave

VBA Code:

Sub test()
Dim WordApp As Object, TableTot As Integer, TableStart As Integer, Cnt As Integer
Dim irow As Integer, icol As Integer, MyRange As Variant, LastParaLoc As Integer
'create Word app
On Error Resume Next
Set WordApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then
Set WordApp = CreateObject("Word.Application")
End If
On Error GoTo 0
'show Word doc and leave open
WordApp.Visible = True
'********adjust file path to suit
WordApp.Documents.Open ("C:\Testfolder\testdoc.docx")

WordApp.ActiveDocument.Select
LastParaLoc = WordApp.Selection.Paragraphs.Count
For Cnt = 1 To LastParaLoc
Set MyRange = WordApp.ActiveDocument.Paragraphs(Cnt).Range
'paragraph not in table and not blank
If Right(MyRange, 1) <> Chr(7) And MyRange <> Chr(13) Then
MyRange.Select
MsgBox WordApp.Selection.Text
End If
Next Cnt
End Sub

NdNoviceHlp · Apr 4, 2024

Whoops... I think the pilcrow is actually CHR(13). Dave

chickyguy · Apr 4, 2024

Ah, I'm able to pull out the PersonName with Paragraphs!! Thank you.

Currently I'm looping through paragraphs for those fields for the "title"-like data I need, and then by table to get the whole cell data. Somehow Paragraphs cut off line by line in the cells of the table.
Not the most optimal method since its like looping through the entire content twice, but if I can finish this I can finally get it done before the weekend...

But thanks a big bunch NdNoviceHlp!

NdNoviceHlp · Apr 5, 2024

You are welcome thanks for posting your outcome. Have a nice wkend. Dave

Extracting out Title in PDF -> Word conversion

chickyguy

New Member

NdNoviceHlp

Well-known Member

chickyguy

New Member

chickyguy

New Member

chickyguy

New Member

NdNoviceHlp

Well-known Member

NdNoviceHlp

Well-known Member

chickyguy

New Member

NdNoviceHlp

Well-known Member

Similar threads

Share this page

Extracting out Title in PDF -> Word conversion

New Member

Well-known Member

New Member

New Member

New Member

Well-known Member

Well-known Member

New Member

Well-known Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock