Extracting out Title in PDF -> Word conversion

chickyguy

New Member
Joined
Mar 27, 2024
Messages
17
Office Version
  1. 2019
Platform
  1. Windows
Hi,

Using the solution from this, I was able to read the PDF document as a word application.

While I was able to loop through the tables using wdApp.tables, I realised some information I needed is the title of the table (if it can be called a title?)

A sample look is this:
1712168728245.png


Currently I do get the contact number and the email which i need, but tables doesn't have the personName for some reason. I thought it could be a table on its own, but looping through all the tables, I couldn't find any that has PersonName.

Is there a way to just extract all the text from the wdApp without using Tables? I don't want the image, which i think .Range does. This is as I need to find certain keyword values in the text body.

Appreciate any advice.
 

Excel Facts

Why are there 1,048,576 rows in Excel?
The Excel team increased the size of the grid in 2007. There are 2^20 rows and 2^14 columns for a total of 17 billion cells.
Hi *chickyguy. Maybe this will help you determine if the person name is part of the table. It loops through each table and outputs each cell... may take some time. You will need to adjust your file path to suit. HTH. Dave
VBA Code:
Sub test()
Dim WordApp As Object, TableTot As Integer, TableStart As Integer
Dim irow As Integer, icol As Integer
'create Word app
On Error Resume Next
Set WordApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then
Set WordApp = CreateObject("Word.Application")
End If
On Error GoTo 0
'show Word doc and leave open
WordApp.Visible = True
'********adjust file path to suit
WordApp.Documents.Open ("C:\Testfolder\testdoc.docx")
With WordApp.Activedocument
    TableTot = .tables.Count
'loop tables
    For TableStart = 1 To TableTot
        'output tbl by row contents
        With .tables(TableStart)
            For irow = 1 To .Rows.Count ' table rows
                For icol = 1 To .Columns.Count 'table columns
                MsgBox "Text in table: " & TableStart & " Row: " _
                      & irow & " Column: " & icol & " is: " _
                      & Application.WorksheetFunction.Clean(.cell(irow, icol).Range.Text)
                Next icol
            Next irow
        End With
    Next TableStart
End With
End Sub
 
Upvote 0
Thanks for the code
Hi *chickyguy. Maybe this will help you determine if the person name is part of the table. It loops through each table and outputs each cell... may take some time. You will need to adjust your file path to suit. HTH. Dave
VBA Code:
Sub test()
Dim WordApp As Object, TableTot As Integer, TableStart As Integer
Dim irow As Integer, icol As Integer
'create Word app
On Error Resume Next
Set WordApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then
Set WordApp = CreateObject("Word.Application")
End If
On Error GoTo 0
'show Word doc and leave open
WordApp.Visible = True
'********adjust file path to suit
WordApp.Documents.Open ("C:\Testfolder\testdoc.docx")
With WordApp.Activedocument
    TableTot = .tables.Count
'loop tables
    For TableStart = 1 To TableTot
        'output tbl by row contents
        With .tables(TableStart)
            For irow = 1 To .Rows.Count ' table rows
                For icol = 1 To .Columns.Count 'table columns
                MsgBox "Text in table: " & TableStart & " Row: " _
                      & irow & " Column: " & icol & " is: " _
                      & Application.WorksheetFunction.Clean(.cell(irow, icol).Range.Text)
                Next icol
            Next irow
        End With
    Next TableStart
End With
End Sub

Thanks for the code. Ran it and PERSON NAME really isn't showing up. One thing I noticed is that in the PDF, if the text preceding the table is in capital letters, it's not captured in .tables.
But if it is in Pascal Casing, it strangely is.

PERSON X
....

vs

Person Y
....

So far I'm getting Person Y, but not Person X. Any possible workaround for this issue? Is it just due the way Word reads the PDF file?
 
Upvote 0
Ah, disregard previous reply. It doesn't work for Person Y - I have seen it wrongly.

Issue still persist. doc.tables does not have BuyerInformation nor PersonName in it. Only the contact details.
 
Upvote 0
To add some additional information if it helps:

I did the following to print out the PDF:

VBA Code:
Set wdRange = wdPdfDoc.Range(1)
strText = wdRange.Text
MsgBox(strText)

I get the following in the MsgBox:

BUYER INFROMATION
/
Buyer Name companyName
/
CONTACT PERSON
personName
[] contact no
[] ...
[] email
[] ....

I'm assuming the [] symbol means that the information is present as a table. In this case, is the only way i can extract out personName is through crawling the text?


I'm currently doing this:
Code:
For Each Table in doc.tables
    For Each Row in Table.Rows
        ConvertedRow = Row.ConvertToText
        If InStr(1,ConvertedRow, Keyword) > 0 Then
            SplitText = Split(ConvertedRow, "-")
            ' code to insert data into cell

This enable me to render the table data into something that I can pull from.
 
Upvote 0
Hi again *chickyguy. The square thingees are pilcrows CHR(7). Not sure if this will help, but you can sort the paragraphs out from the tables and get their info with the following code. Again, it might take some time to test. Dave
VBA Code:
Sub test()
Dim WordApp As Object, TableTot As Integer, TableStart As Integer, Cnt As Integer
Dim irow As Integer, icol As Integer, MyRange As Variant, LastParaLoc As Integer
'create Word app
On Error Resume Next
Set WordApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then
Set WordApp = CreateObject("Word.Application")
End If
On Error GoTo 0
'show Word doc and leave open
WordApp.Visible = True
'********adjust file path to suit
WordApp.Documents.Open ("C:\Testfolder\testdoc.docx")

WordApp.ActiveDocument.Select
LastParaLoc = WordApp.Selection.Paragraphs.Count
For Cnt = 1 To LastParaLoc
Set MyRange = WordApp.ActiveDocument.Paragraphs(Cnt).Range
'paragraph not in table and not blank
If Right(MyRange, 1) <> Chr(7) And MyRange <> Chr(13) Then
MyRange.Select
MsgBox WordApp.Selection.Text
End If
Next Cnt
End Sub
 
Upvote 1
Solution
Ah, I'm able to pull out the PersonName with Paragraphs!! Thank you.

Currently I'm looping through paragraphs for those fields for the "title"-like data I need, and then by table to get the whole cell data. Somehow Paragraphs cut off line by line in the cells of the table.
Not the most optimal method since its like looping through the entire content twice, but if I can finish this I can finally get it done before the weekend...

But thanks a big bunch NdNoviceHlp!
 
Upvote 0

Forum statistics

Threads
1,223,227
Messages
6,170,853
Members
452,361
Latest member
d3ad3y3

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top