PQ: Searching for a phrase in unstructured data

pvr928

Well-known Member
Joined
Oct 21, 2002
Messages
790
Hi

This is my first post in the PQ section! I've been using Excel and VBA for a long time but Power Pivot/PQ for only a short time. From what I've seen, read and experienced - it is awesome!

I've got multiple Excel files containing unstructured data (converted from PDF files). Each of these files contain similar phrases.

I'm wondering if PQ can search for a particular phrase and then extract data to the right of it, ie:

"Number of days in quarter: 42"

Where the phrase is "Number of days in quarter" and what I want to extract is 42.

I can write some vba to search and extract this data into a new sheet and table, and then use PQ to then extract the data from this new table, but if I can cut this step out it would save a lot of time (I've got files in the multiple 100,000).

I'm not looking for a complete solution, just whether it is possible and then some tips on how to get there.

Thanks so much for your help.

Cheers

Pvr928
 

Excel Facts

Add Bullets to Range
Select range. Press Ctrl+1. On Number tab, choose Custom. Type Alt+7 then space then @ sign (using 7 on numeric keypad)
I don't think you can use PQ over unstructured data. It is designed to work with lists and tables.

There is some software that can convert from PDF to Excel, but I have never tried it Able2Extract PDF Converter & Creator: PDF to Excel, Word, PowerPoint & more

There is a free trial of the software, so I would be interested to hear if this helps. They have been trying to get me to blog about it for a while, but I have never got to it.
 
Last edited:
Upvote 0
If u already converted the data from PDF files, then its easy to deal with PQ.
the steps is:
1- get data.
2- if u have all in the same excel file u can use excel file to get it from one place else u can select from folder
3- add custom column
4- write "Excel.workbook(content)
5- expand your data
6- filter it as u need
and enjoy
 
Upvote 0
So if you have data in PQ in an unstructured manner, you can proceed like this:

1) merge all columns into one so that you have to operate on one column only
2) Filter column using Text.Contains
3) Text.PositionOf will return the starting position of your value
4) Text.Length will let you go to the end of your string
5) Split column by that position +1 (incl. blank space after your search string)
6) This will return a new column starting with your number value to extract (42)
7) Split that column on the first occurrence of blank

hth, otherwise please ask agein
 
Upvote 0
Imke

Given the data is unstructured, how to you hand the situation where the same word is spread all over the document in different positions?
 
Upvote 0
Hi Matt,
thanks, very good point - this didn't cross my mind.

Actually, there are some other aspects as well that could happen that haven't been taken into account on my first suggestion.
But this one should work:

Code:
let
    Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Unstructured", type text}}),
    IdentificationPlaceholder = Table.ReplaceValue(#"Changed Type","Number of days in quarter:","Number of days in quarter:*****",Replacer.ReplaceText,{"Unstructured"}),
    SplitIntoRowsWithOneEach = Table.AddColumn(IdentificationPlaceholder, "Custom", each Text.Split([Unstructured], "Number of days in quarter:")),
    Expand = Table.ExpandListColumn(SplitIntoRowsWithOneEach, "Custom"),
    FilterOnlyMatches = Table.SelectRows(Expand, each Text.Contains([Custom], "*****")),
    DeletePlaceholder = Table.ReplaceValue(FilterOnlyMatches,"*****","",Replacer.ReplaceText,{"Custom"}),
    #"Trimmed Text" = Table.TransformColumns(DeletePlaceholder,{{"Custom", Text.Trim}}),
    IdentifyEndOfFirstNumber = Table.AddColumn(#"Trimmed Text", "SplitPosition", each Text.PositionOfAny([Custom], List.Transform({0..47, 58..300}, each Character.FromNumber(_)))),
    ExtractNumberRange = Table.AddColumn(IdentifyEndOfFirstNumber, "Result", each if [SplitPosition]<0 then [Custom] else Text.Range([Custom],0,[SplitPosition])),
    #"Removed Other Columns" = Table.SelectColumns(ExtractNumberRange,{"Result"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Other Columns",{{"Result", type number}})
in
    #"Changed Type1"

LinkToFile
 
Upvote 0
Thanks so much Matt, ahmedismailfourtex and ImkeF.

ImkeF - that is an impressive piece of code, and thanks so much for providing the file. You guys rock!

I've managed to modify your code to suit my data source, but I don't quite understand how you then run the M code to get the result. I feel like I'm missing something.

Cheers

pvr928
 
Upvote 0
Not sure if I understand your request correctly.
So you managed to modify the code in Power Query already? Then you "run" it by loading the query to the worksheet: Home - Close&Load To... - Table.
If that option is greyed out already (as it will always be once you've closed the query for the first time), then you load it from your worksheet instead: In the workbook queries-pane choose the query, right-click with your mouse and then load to...
 
Upvote 0
Hi ImkeF

Thanks for your reply - sorry life got in the way since my last post!

I think I understand the process a bit better now, but I've not been able to get it to work as desired.

I think it will be easier to search and collate the data I'm looking for via vba into a table, and then have PQ extract the data from there.

Another step I know, but at least I know the vba will work and I'll get more familiar with PQ with an easier task.

BTW I've got all of my files saved as binary - but PQ seems to error when I try to connect with these, but not with .xlsx. Is this the case? If so, I'll have to modify all of my files to .xlsb.

Thanks for your help.

Cheers

pvr928
 
Upvote 0
No problem, it's a new POC in my M-toolbox now :-)

Binary should definitely work. How did you try to open it?
Try: Text.FromBinary(File.Contents(Filepath))
 
Upvote 0

Forum statistics

Threads
1,224,157
Messages
6,176,744
Members
452,741
Latest member
Muhammad Nasir Mahmood

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top