Help me Convert, Merge and Transpose a pdf table into a singular row, and then repeat that step ad nauseam to create a database

Clonk92

New Member
Joined
Jun 20, 2024
Messages
7
Office Version
  1. 365
Context
I have an ever growing collection of 200+ paged pdf documents that contain bank checks. The checks themselves will not scan into Power query, however their header's will.
My goal is to extract the header of each check statement and then transform that information into a row that would create an evergreen database since new check statements come in all the time.

My Problem
The pdfs contain roughly 200+ pages, So unless I split the PDFs up by page (which takes way too much time to be reasonable) I'm unable to use a sample query for each table that gets extracted from the data.
Below I've provided samples of what the raw data and the steps I'm stuck on within power query looks like.

Raw Data
I've created a sample via MS Paint to give an idea of what the tables look like from the bank. (Dummy data to keep things confidential). Fortunately the header of each PDF is always the same and formatted like this.
Unfortunately, the way the data is structured leads to some creative use of transpositions which... on a single basis is fine. But without the use of a sample query... not fine. I'll explain why below.
cheques.png



How my Data shows up within Power Query
After creating a custom column to extract pdf.Tables from the source folder and then filtering out all pages so I am left with only tables... my query looks like the screenshot below.
1718975734096.png

My original Approach

Normally I'd apply transformations to a sample query but since each PDF contains 200+ pages of these tables, that option isn't available. While it is possible to split the PDF's into singular pages, the sheer volume of them would make that process not feasible. I was adding a column and then using the record.field function to extract the information I needed from a column and move it into the next column, then once I've effectively converted the table to a row, the sample query would apply to each individual PDF... however with that no longer being possible... I'm at a complete loss for how to tackle this thing.


I'm currently working through some Power Query Coursees I've found through YouTube, and skimming many threads, but nothing quite answers my problem, I feel the addition of an index and groupby or unpivot might be the answer here but I am unsure.

Worth noting, is that the data always repeats every 4 rows, I'm not sure how to make use of this, but it feels like there'd be a function or tool that could make use of that.

Thank you so much in advance for taking the time to read and respond to this question. I look forward to getting this done and learning how to make use of the functions used to solve this in the future.
I'm really loving power query but I am still quite new to it.


Thanks again.

 

Excel Facts

How to fill five years of quarters?
Type 1Q-2023 in a cell. Grab the fill handle and drag down or right. After 4Q-2023, Excel will jump to 1Q-2024. Dash can be any character.
Is this what you are trying to do?

Power Query:
let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    tbl = Table.AddIndexColumn(Source, "Index"),
    tbl1 = Table.TransformColumns(tbl, {"Index", each Number.RoundDown(_/4)}),
    fieldnames = {"Name", "Request #", "Request Desc", "Transit #", "Account #", "Sequence #", "Amount", "Date"},
    tbl2 = Table.Group(tbl1, {"Index"}, {{"All", each 
                let 
                    c2 = [Column2],
                    c4 = [Column4],
                    data = {[Name]{0}, c2{0},  c4{0}, c2{1}, c4{1}, c2{2},  c4{2}, c2{3}}
                in 
                    Record.FromList(data, fieldnames)
        }}),
    tbl3 = Table.ExpandRecordColumn(tbl2, "All", fieldnames),
    tbl4 = Table.RemoveColumns(tbl3,{"Index"}),
    Result = Table.TransformColumnTypes(tbl4,{{"Name", type text}, {"Transit #", Int64.Type}, {"Account #", Int64.Type}, {"Sequence #", Int64.Type}, {"Amount", Int64.Type}, {"Date", type date}})
in
    Result

Book1
ABCDEFGHI
1Table5
2NameColumn1Column2Column3Column4
3Source NameRequest#Request Desc
4Source NameTransit #166350Account #5586506
5Source NameSequence#822221Amount684
6Source NameDate12/17/2023
7Source NameRequest#Request Desc
8Source NameTransit #334511Account #9278209
9Source NameSequence#679516Amount858
10Source NameDate5/23/2024
11Source NameRequest#Request Desc
12Source NameTransit #577619Account #8757999
13Source NameSequence#387208Amount629
14Source NameDate2/5/2024
15Source NameRequest#Request Desc
16Source NameTransit #978879Account #1867437
17Source NameSequence#972790Amount312
18Source NameDate3/5/2023
19
20Query Output
21NameRequest #Request DescTransit #Account #Sequence #AmountDate
22Source Name166350558650682222168412/17/2023
23Source Name33451192782096795168585/23/2024
24Source Name57761987579993872086292/5/2024
25Source Name97887918674379727903123/5/2023
26
Sheet6
 
Upvote 1
Solution
It worked!

I'm still a little rough around the edges when it comes to M Code and the program as a whole but im so stoked to learn more about it now, It was a little rough to get this working with my source data but I was able to figure it out!

I'll be studying this formula for a while to figure out how it's all coming together but this is astounding! Thanks so much!!
 
Upvote 0
Is this what you are trying to do?

Power Query:
let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    tbl = Table.AddIndexColumn(Source, "Index"),
    tbl1 = Table.TransformColumns(tbl, {"Index", each Number.RoundDown(_/4)}),
    fieldnames = {"Name", "Request #", "Request Desc", "Transit #", "Account #", "Sequence #", "Amount", "Date"},
    tbl2 = Table.Group(tbl1, {"Index"}, {{"All", each
                let
                    c2 = [Column2],
                    c4 = [Column4],
                    data = {[Name]{0}, c2{0},  c4{0}, c2{1}, c4{1}, c2{2},  c4{2}, c2{3}}
                in
                    Record.FromList(data, fieldnames)
        }}),
    tbl3 = Table.ExpandRecordColumn(tbl2, "All", fieldnames),
    tbl4 = Table.RemoveColumns(tbl3,{"Index"}),
    Result = Table.TransformColumnTypes(tbl4,{{"Name", type text}, {"Transit #", Int64.Type}, {"Account #", Int64.Type}, {"Sequence #", Int64.Type}, {"Amount", Int64.Type}, {"Date", type date}})
in
    Result

Book1
ABCDEFGHI
1Table5
2NameColumn1Column2Column3Column4
3Source NameRequest#Request Desc
4Source NameTransit #166350Account #5586506
5Source NameSequence#822221Amount684
6Source NameDate12/17/2023
7Source NameRequest#Request Desc
8Source NameTransit #334511Account #9278209
9Source NameSequence#679516Amount858
10Source NameDate5/23/2024
11Source NameRequest#Request Desc
12Source NameTransit #577619Account #8757999
13Source NameSequence#387208Amount629
14Source NameDate2/5/2024
15Source NameRequest#Request Desc
16Source NameTransit #978879Account #1867437
17Source NameSequence#972790Amount312
18Source NameDate3/5/2023
19
20Query Output
21NameRequest #Request DescTransit #Account #Sequence #AmountDate
22Source Name166350558650682222168412/17/2023
23Source Name33451192782096795168585/23/2024
24Source Name57761987579993872086292/5/2024
25Source Name97887918674379727903123/5/2023
26
Sheet6
I am noticing as I import data into my query that the repeating 4 rows is not sufficient.

is it possible to instead have the Index progress whenever Column 1 has a new instance of the word "Request #:" ?
 
Upvote 0
You can do something like this

Power Query:
let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
    Custom1 = Table.AddColumn(#"Added Index", "Groupby", each if [Column1] = "Request#" then [Index] else null),
    #"Filled Down" = Table.FillDown(Custom1,{"Groupby"})
in
    #"Filled Down"

then do your grouping based on the Groupby column
 
Upvote 0
You can do something like this

Power Query:
let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
    Custom1 = Table.AddColumn(#"Added Index", "Groupby", each if [Column1] = "Request#" then [Index] else null),
    #"Filled Down" = Table.FillDown(Custom1,{"Groupby"})
in
    #"Filled Down"

then do your grouping based on the Groupby column
User blog:M88youngling/It's Working!! | The LBP Union Wiki | Fandom


Thanks so much!
 
Upvote 0

Forum statistics

Threads
1,224,811
Messages
6,181,081
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top