Help me Convert, Merge and Transpose a pdf table into a singular row, and then repeat that step ad nauseam to create a database

Clonk92 · Jun 21, 2024

Context
I have an ever growing collection of 200+ paged pdf documents that contain bank checks. The checks themselves will not scan into Power query, however their header's will.
My goal is to extract the header of each check statement and then transform that information into a row that would create an evergreen database since new check statements come in all the time.

My Problem
The pdfs contain roughly 200+ pages, So unless I split the PDFs up by page (which takes way too much time to be reasonable) I'm unable to use a sample query for each table that gets extracted from the data.
Below I've provided samples of what the raw data and the steps I'm stuck on within power query looks like.

Raw Data
I've created a sample via MS Paint to give an idea of what the tables look like from the bank. (Dummy data to keep things confidential). Fortunately the header of each PDF is always the same and formatted like this.
Unfortunately, the way the data is structured leads to some creative use of transpositions which... on a single basis is fine. But without the use of a sample query... not fine. I'll explain why below.

How my Data shows up within Power Query
After creating a custom column to extract pdf.Tables from the source folder and then filtering out all pages so I am left with only tables... my query looks like the screenshot below.

My original Approach
Normally I'd apply transformations to a sample query but since each PDF contains 200+ pages of these tables, that option isn't available. While it is possible to split the PDF's into singular pages, the sheer volume of them would make that process not feasible. I was adding a column and then using the record.field function to extract the information I needed from a column and move it into the next column, then once I've effectively converted the table to a row, the sample query would apply to each individual PDF... however with that no longer being possible... I'm at a complete loss for how to tackle this thing.

I'm currently working through some Power Query Coursees I've found through YouTube, and skimming many threads, but nothing quite answers my problem, I feel the addition of an index and groupby or unpivot might be the answer here but I am unsure.

Worth noting, is that the data always repeats every 4 rows, I'm not sure how to make use of this, but it feels like there'd be a function or tool that could make use of that.

Thank you so much in advance for taking the time to read and respond to this question. I look forward to getting this done and learning how to make use of the functions used to solve this in the future.
I'm really loving power query but I am still quite new to it.

Thanks again.

JGordon11 · Jun 21, 2024

Is this what you are trying to do?

Power Query:

let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    tbl = Table.AddIndexColumn(Source, "Index"),
    tbl1 = Table.TransformColumns(tbl, {"Index", each Number.RoundDown(_/4)}),
    fieldnames = {"Name", "Request #", "Request Desc", "Transit #", "Account #", "Sequence #", "Amount", "Date"},
    tbl2 = Table.Group(tbl1, {"Index"}, {{"All", each 
                let 
                    c2 = [Column2],
                    c4 = [Column4],
                    data = {[Name]{0}, c2{0},  c4{0}, c2{1}, c4{1}, c2{2},  c4{2}, c2{3}}
                in 
                    Record.FromList(data, fieldnames)
        }}),
    tbl3 = Table.ExpandRecordColumn(tbl2, "All", fieldnames),
    tbl4 = Table.RemoveColumns(tbl3,{"Index"}),
    Result = Table.TransformColumnTypes(tbl4,{{"Name", type text}, {"Transit #", Int64.Type}, {"Account #", Int64.Type}, {"Sequence #", Int64.Type}, {"Amount", Int64.Type}, {"Date", type date}})
in
    Result

Book1

A

B

C

D

E

F

G

H

I

1

Table5

2

Name

Column1

Column2

Column3

Column4

3

Source Name

Request#

Request Desc

4

Source Name

Transit #

166350

Account #

5586506

5

Source Name

Sequence#

822221

Amount

684

6

Source Name

Date

12/17/2023

7

Source Name

Request#

Request Desc

8

Source Name

Transit #

334511

Account #

9278209

9

Source Name

Sequence#

679516

Amount

858

10

Source Name

Date

5/23/2024

11

Source Name

Request#

Request Desc

12

Source Name

Transit #

577619

Account #

8757999

13

Source Name

Sequence#

387208

Amount

629

14

Source Name

Date

2/5/2024

15

Source Name

Request#

Request Desc

16

Source Name

Transit #

978879

Account #

1867437

17

Source Name

Sequence#

972790

Amount

312

18

Source Name

Date

3/5/2023

19

20

Query Output

21

Name

Request #

Request Desc

Transit #

Account #

Sequence #

Amount

Date

22

Source Name

166350

5586506

822221

684

12/17/2023

23

Source Name

334511

9278209

679516

858

5/23/2024

24

Source Name

577619

8757999

387208

629

2/5/2024

25

Source Name

978879

1867437

972790

312

3/5/2023

26

Sheet6

Clonk92 · Jun 22, 2024

It worked!

I'm still a little rough around the edges when it comes to M Code and the program as a whole but im so stoked to learn more about it now, It was a little rough to get this working with my source data but I was able to figure it out!

I'll be studying this formula for a while to figure out how it's all coming together but this is astounding! Thanks so much!!

Clonk92 · Jun 24, 2024

JGordon11 said:
Is this what you are trying to do?

Power Query:

let Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content], tbl = Table.AddIndexColumn(Source, "Index"), tbl1 = Table.TransformColumns(tbl, {"Index", each Number.RoundDown(_/4)}), fieldnames = {"Name", "Request #", "Request Desc", "Transit #", "Account #", "Sequence #", "Amount", "Date"}, tbl2 = Table.Group(tbl1, {"Index"}, {{"All", each let c2 = [Column2], c4 = [Column4], data = {[Name]{0}, c2{0}, c4{0}, c2{1}, c4{1}, c2{2}, c4{2}, c2{3}} in Record.FromList(data, fieldnames) }}), tbl3 = Table.ExpandRecordColumn(tbl2, "All", fieldnames), tbl4 = Table.RemoveColumns(tbl3,{"Index"}), Result = Table.TransformColumnTypes(tbl4,{{"Name", type text}, {"Transit #", Int64.Type}, {"Account #", Int64.Type}, {"Sequence #", Int64.Type}, {"Amount", Int64.Type}, {"Date", type date}}) in Result

Book1
A B C D E F G H I
1 Table5
2 Name Column1 Column2 Column3 Column4
3 Source Name Request# Request Desc
4 Source Name Transit # 166350 Account # 5586506
5 Source Name Sequence# 822221 Amount 684
6 Source Name Date 12/17/2023
7 Source Name Request# Request Desc
8 Source Name Transit # 334511 Account # 9278209
9 Source Name Sequence# 679516 Amount 858
10 Source Name Date 5/23/2024
11 Source Name Request# Request Desc
12 Source Name Transit # 577619 Account # 8757999
13 Source Name Sequence# 387208 Amount 629
14 Source Name Date 2/5/2024
15 Source Name Request# Request Desc
16 Source Name Transit # 978879 Account # 1867437
17 Source Name Sequence# 972790 Amount 312
18 Source Name Date 3/5/2023
19
20 Query Output
21 Name Request # Request Desc Transit # Account # Sequence # Amount Date
22 Source Name 166350 5586506 822221 684 12/17/2023
23 Source Name 334511 9278209 679516 858 5/23/2024
24 Source Name 577619 8757999 387208 629 2/5/2024
25 Source Name 978879 1867437 972790 312 3/5/2023
26
Sheet6

I am noticing as I import data into my query that the repeating 4 rows is not sufficient.

is it possible to instead have the Index progress whenever Column 1 has a new instance of the word "Request #:" ?

JGordon11 · Jun 24, 2024

You can do something like this

Power Query:

let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
    Custom1 = Table.AddColumn(#"Added Index", "Groupby", each if [Column1] = "Request#" then [Index] else null),
    #"Filled Down" = Table.FillDown(Custom1,{"Groupby"})
in
    #"Filled Down"

then do your grouping based on the Groupby column

Clonk92 · Jun 25, 2024

JGordon11 said:

You can do something like this

Power Query:

let
    Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
    #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
    Custom1 = Table.AddColumn(#"Added Index", "Groupby", each if [Column1] = "Request#" then [Index] else null),
    #"Filled Down" = Table.FillDown(Custom1,{"Groupby"})
in
    #"Filled Down"

then do your grouping based on the Groupby column

User blog:M88youngling/It's Working!! | The LBP Union Wiki | Fandom

Thanks so much!

Help me Convert, Merge and Transpose a pdf table into a singular row, and then repeat that step ad nauseam to create a database

Clonk92

New Member

JGordon11

Well-known Member

Clonk92

New Member

Clonk92

New Member

JGordon11

Well-known Member

Clonk92

New Member

Similar threads

Share this page

Help me Convert, Merge and Transpose a pdf table into a singular row, and then repeat that step ad nauseam to create a database

Clonk92

New Member

JGordon11

Well-known Member

Clonk92

New Member

Clonk92

New Member

JGordon11

Well-known Member

Clonk92

New Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock