Context
I have an ever growing collection of 200+ paged pdf documents that contain bank checks. The checks themselves will not scan into Power query, however their header's will.
My goal is to extract the header of each check statement and then transform that information into a row that would create an evergreen database since new check statements come in all the time.
My Problem
The pdfs contain roughly 200+ pages, So unless I split the PDFs up by page (which takes way too much time to be reasonable) I'm unable to use a sample query for each table that gets extracted from the data.
Below I've provided samples of what the raw data and the steps I'm stuck on within power query looks like.
Raw Data
I've created a sample via MS Paint to give an idea of what the tables look like from the bank. (Dummy data to keep things confidential). Fortunately the header of each PDF is always the same and formatted like this.
Unfortunately, the way the data is structured leads to some creative use of transpositions which... on a single basis is fine. But without the use of a sample query... not fine. I'll explain why below.
How my Data shows up within Power Query
After creating a custom column to extract pdf.Tables from the source folder and then filtering out all pages so I am left with only tables... my query looks like the screenshot below.
My original Approach
Normally I'd apply transformations to a sample query but since each PDF contains 200+ pages of these tables, that option isn't available. While it is possible to split the PDF's into singular pages, the sheer volume of them would make that process not feasible. I was adding a column and then using the record.field function to extract the information I needed from a column and move it into the next column, then once I've effectively converted the table to a row, the sample query would apply to each individual PDF... however with that no longer being possible... I'm at a complete loss for how to tackle this thing.
I'm currently working through some Power Query Coursees I've found through YouTube, and skimming many threads, but nothing quite answers my problem, I feel the addition of an index and groupby or unpivot might be the answer here but I am unsure.
Worth noting, is that the data always repeats every 4 rows, I'm not sure how to make use of this, but it feels like there'd be a function or tool that could make use of that.
Thank you so much in advance for taking the time to read and respond to this question. I look forward to getting this done and learning how to make use of the functions used to solve this in the future.
I'm really loving power query but I am still quite new to it.
Thanks again.
I have an ever growing collection of 200+ paged pdf documents that contain bank checks. The checks themselves will not scan into Power query, however their header's will.
My goal is to extract the header of each check statement and then transform that information into a row that would create an evergreen database since new check statements come in all the time.
My Problem
The pdfs contain roughly 200+ pages, So unless I split the PDFs up by page (which takes way too much time to be reasonable) I'm unable to use a sample query for each table that gets extracted from the data.
Below I've provided samples of what the raw data and the steps I'm stuck on within power query looks like.
Raw Data
I've created a sample via MS Paint to give an idea of what the tables look like from the bank. (Dummy data to keep things confidential). Fortunately the header of each PDF is always the same and formatted like this.
Unfortunately, the way the data is structured leads to some creative use of transpositions which... on a single basis is fine. But without the use of a sample query... not fine. I'll explain why below.
How my Data shows up within Power Query
After creating a custom column to extract pdf.Tables from the source folder and then filtering out all pages so I am left with only tables... my query looks like the screenshot below.
My original Approach
Normally I'd apply transformations to a sample query but since each PDF contains 200+ pages of these tables, that option isn't available. While it is possible to split the PDF's into singular pages, the sheer volume of them would make that process not feasible. I was adding a column and then using the record.field function to extract the information I needed from a column and move it into the next column, then once I've effectively converted the table to a row, the sample query would apply to each individual PDF... however with that no longer being possible... I'm at a complete loss for how to tackle this thing.
I'm currently working through some Power Query Coursees I've found through YouTube, and skimming many threads, but nothing quite answers my problem, I feel the addition of an index and groupby or unpivot might be the answer here but I am unsure.
Worth noting, is that the data always repeats every 4 rows, I'm not sure how to make use of this, but it feels like there'd be a function or tool that could make use of that.
Thank you so much in advance for taking the time to read and respond to this question. I look forward to getting this done and learning how to make use of the functions used to solve this in the future.
I'm really loving power query but I am still quite new to it.
Thanks again.