I am new to Power Query and I have been running into memory issues when using large time series data sets. An I am wondering if it is a usage issue on my end based on me not fully understanding how Power Query handles large data sets. I was hoping someone could clarify a few items for me.
What I am Doing
I have been playing with large hourly data sets that span multiple years (80,000 rows) across 200 categories (200 rows). The data is pulled in from a very large Excel spreadsheet. I have been pulling that data into the Power Query as a Connection. I then have been building additional queries as a “reference” to the first query. For example I may have 2x additional queries (Connection only) as references to the first query. All 3x queries have tables that display the time series data. On that basis I am not really clear if that means that Power Query is holding 3 separate versions of that data set (i.e. 3x the memory) or if it is only holding the data from the first data set. The additional queries simply manipulate the data table from the first query and revise the result in a new table of that new query. I am using 3x different queries as I eventually will want to dump the data - as I step through it - from the 3x tables back into Excel in some form of Power Pivot and thus I want to preserve teach table along the way. A example might be:
- table1 has production data (imported form Excel),
- table2 has multiplies production data buy pricing data (references table1), and
- table3 has production x price data (references table1 and table2).
At this point I am not exporting anything back to Excel yet - the data all resides in Power Query.
What I am Seeing
When I start to run the third query I am prompted with memory errors and the query does not finish running. I can see at the bottom of the screen a status bar showing the MB of data that is begin processed. The underlying Excel spreadsheet is about 90 MB. But when I run the third query the status bar shows MB that go into the +300MB range. This is what has me wondering if I am being inefficient in the way in which I am handling the data.
Any feedback on this would be appreciated in terms of best practices or how “referencing” queries influences data usage etc.
Thanks in advance.
What I am Doing
I have been playing with large hourly data sets that span multiple years (80,000 rows) across 200 categories (200 rows). The data is pulled in from a very large Excel spreadsheet. I have been pulling that data into the Power Query as a Connection. I then have been building additional queries as a “reference” to the first query. For example I may have 2x additional queries (Connection only) as references to the first query. All 3x queries have tables that display the time series data. On that basis I am not really clear if that means that Power Query is holding 3 separate versions of that data set (i.e. 3x the memory) or if it is only holding the data from the first data set. The additional queries simply manipulate the data table from the first query and revise the result in a new table of that new query. I am using 3x different queries as I eventually will want to dump the data - as I step through it - from the 3x tables back into Excel in some form of Power Pivot and thus I want to preserve teach table along the way. A example might be:
- table1 has production data (imported form Excel),
- table2 has multiplies production data buy pricing data (references table1), and
- table3 has production x price data (references table1 and table2).
At this point I am not exporting anything back to Excel yet - the data all resides in Power Query.
What I am Seeing
When I start to run the third query I am prompted with memory errors and the query does not finish running. I can see at the bottom of the screen a status bar showing the MB of data that is begin processed. The underlying Excel spreadsheet is about 90 MB. But when I run the third query the status bar shows MB that go into the +300MB range. This is what has me wondering if I am being inefficient in the way in which I am handling the data.
Any feedback on this would be appreciated in terms of best practices or how “referencing” queries influences data usage etc.
Thanks in advance.