Hi everyone, I'm migrating data from a legacy source information system to a new information system and want to confirm a variety of counts of values from the source data against the counts of the same values from the migrated data to ensure the counts match as a way to verify the records migrated properly. The datasets have millions of records and multiple columns that I want to count and compare results to ensure they match.
As one example, I have a column for Organization and a column for Product in both datasets - the dataset columns are not the same i.e., the Organization column is column F in the source data and column C in the migrated data and Product is column M in the source data and column D in the migrated data. For each unique organization in the Organization column, I want to count each record for each unique Product value in the Product column in each data set and then compare the two counts and indicate where the test passes and where it fails for each unique combination. The comparison needs to work in both directions i.e., source data to migrated data and migrated data to source data, to ensure complete inclusion of any variances.
I've thought about using pivot tables or advanced filters, but I don't think they will be efficient given it's working with two datasets with different structures and needing to filter, compare, and return results on multiple criteria - the product count is just one example, we will look at other columns of data too e.g., count invoices column by organization, sum sales column by organization, etc. so needs to be able to handle a number of columns to test in a dynamic way, and given the number of records, it needs to perform reasonably fast. I was wondering about using dictionaries and if nesting dictionaries and performing counts/sums on the dictionary keys and items might be the fastest/most efficient, but I'm completely new to dictionaries and am struggling to really know if that's a good approach, or if there's a better way?
Here's a basic example of Source Data, Migrated Data, and the desired Result output. I've put in on one sheet for illustration, but for clarity, the source data and migrated data are separate sheets with different column structure.
Any help is greatly appreciated!
As one example, I have a column for Organization and a column for Product in both datasets - the dataset columns are not the same i.e., the Organization column is column F in the source data and column C in the migrated data and Product is column M in the source data and column D in the migrated data. For each unique organization in the Organization column, I want to count each record for each unique Product value in the Product column in each data set and then compare the two counts and indicate where the test passes and where it fails for each unique combination. The comparison needs to work in both directions i.e., source data to migrated data and migrated data to source data, to ensure complete inclusion of any variances.
I've thought about using pivot tables or advanced filters, but I don't think they will be efficient given it's working with two datasets with different structures and needing to filter, compare, and return results on multiple criteria - the product count is just one example, we will look at other columns of data too e.g., count invoices column by organization, sum sales column by organization, etc. so needs to be able to handle a number of columns to test in a dynamic way, and given the number of records, it needs to perform reasonably fast. I was wondering about using dictionaries and if nesting dictionaries and performing counts/sums on the dictionary keys and items might be the fastest/most efficient, but I'm completely new to dictionaries and am struggling to really know if that's a good approach, or if there's a better way?
Here's a basic example of Source Data, Migrated Data, and the desired Result output. I've put in on one sheet for illustration, but for clarity, the source data and migrated data are separate sheets with different column structure.
Any help is greatly appreciated!
Sample.xlsx | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | L | |||
1 | Source Data | Migrated Data | Result | |||||||||||
2 | Col F | Col M | Col C | Col D | ||||||||||
3 | ||||||||||||||
4 | Organization | Product | Organization | Product | Organization | Product | Legacy Count | Migrated Count | Result | Variance | ||||
5 | Org A | Product A | Org A | Product A | Org A | Product A | 3 | 3 | Pass | - | ||||
6 | Org A | Product A | Org A | Product A | Org A | Product B | 2 | 2 | Pass | - | ||||
7 | Org A | Product A | Org A | Product A | Org B | Product A | 2 | 2 | Pass | - | ||||
8 | Org A | Product B | Org A | Product B | Org B | Product B | 1 | 0 | Fail | -1 | ||||
9 | Org A | Product B | Org A | Product B | Org B | Product C | 0 | 1 | Fail | 1 | ||||
10 | Org B | Product A | Org B | Product A | Org C | Product A | 2 | 0 | Fail | -2 | ||||
11 | Org B | Product A | Org B | Product A | ||||||||||
12 | Org B | Product B | Org B | Product C | ||||||||||
13 | Org C | Product A | ||||||||||||
14 | Org C | Product A | ||||||||||||
Example |