I have been given a large data set of 8mio rows that lists of transaction lines for a group of shops. Its a single sql query containing all the data and i am wondering if i am better of splitting it into 2 tables that are related. Depends how PP deals with the data and which is more efficient.
The current single table has the following sort of columns. Some of the columns vary with the lines of the transaction and some are repeated for every line of that transaction. A single transaction ID (invoice) can have up to 20 lines.
[TABLE="width: 500"]
<tbody>[TR]
[TD]TransID[/TD]
[TD]ProdID[/TD]
[TD]Qty[/TD]
[TD]Price[/TD]
[TD]Discount[/TD]
[TD]TaX[/TD]
[TD]Location[/TD]
[TD]StaffID[/TD]
[TD]TillID[/TD]
[TD]Date[/TD]
[TD]Time[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]55[/TD]
[TD]1[/TD]
[TD]5.50[/TD]
[TD]0.50[/TD]
[TD]0.37[/TD]
[TD]London[/TD]
[TD]123[/TD]
[TD]456[/TD]
[TD]21/9/15[/TD]
[TD]13:45[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]79[/TD]
[TD]2[/TD]
[TD]3.50[/TD]
[TD]0.35[/TD]
[TD]0.25[/TD]
[TD]London[/TD]
[TD]123[/TD]
[TD]456[/TD]
[TD]21/9/15[/TD]
[TD]13:45[/TD]
[/TR]
</tbody>[/TABLE]
ProdID, StaffID,TillID are all related to other tables
This could be split into 2 table related by the TransID
[TABLE="width: 500"]
<tbody>[TR]
[TD]TransID[/TD]
[TD]ProdID[/TD]
[TD]Qty[/TD]
[TD]Price[/TD]
[TD]Discount[/TD]
[TD]TaX[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]55[/TD]
[TD]1[/TD]
[TD]5.50[/TD]
[TD]0.50[/TD]
[TD]0.37[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]79[/TD]
[TD]2[/TD]
[TD]3.50[/TD]
[TD]0.35[/TD]
[TD]0.25[/TD]
[/TR]
</tbody>[/TABLE]
and
[TABLE="width: 500"]
<tbody>[TR]
[TD]TransID[/TD]
[TD]Location[/TD]
[TD]StaffID[/TD]
[TD]TillID[/TD]
[TD]Date[/TD]
[TD]Time[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]London[/TD]
[TD]123[/TD]
[TD]456[/TD]
[TD]21/9/15[/TD]
[TD]13:45[/TD]
[/TR]
</tbody>[/TABLE]
Currently there are 8 million rows of lets say 12 column, 6 variable and 6 fixed per invoice. So this would become 8mio transaction rows of 6 column and about 1 mio invoice rows of 6 columns.
so my question is whether this is the right thing to do as seems to create less data to be stored as avoids all the duplication. However i get the impression PP may store data in a way that this isn't actually a problem. if anyone can advise as to how far it is advantageous to denormalise for clarity vs inefficiency due to duplication I'd be grateful
Thanks
Mike
The current single table has the following sort of columns. Some of the columns vary with the lines of the transaction and some are repeated for every line of that transaction. A single transaction ID (invoice) can have up to 20 lines.
[TABLE="width: 500"]
<tbody>[TR]
[TD]TransID[/TD]
[TD]ProdID[/TD]
[TD]Qty[/TD]
[TD]Price[/TD]
[TD]Discount[/TD]
[TD]TaX[/TD]
[TD]Location[/TD]
[TD]StaffID[/TD]
[TD]TillID[/TD]
[TD]Date[/TD]
[TD]Time[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]55[/TD]
[TD]1[/TD]
[TD]5.50[/TD]
[TD]0.50[/TD]
[TD]0.37[/TD]
[TD]London[/TD]
[TD]123[/TD]
[TD]456[/TD]
[TD]21/9/15[/TD]
[TD]13:45[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]79[/TD]
[TD]2[/TD]
[TD]3.50[/TD]
[TD]0.35[/TD]
[TD]0.25[/TD]
[TD]London[/TD]
[TD]123[/TD]
[TD]456[/TD]
[TD]21/9/15[/TD]
[TD]13:45[/TD]
[/TR]
</tbody>[/TABLE]
ProdID, StaffID,TillID are all related to other tables
This could be split into 2 table related by the TransID
[TABLE="width: 500"]
<tbody>[TR]
[TD]TransID[/TD]
[TD]ProdID[/TD]
[TD]Qty[/TD]
[TD]Price[/TD]
[TD]Discount[/TD]
[TD]TaX[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]55[/TD]
[TD]1[/TD]
[TD]5.50[/TD]
[TD]0.50[/TD]
[TD]0.37[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]79[/TD]
[TD]2[/TD]
[TD]3.50[/TD]
[TD]0.35[/TD]
[TD]0.25[/TD]
[/TR]
</tbody>[/TABLE]
and
[TABLE="width: 500"]
<tbody>[TR]
[TD]TransID[/TD]
[TD]Location[/TD]
[TD]StaffID[/TD]
[TD]TillID[/TD]
[TD]Date[/TD]
[TD]Time[/TD]
[/TR]
[TR]
[TD]1234[/TD]
[TD]London[/TD]
[TD]123[/TD]
[TD]456[/TD]
[TD]21/9/15[/TD]
[TD]13:45[/TD]
[/TR]
</tbody>[/TABLE]
Currently there are 8 million rows of lets say 12 column, 6 variable and 6 fixed per invoice. So this would become 8mio transaction rows of 6 column and about 1 mio invoice rows of 6 columns.
so my question is whether this is the right thing to do as seems to create less data to be stored as avoids all the duplication. However i get the impression PP may store data in a way that this isn't actually a problem. if anyone can advise as to how far it is advantageous to denormalise for clarity vs inefficiency due to duplication I'd be grateful
Thanks
Mike