I have play-by-play data from Project Retrosheet for the vast majority of major league baseball games played from 1952 through 2009. The data has been converted to comma-delimited format with text in quotes. This is a lot of data and I need to put it into an Access 2007 database. For example, the data from 2000 through 2006 is in excess of 1.355 million records, each with 97 fields. Because of the sheer amount of data, I separated the data by decades (50s and 60s, 70s, 80s, 90s, 00s) but the total size of those five text files is about 3.5 GB.
I'm running a fairly robust system (Core 2 Duo under Windows 7, 64-bit) but I only have 4 gb of RAM and I am concerned whether my proposed database is going to be too large to run smoothly. And I have no idea how to figure that out other than trial-and-error.
So, my first question is how to handle this.
Next, I need to know the best way to import the data. I know how to import text; I'm more concerned with the tactics of dealing with the field names and the formatting of the records. For example, I don't know when to use integer, double-precision, etc. My preference would be to make a table with the field names (imported) amd formatting to use as a template, then import the delimited files into the template and save it with a new name.
Is that a reasonable way to approach this? Or should I simply copy and paste the field names at the start of each text file?
When importing field names, do they need to be quotes-delimited?
Once I build the database, would I be better off splitting it?
Finally, I will need to run queries that span decades and I have no idea how one writes a query that pulls data from two different tables at once. For example, if I wanted to check on something in Barry Bonds's career (1986-2007), I would need to pull data from three tables. In the past, I would write different queries. then combine the data in Excel. But that is cumbersome. I might add that I don't write SQL but use Query Design in Access.
I should also mention that because of the nature of the data, this is not a relational database. Here is one record. A lot of what I do is based on filtering the data based on various flags (T/F and some numerical values, i.e., a 4 under Hit_Value for home runs.
"ATL198904070","LAN",1,0,0,0,0,"X",0,0,"randw001","R","randw001","R","glavt001","L","glavt001","L","davij001","perrg001","blauj001","gantr001","thoma001","smitl002","murpd001","berrg001","","","","53/G5","T","F",4,1,2,"T","T",0,"F","F",1,"F","F",0,"F","F",5,"G","F","F","5",0,0,"N",0,"N",0,"N",0,0,0,0,"53","","","","F","F","F","F","F","F","F","F","F","","","","T","F","F","F","F","","","","",0,3,0,0,5,0,0,0,0,1
Anyway, thanks for reading and if someone can help me, I would appreciate it.
Cliff
I'm running a fairly robust system (Core 2 Duo under Windows 7, 64-bit) but I only have 4 gb of RAM and I am concerned whether my proposed database is going to be too large to run smoothly. And I have no idea how to figure that out other than trial-and-error.
So, my first question is how to handle this.
Next, I need to know the best way to import the data. I know how to import text; I'm more concerned with the tactics of dealing with the field names and the formatting of the records. For example, I don't know when to use integer, double-precision, etc. My preference would be to make a table with the field names (imported) amd formatting to use as a template, then import the delimited files into the template and save it with a new name.
Is that a reasonable way to approach this? Or should I simply copy and paste the field names at the start of each text file?
When importing field names, do they need to be quotes-delimited?
Once I build the database, would I be better off splitting it?
Finally, I will need to run queries that span decades and I have no idea how one writes a query that pulls data from two different tables at once. For example, if I wanted to check on something in Barry Bonds's career (1986-2007), I would need to pull data from three tables. In the past, I would write different queries. then combine the data in Excel. But that is cumbersome. I might add that I don't write SQL but use Query Design in Access.
I should also mention that because of the nature of the data, this is not a relational database. Here is one record. A lot of what I do is based on filtering the data based on various flags (T/F and some numerical values, i.e., a 4 under Hit_Value for home runs.
"ATL198904070","LAN",1,0,0,0,0,"X",0,0,"randw001","R","randw001","R","glavt001","L","glavt001","L","davij001","perrg001","blauj001","gantr001","thoma001","smitl002","murpd001","berrg001","","","","53/G5","T","F",4,1,2,"T","T",0,"F","F",1,"F","F",0,"F","F",5,"G","F","F","5",0,0,"N",0,"N",0,"N",0,0,0,0,"53","","","","F","F","F","F","F","F","F","F","F","","","","T","F","F","F","F","","","","",0,3,0,0,5,0,0,0,0,1
Anyway, thanks for reading and if someone can help me, I would appreciate it.
Cliff