Newbie needs help with large db design

baseball

Board Regular
Joined
Apr 1, 2002
Messages
153
I need some help with constructing a large database. I have play-by-play data for major league baseball games from 1957 through 2006, currently in three flat files covering the American League from 1957 through 1996 (1.19 GB), the National League from 1957 through 1996 (1.11 GB) and the combined leagues (because of inter-league play) from 1997 through 2006 (784 MB).

Each record has 97 fields, some with Null data. I have a database with 2000-2006 play-by-play for all teams. That contains over 1.3 million records. Essentially, this is not a relational database but there are some things that could be done to make it such. For example, there are a number of fields that contain playerIDs (7-8 characters) for Runner_on_1st, Pitcher, First_assist, etc., etc. Is it possible to use one master table in relation to all those fields? Or, so I need a separate master table for each of that type field?

How big of a problem will it be to add this year's play-by-play date?

Can Access 2007 handle the front-end role to this much data?

I'm a complete neophyte when it comes to databases on this scale. I'm not even certain I'm asking the right questions. If anyone can start pushing me in teh right direstion, I would appreciate it.

Thanks you,
Cliff
Win XP running on a dual-core box with 2 GB RAM
 

Excel Facts

Can you AutoAverage in Excel?
There is a drop-down next to the AutoSum symbol. Open the drop-down to choose AVERAGE, COUNT, MAX, or MIN
If your question is in regards to how much data can Access handle, that's a hard one to answer. An Access database cannot exceed 2GB, so users with large amounts of data will often split the front-end (forms, reports, etc) and the back end (tables) into separate database files. Not only does this enable you to exceed the 2GB limit it also makes it easier to upsize to a SQL server back end in the future.

Normalizing will help to reduce your file size considerably through the elimination of redundant data. If you can provide some sample fields I can help to come up with a normalized table structure.

hth,
Giacomo
 
Upvote 0
data fields

You asked. :) Here are the 97 fields with descriptions. The data formats are immutable. I get them through something called Project Retrosheet, who distributes the Event Files freely, just asking that they are credited when used.

While I haven't included any actual data, I think you can tell from what I have included that only areas that lend themselves to relational tables are those that contain the alphanumeric identifiers for players (combination of letters from last and first names plus numerics to distinguish players with sme last and first names--ramin002 for Manny Ramirez as opposed to ramin001 for Mario Ramirez) and three-letter team identifiers. The rest of the fields are flags. I do have a master roster of players for the period covered (1957-2006) in the form of: retroID,LastName,FirstName. And I obviously have a table for team identifiers.

I do want to try and clarify one question about database size and Access. Suppose that I convert the three data tables (AL, NL, and ML) to backends in Access and they each are pushing 2 GB in size, will the Access frontend be able to handle a query that involves linking the three tables? Or, will Access frontend only care that the amount of data returned by the query is less than 2 GB?

If you still need actual data, we'll have to figure out a way but it's really too much to post here.

Thanks,
Cliff

gameID-----XXXyyyymmddG (where XXX is the 3-letter team identifier, date, then G=0,1,2 to sort out first and second games of doubleheaders)--not unique
vis------3-letter identifier for visiting team (such as BOS for Boston)
inn------integer representing inning (less than, greater than, equals, etc.)
bat------0,1 (flag for home or visiting team batting)
outs-----0,1,2 (outs at time of event)
balls----0,1,2,3
strikes--0,1,2
pitch_seq----alphanumeric of varying length
vis_score----visiting team's score at time of event
home_score---home team's score at time of event
batter------------alphanumeric identifier for batter
batter_hand-------L,R (for batted left-/right-handed)
res_batter--------alphanumeric identifier for batter (NOTE: the batter who starts the plate appearance may not be the one who finishes it--injury, for example)
res_batter_hand---L,R (for batted left-/right-handed)
pitcher-----------alphanumeric identifier for batter
pitcher_hand------L,R (for batted left-/right-handed)
res_pitcher-------alphanumeric identifier for batter (NOTE: ditto res-batter note)
res_pitcher_hand--L,R (for batted left-/right-handed)
catcher--------alphanumeric identifier for defensive player at position
first_base-----""
second_base----"" alphanumeric data of varying lengths
third_base-----""
shortstop------""
left_field-----""
center_field---""
right_field----""
runner_1b------alphanumeric identifier for baserunner on that base at start of event (may be NULL)
runner_2b------""
runner_3b------""
event_text-----alphanumeric data of varying lengths
leadoff_flag---T,F (to identify first batter of inning)
pinchhit_flag--T,F (to identify whether batter is counted as a pinch-hitter)
defensive_position--1-11 (which defensive position batter plays, includes DH and PH)
lineup_position-----1-9, which line-up position batter occupies
event_type----------0-24 (codes for various events)
batter_event_flag---T,F (flag for batter events--hits, etc. as opposed to non-batter events (stolen bases, etc.)
ab_flag-------------T,F (flag to indicate wheter batter is credited with official at bat)
hit_value-----------0,1,2,3,4 (0 for no hit, 4 for home run, etc.)
SH_flag-------------T,F (flag for batter credited with sacrifice hit)
SF_flag-------------T,F (flag for batter credited with sacrifice fly)
outs_on_play--------0,1,2,3 (number of outs recorded on play)
double_play_flag----T,F (flag for double-play recorded)
triple_play_flag----T,F (flag for triple-play recorded)
RBI_on_play---------0,1,2,3,4 (number of runs batted in on play)
wild_pitch_flag-----T,F (flag for pitcher charged with wild pitch)
passed_ball_flag----T,F (flag for catcher charged with passed ball)
fielded_by----------alphanumeric identifier for defensive player who fielded ball
batted_ball_type----F,L,P,G (flags for fly balls, line frives, Pop-uops, ground balls)
bunt_flag-----------T,F (flag indicating whether batter bunted ball)
foul_flag-----------T,F (flag indicating whether ball was in foul or fair territory when fielded)
hit_location--------alphanumeric (corresponds to zones on a map--not always recorded)
num_errors----------number of errors charged on a play
error_1st_player----1-9 position for defensive player who was charged with first error on play (May be NULL)
error_1st_type------T,F (flag for Throw or Fielding, may be NULL)
error_2nd_player----1-9 position for defensive player who was charged with first error on play (May be NULL)
error_2nd_type------T,F (flag for Throw or Fielding, may be NULL)
error_3rd_player----1-9 position for defensive player who was charged with first error on play (May be NULL)
error_3rd_type------T,F (flag for Throw or Fielding, may be NULL)
dest_batter---------0,1,2,3,4 (base batter reached at conclusion of play)
dest_runner_1st-----0,1,2,3,4 (base runner on given the base reached at conclusion of play, 0=out, 4=scored)
dest_runner_2nd-----""
dest_runner_3rd-----""
play_batter---------numeric shorthand describing play made on batter (may be NULL)
play_runner_1st-----numeric shorthand describing play made on runner from the given base (may be NULL)
play_runner_2nd-----""
play_runner_3rd-----""
SB_runner_1st_flag--T,F (flag indicating whether runner on the given base was credited with a stolen base, may be NULL)
SB_runner_2nd_flag--""
SB_runner_3rd_flag--""
CS_runner_1st_flag--T,F (flag indicating whether runner on the given base was charged with a caught stealing, may be NULL)
CS_runner_2nd_flag--""
CS_runner_3rd_flag--""
PO_runner_1st_flag--T,F (flag indicating whether runner on the given base was put out, may be NULL)
PO_runner_2nd_flag--""
PO_runner_3rd_flag--""
Res_pitcher_runner_1st---alphanumeric identifier for pitcher responsible for runner on given base, may be NULL)
Res_pitcher_runner_2nd---""
Res_pitcher_runner_3rd---""
New_Game_Flag---T,F (flag to new game beginning)
End_Game_Flag---T,F (flag to indicate end of game)
Pinch_runner_1st---T,F (flag to indicate whether a pinch runner was used for runner at given base)
Pinch_runner_2nd---T,F (flag to indicate whether a pinch runner was used for runner at given base)
Pinch_runner_3rd---T,F (flag to indicate whether a pinch runner was used for runner at given base)
Runner_removed_pinch-runner_1st---alphanumeric identifier for runner removed from game at given base)
Runner_removed_pinch-runner_2nd---"
Runner_removed_pinch-runner_3rd---"
Batter_removed_pinch-hitter------------alphanumeric identifier of batter removed for pinch-hitter, may be NULL)
Position_batter_removed_pinch-hitter---1-11 (fielding position of batter removed for pinch-hitter)
Fielder_First_Putout---1-9 (fielding position of fielder credited with given putout, may be NULL)
Fielder_Second_Putout--""
Fielder_Third_Putout---""
Fielder_First Assist---1-9 (fielding position of fielder credited with given assist, may be NULL)
Fielder_Second Assist--"
Fielder_Third Assist---"
Fielder_Fourth Assist--"
Fielder_Fifth Assist---"
event_num---sequentil numbering of events in game, reset to 1 when New_Game_Flag=1
 
Upvote 0
I'll post back with some normalization ideas later, but in regards to your file size question, you can have an Access front-end to a SQL-Server database that is larger than 2GB. It's just that the .mdb file itself cannot exceed 2GB.

Additionally, you have a lot of opportunity to reduce your file size by normalization. For instance instead of saving redundat game level information at the event level you can move that information to a game table where it will not be duplicated. By eliminating duplication you will reduce file size. There are many examples of this, I'll post back this weekend with some ideas.

Giacomo
 
Upvote 0
Thanks for clarifying the bit about file size.

One other area that I do no know about is adding data to normalized tables. Sometime towards the end of this year, the 2007 data will be publicly available, ~200,000 records in one flat file. How much trouble is it going to be to append that data to the normalized database? Do I run it from the front end and the software figures it out?

I'll be looking forward to seeing what you come up with. My hope would be to reduce the size of the back end (data table) enough so I have only one instead of having to link all three tables (AL,NL,ML).
 
Upvote 0
sorry for the delay in getting back to you. What you need to do is to break out the redundant data into separate tables, by eliminating redundancy you can save disk space.

You start by defining your entities, then you determine the entity attributes, and finally the entity relationships. Entities are the "nouns" of the database, using your example I see the following entities:

Events - events are your lowest point of granularity, it may be a single pitch, hit, stolen base etc.

Plays - A play is a collection of events, example a collection of pitches, and hits make up a single at bat.

Innings - A collection of plays

Games - A collection of innings

Seasons - a collection of games

Players - a unique person

Franchise - I'm using the term franchise here instead of Team because I am thinking that a franchise is a static thing like the Yankees, while a team is the collection of players in a particular season that make up the Yankees Team.

Your entities become the tables in your database, once you have your entities identified you then find those attributes that uniquely identify the entity.

Examples:

Season - SeasonID, Year
Game - GameID, SeasonID, HomeTeamID, VisitingTeamID
Inning - InningID, GameID, InningNumber, HomeTeamScore, VisitingTeamScore
etc.

Player - PlayerID, PlayerFirstName, PlayerLastName, BattingHand, PitchingHand, etc

I don't know nearly enough about baseball to give you the full structure but hopefully this will be enough to give you a start. Once you have created your structure you can save all the queries that you used to transform the flat file to the normalized database and reuse them when you get the next years worth of data.

hth,
Giacomo
 
Upvote 0

Forum statistics

Threads
1,221,829
Messages
6,162,229
Members
451,756
Latest member
tommyw

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top