How to fit a truncated normal distribution to existing data?

JenniferMurphy

Well-known Member
Joined
Jul 23, 2011
Messages
2,676
Office Version
  1. 365
Platform
  1. Windows
I have some data that I have been collecting from a game of solitaire. When I plot it opn a scatter chart, it looks like a normal distribution -- actually a truncated normal distribution. I have been able to more or less fit a standard normal distribution to the data after some trial and error (mostly error), but my attempts at fitting a truncarted normal distribution have failed.

xl2bb tells me that the data is too large to post here, so I have uploaded the workbook to this OneDrive folder.

Solitaire Data

The workbook has 2 sheets. One has my data and the equations I have tried to make fit. The other has detailed explanations. There is also a csv file with just the data.

Here's one of the charts:

1658509375606.png


These standard normal distributions almost fit, but I think a truncated normal would fit better, because the data is actually truncated on the left. This is all explained in the Overview sheet.
 

Excel Facts

Why are there 1,048,576 rows in Excel?
The Excel team increased the size of the grid in 2007. There are 2^20 rows and 2^14 columns for a total of 17 billion cells.
I understand that some may be reluctant to download a workbook from an unknown source, so I'll try to provide more information here.

The complete dataset is too large for xl2bb, but here are the first 20+ rows. (The colors match the plots below)

Cell Formulas
RangeFormula
C3C3=SUMPRODUCT(Moves,Wins)/SUM(Wins)
D3:E3,G3D3=WinsMean
C4C4=STDEV(Wins)
D4:E4,G4D4=WinsStdDev
D5:F5D5=1/StdDevs/SQRT(2*PI())
C6C6=MAX(Wins)
D6D6=MAX(NormDist)
E6E6=MAX(ScaledND)
F6F6=MAX(AdjustedSD)
G6G6=MAX(TruncND)
C7C7=INDEX(Moves,XMATCH(@MaxTbls,Wins))
D7D7=INDEX(Moves,XMATCH(@MaxTbls,NormDist))
E7E7=INDEX(Moves,XMATCH(@MaxTbls,ScaledND))
F7F7=INDEX(Moves,XMATCH(@MaxTbls,AdjustedSD))
D9:D30D9=NORM.DIST(Moves,Means,StdDevs,FALSE)
E9:F30E9=NORM.DIST(Moves,Means,StdDevs,FALSE)/MaxEqs*WinsMaxTbl
G9:G10G9=NORM.DIST(Wins,Means,StdDevs,FALSE)/(1-NORM.DIST(52,Means,StdDevs,TRUE))
G11:G30G11=NORM.DIST(Wins,Means,StdDevs,FALSE)/(NORM.DIST(192,Means,StdDevs,TRUE)-NORM.DIST(52,Means,StdDevs,TRUE))
Named Ranges
NameRefers ToCells
AdjustedSD=OFFSET(Sheet3!AdjustedSDHdr,1,0):OFFSET(Sheet3!AdjustedSDFtr,-1,0)F6:F7
AdjustedSDFtr=Sheet3!$F$150F6:F7
AdjustedSDHdr=Sheet3!$F$8F6:F7
AdjustedSDSD=Sheet3!$F$4F5, F9:F30
MaxTblND=Sheet3!$D$6C7:F7
MaxTbls=Sheet3!$6:$6C7:F7
MaxTblScND=Sheet3!$E$6C7:F7
Moves=OFFSET(Sheet3!MovesHdr,1,0):OFFSET(Sheet3!MovesFtr,-1,0)C7:F7, D9:F30, C3
MovesFtr=Sheet3!$B$150C7:F7, D9:F30, C3
MovesHdr=Sheet3!$B$8C7:F7, D9:F30, C3
NormDist=OFFSET(Sheet3!NormDistHdr,1,0):OFFSET(Sheet3!NormDistFtr,-1,0)D6:D7
NormDistFtr=Sheet3!$D$150D6:D7
NormDistHdr=Sheet3!$D$8D6:D7
ScaledND=OFFSET(Sheet3!ScaledNDHdr,1,0):OFFSET(Sheet3!ScaledNDFtr,-1,0)E6:E7
ScaledNDFtr=Sheet3!$E$150E6:E7
ScaledNDHdr=Sheet3!$E$8E6:E7
TruncND=OFFSET(Sheet3!TruncNDHdr,1,0):OFFSET(Sheet3!TruncNDFtr,-1,0)G6
TruncNDFtr=Sheet3!$G$150G6
TruncNDHdr=Sheet3!$G$8G6
Wins=OFFSET(Sheet3!WinsHdr,1,0):OFFSET(Sheet3!WinsFtr,-1,0)G9:G30, C3:C4, C6:C7
WinsFtr=Sheet3!$C$150G9:G30, C3:C4, C6:C7
WinsHdr=Sheet3!$C$8G9:G30, C3:C4, C6:C7
WinsMaxTbl=Sheet3!$C$6C7:F7, E9:F30
WinsMean=Sheet3!$C$3G3, D3:E3
WinsStdDev=Sheet3!$C$4G4, D4:E4


My data is half truncated on the left only [52, ∞). The actual data is on [52,192]. Here’s the scatter plot:

1658545905511.png


I calculated the mean and std dev as:

Mean = 85.36853 (=SUMPRODUCT(Moves,Wins)/SUM(Wins))
Std Dev = 7.5333 (=STDEV(Wins))

Using those parameters, I plotted a normal distribution using this formula:

=NORM.DIST(@Moves,@Means,@StdDevs,FALSE)

Here’s that plot:

1658546184442.png


This appears to have the basic shape of my data, but on a smaller scale, so I scaled it up by dividing by the maximum value here (0.05296) and multiplying by the maximum of my data (41).

=NORM.DIST(@Moves,@Means,@StdDevs,FALSE)/@MaxEqs*WinsMaxTbl

Here’s that plot superimposed on the plot of my data:

1658546168374.png


It looks roughly like a fit, but shifted to the right and too wide. I fiddled around with the mean and std dev and came up with this:

1658546228151.png


This looks pretty good, but a truncated normal should be better, but this is where I am stuck.

Since my data is only truncated on the left, I tried this formula:

=NORM.DIST(@Wins,@Means,@StdDevs,FALSE)/(1-NORM.DIST(52,@Means,@StdDevs,TRUE))

I get values from 6.89187E-30 to 1.55421E-09.

I then tried this one:

=NORM.DIST(@Wins,@Means,@StdDevs,FALSE)/(NORM.DIST(192,@Means,@StdDevs,TRUE)-NORM.DIST(52,@Means,@StdDevs,TRUE))

That gets the same values.

Can anyone tell me what I am doing wrong?
 
Upvote 0
Jennifer,
I'm sorry for the delay...I got sidetracked. I had intended to follow up with you about this. When I first saw your data, I assumed that if a normal distribution was reasonably representative of the data, it would be a truncated normal distribution because the lower left tail is cut off by very real constraints (i.e., you need a minimum number of moves to complete a game, so a win cannot occur below that threshold). I normally explore these types of data sets by scaling the area under a histogram of the data to 1...so other standard distributions whose cumulative distribution function has values on the interval [0,1] can be compared to the data. When this is done with your data set, the distribution of moves is very peaked...indicating a high kurtosis...and this suggests that a normal distribution may not be an appropriate descriptor for the data. One thing to be aware of: you can scale the histogram of actual data so that the peak is about the same as a standard normal curve (I believe that is what you did), and when you do this, the curves appear to be somewhat similar, but you can tell that the area under the curves (red vs. blue curves in the last post) are quite different.

I noticed in a recent post:
...that you are exploring distributions and tests of normality.

For the particular data set here, I followed a methodology described in this tutorial:

I created a workbook based on your data and the Tukey Lambda method shown in the video. I will need to review this file further to refresh my memory about some details, but wanted to share it with you so that you can explore it too.
On one worksheet, you will see the Lambda calculated suggests that a normal distribution is not appropriate. On another worksheet or two I investigated how to reconstitute your original raw data (a single column list indicating the number of moves) rather than the summary data presented (a two-column list indicating moves and # of wins). The reason is that built-in functions can be used directly on the raw data to compute skew and kurtosis that might offer hope for early screening to determine whether to consider a normal distribution further.
 
Upvote 0
Solution
Jennifer,
I'm sorry for the delay...I got sidetracked.

. . .

For the particular data set here, I followed a methodology described in this tutorial:

I created a workbook based on your data and the Tukey Lambda method shown in the video. I will need to review this file further to refresh my memory about some details, but wanted to share it with you so that you can explore it too.
On one worksheet, you will see the Lambda calculated suggests that a normal distribution is not appropriate. On another worksheet or two I investigated how to reconstitute your original raw data (a single column list indicating the number of moves) rather than the summary data presented (a two-column list indicating moves and # of wins). The reason is that built-in functions can be used directly on the raw data to compute skew and kurtosis that might offer hope for early screening to determine whether to consider a normal distribution further.

Thanks for the reference and the workbook. I fear that some of it may be over my head, but I'll take a look. I really appreciate the edxtra time you took on this. I'm going to mark this as a solution. If I run into problems trying to wrap my head around the details, I'll repost.
 
Upvote 0

Forum statistics

Threads
1,223,164
Messages
6,170,444
Members
452,326
Latest member
johnshaji

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top