How to fit a truncated normal distribution to existing data?

JenniferMurphy · Jul 22, 2022

I have some data that I have been collecting from a game of solitaire. When I plot it opn a scatter chart, it looks like a normal distribution -- actually a truncated normal distribution. I have been able to more or less fit a standard normal distribution to the data after some trial and error (mostly error), but my attempts at fitting a truncarted normal distribution have failed.

xl2bb tells me that the data is too large to post here, so I have uploaded the workbook to this OneDrive folder.

Solitaire Data

The workbook has 2 sheets. One has my data and the equations I have tried to make fit. The other has detailed explanations. There is also a csv file with just the data.

Here's one of the charts:

These standard normal distributions almost fit, but I think a truncated normal would fit better, because the data is actually truncated on the left. This is all explained in the Overview sheet.

JenniferMurphy · Jul 22, 2022

I understand that some may be reluctant to download a workbook from an unknown source, so I'll try to provide more information here.

The complete dataset is too large for xl2bb, but here are the first 20+ rows. (The colors match the plots below)

Solitaire Data.xlsx

B

C

D

E

F

G

3

Mean

85.36853

82.00000

85.36853

4

Std Dev

7.5333

5.0000

7.53330

5

MaxEq

n/a

0.05296

0.07979

6

MaxTbl

7

MaxTbNdx

83

85

82

8

Moves

Wins

Norm Dist

Scaled ND

Adj SD

Trunc ND

9

52

0

10

53

0

11

54

0

12

55

0

13

56

0

14

57

0

15

58

0

16

59

0

17

60

0

18

61

0

19

62

0

20

63

1

21

64

0

22

65

0

23

66

0

24

67

0

25

68

0

26

69

3

27

70

7

28

71

9

29

72

9

30

73

15

Sheet3

Cell Formulas
Range	Formula
C3	C3	=SUMPRODUCT(Moves,Wins)/SUM(Wins)
D3:E3,G3	D3	=WinsMean
C4	C4	=STDEV(Wins)
D4:E4,G4	D4	=WinsStdDev
D5:F5	D5	=1/StdDevs/SQRT(2*PI())
C6	C6	=MAX(Wins)
D6	D6	=MAX(NormDist)
E6	E6	=MAX(ScaledND)
F6	F6	=MAX(AdjustedSD)
G6	G6	=MAX(TruncND)
C7	C7	=INDEX(Moves,XMATCH(@MaxTbls,Wins))
D7	D7	=INDEX(Moves,XMATCH(@MaxTbls,NormDist))
E7	E7	=INDEX(Moves,XMATCH(@MaxTbls,ScaledND))
F7	F7	=INDEX(Moves,XMATCH(@MaxTbls,AdjustedSD))
D9:D30	D9	=NORM.DIST(Moves,Means,StdDevs,FALSE)
E9:F30	E9	=NORM.DIST(Moves,Means,StdDevs,FALSE)/MaxEqs*WinsMaxTbl
G9:G10	G9	=NORM.DIST(Wins,Means,StdDevs,FALSE)/(1-NORM.DIST(52,Means,StdDevs,TRUE))
G11:G30	G11	=NORM.DIST(Wins,Means,StdDevs,FALSE)/(NORM.DIST(192,Means,StdDevs,TRUE)-NORM.DIST(52,Means,StdDevs,TRUE))

Named Ranges
Name	Refers To	Cells
AdjustedSD	=OFFSET(Sheet3!AdjustedSDHdr,1,0):OFFSET(Sheet3!AdjustedSDFtr,-1,0)	F6:F7
AdjustedSDFtr	=Sheet3!$F$150	F6:F7
AdjustedSDHdr	=Sheet3!$F$8	F6:F7
AdjustedSDSD	=Sheet3!$F$4	F5, F9:F30
MaxTblND	=Sheet3!$D$6	C7:F7
MaxTbls	=Sheet3!$6:$6	C7:F7
MaxTblScND	=Sheet3!$E$6	C7:F7
Moves	=OFFSET(Sheet3!MovesHdr,1,0):OFFSET(Sheet3!MovesFtr,-1,0)	C7:F7, D9:F30, C3
MovesFtr	=Sheet3!$B$150	C7:F7, D9:F30, C3
MovesHdr	=Sheet3!$B$8	C7:F7, D9:F30, C3
NormDist	=OFFSET(Sheet3!NormDistHdr,1,0):OFFSET(Sheet3!NormDistFtr,-1,0)	D6:D7
NormDistFtr	=Sheet3!$D$150	D6:D7
NormDistHdr	=Sheet3!$D$8	D6:D7
ScaledND	=OFFSET(Sheet3!ScaledNDHdr,1,0):OFFSET(Sheet3!ScaledNDFtr,-1,0)	E6:E7
ScaledNDFtr	=Sheet3!$E$150	E6:E7
ScaledNDHdr	=Sheet3!$E$8	E6:E7
TruncND	=OFFSET(Sheet3!TruncNDHdr,1,0):OFFSET(Sheet3!TruncNDFtr,-1,0)	G6
TruncNDFtr	=Sheet3!$G$150	G6
TruncNDHdr	=Sheet3!$G$8	G6
Wins	=OFFSET(Sheet3!WinsHdr,1,0):OFFSET(Sheet3!WinsFtr,-1,0)	G9:G30, C3:C4, C6:C7
WinsFtr	=Sheet3!$C$150	G9:G30, C3:C4, C6:C7
WinsHdr	=Sheet3!$C$8	G9:G30, C3:C4, C6:C7
WinsMaxTbl	=Sheet3!$C$6	C7:F7, E9:F30
WinsMean	=Sheet3!$C$3	G3, D3:E3
WinsStdDev	=Sheet3!$C$4	G4, D4:E4

My data is half truncated on the left only [52, ∞). The actual data is on [52,192]. Here’s the scatter plot:

I calculated the mean and std dev as:

Mean = 85.36853 (=SUMPRODUCT(Moves,Wins)/SUM(Wins))
Std Dev = 7.5333 (=STDEV(Wins))

Using those parameters, I plotted a normal distribution using this formula:

=NORM.DIST(@Moves,@Means,@StdDevs,FALSE)

Here’s that plot:

This appears to have the basic shape of my data, but on a smaller scale, so I scaled it up by dividing by the maximum value here (0.05296) and multiplying by the maximum of my data (41).

=NORM.DIST(@Moves,@Means,@StdDevs,FALSE)/@MaxEqs*WinsMaxTbl

Here’s that plot superimposed on the plot of my data:

It looks roughly like a fit, but shifted to the right and too wide. I fiddled around with the mean and std dev and came up with this:

This looks pretty good, but a truncated normal should be better, but this is where I am stuck.

Since my data is only truncated on the left, I tried this formula:

=NORM.DIST(@Wins,@Means,@StdDevs,FALSE)/(1-NORM.DIST(52,@Means,@StdDevs,TRUE))

I get values from 6.89187E-30 to 1.55421E-09.

I then tried this one:

=NORM.DIST(@Wins,@Means,@StdDevs,FALSE)/(NORM.DIST(192,@Means,@StdDevs,TRUE)-NORM.DIST(52,@Means,@StdDevs,TRUE))

That gets the same values.

Can anyone tell me what I am doing wrong?

KRice · Aug 14, 2022

Jennifer,
I'm sorry for the delay...I got sidetracked. I had intended to follow up with you about this. When I first saw your data, I assumed that if a normal distribution was reasonably representative of the data, it would be a truncated normal distribution because the lower left tail is cut off by very real constraints (i.e., you need a minimum number of moves to complete a game, so a win cannot occur below that threshold). I normally explore these types of data sets by scaling the area under a histogram of the data to 1...so other standard distributions whose cumulative distribution function has values on the interval [0,1] can be compared to the data. When this is done with your data set, the distribution of moves is very peaked...indicating a high kurtosis...and this suggests that a normal distribution may not be an appropriate descriptor for the data. One thing to be aware of: you can scale the histogram of actual data so that the peak is about the same as a standard normal curve (I believe that is what you did), and when you do this, the curves appear to be somewhat similar, but you can tell that the area under the curves (red vs. blue curves in the last post) are quite different.

I noticed in a recent post:

Is there a good statistical package that is not too expensive?

Can anyone recommend a good statistical package that is not too expensive or too difficult to master? My immediate need is the ability to test some experimental data to see if it is normally distributed -- that is if it follows (fits) a normal distribution curve. And if it does, what the mean...

www.mrexcel.com

...that you are exploring distributions and tests of normality.

For the particular data set here, I followed a methodology described in this tutorial:

I created a workbook based on your data and the Tukey Lambda method shown in the video. I will need to review this file further to refresh my memory about some details, but wanted to share it with you so that you can explore it too.

Dropbox

www.dropbox.com

On one worksheet, you will see the Lambda calculated suggests that a normal distribution is not appropriate. On another worksheet or two I investigated how to reconstitute your original raw data (a single column list indicating the number of moves) rather than the summary data presented (a two-column list indicating moves and # of wins). The reason is that built-in functions can be used directly on the raw data to compute skew and kurtosis that might offer hope for early screening to determine whether to consider a normal distribution further.

JenniferMurphy · Aug 14, 2022

KRice said:
Jennifer,
I'm sorry for the delay...I got sidetracked.

. . .

For the particular data set here, I followed a methodology described in this tutorial:

I created a workbook based on your data and the Tukey Lambda method shown in the video. I will need to review this file further to refresh my memory about some details, but wanted to share it with you so that you can explore it too.

Dropbox

www.dropbox.com

On one worksheet, you will see the Lambda calculated suggests that a normal distribution is not appropriate. On another worksheet or two I investigated how to reconstitute your original raw data (a single column list indicating the number of moves) rather than the summary data presented (a two-column list indicating moves and # of wins). The reason is that built-in functions can be used directly on the raw data to compute skew and kurtosis that might offer hope for early screening to determine whether to consider a normal distribution further.

Thanks for the reference and the workbook. I fear that some of it may be over my head, but I'll take a look. I really appreciate the edxtra time you took on this. I'm going to mark this as a solution. If I run into problems trying to wrap my head around the details, I'll repost.

How to fit a truncated normal distribution to existing data?

JenniferMurphy

Well-known Member

JenniferMurphy

Well-known Member

KRice

Well-known Member

Is there a good statistical package that is not too expensive?

Dropbox

JenniferMurphy

Well-known Member

Dropbox

Similar threads

Share this page

How to fit a truncated normal distribution to existing data?

JenniferMurphy

Well-known Member

JenniferMurphy

Well-known Member

KRice

Well-known Member

Is there a good statistical package that is not too expensive?

Dropbox

JenniferMurphy

Well-known Member

Dropbox

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock