Generate a normal distribution from mean, upper, and lower confidence intervals

divangou · Jun 18, 2014

I have a set of variables I would like to generate distributions for and draw random numbers from for a Monte Carlo Simulation. For each variable, I have the mean, Upper 95% CI, and lower 95% CI. The upper and lower CI's are almost never equal.

Any help on the most efficient way to do this in excel would be greatly appreciated.

Divangou

divangou · Jun 19, 2014

Thank you all for your thoughtful posts. This is a wonderful discussion board. I am afraid I am in deep over my statistical head here. You are all helping to prevent drowning.

To answer your questions:
--The application is environmental - carbon accounting of forest-based carbon emissions from a particular geography.
--Yes, negative values are not possible, since the means represent biomass per unit area at different locations in the forest.
--Yes, I'm really not sure WHAT kind of distribution is described by this data. I should not have said it was normal. However, forest biomass IS a natural phenomenon, which must approach a normal distribution in nature (right?), but the sample-base estimates obviously do not.
--I have 8031 of these sample estimates, and each was derived by a first-stage statistical process to estimate uncertainty (a Monte Carlo as well I think) - this is the process which gave me the confidence values. I believe that N=100 in this case refers to the fact that the process ran 100 iterations. I did not work on this first stage process, but was given the data by a colleague
--Yes, I DO have a standard deviation value for each of these as well, but to be honest, I'm not sure what stdev really means in this case, since as you all deduce, the distribution is skewed.

Perhaps I just need to spend a lot more time with a statistics textbook before I go any further with this....

shg · Jun 19, 2014

I think there are three steps:

1. Determine the type of distribution you're dealing with, either from first principles (carbon exchange mechanisms?) or analysis of your data.

2. Deduce the relevant parameters of the distribution for each of your data series

3. Generate random deviates based on the distribution and its parameters.

In the alternative, if you have all the raw data, you can numerically create the CDF for each dataset, and then use a random normal deviate to do an interpolated lookup to generate more deviates from the same distribution. Cheap, cheerful, and mathematically sound if you have a lot of samples, though you would never generate deviates outside the bounds of the original samples.

shg · Jun 19, 2014

... and then use a random normal deviate ...

Sorry,

... and then use a random uniform deviate ...

joeu2004 · Jun 19, 2014

divangou said:
I have a set of variables I would like to generate distributions for and draw random numbers from for a Monte Carlo Simulation. For each variable, I have the mean, Upper 95% CI, and lower 95% CI. The upper and lower CI's are almost never equal.

divangou said:
The problem is that the two Confidence Intervals are not equal, suggesting that the distribution is skewed. I'm no statistician either, so I may be getting my terminology wrong [....] The question is how to back-engineer the skewness based on the inputs I have mean and 2 CIs...I think. Here's a sub-sample of the data.

[TABLE="width: 245"]
<tbody>[TR]
[TD="align: right"]Mean
[/TD]
[TD="align: right"]Low 95% CI[/TD]
[TD="align: right"]High 95% CI[/TD]
[TD="align: right"]N[/TD]
[/TR]
[TR]
[TD="align: right"]115.65[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]252.02[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]208.31[/TD]
[TD="align: right"]87.09[/TD]
[TD="align: right"]348.64[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]384.91[/TD]
[TD="align: right"]262.09[/TD]
[TD="align: right"]603.61[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]224.61[/TD]
[TD="align: right"]78.11[/TD]
[TD="align: right"]392.8[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]72.09[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]202.77[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]215.68[/TD]
[TD="align: right"]77.25[/TD]
[TD="align: right"]363.99[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]145.65[/TD]
[TD="align: right"]14.62[/TD]
[TD="align: right"]277.76[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]192.74[/TD]
[TD="align: right"]85.3[/TD]
[TD="align: right"]347.23[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]188.14[/TD]
[TD="align: right"]59.29[/TD]
[TD="align: right"]315.74[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]107.7[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]228.55[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]110.78[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]242.27[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]183.73[/TD]
[TD="align: right"]70.87[/TD]
[TD="align: right"]336.32[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]127.85[/TD]
[TD="align: right"]1.55[/TD]
[TD="align: right"]286.09[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]127.54[/TD]
[TD="align: right"]24.44[/TD]
[TD="align: right"]265.96[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]56.3[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]169.58[/TD]
[TD="align: right"]100[/TD]
[/TR]
[TR]
[TD="align: right"]70.82[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]198.1[/TD]
[TD="align: right"]100[/TD]
[/TR]
</tbody>[/TABLE]

To summarize the following, the Monte Carlo simulation might generate random data "with 95% confidence" using NORMINV(R1,M1,S1), where:

R1: =RAND()
M1: =AVERAGE(A1:A16)
S1: =TINV(1-95%,COUNT(A1:A16)-1)*STDEV(A1:A16)*SQRT(100)

assuming the means of the samples (left-most column above) are in A1:A16.

Alternatively, you might create a range for each data point "with 95% confidence" using a low of NORMINV(R1,M2,S1) and a high of NORMINV(R1,M3,S2), where:

M2: =AVERAGE(A1:A16)-TINV(1-95%,COUNT(A1:A16)-1)*STDEV(A1:A16)
M3: =AVERAGE(A1:A16)+TINV(1-95%,COUNT(A1:A16)-1)*STDEV(A1:A16)

-----

I believe you are misusing terminology and misinterpreting the data that you have, and based on that, there seems to be a lot of misdirection in this thread.

First, a "confidence interval" is a measure of the estimate of a population parameter or sample statistic. It is not a measure of the distribution of the population or sample. For example, for a normal distribution, we might talk about the confidence interval of the sample or population mean.

Second, the confidence interval of the mean is always symmetrical. Specifically, it is u +/- se, where "u" is the population or sample mean, and "se" is the "standard error" of the mean. This is derived from the Central Limit Theorem, which states in effect that the means of equal-sized random samples from a normally-distributed population are normally distributed. The "standard error" of the mean is the standard deviation of the distribution of the means.

Note that the means of the samples are normally distributed. That does not imply the population is normally distributed. But for non-normally-distributed poplulations, the means of the samples tend to be normally distributed as the size of the samples increases. Thus, the average of the means of the samples is an estimator of the mean of the population.

So my interpretation of your data is: we have 16 samples of size 100 from a population of unknown distribution.

I suspect the so-called low and high "95% confidence intervals" [sic; read: confidence limits) probably mean that 95% of each sample data are between those limits. These are not confidence limits. Instead, they are the central 95%ile range of the sample data. As such, they are not particularly interesting or useful.

If the means of the samples are in A1:A16, AVERAGE(A1:A16), about 158.28, is the average of the means of the samples. And STDEV(A1:A16), about 81.00, is the sample standard error of the means of the samples.

If we want to state the average of the means with 95% confidence, the 95% confidence limits would be +/-NORMSINV(1-(1-95%)/2), about 1.96, times the standard error if we had a "large" number of samples. For a small number of samples, the 95% confidence limits are +/-TINV(1-95%,COUNT(A1:A16)-1), about 2.13, times the standard error.

So we are 95% confident that the average of the means is about 158.28 +/- 172.66. (172.66 =~ TINV(1-95%,COUNT(A1:A16)-1)*STDEV(A1:A16).)

Note that the relative standard error, about 81.00, is about 51.18% of the average of the means. Typically, that is too large an interval to be useful. For a tighter interval, we need the means of more samples -- probably many more. We can estimate the number of required samples.

Based on the CLT, we use AVERAGE(A1:A16) as the estimated mean of the population data from which we drew 16 samples.

If we assume the population data is normally distributed (note: we have no reason to make that assumption), we are 95% confident that the estimated standard deviation of the population is the 95% confidence limit of the means times SQRT(100), where 100 is the sample size.

That is because the CLT tells us: se = sd / Sqrt

, where "sd" is the sample standard deviation of a sample of size n.

Thus, the estimated standard deviation of the population data is TINV(1-95%,COUNT(A1:A16)-1)*STDEV(A1:A16)*SQRT(100), about 1726.57.

So your Monte Carlo simulation might generate random data using NORMINV(RAND(),158.28,1726.57), substituting the actual formulas for those constants.

Needless to say, that is likely to be a very poor simulation of the population data for several reasons:

1. As noted, the relative standard error is unreasonably high.

2. We have no reason to assume the population data is normally distributed. That is, you have not given us a reason.

3. Some people (not I) believe NORMINV(RAND(),...) results in a very poor random sample, largely because of claims that the Excel RAND has poor random properties, especially before Excel 2010 and even the new implementation starting in Excel 2010.

joeu2004 · Jun 19, 2014

divangou said:
Yes, I'm really not sure WHAT kind of distribution is described by this data. I should not have said it was normal. However, forest biomass IS a natural phenomenon, which must approach a normal distribution in nature (right?), but the sample-base estimates obviously do not.

"Must" approach a normal distribution? No.

But "might" approach a normal distribution? Yes. What does literature in your field say about that?

Absent information to the contrary, it is not unusual for natural scientists of many fields to leap to that conclusion.

divangou said:
I have 8031 of these sample estimates, and each was derived by a first-stage statistical process to estimate uncertainty (a Monte Carlo as well I think)

If those are 8031 real samples based on measurements of the actual phenomenon, based on the 16 sample means that you posted, I estimate you can know the average of the sample means, ergo the estimated mean of the population, within about +/-0.6%.

But if the 16 sample means that you posted are the results of a Monte Carlo simulation, they are useless for estimating (improving) the mean and standard deviation of subsequent Monte Carlo simulations using the methods I described. GIGO!

divangou said:
Yes, I DO have a standard deviation value for each of these as well, but to be honest, I'm not sure what stdev really means in this case, since as you all deduce, the distribution is skewed.

The standard deviation of the sample(s) is useless for the boot-strap procedure that I described previously.

But if you have only one real sample (ignore any simulated results), the mean and the standard deviation of that sample are the best data that you can use.

As you said, even if the sample distribution is "skewed" (it almost always is for small samples), it is not unreasonable to assume it is a sample from a normally-distributed population if the data represents a natural phenomenon, absent any information to the contrary.

joeu2004 · Jun 19, 2014

Errata....

joeu2004 said:
I believe you are misusing terminology and misinterpreting the data that you have, and based on that, there seems to be a lot of misdirection in this thread.

First, a "confidence interval" is a measure of the estimate of a population parameter or sample statistic. It is not a measure of the distribution of the population or sample. For example, for a normal distribution, we might talk about the confidence interval of the sample or population mean.

Second, the confidence interval of the mean is always symmetrical. Specifically, it is u +/- se, where "u" is the population or sample mean, and "se" is the "standard error" of the mean. This is derived from the Central Limit Theorem, which states in effect that the means of equal-sized random samples from a normally-distributed population are normally distributed. The "standard error" of the mean is the standard deviation of the distribution of the means.

Note that the means of the samples are normally distributed. That does not imply the population is normally distributed. But for non-normally-distributed poplulations, the means of the samples tend to be normally distributed as the size of the samples increases. Thus, the average of the means of the samples is an estimator of the mean of the population.

So my interpretation of your data is: we have 16 samples of size 100 from a population of unknown distribution.

I suspect the so-called low and high "95% confidence intervals" [sic; read: confidence limits) probably mean that 95% of each sample data are between those limits. These are not confidence limits. Instead, they are the central 95%ile range of the sample data. As such, they are not particularly interesting or useful.

I believe the above is correct, but the formulas I provided are incorrect due to some mistakes. I don't have time to fix everything. And it seems moot, given the evolving description of the problem.

I "withdraw" my comments.

Generate a normal distribution from mean, upper, and lower confidence intervals

divangou

New Member

divangou

New Member

Excel Facts

shg

MrExcel MVP

shg

MrExcel MVP

joeu2004

Banned user

joeu2004

Banned user

joeu2004

Banned user

Similar threads

Forum statistics

Share this page

Generate a normal distribution from mean, upper, and lower confidence intervals

divangou

New Member

divangou

New Member

Excel Facts

shg

MrExcel MVP

shg

MrExcel MVP

joeu2004

Banned user

joeu2004

Banned user

joeu2004

Banned user

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock