Minimum sample size for 'n' allowed errors, given required accuracy and confidence (statistics)

SerenityNetworks

Board Regular
Joined
Aug 13, 2009
Messages
131
Office Version
  1. 365
Platform
  1. Windows
I use the workbook referenced below to provide me with the minimum sample sizes needed for a range of allowed errors, given a required accuracy and required confidence.

In a round-about way, I fill-down the BINOMDIST function in a column to provide me with the probability/confidence that my accuracy requirement will be met for a given sample size and given number of errors. This allows me to then look up the sample size where the probability/confidence I'm looking has been met (by the BINOMDIST formula).

The solution works perfectly. The results have been validated many times. However, it produces a huge (100+ mb) Excel file that is slow to calculate. In the example file provided I'm only giving 5 columns of formulas that are filled-down 200 rows. To accommodate the ranges I need to work with, I must fill-down 50k or more rows and across 75 or more columns. That makes for huge workbooks that are slow to calculate.

I'm wondering if anyone can show me a more elegant method to accomplish my goal. I've tried, but haven't been able to get away from using a massive number of columns and rows. It seems there should be a single formula that will do what I need, but I'm not mathematically inclined enough to be able to figure it out. Any help will be appreciated.

Thanks in advance,
Andrew

Dropbox link to workbook directory

or link to workbook directly if above doesn't work.
 
Last edited:

Excel Facts

Do you hate GETPIVOTDATA?
Prevent GETPIVOTDATA. Select inside a PivotTable. In the Analyze tab of the ribbon, open the dropown next to Options and turn it off
First, I'd like to say that this is not the right solution to your problem; there is surely a closed-form solution that would be much, much faster. If you go ask on math.stackexchange.com, someone will likely give you the answer in minutes, and if you link to it, someone here can help you implement it in Excel. (I suggest you compose the question very carefully and thoroughly; they are not likely to go look at your workbook.)

That said,

[Table="width:, class:grid"][tr][td="bgcolor:#C0C0C0"][/td][td="bgcolor:#C0C0C0"]
A​
[/td][td="bgcolor:#C0C0C0"]
B​
[/td][td="bgcolor:#C0C0C0"]
C​
[/td][/tr][tr][td="bgcolor:#C0C0C0"]
1​
[/td][td="bgcolor:#F3F3F3"]
Accuracy
[/td][td]
90%​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
2​
[/td][td="bgcolor:#F3F3F3"]
Confidency
[/td][td]
80%​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
3​
[/td][td][/td][td][/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
4​
[/td][td="bgcolor:#F3F3F3"]
Allowed Errors
[/td][td="bgcolor:#F3F3F3"]
Min Samples
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
5​
[/td][td]
0​
[/td][td="bgcolor:#E5E5E5"]
16​
[/td][td]B5: {=MATCH(TRUE, 1 - BINOMDIST($A5, RowVec(1, 50000), 1-$B$1, TRUE) >= $B$2, 0)}[/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
6​
[/td][td]
1​
[/td][td="bgcolor:#E5E5E5"]
29​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
7​
[/td][td]
2​
[/td][td="bgcolor:#E5E5E5"]
42​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
8​
[/td][td]
3​
[/td][td="bgcolor:#E5E5E5"]
54​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
9​
[/td][td]
4​
[/td][td="bgcolor:#E5E5E5"]
66​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
10​
[/td][td]
5​
[/td][td="bgcolor:#E5E5E5"]
78​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
11​
[/td][td]
6​
[/td][td="bgcolor:#E5E5E5"]
90​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
12​
[/td][td]
7​
[/td][td="bgcolor:#E5E5E5"]
101​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
13​
[/td][td]
8​
[/td][td="bgcolor:#E5E5E5"]
113​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
14​
[/td][td]
9​
[/td][td="bgcolor:#E5E5E5"]
124​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
15​
[/td][td]
10​
[/td][td="bgcolor:#E5E5E5"]
135​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
465​
[/td][td]
460​
[/td][td="bgcolor:#E5E5E5"]
4,780​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
466​
[/td][td]
461​
[/td][td="bgcolor:#E5E5E5"]
4,791​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
467​
[/td][td]
462​
[/td][td="bgcolor:#E5E5E5"]
4,801​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
468​
[/td][td]
4943​
[/td][td="bgcolor:#E5E5E5"]
50,000​
[/td][td][/td][/tr]
[/table]


The (brutal) formula in B5 does exactly what your workbook does without the intermediate data. It uses a UDF (RowVec) just to generate a constant array:

Code:
Function RowVec(iBeg As Long, iEnd As Long, _
                Optional ByVal iStep As Long = 1&) As Variant

  ' shg 2006
  ' Returns a 1-based, 1D array from iBeg to iEnd stepping iStep

  Dim nOut          As Long
  Dim aiOut()       As Long
  Dim iOut          As Long

  If iStep <> 0 Then
    iStep = Sgn(iEnd - iBeg) * Abs(iStep)

    nOut = (iEnd - iBeg) \ iStep + 1
    ReDim aiOut(1 To nOut)
    aiOut(1) = iBeg

    For iOut = 2 To nOut
      aiOut(iOut) = aiOut(iOut - 1) + iStep
    Next iOut

    RowVec = aiOut
  End If
End Function

The UDF avoids the ROW(INDIRECT(1:50000)) construct that would make the formula volatile, which you seriously do not want to do.
 
Last edited:
Upvote 0
Thank you very much. Your UDF helps a lot, even if no other solutions are provided. But I will definitely check out math.stackexchange.com. I'm aware that the brute force solution I created is not ideal. But it works and believe it or not has been a great improvement over previous tools I was given to use.

I'll post back what I find out.

Thanks again,
Andrew
 
Upvote 0
I haven't received a reply yet at stackoverflow, but I'm still trying to understand your UDF.

The array formula in the worksheet is pretty straightforward. But in the UDF I'm not seeing how iBeg is any value other than 1. Obviously it increments, but how? If iBeg increments then why not iEnd? How does iStep increment? I'm not following.

In the same vein, it would seem that I could increase the array from 1,50000 to 1,70000 if I wanted to increase the permissible range of allowed errors. That's not the case. Changing 50k to 70k actually decreases the range from 4943 to 428. I'm clueless as to why.

Would you mind helping me in understanding the UDF?

Thanks,
Andrew
 
Upvote 0
The UDF is brainless -- RowVec(3,7) returns the sequence {3,4,5,6,7}
 
Upvote 0
Post a link to your stackexchange question?
 
Upvote 0
The UDF is brainless -- RowVec(3,7) returns the sequence {3,4,5,6,7}
Aaaaah. Okay. Now I can follow - I think.

Still, I'm at a loss as to why it throws #N/A when I enter larger values. If RowVec(1,50000) returns {1,2,3,...49999,50000} then why do I get #N/A if I use RowVec(1,70000)?

For example, I enter an accuracy of 99.95% with a confidence of 80%. I use RowVec(1,65536). I start in row 6 with 23. It makes sense that I get #N/A at Allowed Errors of 28 as the minimum sample would be 66814 and that is higher than my defined sequence. But if I enter RowVec(1,66815) or even RowVec(1,70000) then I get #N/A. I am not understanding why.

My post at stackexchange is here.
 
Upvote 0
Plan B:

[Table="width:, class:grid"][tr][td="bgcolor:#C0C0C0"][/td][td="bgcolor:#C0C0C0"]
A​
[/td][td="bgcolor:#C0C0C0"]
B​
[/td][td="bgcolor:#C0C0C0"]
C​
[/td][/tr][tr][td="bgcolor:#C0C0C0"]
1​
[/td][td="bgcolor:#F3F3F3"]
Accuracy
[/td][td]
99.00%​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
2​
[/td][td="bgcolor:#F3F3F3"]
Confidence
[/td][td]
80.00%​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
3​
[/td][td][/td][td][/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
4​
[/td][td="bgcolor:#F3F3F3"]
Allowed Errors
[/td][td="bgcolor:#F3F3F3"]
Min Samples
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
5​
[/td][td]
0​
[/td][td="bgcolor:#E5E5E5"]
161​
[/td][td]B5: =NumTrials(A5, $B$2, 1-$B$1)[/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
6​
[/td][td]
1​
[/td][td="bgcolor:#E5E5E5"]
299​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
7​
[/td][td]
2​
[/td][td="bgcolor:#E5E5E5"]
427​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
8​
[/td][td]
5​
[/td][td="bgcolor:#E5E5E5"]
790​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
9​
[/td][td]
10​
[/td][td="bgcolor:#E5E5E5"]
1,364​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
10​
[/td][td]
20​
[/td][td="bgcolor:#E5E5E5"]
2,471​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
11​
[/td][td]
50​
[/td][td="bgcolor:#E5E5E5"]
5,686​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
12​
[/td][td]
100​
[/td][td="bgcolor:#E5E5E5"]
10,930​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
13​
[/td][td]
200​
[/td][td="bgcolor:#E5E5E5"]
21,277​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
14​
[/td][td]
500​
[/td][td="bgcolor:#E5E5E5"]
51,964​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
15​
[/td][td]
1,000​
[/td][td="bgcolor:#E5E5E5"]
102,739​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
16​
[/td][td]
2,000​
[/td][td="bgcolor:#E5E5E5"]
203,836​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
17​
[/td][td]
5,000​
[/td][td="bgcolor:#E5E5E5"]
506,012​
[/td][td][/td][/tr]
[tr][td="bgcolor:#C0C0C0"]
18​
[/td][td]
10,000​
[/td][td="bgcolor:#E5E5E5"]
1,008,465​
[/td][td][/td][/tr]
[/table]


Code:
Function NumTrials(numSucc As Long, Conf As Double, p As Double) As Long
  Dim cdf           As Double   ' cumulative distribution function
  Dim n             As Long     ' trials

  n = numSucc
  
  With WorksheetFunction
    Do
      cdf = .Binom_Dist(numSucc, n, p, True)
      n = n + 1
    Loop While 1 - cdf < Conf
  End With

  NumTrials = n - 1
End Function

The good news is, it only calculates as many as it needs. The bad news is, worksheet functions called from VBA are not as fast as called from formulas.

I still feel certain there's a mathematical simplification, like the one user Dap gave me at https://math.stackexchange.com/questions/2446752/erlang-c-for-large-numbers.
 
Last edited:
Upvote 0
Excellent! And it's even a UDF I can easily follow. Thank you very much. The speed is not an issue. With the ability to apply it to just the allowed errors I desire to use, I usually won't need to use it more than four times in a workbook. Eight uses in a workbook would be the maximum.

Did you see the update I made to my post on stackexchange? It contains the original function used by the online tool I employed in validating my Excel tool. I've just not been able to reduce it to where I could apply it successfully. Forty plus years ago it was within my ability, but not today. If you can apply it that would be cool. Ultimately, I'd like to (1) calculate minimum sample size given allowed errors, required accuracy, and required confidence, (2) calculate the confidence level given the count of errors, required accuracy, and sample size, and (3) calculate the number of errors allowed given the sample size, required accuracy, and required confidence. All of that should be able to be derived from the function I provided. I just don't know how.

Thanks again,
Andrew
 
Upvote 0
Glad it helps.

Yes, I did see the update. I posted a question of my own at https://math.stackexchange.com/q/2483017/ for a recurrence relation for the binomial CDF, and got one. I was excited because the UDF runs about 60 times faster than the last one:

Code:
Function NumTrials2(k As Long, Conf As Double, p As Double) As Long
  Dim cdf           As Double   ' cumulative distribution function
  Dim pmf           As Double
  Dim n             As Long     ' trials

  n = k
  cdf = 1#      ' F(k, k, p)
  pmf = p ^ k   ' f(k, k, p)

  Do
    ' F(k, n+1, p) = F(k, n, p) - p * f(k, n, p)
    cdf = cdf - p * pmf
    ' pmf(k, n+1, p) via recurrence relation
    pmf = (n + 1) / (n + 1 - k) * (1 - p) * pmf
    If pmf = 0 Then Stop
    n = n + 1
  Loop While 1 - cdf < Conf

  NumTrials2 = n
End Function

It gives the same answers to a point, but dies when the PMF falls below the level at which it can be subtracted from the CDF.
 
Upvote 0

Forum statistics

Threads
1,224,816
Messages
6,181,141
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top