Percentile of n most recent values

Fitzy22

New Member
Joined
Jan 19, 2022
Messages
17
Office Version
  1. 365
Platform
  1. Windows
Hi,

Please see simplified minisheet of my data below.

I would like to calculate the 80th percentile of the most recent (i.e. dates in Column A) 24 values (i.e. values in Column C) for each site (i.e. x and y in Column B). Where there is no value for a particular date it means there is no value recorded and therefore needs to be ignored and doesn't from part of the percentile calculation. So only the 24 most recent "values" should be used.

I hope this adequately explains my requirements, but if there are any questions please ask.

Thanks in advance...

Percentile_calculation.xlsx
ABC
1DateSiteValue
225/12/2006x
321/12/2007x1
421/12/2007x2
518/11/2008x
618/11/2008x
728/11/2008y
828/11/2008y1
928/11/2008y2
1028/11/2008y
1128/11/2008y
1228/11/2008y
1328/11/2008x
1412/01/2009x
1512/01/2009x1
1612/01/2009y1
1712/01/2009y1
1812/01/2009y
1912/01/2009x
2012/01/2009y
2112/01/2009x
2212/01/2009y
2312/01/2009x
2417/01/2009x
2517/01/2009x2
2617/01/2009y
2717/01/2009y
2817/01/2009y
2917/01/2009x
301/01/2010x
311/01/2010x10
3215/02/2010x
3315/02/2010x
3415/02/2010y3
3518/02/2010x
3618/02/2010x
3718/02/2010x
3818/02/2010x
3918/02/2010x
4017/01/2011x4
419/03/2011x8
429/05/2011x3
437/03/2012x1
4426/02/2013x2
457/03/2013x1
461/03/2014x2.1
474/12/2014x3.4
484/12/2014y2.4
497/12/2014y1.2
507/12/2014y1.5
5114/12/2014x2
5214/12/2014x1.8
5329/12/2014x2.7
5416/01/2015x2.3
557/11/2015x
566/02/2016x2.3
5713/05/2016x3.2
5813/05/2016x2.5
5927/06/2016y1.1
6024/09/2016y1.4
6124/09/2016y1
6224/02/2017x1.1
633/03/2018x4.1
645/03/2018x9.1
655/03/2018x9
665/03/2018x7.8
676/03/2018x2.8
688/03/2018x5.3
698/03/2018x5.4
708/03/2018x3.3
719/03/2018x7.8
7224/02/2019x1.6
7324/02/2019x3.2
7431/03/2019y1.6
7531/03/2019y3.7
7625/01/2020y1.9
7727/01/2020x6.7
7827/01/2020x2.7
7927/01/2020x1.3
805/02/2020x4.6
818/03/2020y4.4
828/03/2020x0.8
8326/09/2020x1.6
8426/09/2020x2.2
8525/12/2020x1.7
8610/03/2021x14
8723/03/2021x1.4
8824/06/2021x2.4
8929/10/2021x4.2
9021/01/2022x7.5
9124/01/2022x12
9226/01/2022x2.2
Sheet1
 
Hi @KRice,

I hope you're doing well and was hoping I might bother you for a little more advice please.

I have created an excel table for the first time with a dataset and I'm getting close to getting it to work with the following formula which was derived from your previous assistance. My dataset is located in columns 1 (Location), 2 (Date) and 24 (Sulphates).

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),{1,0,1}),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),2),0.8)

The most recent 24 values are below but rather than getting 7.02 for the 80th percentile, I'm getting 8.88 (which is the highest value). I thought it may have something to do with the date formatting or the text matching from cell $M$2 but everything is correct.

Can you see anything obvious where I might be going wrong?

5.3
7.8
3.2
1.6
1.6
3.7
1.9
1.3
2.7
6.7
4.6000
4.4
0.8
1.6
2.2
1.7
14
1.4
2.4
4.2
7.5
12
2.2
8.8
 
Upvote 0

Excel Facts

Can you AutoAverage in Excel?
There is a drop-down next to the AutoSum symbol. Open the drop-down to choose AVERAGE, COUNT, MAX, or MIN
With structured references and your revised table structure, give this a try:
Excel Formula:
=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)
 
Upvote 0
Perfect @KRice, you are a marvel!!

So...because there is a table...I only needed the one index?

And it was the third column that needed sorting, not the second?

Regards,
 
Upvote 0
I'm glad you have it working. To answer your last questions..."not quite" for both questions. I'll try to explain. In post #14, I mentioned...
If your table structure is such that the Date, Site, and Value columns are far apart and constructing the outer FILTER column "on/off" array is tedious (lots of 0's), there is another approach. Rather than specifying a very large contiguous range in the inner FILTER, you could feed that large contiguous range into an INDEX function and have it return only the three columns that you need.
With your table structure positioning the relevant columns far apart, it makes sense to return the columns using an array inside the INDEX function to return columns {1,2,24}. Doing so makes the outer FILTER (originally described) unnecessary. One could argue that this is the preferred way to do it anyway. In your post #21, you used the inner INDEX approach (inside the inner FILTER)--that is okay. But then you applied a 2nd (outer) FILTER and instructed the formula to return {1,0,1}, which would return only the 1st and 3rd columns of the 3 columns produced by the inner parts of the formula. This meant that you were getting only the Location and Sulphates columns for all subsequent operations, and the Date column was being excluded even though it is still needed by the SORT function (to position the most recent results at the top of the list). Then when you applied the SORT function to column "2" of the remaining two columns (that's what {1,0,1} left you with), you were sorting on the Sulphates columns...which is not correct. You do want to sort on the 2nd column returned by the FILTER(INDEX inner construction, but you need to keep that 2nd column in order to be able to sort on it. By eliminating the needed 2nd column with the {1,0,1} array, your 2nd column was actually the 3rd column returned by the FILTER(INDEX inner construction.

Then after sorting by date and determining how many data points are available for the percentile calculation, the outer INDEX function is still necessary to extract the 3rd column of the sorted data. You were extracting the 2nd column of your sorted data (meaning the values in Sulphates), which was the correct column, but the previous outer FILTER and SORT operations left you with a jumbled table.

As mentioned before, you could take advantage of Excel 365's LET function to reduce the redundancy in this formula by assigning the filtering criteria to a variable (I've named fcrit), and then referring to fcrit where necessary:
Excel Formula:
=LET(fcrit,(Data[Location]=$M$2)*(Data[Sulphates]<>""),PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),fcrit),2,-1),SEQUENCE(MIN(24,SUM(fcrit))),3),0.8))
This construction is a little more difficult to debug, so I normally convert formulas over to this only after their longer versions are working correctly.
 
Upvote 0
Thanks @KRice,

I'm not there yet...still having a couple of issues and lacking understanding of this.

The following formula was replicated from your Post #14 and has the inner and outer filter and index. This still seems to be returning the correct value, even though it contains the two filters. The columns are Date (1), Location (4) and Data (7). Does this still seem correct as I have an entire spreadsheet which uses this formular across about 90 columns?

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX($A$10:G274,SEQUENCE(COUNT($A$10:$A$274)),{1,4,7}),($D$10:$D$274=$F$2)*(G10:G274<>"")),1,0,1}),1,-1),SEQUENCE(MIN(24,SUM(($D$10:$D$274=$F$2)*(G10:G274<>"")))),2),0.8)

Your formula from Post #14

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX($A$2:$G$92,SEQUENCE(COUNT($A$2:$A$92)),{1,3,7}),($C$2:$C$92=I$2)*($G$2:$G$92<>"")),1,0,1}),1,-1),SEQUENCE(MIN(24,SUM( ($C$2:$C$92=I$2)*($G$2:$G$92<>"") ))),2),0.8)

The next formula is the one I did, and simply tried to copy the above formula but with the excel table. The difference is that columns are Location (1), Date (2) and Data (24). If I had used {0,1,1} for the array and 3 at the end (see green text), would that have been correct with the extra filter?

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),{1,0,1}),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),2),0.8)

The next formula was from your most recent post and it seems to work.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)

I have amended this slightly so that I can fix some of the reference cells as per the formula below which returns 7.02 and I have checked the result with =PERCENTILE.INC(X254:X277,0.8) which is also 7.02.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Sulphates],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,24}),(Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)

...but I am getting some strange results. In the previous column for suspended solids I am using the following formula which gives me 2480. When I check the result with =PERCENTILE.INC(W254:W277,0.8) if get 2280.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Suspended Solids],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,23}),(Data[[Location]:[Location]]=$M$2)*(Data[Suspended Solids]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Suspended Solids]<>"")))),3),0.8)

Once again, sorry to bombard you....I'm getting really anxious about this 😬😬.
 
Upvote 0
The following formula was replicated from your Post #14 and has the inner and outer filter and index.
I see. I guess post #14 is a bit tricky to follow. In that case, we had Date, Site, and Value in columns 1, 3, and 7, respectively,...so the {1,0,1} associated with the outer FILTER was extracting only the Date and Values, and the resulting two-column array was then sorted by the 1st column (Date). And in the post #14 case, it was okay to drop the 2nd column (i.e., column 3 that is tied to the "0" in the {1,0,1} array because it had already been used by the inner FILTER to ensure that the correct Site (location) was considered.

I think it is probably better and easier to understand if you begin with an inner INDEX function to define the location of the relevant columns and the SEQUENCE(COUNT construction to capture all of the rows. Then wrap that inner INDEX inside a FILTER function to apply the row-filtering criteria (for desired location and to exclude blanks). Then wrap that inside a SORT function and sort by whichever column index is the Date in recent-to-old order (use -1). Then wrap all of that inside another INDEX function to extract only the column index that includes the values needed for the percentile computation and the rows that correspond to the number of data points you want to consider (the SEQUENCE(MIN(24,SUM construction).

I don't see any issues with the last formula you mentioned produces some odd results. Do you have a subset of data to look at?
 
Upvote 0
I sometimes avoid the absolute structured references to trim down the formulas. If you do that, your last one would be
Excel Formula:
=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Suspended Solids]],SEQUENCE(COUNT(Data[Date])),{1,2,23}),(Data[Location]=$M$2)*(Data[Suspended Solids]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Suspended Solids]<>"")))),3),0.8)
As a spot check, you can copy the inside array of the PERCENTILE.INC formula into a blank cell (with plenty of space below it for spilling results) to examine the value identified by the formula...and then re-sort them to pick out by eye where the 80th percentile would be:
Excel Formula:
SORT(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Suspended Solids]],SEQUENCE(COUNT(Data[Date])),{1,2,23}),(Data[Location]=$M$2)*(Data[Suspended Solids]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Suspended Solids]<>"")))),3))
 
Upvote 0
Sorry...I glossed right over your questions:
This still seems to be returning the correct value, even though it contains the two filters. The columns are Date (1), Location (4) and Data (7). Does this still seem correct as I have an entire spreadsheet which uses this formular across about 90 columns?
There is no problem using the two-FILTER version. The version presented takes data from $A$10:G274, with columns A (Date), D (Location), and G (Values) being of interest. $F$2 has the Location lookup. The inner INDEX extracts the main block of data in the three columns. The inner FILTER trims down the rows to consider only those points meeting the Location and non-blank requirements. The outer FILTER then jettisons the middle column (Location)--that's okay now since it already served its purpose--keeping only Date and Values. This 2-column array is sorted by the 1st column (Date) in descending order. Then the outer INDEX performs a count of the number of data points that could be considered and takes the smaller of that amount or 24 and constructs a row-indexing array using that value to extract only that many values from the top of the 2nd column (the Values/Data column). The PERCENTILE.INC function operates on that subset of data. I don't see any issues with the formula.
The next formula is the one I did, and simply tried to copy the above formula but with the excel table. The difference is that columns are Location (1), Date (2) and Data (24). If I had used {0,1,1} for the array and 3 at the end (see green text), would that have been correct with the extra filter?
=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),{1,0,1}),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),2),0.8)
Almost...yes to both of your explanations: using {0,1,1} rather than {1,0,1} in the outer FILTER would return a 2-column array of Date and Data, which is what you want...so "yes" to {0,1,1}. But the next step involves sorting by Date, so the "2" that I highlighted in red should be a 1 since Data is now the 1st column of the 2-column array. And finally, yes to the green "2", since you want the outer INDEX function to return the 2nd column (Data) of the 2-column array.
I have amended this slightly so that I can fix some of the reference cells as per the formula below which returns 7.02 and I have checked the result with =PERCENTILE.INC(X254:X277,0.8) which is also 7.02.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Sulphates],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,24}),(Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)
I don't see any issues with the formula, assuming you want to use a fixed reference for the [Location] column. If you want to investigate further, see my comment in post #27 about copying the formula inside the PERCENTILE.INC function and then sorting that result to see the list of values being used. That might help you determine whether something is wrong.
Excel Formula:
=SORT(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Sulphates],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,24}),(Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")))),3))
 
Upvote 0
Hi @KRice,

I just wanted to let you know that I have identified why I am getting the unusual results...its actually correct. The reason for the different values is that I have three rows of data for the same day, which is correct...but these three rows are split over the 24th datapoint in the percentile calculation. What excel does automatically (unbelievably) is to take the mean of those three values and then use that number as the 24th value.

I will go through your most recent few posts soon, but it's great to know that the formula is working.
 
Upvote 0
Thanks for the update. You may want to investigate using the formula I described at the end of my last post...the formula that stops just short of evaluating the percentile, and instead spills the array of values that would be used by the percentile function.
 
Upvote 0

Forum statistics

Threads
1,223,896
Messages
6,175,264
Members
452,627
Latest member
KitkatToby

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top