Percentile of n most recent values

Fitzy22 · Jun 28, 2022

Hi,

Please see simplified minisheet of my data below.

I would like to calculate the 80th percentile of the most recent (i.e. dates in Column A) 24 values (i.e. values in Column C) for each site (i.e. x and y in Column B). Where there is no value for a particular date it means there is no value recorded and therefore needs to be ignored and doesn't from part of the percentile calculation. So only the 24 most recent "values" should be used.

I hope this adequately explains my requirements, but if there are any questions please ask.

Thanks in advance...

Percentile_calculation.xlsx

A

B

C

1

Date

Site

Value

2

25/12/2006

x

3

21/12/2007

x

1

4

21/12/2007

x

2

5

18/11/2008

x

6

18/11/2008

x

7

28/11/2008

y

8

28/11/2008

y

1

9

28/11/2008

y

2

10

28/11/2008

y

11

28/11/2008

y

12

28/11/2008

y

13

28/11/2008

x

14

12/01/2009

x

15

12/01/2009

x

1

16

12/01/2009

y

1

17

12/01/2009

y

1

18

12/01/2009

y

19

12/01/2009

x

20

12/01/2009

y

21

12/01/2009

x

22

12/01/2009

y

23

12/01/2009

x

24

17/01/2009

x

25

17/01/2009

x

2

26

17/01/2009

y

27

17/01/2009

y

28

17/01/2009

y

29

17/01/2009

x

30

1/01/2010

x

31

1/01/2010

x

10

32

15/02/2010

x

33

15/02/2010

x

34

15/02/2010

y

3

35

18/02/2010

x

36

18/02/2010

x

37

18/02/2010

x

38

18/02/2010

x

39

18/02/2010

x

40

17/01/2011

x

4

41

9/03/2011

x

8

42

9/05/2011

x

3

43

7/03/2012

x

1

44

26/02/2013

x

2

45

7/03/2013

x

1

46

1/03/2014

x

2.1

47

4/12/2014

x

3.4

48

4/12/2014

y

2.4

49

7/12/2014

y

1.2

50

7/12/2014

y

1.5

51

14/12/2014

x

2

52

14/12/2014

x

1.8

53

29/12/2014

x

2.7

54

16/01/2015

x

2.3

55

7/11/2015

x

56

6/02/2016

x

2.3

57

13/05/2016

x

3.2

58

13/05/2016

x

2.5

59

27/06/2016

y

1.1

60

24/09/2016

y

1.4

61

24/09/2016

y

1

62

24/02/2017

x

1.1

63

3/03/2018

x

4.1

64

5/03/2018

x

9.1

65

5/03/2018

x

9

66

5/03/2018

x

7.8

67

6/03/2018

x

2.8

68

8/03/2018

x

5.3

69

8/03/2018

x

5.4

70

8/03/2018

x

3.3

71

9/03/2018

x

7.8

72

24/02/2019

x

1.6

73

24/02/2019

x

3.2

74

31/03/2019

y

1.6

75

31/03/2019

y

3.7

76

25/01/2020

y

1.9

77

27/01/2020

x

6.7

78

27/01/2020

x

2.7

79

27/01/2020

x

1.3

80

5/02/2020

x

4.6

81

8/03/2020

y

4.4

82

8/03/2020

x

0.8

83

26/09/2020

x

1.6

84

26/09/2020

x

2.2

85

25/12/2020

x

1.7

86

10/03/2021

x

14

87

23/03/2021

x

1.4

88

24/06/2021

x

2.4

89

29/10/2021

x

4.2

90

21/01/2022

x

7.5

91

24/01/2022

x

12

92

26/01/2022

x

2.2

Sheet1

Fitzy22 · Jul 12, 2022

Hi @KRice,

I hope you're doing well and was hoping I might bother you for a little more advice please.

I have created an excel table for the first time with a dataset and I'm getting close to getting it to work with the following formula which was derived from your previous assistance. My dataset is located in columns 1 (Location), 2 (Date) and 24 (Sulphates).

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),{1,0,1}),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),2),0.8)

The most recent 24 values are below but rather than getting 7.02 for the 80th percentile, I'm getting 8.88 (which is the highest value). I thought it may have something to do with the date formatting or the text matching from cell $M$2 but everything is correct.

Can you see anything obvious where I might be going wrong?

5.3
7.8
3.2
1.6
1.6
3.7
1.9
1.3
2.7
6.7
4.6000
4.4
0.8
1.6
2.2
1.7
14
1.4
2.4
4.2
7.5
12
2.2
8.8

KRice · Jul 12, 2022

With structured references and your revised table structure, give this a try:

Excel Formula:

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)

Fitzy22 · Jul 12, 2022

Perfect @KRice, you are a marvel!!

So...because there is a table...I only needed the one index?

And it was the third column that needed sorting, not the second?

Regards,

KRice · Jul 12, 2022

I'm glad you have it working. To answer your last questions..."not quite" for both questions. I'll try to explain. In post #14, I mentioned...

KRice said:
If your table structure is such that the Date, Site, and Value columns are far apart and constructing the outer FILTER column "on/off" array is tedious (lots of 0's), there is another approach. Rather than specifying a very large contiguous range in the inner FILTER, you could feed that large contiguous range into an INDEX function and have it return only the three columns that you need.

With your table structure positioning the relevant columns far apart, it makes sense to return the columns using an array inside the INDEX function to return columns {1,2,24}. Doing so makes the outer FILTER (originally described) unnecessary. One could argue that this is the preferred way to do it anyway. In your post #21, you used the inner INDEX approach (inside the inner FILTER)--that is okay. But then you applied a 2nd (outer) FILTER and instructed the formula to return {1,0,1}, which would return only the 1st and 3rd columns of the 3 columns produced by the inner parts of the formula. This meant that you were getting only the Location and Sulphates columns for all subsequent operations, and the Date column was being excluded even though it is still needed by the SORT function (to position the most recent results at the top of the list). Then when you applied the SORT function to column "2" of the remaining two columns (that's what {1,0,1} left you with), you were sorting on the Sulphates columns...which is not correct. You do want to sort on the 2nd column returned by the FILTER(INDEX inner construction, but you need to keep that 2nd column in order to be able to sort on it. By eliminating the needed 2nd column with the {1,0,1} array, your 2nd column was actually the 3rd column returned by the FILTER(INDEX inner construction.

Then after sorting by date and determining how many data points are available for the percentile calculation, the outer INDEX function is still necessary to extract the 3rd column of the sorted data. You were extracting the 2nd column of your sorted data (meaning the values in Sulphates), which was the correct column, but the previous outer FILTER and SORT operations left you with a jumbled table.

As mentioned before, you could take advantage of Excel 365's LET function to reduce the redundancy in this formula by assigning the filtering criteria to a variable (I've named fcrit), and then referring to fcrit where necessary:

Excel Formula:

=LET(fcrit,(Data[Location]=$M$2)*(Data[Sulphates]<>""),PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),fcrit),2,-1),SEQUENCE(MIN(24,SUM(fcrit))),3),0.8))

This construction is a little more difficult to debug, so I normally convert formulas over to this only after their longer versions are working correctly.

Fitzy22 · Jul 12, 2022

Thanks @KRice,

I'm not there yet...still having a couple of issues and lacking understanding of this.

The following formula was replicated from your Post #14 and has the inner and outer filter and index. This still seems to be returning the correct value, even though it contains the two filters. The columns are Date (1), Location (4) and Data (7). Does this still seem correct as I have an entire spreadsheet which uses this formular across about 90 columns?

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX($A$10:G274,SEQUENCE(COUNT($A$10:$A$274)),{1,4,7}),($D$10:$D$274=$F$2)*(G10:G274<>"")),1,0,1}),1,-1),SEQUENCE(MIN(24,SUM(($D$10:$D$274=$F$2)*(G10:G274<>"")))),2),0.8)

Your formula from Post #14

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX($A$2:$G$92,SEQUENCE(COUNT($A$2:$A$92)),{1,3,7}),($C$2:$C$92=I$2)*($G$2:$G$92<>"")),1,0,1}),1,-1),SEQUENCE(MIN(24,SUM( ($C$2:$C$92=I$2)*($G$2:$G$92<>"") ))),2),0.8)

The next formula is the one I did, and simply tried to copy the above formula but with the excel table. The difference is that columns are Location (1), Date (2) and Data (24). If I had used {0,1,1} for the array and 3 at the end (see green text), would that have been correct with the extra filter?

=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),{1,0,1}),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),2),0.8)

The next formula was from your most recent post and it seems to work.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)

I have amended this slightly so that I can fix some of the reference cells as per the formula below which returns 7.02 and I have checked the result with =PERCENTILE.INC(X254:X277,0.8) which is also 7.02.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Sulphates],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,24}),(Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)

...but I am getting some strange results. In the previous column for suspended solids I am using the following formula which gives me 2480. When I check the result with =PERCENTILE.INC(W254:W277,0.8) if get 2280.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Suspended Solids],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,23}),(Data[[Location]:[Location]]=$M$2)*(Data[Suspended Solids]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Suspended Solids]<>"")))),3),0.8)

Once again, sorry to bombard you....I'm getting really anxious about this

.

KRice · Jul 12, 2022

Fitzy22 said:
The following formula was replicated from your Post #14 and has the inner and outer filter and index.

I see. I guess post #14 is a bit tricky to follow. In that case, we had Date, Site, and Value in columns 1, 3, and 7, respectively,...so the {1,0,1} associated with the outer FILTER was extracting only the Date and Values, and the resulting two-column array was then sorted by the 1st column (Date). And in the post #14 case, it was okay to drop the 2nd column (i.e., column 3 that is tied to the "0" in the {1,0,1} array because it had already been used by the inner FILTER to ensure that the correct Site (location) was considered.

I think it is probably better and easier to understand if you begin with an inner INDEX function to define the location of the relevant columns and the SEQUENCE(COUNT construction to capture all of the rows. Then wrap that inner INDEX inside a FILTER function to apply the row-filtering criteria (for desired location and to exclude blanks). Then wrap that inside a SORT function and sort by whichever column index is the Date in recent-to-old order (use -1). Then wrap all of that inside another INDEX function to extract only the column index that includes the values needed for the percentile computation and the rows that correspond to the number of data points you want to consider (the SEQUENCE(MIN(24,SUM construction).

I don't see any issues with the last formula you mentioned produces some odd results. Do you have a subset of data to look at?

KRice · Jul 12, 2022

I sometimes avoid the absolute structured references to trim down the formulas. If you do that, your last one would be

Excel Formula:

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Suspended Solids]],SEQUENCE(COUNT(Data[Date])),{1,2,23}),(Data[Location]=$M$2)*(Data[Suspended Solids]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Suspended Solids]<>"")))),3),0.8)

As a spot check, you can copy the inside array of the PERCENTILE.INC formula into a blank cell (with plenty of space below it for spilling results) to examine the value identified by the formula...and then re-sort them to pick out by eye where the 80th percentile would be:

Excel Formula:

SORT(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Suspended Solids]],SEQUENCE(COUNT(Data[Date])),{1,2,23}),(Data[Location]=$M$2)*(Data[Suspended Solids]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Suspended Solids]<>"")))),3))

KRice · Jul 12, 2022

Sorry...I glossed right over your questions:

Fitzy22 said:
This still seems to be returning the correct value, even though it contains the two filters. The columns are Date (1), Location (4) and Data (7). Does this still seem correct as I have an entire spreadsheet which uses this formular across about 90 columns?

There is no problem using the two-FILTER version. The version presented takes data from $A$10:G274, with columns A (Date), D (Location), and G (Values) being of interest. $F$2 has the Location lookup. The inner INDEX extracts the main block of data in the three columns. The inner FILTER trims down the rows to consider only those points meeting the Location and non-blank requirements. The outer FILTER then jettisons the middle column (Location)--that's okay now since it already served its purpose--keeping only Date and Values. This 2-column array is sorted by the 1st column (Date) in descending order. Then the outer INDEX performs a count of the number of data points that could be considered and takes the smaller of that amount or 24 and constructs a row-indexing array using that value to extract only that many values from the top of the 2nd column (the Values/Data column). The PERCENTILE.INC function operates on that subset of data. I don't see any issues with the formula.

Fitzy22 said:
The next formula is the one I did, and simply tried to copy the above formula but with the excel table. The difference is that columns are Location (1), Date (2) and Data (24). If I had used {0,1,1} for the array and 3 at the end (see green text), would that have been correct with the extra filter?
=PERCENTILE.INC(INDEX(SORT(FILTER(FILTER(INDEX(Data[[Location]:[Sulphates]],SEQUENCE(COUNT(Data[Date])),{1,2,24}),(Data[Location]=$M$2)*(Data[Sulphates]<>"")),{1,0,1}),2,-1),SEQUENCE(MIN(24,SUM((Data[Location]=$M$2)*(Data[Sulphates]<>"")))),2),0.8)

Almost...yes to both of your explanations: using {0,1,1} rather than {1,0,1} in the outer FILTER would return a 2-column array of Date and Data, which is what you want...so "yes" to {0,1,1}. But the next step involves sorting by Date, so the "2" that I highlighted in red should be a 1 since Data is now the 1st column of the 2-column array. And finally, yes to the green "2", since you want the outer INDEX function to return the 2nd column (Data) of the 2-column array.

Fitzy22 said:
I have amended this slightly so that I can fix some of the reference cells as per the formula below which returns 7.02 and I have checked the result with =PERCENTILE.INC(X254:X277,0.8) which is also 7.02.

=PERCENTILE.INC(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Sulphates],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,24}),(Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")))),3),0.8)

I don't see any issues with the formula, assuming you want to use a fixed reference for the [Location] column. If you want to investigate further, see my comment in post #27 about copying the formula inside the PERCENTILE.INC function and then sorting that result to see the list of values being used. That might help you determine whether something is wrong.

Excel Formula:

=SORT(INDEX(SORT(FILTER(INDEX(Data[[Location]:[Location]]:Data[Sulphates],SEQUENCE(COUNT(Data[[Date]:[Date]])),{1,2,24}),(Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")),2,-1),SEQUENCE(MIN(24,SUM((Data[[Location]:[Location]]=$M$2)*(Data[Sulphates]<>"")))),3))

Fitzy22 · Jul 12, 2022

Hi @KRice,

I just wanted to let you know that I have identified why I am getting the unusual results...its actually correct. The reason for the different values is that I have three rows of data for the same day, which is correct...but these three rows are split over the 24th datapoint in the percentile calculation. What excel does automatically (unbelievably) is to take the mean of those three values and then use that number as the 24th value.

I will go through your most recent few posts soon, but it's great to know that the formula is working.

KRice · Jul 12, 2022

Thanks for the update. You may want to investigate using the formula I described at the end of my last post...the formula that stops just short of evaluating the percentile, and instead spills the array of values that would be used by the percentile function.

Percentile of n most recent values

Fitzy22

New Member

Fitzy22

New Member

KRice

Well-known Member

Fitzy22

New Member

KRice

Well-known Member

Fitzy22

New Member

KRice

Well-known Member

KRice

Well-known Member

KRice

Well-known Member

Fitzy22

New Member

KRice

Well-known Member

Similar threads

Share this page

Percentile of n most recent values

New Member

New Member

Well-known Member

New Member

Well-known Member

New Member

Well-known Member

Well-known Member

Well-known Member

New Member

Well-known Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock