Using Non Linear Regression to predict missing values

danodco · Apr 5, 2022

I would like to use excel to predict missing values in a table based on a best fit curve. I suspect non-linear regression analysis is a good way to do this but i am not a statistician and welcome solutions. I have a table of more than 300000 rows but I am posting a small excel sheet with 25 rows just to show what it is that I want to do. Cells in yellow are what I want to fill up.

KRice · Apr 5, 2022

Welcome to the Board! Before you move forward with some type of interpolation to estimate the points, it would be important to think about what the data represent and whether there is some theory or mathematical model to which the data would be expected to conform. Then you could perform an analysis to determine the "best fit" parameters that come close to matching the data to the model, and use those newly found fitting parameters to estimate the missing points. Do you have any idea about what a suitable model might be? I ask because the points you've shown do not appear to be suitable for a 4th order polynomial. I would guess an inverse logarithmic function, but knowing more about the data would be crucial.

danodco · Apr 5, 2022

Hi KRice, thanks for the reply.
The purpose of this analysis is to analyze a google search results data set and interpolate the monthly search for a specific phrase.

E.g.
Search rank (1) - iphone charger would have 6million search results per month whereas search rack (24) - audio cassette would have only 12 per month.

The data is patchy by the generally pattern is that the higher the search rank, there is an exponential increase in number if searches.

Hope it makes sense.

danodco · Apr 5, 2022

danodco said:
Hi KRice, thanks for the reply.
The purpose of this analysis is to analyze a google search results data set and interpolate the monthly search for a specific phrase.

E.g.
Search rank (1) - iphone charger would have 6million search results per month whereas search rack (24) - audio cassette would have only 12 per month.

The data is patchy by the generally pattern is that the higher the search rank, there is an exponential increase in number if searches.

Hope it makes sense.

The lower* the search rank, there is an exponential increase in search volume. The dataset i am working with has over a million search phrases.

KRice · Apr 5, 2022

That makes sense. I think you hinted at the general type of model to investigate---something with exponential behavior. Here is a basic exponential model produced by Excel's trendline. Not bad...but maybe there is room for improvement. It might make sense to perform this analysis in a different package (like R) where other models can be created and various goodness of fit measures could be determined to make an assessment of the model form that seems to come closest to representing the data. I'll give this some thought and post back if I make any progress.

Another idea is to use an interpolating function, but I'll have to review some material to see if that might be a more suitable approach.

MrExcel_20220401.xlsx

A

B

C

1

Rank

Searches

Est. by Model

2

1

6000000

4616923

3

2

3854582

2917066

4

3

1843062

5

4

750000

1164484

6

5

549000

735745

7

6

464859

8

7

293707

9

8

99000

185570

10

9

87000

117247

11

10

67000

74079

12

11

46805

13

12

41000

29572

14

13

37000

18684

15

14

12000

11805

16

15

7459

17

16

6000

4713

18

17

4000

2977

19

18

2000

1881

20

19

999

1189

21

20

751

22

21

474

23

22

789

300

24

23

569

189

25

24

12

120

Sheet10

Cell Formulas
Range	Formula
C2:C25	C2	=$F$1EXP($F$2A2)

MrExcel_20220401.xlsx

E

F

1

a

7307333.012

2

b

-0.45915

3

Rank, x

4

N Searches, y

5

function

y = a * exp(b*x)

Sheet10

KRice · Apr 5, 2022

I'm back with a follow-up question. When you said...

danodco said:
I would like to use excel to predict missing values in a table based on a best fit curve.

Do you want/need a best fit curve for the entire set of data, or is the main objective to fill in missing points with reasonable estimates, perhaps without a "best fit curve"? I ask because the former involves looking at all of the data to establish that some function reasonably describes the known points, while the latter can be done by looking at a relatively small number of points around each gap and using a fitting-interpolating function to provide estimates for the missing points. The former involves development of a function that works over the entire range covered by the data, while the latter does not produce such a function, only discrete point estimates for the gaps.

danodco · Apr 5, 2022

KRice said:
I'm back with a follow-up question. When you said...

Do you want/need a best fit curve for the entire set of data, or is the main objective to fill in missing points with reasonable estimates, perhaps without a "best fit curve"? I ask because the former involves looking at all of the data to establish that some function reasonably describes the known points, while the latter can be done by looking at a relatively small number of points around each gap and using a fitting-interpolating function to provide estimates for the missing points. The former involves development of a function that works over the entire range covered by the data, while the latter does not produce such a function, only discrete point estimates for the gaps.

Hi Krice,

The second option looks like what I need.

KRice · Apr 14, 2022

I wanted to follow up with you. Did you get this resolved? In your first post, where you mentioned a file with 300000 rows, does that mean you have search term rankings from 1 to approximately 300000, but you are missing the number of searches for some of them? In my previous post, I was originally thinking about using a conventional cubic spline interpolating polynomial, although with a very large data set that method would almost necessarily involve splitting up the data into manageable chunks, as the size of the matrix involved in determining the interpolating polynomials would exceed limits in most software packages. I have a simpler approach if you are satisfied that the general form of your data follow the shape given by the exponential model in post #5. The simplified version does not examine the entire data set, but only the immediate known points above and below the gap...and the model parameters "a" and "b" are determined for the curve that includes those two known data points. Then your missing "gap" point(s) are determined using the "a" and "b" for this local curve.

MrExcel_20220401.xlsx

A

B

C

D

E

F

G

H

1

2

Given

Option 1 Est. by Full Set Model

Option 2 Est by Local Data Model

3

Rank

Searches

4

1

6000000

4616923

6000000

Option 1: Guess a functional model for

5

2

3854582

2917066

3854582

fitting entire set of points

6

3

1843062

1700275

Try function:

y = a * exp(b*x)

7

4

750000

1164484

750000

where x is Rank, and y is number of Searches

8

5

549000

735745

549000

Excel's Trendline tool gives fitting parameters:

9

6

464859

310165

a

7307333.012

10

7

293707

175232

b

-0.45915

11

8

99000

185570

99000

12

9

87000

117247

87000

Option 2: Use the general form of a guessed function model and

13

10

67000

74079

67000

establish "a" and "b" fitting parameters locally around gaps.

14

11

46805

52412

15

12

41000

29572

41000

16

13

37000

18684

37000

17

14

12000

11805

12000

18

15

7459

8485

19

16

6000

4713

6000

20

17

4000

2977

4000

21

18

2000

1881

2000

22

19

999

1189

999

23

20

751

923

24

21

474

854

25

22

789

300

789

26

23

569

189

569

27

24

12

120

12

Sheet10 (3)

Cell Formulas
Range	Formula
C4:C27	C4	=$L$9EXP($L$10A4)
D4:D27	D4	=IF(B4<>"",B4,LET(knownX1,INDEX(A$4:A4,XMATCH(1,--(B$4:B4<>""),0,-1)),knownY1,INDEX(B$4:B4,XMATCH(1,--(B$4:B4<>""),0,-1)),knownX3,INDEX(A4:A$300000,XMATCH(1,--(B4:B$300000<>""),0,1)),knownY3,INDEX(B4:B$300000,XMATCH(1,--(B4:B$300000<>""),0,1)),knownX2,A4,b,LN(knownY3/knownY1)/(knownX3-knownX1),a,knownY1/EXP(bknownX1),aEXP(b*knownX2)))

The known points are blue circles. Those returned by the Option 2 formula are open red circles. The Option 1 full data set trendline is shown as a dotted blue curve. You'll see that consideration of the entire full curve (the blue trendline) can lead to substantial errors relative to known points. For that reason, it would be prudent to avoid that approach. However, the general character of the data produce a shape similar to that of the exponential model used for the trendline. That helps to establish some confidence that forcing this same function through known data points immediately above and below the gaps (i.e., determining the local "a" and "b" parameters for this splicing curve) should offer reasonable estimates of the missing values.

Using Non Linear Regression to predict missing values

danodco

New Member

Attachments

KRice

Well-known Member

danodco

New Member

danodco

New Member

KRice

Well-known Member

KRice

Well-known Member

danodco

New Member

KRice

Well-known Member

Similar threads

Share this page

Using Non Linear Regression to predict missing values

New Member

Attachments

Well-known Member

New Member

New Member

Well-known Member

Well-known Member

New Member

Well-known Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock