Using Access to find correlation between data columns

inigo2

New Member
Joined
Feb 28, 2005
Messages
4
Okay...I have experience with Access, and (more) experience with Excel, but I can't seem to figure out how to do what I need to. I've got a data table (I have it in a txt file) that I need to analyze. The relevant data columns are category (a, b, or c), volume and weight. I am trying to figure out what type of correlation there is between volume and weight, both overall and within each category; the overall goal is to see if, given a volume, I can then estimate within pretty good accuracy the weight. I know in Excel I can use the CORREL function, and have done that for testing purposes. The problem is, I have approximately 350,000 records -- too many for Excel. So I need to find a way to do this in Access (I think, anyways)...

Any thoughts? I appreciate any comments...

-Dan
 

Excel Facts

Can you AutoAverage in Excel?
There is a drop-down next to the AutoSum symbol. Open the drop-down to choose AVERAGE, COUNT, MAX, or MIN
Found a solution!

Alright...I found another place to look as well (somehow I couldn't find the following info before when I looked, but this time it popped right up..d'oh!) Here's the info I found (on comp.databases.ms-access):

It assumes that the two sets of data are in one table or query called MyData with columns called x and y. You will have to change things if you need to handle missing values and so on. If your data are in different tables you will need to join them first on the date or some other index field to match corresponding x and y values.
Code:
SELECT count(*) as n, sum(x) as sx, sum(y) as sy, sum(x*y) as sxy,
       sum(x)*sum(y) as sxsy, sum(x*x) as sxx, sum(y*y)as syy,  
       sx*sx as sxsx, sy*sy as sysy,
       (sxy-sxsy/n)/sqr((sxx-sxsx/n)*(syy-sysy/n)) as r
FROM MyData;
The correlation coefficient is the field "r".

Or, alternatively (and I haven't done this one, so I'm not positive it's 100% right):

Assuming you have a table named Table1, with two numeric fields, ColX and ColY, observe the following query:
Code:
SELECT
        Count(*) AS N,
        Sum(ColX) AS SumOfX,
        Sum(ColY) AS SumOfY,
        Sum([ColX]^2) AS SumOfX2,
        Sum([ColY]^2) AS SumOfY2,
        Sum([ColX]*[ColY]) AS SumOfXY,
        [SumOfX2]-([SumOfX])^2/[N] AS SXX,
        [SumOfY2]-[SumOfY]^2/[N] AS SYY,
        [SumOfXY]-[SumOfX]*[SumOfY]/[N] AS SXY,
        [SXY]/Sqr([SXX]*[SYY]) AS CORREL
FROM Table1

Hope maybe this helps someone out there...
 
Upvote 0
Hi Dan
The 2nd sample query you posted lays it out nicely but before you use it, have you imported the data into Access? If not, or if you are unsure, then post another question back here.

Once you have the data in Access (all 350k records), you can use the 2nd sample query as the basis for your analysis. I tested it to make sure it returned the correct correlation values and it is ok.

Create a new query, add your table with the 350k records, and then copy and paste the following SQL into the SQL view of the query (you can get to it by clicking View -> SQL) :
SELECT Count(*) AS N, Sum(volume) AS SumOfX, Sum(weight) AS SumOfY, Sum([volume]^2) AS SumOfX2, Sum([weight]^2) AS SumOfY2, Sum([volume]*[weight]) AS SumOfXY, [SumOfX2]-([SumOfX])^2/[N] AS SXX, [SumOfY2]-[SumOfY]^2/[N] AS SYY, [SumOfXY]-[SumOfX]*[SumOfY]/[N] AS SXY, [SXY]/Sqr([SXX]*[SYY]) AS CORREL
FROM T_inigo;
Please note that I used the table name "T_inigo" and the field names "volume" and "weight" (all without the quotes) - make sure you change these to your actual table and field names before you run the query. This query will give you the overall correlation.

To get the correlation for each category, use the following SQL instead (in a new query) :
SELECT T_inigo.category, Count(T_inigo.volume) AS N, Sum(T_inigo.volume) AS SumOfX, Sum(T_inigo.weight) AS SumOfY, Sum(([volume]^2)) AS SumOfX2, Sum(([weight]^2)) AS SumOfY2, Sum(([volume]*[weight])) AS SumOfXY, [SumOfX2]-([SumOfX])^2/[N] AS SXX, [SumOfY2]-[SumOfY]^2/[N] AS SYY, [SumOfXY]-[SumOfX]*[SumOfY]/[N] AS SXY, [SXY]/Sqr([SXX]*[SYY]) AS CORREL
FROM T_inigo
GROUP BY T_inigo.category;

I used the field name "category" in this query - again remember to use your actual table and field names.

HTH, Andrew. :)
 
Upvote 0
Andrew,
Just saw your response. I had gotten the first part working (finding the correlation on all the data) but was having trouble doing it for the separate categories, so your code at the end is a HUGE help. Thanks very much for your time, I really appreciate it! Now that I'm past this hurdle, I'm gonna try to start back towards the beginning with Access...

-dan
 
Upvote 0

Forum statistics

Threads
1,221,860
Messages
6,162,479
Members
451,769
Latest member
adyapratama

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top