How to Remove Outliers from a data set

secret292

New Member
Joined
Sep 26, 2013
Messages
12
I understand how to identify an outlier but how i do remove them?
I wish to average data with excluding the outliers.
The manual method of deleting each outlier individually will take too long for my data set of 10,000+ data points.

Thanks,

Just a Uni student completing an experimental project
 
the way i have set up my data is that i have broken it up into months.
So i have separate excel files for each month (may june july..etc)
within each excel file i have my raw data on sheet 1 called "april" and average data on sheet 2 called "average" where each cell contains average data per day.
So for example cell m2 contains averaged data for 1st of april from 0000hrs to 2359hrs and so for for cell m3 contains average data of 2nd of june from "april" raw data.

Below is the formula i use and the underlined parts are just modification factors for the equipment used (strain gauges)
=((averageifs(april!m:m,april!a:a,">01/04/2013 00:00",april!a:a,"<01/04/2013 23:59"))-674)*2/2.12

my idea is to eliminate the the outliers so that the averaging function does not include them.
If you recommend a better approach i am more than willing to learn.

Thank you for your help

How about invoking TRIMMEAN like below:
Rich (BB code):
=(TRIMMEAN(
    IF(april!A2:A200 > "01/04/2013 00:00"+0,
    IF(april!A2:A200 < "01/04/2013 23:59"+0,
      april!M2:M200)),0.1)-674)*2/2.12
where 0.1 is a percentage indicating how much to exclude from top and bottom.

Note that the formula must be confirmed with control+shift+enter, not just enter.
 
Upvote 0

Excel Facts

Select all contiguous cells
Pressing Ctrl+* (asterisk) will select the "current region" - all contiguous cells in all directions.
I would guess this to be biaxial strain gauge data (am I correct?). if so you want outliers identified from each data column, not the total data cloud.
Try the following:

Code:
Sub outliers2()
Dim dblAverage As Double, dblStdDev As Double
Dim rData As Range, rTest As Range, Rng As Range
Set rData = Range("M2:AJ" & Cells(Rows.Count, 13).End(xlUp).Row)
For Each rTest In rData.Columns
    dblAverage = WorksheetFunction.Average(rTest)
    dblStdDev = WorksheetFunction.StDev(rTest)
    For i = 1 To rTest.Cells.Count
        Set Rng = rTest.Cells(i, 1)
        Debug.Print Rng.Address
        If Rng > dblAverage + 3 * dblStdDev Or Rng < dblAverage - 3 * dblStdDev Then
            Rng.Value = "Outlier" 'or delete the data with Rng.clearcontents
        End If
    Next i
Next
End Sub

yes it is an biaxial strain gauge.
I used the coding given and waited 20 minutes but excel states it is still processing. :( I cant close excel, only by force close.
I found another way by using the code given earlier by you where i creat individual marcos to target each column which works so thank you very much for the help!
 
Upvote 0
How about invoking TRIMMEAN like below:
Rich (BB code):
=(TRIMMEAN(
    IF(april!A2:A200 > "01/04/2013 00:00"+0,
    IF(april!A2:A200 < "01/04/2013 23:59"+0,
      april!M2:M200)),0.1)-674)*2/2.12
where 0.1 is a percentage indicating how much to exclude from top and bottom.

Note that the formula must be confirmed with control+shift+enter, not just enter.

I will compare this method and the other method thanks for the help!
One thing is that trim mean does not remove my data statistically does it??
I need to show that I have managed to remove invalid data based on average and standard deviation away from the average which trimmean does not account for. (this is my understanding of trim mean, if i am wrong please let me know)

Thanks
 
Upvote 0
This code will replace the outlier (assumes data in Column A) with the text "Outlier". I've used a test to see if the data is outside a 3 sigma band to identify an outlier.
You can modify this to delete the data but most statistics functions have a way to ignore text.

I recommend you try it on a COPY of your data first.

Code:
Sub outliers()
Dim dblAverage As Double, dblStdDev As Double
Dim rTest As Range, Rng As Range
Set rTest = Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
dblAverage = WorksheetFunction.Average(rTest)
dblStdDev = WorksheetFunction.StDev(rTest)
For Each Rng In rTest
    If Rng > dblAverage + 3 * dblStdDev Or Rng < dblAverage - 3 * dblStdDev Then
        Rng.Value = "Outlier" 'or delete the data with Rng.clearcontents
    End If
Next
End Sub

One last help, How could i alter this code if i want to reduce from 3 sigma band to a 2 sigma band or even 1 sigma band?

Thanks
 
Upvote 0
I will compare this method and the other method thanks for the help!
One thing is that trim mean does not remove my data statistically does it??
I need to show that I have managed to remove invalid data based on average and standard deviation away from the average which trimmean does not account for. (this is my understanding of trim mean, if i am wrong please let me know)

Thanks

Not sure what you mean? The formula ignores conditionally the outliers, specified by (adaptable) 0.1 while calculating the average.
 
Upvote 0
Hi Secret292,

I hadn't seen the size of your data set before my last post.

I've incorporated what you asked for (variable for no. of sigma band required) in outliers_mod but I suggest you try outlierers3 first. I've disabled ScreenUpdating to increase speed (approx 6X from memory of a test I ran I while back). I've also incorporated a lStartRow variable so you can set it to run from a specific row (at say row 42,000) and ignore data you've already processed (speeding up the subsequent runs).

Based upon similar tasks I've done on my work PC (2-3 yo laptop) I'd expect to leave it 45-60 minutes (guesstimate) to process; I know it's a long wait but that's one hell of a lot of data.

Code:
Sub outliers3()
Application.ScreenUpdating = False
Dim dblAverage As Double, dblStdDev As Double
Dim NoStdDevs As Integer
Dim lStartRow as Long
Dim rData As Range, rTest As Range, Rng As Range
NoStdDevs = 3 'adjust to your outlier preference of sigma
lStartRow = 1 'set at last row of last run
Set rData = Range("M2:AJ" & Cells(Rows.Count, 13).End(xlUp).Row)
For Each rTest In rData.Columns
    dblAverage = WorksheetFunction.Average(rTest)
    dblStdDev = WorksheetFunction.StDev(rTest)
    For i = lStartRow To rTest.Cells.Count
        Set Rng = rTest.Cells(i, 1)
        If Rng > dblAverage + NoStdDevs * dblStdDev Or Rng < dblAverage - NoStdDevs * dblStdDev Then
            Rng.Value = "Outlier" 'or delete the data with Rng.clearcontents
        End If
    Next i
Next
Application.ScreenUpdating = True
End Sub


Code:
Sub outliers_mod()
Dim dblAverage As Double, dblStdDev As Double
Dim NoStdDevs As Integer
Dim rTest As Range, Rng As Range
Application.ScreenUpdating = False
NoStdDevs = 3 'adjust to your outlier preference of sigma
Set rTest = Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
dblAverage = WorksheetFunction.Average(rTest)
dblStdDev = WorksheetFunction.StDev(rTest)
For Each Rng In rTest
    If Rng > dblAverage + NoStdDevs  * dblStdDev Or Rng < dblAverage - NoStdDevs  * dblStdDev Then
        Rng.Value = "Outlier" 'or delete the data with Rng.clearcontents
    End If
Next
Application.ScreenUpdating = True
End Sub
 
Upvote 0
Sweet!

Thanks for the help you have provided so far really helps me out!
I will test both methods and post back here how long it took to process simply if your interested.

Thanks again! :)
 
Upvote 0
OK i tested both methods and it takes 4.5 - 5 hrs to complete one sheet !!!
Where as if I do it individually (i.e. each column at a time) it takes 1.5 hours or so to complete.
Ill just leave it running over night and do one sheet at a time lol thanks though! helps get rid of my outliers which is great !
 
Upvote 0
Hi Secret292,

A job I did recently at work (a large data reduction) made me think of this thread and I realised (DUH!) that you could get a large speed saving by processing the data in memory rather than constantly accessing the worksheet. I've checked on sample data (10,000 lines) and get a 10X saving. Let me know if you are still working on this project and I'll post the code.

Teeroy
 
Upvote 0
OK i tested both methods and it takes 4.5 - 5 hrs to complete one sheet !!!
Where as if I do it individually (i.e. each column at a time) it takes 1.5 hours or so to complete.
Ill just leave it running over night and do one sheet at a time lol thanks though! helps get rid of my outliers which is great !

Hi Secret292,

A job I did recently at work (a large data reduction) made me think of this thread and I realised (DUH!) that you could get a large speed saving by processing the data in memory rather than constantly accessing the worksheet. I've checked on sample data (10,000 lines) and get a 10X saving. Let me know if you are still working on this project and I'll post the code.

Teeroy

How does the TRIMMEAN formula perform?
 
Upvote 0

Forum statistics

Threads
1,221,498
Messages
6,160,161
Members
451,627
Latest member
WORBY10

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top