Fast Duplicate Check For Large Data Sets Using VBA?

CaliKidd · Aug 27, 2011

Hey Experts,

I am using the following logic to check for and highlight duplicate entries. The way the logic works, it starts off slow and speeds up as it goes because the number of cells it has to check for duplicates becomes less and less. It works OK as long as the data set is relatively small (hundreds to low thousands), but I decided to load-test it on 1,000,000 cells and it crawled! It took about 30 minutes to just get through the first 15 duplicate checks (and my PC has a quad-core processor overclocked to 4Ghz and 4Gb RAM). At this rate, it would probably take months to finish...

Here's the code I borrowed from another online site:

Code:

Sub DupsGreen()
[INDENT]Application.ScreenUpdating = False
Rng = Selection.Rows.Count
For i = Rng To 1 Step -1
[INDENT]myCheck = ActiveCell
ActiveCell.Offset(1, 0).Select
For j = 1 To i
[INDENT]If ActiveCell = myCheck Then
[INDENT]Selection.Font.Bold = True
Selection.Font.ColorIndex = 4
[/INDENT]End If
ActiveCell.Offset(1, 0).Select
[/INDENT]Next j
ActiveCell.Offset(-i, 0).Select
[/INDENT]Next i
Application.ScreenUpdating = True
[/INDENT]End Sub

Does anyone know of a slicker, faster way?

mirabeau · Aug 30, 2011

CaliKidd said:
How exactly would this be done? One caveat - I cannot alter/remove the data in any way. I can only analyze it and then annotate it in some fashion (hence, the highlight, cell comments, and/or msgbox) so the owner of the data can then fix it.

You could consider the following code.

Try it on some small test data in ColumnA first.

It doesn't highlight anything - doing this slows the code significantly, except for some Conditional Formatting type approaches which are specifically designed for highlighting. Nor does it do anything to the original data other than read it.

What it does is to list in ColD the values occurring more than once in your data, in ColE the address of the first such value, and in colF the address(es) of the other(s). If you wish this duplicate list can easily be put anywhere you like.

This code does use the scripting dictionary object, which actually takes about twice as long to run as the sorting approach I used earlier, if the data are all in one column. It also generates this dictionary through early binding, so if you initially get an error then, in the code window menu, go to Tools-> References and check microsoft scripting runtime.

Code:

Sub markdups2()
Dim t As Single
t = Timer
Dim d As New dictionary 'ref microsoft scripting runtime
Dim i&, j&, k&, e As Variant
Dim c(1 To 10 ^ 6, 1 To 3)
For Each e In Intersect(Columns("A"), ActiveSheet.UsedRange)
If e.Value <> vbNullString Then
    If Not d.exists(e.Value) Then
        k = k + 1
        d(e.Value) = k
        c(k, 1) = e.Value
        c(k, 2) = e.Address
    Else
        c(d(e.Value), 3) = c(d(e.Value), 3) & "," & e.Address
    End If
End If
Next e
For i = 1 To k
    If Len(c(i, 3)) > 0 Then
        j = j + 1
        c(j, 1) = c(i, 1): c(j, 2) = c(i, 2)
        c(j, 3) = Right(c(i, 3), Len(c(i, 3)) - 1)
    End If
Next i
If j > 0 Then Range("D1").Resize(j, 3) = c
MsgBox "Code took " & Format(Timer - t, "0.000 secs")
End Sub

CaliKidd · Aug 30, 2011

The results are in on a data set of 1,048,573 cells with one value duplicated two other times, one near the top and one near the end.

Dave, your code clocked in at 44.6 seconds using the late binding approach and 38.5 seconds using an early binding approach. Thanks for commenting your code as it made it easier for me to understand what the code was doing.

Mirabeau, your latest code clocked in at 43 seconds using a late binding approach and 39.5 seconds using an early binding approach.

In terms of speed, they both perform about the same. Since they both flush out the cell addresses of where the duplicates are, I should be able to take it from there in terms of adding highlights, comments, etc, if required.

Both of you are more advanced in your Excel/VBA skills than I am, so it will take me a bit to analyze your solutions since my knowledge and use of variant arrays and intersect/union methods is fairly basic. I really don't know enough right now to know if one approach is better than the other. I'm just glad they work!

Seeing what is possible at a higher skill level inspires me to keep learning and get to the same level as you.

I think we've kicked this horse enough for now. If I have any additional follow-up questions on the details, I hope you don't mind if I PM you.

Thanks for all your help.

brettdj · Aug 30, 2011

CaliKidd said:
I think we've kicked this horse enough for now. If I have any additional follow-up questions on the details, I hope you don't mind if I PM you.
Thanks for all your help.

No problem

I suggest it may be worth continuing the discussing in this thread though to benefit any others who may be interested

Note that you can adapt this approach easily to
- run case insensitive dupe matching [ in my code use LCase$(X(lngRow, lngCol)) rather than X(lngRow, lngCol)]
- apply the Worksheet CLEAN and/or TRIM functions
- apply regular expression matching etc

This is exactly what my duplicate master addin does

A couple of variant array links in articles that I have written that you may find useful
- Using Variant Arrays in Excel VBA for Large Scale Data Manipulation, http://www.experts-exchange.com/A_2684.html (which I used for this threads code)
- Creating and Writing to a CSV File Using Excel VBA A http://www.experts-exchange.com/A_3509.html

Cheers

Dave

Fazza · Aug 30, 2011

To highlight duplicates after the first one is just a minor change.

Code:

With Selection
  .FormatConditions.Delete
  .FormatConditions.Add Type:=xlExpression, Formula1:="=countif(" & .Cells(1).Address(ReferenceStyle:=xlR1C1) & ":RC,RC)>1"
  .FormatConditions(1).Interior.ColorIndex = 4
End With

However, along xenou's comments, this sort of highlighting is of limited value.

If suitable, I would suggest an adjacent column is populated to tag the records of interest. Perhaps via formula (same as I've used just above for the conditional format) applied via VBA, and then converted to values so that it is not a speed sapping formula. Then filtering or whatever can be done using that extra field. You might even want to use some UPDATE queries to work with the data.

regards

CaliKidd · Aug 31, 2011

Fazza said:
If suitable, I would suggest an adjacent column is populated to tag the records of interest. Perhaps via formula (same as I've used just above for the conditional format) applied via VBA, and then converted to values so that it is not a speed sapping formula.

I like this idea. Would it be possible to put a tag (say, the number "1") in the cell adjacent to the cell that is being evaluated? For example, if your code finds and highlights a duplicate in cells F100, is there a way to also put a "1" in cell G100?

brettdj · Aug 31, 2011

My code returns a range object of dupes

It you wanted a cell flag just change the address report to a cell flag, ie

Cheers

Dave

Code:

  MsgBox "Duplicate range is " & rng2.Address
'to
rng2.Offset(0, 1) = "Dupe"

CaliKidd · Aug 31, 2011

I cannot edit my last post, but I wanted to add one more sentence to the end:

If the cell is not highlighted as a result of the conditional format trigger, then it would leave the adjacent cell blank.

CaliKidd · Aug 31, 2011

On second thought, all I really want is for the number "1" to be placed next to any cell that has a duplicate value. The cell that has the duplicate value does not have to be highlighted. I just want to enter the number "1" in the cells adjacent to the duplicate cells. This would then allow me to sum up the number of duplicates found. Can this be done through a conditional formatting formula approach like Fazza's?

brettdj · Aug 31, 2011

No. Conditional formats change formatting not values

The dupe count exists in my code, just count the cells in the dupe range rng2. ie

Code:

  MsgBox "Duplicates number " & rng2.Cells.Count

CaliKidd · Aug 31, 2011

Ok, Dave, will do. I appreciate the help. My finding duplicates issue is now solved.

I am going to open a new thread to see if there is a cell formula that can be written that detects the formatting of another cell. Since it's a different topic, I'll start a new post.

Fast Duplicate Check For Large Data Sets Using VBA?

CaliKidd

Board Regular

mirabeau

Banned user

CaliKidd

Board Regular

brettdj

Active Member

Fazza

MrExcel MVP

CaliKidd

Board Regular

brettdj

Active Member

CaliKidd

Board Regular

CaliKidd

Board Regular

brettdj

Active Member

CaliKidd

Board Regular

Similar threads

Share this page

Fast Duplicate Check For Large Data Sets Using VBA?

Board Regular

Banned user

Board Regular

Active Member

MrExcel MVP

Board Regular

Active Member

Board Regular

Board Regular

Active Member

Board Regular

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock