Fast Duplicate Check For Large Data Sets Using VBA?

CaliKidd

Board Regular
Joined
Feb 16, 2011
Messages
173
Hey Experts,

I am using the following logic to check for and highlight duplicate entries. The way the logic works, it starts off slow and speeds up as it goes because the number of cells it has to check for duplicates becomes less and less. It works OK as long as the data set is relatively small (hundreds to low thousands), but I decided to load-test it on 1,000,000 cells and it crawled! It took about 30 minutes to just get through the first 15 duplicate checks (and my PC has a quad-core processor overclocked to 4Ghz and 4Gb RAM). At this rate, it would probably take months to finish... :eeek:

Here's the code I borrowed from another online site:
Code:
Sub DupsGreen()
[INDENT]Application.ScreenUpdating = False
Rng = Selection.Rows.Count
For i = Rng To 1 Step -1
[INDENT]myCheck = ActiveCell
ActiveCell.Offset(1, 0).Select
For j = 1 To i
[INDENT]If ActiveCell = myCheck Then
[INDENT]Selection.Font.Bold = True
Selection.Font.ColorIndex = 4
[/INDENT]End If
ActiveCell.Offset(1, 0).Select
[/INDENT]Next j
ActiveCell.Offset(-i, 0).Select
[/INDENT]Next i
Application.ScreenUpdating = True
[/INDENT]End Sub
Does anyone know of a slicker, faster way?
 
How exactly would this be done? One caveat - I cannot alter/remove the data in any way. I can only analyze it and then annotate it in some fashion (hence, the highlight, cell comments, and/or msgbox) so the owner of the data can then fix it.
You could consider the following code.

Try it on some small test data in ColumnA first.

It doesn't highlight anything - doing this slows the code significantly, except for some Conditional Formatting type approaches which are specifically designed for highlighting. Nor does it do anything to the original data other than read it.

What it does is to list in ColD the values occurring more than once in your data, in ColE the address of the first such value, and in colF the address(es) of the other(s). If you wish this duplicate list can easily be put anywhere you like.

This code does use the scripting dictionary object, which actually takes about twice as long to run as the sorting approach I used earlier, if the data are all in one column. It also generates this dictionary through early binding, so if you initially get an error then, in the code window menu, go to Tools-> References and check microsoft scripting runtime.
Code:
Sub markdups2()
Dim t As Single
t = Timer
Dim d As New dictionary 'ref microsoft scripting runtime
Dim i&, j&, k&, e As Variant
Dim c(1 To 10 ^ 6, 1 To 3)
For Each e In Intersect(Columns("A"), ActiveSheet.UsedRange)
If e.Value <> vbNullString Then
    If Not d.exists(e.Value) Then
        k = k + 1
        d(e.Value) = k
        c(k, 1) = e.Value
        c(k, 2) = e.Address
    Else
        c(d(e.Value), 3) = c(d(e.Value), 3) & "," & e.Address
    End If
End If
Next e
For i = 1 To k
    If Len(c(i, 3)) > 0 Then
        j = j + 1
        c(j, 1) = c(i, 1): c(j, 2) = c(i, 2)
        c(j, 3) = Right(c(i, 3), Len(c(i, 3)) - 1)
    End If
Next i
If j > 0 Then Range("D1").Resize(j, 3) = c
MsgBox "Code took " & Format(Timer - t, "0.000 secs")
End Sub
 
Upvote 0

Excel Facts

How to fill five years of quarters?
Type 1Q-2023 in a cell. Grab the fill handle and drag down or right. After 4Q-2023, Excel will jump to 1Q-2024. Dash can be any character.
The results are in on a data set of 1,048,573 cells with one value duplicated two other times, one near the top and one near the end.

Dave, your code clocked in at 44.6 seconds using the late binding approach and 38.5 seconds using an early binding approach. Thanks for commenting your code as it made it easier for me to understand what the code was doing.

Mirabeau, your latest code clocked in at 43 seconds using a late binding approach and 39.5 seconds using an early binding approach.

In terms of speed, they both perform about the same. Since they both flush out the cell addresses of where the duplicates are, I should be able to take it from there in terms of adding highlights, comments, etc, if required.

Both of you are more advanced in your Excel/VBA skills than I am, so it will take me a bit to analyze your solutions since my knowledge and use of variant arrays and intersect/union methods is fairly basic. I really don't know enough right now to know if one approach is better than the other. I'm just glad they work! :biggrin: Seeing what is possible at a higher skill level inspires me to keep learning and get to the same level as you.

I think we've kicked this horse enough for now. If I have any additional follow-up questions on the details, I hope you don't mind if I PM you.

Thanks for all your help.
 
Upvote 0
I think we've kicked this horse enough for now. If I have any additional follow-up questions on the details, I hope you don't mind if I PM you.
Thanks for all your help.
No problem :)

I suggest it may be worth continuing the discussing in this thread though to benefit any others who may be interested

Note that you can adapt this approach easily to
- run case insensitive dupe matching [ in my code use LCase$(X(lngRow, lngCol)) rather than X(lngRow, lngCol)]
- apply the Worksheet CLEAN and/or TRIM functions
- apply regular expression matching etc

This is exactly what my duplicate master addin does

A couple of variant array links in articles that I have written that you may find useful
- Using Variant Arrays in Excel VBA for Large Scale Data Manipulation, http://www.experts-exchange.com/A_2684.html (which I used for this threads code)
- Creating and Writing to a CSV File Using Excel VBA A http://www.experts-exchange.com/A_3509.html

Cheers

Dave
 
Upvote 0
To highlight duplicates after the first one is just a minor change.
Code:
With Selection
  .FormatConditions.Delete
  .FormatConditions.Add Type:=xlExpression, Formula1:="=countif(" & .Cells(1).Address(ReferenceStyle:=xlR1C1) & ":RC,RC)>1"
  .FormatConditions(1).Interior.ColorIndex = 4
End With

However, along xenou's comments, this sort of highlighting is of limited value.

If suitable, I would suggest an adjacent column is populated to tag the records of interest. Perhaps via formula (same as I've used just above for the conditional format) applied via VBA, and then converted to values so that it is not a speed sapping formula. Then filtering or whatever can be done using that extra field. You might even want to use some UPDATE queries to work with the data.

regards
 
Upvote 0
If suitable, I would suggest an adjacent column is populated to tag the records of interest. Perhaps via formula (same as I've used just above for the conditional format) applied via VBA, and then converted to values so that it is not a speed sapping formula.
I like this idea. Would it be possible to put a tag (say, the number "1") in the cell adjacent to the cell that is being evaluated? For example, if your code finds and highlights a duplicate in cells F100, is there a way to also put a "1" in cell G100?
 
Upvote 0
My code returns a range object of dupes

It you wanted a cell flag just change the address report to a cell flag, ie

Cheers

Dave

Code:
  MsgBox "Duplicate range is " & rng2.Address
'to
rng2.Offset(0, 1) = "Dupe"
 
Upvote 0
I cannot edit my last post, but I wanted to add one more sentence to the end:

If the cell is not highlighted as a result of the conditional format trigger, then it would leave the adjacent cell blank.
 
Upvote 0
On second thought, all I really want is for the number "1" to be placed next to any cell that has a duplicate value. The cell that has the duplicate value does not have to be highlighted. I just want to enter the number "1" in the cells adjacent to the duplicate cells. This would then allow me to sum up the number of duplicates found. Can this be done through a conditional formatting formula approach like Fazza's?
 
Upvote 0
No. Conditional formats change formatting not values

The dupe count exists in my code, just count the cells in the dupe range rng2. ie

Code:
  MsgBox "Duplicates number " & rng2.Cells.Count
 
Upvote 0
Ok, Dave, will do. I appreciate the help. My finding duplicates issue is now solved. :-)

I am going to open a new thread to see if there is a cell formula that can be written that detects the formatting of another cell. Since it's a different topic, I'll start a new post.
 
Upvote 0

Forum statistics

Threads
1,224,586
Messages
6,179,719
Members
452,939
Latest member
WCrawford

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top