Fast Duplicate Check For Large Data Sets Using VBA?

CaliKidd

Board Regular
Joined
Feb 16, 2011
Messages
173
Hey Experts,

I am using the following logic to check for and highlight duplicate entries. The way the logic works, it starts off slow and speeds up as it goes because the number of cells it has to check for duplicates becomes less and less. It works OK as long as the data set is relatively small (hundreds to low thousands), but I decided to load-test it on 1,000,000 cells and it crawled! It took about 30 minutes to just get through the first 15 duplicate checks (and my PC has a quad-core processor overclocked to 4Ghz and 4Gb RAM). At this rate, it would probably take months to finish... :eeek:

Here's the code I borrowed from another online site:
Code:
Sub DupsGreen()
[INDENT]Application.ScreenUpdating = False
Rng = Selection.Rows.Count
For i = Rng To 1 Step -1
[INDENT]myCheck = ActiveCell
ActiveCell.Offset(1, 0).Select
For j = 1 To i
[INDENT]If ActiveCell = myCheck Then
[INDENT]Selection.Font.Bold = True
Selection.Font.ColorIndex = 4
[/INDENT]End If
ActiveCell.Offset(1, 0).Select
[/INDENT]Next j
ActiveCell.Offset(-i, 0).Select
[/INDENT]Next i
Application.ScreenUpdating = True
[/INDENT]End Sub
Does anyone know of a slicker, faster way?
 
eliminate formulas totally to keep the spreadsheet fast

It is just a slight modification to the earlier conditional formatting code to instead populate the adjacent column. It is best to totally avoid formulas (and hence the conditional formatting approach) as they will slow the spreadsheet. That is why the code pastes special values & also why I earlier suggest UPDATE queries.

Code:
'it'd also be best to not use selection in the VBA
'I've just maintained it per the rest of the thread
 
With Selection.Offset(, 1)
  .ClearContents
  .FormulaR1C1 = "=if(countif(" & .Cells(1).Offset(, -1).Address(ReferenceStyle:=xlR1C1) & ":RC[-1],RC[-1])>1,1,"""")"
  .Value = .Value
End With
 
Upvote 0

Excel Facts

What is the fastest way to copy a formula?
If A2:A50000 contain data. Enter a formula in B2. Select B2. Double-click the Fill Handle and Excel will shoot the formula down to B50000.
Hi Mirabeau, Thanks for this gr8 code. Its working perfectly fine as per my needs. However, one change that i was trying but failed that I dont need results on the excel sheet but want the same details in a listbox on a form. I tried listbox1.additem but got only the value and address and not all the address as shown on the sheet.

Can please help and edit the code.

Thanks and Regards


You could consider the following code.

Try it on some small test data in ColumnA first.

It doesn't highlight anything - doing this slows the code significantly, except for some Conditional Formatting type approaches which are specifically designed for highlighting. Nor does it do anything to the original data other than read it.

What it does is to list in ColD the values occurring more than once in your data, in ColE the address of the first such value, and in colF the address(es) of the other(s). If you wish this duplicate list can easily be put anywhere you like.

This code does use the scripting dictionary object, which actually takes about twice as long to run as the sorting approach I used earlier, if the data are all in one column. It also generates this dictionary through early binding, so if you initially get an error then, in the code window menu, go to Tools-> References and check microsoft scripting runtime.
Code:
Sub markdups2()
Dim t As Single
t = Timer
Dim d As New dictionary 'ref microsoft scripting runtime
Dim i&, j&, k&, e As Variant
Dim c(1 To 10 ^ 6, 1 To 3)
For Each e In Intersect(Columns("A"), ActiveSheet.UsedRange)
If e.Value <> vbNullString Then
    If Not d.exists(e.Value) Then
        k = k + 1
        d(e.Value) = k
        c(k, 1) = e.Value
        c(k, 2) = e.Address
    Else
        c(d(e.Value), 3) = c(d(e.Value), 3) & "," & e.Address
    End If
End If
Next e
For i = 1 To k
    If Len(c(i, 3)) > 0 Then
        j = j + 1
        c(j, 1) = c(i, 1): c(j, 2) = c(i, 2)
        c(j, 3) = Right(c(i, 3), Len(c(i, 3)) - 1)
    End If
Next i
If j > 0 Then Range("D1").Resize(j, 3) = c
MsgBox "Code took " & Format(Timer - t, "0.000 secs")
End Sub
 
Upvote 0
saw this thread the other day and worked on an approach based off of brettdj, its not really a solution to your problem.

@brettdj:

was interested in your method as it returns the range object of the duplicated range...this is really quite tricky, especially when there is a high number of disjointed ranges. the "union" function is extremely useful, but really is quite slow when used any more than it needs to be.

its really not easy, as you have to deal with the case where you have a majority unique, and a majority duplicate.

when there are lots of duplicates you can use union a lot less by pre-finding adjacent areas in a column. when there are lots of unique values there are less duplicates to worry about...the worst i think is the middle range, where there is a high number of disjointed duplicate ranges.

i think here there is really not much you can do about it, and returning one range object will be slow.

this function gets almost there..with 2 arrays output...needs a way of merging the 2 before using union...would have to sort the frAdd array by col and row, then merge..i dont have the time to make this function, but it seems that that methodology would be the fastest. and by "the fastest" i mean the fastest simple way i can think of at the moment.


again its not polished or complete...and was tested only on 100,000 random longs from 1-10000

Code:
Option Explicit
Option Base 1

Sub tst1()
Dim tRng As Range
Set tRng = getDupRng1(Range("a1:a100000"))

End Sub

Public Function getDupRng1(inRng As Range) As Range

Dim tArea As Range
Dim tArr As Variant, TMP As Variant, pVal As Variant
Dim numRows As Long, i As Long, j As Long, totCnt As Long, frCnt As Long
Dim vDic As Object
Dim totAdd() As String, frAdd() As String
Dim stCell As Range
Dim consecRng As Boolean
Dim tStr As String, lstAdd As String

Dim tim As Double

tim = Timer
If inRng Is Nothing Then Exit Function
Set vDic = CreateObject("scripting.dictionary")

'set initial arrays to max possible size
ReDim totAdd(1 To inRng.Cells.Count)
ReDim frAdd(1 To inRng.Cells.Count)

'loop through all areas
For Each tArea In inRng.Areas
    'sets temp array, number of rows, and the first cell in range
    tArr = tArea.Value2
    numRows = UBound(tArr)
    Set stCell = tArea.Cells(1)
    
    'loops through each cell, column outside rows in loop -- usually more rows than cols
    With stCell
        For i = 1 To UBound(tArr, 2)
            For j = 1 To numRows
                'set a tmp var to the cell value (not much faster, easier to read)
                TMP = tArr(j, i)
                'check if empty, skips if is
                If TMP <> Empty Then
                    'this improves speed on sorted or semisorted data
                    If TMP <> pVal Then
                        pVal = TMP
                        'checks if exists, if does, adds, if not adds to "other" array
                        If Not vDic.exists(TMP) Then
                            vDic.Add TMP, .Offset(j - 1, i - 1).Address
                            'new unique so dup range is broken
                            consecRng = False
                        Else
                            If consecRng Then
                                'this just returns the current last address of the consecutive range
                                lstAdd = .Offset(j - 1, i - 1).Address
                            Else
                                'here consec is broken so check to add
                                If lstAdd <> vbNullString Then
                                    totAdd(totCnt) = totAdd(totCnt) & ":" & lstAdd
                                    lstAdd = vbNullString
                                End If
                                'start new range
                                totCnt = totCnt + 1
                                totAdd(totCnt) = .Offset(j - 1, i - 1).Address
                                consecRng = True
                            End If
                            
                            'if unique first then add to frAdd
                            If vDic.Item(TMP) <> Empty Then
                                frCnt = frCnt + 1
                                frAdd(frCnt) = vDic.Item(TMP)
                                vDic.Item(TMP) = Empty
                            End If
                        End If
                    Else
                        'same as above just skip dictionary bit
                        If consecRng Then
                            lstAdd = .Offset(j - 1, i - 1).Address
                        Else
                            If lstAdd <> vbNullString Then
                                totAdd(totCnt) = totAdd(totCnt) & ":" & lstAdd
                                lstAdd = vbNullString
                            End If
                            totCnt = totCnt + 1
                            totAdd(totCnt) = .Offset(j - 1, i - 1).Address
                            consecRng = True
                        End If
                        
                        If vDic.Item(TMP) <> Empty Then
                            frCnt = frCnt + 1
                            frAdd(frCnt) = vDic.Item(TMP)
                            vDic.Item(TMP) = Empty
                        End If
                    End If
                End If
            Next j
            'gets last one if possible
            If consecRng Then
                If lstAdd <> vbNullString Then totAdd(totCnt) = totAdd(totCnt) & ":" & lstAdd
            End If
            'havent really tested multiple columns this would restart consec though
            consecRng = False
        Next i
    End With
Next tArea

'redimensions lists
ReDim Preserve totAdd(1 To totCnt)
ReDim Preserve frAdd(1 To frCnt)

'this gets the total range...very slow (dont run if str array>10000)
If totCnt > 10000 Or frCnt > 10000 Then
    MsgBox "will take long to run...over a minute"
    Stop
End If

Set getDupRng1 = strUnion(frAdd)
getDupRng1.Interior.ColorIndex = 4
Set getDupRng1 = strUnion(totAdd)
getDupRng1.Interior.ColorIndex = 3

MsgBox Timer - tim
End Function


Public Function strUnion(strArr() As String, Optional stRng As Range = Nothing, Optional pWs As Worksheet) As Range

Dim i As Long
Dim tStr As String
Dim doUnion As Boolean

If pWs Is Nothing Then Set pWs = ActiveSheet

If Not stRng Is Nothing Then
    Set strUnion = stRng
    doUnion = True
End If

For i = LBound(strArr) To UBound(strArr)
    If strArr(i) <> vbNullString Then
        If Len(tStr) + Len(strArr(i)) > 254 Then
            If doUnion Then
                Set strUnion = Union(strUnion, pWs.Range(tStr))
            Else
                Set strUnion = pWs.Range(tStr)
                doUnion = True
            End If
            tStr = strArr(i)
        Else
            If tStr <> vbNullString Then
                tStr = tStr & "," & strArr(i)
            Else: tStr = strArr(i)
            End If
        End If
    End If
Next i

If doUnion Then
    Set strUnion = Union(strUnion, pWs.Range(tStr))
Else
    Set strUnion = pWs.Range(tStr)
End If

End Function
 
Upvote 0
ON SECOND THOUGHT!!!

if you wanted to return a range object i would think that the whole one pass, then merging lists still might be the fastest but i think this way is much simpler. just does 2 passes, and so you end up with an optimized (within columns at least) array with which to merge using union...this will return a single range containing all the duplicate values...tested over 100,000 cells containing random integers from 1-10000.... and some random uniques...still has same problem as other, when you try and join to many ranges very slow, not sure how to fix this.

Code:
Option Explicit
Option Base 1

Sub tst2()
Dim tRng As Range
Set tRng = getDup2(Range("a1:b100000"))

End Sub

Public Function getDup2(inRng As Range) As Range

Dim tArea As Range
Dim tArr As Variant, TMP As Variant
Dim numRows As Long, i As Long, j As Long, totCnt As Long, _
        k As Long, lstRow As Long, lstCol As Long
Dim vDic As Object
Dim totAdd() As String
Dim stCell As Range
Dim consecRng As Boolean

Dim tim As Double

tim = Timer
If inRng Is Nothing Then Exit Function
Set vDic = CreateObject("scripting.dictionary")

'set initial arrays to max possible size
ReDim totAdd(1 To inRng.Cells.Count)

'loop through all areas
For Each tArea In inRng.Areas
    'sets temp array, number of rows, and the first cell in range
    tArr = tArea.Value2
    numRows = UBound(tArr)
    Set stCell = tArea.Cells(1)
    
    For k = 1 To 2
        'loops through each cell, column outside rows in loop -- usually more rows than cols
        With stCell
            For i = 1 To UBound(tArr, 2)
                For j = 1 To numRows
                    'set a tmp var to the cell value (not much faster, easier to read)
                    TMP = tArr(j, i)
                    'check if empty, skips if is
                    If TMP <> Empty Then
                        If k < 2 Then
                            If Not vDic.exists(TMP) Then
                                vDic.Add TMP, Empty
                            Else
                                vDic.Item(TMP) = 2
                            End If
                        Else
                            If vDic.Item(TMP) > 1 Then
                                If consecRng Then
                                    lstRow = j
                                    lstCol = i
                                Else
                                    If lstRow > 0 Then
                                        totAdd(totCnt) = totAdd(totCnt) & ":" & .Offset(lstRow - 1, lstCol - 1).Address
                                        lstRow = 0
                                    End If
                                    totCnt = totCnt + 1
                                    consecRng = True
                                    totAdd(totCnt) = .Offset(j - 1, i - 1).Address
                                End If
                            Else
                                consecRng = False
                            End If
                        End If
                    End If
                Next j
                If lstRow > 0 Then
                    totAdd(totCnt) = totAdd(totCnt) & ":" & .Offset(lstRow - 1, lstCol - 1).Address
                    lstRow = 0
                End If
                consecRng = False
            Next i
        End With
    Next k
Next tArea

ReDim Preserve totAdd(1 To totCnt)
Set getDup2 = strUnion(totAdd)
MsgBox Timer - tim

End Function


Public Function strUnion(strArr() As String, Optional pWs As Worksheet) As Range

Dim i As Long
Dim tStr As String
Dim doUnion As Boolean

'assumes all individual inputs less than 256
If pWs Is Nothing Then Set pWs = ActiveSheet

For i = LBound(strArr) To UBound(strArr)
    If strArr(i) <> vbNullString Then
        If Len(tStr) + Len(strArr(i)) > 254 Then
            If doUnion Then
                Set strUnion = Union(strUnion, pWs.Range(tStr))
            Else
                Set strUnion = pWs.Range(tStr)
                doUnion = True
            End If
            tStr = strArr(i)
        ElseIf tStr <> vbNullString Then
            tStr = tStr & "," & strArr(i)
        Else
            tStr = strArr(i)
        End If
    End If
Next i

If tStr <> vbNullString Then
    If doUnion Then
        Set strUnion = Union(strUnion, pWs.Range(tStr))
    Else
        Set strUnion = pWs.Range(tStr)
    End If
End If

End Function
 
Upvote 0
Hi,

You're reference to "There's an advanced data filter which filters a column, removing any duplicate entries, pasting the results into a new column. Will this suffice?"

Would be beneficial to me?
 
Upvote 0
Hello to all from Italy.
After much research on the net I found interesting this macro of mirabeau:
http://www.mrexcel.com/forum/excel-...a-sets-using-visual-basic-applications-2.html

Code:
Sub colorduplicates()
Dim t As Single
t = Timer
Dim q&, x&, i&, a
Dim ash As Worksheet
Set ash = ActiveSheet
q = Range("A" & Rows.Count).End(3).Row - 3
a = Range("A4").Resize(q)
Application.ScreenUpdating = False
With Sheets.Add
    .Cells(1).Resize(q) = a
    .Cells(2) = 1: .Cells(2).Resize(q).DataSeries
    .Cells(1).Resize(q, 2).Sort .Cells(1), 1, Header:=xlNo
    a = .Cells(1).Resize(q + 1)
    For i = 1 To q
        If a(i, 1) <> a(i + 1, 1) Then
            If i > x + 1 Then _
                .Cells(x + 2, 1).Resize(i - x - 1).Font.ColorIndex = 4
            x = i
        End If
    Next i
    .Cells(1).Resize(q, 2).Sort .Cells(2), 1, Header:=xlNo
    .Cells(1).Resize(q).Copy ash.Range("A4")
Application.DisplayAlerts = False
    .Delete
Application.DisplayAlerts = True
End With
Application.ScreenUpdating = True
MsgBox "Code took " & Format(Timer - t, "0.00 secs")
End Sub

you can add a change to a new msgbox to display how many repeated numbers were found?
Thanks in advance.
max_max
 
Upvote 0
Hello to all.
Probably the thread that I have proposed is too old
and who created the macro perhaps no longer included in the forum.
Someone can correct the macro that I revived?
Thanks in advance.
max_max
 
Upvote 0

Forum statistics

Threads
1,224,586
Messages
6,179,719
Members
452,939
Latest member
WCrawford

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top