Excel VBA - Check For Duplicates and Return the Count on a Large Data Set

whyme6181

New Member
Joined
Oct 26, 2009
Messages
47
Hi. I have a macro that I use to go through each record's ID number and check to see if that number is in the list and if so, it will return how many times it is in the list in the adjacent column. This works great on many of the spreadsheets that we use it for, but there is a group of spreadsheets that have over 800,000 records where it takes a long time to process through (over 9 hours!) I have tried multiple methods in the past from applying the countif formula directly to each row and letting excel do the heavy lifting (resulted in not all formulas calculating and an incorrect result), using a scripting dictionary (took about 4 hours but didn't return the count of the occurrences and ignored the first instance of each), to conditional formatting (which highlighted every duplicate supposedly but I was never able to check because afterwards the sheet would become unusable and freeze). Here is my existing code. Any help would be greatly appreciated.

VBA Code:
Application.ScreenUpdating = False
StepNum = StepNum + 1
Application.StatusBar = "Validating Duplicate Case IDs (Step " & StepNum & " of 12)"

ActiveSheet.AutoFilterMode = False

DoEvents
    rApplication.Offset(0, 1).Select
    If Cells(rHdr.Row, ActiveCell.Column) = "Application Count" Then
    Else
        Trulcol = Trulcol + 1
        rCase.Offset(0, 1).EntireColumn.Insert
        rCase.Offset(0, 1).EntireColumn.NumberFormat = "General"
        rCase.Offset(0, 1).EntireColumn.Interior.Color = 65535
        Cells(rHdr.Row, rCase.Column + 1) = "Case Count"
    End If

rApplication.Offset(0, 1).Select

''''Goes one by one and takes forever (9 hours) 100% Accurate
Set ra1 = Cells(rHdr.Row, ActiveCell.Column)
    Do
        Set ra1 = ra1.Offset(1, 0)
        Application.StatusBar = "Validating Duplicate Case IDs (Step " & StepNum & " of 12) " & ra1.Row & " of " & Trulrow
        ra1.Value = Application.WorksheetFunction.CountIf(Range(rApplication.Address), ra1.Offset(0, -1))
    Loop Until ra1.Row = Trulrow
 

Excel Facts

What is =ROMAN(40) in Excel?
The Roman numeral for 40 is XL. Bill "MrExcel" Jelen's 40th book was called MrExcel XL.
Hi,

As you have a large dataset probably PowerQuery will offer a stunting quick way.
I expect just seconds, not hours.

I expect you are using at least Excel 2010 (in 2010 it was a downloadable add-in, since 2013 it is integrated in standard Excel installation. And in current versions it is just: Data->Get and Transform data)

So select your input data table and make it an excel table (Insert->Table).
Then make a new empty query in PQ and go to advanced editor and write such code:
Power Query:
let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    CounterColumn = Table.AddColumn(Source, "Count", each Table.RowCount(Table.SelectRows(Source,(R) => _[My Value] = R[My Value])), type number),
    RemovedColumnsOtherThanCounter = Table.SelectColumns(CounterColumn,{"Count"})
in
    RemovedColumnsOtherThanCounter
Change if needed Table1 to your just created table name, and My Value to header of your IDs column (twice in the code), then from main menu select Close and Load to and point to an empty cell in headers row in your worksheet.

If input data changes refresh the query by right clicking on the output and selecting refresh.

BTW, as you've asked about VBA code:
Just a part of code, without declarations, and a concept (screenshot) of data layout (here probably just one column) is not easy to improve/debug.
Anyway, if you would like to stick to VBA code, the quick way would be by
1) downloading whole IDs column to an array,
2) for each element make counting (if there are many repeating elements, may be also a directory with already calculated results), and
3) store output for given ID in the second array and return to 2) until all rows are processed
4) only after whole processing is done*, output whole array to the worksheet*

* Otherwise you are writing cell by cell (to cell ra1 = ra1.Offset(1, 0)) some 800 000 cells. And that is really key time-consuming process.
 

Attachments

  • Zrzut ekranu 2024-09-11 124948.png
    Zrzut ekranu 2024-09-11 124948.png
    84.3 KB · Views: 6
Upvote 0
Since you didn't provide the full code and images for me to visualize your data, I can only make an educated guess based on your code snippet.
I have taken the liberty to draft a code snippet using arrays and dictionaries, rather than coding directly on the sheet

PHP:
Dim rng, rng2, res()
Dim dic As Object
Set dic = CreateObject("Scripting.Dictionary")

'this is data range ( countif range) ???
rng = Range(rApplication.Address).Value ' save data range into an array

'this is lookup range , with 2nd column will be the counting???
rng2 = Range(Cells(rHdr.Row, ActiveCell.Column), Cells(Trulrow, ActiveCell.Column)).Value

ReDim res(1 To UBound(rng2), 1 To 1) ' result array
For i = 1 To UBound(rng2) 'loop from 1st lookup  value to last one
    For j = 1 To UBound(rng) 'do counting
        If rng2(i, 1) = rng(j, 1) Then
            If Not dic.exists(rng2(i, 1)) Then '1st time
                dic.Add rng2(i, 1), 1
            Else
                dic(rng2(i, 1)) = dic(rng2(i, 1)) + 1 ' count +1 for next times
            End If
        End If
    Next
Next

'check with dictionary
For i = 1 To UBound(rng2)
    If dic.exists(rng2(i, 1)) Then res(i, 1) = dic(rng2(i, 1))
Next

'write results from array onto sheet at once
Cells(rHdr.Row, ActiveCell.Column).Resize(UBound(res), 1).Value = res
 
Upvote 0
Hi,

As you have a large dataset probably PowerQuery will offer a stunting quick way.
I expect just seconds, not hours.

I expect you are using at least Excel 2010 (in 2010 it was a downloadable add-in, since 2013 it is integrated in standard Excel installation. And in current versions it is just: Data->Get and Transform data)

So select your input data table and make it an excel table (Insert->Table).
Then make a new empty query in PQ and go to advanced editor and write such code:
Power Query:
let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    CounterColumn = Table.AddColumn(Source, "Count", each Table.RowCount(Table.SelectRows(Source,(R) => _[My Value] = R[My Value])), type number),
    RemovedColumnsOtherThanCounter = Table.SelectColumns(CounterColumn,{"Count"})
in
    RemovedColumnsOtherThanCounter
Change if needed Table1 to your just created table name, and My Value to header of your IDs column (twice in the code), then from main menu select Close and Load to and point to an empty cell in headers row in your worksheet.

If input data changes refresh the query by right clicking on the output and selecting refresh.

BTW, as you've asked about VBA code:
Just a part of code, without declarations, and a concept (screenshot) of data layout (here probably just one column) is not easy to improve/debug.
Anyway, if you would like to stick to VBA code, the quick way would be by
1) downloading whole IDs column to an array,
2) for each element make counting (if there are many repeating elements, may be also a directory with already calculated results), and
3) store output for given ID in the second array and return to 2) until all rows are processed
4) only after whole processing is done*, output whole array to the worksheet*

* Otherwise you are writing cell by cell (to cell ra1 = ra1.Offset(1, 0)) some 800 000 cells. And that is really key time-consuming process.
Thank you for your input. Would you be able to give me an example of using the array method you mentioned?
 
Upvote 0
One of array - oriented approaches was presented by @bebo021999 above.
Data is loaded into array:
VBA Code:
rng = Range(rApplication.Address).Value ' save data range into an array
and resulting array is loaded into a sheet
VBA Code:
Cells(rHdr.Row, ActiveCell.Column).Resize(UBound(res), 1).Value = res
All work in between is done in computer's memory.


Anyway - think about Power Query. With huge datasets it could (shall?) be the quickest method.
 
Upvote 0

Forum statistics

Threads
1,223,702
Messages
6,173,932
Members
452,539
Latest member
delvey

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top