VBA - Sophisticated approach to find and highlight duplicates

excelstarter1

Board Regular
Joined
Jul 20, 2017
Messages
81
Hello VBA pros,

I have a tricky problem (well it is for me) regarding duplicates in a dataset. Regularly I receive manually generated customer lists from our finance/sales or another department including an overview of customers and their revenues in a specific period (customer name in column A and revenue in column B). Now the issues with those lists are duplicates, e.g. the customer Walmart appears several times as Walmart USA, Walmart (NY), Walmartt (with a typo), USA Walmart, NY Walmart and so on. I want to find all those duplicates and cluster the customers. The simple Excel function would not work. I already tried text functions like LEFT (for 3 characters, then run find duplicates). However, they are not sufficient.

Is there a way with VBA to identify duplicates by comparing the content of each individual cell to the full dataset (e.g. 1000 rows with customer names)? As the duplicate name could appear at the beginning (e.g. Walmart USA), the end (USA Walmart) or the middle (USA, Walmart, NY), it would make sense not to compare the full content of the cell but only 5 or 6 characters in order to take this issue into account.

Would be great if you could help me with a couples of ideas!

Thanks a lot!

Best regards
 

Excel Facts

Format cells as date
Select range and press Ctrl+Shift+3 to format cells as date. (Shift 3 is the # sign which sort of looks like a small calendar).
I don't regard this as sophisticated, but it may be helpful. This code will prompt the user to enter a customer name (not case sensitive, but can easily be changed to be so). Every instance of that name found in the used range of the active sheet will be highlighted in bold, red font.
Code:
Sub HighlightWordOrPhrase()
Dim wrd As Variant, ct As Long
wrd = InputBox("Enter customer name to highlight")
If wrd = "" Then Exit Sub
Dim c As Range, x As Variant, i As Long
Application.ScreenUpdating = False
For Each c In ActiveSheet.UsedRange
    If Not IsError(c) And Not IsEmpty(c) Then
        x = InStr(1, c.Value, wrd, vbTextCompare)
        If x > 0 Then
            ct = ct + 1
            With c.Characters(x, Len(wrd)).Font
                .Color = vbRed
                .Bold = True
            End With
        End If
    End If
Next c
If ct > 0 Then
    MsgBox ct & " instances of the customer name were found and highlighted"
Else
    MsgBox "No instances of the customer name were found in the activesheet"
End If
Application.ScreenUpdating = True
End Sub
If the customer name is say Ford you can enter it with a leading and trailing space (" Ford ") so you won't pickup something like "Fording the stream was a challenge" if that's an issue.
 
Last edited:
Upvote 0
Thats great, thank you Joe!
Do you have an approach on how to take e.g. the first 5 characters of each cell (for the whole e.g. 500 rows) and compare that string to the whole list?
 
Upvote 0
You are welcome.

Can you provide some sample data and expected result for your new question? I don't quite grasp exactly what you describe and want to achieve. If you want to highlight the first 5 characters of every cell in the used range in every cell in the used range those characters might appear, the whole sheet may become a highlighted mess.
 
Upvote 0
Sure. Sorry, maybe my request was a bit confusing.
Usually the customers are not ranked like this but mixed (here all the different Walmart customers follow each other; see below).
The idea is to take the first five characters of each cell (here e.g. row 2 = Walma) und use that string (Walma) to check the full list and identify duplicates.
Then repeat the process with the next customer in row 3 and so on.
Ideally I want a new row where all different Walmart customers are allocated to Walmart (or here the string Walma), Nike, Adidas and so on. That last task might be a bit more complex...

[TABLE="width: 500, align: left"]
<tbody>[TR]
[TD]CUSTOMER
[/TD]
[TD]Sales
[/TD]
[/TR]
[TR]
[TD]Walmart
[/TD]
[TD]10
[/TD]
[/TR]
[TR]
[TD]Walmart USA
[/TD]
[TD]20
[/TD]
[/TR]
[TR]
[TD]Walmart, NY
[/TD]
[TD]30
[/TD]
[/TR]
[TR]
[TD]USA Walmart
[/TD]
[TD]15
[/TD]
[/TR]
[TR]
[TD]NY, Walmart, USA
[/TD]
[TD]40
[/TD]
[/TR]
[TR]
[TD]USAWalmart
[/TD]
[TD]10
[/TD]
[/TR]
[TR]
[TD]Nike
[/TD]
[TD]90
[/TD]
[/TR]
[TR]
[TD]USA Nike
[/TD]
[TD]20
[/TD]
[/TR]
[TR]
[TD]Nike, CA
[/TD]
[TD]15
[/TD]
[/TR]
[TR]
[TD]Nike, USA, CA
[/TD]
[TD]10
[/TD]
[/TR]
[TR]
[TD]USA, Nike
[/TD]
[TD]50
[/TD]
[/TR]
[TR]
[TD]Nikee USA
[/TD]
[TD]20
[/TD]
[/TR]
[TR]
[TD]Adidas
[/TD]
[TD]10
[/TD]
[/TR]
[TR]
[TD]Canada Adidas
[/TD]
[TD]20
[/TD]
[/TR]
[TR]
[TD]Adidas USA
[/TD]
[TD]50
[/TD]
[/TR]
</tbody>[/TABLE]
 
Upvote 0
To prevent a highlighted mess, you could create a new column say before the Customer column then using the first 5 of each cell in the customer list assign a number in that new column and then when the list is complete use that column to sort the whole table so like customer names are grouped. Is that something you could work with? Also, if the search can be confined to just the customer column, that would speed things up a bit.

I have to go offline shortly, but if you reply I will look at this further, if no one else has helped you, when I return.
 
Upvote 0
No worries, thanks for your quick replies!! A big help!

Yes the search is only confined to just the customer column. The numbers/sales can be ignored for the moment.

Well the idea with the first 5 characters was a loop and to start with the first 5 (e.g. Walma), then skip one character and use the next 5 characters (e.g. almar) and so on. I have not figured out the best approach. In the end I want all the customers that have "Walmart" in their name or Nike, Adidas etc. to be in one group.
 
Upvote 0
After looking at the sample data you posted, I think using the first x characters will not work to achieve what you want. Are there few enough customers that you could provide a list of them using only the key part, not all of them would have to appear in any particular listing you receive? For example, for the customer list you posted the key list would be: Adidas, Nike, Walmart.

If such a list was available it would be possible to sort the data by company like the customer list in post #5 is arranged.

An alternative would be to have the user input a specific customer or customers and the macro would group all entries for that customer or customers starting at the top of the data range.
 
Upvote 0

Forum statistics

Threads
1,223,907
Messages
6,175,300
Members
452,633
Latest member
DougMo

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top