Find Partial Duplicate values in a column

leeandoona

New Member
Joined
Oct 13, 2016
Messages
45
I have an annoying problem I never seem to solve properly and today I've had another bash at it and I still can't figure out the best way forward. The problem is I have a very large list of data that contains columns that have partial duplicates that I need to identify, and then turn into 'unique' records. Unfortunately the duplicate values can occur randomly at the front, back of middle of the text string and can be either words or numbers or both! For example, here is a sample of what I've got to work with;

[TABLE="width: 406"]
<colgroup><col></colgroup><tbody>[TR]
[TD]SANTAS LITTLE HELPER (116cm)[/TD]
[/TR]
[TR]
[TD]SANTAS LITTLE HELPER (128cm)[/TD]
[/TR]
[TR]
[TD]SANTAS LITTLE HELPER (140cm)[/TD]
[/TR]
[TR]
[TD]PETTICOAT WHITE - ONE SIZE[/TD]
[/TR]
[TR]
[TD]PETTICOAT RED - ONE SIZE[/TD]
[/TR]
[TR]
[TD]PETTICOAT BLACK - ONE SIZE[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (S)[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (M)[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (L)[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (XL)[/TD]
[/TR]
[TR]
[TD]COLOUR CHANGING CRYSTAL BALL[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (128cm)[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (140cm)[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER (158cm)

and here's what it ought to look like when I've identified and removed the duplicates;

[TABLE="width: 406"]
<colgroup><col></colgroup><tbody>[TR]
[TD]Description[/TD]
[/TR]
[TR]
[TD]SANTAS LITTLE HELPER[/TD]
[/TR]
[TR]
[TD]SANTAS LITTLE HELPER[/TD]
[/TR]
[TR]
[TD]SANTAS LITTLE HELPER[/TD]
[/TR]
[TR]
[TD]PETTICOAT - ONE SIZE[/TD]
[/TR]
[TR]
[TD]PETTICOAT - ONE SIZE[/TD]
[/TR]
[TR]
[TD]PETTICOAT - ONE SIZE[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER[/TD]
[/TR]
[TR]
[TD]COLOUR CHANGING CRYSTAL BALL[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER[/TD]
[/TR]
[TR]
[TD]DEATHLY GRIM REAPER
[/TD]
[/TR]
</tbody>[/TABLE]
[/TD]
[/TR]
</tbody>[/TABLE]

As you can see, this makes a lookup unlikely to work well and fuzzylookup isn't really working properly either. I need to essentially identify the duplicates and remove the differentials. :eeek:Is there any VBA that will solve this, or even get close would help? Thanks v much
 
With huge thanks to ZVI this now works brilliantly and VERY quickly. Here is the code used. Note that it will sort A-Z and cannot identify normalized word placement if it is not in consistent position within the cell.

Option Explicit
Option Compare Text

Sub Main()
' ZVI:2019-02-27 https://www.mrexcel.com/forum/excel-questions/1088801-find-partial-duplicate-values-column-2.html

Dim a(), b(), s1(), s2(), v
Dim i As Long, j As Long, k As Long, n As Long, m As Long, p As Long, pp As Long
Dim Rng As Range
Dim IsEmpty As Boolean

' Set data range A:D
With ActiveSheet.UsedRange
i = .Columns(1).Columns(1).Find(What:="*", LookIn:=xlFormulas, SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, SearchFormat:=False).Row
Set Rng = .Offset(1).Resize(i - .Row).Columns("A:D")
End With

' Normalize Dataset
a() = Rng.Columns("B:C").Value
ReDim b(1 To UBound(a))
For i = 1 To UBound(b)
b(i) = TxtToArray(a(i, 1)) ' Array of normalized dataset
a(i, 1) = Join(b(i)) ' Normalized dataset
a(i, 2) = i ' Index
Next

' Disable screen updating
Application.ScreenUpdating = False

' Sort Rng by normalized Dataset
With Rng
.Columns("C:D").Value = a()
.Sort .Cells(1, "C"), xlAscending, Header:=xlNo
End With

'Main
a() = Rng.Columns("C:D").Value
p = a(1, 2)
For i = 1 To UBound(a) - 1
pp = a(i + 1, 2)
v = CompArrays(b(p), b(pp))
j = Abs(v)
Select Case v
Case Empty
' Equal
If j >= k Then
a(i, 1) = Join(b(p))
a(i, 2) = Empty
End If
a(i + 1, 1) = Join(b(pp))
a(i + 1, 2) = Empty
Case 0
' Not equal
If i = 1 Then
a(i, 1) = Join(b(p))
a(i, 2) = Empty
End If
a(i + 1, 2) = Empty
Case Else 'Is > 0, Is < 0
' j-word is not equal
If j >= k Then
m = 0
ReDim s(0 To UBound(b(p)) - 1)
For n = 0 To UBound(b(p))
If n <> j Then
s(m) = b(p)(n)
m = m + 1
End If
Next
a(i, 1) = Join(s)
a(i, 2) = b(p)(j)
If v > 0 Then
a(i + 1, 1) = a(i, 1)
a(i + 1, 2) = b(pp)(j)
Else
a(i + 1, 1) = Join(b(pp))
a(i + 1, 2) = Empty
j = 0
End If
Else
If v > 0 Then
a(i + 1, 1) = Join(b(pp))
a(i + 1, 2) = Empty
End If
End If
End Select
p = pp
k = j
Next


' Put result
Rng.Columns("C:D").Value = a()


' Sort Rng by UID
With Rng
' **** Uncomment the next line to sort by UID ****
'.Sort .Cells(1, "A"), xlAscending, Header:=xlNo
End With


' Enable screen updating
Application.ScreenUpdating = True


End Sub

Function TxtToArray(Txt) As Variant
Dim a, i As Long, j As Long, s As String
Static RegEx As Object
If RegEx Is Nothing Then
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = "[.,!?:;_]"
End If
s = RegEx.Replace(Txt, " ")
If s Like "*(* *)*" Then
While InStr(s, " ") > 0
s = Replace(s, " ", " ")
Wend
End If
a = Split(s)
j = 0
For i = 0 To UBound(a)
If Len(a(i)) > 0 Then
If a(i) <> "-" Then
a(j) = a(i)
j = j + 1
End If
End If
Next
If j - 1 < UBound(a) Then ReDim Preserve a(0 To j - 1)
TxtToArray = a
End Function


Function CompArrays(Arr1, Arr2) As Variant
' Compare arrays
' Returns:
' -n for index in Arr1 if SIZES found
' Empty for equal arrays
' 0 for different arrays
' n for index in Arr1 and Arr2 with different words
Const SIZES As String = "(XS)(S)(M)(L)(XL)(XXL)"
Dim i As Long, j As Long, n As Long
For i = 0 To UBound(Arr1)
If Arr1(i) Like ("(*)") Then
If InStr(1, SIZES, Arr1(i), vbTextCompare) > 0 Then
j = -i
If UBound(Arr2) >= i Then
If Arr2(i) Like ("(*)") Then
If InStr(1, SIZES, Arr2(i), vbTextCompare) > 0 Then
j = i
End If
End If
End If
Exit For
End If
End If
Next
If j = 0 Then
If UBound(Arr1) <> UBound(Arr2) Then
CompArrays = 0
Exit Function
End If
For i = 0 To UBound(Arr1)
If Arr1(i) <> Arr2(i) Then
j = i
n = n + 1
If n > 1 Then
CompArrays = 0
Exit Function
End If
End If
Next
End If
If j <> 0 Then CompArrays = j
End Function
 
Upvote 0

Excel Facts

Do you hate GETPIVOTDATA?
Prevent GETPIVOTDATA. Select inside a PivotTable. In the Analyze tab of the ribbon, open the dropown next to Options and turn it off

Forum statistics

Threads
1,224,754
Messages
6,180,749
Members
452,996
Latest member
nelsonsix66

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top