Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
Hi Alan! It worked! This macro is amazing and will save me and others a great deal of time and effort. Thank you so much! All the best, K.
 
Upvote 0

Excel Facts

Workdays for a market open Mon, Wed, Friday?
Yes! Use "0101011" for the weekend argument in NETWORKDAYS.INTL or WORKDAY.INTL. The 7 digits start on Monday. 1 means it is a weekend.
Hi Alan, I've been validating and playing with the matched data all night and day. Your macro has been busy at work on over 15K lines of data. Is there anyway to add code to the macro so that either:
- the first word should match in the look ups? or
- a high weighting put on the first words matching?

It seems that many of the incorrect codes are returned for other brands with the same words in their description. Much thanks! K.
 
Upvote 0
Is is possible the script retrieve defferent values, 1st value, 2nd value, 3rd value? multiple occurences if something mathces?

Thanks
Azumi
 
Upvote 0
Hi Azumi,

This is very easily done using the RANK parameter.
If you set RANK parameter to, say, 2, it will return the 2nd best match, and setting it to 3 will return the 3rd best match etc.

Also, if you set it to, say, "1, 10" (note the quotes), it will return the first, second, third etc best matches, in order, and separated by 'comma space' in a single cell.

Conversely if you set it to, e.g. "10: 1" ,it will return the tenth, ninth, eighth ... first best matches in a single cell separated by 'colon space'

Post back with an example of your input and desired output if you have problems.

Best wishes.

Alan
 
Upvote 0
hey alan, i found this thread so helping.
i wanna ask you questions,,
1. is it possible we do fuzzy-countif? > is this even exist? hahaha
2. how to adapt this formula to a program that will be used as duplicates detector? > the program will compare the data each row with the other data in the same column.

regards.
Ali
 
Upvote 0
Hi Ali,
Glad you find it useful :)
1) have you tried FuzzyCount function?
2) you can call the FuzzyPercent function directly into your own macro.
This function returns a percentage match between the two supplied strings.
For efficiency I suggest you store the tanle into an array variable and loop thru the entries. Also "normalise" the strings to be matched by converting them both to , say, lowercase, and removing leading, trailing and intermediate multiple spaces, e.g. xxxx = lcase$(worksheetfunction.trim(xxxx))
 
Upvote 0
Hi alan,
i have tried yours, and I use your code which calculate levenstein distance and get levenshtein distance percentage.
as we know, there are 3 function involved in calculating levenshtein distance.
deletion, subtitution and insertion.
and i just wanna use subtitution function only. > maybe it's not called levenshtein distance anymore. haha

i got stuck , i don't know how to implement the levenstein distance in data duplication listing. like you've explained in your previous reply.

Code:
  Dim duplicate(), i As Long
    Dim delrange As Range, cell As Long
    Dim shtIn As Worksheet, Shtout As Worksheet
    Dim numofrows1
    dim numofrows2
    dim j as long
  
     Set shtIn = ThisWorkbook.Sheets("process")
    Set Shtout = ThisWorkbook.Sheets("output")
     
    
    x = 2
    y = 1
    
    Set delrange = shtIn.Range("h1:h30000")  'set your range here
   
   ReDim duplicate(0)
'search duplicates in 2nd column
    For cell = 1 To delrange.Cells.Count
        If Application.CountIf(delrange, delrange(cell)) > 1 Then
            ReDim Preserve duplicate(i)
            duplicate(i) = delrange(cell).Address
            i = i + 1
        End If
 Next
    
   
        'print duplicates
    For i = UBound(duplicate) To LBound(duplicate) Step -1
    Shtout.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
    x = x + 1
Next i
numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1
If Shtout.Cells(2, 1).Value = "" Then
MsgBox ("No Duplicates Found!")
Else
MsgBox (numofrows1 & " " & "Potential Duplicates Found")
    End If
End Sub

after read several references,, i think it's possible to do this.
i can change the code below

Code:
ReDim duplicate(0)
'search duplicates in 2nd column
    For cell = 1 To delrange.Cells.Count
        If Application.CountIf(delrange, delrange(cell)) > 1 Then
            ReDim Preserve duplicate(i)
            duplicate(i) = delrange(cell).Address
            i = i + 1
        End If
Next

with my own parameter.
but i don't know how to do it. haha
and the program going ridicculously slow without levenshtein istance added.
 
Upvote 0
Hi Ali,
I'm on holiday for 2 weeks atm, but this should address your slow response issues:
Replace

Code:
Set delrange = shtIn.Range("h1:h30000")  'set your range here

with

Code:
With shtin.UsedRange
Set delrange = shtin.Range("h1").Resize(.Column + .Columns.Count - 1) 'set your range here
End With
 
Upvote 0
Hi Ali,

Try the following (Untested) code:
Code:
Option Explicit

Sub xxx()

Dim duplicate As Variant, I As Long
Dim vaData As Variant
 
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, Shtout As Worksheet
Dim numofrows1
Dim numofrows2
Dim j As Long
  
    Set shtIn = ThisWorkbook.Sheets("process")
    Set Shtout = ThisWorkbook.Sheets("output")
     
    
    x = 2
    y = 1
    
    With shtIn.UsedRange
        Set delrange = shtIn.Range("h1").Resize(.Column + .Columns.Count - 1) 'set your range here
    End With
    
    vaData = delrange.Value
   ReDim duplicate(1 To 1, 1 To 1)
'search duplicates in 2nd column
    For cell = 1 To UBound(varange, 1)
        If FuzzyCount(LookupValue:=CStr(vaData(cell, 1)), TableArray:=delrange, NFPercent:=0.7) > 1 Then
        If Application.CountIf(delrange, delrange(cell)) > 1 Then
            I = I + 1
            ReDim Preserve duplicate(1 To 1, 1 To I)
            duplicate(1, I) = "H" & cell
        End If
 Next
    
    If I = 0 Then
        MsgBox ("No Duplicates Found!")
    Else
        'print duplicates
        MsgBox (numofrows1 & " " & "Potential Duplicates Found")
        Shtout.UsedRange.ClearContents
        Shtout.Range("A1").Resize(I).Value = WorksheetFunction.Transpose(duplicate)
    End If
    
'**** Not sure what this statement is for! ****
'numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1
End Sub
 
Upvote 0
Hi Ali,

Try the following (Untested) code:
Code:
Option Explicit

Sub xxx()

Dim duplicate As Variant, I As Long
Dim vaData As Variant
 
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, Shtout As Worksheet
Dim numofrows1
Dim numofrows2
Dim j As Long
  
    Set shtIn = ThisWorkbook.Sheets("process")
    Set Shtout = ThisWorkbook.Sheets("output")
     
    
    x = 2
    y = 1
    
    With shtIn.UsedRange
        Set delrange = shtIn.Range("h1").Resize(.Column + .Columns.Count - 1) 'set your range here
    End With
    
    vaData = delrange.Value
   ReDim duplicate(1 To 1, 1 To 1)
'search duplicates in 2nd column
    For cell = 1 To UBound(varange, 1)
        If FuzzyCount(LookupValue:=CStr(vaData(cell, 1)), TableArray:=delrange, NFPercent:=0.7) > 1 Then
        If Application.CountIf(delrange, delrange(cell)) > 1 Then
            I = I + 1
            ReDim Preserve duplicate(1 To 1, 1 To I)
            duplicate(1, I) = "H" & cell
        End If
 Next
    
    If I = 0 Then
        MsgBox ("No Duplicates Found!")
    Else
        'print duplicates
        MsgBox (numofrows1 & " " & "Potential Duplicates Found")
        Shtout.UsedRange.ClearContents
        Shtout.Range("A1").Resize(I).Value = WorksheetFunction.Transpose(duplicate)
    End If
    
'**** Not sure what this statement is for! ****
'numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1
End Sub

Wew, that quite inspiring.. :D
really. Thankyou!!
Sorry for disturb your holiday alan.
the last code is just for telling how many duplicates found. Hehe

ill tell you if the program works.
 
Upvote 0

Forum statistics

Threads
1,225,626
Messages
6,186,092
Members
453,337
Latest member
fiaz ahmad

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top