Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
I did change that variable as you indicated above... and I did see it update every 5... eventually, of course, Excel goes into not responding...

Here's the dilemma, 106000 rows in column a.... 14000ish rows in column B.

I killed all but a single record in column B to see how long it was taking to go through and compare the single record in column B to each record in column A... was not pretty to say the least.

As long as I'm getting ALL exact matches and then anything with say a 60% or higher confidence score, I should be good.

My addresses are simply the street address. Like:


Address1 Address2
304 9th Ave S 1985 E RIVER RD
 
Upvote 0

Excel Facts

How to total the visible cells?
From the first blank cell below a filtered data set, press Alt+=. Instead of SUM, you will get SUBTOTAL(9,)
Hi,

So if amend the parsing loop to match, say strings of 4 charts and upwards, it'll definately run faster. I'll PM you my email address & you could possibly send me a sample for me to play with - if it's not sensitive data that is
 
Upvote 0
I've noticed that FuzzyPercent, alg3, returns different values depending on which string is furnished as String1 & which as String2.

For example:

String1 = John Doe
String2 = Jane Doe
FuzzyPercent = 0.3913043

String1 = Jane Doe
String2 = John Doe
FuzzyPercent = 0.4347826

If I'm trying to determine programatically if John and Jane are the same name, but just spelled incorrectly in one (or both) cases by using a threshold value, will I get the best results by averaging the two results? Taking the highest one? The lowest?
 
Upvote 0
Hi JRV, Yes, Im aware of this inconsistency, pointed out by someone with a very similar name to yourself (page 17, post #161) ;)

As I suggested in that post, maybe algorithm 2 would give more consistent results, alternatively a chap named Levenshtein came up with an 'Edit distance' method (google 'Levenshtein Distance' for more info)
Perhaps you could use the function "GetLevenshteinPercentMatch" which envelopes his method:
Code:
Option Explicit

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String, _
                                            Optional Normalised As Boolean = False) As Single
Dim iLen As Integer
If Normalised = False Then
    String1 = UCase$(WorksheetFunction.Trim(String1))
    String2 = UCase$(WorksheetFunction.Trim(String2))
End If
iLen = Max(String1, String2)
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

'*******************************
'*** Get minimum of three values
'*******************************

Private Function Minimum(ByVal a As Integer, _
                         ByVal b As Integer, _
                         ByVal c As Integer) As Integer
Dim mi As Integer
                          
  mi = a
  If b < mi Then
    mi = b
  End If
  If c < mi Then
    mi = c
  End If
  
  Minimum = mi
                          
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim n As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost
  
  ' Step 1
  
  n = Len(s)
  m = Len(t)
  If n = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = n
    Exit Function
  End If
  ReDim d(0 To n, 0 To m) As Integer
  
  ' Step 2
  
  For i = 0 To n
    d(i, 0) = i
  Next i
  
  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For i = 1 To n
    
    s_i = Mid$(s, i, 1)
    
    ' Step 4
    
    For j = 1 To m
      
      t_j = Mid$(t, j, 1)
      
      ' Step 5
      
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      
      ' Step 6
      
      d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
    
    Next j
    
  Next i
  
  ' Step 7
  
  LevenshteinDistance = d(n, m)
  Erase d

End Function
 
Upvote 0
Thanks!

IIRC, I ended up on algorithm 3 after spending quite a while testing algorithm 1, 2, & 3, and concluded that 3 produced the most usable results, save for the discrepancy that occurs when the strings are reversed. Apparently I had since convinced myself that 3 was your recommendation, but history shows that was not true!:oops:

I'll try the Edit Distance method you posted; thanks for that. If it produces better results, I'll use it. Otherwise I'll keep running the data through algorithm 3 in both "directions" and average the results. It seems to get me pretty close; just wondered if there was any "science" behind that approach.
 
Upvote 0
Hi JRV, you may want to do some timings, I suspect that Levenshtein is more accurate (if that's the correct adjective) but slower.
 
Upvote 0
What I'm doing is evaluating edits people make to Contact information from Outlook and other databases, used in a Word document. The app has to figure out what to do with the edit depending on how much has changed. If it amounts to correcting a typo, it does one thing, if it's a major change, it does another.

It works pretty well; just looking for ways to make it better. But we're not dealing with batches of 20,000 records here; just one at a time. So I can trade some speed for accuracy.
 
Upvote 0
Regarding this line--

Code:
iLen = Max(String1, String2)

--the Max function does not exist in Word or XL VBA. There is a Max worksheet function, but it does not work with strings. Any idea what Max does, so I can create it as a VBA function?

From context, it looks like it's probably setting iLen to the length of the longer of String1 & String2, but if you know for sure, that would be a big help.
 
Last edited:
Upvote 0
And having tried it with this substitution--

Code:
iS1 = Len(String1)
iS2 = Len(String2)
If iS1 > iS2 Then
    iLen = iS1
Else
    iLen = iS2
End If
--it does seem to give me a number between 0 and 1. But confirmation would still give me more confidence that I guessed right!
 
Upvote 0

Forum statistics

Threads
1,223,277
Messages
6,171,148
Members
452,382
Latest member
RonChand

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top