Hi Alan and all who’ve contributed,
Thank you so much for this code, it really is some of the most impressive and useful VBA I’ve come across!
It’s proved very handy for a lot of projects I’ve done, but I’ve found that it doesn’t work quite as desired for names and some other types of data. I come across many cases where names are switched round, abbreviated, or with inconsistent use of middle names. For instance ‘John Smith’ gets a very low match with ‘Smith John’ with the algorithms put forward so far. There is a Microsoft add-in called ‘Fuzzy Lookup’ (
Download Fuzzy Lookup Add-In for Excel from Official Microsoft Download Center) which deals with these things well and generally produces better matches, and so I’ve used this whenever I’ve been dealing with names or needed greater accuracy. As anyone who’s used it will know though, there is no way to activate it using VBA and so I’ve had to get the user to manually use it, which is clumsy and just isn’t always feasible.
I’m no expert at all with VBA, or with fuzzy matching algorithms, but I thought it would be worth trying to write something that would take common variations into account. I wrote the code to compare all the individual words in one string against all the individual words in the other, and then find the optimum pairings. If any of the words are a single letter, they get an 80% match if they are the same as the starting letter of the word they’re being compared with, or 0% if not to account for abbreviations. Comparing every word against every word is obviously much more computationally expensive, but for most of my uses that’s not an issue and definitely worth the sacrifice.
Here are the output percentage matches produced by my version, compared with the MS Fuzzy Lookup add-in, and the (excellent) Levenshtein distance algorithm put forward earlier in this thread. My algorithm is using the same Levenshtein distance algorithm, but comparing the strings word-wise, as described above.
[TABLE="width: 481"]
<tbody>[TR]
[TD]Name 1[/TD]
[TD]Name 2[/TD]
[TD]MS Add-in[/TD]
[TD]Pure Levenstein[/TD]
[TD]Word-wise Levenstein[/TD]
[/TR]
[TR]
[TD]John Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]1.0000[/TD]
[TD="align: right"]1[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]Smith John[/TD]
[TD]John Smith[/TD]
[TD="align: right"]1.0000[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]J Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9010[/TD]
[TD="align: right"]0.7[/TD]
[TD="align: right"]0.9[/TD]
[/TR]
[TR]
[TD]N Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4400[/TD]
[TD="align: right"]0.7[/TD]
[TD="align: right"]0.5[/TD]
[/TR]
[TR]
[TD]Mr John Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8800[/TD]
[TD="align: right"]0.77[/TD]
[TD="align: right"]0.8[/TD]
[/TR]
[TR]
[TD]Mr J Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.7974[/TD]
[TD="align: right"]0.6[/TD]
[TD="align: right"]0.72[/TD]
[/TR]
[TR]
[TD]John[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9000[/TD]
[TD="align: right"]0.4[/TD]
[TD="align: right"]0.67[/TD]
[/TR]
[TR]
[TD]Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9000[/TD]
[TD="align: right"]0.5[/TD]
[TD="align: right"]0.67[/TD]
[/TR]
[TR]
[TD]Jon Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9143[/TD]
[TD="align: right"]0.9[/TD]
[TD="align: right"]0.88[/TD]
[/TR]
[TR]
[TD]Johnathan[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.4[/TD]
[TD="align: right"]0.3[/TD]
[/TR]
[TR]
[TD]John B Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8800[/TD]
[TD="align: right"]0.83[/TD]
[TD="align: right"]0.8[/TD]
[/TR]
[TR]
[TD]Joe Bloggs[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.2[/TD]
[TD="align: right"]0.25[/TD]
[/TR]
[TR]
[TD]J Bloggs[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.1[/TD]
[TD="align: right"]0.4[/TD]
[/TR]
[TR]
[TD]Joe Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9309[/TD]
[TD="align: right"]0.8[/TD]
[TD="align: right"]0.75[/TD]
[/TR]
[TR]
[TD]John Bloggs[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4500[/TD]
[TD="align: right"]0.45[/TD]
[TD="align: right"]0.5[/TD]
[/TR]
[TR]
[TD]Luke Clark[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.6857[/TD]
[TD="align: right"]0.1[/TD]
[TD="align: right"]0[/TD]
[/TR]
[TR]
[TD]John John[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4667[/TD]
[TD="align: right"]0.5[/TD]
[TD="align: right"]0.5[/TD]
[/TR]
[TR]
[TD]Jonathan Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4400[/TD]
[TD="align: right"]0.71[/TD]
[TD="align: right"]0.75[/TD]
[/TR]
[TR]
[TD]Johnathan Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8902[/TD]
[TD="align: right"]0.67[/TD]
[TD="align: right"]0.72[/TD]
[/TR]
[TR]
[TD]Joe Bloggs Joe Bloggs[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.14[/TD]
[TD="align: right"]0.31[/TD]
[/TR]
[TR]
[TD]John Smith Joe Bloggs[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8800[/TD]
[TD="align: right"]0.48[/TD]
[TD="align: right"]0.89[/TD]
[/TR]
[TR]
[TD]JB Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4400[/TD]
[TD="align: right"]0.7[/TD]
[TD="align: right"]0.62[/TD]
[/TR]
[TR]
[TD]J S[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.3[/TD]
[TD="align: right"]0.8[/TD]
[/TR]
[TR]
[TD]J L S[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.3[/TD]
[TD="align: right"]0.64[/TD]
[/TR]
[TR]
[TD]John L Smith[/TD]
[TD]John Luke Smith[/TD]
[TD="align: right"]0.8835[/TD]
[TD="align: right"]0.8[/TD]
[TD="align: right"]0.93[/TD]
[/TR]
[TR]
[TD]John J Smith[/TD]
[TD]John Luke Smith[/TD]
[TD="align: right"]0.6000[/TD]
[TD="align: right"]0.73[/TD]
[TD="align: right"]0.67[/TD]
[/TR]
[TR]
[TD]Joe Bloggs[/TD]
[TD]John Luke Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.27[/TD]
[TD="align: right"]0.27[/TD]
[/TR]
[TR]
[TD]J L Smith[/TD]
[TD]John Luke Smith[/TD]
[TD="align: right"]0.8504[/TD]
[TD="align: right"]0.6[/TD]
[TD="align: right"]0.87[/TD]
[/TR]
</tbody>[/TABLE]
The code I have written feels a little bit haphazard and empirical and is probably not very efficient, but it seems to give good results. It would be interesting to hear what other people think about it and see if anyone has suggestions for improvements. Everything I’ve tested it on seems to work OK, but it would useful to hear if people find cases where it does not work as desired.
The function I’ve written is called by the FuzzyPercent function (found earlier in this thread), the same as FuzzyAlg1 or FuzzyAlg2 etc. and uses the Levenshtein distance method (also put forward earlier in this thread, and included at the bottom of my code without any changes). It stores all of the word-wise matches in the sngPercentMatch array. The next section then calculates the optimum pairings for 1, 2 and 3 word strings, and scales the score depending on whether the two strings have the same number of words. Very few of the strings I deal with are over three words, and so I just did a simpler approximation for strings above 3 words, but it’s definitely not as accurate. It would be great to have a generalised algorithm to calculate this properly (the same as used for 1, 2 and 3 words) for n-word strings, but I wasn’t sure how to do that.
Code:
Function WordwiseLevenshtein(ByVal String1 As String, _
ByVal String2 As String, _
Optional Normalised As Boolean = False) As Single
Dim strString1Array() As String
Dim strString2Array() As String
Dim nWordsString1 As Integer
Dim nWordsString2 As Integer
Dim nWordsMax As Integer
Dim nWord1 As Integer
Dim nWord2 As Integer
Dim iLen As Integer
Dim sngPercentMatch() As Single
Dim sngString1Percent() As Single
Dim sngString2Percent() As Single
Dim sngMeanString1Percent As Single
Dim sngMeanString2Percent As Single
Dim sngBestFit As Single
'should already be normalised if called by FuzzyPercent, but just in case
If Normalised = False Then
String1 = LCase$(Application.Trim(String1))
String2 = LCase$(Application.Trim(String2))
End If
strString1Array = Split(String1, " ")
strString2Array = Split(String2, " ")
nWordsString1 = UBound(strString1Array)
nWordsString2 = UBound(strString2Array)
nWordsMax = WorksheetFunction.Max(nWordsString1, nWordsString2)
ReDim sngPercentMatch(nWordsMax, nWordsMax) 'needs to be square for calculating sngBestFit
ReDim sngString1Percent(nWordsString1)
ReDim sngString2Percent(nWordsString2)
'Calculate similarity of all words in strString1Array against all words in strString2Array and store in sngPercentMatch
'If initial is used, give value of 0.8 if matching or 0 if not
For nWord1 = 0 To nWordsString1
For nWord2 = 0 To nWordsString2
If strString1Array(nWord1) = strString2Array(nWord2) Then
sngPercentMatch(nWord1, nWord2) = 1
ElseIf Len(strString1Array(nWord1)) = 1 Or Len(strString2Array(nWord2)) = 1 Then
If Left(strString1Array(nWord1), 1) = Left(strString2Array(nWord2), 1) Then
sngPercentMatch(nWord1, nWord2) = 0.8
Else
sngPercentMatch(nWord1, nWord2) = 0
End If
Else 'use a fuzzy matching algorithm which returns a percentage match, in this version I've used Levenshtein distance but could use anything
iLen = WorksheetFunction.Max(Len(strString1Array(nWord1)), Len(strString2Array(nWord2)))
sngPercentMatch(nWord1, nWord2) = (iLen - LevenshteinDistance(strString1Array(nWord1), strString2Array(nWord2))) / iLen
End If
Next nWord2
Next nWord1
'Approximate how well the two strings match based on the wordwise comparisons.
sngBestFit = 0
If nWordsMax = 0 Then 'if longest string is 1 word
sngBestFit = sngPercentMatch(0, 0)
ElseIf nWordsMax = 1 Then 'if longest string is 2 words
sngBestFit = WorksheetFunction.Max(sngPercentMatch(0, 0) + sngPercentMatch(1, 1), _
sngPercentMatch(1, 0) + sngPercentMatch(0, 1)) _
/ ((WorksheetFunction.Min(nWordsString1, nWordsString2) + 3) / 2) 'denomenator adjusts score to give better match if strings have same number of words
ElseIf nWordsMax = 2 Then 'if longest string is 3 words
sngBestFit = WorksheetFunction.Max(sngPercentMatch(0, 0) + sngPercentMatch(1, 1) + sngPercentMatch(2, 2), _
sngPercentMatch(0, 0) + sngPercentMatch(2, 1) + sngPercentMatch(1, 2), _
sngPercentMatch(1, 0) + sngPercentMatch(0, 1) + sngPercentMatch(2, 2), _
sngPercentMatch(1, 0) + sngPercentMatch(2, 1) + sngPercentMatch(0, 2), _
sngPercentMatch(2, 0) + sngPercentMatch(0, 1) + sngPercentMatch(1, 2), _
sngPercentMatch(2, 0) + sngPercentMatch(1, 1) + sngPercentMatch(0, 2)) _
/ ((WorksheetFunction.Min(nWordsString1, nWordsString2) + 4) / 2) 'denomenator adjusts score to give better match if strings have same number of words
Else 'if longest string is more than 3 words
For nWord1 = 0 To nWordsString1
For nWord2 = 0 To nWordsString2
If sngPercentMatch(nWord1, nWord2) > sngString1Percent(nWord1) Then
sngString1Percent(nWord1) = sngPercentMatch(nWord1, nWord2)
End If
If sngPercentMatch(nWord1, nWord2) > sngString2Percent(nWord2) Then
sngString2Percent(nWord2) = sngPercentMatch(nWord1, nWord2)
End If
Next nWord2
Next nWord1
sngMeanString1Percent = WorksheetFunction.Average(sngString1Percent)
sngMeanString2Percent = WorksheetFunction.Average(sngString2Percent)
sngBestFit = WorksheetFunction.Max(sngMeanString1Percent, sngMeanString2Percent) * 0.67 + _
WorksheetFunction.Min(sngMeanString1Percent, sngMeanString2Percent) * 0.33 'seems to give an OK approximation but not as good as methods for fewer words
End If
WordwiseLevenshtein = sngBestFit
End Function
'The following code is not my own, but taken from earlier in the thread
'********************************
'*** Compute Levenshtein Distance
'********************************
Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost
' Step 1
N = Len(s)
m = Len(t)
If N = 0 Then
LevenshteinDistance = m
Exit Function
End If
If m = 0 Then
LevenshteinDistance = N
Exit Function
End If
ReDim d(0 To N, 0 To m) As Integer
' Step 2
For i = 0 To N
d(i, 0) = i
Next i
For j = 0 To m
d(0, j) = j
Next j
' Step 3
For i = 1 To N
s_i = Mid$(s, i, 1)
' Step 4
For j = 1 To m
t_j = Mid$(t, j, 1)
' Step 5
If s_i = t_j Then
cost = 0
Else
cost = 1
End If
' Step 6
d(i, j) = WorksheetFunction.Min(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
Next j
Next i
' Step 7
LevenshteinDistance = d(N, m)
End Function