Fuzzy Matching - new version plus explanation

al_b_cnu · Mar 27, 2006

It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:

Fuzzy Examples.xls

A

B

C

D

E

1

Starts With

Ends

Contains

2

Bill

jelen

Bill

3

Mr Bill Jelen

4

3

4

Bill Jelen

5

Joe Bloggs

6

Fred Smith

MATCH Example

... Continued ...

al_b_cnu · Jan 25, 2014

Hi ali,

This version might actually work!
note I've included all the Fuzzy functions required

Code:

Option Explicit

Sub xxx()

Dim duplicate As Variant, I As Long
Dim vaData As Variant
 
Dim delrange As Range, lRow As Long
Dim shtIn As Worksheet, Shtout As Worksheet
Dim numofrows1
Dim numofrows2
Dim j As Long
  
    Set shtIn = ThisWorkbook.Sheets("process")
    Set Shtout = ThisWorkbook.Sheets("output")
     
    
    With shtIn.UsedRange
        Set delrange = shtIn.Range("h1").Resize(.Row + .Rows.Count - 1) 'set your range here
    End With
    
    vaData = delrange.Value
   ReDim duplicate(1 To 1, 1 To 1)
'search duplicates in 2nd column
    For lRow = 1 To UBound(vaData, 1)
        If FuzzyCount(LookupValue:=CStr(vaData(lRow, 1)), TableArray:=delrange, NFPercent:=0.7, Algorithm:=4) > 1 Then
'        If Application.CountIf(delrange, delrange(lRow)) > 1 Then
            I = I + 1
            ReDim Preserve duplicate(1 To 1, 1 To I)
            duplicate(1, I) = "H" & lRow
        End If
    Next lRow
    
    If I = 0 Then
        MsgBox ("No Duplicates Found!")
    Else
        'print duplicates
        MsgBox (numofrows1 & " " & "Potential Duplicates Found")
        Shtout.UsedRange.ClearContents
        Shtout.Range("A1").Resize(I).Value = WorksheetFunction.Transpose(duplicate)
    End If
    
'**** Not sure what this statement is for! ****
'numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1
End Sub

Function FuzzyCount(ByVal LookupValue As String, _
                      ByVal TableArray As Range, _
                      Optional NFPercent As Single = 0.05, _
                      Optional Algorithm As Variant = 3) As Long
'**********************************************************************
'** Simple count of (Fuzzy) Matching strings >= NFPercent threshold  **
'**********************************************************************
Dim lMatchCount As Long

Dim rCur As Range

Dim sString1 As String
Dim sString2 As String

'** Normalise lookup value **
sString1 = LCase$(Application.Trim(LookupValue))

For Each rCur In Intersect(TableArray.Resize(, 1), Sheets(TableArray.Parent.Name).UsedRange)

    '** Normalise current Table entry **
    sString2 = LCase$(Application.Trim(CStr(rCur)))
    
    If sString2 <> "" Then
        If FuzzyPercent(String1:=sString1, _
                        String2:=sString2, _
                        Algorithm:=Algorithm, _
                        Normalised:=False) >= NFPercent Then
            lMatchCount = lMatchCount + 1
        End If
    End If
Next rCur

FuzzyCount = lMatchCount

End Function

Function FuzzyPercent(ByVal String1 As String, _
                      ByVal String2 As String, _
                      Optional Algorithm As Variant = 3, _
                      Optional Normalised As Boolean = False) As Single
'*************************************
'** Return a % match on two strings **
'*************************************
Dim bSoundex As Boolean
Dim bBasicMetaphone As Boolean
Dim intLen1 As Integer, intLen2 As Integer
Dim intCurLen As Integer
Dim intTo As Integer
Dim intPos As Integer
Dim intPtr As Integer
Dim intScore As Integer
Dim intTotScore As Integer
Dim intStartPos As Integer
Dim lngAlgorithm As Long
Dim sngScore As Single
Dim strWork As String

bSoundex = LCase$(CStr(Algorithm)) = "soundex"
bBasicMetaphone = LCase$(CStr(Algorithm)) = "metaphone"

'-------------------------------------------------------
'-- If strings havent been normalised, normalise them --
'-------------------------------------------------------
If Normalised = False Then
    If bSoundex Or bBasicMetaphone Then
       String1 = NormaliseStringAtoZ(String1)
       String2 = NormaliseStringAtoZ(String2)
    Else
        String1 = LCase$(Application.Trim(String1))
        String2 = LCase$(Application.Trim(String2))
    End If
End If

'----------------------------------------------
'-- Give 100% match if strings exactly equal --
'----------------------------------------------
If String1 = String2 Then
    FuzzyPercent = 1
    Exit Function
End If

'If bSoundex Then
'    String1 = Soundex(Replace(String1, " ", ""))
'    String2 = Soundex(Replace(String2, " ", ""))
'    If String1 = String2 Then
'        FuzzyPercent = msngSoundexMatchPercent
'    Else
'        FuzzyPercent = 0
'    End If
'    Exit Function
'ElseIf bBasicMetaphone Then
'    String1 = Metaphone1(String1)
'    String2 = Metaphone1(String2)
'    If String1 = String2 Then
'        FuzzyPercent = msngMetaphoneMatchPercent
'    Else
'        FuzzyPercent = 0
'    End If
'    Exit Function
'End If

intLen1 = Len(String1)
intLen2 = Len(String2)

If intLen1 = 0 Or intLen2 = 0 Then
    FuzzyPercent = 0
    Exit Function
End If

'----------------------------------------
'-- Give 0% match if string length < 2 --
'----------------------------------------
If intLen1 < 2 Then
    FuzzyPercent = 0
    Exit Function
End If

intTotScore = 0                   'initialise total possible score
intScore = 0                      'initialise current score

lngAlgorithm = Val(Algorithm)

'--------------------------------------------------------
'-- If Algorithm = 1 or 3, Search for single characters --
'--------------------------------------------------------
If (lngAlgorithm And 1) <> 0 Then
    If intLen1 < intLen2 Then
        FuzzyAlg1 String1, String2, intScore, intTotScore
    Else
        FuzzyAlg1 String2, String1, intScore, intTotScore
    End If
End If

'-----------------------------------------------------------
'-- If Algorithm = 2 or 3, Search for pairs, triplets etc. --
'-----------------------------------------------------------
If (lngAlgorithm And 2) <> 0 Then
    If intLen1 < intLen2 Then
        FuzzyAlg2 String1, String2, intScore, intTotScore
    Else
        FuzzyAlg2 String2, String1, intScore, intTotScore
    End If
End If

'-------------------------------------------------------------
'-- If Algorithm = 4,5,6,7, use Levenstein Distance method  --
'-- (Algorithm 4 was Dan Ostrander's code)                  --
'-------------------------------------------------------------
If (lngAlgorithm And 4) <> 0 Then
    If intLen1 < intLen2 Then
'        sngScore = FuzzyAlg4(String1, String1)
        sngScore = GetLevenshteinPercentMatch(String1:=String1, _
                                              String2:=String2, _
                                              Normalised:=True)
    Else
'        sngScore = FuzzyAlg4(String2, String1)
        sngScore = GetLevenshteinPercentMatch(String1:=String2, _
                                              String2:=String1, _
                                              Normalised:=True)
    End If
    intScore = intScore + (sngScore * 100)
    intTotScore = intTotScore + 100
End If

FuzzyPercent = intScore / intTotScore

End Function

Private Sub FuzzyAlg1(ByVal String1 As String, _
                      ByVal String2 As String, _
                      ByRef Score As Integer, _
                      ByRef TotScore As Integer)
Dim intLen1 As Integer, intPos As Integer, intPtr As Integer, intStartPos As Integer

intLen1 = Len(String1)
TotScore = TotScore + intLen1              'update total possible score
intPos = 0
For intPtr = 1 To intLen1
    intStartPos = intPos + 1
    intPos = InStr(intStartPos, String2, Mid$(String1, intPtr, 1))
    If intPos > 0 Then
        If intPos > intStartPos + 3 Then     'No match if char is > 3 bytes away
            intPos = intStartPos
        Else
            Score = Score + 1          'Update current score
        End If
    Else
        intPos = intStartPos
    End If
Next intPtr
End Sub
Private Sub FuzzyAlg2(ByVal String1 As String, _
                        ByVal String2 As String, _
                        ByRef Score As Integer, _
                        ByRef TotScore As Integer)
Dim intCurLen As Integer, intLen1 As Integer, intTo As Integer, intPtr As Integer, intPos As Integer
Dim strWork As String

intLen1 = Len(String1)
For intCurLen = 1 To intLen1
    strWork = String2                          'Get a copy of String2
    intTo = intLen1 - intCurLen + 1
    TotScore = TotScore + Int(intLen1 / intCurLen)  'Update total possible score
    For intPtr = 1 To intTo Step intCurLen
        intPos = InStr(strWork, Mid$(String1, intPtr, intCurLen))
        If intPos > 0 Then
            Mid$(strWork, intPos, intCurLen) = String$(intCurLen, &H0) 'corrupt found string
            Score = Score + 1     'Update current score
        End If
    Next intPtr
Next intCurLen

End Sub
'Private Function FuzzyAlg4(strIn1 As String, strIn2 As String) As Single
'
'Dim L1               As Integer
'Dim In1Mask(1 To 24) As Long     'strIn1 is 24 characters max
'Dim iCh              As Integer
'Dim N                As Long
'Dim strTry           As String
'Dim strTest          As String
'
'TopMatch = 0
'L1 = Len(strIn1)
'strTest = UCase(strIn1)
'strCompare = UCase(strIn2)
'For iCh = 1 To L1
'    In1Mask(iCh) = 2 ^ iCh
'Next iCh      'Loop thru all ordered combinations of characters in strIn1
'For N = 2 ^ (L1 + 1) - 1 To 1 Step -1
'    strTry = ""
'    For iCh = 1 To L1
'        If In1Mask(iCh) And N Then
'            strTry = strTry & Mid(strTest, iCh, 1)
'        End If
'    Next iCh
'    If Len(strTry) > TopMatch Then FuzzyAlg4Test strTry
'Next N
'FuzzyAlg4 = TopMatch / CSng(L1)
'End Function
'Sub FuzzyAlg4Test(strIn As String)
'
'Dim l          As Integer
'Dim strTry   As String
'Dim iCh        As Integer
'
'l = Len(strIn)
'If l <= TopMatch Then Exit Sub
'strTry = "*"
'For iCh = 1 To l
'    strTry = strTry & Mid(strIn, iCh, 1) & "*"
'Next iCh
'If strCompare Like strTry Then
'    If l > TopMatch Then TopMatch = l
'End If
'End Sub

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String, _
                                            Optional Normalised As Boolean = False) As Single
Dim iLen As Integer
If Normalised = False Then
    String1 = UCase$(WorksheetFunction.Trim(String1))
    String2 = UCase$(WorksheetFunction.Trim(String2))
End If
iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Private Function NormaliseStringAtoZ(ByVal String1 As String) As String
'---------------------------------------------------------
'-- Remove all but alpha chars and convert to lowercase --
'---------------------------------------------------------
Dim iPtr As Integer
Dim sChar As String
Dim sResult As String

sResult = ""
For iPtr = 1 To Len(String1)
    sChar = LCase$(Mid$(String1, iPtr, 1))
    If sChar <> UCase$(sChar) Then sResult = sResult & sChar
Next iPtr
NormaliseStringAtoZ = sResult
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N

    s_i = Mid$(s, I, 1)

    ' Step 4

    For j = 1 To m

      t_j = Mid$(t, j, 1)

      ' Step 5

      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If

      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)

    Next j

  Next I

  ' Step 7

  LevenshteinDistance = d(N, m)

End Function

alijahh · Jan 26, 2014

hey alan, the last code you gave, it works.
but when I change

Code:

Shtout.Range("A1").Resize(I).Value = WorksheetFunction.Transpose(duplicate)

with

Code:

 For i = UBound(duplicate) To LBound(duplicate) Step -1
    Shtout.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
    x = x + 1
Next i

to make the output just like before, it goes error 424-object required.

yand still, the program going very-very slow. *this happened only if using fuzzy count. if I change back to find exact duplicates, yes it faster than before.

alijahh · Jan 27, 2014

i solved my problem.

thx for your assistance. I really appreciate it.

Code:

Sub duplicate_separation()
Dim duplicate As Variant, I As Long
Dim vaData As Variant
Dim vadata2 As Variant
Dim delrange As Range, lRow As Long
Dim delrange2 As Range
Dim shtIn As Worksheet, Shtout As Worksheet
Dim numofrows1

Dim j As Long

Set shtIn = ThisWorkbook.Sheets("process")
Set Shtout = ThisWorkbook.Sheets("output")


With shtIn.UsedRange 'set your range here
Set delrange = shtIn.Range("b1").Resize(.Row + .Rows.Count - 1)


End With

vaData = delrange.Value
ReDim duplicate(1 To 1, 1 To 1)
'search duplicates in 2nd column
For lRow = 1 To UBound(vaData, 1)
'choose the parameter
'1. detect potential duplicate data for similiarity above 70%
If FuzzyCount(LookupValue:=CStr(vaData(lRow, 1)), TableArray:=delrange, NFPercent:=0.7, Algorithm:=4) > 1 Then

I = I + 1
ReDim Preserve duplicate(1 To 1, 1 To I)
duplicate(1, I) = delrange(lRow).Address
End If
Next lRow


Shtout.Cells(1, 1).Resize(1, 7).Value = _
Array("Material Number", "Short Description", "Manufacturer", "Material Part Number", "Old Material Number", "Long Description", "sorted ShortDesc")


If I = 0 Then
MsgBox ("No Duplicates Found!")
Else
'print duplicates
MsgBox (I & " " & "Potential Duplicates Found")
Shtout.Cells(2, 1).Resize(I, 6).EntireRow.Value = shtIn.Range(duplicate(1, 1)).Resize(I, 6).EntireRow.Value
End If




End Sub

markkos11 · Feb 17, 2014

Hi Alan,

I've been trying to figure out your function to solve my problem but I haven't achieved yet. Maybe you or someone can help. In a nuttshell, the comparison seems not to work for these cases:

list1:
mike trout ana
kershaw clayton

list2:
mike trout
clayton kershaw

I'm using the function as =FuzzyVLookup(A2,Sheet2!$A:$A,1,80%,,2), still the function is finding a match for "mike trout ana" where there's actually not an exact match (I don't understand here why the fuzzypercentage returns 100% when the content of the cells are not the exactly the same. And for the second entry somehow is not recognizing is the same name but in different order. I hope I'm using the function incorrectly and that there's a solution!

The expected result should be:
mike trout ana #N/A
kershaw clayton clayton kershaw

This is my original thread, just for reference and more information:
http://www.mrexcel.com/forum/excel-...find-each-string-cell-within-list-values.html

Thanks in advance! hope it's clear.

djangounchained · May 10, 2014

hi
is there a way the fuzzyvlookup returns a percentage of match only

slechta · Jul 11, 2014

This mind shredding code changed my life. Thank you for sharing!!

sjcocks · Oct 2, 2014

Hi Alan and all who’ve contributed,

Thank you so much for this code, it really is some of the most impressive and useful VBA I’ve come across!

It’s proved very handy for a lot of projects I’ve done, but I’ve found that it doesn’t work quite as desired for names and some other types of data. I come across many cases where names are switched round, abbreviated, or with inconsistent use of middle names. For instance ‘John Smith’ gets a very low match with ‘Smith John’ with the algorithms put forward so far. There is a Microsoft add-in called ‘Fuzzy Lookup’ (Download Fuzzy Lookup Add-In for Excel from Official Microsoft Download Center) which deals with these things well and generally produces better matches, and so I’ve used this whenever I’ve been dealing with names or needed greater accuracy. As anyone who’s used it will know though, there is no way to activate it using VBA and so I’ve had to get the user to manually use it, which is clumsy and just isn’t always feasible.

I’m no expert at all with VBA, or with fuzzy matching algorithms, but I thought it would be worth trying to write something that would take common variations into account. I wrote the code to compare all the individual words in one string against all the individual words in the other, and then find the optimum pairings. If any of the words are a single letter, they get an 80% match if they are the same as the starting letter of the word they’re being compared with, or 0% if not to account for abbreviations. Comparing every word against every word is obviously much more computationally expensive, but for most of my uses that’s not an issue and definitely worth the sacrifice.

Here are the output percentage matches produced by my version, compared with the MS Fuzzy Lookup add-in, and the (excellent) Levenshtein distance algorithm put forward earlier in this thread. My algorithm is using the same Levenshtein distance algorithm, but comparing the strings word-wise, as described above.

[TABLE="width: 481"]
<tbody>[TR]
[TD]Name 1[/TD]
[TD]Name 2[/TD]
[TD]MS Add-in[/TD]
[TD]Pure Levenstein[/TD]
[TD]Word-wise Levenstein[/TD]
[/TR]
[TR]
[TD]John Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]1.0000[/TD]
[TD="align: right"]1[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]Smith John[/TD]
[TD]John Smith[/TD]
[TD="align: right"]1.0000[/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]J Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9010[/TD]
[TD="align: right"]0.7[/TD]
[TD="align: right"]0.9[/TD]
[/TR]
[TR]
[TD]N Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4400[/TD]
[TD="align: right"]0.7[/TD]
[TD="align: right"]0.5[/TD]
[/TR]
[TR]
[TD]Mr John Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8800[/TD]
[TD="align: right"]0.77[/TD]
[TD="align: right"]0.8[/TD]
[/TR]
[TR]
[TD]Mr J Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.7974[/TD]
[TD="align: right"]0.6[/TD]
[TD="align: right"]0.72[/TD]
[/TR]
[TR]
[TD]John[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9000[/TD]
[TD="align: right"]0.4[/TD]
[TD="align: right"]0.67[/TD]
[/TR]
[TR]
[TD]Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9000[/TD]
[TD="align: right"]0.5[/TD]
[TD="align: right"]0.67[/TD]
[/TR]
[TR]
[TD]Jon Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9143[/TD]
[TD="align: right"]0.9[/TD]
[TD="align: right"]0.88[/TD]
[/TR]
[TR]
[TD]Johnathan[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.4[/TD]
[TD="align: right"]0.3[/TD]
[/TR]
[TR]
[TD]John B Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8800[/TD]
[TD="align: right"]0.83[/TD]
[TD="align: right"]0.8[/TD]
[/TR]
[TR]
[TD]Joe Bloggs[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.2[/TD]
[TD="align: right"]0.25[/TD]
[/TR]
[TR]
[TD]J Bloggs[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.1[/TD]
[TD="align: right"]0.4[/TD]
[/TR]
[TR]
[TD]Joe Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.9309[/TD]
[TD="align: right"]0.8[/TD]
[TD="align: right"]0.75[/TD]
[/TR]
[TR]
[TD]John Bloggs[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4500[/TD]
[TD="align: right"]0.45[/TD]
[TD="align: right"]0.5[/TD]
[/TR]
[TR]
[TD]Luke Clark[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.6857[/TD]
[TD="align: right"]0.1[/TD]
[TD="align: right"]0[/TD]
[/TR]
[TR]
[TD]John John[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4667[/TD]
[TD="align: right"]0.5[/TD]
[TD="align: right"]0.5[/TD]
[/TR]
[TR]
[TD]Jonathan Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4400[/TD]
[TD="align: right"]0.71[/TD]
[TD="align: right"]0.75[/TD]
[/TR]
[TR]
[TD]Johnathan Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8902[/TD]
[TD="align: right"]0.67[/TD]
[TD="align: right"]0.72[/TD]
[/TR]
[TR]
[TD]Joe Bloggs Joe Bloggs[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.14[/TD]
[TD="align: right"]0.31[/TD]
[/TR]
[TR]
[TD]John Smith Joe Bloggs[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.8800[/TD]
[TD="align: right"]0.48[/TD]
[TD="align: right"]0.89[/TD]
[/TR]
[TR]
[TD]JB Smith[/TD]
[TD]John Smith[/TD]
[TD="align: right"]0.4400[/TD]
[TD="align: right"]0.7[/TD]
[TD="align: right"]0.62[/TD]
[/TR]
[TR]
[TD]J S[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.3[/TD]
[TD="align: right"]0.8[/TD]
[/TR]
[TR]
[TD]J L S[/TD]
[TD]John Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.3[/TD]
[TD="align: right"]0.64[/TD]
[/TR]
[TR]
[TD]John L Smith[/TD]
[TD]John Luke Smith[/TD]
[TD="align: right"]0.8835[/TD]
[TD="align: right"]0.8[/TD]
[TD="align: right"]0.93[/TD]
[/TR]
[TR]
[TD]John J Smith[/TD]
[TD]John Luke Smith[/TD]
[TD="align: right"]0.6000[/TD]
[TD="align: right"]0.73[/TD]
[TD="align: right"]0.67[/TD]
[/TR]
[TR]
[TD]Joe Bloggs[/TD]
[TD]John Luke Smith[/TD]
[TD]-[/TD]
[TD="align: right"]0.27[/TD]
[TD="align: right"]0.27[/TD]
[/TR]
[TR]
[TD]J L Smith[/TD]
[TD]John Luke Smith[/TD]
[TD="align: right"]0.8504[/TD]
[TD="align: right"]0.6[/TD]
[TD="align: right"]0.87[/TD]
[/TR]
</tbody>[/TABLE]

The code I have written feels a little bit haphazard and empirical and is probably not very efficient, but it seems to give good results. It would be interesting to hear what other people think about it and see if anyone has suggestions for improvements. Everything I’ve tested it on seems to work OK, but it would useful to hear if people find cases where it does not work as desired.

The function I’ve written is called by the FuzzyPercent function (found earlier in this thread), the same as FuzzyAlg1 or FuzzyAlg2 etc. and uses the Levenshtein distance method (also put forward earlier in this thread, and included at the bottom of my code without any changes). It stores all of the word-wise matches in the sngPercentMatch array. The next section then calculates the optimum pairings for 1, 2 and 3 word strings, and scales the score depending on whether the two strings have the same number of words. Very few of the strings I deal with are over three words, and so I just did a simpler approximation for strings above 3 words, but it’s definitely not as accurate. It would be great to have a generalised algorithm to calculate this properly (the same as used for 1, 2 and 3 words) for n-word strings, but I wasn’t sure how to do that.

Code:

Function WordwiseLevenshtein(ByVal String1 As String, _
                             ByVal String2 As String, _
                             Optional Normalised As Boolean = False) As Single


Dim strString1Array() As String
Dim strString2Array() As String
Dim nWordsString1 As Integer
Dim nWordsString2 As Integer
Dim nWordsMax As Integer
Dim nWord1 As Integer
Dim nWord2 As Integer
Dim iLen As Integer


Dim sngPercentMatch() As Single
Dim sngString1Percent() As Single
Dim sngString2Percent() As Single
Dim sngMeanString1Percent As Single
Dim sngMeanString2Percent As Single
Dim sngBestFit As Single


'should already be normalised if called by FuzzyPercent, but just in case
If Normalised = False Then
    String1 = LCase$(Application.Trim(String1))
    String2 = LCase$(Application.Trim(String2))
End If




strString1Array = Split(String1, " ")
strString2Array = Split(String2, " ")


nWordsString1 = UBound(strString1Array)
nWordsString2 = UBound(strString2Array)
nWordsMax = WorksheetFunction.Max(nWordsString1, nWordsString2)


ReDim sngPercentMatch(nWordsMax, nWordsMax)     'needs to be square for calculating sngBestFit
ReDim sngString1Percent(nWordsString1)
ReDim sngString2Percent(nWordsString2)


'Calculate similarity of all words in strString1Array against all words in strString2Array and store in sngPercentMatch
'If initial is used, give value of 0.8 if matching or 0 if not
For nWord1 = 0 To nWordsString1
    For nWord2 = 0 To nWordsString2
        If strString1Array(nWord1) = strString2Array(nWord2) Then
            sngPercentMatch(nWord1, nWord2) = 1
        ElseIf Len(strString1Array(nWord1)) = 1 Or Len(strString2Array(nWord2)) = 1 Then
            If Left(strString1Array(nWord1), 1) = Left(strString2Array(nWord2), 1) Then
                sngPercentMatch(nWord1, nWord2) = 0.8
            Else
                sngPercentMatch(nWord1, nWord2) = 0
            End If
        Else                'use a fuzzy matching algorithm which returns a percentage match, in this version I've used Levenshtein distance but could use anything
            iLen = WorksheetFunction.Max(Len(strString1Array(nWord1)), Len(strString2Array(nWord2)))
            sngPercentMatch(nWord1, nWord2) = (iLen - LevenshteinDistance(strString1Array(nWord1), strString2Array(nWord2))) / iLen
        End If
    Next nWord2
Next nWord1


'Approximate how well the two strings match based on the wordwise comparisons.
sngBestFit = 0
If nWordsMax = 0 Then         'if longest string is 1 word
    sngBestFit = sngPercentMatch(0, 0)
ElseIf nWordsMax = 1 Then     'if longest string is 2 words
    sngBestFit = WorksheetFunction.Max(sngPercentMatch(0, 0) + sngPercentMatch(1, 1), _
                                       sngPercentMatch(1, 0) + sngPercentMatch(0, 1)) _
                                       / ((WorksheetFunction.Min(nWordsString1, nWordsString2) + 3) / 2)    'denomenator adjusts score to give better match if strings have same number of words
ElseIf nWordsMax = 2 Then     'if longest string is 3 words
    sngBestFit = WorksheetFunction.Max(sngPercentMatch(0, 0) + sngPercentMatch(1, 1) + sngPercentMatch(2, 2), _
                                       sngPercentMatch(0, 0) + sngPercentMatch(2, 1) + sngPercentMatch(1, 2), _
                                       sngPercentMatch(1, 0) + sngPercentMatch(0, 1) + sngPercentMatch(2, 2), _
                                       sngPercentMatch(1, 0) + sngPercentMatch(2, 1) + sngPercentMatch(0, 2), _
                                       sngPercentMatch(2, 0) + sngPercentMatch(0, 1) + sngPercentMatch(1, 2), _
                                       sngPercentMatch(2, 0) + sngPercentMatch(1, 1) + sngPercentMatch(0, 2)) _
                                       / ((WorksheetFunction.Min(nWordsString1, nWordsString2) + 4) / 2)    'denomenator adjusts score to give better match if strings have same number of words
Else                                                                    'if longest string is more than 3 words
    For nWord1 = 0 To nWordsString1
        For nWord2 = 0 To nWordsString2
            If sngPercentMatch(nWord1, nWord2) > sngString1Percent(nWord1) Then
                sngString1Percent(nWord1) = sngPercentMatch(nWord1, nWord2)
            End If
            If sngPercentMatch(nWord1, nWord2) > sngString2Percent(nWord2) Then
                sngString2Percent(nWord2) = sngPercentMatch(nWord1, nWord2)
            End If
        Next nWord2
    Next nWord1
    sngMeanString1Percent = WorksheetFunction.Average(sngString1Percent)
    sngMeanString2Percent = WorksheetFunction.Average(sngString2Percent)
    sngBestFit = WorksheetFunction.Max(sngMeanString1Percent, sngMeanString2Percent) * 0.67 + _
                 WorksheetFunction.Min(sngMeanString1Percent, sngMeanString2Percent) * 0.33         'seems to give an OK approximation but not as good as methods for fewer words
End If




WordwiseLevenshtein = sngBestFit




End Function




'The following code is not my own, but taken from earlier in the thread


'********************************
'*** Compute Levenshtein Distance
'********************************


Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost


  ' Step 1


  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer


  ' Step 2


  For i = 0 To N
    d(i, 0) = i
  Next i


  For j = 0 To m
    d(0, j) = j
  Next j


  ' Step 3


  For i = 1 To N


    s_i = Mid$(s, i, 1)


    ' Step 4


    For j = 1 To m


      t_j = Mid$(t, j, 1)


      ' Step 5


      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If


      ' Step 6


      d(i, j) = WorksheetFunction.Min(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)


    Next j


  Next i


  ' Step 7


  LevenshteinDistance = d(N, m)


End Function

namer98 · Oct 14, 2014

<3

That is all.

iknowu99 · Nov 26, 2014

Hello,

I am training the fuzzy to work around my data set.

I am running into two difficulties; first problem is when I use fuzzyvlookup it returns close match even if in the lookup table there exists a perfect match right under it.

Second problem is that when I try to work around that and calculate "GetLevenshteinPercentMatch" for two strings so I can pick out max in lookup table, if there are two words and bogus third word, for my purpose it's enough to make it very strong match. That is not the case with this Levenshtein's calculation
I was looking up "LED fixtures abc124"

in lookup table it returns all same score even though first one matches both words exactly!
LED fixtures 0.171428576
Indoor Fixture 0.171428576
Outdoor Fixture 0.171428576

Goose306 · Dec 8, 2014

Alan,

This code is amazing, thanks so much for sharing. Turned several 3+ hour projects into what now takes about 2 minutes after I spent about an hour tweaking it the first time.

Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member

al_b_cnu

Well-known Member

alijahh

New Member

alijahh

New Member

markkos11

New Member

djangounchained

New Member

slechta

New Member

sjcocks

New Member

namer98

New Member

iknowu99

Well-known Member

Goose306

Board Regular

Similar threads

Forum statistics

Share this page

Fuzzy Matching - new version plus explanation

Well-known Member

Well-known Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

Well-known Member

Board Regular

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock