Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
Hi Earthworm.
I'm not at all surprised that the code doesn't work using your data. To explain:
Regarding these constants declared at the start of the code:
VBA Code:
Const msSheetName As String = "Sheet1"                      '** Defines the worksheet
Dim msMatchCriteriaCell As String                           '** Defines the cell containing the match Keywords heading
Const msMatchCriteriaHeading As String = "Search Heading"   '** Heading for search criteria headings
Const msFileTitleHeading As String = "Excel File Title"     '** Heading for file column
Const msMatchHeading As String = "%Match"                   '** Heading for results
Const msngMinPercent As Single = 0.5                        '** Minimum % Match
msSheetName defines the worksheet name that the code will perform the code on
msMatchCriteriaHeading defines the heading of the column which contains the search criteria headings, The column immediately to the right of this column contains the search criteria
msFileTitleHeading defines the heading at the start of the data and, in Rishm's case, defines the column heading for the filenames
msMatchHeading defines the heading to contain the %age match results (written to by the code)
msngMinPercent defines the minimum %age match to deem the entry a match
The code performs the following actions:
1) Find the criteria columns as defined by the heading specified by msMatchCriteriaHeading
2) For each entry in this column perform the criteria matching as defined by column+1 against the specified heading
3) Return the %age match in the column as specified by msMatchHeading
So ....
In the example I posted for Rishm the code will search for and match against the data in headings "Keywords", Locations", "Company Type" and "Year Founded" and return the %age match in the column headed "%Match"
Hope this helps

Alan
I made changes as per my understanding but the code stuck at

VBA Code:
Option Explicit
Const msSheetName As String = "Sheet1"                      '** Defines the worksheet that the code will perform the code on
Dim msMatchCriteriaCell As String                           '** Defines the cell containing the match Keywords heading
Const msMatchCriteriaHeading As String = "Test 2"   '** Heading for search criteria headings
Const msFileTitleHeading As String = "Test 1"               '** Heading for file column
Const msMatchHeading As String = "%Match"                   '** Heading for results
Const msngMinPercent As Single = 0.5                        '** Minimum % Match

Upon run the code is stuk on sngTotalScore = sngTotalScore / sngMaxScore .

Alan Sir , Can you please share the excel file with this code with small dummy data as it will be easy for us to understand and make changes to our needs
 
Last edited:
Upvote 0

Excel Facts

What do {} around a formula in the formula bar mean?
{Formula} means the formula was entered using Ctrl+Shift+Enter signifying an old-style array formula.
I made changes as per my understanding but the code stuck at

VBA Code:
Option Explicit
Const msSheetName As String = "Sheet1"                      '** Defines the worksheet that the code will perform the code on
Dim msMatchCriteriaCell As String                           '** Defines the cell containing the match Keywords heading
Const msMatchCriteriaHeading As String = "Test 2"   '** Heading for search criteria headings
Const msFileTitleHeading As String = "Test 1"               '** Heading for file column
Const msMatchHeading As String = "%Match"                   '** Heading for results
Const msngMinPercent As Single = 0.5                        '** Minimum % Match

Upon run the code is stuk on sngTotalScore = sngTotalScore / sngMaxScore .

Alan Sir , Can you please share the excel file with this code with small dummy data as it will be easy for us to understand and make changes to our needs
Hi Earthworm,
The reason for the error you encountered is, I suspect, a division by zero error because you have not specified the headings criteria.
That immediate error could be cured fairly easily but I suspect you'd get further problems down the line, all because the worksheet isn't set up correctly.
If you go to post #447 in this thread, click on the "Copy" icon at the bottom left, you can paste that into your own Excel spreadsheet.

If you're still having problems, you may want to start a new post with your particular requirement. If so you're very welcome to message me with a link to the post and I'll have a look at it.
Best wishes
Alan
 
Upvote 0
Hi Earthworm,
The reason for the error you encountered is, I suspect, a division by zero error because you have not specified the headings criteria.
That immediate error could be cured fairly easily but I suspect you'd get further problems down the line, all because the worksheet isn't set up correctly.
If you go to post #447 in this thread, click on the "Copy" icon at the bottom left, you can paste that into your own Excel spreadsheet.

If you're still having problems, you may want to start a new post with your particular requirement. If so you're very welcome to message me with a link to the post and I'll have a look at it.
Best wishes
Alan
You are right . I did make it work by following your instruction . Its like another world .
I cannot make a new post since this thread itself relates to fuzzy lookup .
I would be obliged if you can make a macro that can work as per below requirement

My requirement

Column A Column B Match
Muhammad Mohammad 99%
Muhammad Muhammad 100%
Mohd M ohd 99%
Tommy Tomy 98%

I know Microsoft excel fuzzy lookup adon is present for this but in office installation of adon is blocked by administrators.
 
Upvote 0
You are right . I did make it work by following your instruction . Its like another world .
I cannot make a new post since this thread itself relates to fuzzy lookup .
I would be obliged if you can make a macro that can work as per below requirement

My requirement

Column A Column B Match
Muhammad Mohammad 99%
Muhammad Muhammad 100%
Mohd M ohd 99%
Tommy Tomy 98%

I know Microsoft excel fuzzy lookup adon is present for this but in office installation of adon is blocked by administrators.
Fuzzy Earthworm.xlsm
ABC
1Heading 1Heading 2Match
2MuhammadMohammad87.50%
3MuhammadMuhammad100.00%
4MohdM ohd80.00%
5TommyTomy80.00%
Sheet2
Cell Formulas
RangeFormula
C2:C5C2=GetLevenshteinPercentMatch(A2,B2)


VBA Code:
Option Explicit

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String) As Single
Dim iLen As Integer

iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
'********************************
'*** Compute Levenshtein Distance
'********************************

Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N
    s_i = Mid$(s, I, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
      
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)
    Next j
  Next I

  ' Step 7
  LevenshteinDistance = d(N, m)
End Function
 
Upvote 0
Fuzzy Earthworm.xlsm
ABC
1Heading 1Heading 2Match
2MuhammadMohammad87.50%
3MuhammadMuhammad100.00%
4MohdM ohd80.00%
5TommyTomy80.00%
Sheet2
Cell Formulas
RangeFormula
C2:C5C2=GetLevenshteinPercentMatch(A2,B2)


VBA Code:
Option Explicit

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String) As Single
Dim iLen As Integer

iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
'********************************
'*** Compute Levenshtein Distance
'********************************

Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N
    s_i = Mid$(s, I, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
     
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)
    Next j
  Next I

  ' Step 7
  LevenshteinDistance = d(N, m)
End Function
Thank you for your prompt support .

I have a general question , dont you think the matching between MUHAMMAD and MOHAMMAD is 99% ? what do you suggest ?
how can I change matching % . is it easy to change ?
 
Upvote 0
Thank you for your prompt support .

I have a general question , dont you think the matching between MUHAMMAD and MOHAMMAD is 99% ? what do you suggest ?
how can I change matching % . is it easy to change ?
Hi Earthworm
87.5% is, in fact, mathematically correct.
To put it another way, the two strings are 8 characters long and only 7 match which makes it a 7/8 match.
7/8 = 0.875 = 87.5%
 
Upvote 0
Hi Earthworm
87.5% is, in fact, mathematically correct.
To put it another way, the two strings are 8 characters long and only 7 match which makes it a 7/8 match.
7/8 = 0.875 = 87.5%
Hail to you. I was expecting same logical answer from you. You Rock !
 
Upvote 0
Hi Allen,
Hope you are doing good.

I am stuck with a problem and need your help. I need to fuzzy match the data of around 899*12 names (delimited, and not all filled) with another dataset of around 6000 names. My first dataset consists of names of cast member of movies and second one oscar nomination dataset. Problems that I am facing is:
1. Most of the names have leading or trailing spaces (not able to delete them after trimming through VBA)
2. Many of the names have special characters or accent alphabets, such as (Édgar Ramírez, Pen√©lope Cruz)
3. I tried using your vba code (algorithm 1) with 80% NFP; it works decently but fails in cases of leading spaces and extra special characters.

Please suggest solution. Thanks. (I have used macros for the first time ever. Hope you will understand.)
 
Upvote 0
Hi Allen,
Hope you are doing good.

I am stuck with a problem and need your help. I need to fuzzy match the data of around 899*12 names (delimited, and not all filled) with another dataset of around 6000 names. My first dataset consists of names of cast member of movies and second one oscar nomination dataset. Problems that I am facing is:
1. Most of the names have leading or trailing spaces (not able to delete them after trimming through VBA)
2. Many of the names have special characters or accent alphabets, such as (Édgar Ramírez, Pen√©lope Cruz)
3. I tried using your vba code (algorithm 1) with 80% NFP; it works decently but fails in cases of leading spaces and extra special characters.

Please suggest solution. Thanks. (I have used macros for the first time ever. Hope you will understand.)
Hi,
Maybe fuzzy match 'modified' strings using a UDF such as below which converts accent alphabets to equivalent 'normal' alphabetic characters and removes all non-alphanumerics.
This isn't a full solution but a good start:
Best wishes
Alan
VBA Code:
Function PrepareString(ByVal stringx As String) As String
'*****************************************************************************************************
'** Replace 'Special' characters to a - z, all uppercase to lowercase, remove all but alphanumerics **
'*****************************************************************************************************
Const Chars As String = "0123456789???????ABCDEFGHIJKLMNOPQRSTUVWXYZ??????abcdefghijklmnopqrstuvwxyz?????????????????????????????????????????????????????????????????????AAAAAA?CEEEEIIIIDNOOOOOxOUUUUY??aaaaaa?ceeeeiiiionooooo?ouuuuy?y"

Dim lPtr As Long
Dim lValue As Long

Dim sChar As String
Dim sResult As String

sResult = ""
For lPtr = 1 To Len(stringx)
    sChar = LCase$(Mid$(stringx, lPtr, 1))
    lValue = Asc(sChar)
    sChar = ""
    lValue = lValue - 47
    If lValue > 0 _
    And lValue <= Len(Chars) Then
        sChar = Mid$(Chars, lValue, 1)
        If sChar = "?" Then
            sChar = ""
        End If
    End If
    
    sResult = sResult & sChar
Next lPtr
ConvertChars = sResult
End Function
 
Upvote 0
Hi,
Maybe fuzzy match 'modified' strings using a UDF such as below which converts accent alphabets to equivalent 'normal' alphabetic characters and removes all non-alphanumerics.
This isn't a full solution but a good start:
Best wishes
Alan
VBA Code:
Function PrepareString(ByVal stringx As String) As String
'*****************************************************************************************************
'** Replace 'Special' characters to a - z, all uppercase to lowercase, remove all but alphanumerics **
'*****************************************************************************************************
Const Chars As String = "0123456789???????ABCDEFGHIJKLMNOPQRSTUVWXYZ??????abcdefghijklmnopqrstuvwxyz?????????????????????????????????????????????????????????????????????AAAAAA?CEEEEIIIIDNOOOOOxOUUUUY??aaaaaa?ceeeeiiiionooooo?ouuuuy?y"

Dim lPtr As Long
Dim lValue As Long

Dim sChar As String
Dim sResult As String

sResult = ""
For lPtr = 1 To Len(stringx)
    sChar = LCase$(Mid$(stringx, lPtr, 1))
    lValue = Asc(sChar)
    sChar = ""
    lValue = lValue - 47
    If lValue > 0 _
    And lValue <= Len(Chars) Then
        sChar = Mid$(Chars, lValue, 1)
        If sChar = "?" Then
            sChar = ""
        End If
    End If
   
    sResult = sResult & sChar
Next lPtr
ConvertChars = sResult
End Function
The corrected code:
VBA Code:
Option Explicit

Function PrepareString(ByVal stringx As String) As String
'*****************************************************************************************************
'** Replace 'Special' characters to a - z, all uppercase to lowercase, remove all but alphanumerics **
'*****************************************************************************************************
Const Chars As String = "0123456789???????ABCDEFGHIJKLMNOPQRSTUVWXYZ??????abcdefghijklmnopqrstuvwxyz?????????????????????????????????????????????????????????????????????AAAAAA?CEEEEIIIIDNOOOOOxOUUUUY??aaaaaa?ceeeeiiiionooooo?ouuuuy?y"

Dim lPtr As Long
Dim lValue As Long

Dim sChar As String
Dim sResult As String

sResult = ""
For lPtr = 1 To Len(stringx)
    sChar = LCase$(Mid$(stringx, lPtr, 1))
    lValue = Asc(sChar)
    sChar = ""
    lValue = lValue - 47
    If lValue > 0 _
    And lValue <= Len(Chars) Then
        sChar = Mid$(Chars, lValue, 1)
        If sChar = "?" Then
            sChar = ""
        End If
    End If
    
    sResult = sResult & sChar
Next lPtr
PrepareString = sResult
End Function
An example:
Book1
IJ
2150>a%BÙ0abu
Sheet2
Cell Formulas
RangeFormula
I215I215="0>a%B"&CHAR(217)
J215J215=PrepareString(I215)
 
Upvote 0

Forum statistics

Threads
1,224,836
Messages
6,181,250
Members
453,026
Latest member
cknader

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top