Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
Hi Dan,

It seems to work here, have you got calculation set to manual?
If so, try pressing [F9]

This thread is extremely long now, making it difficult to find the latest version of the code. A link can be found in my signature below.

*SOLVED*

my mistake, i had imported a mistake somehow.
 
Upvote 0

Excel Facts

Ambidextrous Undo
Undo last command with Ctrl+Z or Alt+Backspace. If you use the Undo icon in the QAT, open the drop-down arrow to undo up to 100 steps.
There seems to be a bug whereby if the lookup value isn't in the same sheet as the lookup table, incorrect results are returned.
Also, 'FuzzyHLookup' has long since got out of step with it's bigger brother 'FuzzyVLookup', so I've removed it
Here's the latest code:
Code:
Option Explicit

Type RankInfo
    Offset As Long
    Percentage As Single
End Type

Dim mudRankData() As RankInfo
Dim mlBestMatchPtr As Long

Dim TopMatch         As Long
Dim strCompare       As String


Function FuzzyPercent(ByVal String1 As String, _
                      ByVal String2 As String, _
                      Optional Algorithm As Integer = 3, _
                      Optional Normalised As Boolean = False) As Single
'*************************************
'** Return a % match on two strings **
'*************************************
Dim intLen1 As Integer, intLen2 As Integer
Dim intCurLen As Integer
Dim intTo As Integer
Dim intPos As Integer
Dim intPtr As Integer
Dim intScore As Integer
Dim intTotScore As Integer
Dim intStartPos As Integer
Dim sngScore As Single
Dim strWork As String

'-------------------------------------------------------
'-- If strings havent been normalised, normalise them --
'-------------------------------------------------------
If Normalised = False Then
    String1 = LCase$(Application.Trim(String1))
    String2 = LCase$(Application.Trim(String2))
End If

'----------------------------------------------
'-- Give 100% match if strings exactly equal --
'----------------------------------------------
If String1 = String2 Then
    FuzzyPercent = 1
    Exit Function
End If

intLen1 = Len(String1)
intLen2 = Len(String2)

If intLen1 = 0 Or intLen2 = 0 Then
    FuzzyPercent = 0
    Exit Function
End If

'----------------------------------------
'-- Give 0% match if string length < 2 --
'----------------------------------------
If intLen1 < 2 Then
    FuzzyPercent = 0
    Exit Function
End If

intTotScore = 0                   'initialise total possible score
intScore = 0                      'initialise current score

'--------------------------------------------------------
'-- If Algorithm = 1 or 3, Search for single characters --
'--------------------------------------------------------
If (Algorithm And 1) <> 0 Then
    If intLen1 < intLen2 Then
        FuzzyAlg1 String1, String2, intScore, intTotScore
    Else
        FuzzyAlg1 String2, String1, intScore, intTotScore
    End If
End If

'-----------------------------------------------------------
'-- If Algorithm = 2 or 3, Search for pairs, triplets etc. --
'-----------------------------------------------------------
If (Algorithm And 2) <> 0 Then
    If intLen1 < intLen2 Then
        FuzzyAlg2 String1, String2, intScore, intTotScore
    Else
        FuzzyAlg2 String2, String1, intScore, intTotScore
    End If
End If

'------------------------------------------------------
'-- If Algorithm = 4,5,6,7, use Dan Ostander's code. --
'------------------------------------------------------
If (Algorithm And 4) <> 0 Then
    If intLen1 < intLen2 Then
        sngScore = FuzzyAlg4(String1, String1)
    Else
        sngScore = FuzzyAlg4(String2, String1)
    End If
    intScore = intScore + (sngScore * 100)
    intTotScore = intTotScore + 100
End If

FuzzyPercent = intScore / intTotScore

End Function

Private Sub FuzzyAlg1(ByVal String1 As String, _
                      ByVal String2 As String, _
                      ByRef Score As Integer, _
                      ByRef TotScore As Integer)
Dim intLen1 As Integer, intPos As Integer, intPtr As Integer, intStartPos As Integer

intLen1 = Len(String1)
TotScore = TotScore + intLen1              'update total possible score
intPos = 0
For intPtr = 1 To intLen1
    intStartPos = intPos + 1
    intPos = InStr(intStartPos, String2, Mid$(String1, intPtr, 1))
    If intPos > 0 Then
        If intPos > intStartPos + 3 Then     'No match if char is > 3 bytes away
            intPos = intStartPos
        Else
            Score = Score + 1          'Update current score
        End If
    Else
        intPos = intStartPos
    End If
Next intPtr
End Sub
Private Sub FuzzyAlg2(ByVal String1 As String, _
                        ByVal String2 As String, _
                        ByRef Score As Integer, _
                        ByRef TotScore As Integer)
Dim intCurLen As Integer, intLen1 As Integer, intTo As Integer, intPtr As Integer, intPos As Integer
Dim strWork As String

intLen1 = Len(String1)
For intCurLen = 1 To intLen1
    strWork = String2                          'Get a copy of String2
    intTo = intLen1 - intCurLen + 1
    TotScore = TotScore + Int(intLen1 / intCurLen)  'Update total possible score
    For intPtr = 1 To intTo Step intCurLen
        intPos = InStr(strWork, Mid$(String1, intPtr, intCurLen))
        If intPos > 0 Then
            Mid$(strWork, intPos, intCurLen) = String$(intCurLen, &H0) 'corrupt found string
            Score = Score + 1     'Update current score
        End If
    Next intPtr
Next intCurLen

End Sub
Private Function FuzzyAlg4(strIn1 As String, strIn2 As String) As Single

Dim L1               As Integer
Dim In1Mask(1 To 24) As Long     'strIn1 is 24 characters max
Dim iCh              As Integer
Dim N                As Long
Dim strTry           As String
Dim strTest          As String

TopMatch = 0
L1 = Len(strIn1)
strTest = UCase(strIn1)
strCompare = UCase(strIn2)
For iCh = 1 To L1
    In1Mask(iCh) = 2 ^ iCh
Next iCh      'Loop thru all ordered combinations of characters in strIn1
For N = 2 ^ (L1 + 1) - 1 To 1 Step -1
    strTry = ""
    For iCh = 1 To L1
        If In1Mask(iCh) And N Then
            strTry = strTry & Mid(strTest, iCh, 1)
        End If
    Next iCh
    If Len(strTry) > TopMatch Then FuzzyAlg4Test strTry
Next N
FuzzyAlg4 = TopMatch / CSng(L1)
End Function
Sub FuzzyAlg4Test(strIn As String)

Dim l          As Integer
Dim strTry   As String
Dim iCh        As Integer

l = Len(strIn)
If l <= TopMatch Then Exit Sub
strTry = "*"
For iCh = 1 To l
    strTry = strTry & Mid(strIn, iCh, 1) & "*"
Next iCh
If strCompare Like strTry Then
    If l > TopMatch Then TopMatch = l
End If
End Sub

Function FuzzyVLookup(ByVal LookupValue As String, _
                      ByVal TableArray As Range, _
                      ByVal IndexNum As Integer, _
                      Optional NFPercent As Single = 0.05, _
                      Optional Rank As Integer = 1, _
                      Optional Algorithm As Integer = 3, _
                      Optional AdditionalCols As Integer = 0, _
                      Optional LookupColOffset As Integer = 0, _
                      Optional GroupColOffset As Integer = 0, _
                      Optional GroupValue As Variant = "") As Variant
'********************************************************************************
'** Function to Fuzzy match LookupValue with entries in                        **
'** column 1 of table specified by TableArray.                                 **
'** TableArray must specify the top left cell of the range to be searched      **
'** The function stops scanning the table when an empty cell in column 1       **
'** is found.                                                                  **
'** For each entry in column 1 of the table, FuzzyPercent is called to match   **
'** LookupValue with the Table entry.                                          **
'** 'Rank' is an optional parameter which may take any value > 0               **
'**        (default 1) and causes the function to return the 'nth' best        **
'**         match (where 'n' is defined by 'Rank' parameter)                   **
'** If the 'Rank' match percentage < NFPercent (Default 5%), #N/A is returned. **
'** IndexNum is the column number of the entry in TableArray required to be    **
'** returned, as follows:                                                      **
'** If IndexNum > 0 and the 'Rank' percentage match is >= NFPercent            **
'**                 (Default 5%) the column entry indicated by IndexNum is     **
'**                 returned.                                                  **
'** if IndexNum = 0 and the 'Rank' percentage match is >= NFPercent            **
'**                 (Default 5%) the offset row (starting at 1) is returned.   **
'**                 This value can be used directly in the 'Index' function.   **
'**                                                                            **
'** Algorithm can take one of the following values:                            **
'** Algorithm = 1:                                                             **
'**     This algorithm is best suited for matching mis-spellings.              **
'**     For each character in 'String1', a search is performed on 'String2'.   **
'**     The search is deemed successful if a character is found in 'String2'   **
'**     within 3 characters of the current position.                           **
'**     A score is kept of matching characters which is returned as a          **
'**     percentage of the total possible score.                                **
'** Algorithm = 2:                                                             **
'**     This algorithm is best suited for matching sentences, or               **
'**     'firstname lastname' compared with 'lastname firstname' combinations   **
'**     A count of matching pairs, triplets, quadruplets etc. in 'String1' and **
'**     'String2' is returned as a percentage of the total possible.           **
'** Algorithm = 3: Both Algorithms 1 and 2 are performed.                      **
'** Algorithm = 4: Dan Ostrander's algorithm                                   **
'**                                                                            **
'** The following parameters allow matching by group, and only those values    **
'** which are in the group specified will be considered for matching.          **                  **
'** GroupColOffset                                                             **
'** This parameter specifies the offset column which contains the group values **
'** To trigger group matching, this must be a non-zero integer                 **
'** GroupValue                                                                 **
'** This parameter specifies the Group to be considered for matching           **
'********************************************************************************
Dim bWanted As Boolean
Dim lEndRow As Long

Dim rCur As Range
Dim rSearchRange As Range

Dim sngMinPercent As Single


Dim vCurValue As Variant
Dim wsTable As Worksheet

'--------------------------------------------------------------
'--    Validation                                            --
'--------------------------------------------------------------

LookupValue = LCase$(Application.Trim(LookupValue))

If IsMissing(NFPercent) Then
    sngMinPercent = 0.05
Else
    If (NFPercent <= 0) Or (NFPercent > 1) Then
        FuzzyVLookup = "*** 'NFPercent' must be a percentage > zero ***"
        Exit Function
    End If
    sngMinPercent = NFPercent
End If

If Rank < 1 Then
    FuzzyVLookup = "*** 'Rank' must be an integer > 0 ***"
    Exit Function
End If

Set wsTable = Sheets(TableArray.Parent.Name)

ReDim mudRankData(1 To Rank)

lEndRow = TableArray.Rows.Count
If VarType(TableArray.Cells(lEndRow, 1).Value) = vbEmpty Then
    lEndRow = TableArray.Cells(lEndRow, 1).End(xlUp).Row
End If
Set rSearchRange = Intersect(TableArray.Resize(, 1), wsTable.UsedRange)
If rSearchRange Is Nothing Then

    FuzzyVLookup = CVErr(xlErrNA)
    
Else


    '---------------
    '-- Main loop --
    '---------------
    
    If Len(GroupValue) <> 0 Then
        With rSearchRange.Offset(, GroupColOffset)
            For Each rCur In rSearchRange.Offset(, GroupColOffset)
                vCurValue = rCur.Value
                If VarType(vCurValue) = vbString Then
                    bWanted = LCase$(CStr(vCurValue)) = LCase$(CStr(GroupValue))
                Else
                    bWanted = Val(vCurValue) = Val(GroupValue)
                End If
                If bWanted Then
                    FuzzyVlookupMain LookupValue:=LookupValue, _
                                    TableArray:=rCur.Offset(, GroupColOffset * -1), _
                                    IndexNum:=IndexNum, _
                                    NFPercent:=NFPercent, _
                                    Rank:=Rank, _
                                    Algorithm:=Algorithm, _
                                    AdditionalCols:=AdditionalCols, _
                                    LookupColOffset:=LookupColOffset
                End If
            Next rCur
        End With
    Else
        For Each rCur In rSearchRange
            FuzzyVlookupMain LookupValue:=LookupValue, _
                            TableArray:=rCur, _
                            IndexNum:=IndexNum, _
                            NFPercent:=NFPercent, _
                            Rank:=Rank, _
                            Algorithm:=Algorithm, _
                            AdditionalCols:=AdditionalCols, _
                            LookupColOffset:=LookupColOffset
        Next rCur
    End If
    
    If mudRankData(Rank).Percentage < sngMinPercent Then
        '--------------------------------------
        '-- Return '#N/A' if below NFPercent --
        '--------------------------------------
        FuzzyVLookup = CVErr(xlErrNA)
    Else
        mlBestMatchPtr = mudRankData(Rank).Offset - TableArray.Cells(1, 1).Row + 1
        If IndexNum > 0 Then
            '-----------------------------------
            '-- Return column entry specified --
            '-----------------------------------
            FuzzyVLookup = TableArray.Cells(mlBestMatchPtr, IndexNum)
        Else
            '-----------------------
            '-- Return offset row --
            '-----------------------
            FuzzyVLookup = mlBestMatchPtr
        End If
    End If
    
End If          'If rSearchRange Is Nothing Then
End Function
Private Sub FuzzyVlookupMain(ByVal LookupValue As String, _
                                    ByVal TableArray As Range, _
                                    ByVal IndexNum As Integer, _
                                    Optional NFPercent As Single = 0.05, _
                                    Optional Rank As Integer = 1, _
                                    Optional Algorithm As Integer = 3, _
                                    Optional AdditionalCols As Integer = 0, _
                                    Optional LookupColOffset As Integer = 0)
Dim I As Integer
Dim intRankPtr As Integer
Dim intRankPtr1 As Integer
Dim strListString As String
Dim sngCurPercent As Single
Dim sngMinPercent As Single
Dim vCurValue As Variant

vCurValue = ""
For I = 0 To AdditionalCols
    vCurValue = vCurValue & CStr(TableArray.Offset(0, I + LookupColOffset).Value)
Next I
If VarType(vCurValue) = vbString Then
    strListString = LCase$(Application.Trim(vCurValue))
    
    '------------------------------------------------
    '-- Fuzzy match strings & get percentage match --
    '------------------------------------------------
    sngCurPercent = FuzzyPercent(String1:=LookupValue, _
                                 String2:=strListString, _
                                 Algorithm:=Algorithm, _
                                 Normalised:=True)
    
    If sngCurPercent >= sngMinPercent Then
        '---------------------------
        '-- Store in ranked array --
        '---------------------------
        For intRankPtr = 1 To Rank
            If sngCurPercent > mudRankData(intRankPtr).Percentage Then
                For intRankPtr1 = Rank To intRankPtr + 1 Step -1
                    With mudRankData(intRankPtr1)
                        .Offset = mudRankData(intRankPtr1 - 1).Offset
                        .Percentage = mudRankData(intRankPtr1 - 1).Percentage
                    End With
                Next intRankPtr1
                With mudRankData(intRankPtr)
                    .Offset = TableArray.Row
                    .Percentage = sngCurPercent
                End With
                Exit Sub
            End If
        Next intRankPtr
    End If
    
End If
End Sub
 
Upvote 0
Got this latest version working using data across two worksheets and it is awesome! But I have 17,000 rows on the first sheet and 230,000 rows on the 2nd sheet...and after successfully matching data from about 30 rows as a test, I closed the spreadsheet and re-opened it and it is taking over 30 minutes just to open. (core 2 duo 2.33ghz machine)

The tool is perfect for what I am trying to do, which is match values from each worksheet with a % confidence level that I have a good match, but the time it takes to process makes it almost unusable unfortunately. Any ideas on how I could make it more efficient / get it to process faster? Maybe including both sets of data on the same sheet?

Thanks!
 
Upvote 0
Hi, have you tried setting calculation to manual (in Excel 2007 this is done from the menu via Formulas / Calculation Options)
 
Upvote 0
Hi, have you tried setting calculation to manual (in Excel 2007 this is done from the menu via Formulas / Calculation Options)

Thanks, that does help for calculating formulas, but the issue seems to be the initial loading of the script. By default Excel 2007 disables unknown macros, and when I enable them that's when it hangs. Actually I got rid of some unnecessary data and now it loads in about 10-15 minutes. So maybe not much more can be done.
 
Upvote 0
Hi, you're probably right, with that amount of data, it will take a long time to load.
You may be able to speed it up by seperating the data into 2 workbooks, but that would require rewriting FuzzyVLookup as a macro.
 
Upvote 0
Hi, you're probably right, with that amount of data, it will take a long time to load.
You may be able to speed it up by seperating the data into 2 workbooks, but that would require rewriting FuzzyVLookup as a macro.

Well, I got past the loading issue, but now I'm trying to calculate the full spreadsheet and it looks like it's going to take days to finish. I wonder if I could use Amazon EC2 or something to run it? Any ideas?

Thanks!
 
Upvote 0
Hi,
I suspect you'll have to use a bespoke macro which you call once rather than a function which you call many times.
What are your sheet layouts, and what parameters to FuzzyVLookup do you use?
 
Upvote 0
Hi,
I suspect you'll have to use a bespoke macro which you call once rather than a function which you call many times.
What are your sheet layouts, and what parameters to FuzzyVLookup do you use?

2 sheets with data on the 1st and a few columns matching to data on the 2nd sheet using FuzzyVLookup and FuzzyPercent. I could send it over if you would like to take a look. :)
 
Upvote 0
al_b_cnu,
This is awesome stuff! However, I am trying to compare 10k lines of addresses and the function is running forever! Do you have some magic or another method to easily compare large files?
 
Upvote 0

Forum statistics

Threads
1,223,282
Messages
6,171,174
Members
452,387
Latest member
Lorenzo_Barry

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top