Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
I've been playing with a version of this code and it works great! Is this the latest and greatest version of the fuzzy code, I can't find the FuzzyVlookup routine? What I need though is the index of the entry matched and not the matched string. I am really looking for FuzzyMatch. Any chance of this routine being developed?
 
Last edited:
Upvote 0

Excel Facts

Save Often
If you start asking yourself if now is a good time to save your Excel workbook, the answer is Yes
Hello all!

Mr. al_b_cnu your functions are suberb! Thank you very much!
I would like to ask something though:

1)Which one is the final version? Because your signature redirects to post 277 that dont have your work. Is it at post 307?
2)When i use the version from post 307 (or the one before the post in your signature) the cell that give the result (of fuzzyvlookup function) cannot be formatted. However when i use the version on the 1st page, result cell can be formatted. Is it a bug there or im too noob? :P

Thanks in advance
 
Upvote 0
Excellent UDF, and a link to this page with due credit is going in the book I'm in the final stages of writing.

Enhancement suggestion: Why not let users pass in the optional values as cell references? This would be particularly useful for the NFPercent value, because you could globally change the value, and see how all your matches performed, letting you tweak it globally in real time.

Currently this value is hard-coded.
 
Upvote 0
My bad...I had incorrectly assumed that any time the code was updated, it would also be updated at Post #2 (which would make it much easier for people to find)

My confusion was confounded by the fact that currently Alan's signature points to post #277 (which doesn't actually have a code listing in it). I *think* the latest version was #307.

It would be really cool if a page could be set up housing the most up-to-date code that is locked down for comments, and just point users at that one to avoid any doubt.
 
Upvote 0
I don't think the code's working correctly, given the following results.

=FuzzyPercent("Microsoft Excel","Excel Microsoft",2)
=0.688888907

=FuzzyPercent("Excel Microsoft","Microsoft Excel",2)
=0.733333349

So those two results aren't bad, but I would have thought those values would have been higher, given algorithm 2 is the one that best handles 'firstname lastname' compared with 'lastname firstname' combinations.

Now for the result that really puzzles me: watch what happens if I ditch the "Excel" from the "Microsoft Excel" string in the original =FuzzyPercent("Microsoft Excel","Excel Microsoft",2) formula:

=FuzzyPercent("Microsoft Excel","Excel",2)
=1


That doesn't make sense to me...surely that's a worse match, not a perfect match?
 
Upvote 0
Ah, found the bug. But first, to convince anyone reading this that it IS a bug, then consider this:

Which of these pairs should return the closest match under algorithm 2:

"Microsoft Excel" and "Microsoft Excel"
or
"Microsoft Excel" and "Excel"

Easy - the first pairing is an EXACT match. Whereas the 2nd pairing is not.

But look what gets returned by FuzzyPercent for both of these:
=FuzzyPercent("Microsoft Excel","Microsoft Excel",2)
=1

=FuzzyPercent("Excel","Microsoft Excel",2)
=1

...in other words, FuzzyPercent does not consider "Microsoft Excel" and "Microsoft Excel" a better match than "Microsoft Excel" and "Excel".

The culprit seems to be this:
Code:
'-----------------------------------------------------------
'-- If Algorithm = 2 or 3, Search for pairs, triplets etc. --
'-----------------------------------------------------------
If (lngAlgorithm And 2) <> 0 Then
    If intLen1 < intLen2 Then
        FuzzyAlg2 String1, String2, intScore, intTotScore
    Else
        FuzzyAlg2 String2, String1, intScore, intTotScore
    End If
End If

...specifically the If intLen Less Than inLen2 bit. I think that should be greater than. That explains why we get a value of 100% for Excel against "Microsoft Excel"...because 100% of "Excel" can be found in "Microsoft Excel". But I'm pretty sure that the intent is to test it the other way around. I.e. how much of "Microsoft Excel" is in "Excel"

If I change that less than to greater than, I get more sensible results:

=FuzzyPercent("Microsoft Excel","Microsoft Excel",2)
=1

=FuzzyPercent("Excel","Microsoft Excel",2)
=0.200000003


=FuzzyPercent("Excel Microsoft","Microsoft Excel",$E$77)
=0.733333349
 
Upvote 0
Thinking about this matching business some more, it strikes me that something along these lines could be added to the FuzzyPercent routine under the Give 100% match if strings exactly equal bit, to look for simple word transposes more efficiently:


Code:
'----------------------------------------------
'-- Give 100% match if word order transposed --
'----------------------------------------------
Dim varString1 As Variant
Dim String1_Test As String
Dim String2_Test As String
Dim strWordTest As String
Dim lngWords As Long
Dim i As Long
Dim score As Long


String1_Test = Replace(String1, ", ", " ")
String2_Test = Replace(String2, ", ", " ")
String2_Test = Replace(String2_Test, " ", "")
On Error Resume Next
varString1 = Split(String1_Test, " ")
lngWords = UBound(varString1)
On Error GoTo 0

If lngWords > 0 Then
    For i = 0 To lngWords
        strWordTest = varString1(i)
        If InStr(String2_Test, strWordTest) > 0 Then
            score = score + Len(strWordTest)
            String2_Test = Replace(String2_Test, strWordTest, "", 1, 1)
        End If
    Next
End If
score = score / (score + Len(String2_Test))

If score = 1 Then
    FuzzyPercent = 1
    Exit Function
End If

But the question is, when do you consider a transposition 'good enough' to exit the routine.

In the case when you've got just two words, such as "Weir, Jeff" and "Jeff Weir" then it's a nobrainer - the routine should give these should give a match of 1 and then exit.

But what about matches involving three words, such as "I am Sam"? In than case, we've got multiple transpositions that would give a match of 100% under the above approach:
"Sam I am"
"Am I Sam"
"Am Sam I"
"Am I Sam"
"I Sam am"
"Sam am I"

...and it's possible you could have more than one of those in the lookup list. Should we exit on the first match, or should we look for a closer match involving the order of the words. i.e. Should anything with "I am" or "am Sam" be given precedence over the other possibles?

And say you don't get an exact match, because of say an extra character in one of the terms e.g.:
"I am Sam" and "Sam I am." (Note the fullstop). How do you handle that? You get a very very close match, but how close should the match be before we say "hey that's close enough that we don't have to bother incurring overhead by checking everything else in the lookup table".

Love to hear anyone's musings on this.
 
Upvote 0
Sorry, that previous code listing should have been something like this:
Code:
'----------------------------------------------
'-- Give 100% match if word order transposed --
'----------------------------------------------
Dim varString1 As Variant
Dim String1_Test As String
Dim String2_Test As String
Dim strWordTest As String
Dim lngWords As Long
Dim i As Long
Dim score As Double
Dim String1_len As Long
Dim String2_len As Long

String1_Test = Replace(String1, ", ", " ")
String2_Test = Replace(String2, ", ", " ")

String1_len = Len(Replace(String1_Test, " ", ""))
String2_len = Len(Replace(String2_Test, " ", ""))

On Error Resume Next
varString1 = Split(String1_Test, " ")
lngWords = UBound(varString1)
On Error GoTo 0

If lngWords > 0 Then
    For i = 0 To lngWords
        strWordTest = varString1(i)
        If InStr(String2_Test, strWordTest) > 0 Then
            score = score + Len(strWordTest)
            String2_Test = Replace(String2_Test, strWordTest, "", 1, 1)
        Else: ' score = score - Len(strWordTest)
        End If
    Next
End If
score = score / Application.WorksheetFunction.Max(String1_len, String2_len)

If score = 1 Then
    FuzzyPercent = 1
    Exit Function
End If
 
Upvote 0
Alan,

thank you so much for developing such great macro which benefits lots of people around the world. may i know if that's possible to enhance the script to support multithreading? i've got 1000 records to match with a database more than 800000 records... it takes 60s for each record.... thanks a lot in advance!!

Thanks,
Derek

Hi,

Algorithm 4 (Levenstein distance) is the most accurate, but slower than algorithm 2 which matches pairs, triplets, quadfs etc..

Given the number of entries, I'd suggest algorithm 2.
 
Upvote 0
yam1314 - multithreading or not you're asking a lot of Excel. Can you post a sample of the data you're trying to match and some of the search terms that you would like them matched against? This might help point out an alternative approach, such as judicios use of wildcard matching via VLOOKUP, or creative use of wildcards in find and replace to make large amounts of your terms conform with your lookup table. Also, will there be any direct matches in the data? If so, approximately how many? If lots, then do direct matches with a traditional VLOOKUP, and then remove those from the remaining data.
 
Upvote 0

Forum statistics

Threads
1,223,948
Messages
6,175,575
Members
452,652
Latest member
eduedu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top