Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
Hi,

Not sure what your requirements are here, but have you tried setting the 'Algorithm' parameter to 1 which is geared towards matching mispellings?

Can you give further details of your requirements and post an example of strings which should and should not match?

What would you want doing with a string such as 'George Harrs'
 
Upvote 0

Excel Facts

Which Excel functions can ignore hidden rows?
The SUBTOTAL and AGGREGATE functions ignore hidden rows. AGGREGATE can also exclude error cells and more.
This function is incredible. I ran into a problem, though. I copied the formula down and it worked unitl I reached the 20th or so cell with the formula in it and that one and all subsequent ones gave a #VALUE! error. So strange.
 
Upvote 0
For what it's worth, the first instance of the error emerged from trying to match

RON DAVIDSON WELLINGTON FL

with

RONALD DAVIDSON WELLINGTON FL

and then, like I said, all other instances of the formula thereafter resulted in errors.

Perplexed in Springfield
 
Upvote 0
Without seeing it, it sounds like it could be an anchor problem to me.
I've pm'd you with my email address so you can send me a sample s/sheet with the problem.
 
Upvote 0
Hi al_b_cnu,

over all is pretty good for my initial on the surface testing.
I have tried it and noticed some things (good and bad),

1 - it takes a lot of memory running about 200 datas with large records, takes very long to process, slows down computer dramatically. I only test on about 200 records.

2 - the good things is while I was testing on 200 records I noticed a accuracy about 95%, but this may be on less complex records. Do you know if there is anything out there that will produce more accuracy? I say this because if it's not close to 100% there are chance where there could be error and risky to use especially with large records, unless we also do a manual check which you have to kind of check the whole records which is almost like doing it manually.

Do you know if the match is more accurate with data that has less characters like 20 vs more characters like 100? Because I was doing a initial test and notice that it produce more accuracy with data with less characters or it may be because I was looking into a bigger file.

3 - if the data is looking for is not there, the match is completely off --- give random data from the look onto record which is inaccurate info.

4 - it usually match on the first record it see, even though on the list there may be a closer match down further.

Do you know if there are any thing that is more accurate than this out there, and could this be converted into script? Also the ones that are not there should not give a random record that is completely off instead it should say n/a.

over all it is good, but I still think it is risky depending on this b/c sometimes it could give you a match and it may not be correct, off course the chances are less but I think a check on the work after is needed but then is may seems a little manual.

Let me know what you think, Thanks.
 
Last edited:
Upvote 0
Hi John, thanks for your comments. Responses below, hope they are helpful.

1 - it takes a lot of memory running about 200 datas with large records, takes very long to process, slows down computer dramatically. I only test on about 200 records.

RESPONSE: Yes, agreed, which is why you’d only use this tool as a last resort.

2 - the good things is while I was testing on 200 records I noticed a accuracy about 95%, but this may be on less complex records. Do you know if there is anything out there that will produce more accuracy? I say this because if it's not close to 100% there are chance where there could be error and risky to use especially with large records, unless we also do a manual check which you have to kind of check the whole records which is almost like doing it manually.

RESPONSE: controlled by the ‘Algorithm’ and ‘NFPercent’ parameters. You may also want to use FuzzyPercent to return the actual %age match. (but that’s even more processing)

Do you know if the match is more accurate with data that has less characters like 20 vs more characters like 100? Because I was doing a initial test and notice that it produce more accuracy with data with less characters or it may be because I was looking into a bigger file.

RESPONSE: The code will always match shortest string against the longest.

3 - if the data is looking for is not there, the match is completely off --- give random data from the look onto record which is inaccurate info.

RESPONSE:Controlled by ‘NFPercent’ parameter.

4 - it usually match on the first record it see, even though on the list there may be a closer match down further.

Response: The code will return the first highest match entry in the list. If subsequent highest entries are required, use the RANK parameter.

Do you know if there are any thing that is more accurate than this out there, and could this be converted into script? Also the ones that are not there should not give a random record that is completely off instead it should say n/a.

RESPONSE: Controlled by ‘NFPercent’ parameter.

over all it is good, but I still think it is risky depending on this b/c sometimes it could give you a match and it may not be correct, off course the chances are less but I think a check on the work after is needed but then is may seems a little manual.

RESPONSE: As with anything which matches two non-identical entries there is a risk that the entry returned is not the one you wanted it to. You CAN tailor it using the parameters ‘Algorithm’, ‘NFPercent’ and Rank, and that risk can be demonstrably quantified using the FuzzyPercent UDF against the lookup string and the returned entry.
 
Upvote 0
I came across this board while searching a formula like the one that was created.

I think you did a terrific job but I was wondering if there is a limit to the number of columns/rows. I've been using fuzzyv and it works great in small samples but when use it to search for a specific name inside a box with more than hundred thousand names it gives me back an error (instead of ND it gives back Value).<o:p></o:p>
<o:p> </o:p>
Thanks<o:p></o:p>
 
Upvote 0
Hi vgasmo,

No limits as far as I'm aware, but it was written for Pre 2007 version, maybe 100,000 rows is a bit too much for the poor little thing - with that amount of rows, I'm surprised you get a response at all :eeek:

Can you post the layout of your worksheet with some sample data (including a #Value) & we can have a look.

Best wishes

Alan
 
Upvote 0
Hi.

Thanks for the reply.

I do get an answer...i was able to replace some of the names to see what happened and surprise some of #value transformed into the correct values (at least the first ones). I think it is a computacional problem... i'll try to run it on a different pc...

I'll be back with some info soon.
 
Upvote 0
Hello,


sorry to flog a dead horse, but using the same formula =FuzzyVLookup($B3,$A:$A,1,,,C$2) with the provided example data set, I am still receiving the same #value! error. I made sure the various tool pack plugins are installed and adjusted the calculation settings, but so far no luck.

I'm trying to perform a similar operation, with one column of 13500 records of company names (and their various permutations) checked up against a master set of 1000 records. Closest I have come is in using a formula like =FuzzyVLookup($a1,$b1:$b1000,$c1); I would have used html maker to provide an example, but the system is locked down to third party apps.

Any help would be greatly appreciated.
 
Upvote 0

Forum statistics

Threads
1,224,826
Messages
6,181,192
Members
453,021
Latest member
pingpong7117

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top