Fuzzy Matching - new version plus explanation

al_b_cnu · Mar 27, 2006

It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:

Fuzzy Examples.xls

A

B

C

D

E

1

Starts With

Ends

Contains

2

Bill

jelen

Bill

3

Mr Bill Jelen

4

3

4

Bill Jelen

5

Joe Bloggs

6

Fred Smith

MATCH Example

... Continued ...

al_b_cnu · Aug 24, 2009

Hi,

Not sure what your requirements are here, but have you tried setting the 'Algorithm' parameter to 1 which is geared towards matching mispellings?

Can you give further details of your requirements and post an example of strings which should and should not match?

What would you want doing with a string such as 'George Harrs'

Hoozits · Nov 5, 2009

This function is incredible. I ran into a problem, though. I copied the formula down and it worked unitl I reached the 20th or so cell with the formula in it and that one and all subsequent ones gave a #VALUE! error. So strange.

Hoozits · Nov 5, 2009

For what it's worth, the first instance of the error emerged from trying to match

RON DAVIDSON WELLINGTON FL

with

RONALD DAVIDSON WELLINGTON FL

and then, like I said, all other instances of the formula thereafter resulted in errors.

Perplexed in Springfield

al_b_cnu · Nov 6, 2009

Without seeing it, it sounds like it could be an anchor problem to me.
I've pm'd you with my email address so you can send me a sample s/sheet with the problem.

JohnSeito · Nov 7, 2009

Hi al_b_cnu,

over all is pretty good for my initial on the surface testing.
I have tried it and noticed some things (good and bad),

1 - it takes a lot of memory running about 200 datas with large records, takes very long to process, slows down computer dramatically. I only test on about 200 records.

2 - the good things is while I was testing on 200 records I noticed a accuracy about 95%, but this may be on less complex records. Do you know if there is anything out there that will produce more accuracy? I say this because if it's not close to 100% there are chance where there could be error and risky to use especially with large records, unless we also do a manual check which you have to kind of check the whole records which is almost like doing it manually.

Do you know if the match is more accurate with data that has less characters like 20 vs more characters like 100? Because I was doing a initial test and notice that it produce more accuracy with data with less characters or it may be because I was looking into a bigger file.

3 - if the data is looking for is not there, the match is completely off --- give random data from the look onto record which is inaccurate info.

4 - it usually match on the first record it see, even though on the list there may be a closer match down further.

Do you know if there are any thing that is more accurate than this out there, and could this be converted into script? Also the ones that are not there should not give a random record that is completely off instead it should say n/a.

over all it is good, but I still think it is risky depending on this b/c sometimes it could give you a match and it may not be correct, off course the chances are less but I think a check on the work after is needed but then is may seems a little manual.

Let me know what you think, Thanks.

al_b_cnu · Nov 9, 2009

Hi John, thanks for your comments. Responses below, hope they are helpful.

1 - it takes a lot of memory running about 200 datas with large records, takes very long to process, slows down computer dramatically. I only test on about 200 records.

RESPONSE: Yes, agreed, which is why you’d only use this tool as a last resort.

2 - the good things is while I was testing on 200 records I noticed a accuracy about 95%, but this may be on less complex records. Do you know if there is anything out there that will produce more accuracy? I say this because if it's not close to 100% there are chance where there could be error and risky to use especially with large records, unless we also do a manual check which you have to kind of check the whole records which is almost like doing it manually.

RESPONSE: controlled by the ‘Algorithm’ and ‘NFPercent’ parameters. You may also want to use FuzzyPercent to return the actual %age match. (but that’s even more processing)

Do you know if the match is more accurate with data that has less characters like 20 vs more characters like 100? Because I was doing a initial test and notice that it produce more accuracy with data with less characters or it may be because I was looking into a bigger file.

RESPONSE: The code will always match shortest string against the longest.

3 - if the data is looking for is not there, the match is completely off --- give random data from the look onto record which is inaccurate info.

RESPONSE:Controlled by ‘NFPercent’ parameter.

4 - it usually match on the first record it see, even though on the list there may be a closer match down further.

Response: The code will return the first highest match entry in the list. If subsequent highest entries are required, use the RANK parameter.

Do you know if there are any thing that is more accurate than this out there, and could this be converted into script? Also the ones that are not there should not give a random record that is completely off instead it should say n/a.

RESPONSE: Controlled by ‘NFPercent’ parameter.

over all it is good, but I still think it is risky depending on this b/c sometimes it could give you a match and it may not be correct, off course the chances are less but I think a check on the work after is needed but then is may seems a little manual.

RESPONSE: As with anything which matches two non-identical entries there is a risk that the entry returned is not the one you wanted it to. You CAN tailor it using the parameters ‘Algorithm’, ‘NFPercent’ and Rank, and that risk can be demonstrably quantified using the FuzzyPercent UDF against the lookup string and the returned entry.

vgasmo · Jan 12, 2010

I came across this board while searching a formula like the one that was created.

I think you did a terrific job but I was wondering if there is a limit to the number of columns/rows. I've been using fuzzyv and it works great in small samples but when use it to search for a specific name inside a box with more than hundred thousand names it gives me back an error (instead of ND it gives back Value).<o

></o

>
<o

> </o

>
Thanks<o

></o

>

al_b_cnu · Jan 12, 2010

Hi vgasmo,

No limits as far as I'm aware, but it was written for Pre 2007 version, maybe 100,000 rows is a bit too much for the poor little thing - with that amount of rows, I'm surprised you get a response at all

Can you post the layout of your worksheet with some sample data (including a #Value) & we can have a look.

Best wishes

Alan

vgasmo · Jan 12, 2010

Hi.

Thanks for the reply.

I do get an answer...i was able to replace some of the names to see what happened and surprise some of #value transformed into the correct values (at least the first ones). I think it is a computacional problem... i'll try to run it on a different pc...

I'll be back with some info soon.

sandaflo · Jan 29, 2010

Hello,

sorry to flog a dead horse, but using the same formula =FuzzyVLookup($B3,$A:$A,1,,,C$2) with the provided example data set, I am still receiving the same #value! error. I made sure the various tool pack plugins are installed and adjusted the calculation settings, but so far no luck.

I'm trying to perform a similar operation, with one column of 13500 records of company names (and their various permutations) checked up against a master set of 1000 records. Closest I have come is in using a formula like =FuzzyVLookup($a1,$b1:$b1000,$c1); I would have used html maker to provide an example, but the system is locked down to third party apps.

Any help would be greatly appreciated.

Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member

al_b_cnu

Well-known Member

Hoozits

Active Member

Hoozits

Active Member

al_b_cnu

Well-known Member

JohnSeito

Active Member

al_b_cnu

Well-known Member

vgasmo

New Member

al_b_cnu

Well-known Member

vgasmo

New Member

sandaflo

New Member

Similar threads

Share this page

Fuzzy Matching - new version plus explanation

Well-known Member

Well-known Member

Active Member

Active Member

Well-known Member

Active Member

Well-known Member

New Member

Well-known Member

New Member

New Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock