Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 

Excel Facts

Back into an answer in Excel
Use Data, What-If Analysis, Goal Seek to find the correct input cell value to reach a desired result
Superb!!
quick question. What would u recommend the best settings are for the following.

I am merging 2 databases and will be checking for duplicates in names, emails and addresses (street name and zip code)

one database has 11,000 records and the other 6,000, so don't want to spend all day figuring out what's best for each lookup. Have a quad core processor but even that takes a while to get results.

thanks in advance!!
 
Upvote 0
Superb!!
quick question. What would u recommend the best settings are for the following.

I am merging 2 databases and will be checking for duplicates in names, emails and addresses (street name and zip code)

one database has 11,000 records and the other 6,000, so don't want to spend all day figuring out what's best for each lookup. Have a quad core processor but even that takes a while to get results.

thanks in advance!!

Hi,

Algorithm 4 (Levenstein distance) is the most accurate, but slower than algorithm 2 which matches pairs, triplets, quadfs etc..

Given the number of entries, I'd suggest algorithm 2.
 
Upvote 0
Unusual reason for non-completion of fuzzy vlookup

Just thought I'd post this because it is quite funny.

Was running a fuzzy vlookup for a column of 2000 odd names and while in the midst of the calculation, my laptop was stolen! From the kitchen while we were in the room next door! :eeek: (I am in Africa doing a child sponsorship program - was comparing names to last years list with plenty of spelling errors).

I wonder if the bandito (as they call them here in Mozambique) allowed the calculation to complete...
 
Upvote 0
I've been using FuzzyVlookup to match account information between two databases. I'm matching based on the street address and then removing duplicates. This has saved me considerable time where Vlookup would have caused all sorts of issues in regards to slight deviations in spelling street names or in how certain streets were abbreviated when entered in. Sometimes "6th st" was entered "sixth st" for example and this has saved me time in matching.

However, I feel that my procoss of merging databases, given the criteria of matching by street address, would be more accurate by matching first on street number and then referencing the street name in another column. I'm not quite sure how to inform the macro that's run though to first match to street number and then to compare that against the street name. Basically I want to have a higher threshold for the street number with a lower threshhold for the street name. I imagine this would be best served with one column having the respective street numbers and the column over having the street names.

This is probably already possible to do, but could someone provide an example of how to do this? Perhaps it involves the group or column offset?

Thanks for your help!
 
Upvote 0
Hi irhumbled,

Can you provide a sample set of inputs?

you may want to start a new thread for this (although I'm sure at least one reply will say "try FuzzyVLookup") - by all means PM me with a link if you wish
 
Upvote 0
Hi irhumbled,

Can you provide a sample set of inputs?

you may want to start a new thread for this (although I'm sure at least one reply will say "try FuzzyVLookup") - by all means PM me with a link if you wish

I'll try to explain what I'm trying to accomplish hopefully more thoroughly.

You might imagine in Column A I have street numbers going down.

For example in A1: 122, A2: 156, A3: 994

And in Column B I'll have the street name of the street address

For example in B1: Thoroughway St, B2: Douglas Ct, B3: Clearwater Dr

And in Column C and/or D I'd have the street addresses (split or concated together into one column) from another database.

I'd like to have the next column over attempt to match the information in column C (or C and D if the info is split) to columns A and B.

The reason I mention this is in addresses I fee it makes the most sense to first priorotize matching street numbers, and then to see if the street name is correct. I'm using the FuzzyVlookup currently but I feel it's giving the same value to a street name as it would to a street number. Basically when there's not a perfect or near perfect match, I'd rather have it have a high priority of matching street numbers even if the street name is moderatly off. This was, something like 1113 6th st matches more likely to 1113 sixth street because of the shared street number.

If I can clarify any further, please let me know and I'll be happy to. It may not be necessary to split the contents of a street address into two columns but I thought it might be helpful in attempting what I want to do.

Thanks!
 
Upvote 0
ok, try this example:

Excel 2003
ABCDEF
1NumAddressLookup 1Lookup 2Fuzzy Match Row
2122Thoroughway St123 thoroughway Street5
3156Douglas Ct157 Douglas Ct#N/A
4994Clearwater Dr156duglas Ct3
5123Thoroway St156 duglas court3
Sheet2
 
Upvote 0
ok, try this example:
Excel 2003
ABCDEF
NumAddressLookup 1Lookup 2Fuzzy Match Row
Thoroughway St123 thoroughway Street
Douglas Ct157 Douglas Ct
Clearwater Drduglas Ct
Thoroway St156 duglas court

<colgroup><col><col><col><col><col><col><col></colgroup><thead>
</thead><tbody>
[TD="align: center"]1[/TD]

[TD="align: right"][/TD]

[TD="align: center"]2[/TD]
[TD="align: right"]122[/TD]

[TD="align: right"][/TD]
[TD="align: right"][/TD]
[TD="align: right"]5[/TD]

[TD="align: center"]3[/TD]
[TD="align: right"]156[/TD]

[TD="align: right"][/TD]
[TD="align: right"][/TD]
[TD="align: right"]#N/A[/TD]

[TD="align: center"]4[/TD]
[TD="align: right"]994[/TD]

[TD="align: right"]156[/TD]

[TD="align: right"][/TD]
[TD="align: right"]3[/TD]

[TD="align: center"]5[/TD]
[TD="align: right"]123[/TD]

[TD="align: right"][/TD]
[TD="align: right"][/TD]
[TD="align: right"]3[/TD]

</tbody>
Sheet2

So... I'm not quite sure what to make of it. Perhaps the formulas posted would help me understand how the matching is done so I can replicate it in my own workbook?

Thank so much for you help al_b_cnu, you're UDF has been a lifesaver!
 
Upvote 0

Forum statistics

Threads
1,225,619
Messages
6,186,045
Members
453,335
Latest member
sfd039

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top