Ranking similarity between multiple fields

Lurkily · Apr 27, 2022

I have a somewhat usual problem that I'd like to automate. In our inventories of equipment, different tracking systems sometimes provide different assigned users.

We have four systems - one tracking system that has IT-entered data, one system with user accounts entered instead of makes, the last recorded login, and a self-reporting system where users can enter what equipment they have.

I put together a sheet that grabs all this data, splits it into first and last name (Accounts are firstname.lastname@email.com), and stacks these side-by side so I can easily see if they match. But I'd like to use formatting to highlight matches in large lists, as an inventory can include thousands of items, and eliminating the easy items fast helps make these more manageable.

The problem is, sometimes they do match, mostly, but not quite. Jonathon Jones in the hand-entered tracking might be Jonathan instead, or have the account Jonathon.G.Jones or Jonathon.Jones1. User entered data might have a typo. Sometimes names change but aren't reflected in all the tracking.

I suspect asking for squishy results like 'similarity' in a deterministic program like excel is probably a ridiculous ask . . . but does anybody have a thought about how I might approach that?

Currently, I'm just assigning scores for every perfect match - a point for every match in the 'first name' range to the first entry, second, third, and fourth, and doing the same for last name. If the entry is blank or an error, I output a different unicode hypen (dash, endash, emdash, etc) to force no match to be scored. They usually get close, but can fail to highlight items where matching users don't actually have 100% matching names.

However, I only know enough about Excel to be dangerous; is there a better way? All I can think is to try and filter out common oddities like middle initials and numbered accounts before scoring.

shknbk2 · Apr 27, 2022

I found a good VBA example for ranking two strings.

A better similarity ranking algorithm for variable length strings

I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc). For example, Given

stackoverflow.com

Search for the user "bchatham".

I copied his entire code example and ran the very top function stringSimilarity on the following comparisons in the Immediate Window. I don't know if you would use the return amount and assume that a value over X is considered a close enough match.

VBA Code:

?stringSimilarity("Jonathon.G.Jones","Jonathon Jones")
 0.846153846153846 

?stringSimilarity("Jonathon.G.Jones","Jonathon.Jones")
 0.928571428571429 

?stringSimilarity("Jonathon G Jones","Jonathon Jones")
 1 

?stringSimilarity("Jonathon.Jones1","Jonathon Jones")
 0.88 

?stringSimilarity("Jonathon Jones1","Jonathon Jones")
 0.956521739130435

It seems like replacing the periods with spaces in the first string and comparing with spaces in the second string gives the higher results. Anything over 0.9, then, is close enough?

Lurkily · May 6, 2022

Thank you for the time you took. Unfortunately, I can't replicate your results.

I resaved as a macro-enabled book, opened the VBA editor, hit insert class module, pasted the code found there, and closed out VBA. But even simple tests like =stringSimilarity(A1,A2) give me a #name error. The same with strings, like =stringSimilarity("Robert","Robbie")

I confess that I'm rather clumsy and have only attempted to use VBA a couple of times in excel. I attempted this with google's help as it seems like a copy/paste solution, but it's not working out for me.

I have to deal with Excel 2016 at work; perhaps the older implementation of VBA (Or newer, that's a 2014 discussion) means the solution isn't suited to this version?

shknbk2 · May 6, 2022

This won't work from the "ThisWorkbook" code section. Create a new Module and paste it there.

Those names don't have a lot of similarity.



Bob	Robert	0.285714
Robert	Robbie	0.4

Lurkily · Jun 13, 2022

shknbk2 said:
This won't work from the "ThisWorkbook" code section. Create a new Module and paste it there.

Those names don't have a lot of similarity.

Bob Robert 0.285714
Robert Robbie 0.4

Sorry I never got back to you previously.

I DID make this work -- though only stringSimilarity is working, it's not letting me, for example, use the functions that SEEM to be for comparing one values to multiple values. However, it's definitely enough to make it work.

Oddly, I get an error on a three-character string with a space in the middle. I think it doesn't count the space as a character, and requires more than two characters. I'm using triple-hyphens, triple-emdash, etc for names in different fields that are blank, to ensure that they remain a non-match to everything, but also don't confuse the function.

Ranking similarity between multiple fields

Lurkily

New Member

shknbk2

Well-known Member

A better similarity ranking algorithm for variable length strings

Lurkily

New Member

shknbk2

Well-known Member

Lurkily

New Member

Similar threads

Share this page

Ranking similarity between multiple fields

Lurkily

New Member

shknbk2

Well-known Member

A better similarity ranking algorithm for variable length strings

Lurkily

New Member

shknbk2

Well-known Member

Lurkily

New Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock