Help needed to test strings for defined "similarity"

patsnip

New Member
Joined
Feb 12, 2009
Messages
6
Hi, hoping someone can help with a string testing problem.

I want to test for similarity. I have two columns of data, actually filenames. In the second column, somewhere there is an *almost* identical filename to each of the filenames in the first column. The function needs to lookup the second column, and return the *most similar* filename.

I figure the answer will lie in comparing the characters? If it is too computationally intensive to do this over each 30-odd character filename, then perhaps rather a function that just tests for a match of at least 5 of the first 8 characters from the left or something like this? I can't figure out how to structure this "at least x of y characters" structure - if anyone can help I'd sure appreciate it !

thanks in advance,
patsnip
 

Excel Facts

Best way to learn Power Query?
Read M is for (Data) Monkey book by Ken Puls and Miguel Escobar. It is the complete guide to Power Query.
can you give some example of a test string and "similar" strings with which it should match and some "almost similar" strings that it's not supposed to match.
 
Upvote 0
Sure, for examples:

(1)

Battlesick / The Unclaimed Prize [2CD][VERY RARE][OOP] {in col. A}
AND
Battlesick__.jpg {in col. B}

(2)

Beethoven: Symphonies 2 & 5 [Germany][OOP] {in A}
AND
Beethoven__Symphonies_2___5__Germany__OOP__.jpg {in B}

any ideas?
 
Upvote 0
Sorry, I didn't directly address your questions.

To clarify, the function SHOULD return the *almost* similar string - there won't ever be an exact match, and for those odd cases where there may be two or more *almost* similar matches, it's ok it it just returns one of these. I can filter those manually, there will not be more than a couple.
 
Upvote 0
I've got a couple of ideas. Some of them word based, some character based.
Is the first word of the test string always the start of its match? (As your examples show?)
Is the first word of the test string always a sub-string of its match?
e.g. might "Beethoven: Symphonies 2 & 5 [Germany][OOP]" be matched with
"The Beethoven symphonies 2 & 5.jpg" or
"Beethoven's symphonies 2&5"
 
Upvote 0
"No", "no", and "yes" to your questions, it's not *always* the case that the first word will match (although it is *usually* the case, there are enough instances where this is not the case such as would render consequent manual filtering unwieldy. I think this suggests a character-based method may be more appropriate?
 
Upvote 0
By the way, it's "no" to your second question due to language differences such as Symphony vs. Symphonie, or Quartet vs. Kwartet...
 
Upvote 0
What I have for now is this function ClosenessFactor which returns a number depending on how close two strings are. If they are identical ClosenessFactor is 1. If they have no sub-strings in common, 0 is returned

ClosenessFactor("abcd","xabx") = . 5

"Symphony", "Symphonie" =0.823529412
"Quartet", "Kwartet" =0.714285714

This seems to be a decent aproach to test closness of words.
To test a sentence, I'd split it into words and average their closeness factors.

Code:
Function ClosenessFactor(aStr As String, bStr As String) As Double
    If CBool(Len(aStr) * Len(bStr)) Then
        ClosenessFactor = 2 * LongestCommonSubStringLength(aStr, bStr) / (Len(aStr) + Len(bStr))
    End If
End Function

Function LongestCommonSubStringLength(aStr As String, bStr As String) As Long
    Dim i As Long, j As Long
    For i = Len(aStr) To 1 Step -1
        For j = 1 To Len(aStr) - i + 1
            If CBool(InStr(bStr, Mid(aStr, j, i))) Then GoTo endFunction
        Next j
    Next i
endFunction:
    LongestCommonSubStringLength = i
End Function
 
Last edited:
Upvote 0

Forum statistics

Threads
1,223,903
Messages
6,175,289
Members
452,631
Latest member
a_potato

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top