Dr. Demento
Well-known Member
- Joined
- Nov 2, 2010
- Messages
- 618
- Office Version
- 2019
- 2016
- Platform
- Windows
In exploring the world of fuzzy matching, Patrick O'Beirne provides a UDF for Levenshtein distance using a byte array to evaluate strings; his rationale is that using a Byte array is significantly faster (Levenshtein Distance in Excel - Stack Overflow). I'm evaluting thousands of entries, looking for near matches, so speed is important.
I get most of the code, but there's one line that I'm struggling to understand and then apply to a different formula (q-Gram matching). Below is how Patrick "converts" the MID function to use the Byte array:
I'm struggling how to convert this formula (used in q-Gram matching) to use a Byte array (from Greg Holland thesis, Figure 6 (gregholland.com/qTR.pdf)):
The remainder of the q-Gram code is purely numeric, so the arrays probably are pertinent there (although, if I'm mistaken, please correct me).
Below is Patrick's full code (modified somewhat). I realize that other sections may need to be modified too, but I don't have any experience to guide me in what those changes should be (arr_LevenDist loops? Not sure).
Thanks y'all.
I get most of the code, but there's one line that I'm struggling to understand and then apply to a different formula (q-Gram matching). Below is how Patrick "converts" the MID function to use the Byte array:
Code:
'slow way: [B]If Mid$(string1, i, 1) = Mid$(string2, j, 1)[/B] Then
[COLOR=#0000ff][B]If bs1((i - 1) * 2) = bs2((j - 1) * 2) Then[/B][/COLOR] ' *2 because Unicode every 2nd byte is 0
distance(i, j) = distance(i - 1, j - 1)
I'm struggling how to convert this formula (used in q-Gram matching) to use a Byte array (from Greg Holland thesis, Figure 6 (gregholland.com/qTR.pdf)):
Code:
[COLOR=#ff0000][B]If InStr(rts2, Mid(rts1, i, j)) > 0 Then[/B][/COLOR]
Below is Patrick's full code (modified somewhat). I realize that other sections may need to be modified too, but I don't have any experience to guide me in what those changes should be (arr_LevenDist loops? Not sure).
Thanks y'all.
Code:
Function dist_Levenshtein(ByVal rts1 As String, _
ByVal rts2 As String, _
Optional min_percentage As Long = 0.7) As Long
' ~~ Levenshtein arr_LevenDist in Excel (24x faster d/t byte array)
' [URL="http://stackoverflow.com/a/11584381"]Levenshtein Distance in Excel - Stack Overflow[/URL] || [URL="http://stackoverflow.com/questions/4243036/levenshtein-arr_LevenDist-in-excel"]Levenshtein Distance in Excel - Stack Overflow[/URL]
' Explanation/Source document: [URL]https://sysmod.wordpress.com/2012/07/27/string-fuzzy-matching-in-vba-and-vb-net/[/URL]
'Option Base 0 assumed
'POB: fn with byte array is 17 times faster
Dim i As Long, _
j As Long
Dim str1_len As Long, _
str2_len As Long, _
max_len As Long
Dim min1 As Long, _
min2 As Long, _
min3 As Long
Dim arr_LevenDist() As Long
Dim arr_byte1() As Byte, _
arr_byte2() As Byte
str1_len = Len(rts1)
str2_len = Len(rts2)
' ~~ Added optimization; specify the MINIMUM match % to return (default is 70%) _
You enter percentags like "0.50" or "0.80", or "0" to run the formula regardless _
[URL="http://stackoverflow.com/a/6423088"]Levenshtein Distance in Excel - Stack Overflow[/URL]
' ~~ Check if not too long
If str1_len >= str2_len * (min_percentage) Then
' ~~ Check if not too short
If str1_len <= str2_len * (2 - min_percentage) Then
ReDim arr_LevenDist(str1_len, str2_len)
arr_byte1 = rts1
arr_byte2 = rts2
For i = 0 To str1_len
arr_LevenDist(i, 0) = i
Next i
For j = 0 To str2_len
arr_LevenDist(0, j) = j
Next j
For i = 1 To str1_len
For j = 1 To str2_len
'slow way: If Mid$(rts1, i, 1) = Mid$(rts2, j, 1) Then
If arr_byte1((i - 1) * 2) = arr_byte2((j - 1) * 2) Then ' *2 because Unicode every 2nd byte is 0
arr_LevenDist(i, j) = arr_LevenDist(i - 1, j - 1)
Else
'arr_LevenDist(i, j) = Application.WorksheetFunction.Min _
(arr_LevenDist(i - 1, j) + 1, _
arr_LevenDist(i, j - 1) + 1, _
arr_LevenDist(i - 1, j - 1) + 1)
' spell it out, 50 times faster than WorksheetFunction.MIN
min1 = arr_LevenDist(i - 1, j) + 1
min2 = arr_LevenDist(i, j - 1) + 1
min3 = arr_LevenDist(i - 1, j - 1) + 1
If min1 <= min2 And min1 <= min3 Then
arr_LevenDist(i, j) = min1
ElseIf min2 <= min1 And min2 <= min3 Then
arr_LevenDist(i, j) = min2 'Insertion or Substitution
Else
arr_LevenDist(i, j) = min3 'Deletion or Substitution
End If
End If
Next j
Next i
' ~~ Max length from [URL="http://stackoverflow.com/a/12494656"]Levenshtein Distance in Excel - Stack Overflow[/URL]
max_len = str1_len
If str2_len > max_len Then _
max_len = str2_len
dist_Levenshtein = 100 - CLng((arr_LevenDist(str1_len, str2_len) * 100) / max_len)
' dist_Levenshtein = arr_LevenDist(str1_len, str2_len) ' ~~ Original
End If 'too short
End If 'too long
End Function