Converting to byte array

Dr. Demento

Well-known Member
Joined
Nov 2, 2010
Messages
618
Office Version
  1. 2019
  2. 2016
Platform
  1. Windows
In exploring the world of fuzzy matching, Patrick O'Beirne provides a UDF for Levenshtein distance using a byte array to evaluate strings; his rationale is that using a Byte array is significantly faster (Levenshtein Distance in Excel - Stack Overflow). I'm evaluting thousands of entries, looking for near matches, so speed is important.

I get most of the code, but there's one line that I'm struggling to understand and then apply to a different formula (q-Gram matching). Below is how Patrick "converts" the MID function to use the Byte array:
Code:
'slow way: [B]If Mid$(string1, i, 1) = Mid$(string2, j, 1)[/B] Then
          [COLOR=#0000ff][B]If bs1((i - 1) * 2) = bs2((j - 1) * 2) Then[/B][/COLOR]   ' *2 because Unicode every 2nd byte is 0
              distance(i, j) = distance(i - 1, j - 1)

I'm struggling how to convert this formula (used in q-Gram matching) to use a Byte array (from Greg Holland thesis, Figure 6 (gregholland.com/qTR.pdf)):
Code:
[COLOR=#ff0000][B]If InStr(rts2, Mid(rts1, i, j)) > 0 Then[/B][/COLOR]
The remainder of the q-Gram code is purely numeric, so the arrays probably are pertinent there (although, if I'm mistaken, please correct me).

Below is Patrick's full code (modified somewhat). I realize that other sections may need to be modified too, but I don't have any experience to guide me in what those changes should be (arr_LevenDist loops? Not sure).

Thanks y'all.

Code:
Function dist_Levenshtein(ByVal rts1 As String, _
                          ByVal rts2 As String, _
                          Optional min_percentage As Long = 0.7) As Long
' ~~ Levenshtein arr_LevenDist in Excel (24x faster d/t byte array)
' [URL="http://stackoverflow.com/a/11584381"]Levenshtein Distance in Excel - Stack Overflow[/URL] || [URL="http://stackoverflow.com/questions/4243036/levenshtein-arr_LevenDist-in-excel"]Levenshtein Distance in Excel - Stack Overflow[/URL]
' Explanation/Source document: [URL]https://sysmod.wordpress.com/2012/07/27/string-fuzzy-matching-in-vba-and-vb-net/[/URL]
'Option Base 0 assumed
'POB: fn with byte array is 17 times faster
Dim i As Long, _
    j As Long
Dim str1_len As Long, _
    str2_len As Long, _
    max_len As Long
Dim min1 As Long, _
    min2 As Long, _
    min3 As Long
Dim arr_LevenDist() As Long
Dim arr_byte1() As Byte, _
    arr_byte2() As Byte
  str1_len = Len(rts1)
  str2_len = Len(rts2)
  
  ' ~~ Added optimization; specify the MINIMUM match % to return (default is 70%) _
        You enter percentags like "0.50" or "0.80", or "0" to run the formula regardless _
        [URL="http://stackoverflow.com/a/6423088"]Levenshtein Distance in Excel - Stack Overflow[/URL]
        
  ' ~~ Check if not too long
  If str1_len >= str2_len * (min_percentage) Then
    ' ~~ Check if not too short
    If str1_len <= str2_len * (2 - min_percentage) Then
        
      ReDim arr_LevenDist(str1_len, str2_len)
      
      arr_byte1 = rts1
      arr_byte2 = rts2
    
      For i = 0 To str1_len
        arr_LevenDist(i, 0) = i
      Next i
    
      For j = 0 To str2_len
        arr_LevenDist(0, j) = j
      Next j
    
      For i = 1 To str1_len
        For j = 1 To str2_len
        
          'slow way: If Mid$(rts1, i, 1) = Mid$(rts2, j, 1) Then
          If arr_byte1((i - 1) * 2) = arr_byte2((j - 1) * 2) Then   ' *2 because Unicode every 2nd byte is 0
            arr_LevenDist(i, j) = arr_LevenDist(i - 1, j - 1)
            
          Else
            'arr_LevenDist(i, j) = Application.WorksheetFunction.Min _
            (arr_LevenDist(i - 1, j) + 1, _
             arr_LevenDist(i, j - 1) + 1, _
             arr_LevenDist(i - 1, j - 1) + 1)
             
            ' spell it out, 50 times faster than WorksheetFunction.MIN
            min1 = arr_LevenDist(i - 1, j) + 1
            min2 = arr_LevenDist(i, j - 1) + 1
            min3 = arr_LevenDist(i - 1, j - 1) + 1
            
            If min1 <= min2 And min1 <= min3 Then
              arr_LevenDist(i, j) = min1
            ElseIf min2 <= min1 And min2 <= min3 Then
              arr_LevenDist(i, j) = min2  'Insertion or Substitution
            Else
              arr_LevenDist(i, j) = min3  'Deletion or Substitution
            End If
          
          End If
        Next j
      Next i
    
      ' ~~ Max length from [URL="http://stackoverflow.com/a/12494656"]Levenshtein Distance in Excel - Stack Overflow[/URL]
      max_len = str1_len
      If str2_len > max_len Then _
        max_len = str2_len
        
      dist_Levenshtein = 100 - CLng((arr_LevenDist(str1_len, str2_len) * 100) / max_len)
'      dist_Levenshtein = arr_LevenDist(str1_len, str2_len)  ' ~~ Original
      
    End If 'too short
  End If 'too long
End Function
 

Excel Facts

Format cells as time
Select range and press Ctrl+Shift+2 to format cells as time. (Shift 2 is the @ sign).
I rand these two versions:

Code:
Function qTRByte(s1 As String, s2 As String) As Double
  Dim as1() As Byte
  Dim as2() As Byte
  
  Dim i             As Long
  Dim j             As Long
  Dim Q             As Long
  Dim n1            As Long
  Dim n2            As Long
  Dim Tn1           As Double
  Dim Tn2           As Double

  as1 = s1
  as2 = s2
  
  i = 1
  n1 = Len(s1)
  n2 = Len(s2)
  
  Do While i <= n1
    j = 1
    Do While j <= n1 - i + 1
      If InStr(as2, Mid(as1, i, j)) > 0 Then Q = Q + j
      j = j + 1
    Loop
    i = i + 1
  Loop
  
  Tn1 = n1 * (n1 + 1) * (n1 + 2) / 6
  Tn2 = n2 * (n2 + 1) * (n2 + 2) / 6
  qTRByte = (n1 * Q / Tn1 + n2 * Q / Tn2) / (n1 + n2)
End Function

Function qTRStr(s1 As String, s2 As String) As Double
  Dim i             As Long
  Dim j             As Long
  Dim Q             As Long
  Dim n1            As Long
  Dim n2            As Long
  Dim Tn1           As Double
  Dim Tn2           As Double

  i = 1
  n1 = Len(s1)
  n2 = Len(s2)
  
  Do While i <= n1
    j = 1
    Do While j <= n1 - i + 1
      If InStr(s2, Mid(s1, i, j)) > 0 Then Q = Q + j
      j = j + 1
    Loop
    i = i + 1
  Loop
  
  Tn1 = n1 * (n1 + 1) * (n1 + 2) / 6
  Tn2 = n2 * (n2 + 1) * (n2 + 2) / 6
  qTRStr = (n1 * Q / Tn1 + n2 * Q / Tn2) / (n1 + n2)
End Function

Comparing 2500 words from 3 to 13 characters to one another, I got these results:

[td="bgcolor:#F3F3F3"]
Range
[/td][td="bgcolor:#F3F3F3"]
Formula
[/td][td="bgcolor:#F3F3F3"]
Time/Calc
[/td][td="bgcolor:#F3F3F3"]
RelSpeed
[/td]

[tr][td]C2:C2501[/td][td] =qTRByte($A2,$B2)[/td][td="bgcolor:#E5E5E5"]
0.000 025 049​
[/td][td="bgcolor:#E5E5E5"]
1.0​
[/td][/tr]
[tr][td]E2:E2501[/td][td] =qTRstr($A2,$B2)[/td][td="bgcolor:#E5E5E5"]
0.000 015 137​
[/td][td="bgcolor:#E5E5E5"]
1.7​
[/td][/tr]


As you can see, the string version was faster.
 
Upvote 0
Appreciate the code and feedback, shg.

I always forget to actually speed test the code. Guess I should get in the habit, especially if I say speed matters :-\
 
Upvote 0
Also, I compared the Stack Overflow Levenshtein code using byte arrays to a UDF I wrote several years ago. It was 1.7 times as fast as mine (same number as above, go figure, but not the 17 times faster that the comment claimed).

On the other hand, it ASSUMES that the strings are ANSI, so it only compares the first byte -- meaning it wouldn't work with Unicode strings. So it achieves its speed advantage by doing half the comparisons.
 
Last edited:
Upvote 0
Copy. I think I'm ok with that limitation though; my data is coming from the federal government via text file and I guarantee they're at the lowest common denominator. If I try using this technique with any other encoding, I will keep this in mind.

Thanks again.
 
Upvote 0

Forum statistics

Threads
1,225,757
Messages
6,186,848
Members
453,379
Latest member
gabriellegonzalez

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top