Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
Is it correct that the latest version is via Alan's signature, which takes you to page 28 of this thread?
Can someone explain the logic for the algorithm chosen. For example, how does the code below account for algorithm 4, 5, 6 and 7? I can't find reference to 5, 6 and 7, yet the formula works. And I don't understand the line "If (lngAlgorithm And 4) <> 0".


VBA Code:
'-------------------------------------------------------------
'-- If Algorithm = 4,5,6,7, use Levenstein Distance method  --
'-- (Algorithm 4 was Dan Ostrander's code)                  --
'-------------------------------------------------------------
If (lngAlgorithm And 4) <> 0 Then
    If intLen1 < intLen2 Then
'        sngScore = FuzzyAlg4(String1, String1)
        sngScore = GetLevenshteinPercentMatch(String1:=String1, _
                                              String2:=String2, _
                                              Normalised:=True)
    Else
'        sngScore = FuzzyAlg4(String2, String1)
        sngScore = GetLevenshteinPercentMatch(String1:=String2, _
                                              String2:=String1, _
                                              Normalised:=True)
    End If
    intScore = intScore + (sngScore * 100)
    intTotScore = intTotScore + 100
End If
 
Upvote 0

Excel Facts

Why are there 1,048,576 rows in Excel?
The Excel team increased the size of the grid in 2007. There are 2^20 rows and 2^14 columns for a total of 17 billion cells.
That said, there are times when you have to send for the ‘big boys’.
Code:
Option Explicit
Type RankInfo
    Offset As Integer
    Percentage As Single
End Type

Function FuzzyPercent(ByVal String1 As String, _
                      ByVal String2 As String, _
                      Optional Algorithm As Integer = 3, _
                      Optional Normalised As Boolean = False) As Single
'*************************************
'** Return a % match on two strings **
'*************************************
Dim intLen1 As Integer, intLen2 As Integer
Dim intCurLen As Integer
Dim intTo As Integer
Dim intPos As Integer
Dim intPtr As Integer
Dim intScore As Integer
Dim intTotScore As Integer
Dim intStartPos As Integer
Dim strWork As String

'-------------------------------------------------------
'-- If strings havent been normalised, normalise them --
'-------------------------------------------------------
If Normalised = False Then
    String1 = LCase$(Application.Trim(String1))
    String2 = LCase$(Application.Trim(String2))
End If

'----------------------------------------------
'-- Give 100% match if strings exactly equal --
'----------------------------------------------
If String1 = String2 Then
    FuzzyPercent = 1
    Exit Function
End If

intLen1 = Len(String1)
intLen2 = Len(String2)

'----------------------------------------
'-- Give 0% match if string length< 2 --
'----------------------------------------
If intLen1< 2 Then
    FuzzyPercent = 0
    Exit Function
End If

intTotScore = 0                   'initialise total possible score
intScore = 0                      'initialise current score

'--------------------------------------------------------
'-- If Algorithm = 1 or 3, Search for single characters --
'--------------------------------------------------------
If (Algorithm And 1)<> 0 Then
    FuzzyAlg1 String1, String2, intScore, intTotScore
    If intLen1< intLen2 Then FuzzyAlg1 String2, String1, intScore, intTotScore
End If

'-----------------------------------------------------------
'-- If Algorithm = 2 or 3, Search for pairs, triplets etc. --
'-----------------------------------------------------------
If (Algorithm And 2)<> 0 Then
    FuzzyAlg2 String1, String2, intScore, intTotScore
    If intLen1< intLen2 Then FuzzyAlg2 String2, String1, intScore, intTotScore
End If

FuzzyPercent = intScore / intTotScore

End Function
Private Sub FuzzyAlg1(ByVal String1 As String, _
                      ByVal String2 As String, _
                      ByRef Score As Integer, _
                      ByRef TotScore As Integer)
Dim intLen1 As Integer, intPos As Integer, intPtr As Integer, intStartPos As Integer

intLen1 = Len(String1)
TotScore = TotScore + intLen1              'update total possible score
intPos = 0
For intPtr = 1 To intLen1
    intStartPos = intPos + 1
    intPos = InStr(intStartPos, String2, Mid$(String1, intPtr, 1))
    If intPos > 0 Then
        If intPos > intStartPos + 3 Then     'No match if char is > 3 bytes away
            intPos = intStartPos
        Else
            Score = Score + 1          'Update current score
        End If
    Else
        intPos = intStartPos
    End If
Next intPtr
End Sub
Private Sub FuzzyAlg2(ByVal String1 As String, _
                        ByVal String2 As String, _
                        ByRef Score As Integer, _
                        ByRef TotScore As Integer)
Dim intCurLen As Integer, intLen1 As Integer, intTo As Integer, intPtr As Integer, intPos As Integer
Dim strWork As String

intLen1 = Len(String1)
For intCurLen = 2 To intLen1
    strWork = String2                          'Get a copy of String2
    intTo = intLen1 - intCurLen + 1
    TotScore = TotScore + Int(intLen1 / intCurLen)  'Update total possible score
    For intPtr = 1 To intTo Step intCurLen
        intPos = InStr(strWork, Mid$(String1, intPtr, intCurLen))
        If intPos > 0 Then
            Mid$(strWork, intPos, intCurLen) = String$(intCurLen, &H0) 'corrupt found string
            Score = Score + 1     'Update current score
        End If
    Next intPtr
Next intCurLen

End Sub

Function FuzzyVLookup(ByVal LookupValue As String, _
                      ByVal TableArray As Range, _
                      ByVal IndexNum As Integer, _
                      Optional NFPercent As Single = 0.05, _
                      Optional Rank As Integer = 1, _
                      Optional Algorithm As Integer = 3, _
                      Optional AdditionalCols As Integer = 0) As Variant
'********************************************************************************
'** Function to Fuzzy match LookupValue with entries in                        **
'** column 1 of table specified by TableArray.                                 **
'** TableArray must specify the top left cell of the range to be searched      **
'** The function stops scanning the table when an empty cell in column 1       **
'** is found.                                                                  **
'** For each entry in column 1 of the table, FuzzyPercent is called to match   **
'** LookupValue with the Table entry.                                          **
'** 'Rank' is an optional parameter which may take any value > 0               **
'**        (default 1) and causes the function to return the 'nth' best        **
'**         match (where 'n' is defined by 'Rank' parameter)                   **
'** If the 'Rank' match percentage< NFPercent (Default 5%), #N/A is returned. **
'** IndexNum is the column number of the entry in TableArray required to be    **
'** returned, as follows:                                                      **
'** If IndexNum > 0 and the 'Rank' percentage match is >= NFPercent            **
'**                 (Default 5%) the column entry indicated by IndexNum is     **
'**                 returned.                                                  **
'** if IndexNum = 0 and the 'Rank' percentage match is >= NFPercent            **
'**                 (Default 5%) the offset row (starting at 1) is returned.   **
'**                 This value can be used directly in the 'Index' function.   **
'**                                                                            **
'** Algorithm can take one of the following values:                            **
'** Algorithm = 1:                                                             **
'**     This algorithm is best suited for matching mis-spellings.              **
'**     For each character in 'String1', a search is performed on 'String2'.   **
'**     The search is deemed successful if a character is found in 'String2'   **
'**     within 3 characters of the current position.                           **
'**     A score is kept of matching characters which is returned as a          **
'**     percentage of the total possible score.                                **
'** Algorithm = 2:                                                             **
'**     This algorithm is best suited for matching sentences, or               **
'**     'firstname lastname' compared with 'lastname firstname' combinations   **
'**     A count of matching pairs, triplets, quadruplets etc. in 'String1' and **
'**     'String2' is returned as a percentage of the total possible.           **
'** Algorithm = 3: Both Algorithms 1 and 2 are performed.                      **
'********************************************************************************
Dim R As Range

Dim strListString As String
Dim strWork As String

Dim sngMinPercent As Single
Dim sngWork As Single
Dim sngCurPercent  As Single
Dim intBestMatchPtr As Integer
Dim intRankPtr As Integer
Dim intRankPtr1 As Integer
Dim I As Integer

Dim lEndRow As Long

Dim udRankData() As RankInfo

Dim vCurValue As Variant

'--------------------------------------------------------------
'--    Validation                                            --
'--------------------------------------------------------------

LookupValue = LCase$(Application.Trim(LookupValue))

If IsMissing(NFPercent) Then
    sngMinPercent = 0.05
Else
    If (NFPercent<= 0) Or (NFPercent > 1) Then
        FuzzyVLookup = "*** 'NFPercent' must be a percentage > zero ***"
        Exit Function
    End If
    sngMinPercent = NFPercent
End If

If Rank< 1 Then
    FuzzyVLookup = "*** 'Rank' must be an integer > 0 ***"
    Exit Function
End If

ReDim udRankData(1 To Rank)

lEndRow = TableArray.Rows.Count
If VarType(TableArray.Cells(lEndRow, 1).Value) = vbEmpty Then
    lEndRow = TableArray.Cells(lEndRow, 1).End(xlUp).Row
End If

'---------------
'-- Main loop --
'---------------
For Each R In Range(TableArray.Cells(1, 1), TableArray.Cells(lEndRow, 1))
    vCurValue = ""
    For I = 0 To AdditionalCols
        vCurValue = vCurValue & R.Offset(0, I).Text
    Next I
    If VarType(vCurValue) = vbString Then
        strListString = LCase$(Application.Trim(vCurValue))
       
        '------------------------------------------------
        '-- Fuzzy match strings & get percentage match --
        '------------------------------------------------
        sngCurPercent = FuzzyPercent(String1:=LookupValue, _
                                     String2:=strListString, _
                                     Algorithm:=Algorithm, _
                                     Normalised:=True)
       
        If sngCurPercent >= sngMinPercent Then
            '---------------------------
            '-- Store in ranked array --
            '---------------------------
            For intRankPtr = 1 To Rank
                If sngCurPercent > udRankData(intRankPtr).Percentage Then
                    For intRankPtr1 = Rank To intRankPtr + 1 Step -1
                        With udRankData(intRankPtr1)
                            .Offset = udRankData(intRankPtr1 - 1).Offset
                            .Percentage = udRankData(intRankPtr1 - 1).Percentage
                        End With
                    Next intRankPtr1
                    With udRankData(intRankPtr)
                        .Offset = R.Row
                        .Percentage = sngCurPercent
                    End With
                    Exit For
                End If
            Next intRankPtr
        End If
       
    End If
Next R

If udRankData(Rank).Percentage< sngMinPercent Then
    '--------------------------------------
    '-- Return '#N/A' if below NFPercent --
    '--------------------------------------
    FuzzyVLookup = CVErr(xlErrNA)
Else
    intBestMatchPtr = udRankData(Rank).Offset - TableArray.Cells(1, 1).Row + 1
    If IndexNum > 0 Then
        '-----------------------------------
        '-- Return column entry specified --
        '-----------------------------------
        FuzzyVLookup = TableArray.Cells(intBestMatchPtr, IndexNum)
    Else
        '-----------------------
        '-- Return offset row --
        '-----------------------
        FuzzyVLookup = intBestMatchPtr
    End If
End If
End Function
Function FuzzyHLookup(ByVal LookupValue As String, _
                      ByVal TableArray As Range, _
                      ByVal IndexNum As Integer, _
                      Optional NFPercent As Single = 0.05, _
                      Optional Rank As Integer = 1, _
                      Optional Algorithm As Integer = 3) As Variant
'********************************************************************************
'** Function to Fuzzy match LookupValue with entries in                        **
'** row 1 of table specified by TableArray.                                    **
'** TableArray must specify the top left cell of the range to be searched      **
'** The function stops scanning the table when an empty cell in row 1          **
'** is found.                                                                  **
'** For each entry in row 1 of the table, FuzzyPercent is called to match      **
'** LookupValue with the Table entry.                                          **
'** 'Rank' is an optional parameter which may take any value > 0               **
'**        (default 1) and causes the function to return the 'nth' best        **
'**         match (where 'n' is defined by 'Rank' parameter)                   **
'** If the 'Rank' match percentage< NFPercent (Default 5%), #N/A is returned. **
'** IndexNum is the row number of the entry in TableArray required to be       **
'** returned, as follows:                                                      **
'** If IndexNum > 0 and the 'Rank' percentage match is >= NFPercent            **
'**                 (Default 5%) the row entry indicated by IndexNum is        **
'**                 returned.                                                  **
'** if IndexNum = 0 and the 'Rank' percentage match is >= NFPercent            **
'**                 (Default 5%) the offset col (starting at 0) is returned.   **
'**                 This value can be used directly in the 'OffSet' function.  **
'**                                                                            **
'** Algorithm can take one of the following values:                            **
'** Algorithm = 1:                                                             **
'**     For each character in 'String1', a search is performed on 'String2'.   **
'**     The search is deemed successful if a character is found in 'String2'   **
'**     within 3 characters of the current position.                           **
'**     A score is kept of matching characters which is returned as a          **
'**     percentage of the total possible score.                                **
'** Algorithm = 2:                                                             **
'**     A count of matching pairs, triplets, quadruplets etc. in 'String1' and **
'**     'String2' is returned as a percentage of the total possible.           **
'** Algorithm = 3: Both Algorithms 1 and 2 are performed.                      **
'********************************************************************************
Dim R As Range

Dim strListString As String
Dim strWork As String

Dim sngMinPercent As Single
Dim sngWork As Single
Dim sngCurPercent  As Single

Dim intBestMatchPtr As Integer
Dim intPtr As Integer
Dim intRankPtr As Integer
Dim intRankPtr1 As Integer

Dim iEndCol As Integer

Dim udRankData() As RankInfo

Dim vCurValue As Variant
'--------------------------------------------------------------
'--    Validation                                            --
'--------------------------------------------------------------
LookupValue = LCase$(Application.Trim(LookupValue))

If IsMissing(NFPercent) Then
    sngMinPercent = 0.05
Else
    If (NFPercent<= 0) Or (NFPercent > 1) Then
        FuzzyHLookup = "*** 'NFPercent' must be a percentage > zero ***"
        Exit Function
    End If
    sngMinPercent = NFPercent
End If

If Rank< 1 Then
    FuzzyHLookup = "*** 'Rank' must be an integer > 0 ***"
    Exit Function
End If

ReDim udRankData(1 To Rank)
'**************************
iEndCol = TableArray.Columns.Count
If VarType(TableArray.Cells(1, iEndCol).Value) = vbEmpty Then
    iEndCol = TableArray.Cells(1, iEndCol).End(xlToLeft).Column
End If

'---------------
'-- Main loop --
'---------------
For Each R In Range(TableArray.Cells(1, 1), TableArray.Cells(1, iEndCol))
    vCurValue = R.Value
    If VarType(vCurValue) = vbString Then
        strListString = LCase$(Application.Trim(vCurValue))
       
        '------------------------------------------------
        '-- Fuzzy match strings & get percentage match --
        '------------------------------------------------
        sngCurPercent = FuzzyPercent(String1:=LookupValue, _
                                     String2:=strListString, _
                                     Algorithm:=Algorithm, _
                                     Normalised:=True)
       
        If sngCurPercent >= sngMinPercent Then
            '---------------------------
            '-- Store in ranked array --
            '---------------------------
            For intRankPtr = 1 To Rank
                If sngCurPercent > udRankData(intRankPtr).Percentage Then
                    For intRankPtr1 = Rank To intRankPtr + 1 Step -1
                        With udRankData(intRankPtr1)
                            .Offset = udRankData(intRankPtr1 - 1).Offset
                            .Percentage = udRankData(intRankPtr1 - 1).Percentage
                        End With
                    Next intRankPtr1
                    With udRankData(intRankPtr)
                        .Offset = R.Column
                        .Percentage = sngCurPercent
                    End With
                    Exit For
                End If
            Next intRankPtr
        End If
       
    End If
Next R

If udRankData(Rank).Percentage< sngMinPercent Then
    '--------------------------------------
    '-- Return '#N/A' if below NFPercent --
    '--------------------------------------
    FuzzyHLookup = CVErr(xlErrNA)
Else
    intBestMatchPtr = udRankData(Rank).Offset - TableArray.Cells(1, 1).Column + 1
    If IndexNum > 0 Then
        '-----------------------------------
        '-- Return row entry specified --
        '-----------------------------------
        FuzzyHLookup = TableArray.Cells(IndexNum, intBestMatchPtr)
    Else
        '-----------------------
        '-- Return offset col --
        '-----------------------
        FuzzyHLookup = intBestMatchPtr
    End If
End If
End Function

Three Fuzzy matching UDF’s are available, FUZZYVLOOKUP, FUZZYHLOOKUP and FUZZYPERCENT.

FUZZYVLOOKUP is the ‘fuzzy’ equivalent of ‘VLOOKUP and has the following parameters:

Lookupvalue
The value to search in the first column of the table array

Tablearray
One or more columns of data. Use a reference to a range or a range name. The values in the first column of table array are the values searched by lookup value.

Indexnum
The column number in the table array from which the matching value must be returned. An index num of 1 returns the value in the first column in the table array; an index num of 2 returns the value in the second column in the table array, and so on. If index num is zero, the relative row number in the table array is returned.

NFpercent
The Percentage value below which matching strings are deemed as not found. If no strings in the lookup table equal or exceed this matching percentage, #N/A is returned.
The higher the percentage specified, the higher the confidence level in the returned result.
Default: 5%

Rank
An optional parameter which may take any value > 0 and causes the function to return the specified ranking best match.
Default: 1

Algorithm
Defines the algorithm to be used for matching strings. Valid values are 1, 2 or 3:
Algorithm = 1
This algorithm is best suited for matching mis-spellings.
For each character in 'String1', a search is performed on 'String2'.
The search is deemed successful if a character is found in 'String2' within 3 characters of the current position.
A score is kept of matching characters which is returned as a percentage of the total possible score.
Algorithm = 2
This algorithm is best suited for matching sentences, or 'firstname lastname' compared with 'lastname firstname' combinations.
A count of matching pairs, triplets, quadruplets etc. in 'String1' and 'String2' is returned as a percentage of the total possible.
Algorithm = 3: Both Algorithms 1 and 2 are performed.
Default: 3

Additionalcols
Defines the number of subsequent columns after the first within the table array to be concatenated with the first column prior to matching.
Default: 0

FUZZYHLOOKUP is the ‘fuzzy’ equivalent of ‘HLOOKUP and has the following parameters, descriptions as for FuzzyVLookup:
LookupValue
TableArray
IndexNum
NFPercent
Rank
Algorithm

FUZZYPERCENT is at the heart of FuzzyVLookup and FuzzyHLookup, and returns the percentage match of the two strings supplied. This UDF can be called independently. It has the following parameters:

String1
The first string to be matched. FuzzyVLookup will pass the lookup value supplied (but normalised) as this parameter.

String2
The second string to be matched. FuzzyVLookup will pass each of the lookup table strings supplied (but normalised) as this parameter.

Algorithm
Algorithm to be used. See FuzzyVLookup for further details.
Default: 3

Normalised
Boolean value indicating whether the two supplied strings have been normalised. Normalised strings have had leading, trailing and multiple internal spaces removed, and have been converted to lowercase.
Default: False


EXAMPLES.

In the following example sheet,
Column A is the lookup table
Column B contain lookup values
Column C contains FuzzyVLookup formulae using algorithm 1
Column D contains FuzzyVLookup formulae using algorithm 2
Column E contains FuzzyVLookup formulae using algorithm 3.
Column F contains the FuzzyPercent result of column B versus column C using algorithm 1
Column G contains the FuzzyPercent result of column B versus column D using algorithm 2
Column H contains the FuzzyPercent result of column B versus column E using algorithm 3
Fuzzy Examples.xls
ABCDEFGH
1FuzzyVLookup - AlgorithmFuzzyPercent - Algorithm
2123123
3Paul McCartneyGeorge BushGeorge HarrisGeorge HarrisGeorge Harris66.67%42.86%51.52%
4George HarrisGorge hrsGeorge HarrisGeorge HarrisGeorge Harris54.55%21.05%33.33%
5Ringo StarrStarr,RingoChuck NorrisRingo StarrRingo Starr17.39%44.44%31.03%
6John LennonBush, GeorgeBruce LeeGeorge HarrisGeorge Harris25.00%31.91%26.39%
7Tony Blair
8George Clooney
9Chuck Norris
10Claude Van Damme
11Bruce Lee
Fuzzy Match Examples


formula in C3 and copied down and across to E6 is =fuzzyvlookup($B3,$A:$A,1,,,C$2)
Formula in F3 and copied down and across to H6 is =fuzzypercent($B3,C3,F$2)
Can I implement the Fuzzy Percent Function into a search engine in excel? I have 200 rows of data. It has the file name and the topics of each file. Someone would search a topic and find a file with the best match. It would let multiple topics from different files to be searched at the same time.
 
Upvote 0
Hi Rishm, I'm sure the answer is "Yes" but not sure what you mean :/
Could you post an example of inputs and expected results?
Best wishes
Alan
 
Upvote 0
Hi Rishm, I'm sure the answer is "Yes" but not sure what you mean :/
Could you post an example of inputs and expected results?
Best wishes
Alan
FuzzyPercent_ScreenShot.jpg

Hi, this is the screenshot. There is a keyword search box to the left of the data. If someone types Airlines, Vegetables, the first two results that should pop up are Vendor10.xlsx and Vendor1.xlsx. I make the keywords for each file in advance.

I am going to have 200 files and more. Someone might want to type three keywords. I want the results to return the correct files as the top results. Would fuzzy percent help me in doing this? I am not sure if would work for 200 files for each file might have like 10 keywords. Sometimes a file that might not have the keyword could receive a higher percentage match than a file that has the keyword.

But overall, I am thinking of doing a multi-field search/advanced engine. So there would be more rows of data and I would just add up the fuzzy percentage of each criteria.
 
Upvote 0
Fuzzy Search.xlsm
ABCDEF
3Excel File TitleKeywords
4KeywordSearchBoxType keywords hereSearch ResultsMatch PercentageVendor4.xlsxFruits, Vegetables, Health-based foods
5Desserts, fruitsVendor5.xlsx80.00%Vendor5.xlsxDesserts, Ice Cream, Cakes
6Vendor14.xlsxVendor6.xlsxVideo Games, PC Games, Virtual Equipment
7Vendor4.xlsxVendor7.xlsxCoding Solutions, Oranges, Apples
8Vendor8.xlsxPears, Peaches, Orange Cones
9Vendor9.xlsxSoda, Cakes, Burgers,
10Vendor10.xlsxMarines, Army, Strength Based Goods
11Vendor11.xlsxSmoke based products, Ciggaratees, Tobacco
12Vendor12.xlsxInsurance Coverages, Deductible, Limit of Insurance
13Vendor13.xlsxAirlinesTransportations Services, Taxis, Buses
14Vendor14.xlsxdeserts, forests
Sheet1
Cell Formulas
RangeFormula
E4:E14E4="Vendor"&ROW()&".xlsx"


The following range names must be defined:
Keywords_Required: Cell containing the keywords being sought
File_Title: defines the start of the filenames, column+1 contains the corresponding file keywords
Match_Percentage: Cell containing the minimum percentage match required for a keyword to be deemed a match
Search_Results: First cell for the results

This code assumes that the data is in 'Sheet1', change this line as appropriate: Set WS = Sheets("Sheet1")

VBA Code:
Option Explicit

Sub FuzzySearch()
'Defined Name           Comment
'Keywords_Required      Cell containing the keyword(s) being sought
'File_Title             First cell in the column containing the filenames
'                       Column+1 contains the corresponding file keywords
'Match_Percentage       Cell containing the min %age match required for a 'hit'
'Search_Results         First cell for the results to be stored

Dim lEndRow As Long
Dim lCurFileRow As Long
Dim lFileListPtr As Long
Dim lKeywordPtr As Long
Dim lResultsPtr As Long
Dim lCurFileListEntryPtr As Long

Dim sKeywordsRqd As String
Dim saKeywordsRqd() As String
Dim saFileList() As String
Dim sCurFile As String
Dim sCurFileKeywords As String
Dim saCurFileKeywords() As String
Dim sCurKeywordRqd As String

Dim sngMinPercentage As Single
Dim sngCurPercent As Single

Dim vaFileData As Variant
Dim vaResults As Variant

Dim WS As Worksheet

Set WS = Sheets("Sheet1")

'** Store list of filenames & file keywords **
lEndRow = WS.Cells(Rows.Count, WS.Range("File_Title").Column).End(xlUp).Row
vaFileData = WS.Range(WS.Range("File_title").Address).Resize(lEndRow, 2).Value
ReDim saFileList(1 To 2, 1 To 1)
lFileListPtr = 0
For lCurFileRow = 1 To UBound(vaFileData, 1)
    sCurFile = Trim$(vaFileData(lCurFileRow, 1))
    If sCurFile <> "" Then
        lFileListPtr = lFileListPtr + 1
        ReDim Preserve saFileList(1 To 2, 1 To lFileListPtr)
        saFileList(1, lFileListPtr) = sCurFile                                              '** Store filename
        saFileList(2, lFileListPtr) = NormaliseKeywords(CStr(vaFileData(lCurFileRow, 2)))   '** Store keywords
    End If
Next lCurFileRow

'** Get Min Percentage rqd **
sngMinPercentage = Val(WS.Range("Match_Percentage").Resize(1, 1).Value)

'** Get list of keywords required
sKeywordsRqd = NormaliseKeywords(CStr(WS.Range("Keywords_Required").Resize(1, 1).Value))
saKeywordsRqd = Split(sKeywordsRqd, ",")

lResultsPtr = 0
ReDim vaResults(1 To 1, 1 To 1)
For lKeywordPtr = 0 To UBound(saKeywordsRqd)
    If sngMinPercentage < 1 Then
        For lFileListPtr = 1 To UBound(saFileList, 2)
             saCurFileKeywords = Split(saFileList(2, lFileListPtr), ",")
             For lCurFileListEntryPtr = 0 To UBound(saCurFileKeywords)
                sngCurPercent = GetLevenshteinPercentMatch(saKeywordsRqd(lKeywordPtr), _
                                                            saCurFileKeywords(lCurFileListEntryPtr), _
                                                            True)
                If sngCurPercent >= sngMinPercentage Then
                lResultsPtr = lResultsPtr + 1
                ReDim Preserve vaResults(1 To 1, 1 To lResultsPtr)
                vaResults(1, lResultsPtr) = saFileList(1, lFileListPtr)
                End If
             Next lCurFileListEntryPtr
        Next lFileListPtr
    Else
        '** Here if 100% (or  more) match required **
        sCurKeywordRqd = "," & saKeywordsRqd(lKeywordPtr) & ","
        For lFileListPtr = 1 To UBound(saFileList, 2)
            If InStr("," & saFileList(2, lFileListPtr) & ",", sCurKeywordRqd) > 0 Then
                lResultsPtr = lResultsPtr + 1
                ReDim Preserve vaResults(1 To 1, 1 To lResultsPtr)
                vaResults(1, lResultsPtr) = saFileList(1, lFileListPtr)
            End If
        Next lFileListPtr
    End If
Next lKeywordPtr
With WS
    .Range("Search_Results").Resize(.UsedRange.Rows.Count, 1).ClearContents
    .Range("Search_Results").Resize(UBound(vaResults, 2), 1).Value = WorksheetFunction.Transpose(vaResults)
End With
End Sub

Private Function NormaliseKeywords(ByVal Keywords As String) As String
Dim lPtr As Long

Dim saKeywords() As String
Dim sResult As String

sResult = LCase$(Trim$(Keywords))
If sResult <> "" Then
    saKeywords = Split(sResult, ",")
    For lPtr = 0 To UBound(saKeywords)
        saKeywords(lPtr) = Trim$(saKeywords(lPtr))
    Next lPtr
    sResult = Join(saKeywords, ",")
End If
NormaliseKeywords = sResult
End Function

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String, _
                                            Optional Normalised As Boolean = True) As Single
Dim iLen As Integer
If Normalised = False Then
    String1 = UCase$(WorksheetFunction.Trim(String1))
    String2 = UCase$(WorksheetFunction.Trim(String2))
End If
iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
'********************************
'*** Compute Levenshtein Distance
'********************************

Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N
    s_i = Mid$(s, I, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
      
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)
    Next j
  Next I

  ' Step 7
  LevenshteinDistance = d(N, m)
End Function

You may want to invoke the code via a worksheet_Change function:
VBA Code:
Option Explicit

Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Resize(1, 1).Address = ActiveSheet.Range("Keywords_Required").Resize(1, 1).Address Then
    Call FuzzySearch
End If
End Sub
 
Upvote 0
Hi Alan, thank you so much for the code. I really appreciate it. This is what I was looking for!

Would it be possible for me to slightly edit the VBA code to make it work for filenames with inconsistent naming titles?

Is there a chance it would work for a multi-field search engine like the example below? I'm sorry for not asking the full question properly initially.

fuzzy.jpg
 
Upvote 0
Hi Rism, thank you for your reply.
In answer to your first question, Yes. I only used a formula in my reply to quickly generate some filenames. The code will return whatever is in the filename field.
Do you want it to match BOTH keyword and location OR either keyword or location?
Best wishes
Alan
 
Upvote 0
Hi Alan,

I was thinking of doing both keyword and location.

I was thinking of putting around six to seven search fields. So I would have additional rows of info about each excel file name in the dataset.
 
Upvote 0
Fuzzy Search.xlsm
ABCDEF
3Excel File TitleKeywords
4KeywordSearchBoxType keywords hereSearch ResultsMatch PercentageVendor4.xlsxFruits, Vegetables, Health-based foods
5Desserts, fruitsVendor5.xlsx80.00%Vendor5.xlsxDesserts, Ice Cream, Cakes
6Vendor14.xlsxVendor6.xlsxVideo Games, PC Games, Virtual Equipment
7Vendor4.xlsxVendor7.xlsxCoding Solutions, Oranges, Apples
8Vendor8.xlsxPears, Peaches, Orange Cones
9Vendor9.xlsxSoda, Cakes, Burgers,
10Vendor10.xlsxMarines, Army, Strength Based Goods
11Vendor11.xlsxSmoke based products, Ciggaratees, Tobacco
12Vendor12.xlsxInsurance Coverages, Deductible, Limit of Insurance
13Vendor13.xlsxAirlinesTransportations Services, Taxis, Buses
14Vendor14.xlsxdeserts, forests
Sheet1
Cell Formulas
RangeFormula
E4:E14E4="Vendor"&ROW()&".xlsx"


The following range names must be defined:
Keywords_Required: Cell containing the keywords being sought
File_Title: defines the start of the filenames, column+1 contains the corresponding file keywords
Match_Percentage: Cell containing the minimum percentage match required for a keyword to be deemed a match
Search_Results: First cell for the results

This code assumes that the data is in 'Sheet1', change this line as appropriate: Set WS = Sheets("Sheet1")

VBA Code:
Option Explicit

Sub FuzzySearch()
'Defined Name           Comment
'Keywords_Required      Cell containing the keyword(s) being sought
'File_Title             First cell in the column containing the filenames
'                       Column+1 contains the corresponding file keywords
'Match_Percentage       Cell containing the min %age match required for a 'hit'
'Search_Results         First cell for the results to be stored

Dim lEndRow As Long
Dim lCurFileRow As Long
Dim lFileListPtr As Long
Dim lKeywordPtr As Long
Dim lResultsPtr As Long
Dim lCurFileListEntryPtr As Long

Dim sKeywordsRqd As String
Dim saKeywordsRqd() As String
Dim saFileList() As String
Dim sCurFile As String
Dim sCurFileKeywords As String
Dim saCurFileKeywords() As String
Dim sCurKeywordRqd As String

Dim sngMinPercentage As Single
Dim sngCurPercent As Single

Dim vaFileData As Variant
Dim vaResults As Variant

Dim WS As Worksheet

Set WS = Sheets("Sheet1")

'** Store list of filenames & file keywords **
lEndRow = WS.Cells(Rows.Count, WS.Range("File_Title").Column).End(xlUp).Row
vaFileData = WS.Range(WS.Range("File_title").Address).Resize(lEndRow, 2).Value
ReDim saFileList(1 To 2, 1 To 1)
lFileListPtr = 0
For lCurFileRow = 1 To UBound(vaFileData, 1)
    sCurFile = Trim$(vaFileData(lCurFileRow, 1))
    If sCurFile <> "" Then
        lFileListPtr = lFileListPtr + 1
        ReDim Preserve saFileList(1 To 2, 1 To lFileListPtr)
        saFileList(1, lFileListPtr) = sCurFile                                              '** Store filename
        saFileList(2, lFileListPtr) = NormaliseKeywords(CStr(vaFileData(lCurFileRow, 2)))   '** Store keywords
    End If
Next lCurFileRow

'** Get Min Percentage rqd **
sngMinPercentage = Val(WS.Range("Match_Percentage").Resize(1, 1).Value)

'** Get list of keywords required
sKeywordsRqd = NormaliseKeywords(CStr(WS.Range("Keywords_Required").Resize(1, 1).Value))
saKeywordsRqd = Split(sKeywordsRqd, ",")

lResultsPtr = 0
ReDim vaResults(1 To 1, 1 To 1)
For lKeywordPtr = 0 To UBound(saKeywordsRqd)
    If sngMinPercentage < 1 Then
        For lFileListPtr = 1 To UBound(saFileList, 2)
             saCurFileKeywords = Split(saFileList(2, lFileListPtr), ",")
             For lCurFileListEntryPtr = 0 To UBound(saCurFileKeywords)
                sngCurPercent = GetLevenshteinPercentMatch(saKeywordsRqd(lKeywordPtr), _
                                                            saCurFileKeywords(lCurFileListEntryPtr), _
                                                            True)
                If sngCurPercent >= sngMinPercentage Then
                lResultsPtr = lResultsPtr + 1
                ReDim Preserve vaResults(1 To 1, 1 To lResultsPtr)
                vaResults(1, lResultsPtr) = saFileList(1, lFileListPtr)
                End If
             Next lCurFileListEntryPtr
        Next lFileListPtr
    Else
        '** Here if 100% (or  more) match required **
        sCurKeywordRqd = "," & saKeywordsRqd(lKeywordPtr) & ","
        For lFileListPtr = 1 To UBound(saFileList, 2)
            If InStr("," & saFileList(2, lFileListPtr) & ",", sCurKeywordRqd) > 0 Then
                lResultsPtr = lResultsPtr + 1
                ReDim Preserve vaResults(1 To 1, 1 To lResultsPtr)
                vaResults(1, lResultsPtr) = saFileList(1, lFileListPtr)
            End If
        Next lFileListPtr
    End If
Next lKeywordPtr
With WS
    .Range("Search_Results").Resize(.UsedRange.Rows.Count, 1).ClearContents
    .Range("Search_Results").Resize(UBound(vaResults, 2), 1).Value = WorksheetFunction.Transpose(vaResults)
End With
End Sub

Private Function NormaliseKeywords(ByVal Keywords As String) As String
Dim lPtr As Long

Dim saKeywords() As String
Dim sResult As String

sResult = LCase$(Trim$(Keywords))
If sResult <> "" Then
    saKeywords = Split(sResult, ",")
    For lPtr = 0 To UBound(saKeywords)
        saKeywords(lPtr) = Trim$(saKeywords(lPtr))
    Next lPtr
    sResult = Join(saKeywords, ",")
End If
NormaliseKeywords = sResult
End Function

Public Function GetLevenshteinPercentMatch(ByVal String1 As String, _
                                            ByVal String2 As String, _
                                            Optional Normalised As Boolean = True) As Single
Dim iLen As Integer
If Normalised = False Then
    String1 = UCase$(WorksheetFunction.Trim(String1))
    String2 = UCase$(WorksheetFunction.Trim(String2))
End If
iLen = WorksheetFunction.Max(Len(String1), Len(String2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(String1, String2)) / iLen
End Function

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
'********************************
'*** Compute Levenshtein Distance
'********************************

Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim I As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1

  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2

  For I = 0 To N
    d(I, 0) = I
  Next I

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For I = 1 To N
    s_i = Mid$(s, I, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
     
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6

      d(I, j) = WorksheetFunction.Min(d(I - 1, j) + 1, d(I, j - 1) + 1, d(I - 1, j - 1) + cost)
    Next j
  Next I

  ' Step 7
  LevenshteinDistance = d(N, m)
End Function

You may want to invoke the code via a worksheet_Change function:
VBA Code:
Option Explicit

Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Resize(1, 1).Address = ActiveSheet.Range("Keywords_Required").Resize(1, 1).Address Then
    Call FuzzySearch
End If
End Sub
Please share the user friendly version of this macro as we dont have the exact data for this macro to try.
 
Upvote 0

Forum statistics

Threads
1,223,630
Messages
6,173,454
Members
452,514
Latest member
cjkelly15

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top