Fuzzy Matching - new version plus explanation

al_b_cnu

Well-known Member
Joined
Jul 18, 2003
Messages
4,546
It has been a while since I originally posted my Fuzzy matching UDF’s on the board, and several variants have appeared subsequently.

I thought it time to ‘put the record straight’ & post a definitive version which contains slightly more efficient code, and better matching algorithms, so here it is.

Firstly, I must state that the Fuzzy matching algorithms are very CPU hungry, and should be used sparingly. If for instance you require to lookup a match for a string which starts with, contains or ends with a specified value, this can be performed far more efficiently using the MATCH function:
Fuzzy Examples.xls
ABCDE
1Starts WithEndsContains
2BilljelenBill
3Mr Bill Jelen433
4Bill Jelen
5Joe Bloggs
6Fred Smith
MATCH Example


... Continued ...
 
Hi,

I actually got the comparisons the wrong way round, try the attached:
Code:
Option Explicit

Sub MatchData()
Const sInputSheet As String = "Sheet1"
Const sMyDataCol As String = "A"
Const sCustCol As String = "B"
Const sResultCol As String = "C"

Dim iPtr As Integer
Dim lRow As Long, lRowEnd As Long, lResultRow As Long
Dim objCustDictionary As Object
Dim rCur As Range
Dim sCur As String, sCurSeries As String, sCurKey As String, sCurSplit() As String
Dim sRange As String
Dim wsInput As Worksheet

Set wsInput = Sheets(sInputSheet)

wsInput.Range(sResultCol & "2:" & sResultCol & wsInput.UsedRange.Rows.Count).ClearContents

Set objCustDictionary = Nothing
Set objCustDictionary = CreateObject("Scripting.Dictionary")
lRowEnd = wsInput.Cells(Rows.Count, sMyDataCol).End(xlUp).Row
For Each rCur In wsInput.Range(sMyDataCol & "2:" & Cells(lRowEnd, sMyDataCol).Address)
    sCur = CStr(rCur.Value)
    If sCur<> "" Then
        sCurSplit = Split("-" & sCur, "-")
    End If
    iPtr = UBound(sCurSplit)
    If LCase$(sCurSplit(iPtr)) = "series" Then
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurSeries = Mid$(Join(sCurSplit, "-"), 2)
        On Error Resume Next
        objCustDictionary.Add key:=sCurSeries, Item:="XXX"
        On Error GoTo 0
    End If
Next rCur

lRowEnd = wsInput.Cells(Rows.Count, sCustCol).End(xlUp).Row
For Each rCur In wsInput.Range(sCustCol & "2:" & Cells(lRowEnd, sCustCol).Address)
    sCur = CStr(rCur.Value)
    lRow = 0
    On Error Resume Next
    lRow = WorksheetFunction.Match(sCur, wsInput.Columns(sMyDataCol), 0)
    On Error GoTo 0
    If lRow = 0 Then
        sCurSplit = Split("-" & sCur, "-")
        iPtr = UBound(sCurSplit)
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurKey = Mid$(Join(sCurSplit, "-"), 2)
        If objCustDictionary.exists(sCurKey) Then wsInput.Range(sResultCol & rCur.Row).Value = sCurKey & "-SERIES"
    Else
        wsInput.Range(sResultCol & rCur.Row).Value = sCur
    End If
Next rCur

On Error Resume Next
objCustDictionary.RemoveAll
Set objCustDictionary = Nothing

End Sub

giving these results:
Excel Workbook
ABC
1My DataCust DataResults
21231-183744-3
3542-series83744-2
483736-1542-10542-SERIES
583736-11231-11231-1
6020-238-SERIES034100AA000PC0A-4034100AA000PC0A-SERIES
7034100AA000PC0A-SERIES034100AA000PC0A-3034100AA000PC0A-SERIES
8020-238-2020-238-SERIES
9020-238-1020-238-SERIES
Sheet1
Excel 2003

If you still get an error, can you indicate which statement it fails on?
 
Upvote 0

Excel Facts

Quick Sum
Select a range of cells. The total appears in bottom right of Excel screen. Right-click total to add Max, Min, Count, Average.
Hmm I'm trying to play around with this code, maybe someone can point me in the right direction.

This is an example of my spreadsheet:
fmutkw.jpg


Column A - Latitude
Column B - Longitude
Column C - Code Result needs to go here.
Column D - Latitude
Column E - Longitude
Column F - Name of Place

What I'm trying to do is take A1 and B1 and find its closest matching pair in all of D:E and output the place for me. I have found code that can match one set but what I need the is the closest matching pairs. Any ideas/suggestions?
 
Upvote 0
Hi,

I actually got the comparisons the wrong way round, try the attached:
Code:
Option Explicit

Sub MatchData()
Const sInputSheet As String = "Sheet1"
Const sMyDataCol As String = "A"
Const sCustCol As String = "B"
Const sResultCol As String = "C"

Dim iPtr As Integer
Dim lRow As Long, lRowEnd As Long, lResultRow As Long
Dim objCustDictionary As Object
Dim rCur As Range
Dim sCur As String, sCurSeries As String, sCurKey As String, sCurSplit() As String
Dim sRange As String
Dim wsInput As Worksheet

Set wsInput = Sheets(sInputSheet)

wsInput.Range(sResultCol & "2:" & sResultCol & wsInput.UsedRange.Rows.Count).ClearContents

Set objCustDictionary = Nothing
Set objCustDictionary = CreateObject("Scripting.Dictionary")
lRowEnd = wsInput.Cells(Rows.Count, sMyDataCol).End(xlUp).Row
For Each rCur In wsInput.Range(sMyDataCol & "2:" & Cells(lRowEnd, sMyDataCol).Address)
    sCur = CStr(rCur.Value)
    If sCur<> "" Then
        sCurSplit = Split("-" & sCur, "-")
    End If
    iPtr = UBound(sCurSplit)
    If LCase$(sCurSplit(iPtr)) = "series" Then
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurSeries = Mid$(Join(sCurSplit, "-"), 2)
        On Error Resume Next
        objCustDictionary.Add key:=sCurSeries, Item:="XXX"
        On Error GoTo 0
    End If
Next rCur

lRowEnd = wsInput.Cells(Rows.Count, sCustCol).End(xlUp).Row
For Each rCur In wsInput.Range(sCustCol & "2:" & Cells(lRowEnd, sCustCol).Address)
    sCur = CStr(rCur.Value)
    lRow = 0
    On Error Resume Next
    lRow = WorksheetFunction.Match(sCur, wsInput.Columns(sMyDataCol), 0)
    On Error GoTo 0
    If lRow = 0 Then
        sCurSplit = Split("-" & sCur, "-")
        iPtr = UBound(sCurSplit)
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurKey = Mid$(Join(sCurSplit, "-"), 2)
        If objCustDictionary.exists(sCurKey) Then wsInput.Range(sResultCol & rCur.Row).Value = sCurKey & "-SERIES"
    Else
        wsInput.Range(sResultCol & rCur.Row).Value = sCur
    End If
Next rCur

On Error Resume Next
objCustDictionary.RemoveAll
Set objCustDictionary = Nothing

End Sub
giving these results:
Excel Workbook
ABC
1My DataCust DataResults
21231-183744-3
3542-series83744-2
483736-1542-10542-SERIES
583736-11231-11231-1
6020-238-SERIES034100AA000PC0A-4034100AA000PC0A-SERIES
7034100AA000PC0A-SERIES034100AA000PC0A-3034100AA000PC0A-SERIES
8020-238-2020-238-SERIES
9020-238-1020-238-SERIES
Sheet1
Excel 2003



If you still get an error, can you indicate which statement it fails on?

OMG...you are genius....I think thats what I want!
I will try to run it many times and see it will happen....
Thank you so much><"
 
Upvote 0
sorry...I forgot one thing before doing the comparison. Is that possible to add another code which can do things below before running the code you wrote?

Before:
<table style="border-collapse: collapse; width: 214px; height: 105px;" border="0" cellpadding="0" cellspacing="0"> <col style="width: 116pt;" width="155"> <tbody><tr style="height: 15.75pt;" height="21"> <td style="height: 15.75pt; width: 116pt;" height="21" width="155">ColA</td> </tr> <tr style="height: 15.75pt;" height="21"> <td style="height: 15.75pt;" height="21">123-123,-2,-3,-4</td> </tr> <tr style="height: 15.75pt;" height="21"> <td style="height: 15.75pt;" height="21">324234,-6,-7</td> </tr> <tr style="height: 15.75pt;" height="21"> <td style="height: 15.75pt;" height="21">23hb23h412b-4,-5,-6</td> </tr> </tbody> </table>
After:
<table style="border-collapse: collapse; width: 116pt;" border="0" cellpadding="0" cellspacing="0" width="155"> <col style="width: 116pt;" width="155"> <tbody><tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; width: 116pt;" height="21" width="155">ColA</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">123-123-2</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">123-123-3</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">123-123-4</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">324234-6</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">324234-7</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">23hb23h412b-4</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">23hb23h412b-5</td> </tr> <tr style="height: 15.75pt;" height="21"> <td class="xl65" style="height: 15.75pt; border-top: medium none;" height="21">23hb23h412b-6</td> </tr> </tbody> </table>
As some of my data were input as something like 123-123,-2,-3,-4 so if it starts compare, the marco wont be able to compare customer's data with these. These data have to be broken down and additional cells in the same col have to be added for the separated data.
can the sub do the separation before comparing the data?

sorry for bothering
Thanks....
 
Last edited:
Upvote 0
Hi,

2 macros: 'correctData' and 'MatchData':
Code:
Option Explicit
Const msInputSheet As String = "Sheet1"
Const msMyDataCol As String = "A"
Const msCustCol As String = "B"
Const msResultCol As String = "C"

Sub CorrectData()
Dim iSplitPtr As Integer
Dim lRow1 As Long, lRow2 As Long, lRowEnd As Long, lPtr As Long
Dim rCur As Range
Dim sFirstAddr As String
Dim sCur As String, saSplit() As String
Dim vaData() As Variant
Dim wsInput As Worksheet

Set wsInput = Sheets(msInputSheet)
lPtr = 0

With wsInput.Columns(msMyDataCol)
    Set rCur = .Find(what:=",", LookIn:=xlValues, lookat:=xlPart)
    If Not rCur Is Nothing Then
        sFirstAddr = rCur.Address
        Do
            sCur = rCur.Value
            saSplit = Split(sCur, ",")
            rCur.Value = saSplit(0) & saSplit(1)
            For iSplitPtr = 2 To UBound(saSplit)
                lPtr = lPtr + 1
                ReDim Preserve vaData(1 To 1, 1 To lPtr)
                vaData(1, lPtr) = saSplit(0) & saSplit(iSplitPtr)
            Next iSplitPtr
            Set rCur = .FindNext(rCur)
            If rCur Is Nothing Then Exit Do
        Loop While rCur.Address <> sFirstAddr
    End If
End With
If lPtr > 0 Then
    lRow1 = wsInput.Cells(Rows.Count, msMyDataCol).End(xlUp).Row + 1
    lRow2 = lRow1 + lPtr - 1
    wsInput.Range(msMyDataCol & lRow1 & ":" & msMyDataCol & lRow2).Value = WorksheetFunction.Transpose(vaData)
End If
End Sub
Sub MatchData()

Dim iPtr As Integer
Dim lRow As Long, lRowEnd As Long, lResultRow As Long
Dim objCustDictionary As Object
Dim rCur As Range
Dim sCur As String, sCurSeries As String, sCurKey As String, sCurSplit() As String
Dim sRange As String
Dim wsInput As Worksheet

Set wsInput = Sheets(msInputSheet)

wsInput.Range(msResultCol & "2:" & msResultCol & wsInput.UsedRange.Rows.Count).ClearContents

Set objCustDictionary = Nothing
Set objCustDictionary = CreateObject("Scripting.Dictionary")
lRowEnd = wsInput.Cells(Rows.Count, msMyDataCol).End(xlUp).Row
For Each rCur In wsInput.Range(msMyDataCol & "2:" & Cells(lRowEnd, msMyDataCol).Address)
    sCur = CStr(rCur.Value)
    If sCur <> "" Then
        sCurSplit = Split("-" & sCur, "-")
    End If
    iPtr = UBound(sCurSplit)
    If LCase$(sCurSplit(iPtr)) = "series" Then
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurSeries = Mid$(Join(sCurSplit, "-"), 2)
        On Error Resume Next
        objCustDictionary.Add key:=sCurSeries, Item:="XXX"
        On Error GoTo 0
    End If
Next rCur

lRowEnd = wsInput.Cells(Rows.Count, msCustCol).End(xlUp).Row
For Each rCur In wsInput.Range(msCustCol & "2:" & Cells(lRowEnd, msCustCol).Address)
    sCur = CStr(rCur.Value)
    lRow = 0
    On Error Resume Next
    lRow = WorksheetFunction.Match(sCur, wsInput.Columns(msMyDataCol), 0)
    On Error GoTo 0
    If lRow = 0 Then
        sCurSplit = Split("-" & sCur, "-")
        iPtr = UBound(sCurSplit)
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurKey = Mid$(Join(sCurSplit, "-"), 2)
        If objCustDictionary.exists(sCurKey) Then wsInput.Range(msResultCol & rCur.Row).Value = sCurKey & "-SERIES"
    Else
        wsInput.Range(msResultCol & rCur.Row).Value = sCur
    End If
Next rCur

On Error Resume Next
objCustDictionary.RemoveAll
Set objCustDictionary = Nothing

End Sub
 
Upvote 0
Thanks again
but there is a problem in MatchData()
it could not find 2 items like

ColA ColB
123123-series 123123
132121664-series 132121664

thanks
 
Upvote 0
also these kinds

ColA ColB
B1421 B1421
G05315 G05315
240545 240545

could not find them..
thanks.....
 
Last edited:
Upvote 0
Hi,

This amended version:
Code:
Sub MatchData()
Dim dblCur As Double
Dim iPtr As Integer
Dim lRow As Long, lRowEnd As Long, lResultRow As Long
Dim objCustDictionary As Object
Dim rCur As Range
Dim sCur As String, sCurSeries As String, sCurKey As String, sCurSplit() As String
Dim sRange As String
Dim vCur As Variant
Dim wsInput As Worksheet

Set wsInput = Sheets(msInputSheet)

wsInput.Range(msResultCol & "2:" & msResultCol & wsInput.UsedRange.Rows.Count).ClearContents

Set objCustDictionary = Nothing
Set objCustDictionary = CreateObject("Scripting.Dictionary")
lRowEnd = wsInput.Cells(Rows.Count, msMyDataCol).End(xlUp).Row
For Each rCur In wsInput.Range(msMyDataCol & "2:" & Cells(lRowEnd, msMyDataCol).Address)
    sCur = CStr(rCur.Value)
    If sCur<> "" Then
        sCurSplit = Split("-" & sCur, "-")
    End If
    iPtr = UBound(sCurSplit)
    If LCase$(sCurSplit(iPtr)) = "series" Then
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurSeries = Mid$(Join(sCurSplit, "-"), 2)
        On Error Resume Next
        objCustDictionary.Add key:=sCurSeries, Item:="XXX"
        On Error GoTo 0
    End If
Next rCur

lRowEnd = wsInput.Cells(Rows.Count, msCustCol).End(xlUp).Row
For Each rCur In wsInput.Range(msCustCol & "2:" & Cells(lRowEnd, msCustCol).Address)
    vCur = rCur.Value
    lRow = 0
    On Error Resume Next
    lRow = WorksheetFunction.Match(vCur, wsInput.Columns(msMyDataCol), 0)
    On Error GoTo 0
    If lRow = 0 Then
        sCurSplit = Split("-" & CStr(vCur), "-")
        iPtr = UBound(sCurSplit)
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurKey = Mid$(Join(sCurSplit, "-"), 2)
        If objCustDictionary.exists(sCurKey) Then wsInput.Range(msResultCol & rCur.Row).Value = sCurKey & "-SERIES"
    Else
        wsInput.Range(msResultCol & rCur.Row).Value = vCur
    End If
Next rCur

On Error Resume Next
objCustDictionary.RemoveAll
Set objCustDictionary = Nothing

End Sub

Returns these results:
Excel Workbook
ABC
1My DataCust DataResults
21231-183744-3
3542-series83744-2
483736-1542-10542-SERIES
583736-11231-11231-1
6020-238-SERIES034100AA000PC0A-4034100AA000PC0A-SERIES
7034100AA000PC0A-SERIES034100AA000PC0A-3034100AA000PC0A-SERIES
8123-123-2020-238-2020-238-SERIES
9324234-6020-238-1020-238-SERIES
1023hb23h412b-4-5123123
11123-123-3132121664
12123-123-4B1421B1421
13324234-7G05315G05315
1423hb23h412b-4-6240545240545
15123123-series
16132121664-series
17B1421
18G05315
19240545
Sheet1
Excel 2003

Regarding the issue that 123123-series not finding 123123, this seems to be a slight rule change here, the code WOULD, however find something like 123123-01. What are the rules?
 
Upvote 0
Hi,

This amended version:
Code:
Sub MatchData()
Dim dblCur As Double
Dim iPtr As Integer
Dim lRow As Long, lRowEnd As Long, lResultRow As Long
Dim objCustDictionary As Object
Dim rCur As Range
Dim sCur As String, sCurSeries As String, sCurKey As String, sCurSplit() As String
Dim sRange As String
Dim vCur As Variant
Dim wsInput As Worksheet

Set wsInput = Sheets(msInputSheet)

wsInput.Range(msResultCol & "2:" & msResultCol & wsInput.UsedRange.Rows.Count).ClearContents

Set objCustDictionary = Nothing
Set objCustDictionary = CreateObject("Scripting.Dictionary")
lRowEnd = wsInput.Cells(Rows.Count, msMyDataCol).End(xlUp).Row
For Each rCur In wsInput.Range(msMyDataCol & "2:" & Cells(lRowEnd, msMyDataCol).Address)
    sCur = CStr(rCur.Value)
    If sCur<> "" Then
        sCurSplit = Split("-" & sCur, "-")
    End If
    iPtr = UBound(sCurSplit)
    If LCase$(sCurSplit(iPtr)) = "series" Then
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurSeries = Mid$(Join(sCurSplit, "-"), 2)
        On Error Resume Next
        objCustDictionary.Add key:=sCurSeries, Item:="XXX"
        On Error GoTo 0
    End If
Next rCur

lRowEnd = wsInput.Cells(Rows.Count, msCustCol).End(xlUp).Row
For Each rCur In wsInput.Range(msCustCol & "2:" & Cells(lRowEnd, msCustCol).Address)
    vCur = rCur.Value
    lRow = 0
    On Error Resume Next
    lRow = WorksheetFunction.Match(vCur, wsInput.Columns(msMyDataCol), 0)
    On Error GoTo 0
    If lRow = 0 Then
        sCurSplit = Split("-" & CStr(vCur), "-")
        iPtr = UBound(sCurSplit)
        ReDim Preserve sCurSplit(0 To iPtr - 1)
        sCurKey = Mid$(Join(sCurSplit, "-"), 2)
        If objCustDictionary.exists(sCurKey) Then wsInput.Range(msResultCol & rCur.Row).Value = sCurKey & "-SERIES"
    Else
        wsInput.Range(msResultCol & rCur.Row).Value = vCur
    End If
Next rCur

On Error Resume Next
objCustDictionary.RemoveAll
Set objCustDictionary = Nothing

End Sub
Returns these results:
Excel Workbook
ABC
1My DataCust DataResults
21231-183744-3
3542-series83744-2
483736-1542-10542-SERIES
583736-11231-11231-1
6020-238-SERIES034100AA000PC0A-4034100AA000PC0A-SERIES
7034100AA000PC0A-SERIES034100AA000PC0A-3034100AA000PC0A-SERIES
8123-123-2020-238-2020-238-SERIES
9324234-6020-238-1020-238-SERIES
1023hb23h412b-4-5123123
11123-123-3132121664
12123-123-4B1421B1421
13324234-7G05315G05315
1423hb23h412b-4-6240545240545
15123123-series
16132121664-series
17B1421
18G05315
19240545
Sheet1
Excel 2003



Regarding the issue that 123123-series not finding 123123, this seems to be a slight rule change here, the code WOULD, however find something like 123123-01. What are the rules?

I could not run it...it says out of range..
 
Upvote 0

Forum statistics

Threads
1,224,828
Messages
6,181,217
Members
453,024
Latest member
Wingit77

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top