# Fuzzy matching on textual data in Power Query



## absherzad (Jul 10, 2017)

Hello everyone,
Is there a way to incorporate Microsoft Fuzzy Lookup Add-In for Excel in Power Query to perform fuzzy matching on textual data in Power Query? Or is it possible to calculate string similarity in Power Query?


----------



## anvg (Jul 11, 2017)

Hi.
The simplest vesion of Fuzzy

```
let
    words1 = "together",
    words2 = "to their",
    simpleLikeFuzzy = (words1 as text, words2 as text) as number =>
        2 * List.Count(List.Intersect({Text.ToList(words1), Text.ToList(words2)})) / (Text.Length(words1) + Text.Length(words2)),
    return = simpleLikeFuzzy(words1, words2)
in
    return
```
Regards,


----------



## absherzad (Jul 11, 2017)

Thank you for your response, and it is appreciated 
The algorithm sounds simple, does it have a name?

My observation:
In some cases this simple version of Fuzzy can be used, such as, minor typo in a column where there is a standard master list for it. However in the following cases it is not advised:


Columns that contains values with few characters. Even with one character difference the result may significantly be decreased.
Also, sometimes there are words completely different, they, however, share common characters.
Moreover, situations that require partial matching.

Fuzzy lookup and comparison is great and has numerous applications. I am looking forward and vote for adding this feature to the Power Query 

Best Regards,


----------



## anvg (Jul 11, 2017)

Hi
Thank you for your observation.


> Columns that contains values with few characters. Even with one character difference the result may significantly be decreased.


You can use another criteria for it like this

```
simpleLikeFuzzy = (words1 as text, words2 as text) as number =>
        List.Count(List.Intersect({Text.ToList(words1), Text.ToList(words2)}))
        / (if Text.Length(words1) > Text.Length(words2) then Text.Length(words2) else Text.Length(words1)),
```
Because chars intersection cannot be more than the shortest char sequence.


> Also, sometimes there are words completely different, they, however, share common characters.


You can use more a strong method. It is not optimized as is

```
let
    words1 = "A big father waks with his own children about a green garden.",
    words2 = "With nice children a big father walks about a greate garden.",
    toTable = (words as text, idName as text) as any =>
    let
    col1 = Table.RenameColumns(Table.FromList(Text.ToList(words)), {{"Column1", "char" & idName}}),
    idx = Table.AddIndexColumn(col1, idName)
    in
    idx,
    chars1 = toTable(words1, "row"),
    chars2 = toTable(words2, "col"),
    joined = Table.Join(chars1, "charrow", chars2, "charcol"),
    addDif = Table.AddColumn(joined, "dif", each [col] -[row]),
    sorted = Table.Sort(addDif,{{"dif", Order.Ascending}, {"row", Order.Ascending}}),
    addIdx = Table.AddIndexColumn(sorted, "idx"),
    addGlobalDif = Table.AddColumn(addIdx, "gdif", each [row] - [idx])[[charrow], [row], [dif], [gdif]],
    rankGroup = Table.Group(addGlobalDif, {"gdif", "dif"}, {{"forExpand", each [[charrow], [row]]}, {"rows", each [row]}, {"count", each List.Count([row])}}),
    rankSort = Table.Sort(rankGroup,{{"count", Order.Ascending}}),
    rankIndex = Table.Buffer(Table.AddIndexColumn(rankSort[[forExpand], [rows]], "idx")),
    step1 = List.Transform(rankIndex[idx], (cur) =>
    let
        rec = rankIndex{cur},
        sub = Table.SelectRows(rankIndex, each (_[idx] > cur) and ( List.Count(List.Intersect({_[rows], rec[rows]})) > 0 )),
        result = if Table.RowCount(sub) > 0 then null else rec
    in
        result
    ),
    step2 = Table.FromRecords(List.RemoveNulls(step1))[[forExpand]],
    uniqieRows = Table.ExpandTableColumn(step2, "forExpand", {"charrow", "row"}),
    sortedUniques = Table.Sort(uniqieRows,{{"row", Order.Ascending}}),
    return = Text.Combine(sortedUniques[charrow])
in
    return
```
Regards, Andrey


----------



## absherzad (Jul 12, 2017)

Thank you very much for more examples. 
I think I like the following solution in your first reply:

```
let
    words1 = "together",
    words2 = "to their",
    simpleLikeFuzzy = (words1 as text, words2 as text) as number =>
        2 * List.Count(List.Intersect({Text.ToList(words1), Text.ToList(words2)})) / (Text.Length(words1) + Text.Length(words2)),
    return = simpleLikeFuzzy(words1, words2)
in
    return
```

By the way, could you please write me which algorithm and approach this method uses? Is it known one?

Best Regards,
Sherzad


----------



## absherzad (Jul 15, 2017)

Dear Andrey,

Is it possible to use the _simpleLikeFuzzy function_ in the merge process while joining two tables? 
Let me elaborate my question, I have two tables without a common key. I like to consider the _name column_ and compare their similarity and if the result of the similarity is greater than the threshold .85 then it is considered a match, otherwise, not.


Regards,


----------



## anvg (Jul 16, 2017)

Hi


> Is it possible to use the _simpleLikeFuzzy function in the merge process while joining two tables? _


Alright

```
let    // an etalon word table, maybe loaded from external source
    defEtalon = #table(type table [value = text], 
    { {"spruce"}, {"cedar"}, {"birch"}, {"pine"}, {"poplar"}, {"aspen"}, 
      {"baobab"}, {"larch"}, {"willow"} }),
    addChars = Table.AddColumn(defEtalon, "chars", each Text.ToList([value]), type list),
    addLength = Table.AddColumn(addChars, "length", each Text.Length([value]), Int64.Type),
    etalonTable = Table.Buffer(addLength),
    // define a search function in the etalon table by the simplest fuzzy
    lookup = (testText as text) as record =>
    let
        simpleFuzzy = (testChars as list) as record =>
        let
            coefficient = Table.AddColumn(etalonTable, "c", each 2 * List.Count(List.Intersect({testChars, [chars]})) / ([length] + List.Count(testChars)), type number)
        in
            Table.Max(coefficient, "c"),
        testChars = Text.ToList(testText)
    in
        // define record fields from the etalon table for returning
        simpleFuzzy(testChars)[[value], [c]],
    // a test table for join with etalon table
    testTable = #table(type table [word = text],
    { {"sprce"}, {"cidar"}, {"pane"}, {"aspern"}, {"babab"} }),
    // to join them
    result = Table.AddColumn(testTable, "rec", each lookup([word]), type record),
    // expand returned joined records
    return = Table.ExpandRecordColumn(result, "rec", {"value", "c"})


in
    return
```
Regards,


----------



## absherzad (Jul 17, 2017)

Many thanks for the input. Frankly speaking I could not manage to customize it in my scenario, attached *screenshot*.
One idea crossed my mind is to use the combination of Name, Job and Department columns using the concatenate transformation function. Can you please have a look at the screenshot and kindly share your opinion with me how to customize the code.


----------



## absherzad (Jul 24, 2017)

Hi Andrey,
Any idea how to use the _simplest vesion of Fuzzy_ on two tables/datasets similar to what Microsoft Excel Fuzzy Lookup Add-In does?

Best,


----------



## Drewbbc (Jun 16, 2018)

Anvg - Brilliant solution. Your solution delivers better results that the 'untweaked' Microsoft fuzzy add-in.  With tweaks - the two solutions are comparable. DrewBbc


----------



## absherzad (Jul 10, 2017)

Hello everyone,
Is there a way to incorporate Microsoft Fuzzy Lookup Add-In for Excel in Power Query to perform fuzzy matching on textual data in Power Query? Or is it possible to calculate string similarity in Power Query?


----------



## Drewbbc (Jun 18, 2018)

let    
    Table1 = #table(type table [value = text], 
    { {"spruce"}, {"cedar"}, {"birch"}, {"pine"}, {"poplar"}, {"aspen"}, 
      {"baobab"}, {"larch"}, {"willow"} }),

    Table2 = #table(type table [word = text],
    { {"sprce"}, {"cidar"}, {"pane"}, {"aspern"}, {"babab"} }),

    AddCol = Table.AddColumn(Table2, "Custom", each Table1),
    Expand = Table.ExpandTableColumn(AddCol, "Custom", {"value"}, {"value"}),

    FuzzyCalc = Table.AddColumn(Expand, "Prct", each 
            2 * List.Count( List.Intersect(
                    { Text.ToList([word]), Text.ToList([value]) } )  )  
            / (List.Count(Text.ToList([word])) + List.Count(Text.ToList([value])))),

    SortRows = Table.Sort(FuzzyCalc,{{"word", Order.Ascending}, {"Prct", Order.Descending}}),
    AddIndex = Table.AddIndexColumn(SortRows, "Index", 1, 1),
    RemoveDupls = Table.Distinct(AddIndex, {"word"})
in
    RemoveDupls


----------



## Drewbbc (Jun 18, 2018)

Andvg - another approach using Cartesian product. 

let
    Src_A = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    Src_B = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
    AddCol_Cartesian = Table.AddColumn(Src_B, "Custom", each Src_A),
    Expand = Table.ExpandTableColumn(AddCol_Cartesian, "Custom", {"Words1"}, {"Words1"}),

    FuzzyCalc = Table.AddColumn(Expand, "Prct", each 2 * List.Count(
      List.Intersect(  { Text.ToList([Words2]), Text.ToList([Words1]) } 
      )  )  / (List.Count(Text.ToList([Words2])) + List.Count(Text.ToList([Words1])))),

    SortRows = Table.Sort(FuzzyCalc,{{"Words2", Order.Ascending}, {"Prct", Order.Descending}}),
    AddIndex = Table.AddIndexColumn(SortRows, "Index", 1, 1),
    RemoveDups = Table.Distinct(AddIndex, {"Words2"}),
    RemoveCol = Table.RemoveColumns(RemoveDups,{"Index"})
in
    RemoveCol


----------

