Fuzzy Match different product titles to find Duplicates but with-in same brand name

RajK2005

New Member
Joined
Jul 26, 2022
Messages
24
Office Version
  1. 365
Platform
  1. Windows
Hi all,
I have a list of products by different brands and I am trying to find all such duplicates and variants.

For simplicty, let's say my sheet has 3 columns: brand, product_name, manufacturer_name. There could be duplicate products created due to differences in the title. Example:


product_idbrandproduct_namemanufacturer_name
13435ABCABC Anti-Hairfall Shampoo - For Men & Women - 500mlHindustan Unilever
36743ABC
ABC Anti-Hairfall Shampoo - Unisex - 500mlHindustan Unilever
23575ABC
ABC Anti-Hairfall Shampoo - Paraben Free, Sulphate Free - 500mlHindustan Unilever
57980ABC
DEF Anti-Hairfall Shampoo - Unisex - 500mlITC



First 3 are same products while 4th is a different product.

I can use Fuzzy match in Excel Power Query to find closest match based on title. But the fuzzy match of every title with every other title leads to either too many false positives or missing out on correct matches. There are 1.5L products.

  • If I set the threshold to too low, I will get too many false positives. 2nd and 4th will match in above example.
  • If I set it too high, I will miss out on many variants. 2nd and 3rd will probably get excluded.
So I am thinking doing fuzzy match between titles of same brand and then keeping a low threshold for matching. But I am not sure how to do that. How do I merge the table with itself on name column with Fuzzy match but only if the brand_name matches.

Closest I can think is let fuzzy match every title with every title with a low threshold. Then remove all pairs which don't have the same brand_name but I am afraid it's too slow. There are too many products.

Can anyone help me in this? I don't have Power BI by the way. Only Power Query in Excel 365 in Windows 10. I am guessing steps would be similar though.
 

Excel Facts

Last used cell?
Press Ctrl+End to move to what Excel thinks is the last used cell.
Hello, the example you posted seems rather straightforward as product_names do have the same structure, i.e. something like

Excel Formula:
=TEXTBEFORE(C2:C5," - ",1)&" "&TEXTAFTER(C2:C5," - ",-1)

would do the trick here. But how is it with the rest of the data?
 
Upvote 0
Hello, the example you posted seems rather straightforward as product_names do have the same structure, i.e. something like

Excel Formula:
=TEXTBEFORE(C2:C5," - ",1)&" "&TEXTAFTER(C2:C5," - ",-1)

would do the trick here. But how is it with the rest of the data?

The data is very messy. I just used some very straightforward examples for the post.

We have duplicates like this:
Tedibar Spoo Gentle Baby Shampoo No Tears | Gentle On Scalp | No.1 Pediatrician Prescribed 125ml
Tedibar Spoo Gentle Baby Shampoo - Tear Free, Pediatrician Prescribed - 125ml (Pack of 1)

Hence the need for lower threshold.
 
Upvote 0

Forum statistics

Threads
1,224,737
Messages
6,180,668
Members
452,992
Latest member
TokugawaIesuma

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top