Hi all,
I have a list of products by different brands and I am trying to find all such duplicates and variants.
For simplicty, let's say my sheet has 3 columns: brand, product_name, manufacturer_name. There could be duplicate products created due to differences in the title. Example:
First 3 are same products while 4th is a different product.
I can use Fuzzy match in Excel Power Query to find closest match based on title. But the fuzzy match of every title with every other title leads to either too many false positives or missing out on correct matches. There are 1.5L products.
Closest I can think is let fuzzy match every title with every title with a low threshold. Then remove all pairs which don't have the same brand_name but I am afraid it's too slow. There are too many products.
Can anyone help me in this? I don't have Power BI by the way. Only Power Query in Excel 365 in Windows 10. I am guessing steps would be similar though.
I have a list of products by different brands and I am trying to find all such duplicates and variants.
For simplicty, let's say my sheet has 3 columns: brand, product_name, manufacturer_name. There could be duplicate products created due to differences in the title. Example:
product_id | brand | product_name | manufacturer_name |
13435 | ABC | ABC Anti-Hairfall Shampoo - For Men & Women - 500ml | Hindustan Unilever |
36743 | ABC | ABC Anti-Hairfall Shampoo - Unisex - 500ml | Hindustan Unilever |
23575 | ABC | ABC Anti-Hairfall Shampoo - Paraben Free, Sulphate Free - 500ml | Hindustan Unilever |
57980 | ABC | DEF Anti-Hairfall Shampoo - Unisex - 500ml | ITC |
First 3 are same products while 4th is a different product.
I can use Fuzzy match in Excel Power Query to find closest match based on title. But the fuzzy match of every title with every other title leads to either too many false positives or missing out on correct matches. There are 1.5L products.
- If I set the threshold to too low, I will get too many false positives. 2nd and 4th will match in above example.
- If I set it too high, I will miss out on many variants. 2nd and 3rd will probably get excluded.
Closest I can think is let fuzzy match every title with every title with a low threshold. Then remove all pairs which don't have the same brand_name but I am afraid it's too slow. There are too many products.
Can anyone help me in this? I don't have Power BI by the way. Only Power Query in Excel 365 in Windows 10. I am guessing steps would be similar though.