Match Text string and Remove Duplicates

Pete2020

Board Regular
Joined
Apr 25, 2020
Messages
68
Office Version
  1. 2016
Platform
  1. Windows
I thank each and every Mr. Excel forum members who are helping hundreds of Productivity Issues daily.

I am looking for a solution to remove similar duplicates by identifying the Pattern which every text ends with a Number.

Based on Column A (primary key) - the macro should run on column B data, by matching the similar text name ends with a Number.It should trim the extra spaces and delete the numbers and special characters around it.

Later macro should delete the Column B duplicates based on column A and results should be populated in column D and E.

Sample data

Book7
ABCDE
1IDRAW DATA PrimaryRemove duplicate and Keep Unique
2CD-001Toggle Method - 1CD-001Toggle Method
3CD-001Toggle Method - 2CD-002Update Action -part
4CD-002Update Action - Part 1CD-003Add Form - Part
5CD-002Update Action - Part 2CD-004Code Implementation
6CD-003Add Form - Part (1)CD-005Create and run a while loop, part
7CD-003Add Form : Part (2)CD-006VBA Framework Lecture
8CD-004Code Implementation ICD-007VBA Project Video
9CD-004Code Implementation IICD-008VBA Command Video
10CD-005Create and run a while loop, part 1CD-009Player Setup P
11CD-005Create and run a while loop, part 2CD-010The registerNewUser method part
12CD-006VBA Framework Lecture:-1 CD-011HTML Form @ Part
13CD-006VBA Framework Lecture:-2 CD-012VBA Function
14CD-007VBA Project Video - iCD-013Let's create Macro Pt
15CD-007VBA Project Video- iiCD-014Source Code Lesson
16CD-008VBA Command Video1CD-015VBA Functions - Chapter
17CD-008VBA Command Video 2CD-016Inbuilt Time Function
18CD-009Player Setup P.1CD-016Inbuilt Date Functions
19CD-009Player Setup P.2CD-016Inbuilt Text Functions
20CD-010The registerNewUser method part1
21CD-010The registerNewUser method part2
22CD-011HTML Form @ part2
23CD-011HTML Form @ part3
24CD-012VBA Functions 01
25CD-012VBA Functions02
26CD-013Let's create Macro Pt.1
27CD-013Let's create Macro Pt.2
28CD-014Source Code Lesson 28
29CD-014Source Code Lesson 29
30CD-015VBA Functions - Chapter-1
31CD-015VBA Functions - Chapter-2
32CD-016Inbuilt Time Function
33CD-016Inbuilt Date Functions
34CD-016Inbuilt Text Functions
Sheet1
 
So is mine - that's why I asked for more samples of that and the expected results after removing the end bits and removing duplicates.

Red rows are just for observation whether macro eliminates the numbers at the end. Nothing much to focus on red colour.

After removing the numbers and it should remove duplicates on a combination of Column A and B
 
Upvote 0

Excel Facts

What does custom number format of ;;; mean?
Three semi-colons will hide the value in the cell. Although most people use white font instead.
To try with a Small sample i giving you some data after reviewing many patterns in my raw data.

Few additional pattern were added from my initial sample data given in post #1.

Once done - Kindly apply the macro on sample which i sent in post#6.

Book3
AB
1Course IDTopics
2ID-123Technical Indicators - Part 1
3ID-123Technical Indicators - Part 2
4ID-1243 - Double entry bookkeeping I
5ID-1244 - Double entry bookkeeping II
6ID-125International Financial Reporting Standards-I
7ID-125International Financial Reporting Standards-II
8ID-125International Financial Reporting Standards-III
9ID-126CHAPTER 6,PART 1
10ID-126CHAPTER 6, PART 2
11ID-126CHAPTER 6, PART 3
12ID-126CHAPTER 6. PART 4
13ID-126CHAPTER 6, PART 5
14ID-126CHAPTER 7 PART 1
15ID-126CHAPTER 7 PART 2
16ID-126CHAPTER 7 PART 3
17ID-126CHAPTER 7 PART 4
18ID-126CHAPTER 7 PART 5
19ID-126CHAPTER 7 PART 6
20ID-127MULTI MODAL TRANSPORT 1
21ID-127MULTI MODAL TRANSPORT 2
22ID-127MULTI MODAL TRANSPORT 3
23ID-127MULTI MODAL TRANSPORT 4
24ID-127MULTI MODAL TRANSPORT 5
25ID-128Intangible Assets_3
26ID-128Intangible Assets_4
27ID-129AE-1
28ID-129AE-2
29ID-130Showme Problem #4
30ID-130Showme Problem #5
31ID-130Showme Problem #6
32ID-131Use QuickBooks File Q1
33ID-131Use QuickBooks File Q2
34ID-132Importing and Organizing Data in Python – part I
35ID-132Importing and Organizing Data in Python – part II
36ID-132Importing and Organizing Data in Python – part II
37ID-132Importing and Organizing Data in Python – part III
38ID-133Formations - 14 (a)
39ID-133Formations - 14 (b)
40ID-133Formations - 14 (c)
41ID-133Formations - 14 (d)
42ID-133Formations - 14 (e)
43ID-133Formations - 14 (f)
44ID-133Formations - 14 (g)
45ID-133Formations - 14 (h)
Sheet1
 
Upvote 0
Could we have the expected results for post #12 sample data please?
 
Upvote 0
Here is the Out Put.

ID-123 Technical Indicators - Part
ID-124 3 - Double entry bookkeeping
ID 125 International Financial Reporting Standards
ID 126 CHAPTER 6,PART
ID 126 CHAPTER 7 PART
ID 127 MULTI MODAL TRANSPORT
ID 128 Intangible Assets
ID 129 AE
ID-130 Showme Problem
ID 131 Use QuickBooks File
ID 132 Importing and Organizing Data in Python – part
ID 133 Formations
 
Upvote 0
Thanks for that.

Sorry for all the questions but I am trying to avoid spending time developing a solution only to find "Oh, that was a mistake" or "That worked fine on that sample but with my real data it didn't work" etc.
Therefore I am trying to come up with a set of logical rules about what to remove at the end and it keeps getting trickier. So I have some more comments/questions that you might be able to clarify.

1. The earlier examples seemed to have removable endings that were either normal numbers or roman numerals and they were either by themselves or enclosed in parentheses, and any non-letter characters preceding. Now we also have (a), (b) etc following a number. Makes me wonder could there also be things like "abc - 14 (iv)" and, if so, what would the result be?

2. In the data below (from your samples and expected results) we have three different formations in column B (as noted in column C) but they all combine to one formation in column E (Expected Results). Is the original data correct? If so, how do we determine that all three of those format should combine into one (and which one?). Or is the sample data incorrect?

Pete2020 2020-05-09 1.xlsm
ABCDE
9ID-126CHAPTER 6,PART 1This has no space after the commaID 126CHAPTER 6,PART
10ID-126CHAPTER 6, PART 2This has a space after the comma
11ID-126CHAPTER 6, PART 3
12ID-126CHAPTER 6. PART 4This has a period "." instead of a comma
13ID-126CHAPTER 6, PART 5
Sheet3


3. In the earlier data we had "HTML Form @ part2". All that was removed was the "2" at the end, not any letters beforehand. Now we have "Use QuickBooks File Q1" and this time not only the number at the end but the letter beforehand is also removed. What logic can we apply there, or is one of them wrong?

4. Given the above point about possible inconsistencies about what to remove, I am also interested in this example from your 'large data' (row 207): "How to navigate around MT4 (Metatrader 4)"
This has a number at the end, followed by a closing parenthesis, like cell B6 in post 1, so is anything removed from this one? If so, how much is removed & what is the logic?

5. In rows 4 & 5 below (from your sample data), the text before the roman numerals at the end is not the same in each row. How can that get combined into a single row as a duplicate?
If it does, then do rows 6 & 7 (my data) also get combined since they also have the same ID as each other with different text before the roman numerals?

Pete2020 2020-05-09 1.xlsm
ABCDE
1Course IDTopicsExpected Results
4ID-1243 - Double entry bookkeeping IID-1243 - Double entry bookkeeping
5ID-1244 - Double entry bookkeeping II
6ID-999Major Currency Pairs IID-999????
7ID-999Forex market time frame II
Sheet3
 
Upvote 0
Great Thinking. Yes, Your Questions Point in a right Way

1. The earlier examples seemed to have removable endings that were either normal numbers or roman numerals and they were either by themselves or enclosed in parentheses, and any non-letter characters preceding. Now we also have (a), (b) etc following a number. Makes me wonder could there also be things like "abc - 14 (iv)" and, if so, what would the result be?

Ans: Yes deleting that Number 14 (iv)

2. In the data below (from your samples and expected results) we have three different formations in column B (as noted in column C) but they all combine to one formation in column E (Expected Results). Is the original data correct? If so, how do we determine that all three of those format should combine into one (and which one?). Or is the sample data incorrect?

Ans: Its is from Actual data.
In this case, If last character is a number then remove special characters around it
After removing just compare word match by ignoring all special characters in rest of the cell, and show one amoung them based on most repetitive pattern
Your result on Column E is right

3. In the earlier data we had "HTML Form @ part2". All that was removed was the "2" at the end, not any letters beforehand. Now we have "Use QuickBooks File Q1" and this time not only the number at the end but the letter beforehand is also removed. What logic can we apply there, or is one of them wrong?

Ans: "HTML Form @ part2" - In this case part is a word with 4 characters and in case of Use QuickBooks File Q1 - Q is single character hence to me removed

4. Given the above point about possible inconsistencies about what to remove, I am also interested in this example from your 'large data' (row 207): "How to navigate around MT4 (Metatrader 4)"This has a number at the end, followed by a closing parenthesis, like cell B6 in post 1, so is anything removed from this one? If so, how much is removed & what is the logic

Ans Should not be removed - Retain because in closing parenthesis if it a consist only number then to be removed - Any word extended by number no deletion required .

5. In rows 4 & 5 below (from your sample data), the text before the roman numerals at the end is not the same in each row. How can that get combined into a single row as a duplicate? If it does, then do rows 6 & 7 (my data) also get combined since they also have the same ID as each other with different text before the roman numerals?

Ans Its not a duplicate - Its a mistake from my end -Just remove the Roman number at the end and it to out put column


Thank You for your close observation on the data.I am very much impressed the way to think and save your valuable time before actually start writing the Code.
 
Upvote 0
The answer to point 2 does not make sense to me.
Firstly, can you precisely define what is a "special character"?
Secondly I cannot see how "CHAPTER 6,PART" fits the description of "most repetitive pattern" since there is one of those and three that say "CHAPTER 6, PART"
However, even if we decide that "CHAPTER 6, PART" is the most common in this example, what are we to do if there are 6 similar cells and
2 of them say "CHAPTER 6,PART"
2 of them say "CHAPTER 6, PART"
2 of them say "CHAPTER 6. PART"
That is, we have three equally most common variations??

In relation to point 5 you say "Its a mistake from my end "
What is the mistake?
a) Incorrect sample data (eg it shouldn't have "3- " and then "4 -") or
b) Incorrect expected results (eg there should be two result rows for that Course ID?

Just looking ahead, I'm not sure that I will have a workable solution for you as there seems to be just so many slight variations & imprecise 'rules'. Still, we will see when the latest answers arrive.
 
Upvote 0
The answer to point 2 does not make sense to me.
Firstly, can you precisely define what is a "special character"?
Secondly I cannot see how "CHAPTER 6,PART" fits the description of "most repetitive pattern" since there is one of those and three that say "CHAPTER 6, PART"
However, even if we decide that "CHAPTER 6, PART" is the most common in this example, what are we to do if there are 6 similar cells and
2 of them say "CHAPTER 6,PART"
2 of them say "CHAPTER 6, PART"
2 of them say "CHAPTER 6. PART"
That is, we have three equally most common variations??

In relation to point 5 you say "Its a mistake from my end "
What is the mistake?
a) Incorrect sample data (eg it shouldn't have "3- " and then "4 -") or
b) Incorrect expected results (eg there should be two result rows for that Course ID?

Just looking ahead, I'm not sure that I will have a workable solution for you as there seems to be just so many slight variations & imprecise 'rules'. Still, we will see when the latest answers arrive.


1. Special character means Comma, ":"; "." "-","@" and most repetitive pattern means the macro trim extra spaces and and delete all special characters(Delimiters) and show the Only text. - Then Output will be " CHAPTER 6 PART"- Let it be simple

2. As we are dealing with removing numbers at the end of the cell and not the beginning of the cell. Hence output would as below.They are not duplicate to each other and no need to adjust the spaces and delimiters if they start at the beginning. From your words they should be Incorrect expected results (eg there should be two result rows for that Course ID?

OutPut
3- Double entry bookkeeping
4 -Double entry bookkeeping

Thank
 
Upvote 0
1. Special character means Comma, ":"; "." "-","@"
From your sample data and expected results, it look like "special characters" also includes "_" and "#"

See how this goes for both accuracy and speed. Report any erroneous results (with the incorrect result and the corrected result)
Assumption is that there are no more than 65,536 rows in the results, otherwise a further modification would be required.

VBA Code:
Sub AggregateParts()
  Dim RX As Object, d As Object
  Dim a As Variant
  Dim i As Long
  
  Const Patt1 As String = "[\,:;\.\-@_#]"
  Const Patt2 As String = "([^a-z0-9][a-z])?((\d+|[civxl]+)|(\(\d+|[civxl]+\)))( *\([a-z]+\))?$"
  
  Set d = CreateObject("Scripting.Dictionary")
  Set RX = CreateObject("VBScript.RegExp")
  RX.Global = True
  RX.IgnoreCase = True
  a = Range("A2", Range("B" & Rows.Count).End(xlUp)).Value
  For i = 1 To UBound(a)
    RX.Pattern = Patt1
    a(i, 2) = Application.Trim(RX.Replace(a(i, 2), " "))
    RX.Pattern = Patt2
    If RX.Test(a(i, 2)) Then a(i, 2) = RTrim(RX.Replace(a(i, 2), ""))
    d(a(i, 1) & ";" & a(i, 2)) = Empty
  Next i
  Application.ScreenUpdating = False
  With Range("D2").Resize(d.Count)
    .Value = Application.Transpose(d.Keys)
    .TextToColumns DataType:=xlDelimited, Semicolon:=True, Comma:=False, Space:=False, Other:=False
  End With
  Application.ScreenUpdating = True
End Sub

My sample data and results:
Pete2020 2020-05-09 1.xlsm
ABCDE
1Course IDTopics
2ID-123Technical Indicators - Part 1ID-123Technical Indicators Part
3ID-123Technical Indicators - Part 2ID-1243 Double entry bookkeeping
4ID-1243 - Double entry bookkeeping IID-1244 Double entry bookkeeping
5ID-1244 - Double entry bookkeeping IIID-125International Financial Reporting Standards
6ID-125International Financial Reporting Standards-IID-126CHAPTER 6 PART
7ID-125International Financial Reporting Standards-IIID-126CHAPTER 7 PART
8ID-125International Financial Reporting Standards-IIIID-127MULTI MODAL TRANSPORT
9ID-126CHAPTER 6,PART 1ID-128Intangible Assets
10ID-126CHAPTER 6, PART 2ID-129AE
11ID-126CHAPTER 6, PART 3ID-130Showme Problem
12ID-126CHAPTER 6. PART 4ID-131Use QuickBooks File
13ID-126CHAPTER 6, PART 5ID-132Importing and Organizing Data in Python – part
14ID-126CHAPTER 7 PART 1ID-133Formations
15ID-126CHAPTER 7 PART 2
16ID-126CHAPTER 7 PART 3
17ID-126CHAPTER 7 PART 4
18ID-126CHAPTER 7 PART 5
19ID-126CHAPTER 7 PART 6
20ID-127MULTI MODAL TRANSPORT 1
21ID-127MULTI MODAL TRANSPORT 2
22ID-127MULTI MODAL TRANSPORT 3
23ID-127MULTI MODAL TRANSPORT 4
24ID-127MULTI MODAL TRANSPORT 5
25ID-128Intangible Assets_3
26ID-128Intangible Assets_4
27ID-129AE-1
28ID-129AE-2
29ID-130Showme Problem #4
30ID-130Showme Problem #5
31ID-130Showme Problem #6
32ID-131Use QuickBooks File Q1
33ID-131Use QuickBooks File Q2
34ID-132Importing and Organizing Data in Python – part I
35ID-132Importing and Organizing Data in Python – part II
36ID-132Importing and Organizing Data in Python – part II
37ID-132Importing and Organizing Data in Python – part III
38ID-133Formations - 14 (a)
39ID-133Formations - 14 (b)
40ID-133Formations - 14 (c)
41ID-133Formations - 14 (d)
42ID-133Formations - 14 (e)
43ID-133Formations - 14 (f)
44ID-133Formations - 14 (g)
45ID-133Formations - 14 (h)
Sheet3



@DanteAmor
I encountered errors when running your code. Particularly this line y = Trim(Mid(c.Value, InStrRev(c.Value, " "), 5)) as some rows of the provided sample data (eg row 27 from post #12) do not contain a space character.
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,170
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top