Find, remove matching text from string

Jeff P

Board Regular
Joined
Jun 11, 2011
Messages
80
In column C, i have a Text string spanning several hundred rows.
In columns ("H:M") are a list of names, that are repeated down.

I'm in need of a VBA script that will search through the text string, for key words from ("H:M") and remove all except those matching words.

As an example to go from this.

C ........................................................... H..... I..... J....K.....L....M
Bob and Mary, went to a dinner party...........Bob Fred Mary John Tim Sam
Sam met with John...................................Bob Fred Mary John Tim Sam

to this:

C........................................................... H..... I..... J....K.....L....M
Bob and Mary, went to a dinner party...........Bob Fred Mary
Sam met with John...................................Bob Fred Mary John Tim Sam


Thanks for any help!

Jeff
 

Excel Facts

Highlight Duplicates
Home, Conditional Formatting, Highlight Cells, Duplicate records, OK to add pink formatting to any duplicates in selected range.
This sort of text comparison can be tricky.
To clarify, what results would you expect from these 2 rows given that in row 4 John and Fred are not whole words, but neither really is Mary in row 5.

Excel Workbook
CDEFGHIJKLM
4Sam & Mary Johnson went to FredericktonBobFredMaryJohnTimSam
5Sam & Mary's dog had pupsBobFredMaryJohnTimSam
Extract names



Edit: .. and another one

Excel Workbook
CDEFGHIJKLM
6John-Bob is elevenBobFredMaryJohnTimSam
Extract names
 
Last edited:
Upvote 0
Hi Peter_SSs, Thanks for the reply!

Lol, you highlighted a dilemma i've been struggling with for quite a while, extracting names from text.

My project involves searching through and identifying major names within a large body of text. Ive ran into a few problems whereby a particular name was part of another word. ( as you've pointed out)
So far my solution has been to search "exact" matching names. In a case like Fredrickson, Ive been doing a search on "Fred ", with a space at the end. Works fine, as long as its not followed by a punctuation. Either way, Ive been making alterations along the way for those sticky cases.

So to answer your question, If possible to search for Exact matches.. Unless you can think of a better solution?

Thanks again for the reply Peter!


Jeff
 
Last edited:
Upvote 0
A few more questions

1. Can you confirm that the names down columns H:M are identical in each row or might some rows have a different set of names?

2. Assuming the names are identical in each row,do you actually have, or need to have, them physically all the way down? Could they just be listed once somewhere - say P1:U1? I'm not exactly sure how I will attack this problem so I'm not sure if it would be best to have them repeated down the rows or just in a single spot so asking the question in case. :)

3. Do you actually want the names to remain in their original columns or would you want or accept results like this?


Excel 365 (Windows) 32 bit
CDEFGHIJ
1TextNames
2Bob and Mary, went to a dinner partyBobMary
3Sam met with JohnSamJohn
4Sam & Mary Johnson went to FredericktonSamMary
5Sam & Mary's dog had pupsSamMary
6John-Bob is eleven
7John-Bob is eleven & Fred, Tim & Sam are tenFredTimSam
Extract names


Edit: One more

4. What about repeated names? Any different result?
eg "Mary & Bob went out. Only Mary came home."
 
Last edited:
Upvote 0
Hi Peter_SSs, Thank you very much again on this.

To answer your questions.

1. The names will always be identical all the way down the rows. Ive indicated only columns H through M, for simplicity sake. My final need will vary and could encompass 50 or more columns of names in Row1, but always begin in column H.

If you can imagine a header of 50 or more names in row 1, beginning at H. and paragraphs of text containing those names throughout columns C2:C5000. What I'm hoping to achieve is a kind of visual grid point where they intersect letting me know what names are inside all that text.

2. I have a master list of names in another sheet, in a column, that is continually updated. New names added, others removed. Its sorted alphabetically, which means their name positions in the column varies. I will create a script to copy this column and paste /transpose into row H1 of my working sheet, acting as a header. I have another script will copy/paste this header row down to beyond the last cell of information in Column 1, creating identical copies.

My need was to remove all the names that are not found within the paragraphs of text in column C2:C5000, leaving only those visible inside.

I realize my method of copying identical rows is a very sloppy way to achieve what I'm after. However, If there's a way to simply reference a single column of names and achieve the same grid like positioning where the order of names in a column are identical the order of the header. that would be great!. And If you need to access another sheet, or work beyond P1:U1 this is perfectly fine.

3. The names will need to remain in their original column, to maintain the accurate positioning reflected in the header order.

4. As for repeated names? The names in the header top row are unique and will only appear once. In the paragraphs of text C1, there will likely be multiple instances. If the code is designed to remove (or find) all names but those indicated in the header list, would multiple instances in the paragraphs of text matter? But not knowing much in terms of VBA code myself, If multiple instances appear in the paragraphs of text, from C1, only one unique name should be present in its corresponding header column/row.

Sorry for the long winded reply.
Trying to give as much clarity as possible.

Thank you very much again!


Jeff
 
Upvote 0
I will create a script to copy this column and paste /transpose into row H1 of my working sheet, acting as a header. I have another script will copy/paste this header row down to beyond the last cell of information in Column 1, creating identical copies.

Hi, Jeff

Actually in col H onwards you only need the header (in row 1), you can leave row 2 downward empty.

Example:

BEFORE:

Excel 2013/2016
CDEFGHIJKLM
1Adrien BlazeBooneDax DominikRubenZaidenZein
2Adrien Ruben, Adrien Blaze Malaki Dax Leandro
3Maximo Frederick Blaze Anson
4Bowen Dax Dominik, Rubens zein
5Boone Dominik Connor ,Zein
6Cory Zeins deBoone
Sheet2


AFTER

Excel 2013/2016
CDEFGHIJKLM
1Adrien BlazeBooneDax DominikRubenZaidenZein
2Adrien Ruben, Adrien Blaze Malaki Dax LeandroAdrien BlazeRuben
3Maximo Frederick Blaze Anson
4Bowen Dax Dominik, Rubens zeinDax DominikZein
5Boone Dominik Connor ,ZeinBooneZein
6Cory Zeins deBoone
Sheet2


THE CODE:
Code:
[FONT=lucida console][color=Royalblue]Sub[/color] a1111478a()
[i][color=seagreen]'https://www.mrexcel.com/forum/excel-questions/1111478-find-remove-matching-text-string.html[/color][/i]
[color=Royalblue]Dim[/color] i [color=Royalblue]As[/color] [color=Royalblue]Long[/color], j [color=Royalblue]As[/color] [color=Royalblue]Long[/color], n [color=Royalblue]As[/color] [color=Royalblue]Long[/color], rc [color=Royalblue]As[/color] [color=Royalblue]Long[/color]
[color=Royalblue]Dim[/color] va, vb, vx
[color=Royalblue]Dim[/color] regEx [color=Royalblue]As[/color] [color=Royalblue]Object[/color]

n = Range([color=brown]"C"[/color] & Rows.count).[color=Royalblue]End[/color](xlUp).Row
rc = Cells([color=crimson]1[/color], Columns.count).[color=Royalblue]End[/color](xlToLeft).Column - [color=crimson]7[/color]

va = Range([color=brown]"C2:C"[/color] & n)
vx = Range([color=brown]"H1"[/color]).Resize([color=crimson]1[/color], rc)
[color=Royalblue]ReDim[/color] vb([color=crimson]1[/color] [color=Royalblue]To[/color] UBound(va, [color=crimson]1[/color]), [color=crimson]1[/color] [color=Royalblue]To[/color] UBound(vx, [color=crimson]2[/color]))

[color=Royalblue]For[/color] i = [color=crimson]1[/color] [color=Royalblue]To[/color] UBound(va, [color=crimson]1[/color])
    [color=Royalblue]For[/color] j = [color=crimson]1[/color] [color=Royalblue]To[/color] UBound(vx, [color=crimson]2[/color])

       [color=Royalblue]Set[/color] regEx = CreateObject([color=brown]"VBScript.RegExp"[/color])
       
        [color=Royalblue]With[/color] regEx
            .[color=Royalblue]Global[/color] = [color=Royalblue]True[/color]
            .MultiLine = [color=Royalblue]True[/color]
            .ignorecase = [color=Royalblue]True[/color]
            .pattern = [color=brown]"\b"[/color] & vx([color=crimson]1[/color], j) & [color=brown]"\b"[/color]
        [color=Royalblue]End[/color] [color=Royalblue]With[/color]

        [color=Royalblue]If[/color] regEx.test(va(i, [color=crimson]1[/color])) [color=Royalblue]Then[/color]
            vb(i, j) = vx([color=crimson]1[/color], j)
        [color=Royalblue]End[/color] [color=Royalblue]If[/color]

    [color=Royalblue]Next[/color]
[color=Royalblue]Next[/color]

Range([color=brown]"H2"[/color]).Resize(UBound(vb, [color=crimson]1[/color]), UBound(vb, [color=crimson]2[/color])) = vb

[color=Royalblue]End[/color] [color=Royalblue]Sub[/color][/FONT]
 
Upvote 0
I also have a Regular Expression approach but first some comments.

As already mentioned, & you are well aware, text comparisons are difficult to get 100% right. We also have very little idea of what the extent of variation in your data is so some of this is just guessing. However, it might shed some light or help develop a solution that better meets your needs.

Firstly, both Akuini's code and mine below, only require the 'names list' once in row 1 H1:??1 so unless you need them in all the other rows first for some reason, don't bother

I tested Akuini's code on the following sample data and by my understanding it produced correct results except for the coloured cells which are all part of longer or compound 'words'. This is due to how regular expressions define 'word boundaries'. Apostrophes are a particular problem as ideally we would like to treat the one in row 5 as a word boundary but not treat the one in row 10 as a word boundary.

Excel Workbook
CHIJKLMN
1TextBobFredMaryJohnTimSamConnor
2Bob and Mary, went to a dinner partyBobMary
3Sam met with JohnJohnSam
4Sam & Mary Johnson went to FredericktonMarySam
5Sam & Mary's dog had pupsMarySam
6John-Bob is eleven & Fred, Tim & Sam are tenBobFredJohnTimSam
7Mary & Bob went out. Only Mary came home.BobMary
8Tom John-Jones is hereJohn
9Connor WatsonConnor
10Bob O'ConnorBobConnor
Extract names (1)



My code produces this result instead, so you can still see the problem with the blue cell, but it eliminates the green ones. I'm assuming they should be eliminated?

Excel Workbook
CHIJKLMN
1TextBobFredMaryJohnTimSamConnor
2Bob and Mary, went to a dinner partyBobMary
3Sam met with JohnJohnSam
4Sam & Mary Johnson went to FredericktonMarySam
5Sam & Mary's dog had pupsMarySam
6John-Bob is eleven & Fred, Tim & Sam are tenFredTimSam
7Mary & Bob went out. Only Mary came home.BobMary
8Tom John-Jones is here
9Connor WatsonConnor
10Bob O'ConnorBobConnor
Extract names 2


My final comment is that this code is much more efficient. I tested with about 900 rows and 7 columns and this code took less than 0.01 seconds whereas the other code took a little over 4 seconds. With your 5,000 and up to 50 columns, that time different would be increased significantly I imagine.

Rich (BB code):
Sub Get_Names()
  Dim RX As Object, d As Object
  Dim a As Variant, b As Variant, mtch
  Dim lr As Long, cols As Long, i As Long

  Set d = CreateObject("Scripting.Dictionary")
  d.compareMode = 1
  cols = Cells(1, Columns.Count).End(xlToLeft).Column - 7
  For i = 1 To cols
    d(Cells(1, i + 7).Value) = i
  Next i
  Set RX = CreateObject("VBScript.RegExp")
  RX.Global = True
  RX.IgnoreCase = True
  RX.MultiLine = True
  RX.Pattern = "\b(" & Join(Application.Index(Cells(1, 8).Resize(, cols).Value, 1, 0), "|") & ")\b"
  lr = Cells(Rows.Count, 3).End(xlUp).Row
  a = Range("C2:C" & lr).Value
  ReDim b(1 To UBound(a), 1 To cols)
  For i = 1 To UBound(a)
    For Each mtch In RX.Execute(Replace(a(i, 1), "-", "5"))
      b(i, d(CStr(mtch))) = mtch
    Next mtch
  Next i
  Range("H2").Resize(UBound(b), cols).Value = b
End Sub


Anyway, give them both a try and post back with any problems or if you think we need to do something different.
 
Upvote 0
Wow, Thank you very much for this!

Both codes worked perfectly. I couldn't see a difference in the results between them on my working sheets. ( other than the code from Akuini, adhering to the capitalization of names from the header.)

On my large sheets, where my header names reached to over 130, and the paragraph texts, more than 1000 rows,.. Peter_SSs, your code was lighting fast!,.. near instantaneous.

Amazing ..

Thanks again for the help .. and patience!
Cheers!!


Jeff
 
Upvote 0
.. adhering to the capitalization of names from the header.
Are the headers
- all upper case or
- all lower case or
- all proper case or
- could some names be upper and some proper etc and you want to match whatever is there?

Also, would your heading names ever include a hyphenated name (eg "John-Bob" or "Ashley-Cooper")? If so my code needs a few more tweaks.
 
Last edited:
Upvote 0
Hi Peter_SSs. Thanks for the additional inquiry.

The comment I made regarding Akuini's code was merely a noticeable difference between his and yours. In my test sheet example, the first letter of each name in the Header was capitalized, which also reflected in their corresponding column throughout. For cosmetic purposes this would be great, however the code you provided works perfectly for my needs! Also, the text and names never include hyphens, so its all good!

Many thanks again, Peter_SSs

Cheers!

Jeff
 
Upvote 0

Forum statistics

Threads
1,223,893
Messages
6,175,242
Members
452,623
Latest member
russelllowellpercy

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top