RegEx criteria - remove punctuation, name honorifics, suffixes

Dr. Demento · Jan 29, 2018

I am comparing a large dataset by concatenating non-unique identifiers. However, when it comes to names, I'm not getting matches that I should. One of the issues is punctuation and name suffixes (LNAME, Jr., etc.). I thought a RegEx would do the trick, but I'm not that proficient yet.

My goal is to remove all punctuation from a string (easy; see below) but in also exclude certain name honorifics/suffixes (Mr, Mrs, Ms, Jr, Sr, III, etc) AND to keep my pipe-delimiter in the string (desired format: LASTNAME|FIRSTNAME). I know this is used as an OR in RegEx, but hopefully there's a workaround.

Thanks y'all.

Code:

Function RemovePunctuation(rts As String) As String
' ~~ Remove punctuation
' [URL]https://www.extendoffice.com/documents/excel/3296-excel-remove-all-punctuation.html[/URL]  
  With CreateObject("VBScript.RegExp")
    .Pattern = "[^a-zA-Z0-9 ]"
    .IgnoreCase = True
    .Global = True
    RemovePunctuation = .Replace(rts, vbNullString)
  End With
  
End Function

Scott Huish · Jan 30, 2018

Can you post a sample of your data as it would appear with the honorific and one without and how you would like them to appear after the function is run?

Dr. Demento · Feb 2, 2018

Here's a list of inputs/outputs. It contains not only suffixes, honorifics, but also punctuation. Please note, an apostrophe could be (' or `) or a dash could be en dash, em dash, or horizontal bar, so I'm hoping the function would strip any/all punctuation rather than identifying specifics.

O'Malley|Mike --> OMALLEY|MIKE
Smythe-Wilson|John --> SMYTHEWILSON|JOHN
West, Jr.|Phillip --> WEST|PHILLIP
Thomas III|Aaron --> THOMAS|AARON
Adams|Dr. Greg --> ADAMS|GREG

Does that help?

Scott Huish · Feb 19, 2018

Sorry this took so long. After a few attempts at this, this is what I came up with:

Code:

Function FirstLast(r As String) As String
Dim t, u, v As String
t = Split(r, "|")
u = Split(t(0))(0)
t(0) = UCase(u)
u = Split(t(1))
t(1) = UCase(u(UBound(u)))
FirstLast = RemovePunctuation(Join(t, "|"))
End Function

Private Function RemovePunctuation(s As String) As String
With CreateObject("vbscript.regexp")
    .Pattern = "[^A-Z\|]"
    .ignorecase = True
    .Global = True
    RemovePunctuation = .Replace(s, "")
End With
End Function

Excel 2010

A

B

1

O'Malley|Mike

OMALLEY|MIKE

2

Smythe-Wilson|John

SMYTHEWILSON|JOHN

3

West, Jr.|Phillip

WEST|PHILLIP

4

Thomas III|Aaron

THOMAS|AARON

5

Adams|Dr. Greg

ADAMS|GREG

Sheet1

Cell Formulas
Range	Formula
B1	=firstlast(A1)
B2	=firstlast(A2)
B3	=firstlast(A3)
B4	=firstlast(A4)
B5	=firstlast(A5)

Dr. Demento · Feb 22, 2018

Much appreciated, Scott!! That worked great.

I did make one change; if A2 = Smythe Wilson|John (sans hyphen), the function returned SMYTHE|JOHN. My solution was to run it thru the original string (r) thru the RemovePunctuation function PRIOR to splitting it for the first time. It now looks like this:

Code:

Function FirstLast(r As String) As String
Dim t, u, v As String

[B][COLOR=#0000FF]r = RemovePunctuation(r)[/COLOR][/B]
t = Split(r, "|")
u = Split(t(0))(0)
t(0) = UCase(u)
u = Split(t(1))
t(1) = UCase(u(UBound(u)))
FirstLast = RemovePunctuation(Join(t, "|"))
End Function

Thanks again!

Dr. Demento · Feb 22, 2018

Looks like I spoke too soon. I need to determine if there's a space before blindly removing punctuation from the original string (r).

New, new code:

Code:

Function FirstLast(r As String) As String
Dim t, u, v As String

[B][COLOR=#0000ff]if instr(r, " ") = 0 then r = RemovePunctuation(r)[/COLOR][/B]
t = Split(r, "|")
u = Split(t(0))(0)
t(0) = UCase(u)
u = Split(t(1))
t(1) = UCase(u(UBound(u)))
FirstLast = RemovePunctuation(Join(t, "|"))
End Function

Scott Huish · Feb 22, 2018

The problem with that is that it wouldn't be able to differentiate between that and say Thomas III. This is always the problem with trying to parse items that aren't delimited correctly between each piece of data.

Dr. Demento · Feb 22, 2018

I see what you mean.

What do you mean I can't automate EVERYTHING??!!! I don't want to have to think!

*sigh*

Thanks again

Rick Rothstein · Feb 22, 2018

Scott Huish said:
The problem with that is that it wouldn't be able to differentiate between that and say Thomas III. This is always the problem with trying to parse items that aren't delimited correctly between each piece of data.

Worse, it screws up names with natural spaces in them. For example, the first name of one woman I worked with before retirement was Mary Ann with the space... the code Dr. D posted strips out the "Ann" part of her name. Another person I worked with back then had this last name... Della Rosa, again, with the space in it... the code Dr. D posted converts that to Rosa. It would seem that without full and proper delimiters, there can be no splitting of names without some kind of dictionary of names (both first and last) to check against or, perhaps smaller, a dictionary of prefixes and suffixes to be removed)

Dr. Demento · Feb 23, 2018

Thanks for the feedback, Rick.

My main intent for this function is a part of a sub that compares two lists by non-unique identifiers (concatenated names and demographics). In my work, I'll often get two individuals submitted to me with different unique identifiers (SSN) but when you compare name/demographics, they are obviously the same person.

Where it was getting jacked was with those individuals who had spaces, punctuation (O'Malley), suffixes (Brown III or Jones Jr), or honorifics (Dr., PhD - which shouldn't be there at all, but data entry QC is not the greatest). My intent was to delete any punctuation, suffixes, and honorifics (which worked fine), but then those folks with natural spaces in their names gacked it all up. My kludge is to do a manual search for spaces in the name and replace it with a '-' if they had two "real" parts to their name (I'm not working with middle names at all, only first/last). Then, using Scott's function, the full name (sans non-letter characters) provides a concatenated "full" name. The fact I had mashed together their name was only to test for duplicates vs uniques, so it wasn't important to anyone but me.

RegEx criteria - remove punctuation, name honorifics, suffixes

Dr. Demento

Well-known Member

Scott Huish

MrExcel MVP

Dr. Demento

Well-known Member

Scott Huish

MrExcel MVP

Dr. Demento

Well-known Member

Dr. Demento

Well-known Member

Scott Huish

MrExcel MVP

Dr. Demento

Well-known Member

Rick Rothstein

MrExcel MVP

Dr. Demento

Well-known Member

Similar threads

Share this page

RegEx criteria - remove punctuation, name honorifics, suffixes

Well-known Member

MrExcel MVP

Well-known Member

MrExcel MVP

Well-known Member

Well-known Member

MrExcel MVP

Well-known Member

MrExcel MVP

Well-known Member

Similar threads

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock