Improve Function to Strip non ASCII chars vba

Nicha

New Member
Joined
Feb 10, 2023
Messages
48
Office Version
  1. 2016
Platform
  1. Windows
I Have a function that usually use to remove special characters from Cells. This function allways worked fine, until I had a problem with the (ª) character - with Unicode [u00AA].
My function uses Regex pattern to exclude "Basic Latin" (\u0000-\u007F) and "Latin-1 Supplement" (\u00C0-\u00FF). This supplment includes the (ª - u00AA) char; but it's not working.

My Regex Pattern :
VBA Code:
"[^\u0000-\u007F\u00C0-\u00FF]"

My Fuction :
VBA Code:
Function StripNonAsciiChars(ByVal InputString As String, Optional Exclude_Latin1_Supplement As Boolean = True) As String
'Orgin::
'https://stackoverflow.com/questions/61336753/regex-to-exclude-non-ascii-but-keep-nordic-characters
'http://www.unicode.org/charts/     -> Unicode 15.0 Character Code Charts

'Exclude_Latin1_Supplement - InputString = "Sem informação disponibilizada"

'\u0020-\u007F - means the characters run from index 32 till index 127 and \u00C0-\u00FF runs from 192 till 255.

'Unicode is a superset of ASCII. However, the ASCII characters are in what is called a "block" not a Unicode category
    Dim RegEx As Object
    Set RegEx = CreateObject("VBScript.RegExp")
    With RegEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        
        Select Case Exclude_Latin1_Supplement
            Case True
                'Includes exception for Latin Characters like çÇ;ãÃ, etc... "Latin-1 Supplement" [u00C0-\u00FF]
                'Added char (ª) - the code u00AA was not recognized. ("[^\u0000-\u007F\u00C0-\u00FF\u00AA]")
                .Pattern = "[^\u0000-\u007F\u00C0-\u00FF\ª]"
            
            Case False
                'Excludes only "Basic Latin (ASCII) - \u0000-\u007F
                .Pattern = "[^\u0000-\u007F]"
        End Select
        
        'StripNonAsciiChars = Application.WorksheetFunction.Trim(RegEx.Replace(InputString, " "))
        StripNonAsciiChars = .Replace(InputString, vbNullString)
    End With
End Function


As you can see below, Works with "[^\u0000-\u007F\u00C0-\u00FF\ª]" and not works with "[^\u0000-\u007F\u00C0-\u00FF\u00AA]". I'm Portuguese and we use both the chars (ª and º). Our Keyboards have the Key shown on the upload image. Can anyone help and tell why those chars included in Unicode List "Latin-1 Supplement", does not work, and wich is the best way to alter my regex.

See the excel outputs:

New Microsoft Excel Worksheet.xlsx
BCD
1Input ResultPattern
2SÉRGIO CARVALHO DOMINGUES & Cª, LIMITADASÉRGIO CARVALHO DOMINGUES & Cª, LIMITADA"[^\u0000-\u007F\u00C0-\u00FF\ª]"
3SÉRGIO CARVALHO DOMINGUES & Cª, LIMITADASÉRGIO CARVALHO DOMINGUES & C, LIMITADA"[^\u0000-\u007F\u00C0-\u00FF]"
Sheet2



Can anyone help please?
 

Attachments

  • Key1.jpg
    Key1.jpg
    111.3 KB · Views: 10

Excel Facts

Which came first: VisiCalc or Lotus 1-2-3?
Dan Bricklin and Bob Frankston debuted VisiCalc in 1979 as a Visible Calculator. Lotus 1-2-3 debuted in the early 1980's, from Mitch Kapor.
Hi Nicha,
If I'm reading your post correctly, you're keeping characters within two ranges, \u00C0-\u00FF and \u0020-\u007F. Does the additional \u00AA need to be also included as a range, ie \u00AA-\u00AA? I'm not so familure with using regex.
I've done something similar with a slightly different attack where I break the string into individual characters and rebuild the string from characters that are within a given range of ascii codes.
Below is my code, you might be able to adapt it for your situation
VBA Code:
Public Function HarshTrim(inputStr As String, Optional OtherChr As String) As String
'   The result will be the contents of the reference cell with all characters stripped out except 0-9, A-Z and a-z.
'   Use optional arguement OtherChr to keep other characters. Enclose otherChr in double quotes

'   Usage:
'   In the worksheet use in syntax HarshTrim(<Cell reference>)
'   Example: harshTrim(A2)
'   or harshtrim(A2,"./ ][") to remove
' Application.MacroOptions _
'        Macro:="HarshTrim", _
'        Description:="The result will be the contents of the reference cell with all characters stripped out except 0-9, A-Z and a-z.", _
'        Category:="Custom Worksheet Functions", _
'        ArgumentDescriptions:=Array( _
'            "inputStr is the first argument.  The string to be operated on, usually a cell reference", _
'            "OtherChr is an optional argument.  Allows you to specify additional characters to keep")


Dim output As String
Dim i As Integer
Dim s() As Byte

s = StrConv(inputStr, vbFromUnicode)
output = ""
For i = 0 To UBound(s)
    If (s(i) >= 48 And s(i) <= 57) Or _
    (s(i) >= 65 And s(i) <= 90) Or _
    (s(i) >= 97 And s(i) <= 122) Or _
    (InStr(1, OtherChr, Chr(s(i)), vbTextCompare) > 0) Then
        output = output & Chr(s(i))
'    ElseIf Right(output, 1) <> " " Then
'        output = output & " "
    End If
Next
HarshTrim = output
End Function
 
Upvote 0
The negation ( ^ ) at the beginning of a character set applies to each range of characters that may be specified, along with any single character. So the regular expression "[^\u0000-\u007F\u00C0-\u00FF\u00AA]" matches a single character not present in that list, which is made up of characters in the range \u0000-\u007F, characters in the range \u00C0-\u00FF, AND the character \u00AA. As a result, there's no match for your string. Note, though, the regular expression "[^\u0000-\u007F\u00C0-\u00FF]" would match the character ª in your string, since it's not present in the list.
 
Upvote 0
The negation ( ^ ) at the beginning of a character set applies to each range of characters that may be specified, along with any single character. So the regular expression "[^\u0000-\u007F\u00C0-\u00FF\u00AA]" matches a single character not present in that list, which is made up of characters in the range \u0000-\u007F, characters in the range \u00C0-\u00FF, AND the character \u00AA. As a result, there's no match for your string. Note, though, the regular expression "[^\u0000-\u007F\u00C0-\u00FF]" would match the character ª in your string, since it's not present in the list.
Hi Domenic. Thank you in advance. Sorry, but I'm not so familiar with regular expressions, to fully understand the meaning of your answer. Can you, please, explain why that pattern is not working, and what would be the solution? With "\u00AA" the expression still excluding the ª char, when I need to achieve just the opposite.
Can you help me to understand it and made tgis work?
 
Upvote 0
[^\u0000-\u007F\u00C0-\u00FF\u00AA] means the following:

[ . . . ] - character class to match a single character present in the list

^ - negated character class to match a single character not present in the list

\u0000-\u007F - characters ranging from 0 to 127

\u00C0-\u00FF - characters ranging from 192 to 255

\u00AA - character 170

So the regular expression matches a single character not present in the character class/ list. And since your string contains the character ª , and since the list of characters to be excluded includes \u00AA, the result is that the character ª is not matched.

With [^\u0000-\u007F\u00C0-\u00FF], you'll see that it matches the character ª in your string, since it's not present in the negated character class/list.
 
Last edited:
Upvote 0
[^\u0000-\u007F\u00C0-\u00FF\u00AA] means the following:

[ . . . ] - character class to match a single character present in the list

^ - negated character class to match a single character not present in the list

\u0000-\u007F - characters ranging from 0 to 127

\u00C0-\u00FF - characters ranging from 192 to 255

\u00AA - character 170

So the negated character class matches a single character not present in its list. And since your string contains the character ª , and since the list of characters to be excluded includes \u00AA, the result is that the character ª is not matched.

With [^\u0000-\u007F\u00C0-\u00FF], you'll see that it matches the character ª in your string, since it's not present in the negated character class/list.
My best Regards, and thank you very much.
 
Upvote 0
Hi Domenic. After your explanation I've been doing some tests, and came into some doubts too.
Considering this Link, (ª - char) does not even corresponds to [u00AA] code, like I thought it were, but to [u00A6]. According to same Table, the ç and Ç characters corresponds to [u0087 and u0080] respectively.
Surely, I'm not looking at the right table, cause with
VBA Code:
[^\u0000-\u007F\u00C0-\u00FF\u00AA]
The expression doesn't removes from string (ª;ç or Ç) characters, which is very good, and it's what I wanted.
Can you please give me a link to the "real" table you are considering?
 
Upvote 0
I'm using the Windows-1252 8-bit ASCII character set...

 
Upvote 0
Solution
Upvote 0

Forum statistics

Threads
1,223,214
Messages
6,170,771
Members
452,353
Latest member
strainu

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top