Improve Function to Strip non ASCII chars vba

Nicha · May 11, 2023

I Have a function that usually use to remove special characters from Cells. This function allways worked fine, until I had a problem with the (ª) character - with Unicode [u00AA].
My function uses Regex pattern to exclude "Basic Latin" (\u0000-\u007F) and "Latin-1 Supplement" (\u00C0-\u00FF). This supplment includes the (ª - u00AA) char; but it's not working.

My Regex Pattern :

VBA Code:

"[^\u0000-\u007F\u00C0-\u00FF]"

My Fuction :

VBA Code:

Function StripNonAsciiChars(ByVal InputString As String, Optional Exclude_Latin1_Supplement As Boolean = True) As String
'Orgin::
'https://stackoverflow.com/questions/61336753/regex-to-exclude-non-ascii-but-keep-nordic-characters
'http://www.unicode.org/charts/     -> Unicode 15.0 Character Code Charts

'Exclude_Latin1_Supplement - InputString = "Sem informação disponibilizada"

'\u0020-\u007F - means the characters run from index 32 till index 127 and \u00C0-\u00FF runs from 192 till 255.

'Unicode is a superset of ASCII. However, the ASCII characters are in what is called a "block" not a Unicode category
    Dim RegEx As Object
    Set RegEx = CreateObject("VBScript.RegExp")
    With RegEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        
        Select Case Exclude_Latin1_Supplement
            Case True
                'Includes exception for Latin Characters like çÇ;ãÃ, etc... "Latin-1 Supplement" [u00C0-\u00FF]
                'Added char (ª) - the code u00AA was not recognized. ("[^\u0000-\u007F\u00C0-\u00FF\u00AA]")
                .Pattern = "[^\u0000-\u007F\u00C0-\u00FF\ª]"
            
            Case False
                'Excludes only "Basic Latin (ASCII) - \u0000-\u007F
                .Pattern = "[^\u0000-\u007F]"
        End Select
        
        'StripNonAsciiChars = Application.WorksheetFunction.Trim(RegEx.Replace(InputString, " "))
        StripNonAsciiChars = .Replace(InputString, vbNullString)
    End With
End Function

As you can see below, Works with "[^\u0000-\u007F\u00C0-\u00FF\ª]" and not works with "[^\u0000-\u007F\u00C0-\u00FF\u00AA]". I'm Portuguese and we use both the chars (ª and º). Our Keyboards have the Key shown on the upload image. Can anyone help and tell why those chars included in Unicode List "Latin-1 Supplement", does not work, and wich is the best way to alter my regex.

See the excel outputs:

New Microsoft Excel Worksheet.xlsx

B

C

D

1

Input

Result

Pattern

2

SÉRGIO CARVALHO DOMINGUES & Cª, LIMITADA

"[^\u0000-\u007F\u00C0-\u00FF\ª]"

3

SÉRGIO CARVALHO DOMINGUES & Cª, LIMITADA

SÉRGIO CARVALHO DOMINGUES & C, LIMITADA

"[^\u0000-\u007F\u00C0-\u00FF]"

Sheet2

Can anyone help please?

rondeondo · May 11, 2023

Hi Nicha,
If I'm reading your post correctly, you're keeping characters within two ranges, \u00C0-\u00FF and \u0020-\u007F. Does the additional \u00AA need to be also included as a range, ie \u00AA-\u00AA? I'm not so familure with using regex.
I've done something similar with a slightly different attack where I break the string into individual characters and rebuild the string from characters that are within a given range of ascii codes.
Below is my code, you might be able to adapt it for your situation

VBA Code:

Public Function HarshTrim(inputStr As String, Optional OtherChr As String) As String
'   The result will be the contents of the reference cell with all characters stripped out except 0-9, A-Z and a-z.
'   Use optional arguement OtherChr to keep other characters. Enclose otherChr in double quotes

'   Usage:
'   In the worksheet use in syntax HarshTrim(<Cell reference>)
'   Example: harshTrim(A2)
'   or harshtrim(A2,"./ ][") to remove
' Application.MacroOptions _
'        Macro:="HarshTrim", _
'        Description:="The result will be the contents of the reference cell with all characters stripped out except 0-9, A-Z and a-z.", _
'        Category:="Custom Worksheet Functions", _
'        ArgumentDescriptions:=Array( _
'            "inputStr is the first argument.  The string to be operated on, usually a cell reference", _
'            "OtherChr is an optional argument.  Allows you to specify additional characters to keep")


Dim output As String
Dim i As Integer
Dim s() As Byte

s = StrConv(inputStr, vbFromUnicode)
output = ""
For i = 0 To UBound(s)
    If (s(i) >= 48 And s(i) <= 57) Or _
    (s(i) >= 65 And s(i) <= 90) Or _
    (s(i) >= 97 And s(i) <= 122) Or _
    (InStr(1, OtherChr, Chr(s(i)), vbTextCompare) > 0) Then
        output = output & Chr(s(i))
'    ElseIf Right(output, 1) <> " " Then
'        output = output & " "
    End If
Next
HarshTrim = output
End Function

Domenic · May 11, 2023

The negation ( ^ ) at the beginning of a character set applies to each range of characters that may be specified, along with any single character. So the regular expression "[^\u0000-\u007F\u00C0-\u00FF\u00AA]" matches a single character not present in that list, which is made up of characters in the range \u0000-\u007F, characters in the range \u00C0-\u00FF, AND the character \u00AA. As a result, there's no match for your string. Note, though, the regular expression "[^\u0000-\u007F\u00C0-\u00FF]" would match the character ª in your string, since it's not present in the list.

Nicha · May 12, 2023

Domenic said:
The negation ( ^ ) at the beginning of a character set applies to each range of characters that may be specified, along with any single character. So the regular expression "[^\u0000-\u007F\u00C0-\u00FF\u00AA]" matches a single character not present in that list, which is made up of characters in the range \u0000-\u007F, characters in the range \u00C0-\u00FF, AND the character \u00AA. As a result, there's no match for your string. Note, though, the regular expression "[^\u0000-\u007F\u00C0-\u00FF]" would match the character ª in your string, since it's not present in the list.

Hi Domenic. Thank you in advance. Sorry, but I'm not so familiar with regular expressions, to fully understand the meaning of your answer. Can you, please, explain why that pattern is not working, and what would be the solution? With "\u00AA" the expression still excluding the ª char, when I need to achieve just the opposite.
Can you help me to understand it and made tgis work?

Domenic · May 12, 2023

[^\u0000-\u007F\u00C0-\u00FF\u00AA] means the following:

[ . . . ] - character class to match a single character present in the list

^ - negated character class to match a single character not present in the list

\u0000-\u007F - characters ranging from 0 to 127

\u00C0-\u00FF - characters ranging from 192 to 255

\u00AA - character 170

So the regular expression matches a single character not present in the character class/ list. And since your string contains the character ª , and since the list of characters to be excluded includes \u00AA, the result is that the character ª is not matched.

With [^\u0000-\u007F\u00C0-\u00FF], you'll see that it matches the character ª in your string, since it's not present in the negated character class/list.

Nicha · May 12, 2023

Domenic said:
[^\u0000-\u007F\u00C0-\u00FF\u00AA] means the following:

[ . . . ] - character class to match a single character present in the list

^ - negated character class to match a single character not present in the list

\u0000-\u007F - characters ranging from 0 to 127

\u00C0-\u00FF - characters ranging from 192 to 255

\u00AA - character 170

So the negated character class matches a single character not present in its list. And since your string contains the character ª , and since the list of characters to be excluded includes \u00AA, the result is that the character ª is not matched.

With [^\u0000-\u007F\u00C0-\u00FF], you'll see that it matches the character ª in your string, since it's not present in the negated character class/list.

My best Regards, and thank you very much.

Nicha · May 15, 2023

Hi Domenic. After your explanation I've been doing some tests, and came into some doubts too.
Considering this Link, (ª - char) does not even corresponds to [u00AA] code, like I thought it were, but to [u00A6]. According to same Table, the ç and Ç characters corresponds to [u0087 and u0080] respectively.
Surely, I'm not looking at the right table, cause with

VBA Code:

[^\u0000-\u007F\u00C0-\u00FF\u00AA]

The expression doesn't removes from string (ª;ç or Ç) characters, which is very good, and it's what I wanted.
Can you please give me a link to the "real" table you are considering?

Domenic · May 15, 2023

I'm using the Windows-1252 8-bit ASCII character set...

ASCII table - Table of ASCII codes, characters and symbols

A complete list of all ASCII codes, characters, symbols and signs included in the 7-bit ASCII table and the extended ASCII table according to the Windows-1252 character set, which is a superset of ISO 8859-1 in terms of printable characters.

www.ascii-code.com

Nicha · May 15, 2023

Domenic said:
I'm using the Windows-1252 8-bit ASCII character set...

ASCII table - Table of ASCII codes, characters and symbols

A complete list of all ASCII codes, characters, symbols and signs included in the 7-bit ASCII table and the extended ASCII table according to the Windows-1252 character set, which is a superset of ISO 8859-1 in terms of printable characters.

www.ascii-code.com

Thank you very much.

Domenic · May 15, 2023

You're very welcome!

Cheers!

Improve Function to Strip non ASCII chars vba

Nicha

Board Regular

Attachments

rondeondo

Active Member

Domenic

MrExcel MVP

Nicha

Board Regular

Domenic

MrExcel MVP

Nicha

Board Regular

Nicha

Board Regular

Domenic

MrExcel MVP

ASCII table - Table of ASCII codes, characters and symbols

Nicha

Board Regular

ASCII table - Table of ASCII codes, characters and symbols

Domenic

MrExcel MVP

Share this page

Improve Function to Strip non ASCII chars vba

Board Regular

Attachments

Active Member

MrExcel MVP

Board Regular

MrExcel MVP

Board Regular

Board Regular

MrExcel MVP

Board Regular

MrExcel MVP

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock