clean data - Removing Unwanted characters

John Caines

Well-known Member
Joined
Aug 28, 2006
Messages
1,155
Office Version
  1. 2019
Platform
  1. Windows
Hello All.
I have a problem.:-(
I have an excel sheet with approx 30,000 rows of data.
These are all keyword phrases.
They can be related to any subject, but for this example these are related to the root keyword phrase of "car Rent"

I have some software which basically pulls in keyword phrases from search engines and meta tags etc, including misspelled keyword phrases.

My problem;
The data is uncleaned.
In other words there might be I think it's called "Carriage return" data in there, so the row of data might be very deep (Instead of a row height of say 10.5 it could be anything, IE some could be 100 or 200 even).
There are unwanted characters, for example; ()[]{}+?!""^*

(If it could delete all unwanted characters except for letters/digits)
There is a problem I see, that if it removes _ or - between words, that it will join the words together which won't be of any use. If it deletes anything with a letter either side of it or a letter and digit, or 2 digits, 1 either side it would then need to add a space to replace the hyphen.

So for example; if there was a phrase in the list like
car_for rent
if it just removed the underscore, then the phrase would be
carfor rent
Which isn't correct. It would need to replace the underscore with a space.
I hope I'm making sense here:-)
So basically I'd love to have if possible a macro button that runs through my entire column of data,
(Always in Column A , on a sheet called "AllKWs", and always starting from row3 downwards.)

If it could go through the list and delete all unwanted characters including double spaces.
So the end result is a keyword phrase list without a lot of junk basically.
After it's gone through the list I suppose it needs to then look at what's left and delete any duplicate phrases last (As once some of these unwanted characters are removed, the keyword list may have duplicates).

Once all this is completed, can a pop up window appear saying something like;
=======================================

Starting No. Phrases: 29,745
Finishing No. Phrases: 29,722
No.Deleted Characters: 12,345
No.Deleted Carriage Returns: 234
No.Deleted Spaces: 235
No.Deleted Duplicates: 23

Time Elapsed: 7.78seconds
======================================
I think that's about it:-)
I really hope someone can help me out on this 1.
I can't write this for sure.
Out of my league I'm afraid:-(
I hope it is possible as this would be very very useful for me.
Maybe it isn't possible as it is quite complicated.
If someone can have a look at it for me and have a go that would be brilliant


Thanks for your time.
Many Thanks
John Caines
 
Please try this

Code:
Sub strp()
Dim lastrow As Long, X As Variant, i As Long, j As Integer, r As Range, iRow As Long
Dim A As Integer, t As Date, Replaced As Variant, Spaces As Variant
Dim Returns As Variant, nDelRow As Long
t = Now()
Replaced = 0
Spaces = 0
Returns = 0
lastrow = Cells(Rows.Count, 1).End(xlUp).Row
Application.ScreenUpdating = False
Application.Cursor = xlWait
Set r = Range(Cells(3, 1), Cells(lastrow, 1))
For i = 3 To lastrow
    Application.StatusBar = Format(100 * i / (lastrow - 3), "0") & "% done..."
    X = Cells(i, 1).Value
    For j = 1 To Len(X)
        A = Asc(Mid(X, j, 1))
        Select Case A
            Case Is < 32
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
                If A = 10 Or A = 13 Then Returns = Returns + 1
            Case 33 To 47
                Cells(i, 1).Replace what:="~" & Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
            Case 58 To 64
                Cells(i, 1).Replace what:="~" & Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
            Case 91 To 96
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
            Case Is > 122
                Cells(i, 1).Replace what:=Chr(A), replacement:=Chr(32)
                Replaced = Replaced + 1
        End Select
    Next j
    X = Cells(i, 1).Value
    Cells(i, 1).Value = Application.Trim(X)
    Spaces = Spaces + Len(X) - Len(Cells(i, 1).Value)
    Rows(i).EntireRow.AutoFit
Next i
Application.ScreenUpdating = True
Application.Cursor = xlNormal
Application.StatusBar = False
Spaces = Spaces - Replaced
Replaced = Replaced - Returns
r.Sort Key1:=Range("A3"), Order1:=xlAscending, Header:=xlGuess
For iRow = r.Rows.Count To 3 Step -1
    If r(iRow) = r(iRow - 1) Then
        r(iRow).EntireRow.Delete
        nDelRow = nDelRow + 1
    End If
Next iRow
MsgBox ("Number of phrases: " & vbTab & vbTab & lastrow - 2 _
& vbCrLf & "Deleted characters: " & vbTab & Replaced _
& vbCrLf & "Deleted carriage returns: " & vbTab & Returns _
& vbCrLf & "Deleted spaces: " & vbTab & vbTab & Spaces _
& vbCrLf & "Deleted rows: " & vbTab & vbTab & nDelRow _
& vbCrLf & "Time elapsed: " & vbTab & vbTab & Format(Now() - t, "hh:mm:ss"))
End Sub

I don't really understand the problem with '.' - does this only occur with hyperlinks? And what do you want to do with them?
 
Upvote 0

Excel Facts

Difference between two dates
Secret function! Use =DATEDIF(A2,B2,"Y")&" years"&=DATEDIF(A2,B2,"YM")&" months"&=DATEDIF(A2,B2,"MD")&" days"
Just testing Now VoGII

Just testing your Macro Now VoGII,,
Don't know quite what was happening then.
After almost a minute of running, all row 2 (bar A2 started flashing, and I have a long header jpeg graphic in Row 1, (From ColA to ColE) that was flashing like mad , quickly on and off???

Just finished running now
It's just returned the results VogII.
It's taken twice as long, but I see it says it has deleted 318 duplicates.

I can see another character not wanted that is returned VoGII.
===
~
===
I need to check VoGII what it's deleted.

It's looking FANTASTIC!!! Though VoGII.
I've just sorted by your character Macro, and all the rubbish characters I mentioned earlier are gone!!
Brilliant.:-)
Just that 1 I mention I think that has escaped the VoGII Father of All Macros!:-)

Really really great VoGII.
If I had to be critical though, the only thing I can think of is it's speed. I can tell there is a hell of a lot of number crunching going on there.
But how does your macro deletes duplicates VoGII.?
In excel, manually doing this is quite quick.
Is this the same formula you've added to the end of your macro VoGII.?

As you know,, I don't know what I'm talking about regarding VB language etc.
Just wondered if it could be speeded up a bit that was all, as it seems deleting the duplicates has made it run twice as long,
Here's the stats VoGII
======================
Number of Phrases:43697
Deleted Characters:3324
Deleted Carriage Returns:7366
Deleted Spaces:6670
Time Elapsed: 00:02:11
======================
99% there VoGII:-)
Just the ~ and the speed if it can be adjusted.
But as to what it's doing, it's really amazing.
A clean list:-)
A huge time saver for sure
Many Thanks VoGII
John Caines
 
Upvote 0
just trying to help VoGII

I expect this is no good and very basic, but I did find this formula for deleteing I believe duplicates.
I don't know ifn it's any better or quicker VoGII,
Actually, it might be a complete waste of time,
but I thought I'd post it anyway VoGII.
Here it is
Code:
 Sub FixDuplicateRows()
Dim RowNdx As Long
Dim ColNum As Integer
ColNum = Selection(1).Column
For RowNdx = Selection(Selection.Cells.Count).Row To _ 
        Selection(1).Row + 1 Step -1
    If Cells(RowNdx, ColNum).Value = Cells(RowNdx - 1, ColNum).Value Then
        Cells(RowNdx, ColNum).Value = "----"
    End If
Next RowNdx
End Sub
Hope it might be of some use VoGII
Many Thanks
John Caines
 
Upvote 0
Hello,

Try this code which uses Regular Expressions. It matches anything that is not a-z, A-Z and 0-9 and replaces it with a space. It then converts multiple whitespaces to a single space. \s is the Regular Expression symbol for whitespace which includes tabs, newlines, spaces, etc so I used that instead of spaces since it works just as well.

Let me know if this works and then we can work on some of the other things you are looking to do. Good Luck!
Code:
Option Explicit

' Note: Requires a reference to Microsoft VBScript Regular Expressions 5.5

' \s{2,}
'
' Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) «\s{2,}»
'    Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
Private Const m_REGEX_PATTERN_MULTIPLE_WHITESPACE As String = "\s{2,}"

' [^a-zA-Z0-9]+
'
' Match a single character NOT present in the list below «[^a-zA-Z0-9]+»
'    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
'    A character in the range between "a" and "z" «a-z»
'    A character in the range between "A" and "Z" «A-Z»
'    A character in the range between "0" and "9" «0-9»
Private Const m_REGEX_PATTERN_ALPHANUMERIC_ONLY As String = "[^a-zA-Z0-9]+"

Public Sub CleanData()
    On Error Resume Next
    Dim regEx As New RegExp
    Dim strValue As String
    Dim lngRow As Long, lngEndRow As Long
    With regEx
        .MultiLine = True
        .Global = True
    End With
    With ActiveWorkbook.Sheets("AllKWS")
        lngEndRow = .Cells(.Rows.Count, "A").End(xlUp).Row
        For lngRow = 3 To lngEndRow
            regEx.Pattern = m_REGEX_PATTERN_ALPHANUMERIC_ONLY
            ' Are there any non alphanumeric characters in the current cell.
            If regEx.Test(.Cells(lngRow, "A").Value) Then
                ' Replace all non alphanumeric characters with a space.
                strValue = regEx.Replace(.Cells(lngRow, "A").Value, " ")
                regEx.Pattern = m_REGEX_PATTERN_MULTIPLE_WHITESPACE
                If regEx.Test(strValue) Then
                    ' Convert all multiple whitespaces to a single space.
                    strValue = Trim$(regEx.Replace(strValue, " "))
                End If
                .Cells(lngRow, "B").Value = strValue
            End If
        Next lngRow
    End With
    Set regEx = Nothing
    If Err.Number Then Err.Clear
    On Error GoTo 0
End Sub 'CleanData
 
Upvote 0
Thanks sbendbuckeye

Thanks for the reply sbendbuckeye,,

Unfortunately I've just tried running the macro.
I got an error.
"Compile error"
User-defined type not defined
Highlighted in yellow is the following line of data;
Code:
Public Sub CleanData()
Also this is highlighted in grey
Code:
regEx As New RegExp

I hope this makes sense sbendbuckeye.
If you need my actual worksheet with keyword data list inserted, I can email it you no problem if you PM me sbendbuckeye

Many thanks for your help on this sbendbuckeye
All the best
John Caines
 
Upvote 0
another formula

Hello All,
I'm just posting this now.
I haven't tried it yet.
It's gone 2am and I need to get to sleep.
I'm knackered.:-)
Just wanted to post the following which someone has sent me as it might help.
Again,
I haven't tried it yet.
Here it is;
Code:
Sub CleanEm() 
    Dim r As Range 
    Dim iRow As Long, lRow As Long 
    Dim s As String 
    Dim iChr As Integer 
    Dim nDelChr As Long, nDelCR As Long, nDelRow As Long, nDelSp As Long 
    Const sFmt As String = "#,##0" 
     
    lRow = Range("A65536").End(xlUp).Row 
    Set r = Range(Cells(3, 1), Cells(lRow, 1)) 
     
    For iRow = 1 To r.Rows.Count 
        s = r(iRow) 
        For iChr = 1 To Len(s) 
            Select Case Mid(s, iChr, 1) 
            Case "A" To "Z", "a" To "z", "0" To "9" 
                 ' do nothing
            Case vbLf ' replace with space
                s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1) 
                nDelCR = nDelCR + 1 
            Case Else ' not alphanumeric - replace with space
                s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1) 
                nDelChr = nDelChr + 1 
            End Select 
        Next 
        r(iRow) = Trim(s) ' remove leading, trailing, and multiple spaces
        nDelSp = nDelSp + Len(s) - Len(r(iRow)) 
    Next 
     
     ' sort to locate and delete duplicates
    With ActiveWorkbook.Worksheets(ActiveSheet.Name).Sort 
        .SortFields.Clear 
        .SortFields.Add Key:=r, _ 
        SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal 
        .SetRange r 
        .Header = xlNo 
        .MatchCase = False 
        .Orientation = xlTopToBottom 
        .SortMethod = xlPinYin 
        .Apply 
    End With 
     
    For iRow = r.Rows.Count To 2 Step -1 
        If r(iRow) = r(iRow - 1) Then 
            r(iRow).EntireRow.Delete 
            nDelRow = nDelRow + 1 
        End If 
    Next 
     
    MsgBox "Starting No Phrases:      " & Format(lRow - 3 + 1, sFmt) & vbLf _ 
    & "Finishing No. Phrases:    " & Format(r.Rows.Count, sFmt) & vbLf _ 
    & "Deleted Characters:       " & Format(nDelChr, sFmt) & vbLf _ 
    & "Deleted Carriage Returns: " & Format(nDelCR, sFmt) & vbLf _ 
    & "Deleted Spaces:           " & Format(nDelSp, sFmt) & vbLf _ 
    & "Deleted Duplicates:       " & Format(nDelRow, sFmt) 
End Sub

Goodnight all.
Many Thanks
John Caines
 
Upvote 0
compatability

hello all,

Just as a note,
I'm using excel2007.
My worsheet is saved as an "xlsm" (Macro enabled) file extention.

I only mention this as there seems to be some compatability problems between 2003 and 2007 excel.

Just thought it best to mention this that's all.
Many Thanks
John Caines
 
Upvote 0
Update

Thanks for the PM sbendbuckeye.
I will try it in a short while.

Many thanks to VoGII on this also.
I have received this from SHG.
I think it's designed with Excel 2007 in mind??
I'll post it now.
I ran it.
It's similar to VoGII's macro, as in how it seemed to run, but gave completely different results???
I think VoGII's is more accurate, but I'll have to try running them several times and figure out what they are deleting differently.
I did try VoGII's several times last night and it seemed to sort perfectly bar the ~ character.
I'll attach this new formula now anyway, as it might help things.
Maybe someone can tell me why they do give different results as I'm no coder for sure:-(
Here it is anyway:
Code:
Sub CleanEm() 
    Dim r As Range 
    Dim iRow As Long, lRow As Long 
    Dim s As String 
    Dim iChr As Integer 
    Dim nDelChr As Long, nDelCR As Long, nDelRow As Long, nDelSp As Long 
    Const sFmt As String = "#,##0" 
     
    lRow = Range("A65536").End(xlUp).Row 
    Set r = Range(Cells(3, 1), Cells(lRow, 1)) 
     
    For iRow = 1 To r.Rows.Count 
        s = r(iRow) 
        For iChr = 1 To Len(s) 
            Select Case Mid(s, iChr, 1) 
            Case "A" To "Z", "a" To "z", "0" To "9" 
                 ' do nothing
            Case vbLf ' replace with space
                s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1) 
                nDelCR = nDelCR + 1 
            Case Else ' not alphanumeric - replace with space
                s = Left(s, iChr - 1) & " " & Mid(s, iChr + 1) 
                nDelChr = nDelChr + 1 
            End Select 
        Next 
        r(iRow) = Trim(s) ' remove leading, trailing, and multiple spaces
        nDelSp = nDelSp + Len(s) - Len(r(iRow)) 
    Next 
     
     ' sort to locate and delete duplicates
    With ActiveWorkbook.Worksheets(ActiveSheet.Name).Sort 
        .SortFields.Clear 
        .SortFields.Add Key:=r, _ 
        SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal 
        .SetRange r 
        .Header = xlNo 
        .MatchCase = False 
        .Orientation = xlTopToBottom 
        .SortMethod = xlPinYin 
        .Apply 
    End With 
     
    For iRow = r.Rows.Count To 2 Step -1 
        If r(iRow) = r(iRow - 1) Then 
            r(iRow).EntireRow.Delete 
            nDelRow = nDelRow + 1 
        End If 
    Next 
     
    MsgBox "Starting No Phrases:      " & Format(lRow - 3 + 1, sFmt) & vbLf _ 
    & "Finishing No. Phrases:    " & Format(r.Rows.Count, sFmt) & vbLf _ 
    & "Deleted Characters:       " & Format(nDelChr, sFmt) & vbLf _ 
    & "Deleted Carriage Returns: " & Format(nDelCR, sFmt) & vbLf _ 
    & "Deleted Spaces:           " & Format(nDelSp, sFmt) & vbLf _ 
    & "Deleted Duplicates:       " & Format(nDelRow, sFmt) 
End Sub

Hope this might be of some use.
By the way,,,,
forgot to mention the different results:-)
here they are anyway.
Firstly VoGII's Macro (Both of these results have been ran on exactly the same keyword list;
=========================
Number of Phrases: 43,697
Deleted Characters: 3324
Deleted Carriage Returns: 7366
Deleted Spaces: 6670
Deleted Rows: 318
Time Elapsed 00:02:10
=========================
SHG's Formula;
========================
Starting No Phrases: 43,697
Finishing No. Phrases: 43,547
Deleted Characters: 132,297
Deleted Carriage Returns: 3,683
Deleted Spaces: 488
Deleted Duplicates: 150
========================

There is a big difference between the 2's deleted spaces and Deleted No. Characters.

I'll have to run these a few times and try and see what exactly they are doing differently.
If anyone can decipher this ,,,please,,, I'm all ears:-)

Many Thanks everyone,
Especially VoGII, sbendbuckeye, and SHG.

Many Thanks
John Caines
 
Upvote 0
sbendbuckeye-didn't work??

Thanks for the PM sbendbuckeye.
Regarding the formula you wrote,

You told me to;
====================
To use regular expressions, you have to set a reference to Microsoft VBScript Regular Expressions 5.x. To do this in the VB Editor, click Tools, References and then scroll down the list until you find the above dll. Verions 1.0 will not work, it needs to be 5.x.
=====================
I done this and tried to run your macro, but it didn't run:-(

I'll try and send you the sheet like you told me.

Many thanks for your help on this
(By the way, I'll have to look up on the internet what "Regular Expressions" actually are,, as this code of yours seems completely different to all the other codes??
A kind of VB code within a code?? Looks complicated:-))
Many Thanks
John Caines
 
Upvote 0
Regular Expressions is a pattern matching language supported by many other programming languages. It is very powerful and also very complex. You can zip up some sample data and send it to me if you would like. Just remove the NO_SPAM from my signature line. Good Luck!

The best regular expressions tool I know of is RegExBuddy at www.RegExBuddy.com. Here is a tutorial from the same site: http://www.regular-expressions.info/tutorial.html
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,181
Members
453,022
Latest member
Mohamed Magdi Tawfiq Emam

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top