Count Word and Random Phrase Frequency

Sleeplol

Board Regular
Joined
Apr 10, 2019
Messages
194
Office Version
  1. 365
  2. 2019
Platform
  1. Windows
Hello Everyone,
I'm attempting to count the word frequency as well as two and three word phrases from a column. I'm stuck and would really appreciate some help.

Getting error:
Run-time error 1004.png


On this line:
ErrorLastLine.png


Setup:
A macro populates column A with values from AP101:AP50000 (formulas combining additional row cells into a single cell).
(Problem) Macro populates the count of B,E,H

Runs fine though throttles for about 2 min for 30000ish rows; then bugs out after populating the results.
I've tried referencing the sheet with Worksheets("DashTest").Range on all three, but it still errors.

Below is a quick shot of what it looks like, as well as the entire script.
Thanks for any help

WordCountDisplay.png



VBA Code:
Sub test()
Dim a, e, s, Ignore As String, temp, x
Dim dic As Object, dic2 As Object, dic3 As Object
Set dic = CreateObject("Scripting.Dictionary")
dic.CompareMode = 1
Set dic2 = CreateObject("Scripting.Dictionary")
dic2.CompareMode = 1
Set dic3 = CreateObject("Scripting.Dictionary")
dic3.CompareMode = 1
With Range("m1").CurrentRegion.Offset(1)
Ignore = Join(Application.Transpose(.Resize(.Rows.Count - 1).Value), Chr(2))
End With
a = Range("a1", Range("a" & Rows.Count).End(xlUp)).Value
With CreateObject("VBScript.RegExp")
.Global = True: .IgnoreCase = True
.Pattern = "([\$\(\)\-\^\|\\\[\]\*\+\?\.])"
Ignore = "[^\w ]|\b(" & Replace(.Replace(Ignore, "\$1"), Chr(2), "|") & ")\b"
For Each e In a
If e <> "" Then
x = GetSentense(e, 2)
If IsArray(x) Then
For Each s In x
dic2(s) = dic2(s) + 1
Next
End If
x = GetSentense(e, 3)
If IsArray(x) Then
For Each s In x
If s <> "" Then dic3(s) = dic3(s) + 1
Next
End If
.Pattern = Ignore
temp = Application.Trim(.Replace(e, ""))
For Each s In Split(temp)
If s <> "" Then dic(s) = dic(s) + 1
Next
End If
Next
End With
Range("b2").Resize(dic.Count, 2).Value = _
Application.Transpose(Array(dic.keys, dic.items))
Range("e2").Resize(dic2.Count, 2).Value = _
Application.Transpose(Array(dic2.keys, dic2.items))
Range("h2").Resize(dic3.Count, 2).Value = _
Application.Transpose(Array(dic3.keys, dic3.items))
End Sub
Function GetSentense(ByVal txt As String, myStep)
Dim i As Long, ii As Long, temp, x
On Error Resume Next
x = Split(txt): ReDim temp(UBound(x) - myStep)
If Err Then GetSentense = Empty: Exit Function
On Error GoTo 0
For i = 0 To UBound(x) - myStep - 1
For ii = 1 To myStep
temp(i) = Trim$(temp(i) & " " & x(i + ii - 1))
Next
Next
GetSentense = temp
End Function
 
EDIT: sorry, didn't know there's a new reply
 
Upvote 0

Excel Facts

Waterfall charts in Excel?
Office 365 customers have access to Waterfall charts since late 2016. They were added to Excel 2019.
I would "paste" the items into A1 of worksheet "Frequency" , then need to run the macro from worksheet "Dash".
Try this:
Data in worksheet "Frequency", you can run the macro from any sheet.
VBA Code:
Sub regexPhraseFrequency2()
'generate word/phrase frequency
'Data must be in column A, start at A1
'sNumber = "1,2,3"  means it generates 3 frequency list: single word & 2 word phrase & 3 word phrase
'tested on text (from a novel) with 161K words with sNumber = "1,2,3", it took 10.5 seconds


Dim i As Long
Dim sNumber As String, txa As String
Dim z, T
Dim obj As New DataObject

T = Timer
Application.ScreenUpdating = False

With Sheets("Frequency")
.Range("C:Z").ClearContents

txa = Join(Application.Transpose(.Range("A1", .Cells(.Rows.Count, "A").End(xlUp))), vbLf)

'--- CHANGE sNumber VALUE TO SUIT -----------------------------------
sNumber = "1,2,3"   'list each number of words in a phrase, change to suit
                    'sNumber = "1"  will generate 1 frequency list of single word
                    'sNumber = "1,2"  will generate 2 frequency list: single word & 2 word phrase

z = Split(sNumber, ",")
  
    For i = LBound(z) To UBound(z)
        Call toProcess2(CLng(z(i)), txa)
    Next

.Range("C:Z").Columns.AutoFit
End With
Application.ScreenUpdating = True

Debug.Print Timer - T

End Sub


Sub toProcess2(n As Long, tx As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q
Dim txb As String

    tx = Replace(tx, "'", "___") 'replace apostrophe with "___", so it will match pattern "\w+"
    
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
        End With
        
        
If n > 1 Then
        
        regEx.Pattern = "( ){2,}"
    
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
        tx = Replace(tx, " ", "_____") 'replace space with "_____", , so it won't match pattern "\W+"
 
        regEx.Pattern = "\W+" 'non words character, Matches any character that is not
                              'a word character (alphanumeric & underscore). Equivalent to [^A-Za-z0-9_]
   
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character with new line char
        End If
        
        tx = Replace(tx, "_____", " ") ' change it back to space
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line
        
End If

    
    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

    regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n)) 'match n words (the phrase) separated by a space
        
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
 For i = 1 To n - 1
        
        regEx.Pattern = "^\w+ "
        txb = tx
        If regEx.test(txb) Then
           txb = regEx.Replace(txb, "")   'remove first word in each line to get different combination of n words (phrase)
           
            regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n))
            Set matches = regEx.Execute(txb)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
 
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

'Range("D:E").ClearContents
With Sheets("Frequency")
rc = .Cells(1, .Columns.Count).End(xlToLeft).Column
'put the result in col D:E

With .Cells(2, rc + 2).Resize(d.Count, 2)
    
    If d.Count < 65536 Then 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.items))
        
    Else
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
        
    End If
    
    'get the apostrophe back
    .Replace What:="___", Replacement:="'", lookAt:=xlPart, SearchFormat:=False, ReplaceFormat:=False
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
End With
    
.Cells(1, rc + 2) = n & " WORD"
.Cells(1, rc + 3) = "FREQUENCY"
End With
End Sub
 
Upvote 0
@Sleeplol

Sorry, I found that the codes in post 6# & post 12# are flawed.
Here's the revised code:
revised code for post 6#:
VBA Code:
Sub regexPhraseFrequency1()
'The code will generate word/phrase frequency
'Data must be in column A, start at A1
'Data can't be more than 65536 rows, because it's using Application.Transpose.
'sNumber = "1,2,3"  means it generates 3 frequency list: single word & 2 word phrase & 3 word phrase, you can change that to suit.
'Word with apostrophe such as "you're" is counted as one word.
'Word with underscore such as "aa_bb" is counted as one.
'Tested on text (from a novel) with 16.500 rows, contains 161K words (12600 unique words) with sNumber = "1,2,3", it took 10.5 seconds


Dim i As Long
Dim sNumber As String, txa As String
Dim z, T
Dim obj As New DataObject

T = Timer
Application.ScreenUpdating = False
Range("C:Z").ClearContents

txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), vbLf)

'--- CHANGE sNumber VALUE TO SUIT -----------------------------------
sNumber = "1,2,3"   'list each number of words in a phrase, change to suit
                    'sNumber = "1"  will generate 1 frequency list of single word
                    'sNumber = "1,2"  will generate 2 frequency list: single word & 2 word phrase

z = Split(sNumber, ",")
   
    For i = LBound(z) To UBound(z)
        Call toProcess(CLng(z(i)), txa)
    Next

Range("C:Z").Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print Timer - T

End Sub

Sub toProcess(n As Long, ByVal tx As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

    tx = Replace(tx, "'", "___") 'replace apostrophe with "___", so it will match pattern "\w+"
   
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
        End With
       
       
If n > 1 Then
       
        regEx.Pattern = "( ){2,}"
   
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
       
        tx = Trim(tx)
        tx = Replace(tx, " ", "_____") 'replace space with "_____", , so it won't match pattern "\W+"

        regEx.Pattern = "\W+" 'non words character, Matches any character that is not
                              'a word character (alphanumeric & underscore). Equivalent to [^A-Za-z0-9_]
  
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character with new line char
        End If
       
        tx = Replace(tx, "_____", " ") ' change it back to space
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line
       
End If

   
    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

    regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n)) 'match n words (the phrase) separated by a space
       
            Set matches = regEx.Execute(tx)
           
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next


For i = 1 To n - 1
       
        regEx.Pattern = "^\w+ "
       
        If regEx.test(tx) Then
       
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

            regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n))
            Set matches = regEx.Execute(tx)
           
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If

Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result in col D:E

With Cells(2, rc + 2).Resize(d.Count, 2)
   
    If d.Count < 65536 Then 'Transpose function has a limit of 65536 item to process
       
        .Value = Application.Transpose(Array(d.Keys, d.items))
       
    Else
       
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
       
    End If
   
    'get the apostrophe back
    .Replace What:="___", Replacement:="'", lookAt:=xlPart, SearchFormat:=False, ReplaceFormat:=False
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
   
   
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "FREQUENCY"

End Sub

revised code for post 12#:
VBA Code:
Sub regexPhraseFrequency2()
'generate word/phrase frequency
'Data must be in column A, start at A1
'sNumber = "1,2,3"  means it generates 3 frequency list: single word & 2 word phrase & 3 word phrase
'Tested on text (from a novel) with 16.500 rows, contains 161K words (12600 unique words) with sNumber = "1,2,3", it took 10.5 seconds

Dim i As Long
Dim sNumber As String, txa As String
Dim z, T

T = Timer
Application.ScreenUpdating = False

With Sheets("Frequency")
.Range("C:Z").ClearContents

txa = Join(Application.Transpose(.Range("A1", .Cells(.Rows.Count, "A").End(xlUp))), vbLf)

'--- CHANGE sNumber VALUE TO SUIT -----------------------------------
sNumber = "1,2,3"   'list each number of words in a phrase, change to suit
                    'sNumber = "1"  will generate 1 frequency list of single word
                    'sNumber = "1,2"  will generate 2 frequency list: single word & 2 word phrase

z = Split(sNumber, ",")
 
    For i = LBound(z) To UBound(z)
        Call toProcess2(CLng(z(i)), txa)
    Next

.Range("C:Z").Columns.AutoFit
End With
Application.ScreenUpdating = True

Debug.Print Timer - T

End Sub


Sub toProcess2(n As Long, ByVal tx As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

    tx = Replace(tx, "'", "___") 'replace apostrophe with "___", so it will match pattern "\w+"
   
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
        End With
       
       
If n > 1 Then
       
        regEx.Pattern = "( ){2,}"
   
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
       
        tx = Trim(tx)
        tx = Replace(tx, " ", "_____") 'replace space with "_____", , so it won't match pattern "\W+"

        regEx.Pattern = "\W+" 'non words character, Matches any character that is not
                              'a word character (alphanumeric & underscore). Equivalent to [^A-Za-z0-9_]
  
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character with new line char
        End If
       
        tx = Replace(tx, "_____", " ") ' change it back to space
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line
       
End If

   
    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

    regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n)) 'match n words (the phrase) separated by a space
       
            Set matches = regEx.Execute(tx)
           
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next

For i = 1 To n - 1
       
        regEx.Pattern = "^\w+ "
       
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)
          
            regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n))
            Set matches = regEx.Execute(tx)
           
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If

Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

'Range("D:E").ClearContents
With Sheets("Frequency")
rc = .Cells(1, .Columns.Count).End(xlToLeft).Column
'put the result in col D:E

With .Cells(2, rc + 2).Resize(d.Count, 2)
   
    If d.Count < 65536 Then 'Transpose function has a limit of 65536 item to process
       
        .Value = Application.Transpose(Array(d.Keys, d.items))
       
    Else
       
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
       
    End If
   
    'get the apostrophe back
    .Replace What:="___", Replacement:="'", lookAt:=xlPart, SearchFormat:=False, ReplaceFormat:=False
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
End With
   
.Cells(1, rc + 2) = n & " WORD"
.Cells(1, rc + 3) = "FREQUENCY"
End With
End Sub
 
Upvote 0
Example using code in post 6#:
you can see that the result of 3 word phrase are wrong

Book1
ABCDEFGHIJ
1a b c d e f g h1 WORDFREQUENCY2 WORDFREQUENCY3 WORDFREQUENCY
2qqa1a b1b c d2
3b1b c1e f g2
4c1c d1a b c1
5d1d e1d e f1
6e1e f1
7f1f g1
8g1g h1
9h1
10qq1
Sheet1


while using the revised code above:
Book1
ABCDEFGHIJ
1a b c d e f g h1 WORDFREQUENCY2 WORDFREQUENCY3 WORDFREQUENCY
2qqa1a b1a b c1
3b1b c1b c d1
4c1c d1c d e1
5d1d e1d e f1
6e1e f1e f g1
7f1f g1f g h1
8g1g h1
9h1
10qq1
Sheet2
 
Upvote 0
Akuini, thank you much for the update. I didn't catch the 3 word phrase flaw. I assume it would also have affected larger phrases.
This code is running fantastically. Accurate and fast. I'm very glad you took time help
 
Upvote 0
You're welcome, glad to help, & thanks for the feedback.:)
 
Upvote 0
@Sleeplol

I need to explain something:
I see in example in post #1, your code showed e.g. "holidays, Need" as a 2 word phrase.
So there could be a comma (and I guess another sign character like :;. etc?) between words in a phrase.
Is that what you want?

Because my code use different criteria, a phrase can only have space between the words.
So for example: "yes, this one"
your code will get 3 phrases: 1. "yes, this" 2. "this one" 3. "yes, this one"
my code will get only 1 phrase: "this one"
 
Upvote 0
When I saw your results, I think it is better to ignore phrases that are separated by a character.
Your code is being used as part of an analytics tool to discover trends in documented person to person interactions; commas and semi-colons generally break a statement into conceptual units, which ARE the phrases I'm looking for. There are outliers that I might need like an accidental comma: "hit the, wall", and deliberate commas separating a list. However, the latter will be picked up by the single word list anyway. The positives of the built-in "filter" of you code, I think outweighs the negative of the few that might slip through the cracks.

I really can't say enough about how great that macro is.
 
Upvote 0
Ok, I'm glad that the code suits your need. :)

In case you or anyone else need it, here's the criteria that is used by the code:
1. A word can only consist of alphabet (a-zA-Z), number(0-9), underscore or apostrophe.

For example:
each line below is counted as 1 word:
You
A11
33
you're
a_bb

each line below is counted as 2 word:
a,bb
a, book
a\d

2. A phrase: multiple words that are separated by a space or spaces:
So for example:
"I see this" > it has 3 phrases: 1. "I see" 2. "see this" 3. "I see this"
"yes, this one" > it has only 1 phrase: "this one"

EXAMPLE:
Book1
ABCDEFGHIJ
1car1 WORDFREQUENCY2 WORDFREQUENCY3 WORDFREQUENCY
2A11a3I see1I see this1
333this2see this1
4you're331this one1
5a_bba_bb1
6a,bbA111
7a, bookbb1
8a\dbook1
9car1
10yes, this oned1
11I see thisI1
12one1
13see1
14yes1
15you're1
Sheet5
 
Upvote 0
Got it.
I'm positive that this will help anyone who needs this type of criteria
 
Upvote 0

Forum statistics

Threads
1,225,741
Messages
6,186,761
Members
453,370
Latest member
juliewar

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top