Count Word and Random Phrase Frequency

Sleeplol

Board Regular
Joined
Apr 10, 2019
Messages
194
Office Version
  1. 365
  2. 2019
Platform
  1. Windows
Hello Everyone,
I'm attempting to count the word frequency as well as two and three word phrases from a column. I'm stuck and would really appreciate some help.

Getting error:
Run-time error 1004.png


On this line:
ErrorLastLine.png


Setup:
A macro populates column A with values from AP101:AP50000 (formulas combining additional row cells into a single cell).
(Problem) Macro populates the count of B,E,H

Runs fine though throttles for about 2 min for 30000ish rows; then bugs out after populating the results.
I've tried referencing the sheet with Worksheets("DashTest").Range on all three, but it still errors.

Below is a quick shot of what it looks like, as well as the entire script.
Thanks for any help

WordCountDisplay.png



VBA Code:
Sub test()
Dim a, e, s, Ignore As String, temp, x
Dim dic As Object, dic2 As Object, dic3 As Object
Set dic = CreateObject("Scripting.Dictionary")
dic.CompareMode = 1
Set dic2 = CreateObject("Scripting.Dictionary")
dic2.CompareMode = 1
Set dic3 = CreateObject("Scripting.Dictionary")
dic3.CompareMode = 1
With Range("m1").CurrentRegion.Offset(1)
Ignore = Join(Application.Transpose(.Resize(.Rows.Count - 1).Value), Chr(2))
End With
a = Range("a1", Range("a" & Rows.Count).End(xlUp)).Value
With CreateObject("VBScript.RegExp")
.Global = True: .IgnoreCase = True
.Pattern = "([\$\(\)\-\^\|\\\[\]\*\+\?\.])"
Ignore = "[^\w ]|\b(" & Replace(.Replace(Ignore, "\$1"), Chr(2), "|") & ")\b"
For Each e In a
If e <> "" Then
x = GetSentense(e, 2)
If IsArray(x) Then
For Each s In x
dic2(s) = dic2(s) + 1
Next
End If
x = GetSentense(e, 3)
If IsArray(x) Then
For Each s In x
If s <> "" Then dic3(s) = dic3(s) + 1
Next
End If
.Pattern = Ignore
temp = Application.Trim(.Replace(e, ""))
For Each s In Split(temp)
If s <> "" Then dic(s) = dic(s) + 1
Next
End If
Next
End With
Range("b2").Resize(dic.Count, 2).Value = _
Application.Transpose(Array(dic.keys, dic.items))
Range("e2").Resize(dic2.Count, 2).Value = _
Application.Transpose(Array(dic2.keys, dic2.items))
Range("h2").Resize(dic3.Count, 2).Value = _
Application.Transpose(Array(dic3.keys, dic3.items))
End Sub
Function GetSentense(ByVal txt As String, myStep)
Dim i As Long, ii As Long, temp, x
On Error Resume Next
x = Split(txt): ReDim temp(UBound(x) - myStep)
If Err Then GetSentense = Empty: Exit Function
On Error GoTo 0
For i = 0 To UBound(x) - myStep - 1
For ii = 1 To myStep
temp(i) = Trim$(temp(i) & " " & x(i + ii - 1))
Next
Next
GetSentense = temp
End Function
 
@auggie24
See if this helps:
Add reference to "Microsoft VBScript Regular Expressions 5.5":
select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
 
Upvote 0

Excel Facts

Excel motto
Not everything I do at work revolves around Excel. Only the fun parts.
Akuini, or any other Excel warrior,

This macro will make my day, but it shows an error "User Defined Type Not Defined" when running at "Dim obj As New DataObject". Could you please elaborate on how to correct it? I added a reference to VBA Script Regular Expressions 5.5, but I cannot go further after Google searching.

Thanks!
 
Upvote 0
@actjfc
You need to add the reference “Microsoft Forms 2.0 Object Library” (to work with clipboard). But actually you can delete the line:
Dim obj As New DataObject
we don't need that, I just forgot to delete it, it was from an earlier code when I wrote this code.
Also here's a revised version, it can handle more than 65536 rows which is the Transpose function limit.
Just run "Sub regexPhraseFrequency3".
VBA Code:
Sub regexPhraseFrequency3()
'The code will generate word/phrase frequency
'Data must be in column A, start at A1
'sNumber = "1,2,3"  means it generates 3 frequency list: single word & 2 word phrase & 3 word phrase, you can change that to suit.
'Word with apostrophe such as "you're" is counted as one word.
'Word with underscore such as "aa_bb" is counted as one word.
'Tested on text (from a novel) with 16.500 rows, contains 161K words (12600 unique words) with sNumber = "1,2,3", it took 10.5 seconds

Dim i As Long, j As Long
Dim sNumber As String, txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range("C:Z").ClearContents
j = Range("A" & Rows.Count).End(xlUp).Row


If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If


'Debug.Print Len(txa)

'--- CHANGE sNumber VALUE TO SUIT -----------------------------------
sNumber = "1"   'list each number of words in a phrase, change to suit
                    'sNumber = "1"  will generate 1 frequency list of single word
                    'sNumber = "1,2"  will generate 2 frequency list: single word & 2 word phrase

z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa)
'        Call toProcessx(CLng(z(i)), txa)
    Next

Range("C:Z").Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print Timer - t

End Sub

Sub toProcessY(n As Long, ByVal tx As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim xp As String
Dim va, q

    'tx = Replace(tx, "'", "___") 'replace apostrophe with "___", so it will match pattern "\w+"
    
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .ignorecase = True
        End With

'define the word characters
'regex pattern, you can add/remove any character as a word character
'xp = "A-Z0-9_'\-"

xp = "A-Z0-9_'"

If n > 1 Then
        
        regEx.Pattern = "( ){2,}"
    
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
        
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xp & " ]+" 'exclude xp and space
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character with new line char
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    
    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xp & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
 
 For i = 1 To n - 1
        
'        regEx.Pattern = "^[A-Z0-9_']+ "
        regEx.Pattern = "^[" & xp & "]+ "
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xp & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
 
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result in col D:E

With Cells(2, rc + 2).Resize(d.Count, 2)
    
    If d.Count < 65536 Then 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.Items))
        
    ElseIf d.Count <= 1048500 Then
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    Else
        MsgBox "Process canceled, the result is more than 1048500 rows"
    End If
    
    'get the apostrophe back
'    .Replace What:="___", Replacement:="'", lookAt:=xlPart, SearchFormat:=False, ReplaceFormat:=False
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "FREQUENCY"

End Sub


If you run into error "Run time error 13" , it's probably because there are error in some cells. Try running this sub first, to remove the error:
VBA Code:
Sub a_removeError()

With ActiveSheet.UsedRange
.Value = .Value
End With
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents

End Sub
 
Upvote 0
You're welcome, glad to help & thanks for the feedback.:)
 
Upvote 0
You're welcome, glad to help & thanks for the feedback.:)
Hi Akuini, thank you so much for this awesome macro. I am succeeding in an output, but the frequency is incorrect. Ideas?

I've removed duplicate "Customer Reviews" (the data we are sorting), and a CTRL+F for example shows only 1 instance of
"a reasonable and beautiful ring with 5 raw stones"
but the frequency under 9 word phrases is listed as 24.

Appreciate any insight! <3
 

Attachments

  • Screenshot 2021-11-15 151015.jpg
    Screenshot 2021-11-15 151015.jpg
    87.2 KB · Views: 20
Upvote 0
Actually using your most recent code fixed that issue, I won't delete incase someone runs into the same problem
 
Upvote 0
I've removed duplicate "Customer Reviews" (the data we are sorting), and a CTRL+F for example shows only 1 instance of
"a reasonable and beautiful ring with 5 raw stones"
but the frequency under 9 word phrases is listed as 24.
Which code did you use?
The code in post #6 & #12 are flawed, and I've corrected them in post #13. In post #19, I also described the criteria used by the code in post #13.
And then I post a newer code in post #43, i.e "Sub regexPhraseFrequency3".
The criteria for both codes ( post #13 & #43) are basically the same, except for one thing, for code in post #43:
A sentence/paragraph can occupy more than one cell, it happens when the end of the cell is a word character or a space. For example:

dhee - Macro to create Word & Phrase Frequency 1.xlsm
ABCDEFGHIJ
1Please, go home1 WORDCOUNT2 WORDCOUNT3 WORDCOUNT
2tomorrow.go1go home1go home tomorrow1
3home1home tomorrow1
4Please1
5tomorrow1
Sheet1

"home tomorrow" is considered as a phrase.

I decided to use this criteria because in some cases the text comes from a source that breaks the paragraphs, oftentimes it happens when you copy-paste the text from a PDF.

Actually using your most recent code fixed that issue, I won't delete incase someone runs into the same problem
Thanks for the feedback.
 
Upvote 0

Forum statistics

Threads
1,224,823
Messages
6,181,179
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top