Word Count with hyphens

cpmurray1985

New Member
Joined
Mar 10, 2022
Messages
15
Office Version
  1. 365
Platform
  1. Windows
Hello,

I've looked and attempted to modify the code, but as I am completely new to VBA, I feel I am doing it wrong. I'm not getting errors, but it is not producing the desired results.

I am looking to have a word/phrase counter that count the word frequency, and in addition, displays two and three word phrases from a column. The code provided already does so, however, I need to modify it a little as it will not count words that are hyphenated as one word. I still would like for the macro to ignore special characters, such as ., -, !, etc as it is currently doing, if it is by itself or after a word.

Expected Result:

Word/Sentence1 WordFrequency2 WordFrequency3 WordFrequency
Take pre-caution todayPre-caution2Take Pre-caution1Take pre-caution today1
Pre-caution - Warning signstake1Pre-caution today1
today1warning signs1
warning1
signs1


What I get instead:

Word/Sentence1 WordFrequency2 WordFrequency3 WordFrequency
Take pre-caution todaycaution2caution today1
Pre-caution - Warning signspre2take pre1
signs1warning signs1
Take1
signs1
today1
warning1

Also, is it possible to keep the code in a certain range without clearing the rest? I know the macro contains "Range("C:Z").ClearContents", which is what I want to, but just to clear from C to L. However, when I adjust it to, for example "Range("C:L").ClearContents", and if there is content in column M and N, it will directly start in column P, even if column C:L is empty.

The code is from this post, but just in case it can't be viewed. Thank you.

VBA Code:
Sub regexPhraseFrequency1()
'The code will generate word/phrase frequency
'Data must be in column A, start at A1
'Data can't be more than 65536 rows, because it's using Application.Transpose.
'sNumber = "1,2,3"  means it generates 3 frequency list: single word & 2 word phrase & 3 word phrase, you can change that to suit.
'Word with apostrophe such as "you're" is counted as one word.
'Word with underscore such as "aa_bb" is counted as one.
'Tested on text (from a novel) with 16.500 rows, contains 161K words (12600 unique words) with sNumber = "1,2,3", it took 10.5 seconds


Dim i As Long
Dim sNumber As String, txa As String
Dim z, T
Dim obj As New DataObject

T = Timer
Application.ScreenUpdating = False
Range("C:Z").ClearContents

txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), vbLf)

'--- CHANGE sNumber VALUE TO SUIT -----------------------------------
sNumber = "1,2,3"   'list each number of words in a phrase, change to suit
                    'sNumber = "1"  will generate 1 frequency list of single word
                    'sNumber = "1,2"  will generate 2 frequency list: single word & 2 word phrase

z = Split(sNumber, ",")
   
    For i = LBound(z) To UBound(z)
        Call toProcess(CLng(z(i)), txa)
    Next

Range("C:Z").Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print Timer - T

End Sub

Sub toProcess(n As Long, ByVal tx As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

    tx = Replace(tx, "'", "___") 'replace apostrophe with "___", so it will match pattern "\w+"
   
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
        End With
       
       
If n > 1 Then
       
        regEx.Pattern = "( ){2,}"
   
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
       
        tx = Trim(tx)
        tx = Replace(tx, " ", "_____") 'replace space with "_____", , so it won't match pattern "\W+"

        regEx.Pattern = "\W+" 'non words character, Matches any character that is not
                              'a word character (alphanumeric & underscore). Equivalent to [^A-Za-z0-9_]
  
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character with new line char
        End If
       
        tx = Replace(tx, "_____", " ") ' change it back to space
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line
       
End If

   
    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

    regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n)) 'match n words (the phrase) separated by a space
       
            Set matches = regEx.Execute(tx)
           
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next


For i = 1 To n - 1
       
        regEx.Pattern = "^\w+ "
       
        If regEx.test(tx) Then
       
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

            regEx.Pattern = Trim(WorksheetFunction.Rept("\w+ ", n))
            Set matches = regEx.Execute(tx)
           
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If

Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result in col D:E

With Cells(2, rc + 2).Resize(d.Count, 2)
   
    If d.Count < 65536 Then 'Transpose function has a limit of 65536 item to process
       
        .Value = Application.Transpose(Array(d.Keys, d.items))
       
    Else
       
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
       
    End If
   
    'get the apostrophe back
    .Replace What:="___", Replacement:="'", lookAt:=xlPart, SearchFormat:=False, ReplaceFormat:=False
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
   
   
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "FREQUENCY"

End Sub
 

Excel Facts

Format cells as currency
Select range and press Ctrl+Shift+4 to format cells as currency. (Shift 4 is the $ sign).
Actually I've revised the code that you mentioned above in "Excel Articles" section with a more detailed explanation of how it works & what criteria the code uses, here:
VBA Macro to create Word & Phrase Frequency

Here's the new code:
VBA Code:
Option Explicit

Sub Word_Phrase_Frequency_v1()

'The code will generate word/phrase frequency
'How to use:
'1. Add reference to "Microsoft VBScript Regular Expressions 5.5" (you need to do it once only):
'   In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
'2. Data must be in column A, start at A1
'3. Run Word_Phrase_Frequency_v1


'--- CHANGE sNumber & xPattern VALUE TO SUIT -----------------------------------

Const sNumber As String = "1,2,3"  '"1,2,3"
'sNumber = "1"  will generate 1 word frequency list
'sNumber = "1,2,3"  will generate 1 word, 2 word & 3 word frequency list

Const xPattern As String = "A-Z0-9_'"
'define the word characters, the above pattern will include letter, number, underscore & apostrophe as word character
'word with apostrophe such as "you're" counts as one word.
'word with underscore such as "aa_bb" counts as one word.


Const xCol As String = "C:ZZ" 'columns to clear
Dim i As Long, j As Long
Dim txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range(xCol).Clear

'if there are errors, remove them
On Error Resume Next
Range("A:A").SpecialCells(xlCellTypeFormulas, xlErrors).ClearContents
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents
On Error GoTo 0

j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If


z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa, xPattern)
    Next

Range(xCol).Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print "It's done in:  " & Timer - t & " seconds"

End Sub

Sub toProcessY(n As Long, ByVal tx As String, xP As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .ignorecase = True
        End With

If n > 1 Then

        regEx.Pattern = "( ){2,}"

        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
               
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xP & " ]+" 'exclude xp and space
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character (excluding space) with new line char (vbLf)
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
For i = 1 To n - 1
        
        regEx.Pattern = "^[" & xP & "]+ "
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result

With Cells(2, rc + 2).Resize(d.Count, 2)
    
    Select Case d.Count
    Case Is < 65536 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.Items))
        
    Case Is <= 1048500
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    
    Case Else
        
        MsgBox "Process is canceled, the result is more than 1048500 rows"
    
    End Select
    
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "COUNT"

End Sub

The code provided already does so, however, I need to modify it a little as it will not count words that are hyphenated as one word.
Change this part:
VBA Code:
Const xPattern As String = "A-Z0-9_'"
to this:
VBA Code:
Const xPattern As String = "A-Z0-9_'-"

Also, is it possible to keep the code in a certain range without clearing the rest? I know the macro contains "Range("C:Z").ClearContents", which is what I want to, but just to clear from C to L.
Change this part:
VBA Code:
Const xCol As String = "C:ZZ" 'columns to clear
to this:
VBA Code:
Const xCol As String = "C:L" 'columns to clear
 
Upvote 0
Note:
Please read the criteria that is used by the code in the link above.
One of them is:
3. A sentence/paragraph can occupy more than one cell, it happens when the end of the cell is a word character or a space. For example:

Book1
A
1Please go home
2tomorrow.
Sheet1


"home tomorrow" is considered as a phrase.

I decided to use this criteria because in some cases the text comes from a source that breaks the paragraphs, oftentimes it happens when you copy-paste the text from a PDF.

If this is not what you want then I can amend the code to suit.
 
Upvote 0
Note:
Please read the criteria that is used by the code in the link above.
One of them is:
3. A sentence/paragraph can occupy more than one cell, it happens when the end of the cell is a word character or a space. For example:

Book1
A
1Please go home
2tomorrow.
Sheet1


"home tomorrow" is considered as a phrase.

I decided to use this criteria because in some cases the text comes from a source that breaks the paragraphs, oftentimes it happens when you copy-paste the text from a PDF.

If this is not what you want then I can amend the code to suit.
Hi,

Thank you for your response; I received an error when attempting to run the code: Run-time error '5021': Application defined or object-defined error. When I debug, it highlights the following:

1647369131723.png

1647369205394.png


I made sure to enable reference "Microsoft VBScript Regular Expressions 5.5". I have the following marked:

1647369347558.png


Here is the code after modifying, maybe I typed something wrong?

VBA Code:
Option Explicit
Sub Word_Phrase_Frequency_v1()

'The code will generate word/phrase frequency
'How to use:
'1. Add reference to "Microsoft VBScript Regular Expressions 5.5" (you need to do it once only):
'   In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
'2. Data must be in column A, start at A1
'3. Run Word_Phrase_Frequency_v1


'--- CHANGE sNumber & xPattern VALUE TO SUIT -----------------------------------

Const sNumber As String = "1,2,3"  '"1,2,3"
'sNumber = "1"  will generate 1 word frequency list
'sNumber = "1,2,3"  will generate 1 word, 2 word & 3 word frequency list

Const xPattern As String = "A-Z0-9_'-"
'define the word characters, the above pattern will include letter, number, underscore & apostrophe as word character
'word with apostrophe such as "you're" counts as one word.
'word with underscore such as "aa_bb" counts as one word.


Const xCol As String = "C:L" 'columns to clear
Dim i As Long, j As Long
Dim txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range(xCol).Clear

'if there are errors, remove them
On Error Resume Next
Range("A:A").SpecialCells(xlCellTypeFormulas, xlErrors).ClearContents
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents
On Error GoTo 0

j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If


z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa, xPattern)
    Next

Range(xCol).Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print "It's done in:  " & Timer - t & " seconds"

End Sub

Sub toProcessY(n As Long, ByVal tx As String, xP As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim va, q

        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .ignorecase = True
        End With

If n > 1 Then

        regEx.Pattern = "( ){2,}"

        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
               
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xP & " ]+" 'exclude xp and space
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character (excluding space) with new line char (vbLf)
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
For i = 1 To n - 1
        
        regEx.Pattern = "^[" & xP & "]+ "
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result

With Cells(2, rc + 2).Resize(d.Count, 2)
    
    Select Case d.Count
    Case Is < 65536 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.Items))
        
    Case Is <= 1048500
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    
    Case Else
        
        MsgBox "Process is canceled, the result is more than 1048500 rows"
    
    End Select
    
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "COUNT"

End Sub
 
Upvote 0
VBA Code:
 I received an error when attempting to run the code: Run-time error '5021': Application defined or object-defined error
Ah, sorry, it should be:
VBA Code:
Const xPattern As String = "A-Z0-9_'\-"

Book1
ABCDEFGHIJ
1Take pre-caution today1 WORDCOUNT2 WORDCOUNT3 WORDCOUNT
2Pre-caution - Warning signspre-caution2- Warning1- Warning signs1
3-1Pre-caution -1Pre-caution - Warning1
4signs1pre-caution today1pre-caution today Pre-caution1
5Take1Take pre-caution1Take pre-caution today1
6today1today Pre-caution1today Pre-caution -1
7Warning1Warning signs1
Sheet2
 
Upvote 0
VBA Code:
 I received an error when attempting to run the code: Run-time error '5021': Application defined or object-defined error
Ah, sorry, it should be:
VBA Code:
Const xPattern As String = "A-Z0-9_'\-"

Book1
ABCDEFGHIJ
1Take pre-caution today1 WORDCOUNT2 WORDCOUNT3 WORDCOUNT
2Pre-caution - Warning signspre-caution2- Warning1- Warning signs1
3-1Pre-caution -1Pre-caution - Warning1
4signs1pre-caution today1pre-caution today Pre-caution1
5Take1Take pre-caution1Take pre-caution today1
6today1today Pre-caution1today Pre-caution -1
7Warning1Warning signs1
Sheet2
So the code now runs, and it will not clear contents after L as I wanted it, however, it will not generate the results starting in column C up to J. Normally, if there is no data on or after Column M, it will post in Column C to J after deleting whatever was there. However, if there is data already on or after Column M, it will post right after the other data and I'm not sure why.

If no data in or after Column M
1647436670086.png


If there is data starting in Column M:Q (that data is before the code, left it there as a placeholder to test), starts at column S instead despite there being empty space.

1647436144747.png


(Disregard column M:N, that was before the code) The other issue is that it is counting "-" as a word itself in column S and after, such as "- warning" as two words, even though "-" is by itself. Is there anyway to ensure special characters like "-" that have spaces before and after, not to be counted? As "Pre-caution" is between two characters, it can count, but if it were "Pre - caution", it would count "Pre" and "caution" as its own words, and not include the "-".

Thank you again.
 
Upvote 0
Try this one:
VBA Code:
Option Explicit
Dim rc As Long
Sub Word_Phrase_Frequency_v1()

'The code will generate word/phrase frequency
'How to use:
'1. Add reference to "Microsoft VBScript Regular Expressions 5.5" (you need to do it once only):
'   In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
'2. Data must be in column A, start at A1
'3. Run Word_Phrase_Frequency_v1


'--- CHANGE sNumber & xPattern VALUE TO SUIT -----------------------------------

Const sNumber As String = "1,2,3"  '"1,2,3"
'sNumber = "1"  will generate 1 word frequency list
'sNumber = "1,2,3"  will generate 1 word, 2 word & 3 word frequency list

Const xPattern As String = "A-Z0-9_'\-"
'define the word characters, the above pattern will include letter, number, underscore & apostrophe as word character
'word with apostrophe such as "you're" counts as one word.
'word with underscore such as "aa_bb" counts as one word.


Const xCol As String = "C:L" 'columns to clear
Dim i As Long, j As Long
Dim txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range(xCol).Clear
rc = 0
'if there are errors, remove them
On Error Resume Next
Range("A:A").SpecialCells(xlCellTypeFormulas, xlErrors).ClearContents
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents
On Error GoTo 0

j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If


z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa, xPattern)
    Next

Range(xCol).Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print "It's done in:  " & Timer - t & " seconds"

End Sub

Sub toProcessY(n As Long, ByVal tx As String, xP As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long
Dim va, q

tx = Replace(tx, " - ", "|")
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .ignorecase = True
        End With

If n > 1 Then

        regEx.Pattern = "( ){2,}"

        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
               
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xP & " ]+" 'exclude xp and space
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character (excluding space) with new line char (vbLf)
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
For i = 1 To n - 1
        
        regEx.Pattern = "^[" & xP & "]+ "
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

'rc = Cells(1, Columns.Count).End(xlToLeft).Column
rc = rc + 3
'put the result

With Cells(2, rc).Resize(d.Count, 2)
    
    Select Case d.Count
    Case Is < 65536 'Transpose function has a limit of 65536 item to process
        
        .Value = Application.Transpose(Array(d.Keys, d.Items))
        
    Case Is <= 1048500
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    
    Case Else
        
        MsgBox "Process is canceled, the result is more than 1048500 rows"
    
    End Select
    
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc) = n & " WORD"
Cells(1, rc + 1) = "COUNT"

End Sub

Example:
Book1
ABCDEFGHIJKLMN
1pre-caution this1 WORDCOUNT2 WORDCOUNT3 WORDCOUNTAriel
2warning - blue signblue1blue sign1pre-caution this warning1Adam
3pre-caution1pre-caution this1Sullivan
4sign1this warning1Kellen
5this1Zein
6warning1Giovanni
Sheet4
 
Upvote 0
Solution
Try this one:
VBA Code:
Option Explicit
Dim rc As Long
Sub Word_Phrase_Frequency_v1()

'The code will generate word/phrase frequency
'How to use:
'1. Add reference to "Microsoft VBScript Regular Expressions 5.5" (you need to do it once only):
'   In Visual Basic Editor menu, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
'2. Data must be in column A, start at A1
'3. Run Word_Phrase_Frequency_v1


'--- CHANGE sNumber & xPattern VALUE TO SUIT -----------------------------------

Const sNumber As String = "1,2,3"  '"1,2,3"
'sNumber = "1"  will generate 1 word frequency list
'sNumber = "1,2,3"  will generate 1 word, 2 word & 3 word frequency list

Const xPattern As String = "A-Z0-9_'\-"
'define the word characters, the above pattern will include letter, number, underscore & apostrophe as word character
'word with apostrophe such as "you're" counts as one word.
'word with underscore such as "aa_bb" counts as one word.


Const xCol As String = "C:L" 'columns to clear
Dim i As Long, j As Long
Dim txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range(xCol).Clear
rc = 0
'if there are errors, remove them
On Error Resume Next
Range("A:A").SpecialCells(xlCellTypeFormulas, xlErrors).ClearContents
Range("A:A").SpecialCells(xlConstants, xlErrors).ClearContents
On Error GoTo 0

j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), " ")
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), " ") & " "
    Next
End If


z = Split(sNumber, ",")
  
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa, xPattern)
    Next

Range(xCol).Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print "It's done in:  " & Timer - t & " seconds"

End Sub

Sub toProcessY(n As Long, ByVal tx As String, xP As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long
Dim va, q

tx = Replace(tx, " - ", "|")
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .ignorecase = True
        End With

If n > 1 Then

        regEx.Pattern = "( ){2,}"

        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
      
        tx = Trim(tx)
             
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xP & " ]+" 'exclude xp and space
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character (excluding space) with new line char (vbLf)
        End If
      
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
          
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
For i = 1 To n - 1
      
        regEx.Pattern = "^[" & xP & "]+ "
        If regEx.Test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xP & "]+ ", n))
          
            Set matches = regEx.Execute(tx)
          
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

'rc = Cells(1, Columns.Count).End(xlToLeft).Column
rc = rc + 3
'put the result

With Cells(2, rc).Resize(d.Count, 2)
  
    Select Case d.Count
    Case Is < 65536 'Transpose function has a limit of 65536 item to process
      
        .Value = Application.Transpose(Array(d.Keys, d.Items))
      
    Case Is <= 1048500
      
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.Keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
  
    Case Else
      
        MsgBox "Process is canceled, the result is more than 1048500 rows"
  
    End Select
  
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
  
End With


Cells(1, rc) = n & " WORD"
Cells(1, rc + 1) = "COUNT"

End Sub

Example:
Book1
ABCDEFGHIJKLMN
1pre-caution this1 WORDCOUNT2 WORDCOUNT3 WORDCOUNTAriel
2warning - blue signblue1blue sign1pre-caution this warning1Adam
3pre-caution1pre-caution this1Sullivan
4sign1this warning1Kellen
5this1Zein
6warning1Giovanni
Sheet4
Thank you so much; it does exactly what I need!
 
Upvote 0
You're welcome, glad to help & thanks for the feedback.:)
 
Upvote 0

Forum statistics

Threads
1,223,888
Messages
6,175,207
Members
452,618
Latest member
Tam84

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top