Count Word and Random Phrase Frequency

Sleeplol

Board Regular
Joined
Apr 10, 2019
Messages
194
Office Version
  1. 365
  2. 2019
Platform
  1. Windows
Hello Everyone,
I'm attempting to count the word frequency as well as two and three word phrases from a column. I'm stuck and would really appreciate some help.

Getting error:
Run-time error 1004.png


On this line:
ErrorLastLine.png


Setup:
A macro populates column A with values from AP101:AP50000 (formulas combining additional row cells into a single cell).
(Problem) Macro populates the count of B,E,H

Runs fine though throttles for about 2 min for 30000ish rows; then bugs out after populating the results.
I've tried referencing the sheet with Worksheets("DashTest").Range on all three, but it still errors.

Below is a quick shot of what it looks like, as well as the entire script.
Thanks for any help

WordCountDisplay.png



VBA Code:
Sub test()
Dim a, e, s, Ignore As String, temp, x
Dim dic As Object, dic2 As Object, dic3 As Object
Set dic = CreateObject("Scripting.Dictionary")
dic.CompareMode = 1
Set dic2 = CreateObject("Scripting.Dictionary")
dic2.CompareMode = 1
Set dic3 = CreateObject("Scripting.Dictionary")
dic3.CompareMode = 1
With Range("m1").CurrentRegion.Offset(1)
Ignore = Join(Application.Transpose(.Resize(.Rows.Count - 1).Value), Chr(2))
End With
a = Range("a1", Range("a" & Rows.Count).End(xlUp)).Value
With CreateObject("VBScript.RegExp")
.Global = True: .IgnoreCase = True
.Pattern = "([\$\(\)\-\^\|\\\[\]\*\+\?\.])"
Ignore = "[^\w ]|\b(" & Replace(.Replace(Ignore, "\$1"), Chr(2), "|") & ")\b"
For Each e In a
If e <> "" Then
x = GetSentense(e, 2)
If IsArray(x) Then
For Each s In x
dic2(s) = dic2(s) + 1
Next
End If
x = GetSentense(e, 3)
If IsArray(x) Then
For Each s In x
If s <> "" Then dic3(s) = dic3(s) + 1
Next
End If
.Pattern = Ignore
temp = Application.Trim(.Replace(e, ""))
For Each s In Split(temp)
If s <> "" Then dic(s) = dic(s) + 1
Next
End If
Next
End With
Range("b2").Resize(dic.Count, 2).Value = _
Application.Transpose(Array(dic.keys, dic.items))
Range("e2").Resize(dic2.Count, 2).Value = _
Application.Transpose(Array(dic2.keys, dic2.items))
Range("h2").Resize(dic3.Count, 2).Value = _
Application.Transpose(Array(dic3.keys, dic3.items))
End Sub
Function GetSentense(ByVal txt As String, myStep)
Dim i As Long, ii As Long, temp, x
On Error Resume Next
x = Split(txt): ReDim temp(UBound(x) - myStep)
If Err Then GetSentense = Empty: Exit Function
On Error GoTo 0
For i = 0 To UBound(x) - myStep - 1
For ii = 1 To myStep
temp(i) = Trim$(temp(i) & " " & x(i + ii - 1))
Next
Next
GetSentense = temp
End Function
 

Excel Facts

Repeat Last Command
Pressing F4 adds dollar signs when editing a formula. When not editing, F4 repeats last command.
@Akuini, I used the code from your post #13. How do i make it to count the word with the hyphen? "-" So, if the something with a hyphen will be treated as one word. I have some word like o-ring, or v-side
i would want them to be counted on the one word category.
Much appreciate your help.
 
Upvote 0
I'm glad the code is useful for you . :)


@Akuini, I used the code from your post #13. How do i make it to count the word with the hyphen? "-" So, if the something with a hyphen will be treated as one word. I have some word like o-ring, or v-side
i would want them to be counted on the one word category.
Much appreciate your help.
 
Upvote 0
How do i make it to count the word with the hyphen? "-" So, if the something with a hyphen will be treated as one word.

Hi, Minhwins
Please try using a small sample first, so you can manually check whether the results are correct.
Run Sub regexPhraseFrequency2.

VBA Code:
Sub regexPhraseFrequency2()
'The code will generate word/phrase frequency
'Data must be in column A, start at A1
'sNumber = "1,2,3"  means it generates 3 frequency list: single word & 2 word phrase & 3 word phrase, you can change that to suit.
'Word with apostrophe such as "you're" is counted as one word.
'Word with underscore such as "aa_bb" is counted as one.
'Word with hyphen such as "aa-bb" is counted as one.
'Tested on text (from a novel) with 16.500 rows, contains 161K words (12600 unique words) with sNumber = "1,2,3", it took 10.5 seconds

Dim i As Long, j As Long
Dim sNumber As String, txa As String
Dim z, t

t = Timer
Application.ScreenUpdating = False
Range("C:Z").ClearContents
j = Range("A" & Rows.Count).End(xlUp).Row

If j < 65000 Then
    txa = Join(Application.Transpose(Range("A1", Cells(Rows.Count, "A").End(xlUp))), vbLf)
Else
    For i = 1 To j Step 65000
    txa = txa & Join(Application.Transpose(Range("A" & i).Resize(65000)), vbLf) & vbLf
    Next
End If


'--- CHANGE sNumber VALUE TO SUIT -----------------------------------
sNumber = "1,2"   'list each number of words in a phrase, change to suit
                    'sNumber = "1"  will generate 1 frequency list of single word
                    'sNumber = "1,2"  will generate 2 frequency list: single word & 2 word phrase

z = Split(sNumber, ",")
    
    'TO PROCESS
    For i = LBound(z) To UBound(z)
        Call toProcessY(CLng(z(i)), txa)
'        Call toProcessx(CLng(z(i)), txa)
    Next

Range("C:Z").Columns.AutoFit
Application.ScreenUpdating = True

Debug.Print Timer - t

End Sub

Sub toProcessY(n As Long, ByVal tx As String)
'phrase frequency

Dim regEx As Object, matches As Object, x As Object, d As Object
Dim i As Long, rc As Long
Dim xp As String
Dim va, q

    
        Set regEx = CreateObject("VBScript.RegExp")
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
        End With

'define the word characters
'regex pattern, you can add/remove any character as a word character
xp = "A-Z0-9_'\-"

If n > 1 Then
        
        regEx.Pattern = "( ){2,}"
    
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, " ") 'remove excessive space
        End If
        
        tx = Trim(tx)
        
'        regEx.Pattern = "[^A-Z0-9_' ]+"
        regEx.Pattern = "[^" & xp & " ]+"
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, vbLf) 'replace non words character with new line char
        End If
        
        tx = Replace(tx, vbLf & " ", vbLf & "") 'remove space in the beginning of every line

End If

    
    Set d = CreateObject("scripting.dictionary")
    d.CompareMode = vbTextCompare

'    regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n)) 'match n words (the phrase) separated by a space
    regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xp & "]+ ", n)) 'match n words (the phrase) separated by a space
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1 'get phrase frequency
            Next
 
 
 For i = 1 To n - 1
        
'        regEx.Pattern = "^[A-Z0-9_']+ "
        regEx.Pattern = "^[" & xp & "]+ "
        If regEx.test(tx) Then
           tx = regEx.Replace(tx, "")   'remove first word in each line to get different combination of n words (phrase)

'            regEx.Pattern = Trim(WorksheetFunction.Rept("[A-Z0-9_']+ ", n))
            regEx.Pattern = Trim(WorksheetFunction.Rept("[" & xp & "]+ ", n))
            
            Set matches = regEx.Execute(tx)
            
            For Each x In matches
                d(CStr(x)) = d(CStr(x)) + 1     'get phrase frequency
            Next

        End If
 
Next

If d.Count = 0 Then MsgBox "Nothing with " & n & " word phrase found": Exit Sub

rc = Cells(1, Columns.Count).End(xlToLeft).Column
'put the result in col D:E

With Cells(2, rc + 2).Resize(d.Count, 2)
    
    If d.Count < 65536 Then 'Transpose function has a limit of 65536 items to process
        
        .Value = Application.Transpose(Array(d.keys, d.items))
        
    ElseIf d.Count <= 1048500 Then
        
        ReDim va(1 To d.Count, 1 To 2)
        i = 0
            For Each q In d.keys
                i = i + 1
                va(i, 1) = q: va(i, 2) = d(q)
            Next
        .Value = va
    Else
        MsgBox "The process is canceled, because the result is more than 1048500 rows"
    End If
    
    'get the apostrophe back
'    .Replace What:="___", Replacement:="'", lookAt:=xlPart, SearchFormat:=False, ReplaceFormat:=False
    .Sort Key1:=.Cells(1, 2), Order1:=xlDescending, Key2:=.Cells(1, 1), Order2:=xlAscending, Header:=xlNo
    
End With


Cells(1, rc + 2) = n & " WORD"
Cells(1, rc + 3) = "FREQUENCY"

End Sub
 
Upvote 0
Hi, first post and an absolute beginner so please bare with me :) I have a list of around 28k rows in a single column which are names of organisations and I’ve been able to extract the volume of common words from this forum, which is great. The purpose is to extract these common words to leave me with key identifying words of an organisation.

Example The Blue Sky Company

Common Words: The, Blue, Company

Result of key identifier: Sky

However I might have an issue that a company is called The Blue Company. That would return me with a nil key identifier as their name is made up from my common words list.

Using the common words list against the 28k organisation names, how can I identify any organisations which are made up of common words only.

Many thanks for your help
 
Upvote 0
@Waves2021 Welcome to the Forum
Could you explain what you mean by common word? and how did you get it?
How can you define "The, Blue, Company" as common words while "Sky" is a key identifier?
 
Upvote 0
Hi @Akuini, thank you for getting in touch.

Sorry I probably didn’t explain myself too well. From the list of 28k organisations I’m looking for repeating words e.g The, &, Company, etc. From this I’m ordering the list of frequent/repeating/common words extracted by how many times they appear from the original 28k list e.g ‘The’ appears 6,356 times. I have a threshold of what I would deem as common - any word which appears more than 200 times. I will then use this list to omit the common words whilst I undertake text similarity (in other software) against a form and what we hold in our database. By omitting the common words it allows a more efficient algorithm for text similarity.

I’ve successfully been able to extract the most frequent or common words by number (from this forum chat) however my problem remains if an organisations name is made up by all common words. I want to exclude this when I undertake text similarity. What I would like to do is create a separate list of any organisations made up entirely of words from my ‘common words list’. So I’m possibly looking to search column D (list of common words) against column A and identify if any rows are made entirely of words in column D. I will then want to extract this as a separate column/list.

I hope that makes sense and thanks again
 
Upvote 0
Are spaces the only character separating words in organizations names? Or it could be other characters such as comma, period etc.
Example: Mark, Louis & Partners
 
Upvote 0
Are spaces the only character separating words in organizations names? Or it could be other characters such as comma, period etc.
Example: Mark, Louis & Partners
Are spaces the only character separating words in organizations names? Or it could be other characters such as comma, period etc.
Example: Mark, Louis & Partners
So looking at spaces, ampersand, comma, hyphen, period and possibly colon.
 
Upvote 0

Forum statistics

Threads
1,225,738
Messages
6,186,725
Members
453,368
Latest member
positivemind

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top