# MSWord Delete duplicate words in selected text



## mole999 (Jun 15, 2016)

As per the title, I'm looking for a bit of code so that selected text will have all duplicate words removed ONLY in the selection, effectively left with a word cloud, even better would to be have an extensible static list in the VBA to delete whole words, and any word less than three characters so (the, if, as) would all go as well, icing on the cake, then sort the words alphabetically so similar words can be identified quickly

not asking much am I, nothing I have seen on the net gets close

ONLY SELECTED
LESS THAN FOUR CHARACTERS
EXCLUDED WORDS
SORTED


----------



## mole999 (Jun 16, 2016)

So I have made a fudge to this, via splitting and sorting as paragraphs

I want to loop a set of redundant words

How best to build an array for *How And Its*, with a space either side to feed into the code as a loop


```
With Selection.Find
        .Text = " How "
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindAsk
        .Format = False
        .MatchCase = True
        .MatchWholeWord = True
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll


    With Selection.Find
        .Text = " its "
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindAsk
        .Format = False
        .MatchCase = True
        .MatchWholeWord = True
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
```


----------



## mole999 (Jun 17, 2016)

My hashed attempt at resolving my issue, using macro recorder and some items from the web
1. Manually copy all the text I want to look at to a new document
2. Select the story, change case to proper, turn off alerts, set the font and size, remove bullets etc
3. Remove punctuation
4. Replace all spaces with a paragraph mark ^p
5. sort by paragraph
6. DeleteDuplicateParagraphs


```
Sub DeleteDuplicateParagraphs()
    'PURPOSE: Remove Duplicate Paragraphs Throughout the Entire Word Document
    'SOURCE: www.TheSpreadsheetGuru.com/the-code-vault
    Dim p1 As Paragraph
    Dim p2 As Paragraph
    Dim DupCount As Long

    DupCount = 0

    For Each p1 In ActiveDocument.Paragraphs
        If p1.Range.Text <> vbCr Then    'Ignore blank paragraphs

            For Each p2 In ActiveDocument.Paragraphs
                If p1.Range.Text = p2.Range.Text Then
                    DupCount = DupCount + 1
                    If p1.Range.Text = p2.Range.Text And DupCount > 1 Then p2.Range.Delete
                End If
            Next p2

        End If

        'Reset Duplicate Counter
        DupCount = 0

    Next p1

End Sub
```

7. remove ^p and replace with space
8. Delete high volume words "And It Was When" etc (need a loop to be efficient)
9. Remove double spaces

nick it, use it, improve it


----------



## Macropod (Jun 17, 2016)

mole999 said:


> As per the title, I'm looking for a bit of code so that selected text will have all duplicate words removed ONLY in the selection, effectively left with a word cloud, even better would to be have an extensible static list in the VBA to delete whole words, and any word less than three characters so (the, if, as) would all go as well, icing on the cake, then sort the words alphabetically so similar words can be identified quickly
> 
> not asking much am I, nothing I have seen on the net gets close
> 
> ...


Deleting the duplicates is trivial, as is compiling a sorted list, but which words go into the list - all of the selection, the duplicates, something else? Where does the list go?


----------



## mole999 (Jun 17, 2016)

I want something like, 
the words to be removed will be added to the code "Been, By, Call, But" type stuff

I lifted a bit of yours from msofficeforums for dealing with punctuation back in 2011

MyArray

    With Selection.Find
        .Text = MyArray
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindAsk
        .Format = False
        .MatchCase = True
        .MatchWholeWord = True
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll

The list will be static (in the VBA) until I update what is in the list


----------



## Macropod (Jun 17, 2016)

To delete a series of words you could use something like:

```
Sub Demo()
Application.ScreenUpdating = False
Dim StrFnd As String, StrRep As String, i As Long
StrFnd = "One,Two,Three,Four,Five,Six"
With ActiveDocument.Range.Find
  .ClearFormatting
  .Replacement.ClearFormatting
  .Forward = True
  .Wrap = wdFindContinue
  .MatchCase = True
  .MatchWholeWord = True
  .MatchWildcards = False
  .MatchSoundsLike = False
  .MatchAllWordForms = False
  .Replacement.Text = ""
  For i = 0 To UBound(Split(StrFnd, ","))
    .Text = Split(StrFnd, ",")(i)
    .Execute Replace:=wdReplaceAll
  Next
End With
Application.ScreenUpdating = True
End Sub
```
But I can't see what this has to do with duplicates or adding items to a list.


----------



## mole999 (Jun 17, 2016)

I'm trying to take

This book is provided on the worldwide web as a service to the community of practitioners and students. Reproduction for educational purposes is permitted with appropriate citation. If you find this work helpful or have suggestions for additions or corrections, please email Chris Hendrickson: cth@cmu.edu. A hardcopy Instructor's Manual with problem solutions is available for a fee of $ 10 to cover reproduction, mailing and handling. Send a check made out to Carnegie Mellon University to Ms. Patty Langer, Department of Civil and Environmental Engineering, Carnegie Mellon University, Pittsburgh, PA 15213.

This book develops a specific viewpoint in discussing the participants, the processes and the techniques of project management for construction. This viewpoint is that of owners who desire completion of projects in a timely, cost effective fashion. Some profound implications for the objectives and methods of project management result from this perspective: 
•The "life cycle" of costs and benefits from initial planning through operation and disposal of a facility are relevant to decision making. An owner is concerned with a project from the cradle to the grave. Construction costs represent only one portion of the overall life cycle costs. 
•Optimizing performance at one stage of the process may not be beneficial overall if additional costs or delays occur elsewhere. For example, saving money on the design process will be a false economy if the result is excess construction costs. 
•Fragmentation of project management among different specialists may be necessary, but good communication and coordination among the participants is essential to accomplish the overall goals of the project. New information technologies can be instrumental in this process, especially the Internet and specialized Extranets. 
•Productivity improvements are always of importance and value. As a result, introducing new materials and automated construction processes is always desirable as long as they are less expensive and are consistent with desired performance. 
•Quality of work and performance are critically important to the success of a project since it is the owner who will have to live with the results. 

to

Accomplish Additional Additions Always Among Appropriate Are Automated Available Beneficial Benefits Book Can Carnegie Check Chris Citation. Civil Communication Community Completion Concerned Consistent Construction Construction. Coordination Corrections, Cost Costs Costs. Cover Cradle Critically Cth@Cmu.Edu. Cycle Cycle" Decision Delays Department Design Desirable Desire Desired Develops Different Discussing Disposal Economy Educational Effective Elsewhere. Email Engineering, Environmental Especially Essential Example, Excess Expensive Extranets. Facility False Fashion. Fee Find Fragmentation Goals Good Grave. Handling. Hardcopy Helpful Hendrickson: If Implications Importance Important Improvements Information Initial Instructor's Instrumental Internet Introducing Is Langer, Less Life Live Long Made Mailing Making. Management Manual Materials May Mellon Methods Money Ms. Necessary, New Objectives Occur One Only Operation Optimizing Out Overall Owner Owners PA Participants Participants, Patty Performance Performance. Permitted Perspective: Pittsburgh, Planning Please Portion Practitioners Problem Process Process, Processes Productivity Profound Project Project. Projects Provided Purposes Quality Relevant Represent Reproduction Reproduction, Result Result, Results. Saving Send Service Since Solutions Some Specialists Specialized Specific Stage Students. Success Suggestions Techniques Technologies Through Timely, University University, Value. Viewpoint Web Who Will Work Worldwide You 

The list will remove "You,Out,Who,Will" etcetera


----------



## Macropod (Jun 17, 2016)

So, do you want to delete all words of 3 characters or less, or only words from a predefined list?


----------



## mole999 (Jun 17, 2016)

I would like to know how to delete three characters or less within the module, thanx


----------



## Macropod (Jun 17, 2016)

Try the following - I've included code for both deleting items from an exclusion list and deleting words of 3 characters or less. There's some overlap, but you can delete/comment-out whichever portions you don't need.

```
Sub ListBuilder()
Application.ScreenUpdating = False
Dim StrIn As String, StrTmp As String, StrExcl As String, StrOut() As String
Dim i As Long, j As Long
'Define the exlusions list
StrExcl = "A,Am,An,And,Are,As,At,B,Be,But,By,C,Can,Cm,D,Did," & _
          "Do,Does,E,Eg,En,Eq,Etc,F,For,G,Get,Go,Got,H,Has,Have," & _
          "He,Her,Him,How,I,Ie,If,In,Into,Is,It,Its,J,K,L,M,Me," & _
          "Mi,Mm,My,N,Na,Nb,No,Not,O,Of,Off,Ok,On,One,Or,Our,Out," & _
          "P,Q,R,Re,S,She,So,T,The,Their,Them,They,This,T,To,U,V," & _
          "Via,Vs,W,Was,We,Were,Who,Will,With,Would,X,Y,Yd,You,Your,Z"
With ActiveDocument.Range
  'Delete words of 3 characters or less
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "<[! ]{1,3}>"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindContinue
    .Format = False
    .MatchWildcards = True
    .Execute Replace:=wdReplaceAll
  End With
  'Capitalise the first letter of each word
  .Case = wdTitleWord
  'Get the document's text
  StrIn = .Text
  'Strip out unwanted characters. Amongst others, hyphens and formatted single quotes are retained at this stage
  For i = 1 To 255
    Select Case i
      'To strip out unwanted characters
      Case 1 To 38, 40 To 63, 91 To 96, 123 To 144, 147 To 149, 152 To 171, 174 To 191, 247
      StrIn = Replace(StrIn, Chr(i), " ")
    End Select
  Next
  'Convert smart single quotes to plain single quotes & delete any at the start/end of a word
  StrIn = Replace(Replace(Replace(Replace(StrIn, Chr(145), "'"), Chr(146), "'"), "' ", " "), " '", " ")
  'Process the exclusions list
  For i = 0 To UBound(Split(StrExcl, ","))
    While InStr(StrIn, " " & Split(StrExcl, ",")(i) & " ") > 0
      StrIn = Replace(StrIn, " " & Split(StrExcl, ",")(i) & " ", " ")
    Wend
  Next
  'Clean up any duplicate spaces
  While InStr(StrIn, "  ") > 0
    StrIn = Replace(StrIn, "  ", " ")
  Wend
  StrIn = " " & Trim(StrIn) & " "
  j = UBound(Split(StrIn, " "))
  For i = 1 To j
    StrTmp = Split(StrIn, " ")(1)
    While InStr(StrIn, " " & StrTmp & " ") > 0
      StrIn = Replace(StrIn, " " & StrTmp & " ", " ")
    Wend
    'Update the output array
    ReDim Preserve StrOut(i - 1)
    StrOut(i - 1) = StrTmp
    If UBound(Split(StrIn, " ")) = 1 Then Exit For
  Next
  WordBasic.SortArray StrOut()
  .Text = Join(StrOut, ", ")
End With
Application.ScreenUpdating = True
End Sub
```


----------



## mole999 (Jun 15, 2016)

As per the title, I'm looking for a bit of code so that selected text will have all duplicate words removed ONLY in the selection, effectively left with a word cloud, even better would to be have an extensible static list in the VBA to delete whole words, and any word less than three characters so (the, if, as) would all go as well, icing on the cake, then sort the words alphabetically so similar words can be identified quickly

not asking much am I, nothing I have seen on the net gets close

ONLY SELECTED
LESS THAN FOUR CHARACTERS
EXCLUDED WORDS
SORTED


----------



## mole999 (Jun 17, 2016)

Thank you very much, I would never have gotten there


----------



## Macropod (Jun 17, 2016)

I'm sure you'd have gotten there in the end - even if not by quite the same route.

I've found a small glitch - the code deletes email & web addresses, but leaves behind the @ from the email addresses. You can delete those by changing:
40 To 63
to:
40 To 64
If you need, I could do some more work to retain the email & web addresses.


----------



## mole999 (Jun 17, 2016)

No, it was purely a way of sifting large amounts of text to get the key words to support another piece of work, whilst out of context, I can now see groups of similar words, and make a decision to include or exclude. Emails wouldn't be an issue, I just grabbed some free text off the web as a demonstration


----------



## Macropod (Jun 21, 2016)

FWIW, in case anyone else needs to work with a document containing email and/or hyperlink addresses, the following version extracts those too (but not the hyperlink display text).


```
Sub ListBuilder()
Application.ScreenUpdating = False
Dim StrIn As String, StrTmp As String, StrExcl As String, StrOut() As String
Dim i As Long, j As Long
'Define the exlusions list
StrExcl = "A,Am,An,And,Are,As,At,B,Be,But,By,C,Can,Cm,D,Did," & _
          "Do,Does,E,Eg,En,Eq,Etc,F,For,G,Get,Go,Got,H,Has,Have," & _
          "He,Her,Him,How,I,Ie,If,In,Into,Is,It,Its,J,K,L,M,Me," & _
          "Mi,Mm,My,N,Na,Nb,No,Not,O,Of,Off,Ok,On,One,Or,Our,Out," & _
          "P,Q,R,Re,S,She,So,T,The,Their,Them,They,This,T,To,U,V," & _
          "Via,Vs,W,Was,We,Were,Who,Will,With,Would,X,Y,Yd,You,Your,Z"
With ActiveDocument.Range
  'Convert email & web addresses to hyperlinks, then capture & delete
  .AutoFormat
  While .Hyperlinks.Count > 0
    With .Hyperlinks(1)
      StrTmp = StrTmp & " " & .Address
      .Delete
    End With
  Wend
  'Delete words of 3 characters or less
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = "<[! ]{1,3}>"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindContinue
    .Format = False
    .MatchWildcards = True
    .Execute Replace:=wdReplaceAll
  End With
  'ActiveWindow.View.ShowFieldCodes = False
  'Capitalise the first letter of each word
  .Case = wdTitleWord
  'Get the document's text
  StrIn = .Text
  'Strip out unwanted characters. Amongst others, hyphens and formatted single quotes are retained at this stage
  For i = 1 To 255
    Select Case i
      'To strip out unwanted characters
      Case 1 To 38, 40 To 64, 91 To 96, 123 To 144, 147 To 149, 152 To 171, 174 To 191, 247
      StrIn = Replace(StrIn, Chr(i), " ")
    End Select
  Next
  'Convert smart single quotes to plain single quotes & delete any at the start/end of a word
  StrIn = Replace(Replace(Replace(Replace(StrIn, Chr(145), "'"), Chr(146), "'"), "' ", " "), " '", " ")
  'Process the exclusions list
  For i = 0 To UBound(Split(StrExcl, ","))
    While InStr(StrIn, " " & Split(StrExcl, ",")(i) & " ") > 0
      StrIn = Replace(StrIn, " " & Split(StrExcl, ",")(i) & " ", " ")
    Wend
  Next
  'Incorporate email & web addresses
  StrIn = StrIn & StrTmp
  'Clean up any duplicate spaces
  While InStr(StrIn, "  ") > 0
    StrIn = Replace(StrIn, "  ", " ")
  Wend
  StrIn = " " & Trim(StrIn) & " "
  j = UBound(Split(StrIn, " "))
  For i = 1 To j
    StrTmp = Split(StrIn, " ")(1)
    While InStr(StrIn, " " & StrTmp & " ") > 0
      StrIn = Replace(StrIn, " " & StrTmp & " ", " ")
    Wend
    'Update the output array
    ReDim Preserve StrOut(i - 1)
    StrOut(i - 1) = StrTmp
    If UBound(Split(StrIn, " ")) = 1 Then Exit For
  Next
  WordBasic.SortArray StrOut()
  .Text = Join(StrOut, ", ")
  'Restore email & web address hyperlinks
  .AutoFormat
  .Style = wdStyleNormal
End With
Application.ScreenUpdating = True
End Sub
```


----------

