VBA routine to perform Text Diff

acatalux

New Member
Joined
Jul 19, 2013
Messages
9
Hello everybody, I am new to the forum but I have been an avid reader of your discussions.
As always stated by common forum rules of behavior, don't start a new post if you can already find the answer in other's discussions.

Well, this is the first time I need to start a new post, because it seems the answer is nowhere on the internet.

I am looking to implement the "Compare Documents" function (available in Word 2010) in Excel 2010 through VBA programming between different text contained in two cells.
In Word this function works quite well (not perfectly), but it highlights in different ways which part has been deleted and which one has been added between an "original" document and a "revised" one.
For the nature of my job, I need to do this on a daily basis and I used to output text from Excel to Word, then compare the two text, and then copy it back to Excel.

Here comes the problem: since in Word the text is formatted (and what I'm looking for is formatted/highlighted text as output), I can't just paste it in Excel as it is: any editing, merging, splitting done on the pasted text (that eventually I need to do) makes the formatting disappear (above all with VBA functions, that can only output data and can't format it).

So, browsing the internet I found an implementation of the LCS (Longest Common Subsequence) algorithm written for VBA: you can find it here (for what I see a very neat job by Travis Hydzik). I looked at the strings function only, even though the author provides functions for arrays as well.

I started to tailor his script to my needs, and at first I claimed victory. But as I applied the code to other situations, I couldn't get the right answer.
Anyway, the behaviour that I expect is exactly the same that is provided by the "Compare Documents" in Word 2010.

In other words: given two cells containing different text, I would like to be able to fill a third cell with text formatted accordingly to the TextDiff output between the two original cell.

E.g.:
INPUT:
Cells(1,1).Value2 = "my name is Andrea and I like jogging" (original)
Cells(1,2).Value2 = "my name is Giovanni and I like running" (revised)

OUTPUT:
Cells(1,3) wll contain: "my name is <strike>Andrea</strike>Giovanni and I like <strike>jogging</strike>running"

Obviously, since UDF doesn't allow formatting of cells, I would need to adjust the main Sub for each pair of document I have to revise, but that won't be the problem: what I need is the engine. It's been two years and a half that I do advanced VBA programming at work but it looks like I can't grasp the rationale behind the LCS algorithm.

Any suggestion, hint, link or other resource is more than welcome.
I hope you could help me find a solution to this (considering it is quite useful anyhow).

Best regards,
Andrea.

P.S.: I don't post any code since I used the code "as is" developed by the guy as the engine/core and I edited the example Sub to my needs, converting "+", "-" and "=" to text formatting using the .Characters object.
 

Excel Facts

Repeat Last Command
Pressing F4 adds dollar signs when editing a formula. When not editing, F4 repeats last command.
Perhaps something like this
Code:
Sub test()
    Dim Delimiter As String
    Dim aString As String, aWords As Variant
    Dim bString As String, bWords As Variant
    Dim aStart() As Long, aLength() As Long
    Dim bStart() As Long, bLength() As Long
    Dim outString As String, outWords As Variant
    Dim Low As Long, High As Long
    Dim outCell As Range
    Dim i As Long, pointer As Long
    
    Delimiter = " "
    aString = "my name is Andrea and I like jogging"
    bString = "my name is Giovanni and I like running"
    Set outCell = Range("P1")
    
    aWords = Split(WorksheetFunction.Trim(aString), Delimiter)
    bWords = Split(WorksheetFunction.Trim(bString), Delimiter)
    
    High = UBound(aWords)
    If UBound(bWords) < UBound(aWords) Then High = UBound(bWords)
    ReDim aStart(1 To High): ReDim aLength(1 To High)
    ReDim bStart(1 To High): ReDim bLength(1 To High)
    For i = 0 To High
        outString = outString & Delimiter & aWords(i)
        If aWords(i) <> bWords(i) Then
            pointer = pointer + 1
            aStart(pointer) = Len(outString) - Len(aWords(i)) 
            aLength(pointer) = Len(aWords(i))
            bStart(pointer) = Len(outString) - Len(Delimiter) + Len(Delimiter)
            bLength(pointer) = Len(bWords(i))
            outString = outString & bWords(i)
        End If
    Next i
    outString = Mid(outString, Len(Delimiter) + 1)
    With outCell
        .Value = outString
        .Font.ColorIndex = xlAutomatic
        
        
        For i = 1 To High + 1
            If aStart(i) = 0 Then
                Exit For
            Else
                With .Characters(aStart(i), aLength(i))
                    .Font.Color = RGB(255, 0, 0)
                    .Font.StrikeThrough = True
                End With
                With .Characters(bStart(i), bLength(i)).Font
                    .Color = RGB(0, 255, 0)
                    .Underline = True
                End With
            End If
        Next i
    End With
End Sub
 
Upvote 0
Thank you very much sir for you quick reply.
I tested it against a common scenario but the logic is not exactly the same found in Word 2010.

I'll give you another example, with the expected output and your output.

Original: "Il presente documento si applica alla "Direttrice Torino - Padova" limitatamente ai sistemi di sicurezza e segnalamento. Potrà essere preso a riferimento per progetti analoghi quale il "Nodo di Genova".

Revised: "Il presente documento si applica alle funzionalità proprie dei sistemi ACCM. Nella presente specifica vengono inserite anche descrizioni ed eventuali requisiti sia di architettura/prodotto che di supervisione della circolazione ritenuti necessari per la completezza della specifica stessa".

Expected output: "Il presente documento si applica alle funzionalità proprie dei<strike>alla "Direttrice Torino - Padova" limitatamente ai</strike> sistemi ACCM. Nella presente specifica vengono inserite anche descrizioni ed eventuali requisiti sia di architettura/prodotto che di supervisione della circolazione ritenuti necessari<strike>sicurezza e segnalamento. Potrà essere preso a riferimento</strike> per la completezza della specifica stessa<strike>progetti analoghi quale il "Nodo di Genova".</strike>

Your output: "Il presente documento si applica <strike>alla</strike>alle <strike>Direttrice</strike>funzionalità <strike>Torino</strike>proprie <strike>-</strike>dei <strike>Padova"</strike>sistemi <strike>limitatamente</strike>ACCM. <strike>ai</strike>Nella <strike>sistemi</strike>presente <strike>di</strike>specifica <strike>sicurezza</strike>vengono <strike>e</strike>inserite <strike>segnalamento.</strike>anche <strike>Potrà</strike>descrizioni <strike>essere</strike>ed <strike>preso</strike>eventuali <strike>a</strike>requisiti <strike>riferimento</strike>sia <strike>per</strike>di <strike>progetti</strike>architettura/prodotto <strike>analoghi</strike>che <strike>quale</strike>di <strike>il</strike>supervisione <strike>"Nodo</strike>della <strike>di</strike>circolazione <strike>Genova"</strike>ritenuti"

I know that being in Italian could prove harder to understand, but I think I give the idea.

About the code: instead of having a word by word comparison it would be optimal to group crossed word and underlined word to form crossed sentences and underlined sentences, like the example. Another thing is that, as soon as it finds the first discrepancy, it starts differentiating words and formatting. In the expected results it splits sentences in a more syntactic way (it lets sentences to end, it respects commas, etc..). Last thing is that doesn't show remaining words (added or removed) from the longest sentence.

Moreover it gives error in a case like this:

Original: "oggi non ho fatto il bidet"

Revised:
"mario guarda che oggi sono esplosi i bidet"

Expected Output: "<ins cite="mailto:Author" datetime="2013-07-19T19:45">mario guarda che </ins>oggi <ins cite="mailto:Author" datetime="2013-07-19T19:45">sono esplosi i</ins><del cite="mailto:Author" datetime="2013-07-19T19:45">non ho fatto il</del> bidet"

Anyway your code is excellent, very fast and it might do the job for me, now it's just a matter of trying to get to the expected output if we wish.
I don't know if that could need a radically different approach, a more timewasting one.

Thank you very much for your time.

Andrea
 
Upvote 0
Something else to add to the conversation.

Output with my actual script.

Original: "oggi non ho fatto il bidet"

Revised: "mario guarda che oggi sono esplosi i bidet"

Output: "mario guarda che oggi <s style="color: rgb(255, 0, 0);">n</s>son<s> h</s>o <s>fatt</s>esplosi i<s>l</s> bidet"

It retains all the words, but it works at character level, so that splits words in letter and attach them in a non meaningful way.

For your information I post the code I use right now, based on the engine I found on the provided link. I wrote exampleString() ​calling the existing functions (in the code box after the main sub).

Code:
Option Explicit
Sub exampleString()


    Dim arr() As Long
    
    Dim String1 As String
    Dim String2 As String
    Dim Target As Range
    
    String1 = "oggi non ho fatto il bidet"
    String2 = "mario guarda che oggi sono esplosi i bidet"
    
    arr = longestCommonSubsequence(String1, String2)
    
    Dim s As String, t As String
    s = backTraceUp(arr, String1, String2, Len(String1), Len(String2))
    t = backTraceLeft(arr, String1, String2, Len(String1), Len(String2))
    Dim a As String, B As String
    a = getDiff(arr, String1, String2, Len(String1), Len(String2))
    
    
    Set Target = ActiveSheet.Cells(4, 3)
    Dim i As Integer
    For i = 2 To Len(a) Step 2
        Target.Value2 = Target.Value2 & Mid(a, i - 1, 1)
    Next i
    For i = 2 To Len(a) Step 2
        Select Case Mid(a, i, 1)
        Case "-"
            With Target.Characters(i / 2, 1).Font
                .Strikethrough = True
                .Color = RGB(255, 0, 0)
                .Bold = False
            End With
        Case "+"
            With Target.Characters(i / 2, 1).Font
                .Color = RGB(0, 128, 128)
                .Bold = False
                .Underline = xlUnderlineStyleSingle
            End With
        End Select
    Next i
    Dim brr() As Long
End Sub
Public Function longestCommonSubsequence(ByRef String1 As String, ByRef String2 As String) As Long()
    If String1 = vbNullString Or String2 = vbNullString Then
        Exit Function
    End If
 
    Dim num() As Long
    
    'define the array, note rows of zeros get added to front automatically
    ReDim num(Len(String1), Len(String2))
    
    Dim i As Long, j As Long
    
    For i = 1 To Len(String1)
        For j = 1 To Len(String2)
            If Mid$(String1, i, 1) = Mid$(String2, j, 1) Then
                num(i, j) = num(i - 1, j - 1) + 1
            Else
                num(i, j) = max(num(i - 1, j), num(i, j - 1))
            End If
        Next j
    Next i


    longestCommonSubsequence = num
End Function
'back traces c, defaulting in the up direction
Public Function backTraceUp(ByRef C() As Long, ByRef String1 As String, ByRef String2 As String, ByRef i As Long, ByRef j As Long) As String
    If i < 1 Or j < 1 Then
        backTraceUp = vbNullString
    ElseIf Mid$(String1, i, 1) = Mid$(String2, j, 1) Then
        'equal characters, save it and then go up and left
        backTraceUp = backTraceUp(C, String1, String2, i - 1, j - 1) & Mid$(String1, i, 1)
    Else
        'go in the direction of the highest number, defaulting to up
        If (C(i, j - 1) > C(i - 1, j)) Then
            backTraceUp = backTraceUp(C, String1, String2, i, j - 1)
        Else
            backTraceUp = backTraceUp(C, String1, String2, i - 1, j)
        End If
    End If
End Function
'back traces c, defaulting in the left direction
Public Function backTraceLeft(ByRef C() As Long, ByRef String1 As String, ByRef String2 As String, ByRef i As Long, ByRef j As Long) As String
    If i < 1 Or j < 1 Then
        backTraceLeft = vbNullString
    ElseIf Mid$(String1, i, 1) = Mid$(String2, j, 1) Then
        'equal characters, save it and then go up and left
        backTraceLeft = backTraceLeft(C, String1, String2, i - 1, j - 1) & Mid$(String1, i, 1)
    Else
        'go in the direction of the highest number, defaulting to left
        If (C(i, j - 1) >= C(i - 1, j)) Then
            backTraceLeft = backTraceLeft(C, String1, String2, i, j - 1)
        Else
            backTraceLeft = backTraceLeft(C, String1, String2, i - 1, j)
        End If
    End If
End Function
'the following function returns a string with indication to what was deleted or added
'proceding character can be;
' = no change
' - deletion
' + addition
Public Function getDiff(ByRef C() As Long, ByRef stringOld As String, ByRef stringNew As String, ByRef i As Long, ByRef j As Long) As String
    If i > 0 Then
        If j > 0 Then 'both are greater than zero
            'can only do the following comparison when i and j are greater than zero
            If Mid$(stringOld, i, 1) = Mid$(stringNew, j, 1) Then
                getDiff = getDiff(C, stringOld, stringNew, i - 1, j - 1) & Mid$(stringOld, i, 1) & "="
            Else
                If i = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                ElseIf C(i, j - 1) >= C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                ElseIf j = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                ElseIf C(i, j - 1) < C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                Else
                    getDiff = vbNullString
                End If
            End If
        Else 'i is is greater than zero
                If j = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                ElseIf C(i, j - 1) < C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                Else
                    getDiff = vbNullString
                End If
        End If
    Else
        If j > 0 Then 'j is  greater than zero
                If i = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                ElseIf C(i, j - 1) >= C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                Else
                    getDiff = vbNullString
                End If
        Else 'none are greater than zero
                getDiff = vbNullString
        End If
    End If
End Function
Public Function max(ByRef a As Long, ByRef B As Long) As Long
    If a >= B Then
        max = a
    Else
        max = B
    End If
End Function
 
Upvote 0
OK. I understand, find when they begin to diverge and when they re-align.

What output do you want from:

"John went to the store"
"John Smith went to the store"
 
Upvote 0
Original: "John went to the store"

Revised: "John Smith went to the store"

Output:
"John Smith went to the store"

But maybe this is better.

Original: "The system is intended for everyone to use. It has a clear interface, and it's easy to get familiar with."

Revised: "The ZMax-1000 is intended for everyone. Provided with a clear interface, it's easy to get used to it, plus it has a beautiful screen"

Output:
"The <s>system</s>ZMax-1000 is intended for everyone<s>to use</s>. <s>It has</s>Provided with a clear interface, <s>and </s>it's easy to get <s>familiar with</s>used to it<s><s><s>.</s></s></s>, plus it has a beautiful screen"

Thanks.
 
Upvote 0
Hmm...I've got some ideas, but I'm concerned about identifying cases where a red phrase and its matching green phrase have a common word (e.g. "the") in common.

Hmm...I don't think that that problem can be handled programmatically, it will have to be dealt with by the user.

This looks like a good weekend project.
 
Upvote 0
Does this do what you want
Code:
Sub test()
    Dim aStr As String
    Dim bStr As String
    aStr = "my name is andrea Smith and I like running and jogging fast to work"
    bStr = "my name is John smith and I like running fast"
    
    Call CompareAndDisplay(aStr, bStr, Range("A1"))
End Sub

Sub CompareAndDisplay(strOne As String, strTwo As String, outCell As Range, Optional Delimiter As String = " ")
    Dim strResult As String
    Dim olStart As Variant, olLength As Variant
    Dim nwStart As Variant, nwLength As Variant
    Dim i As Long
    
    strResult = ComparedText(strOne, strTwo, olStart, olLength, nwStart, nwLength)
    
    With outCell.Cells(1, 1)
        .Clear
        .Value = strResult
        For i = LBound(olStart) To UBound(olStart)
            If olStart(i) <> 0 Then
                
                With .Characters(olStart(i), olLength(i)).Font
                    .ColorIndex = 3
                    .StrikeThrough = True
                End With
            End If
        Next i
        For i = LBound(nwStart) To UBound(nwStart)
            If nwStart(i) <> 0 Then
                With .Characters(nwStart(i), nwLength(i)).Font
                    .ColorIndex = 4
                    .Underline = True
                End With
            End If
        Next i
    End With
End Sub


Function ComparedText(aString As String, bString As String, _
                                        Optional ByRef oStart As Variant, Optional oLength As Variant, _
                                        Optional ByRef nStart As Variant, Optional ByRef nLength As Variant, _
                                        Optional Delimiter As String = " ") As String
    Dim aWords As Variant, aWord As String
    Dim bWords As Variant
    Dim outWords() As String
    Dim aPoint As Long, bPoint As Long, outPoint As Long
    Dim matchPoint As Variant, outLength As Long
    Dim High As Long
    Dim oldStart() As Long, oldLength() As Long, oldPoint As Long
    Dim newStart() As Long, newLength() As Long, newPoint As Long
    
    Rem remove double delimiters
        'to be done
        
    aWords = Split(aString, Delimiter)
    bWords = Split(bString, Delimiter)
    High = UBound(aWords)
    ReDim outWords(0 To High + UBound(bWords))
    ReDim oldStart(0 To High + UBound(bWords)): ReDim oldLength(0 To High + UBound(bWords))
    ReDim newStart(0 To High + UBound(bWords)): ReDim newLength(0 To High + UBound(bWords))
    oldPoint = -1: newPoint = -1
    outLength = Len(Delimiter)
    aPoint = 0: bPoint = 0: outPoint = LBound(outWords) - 1
    
    Do
        aWord = aWords(aPoint)
        If LCase(aWord) = LCase(bWords(bPoint)) Then
            Rem is word in common
            outPoint = outPoint + 1
            outWords(outPoint) = aWord
            outLength = outLength + Len(aWord) + Len(Delimiter)
            bWords(bPoint) = vbNullString
            aPoint = aPoint + 1
            bPoint = bPoint + 1
        Else
            Rem word divergence
            matchPoint = Application.Match(aWord, bWords, 0)
            If IsError(matchPoint) Then
                Rem old word is not in new string
                outPoint = outPoint + 1
                outWords(outPoint) = aWord
                
                oldPoint = oldPoint + 1
                oldStart(oldPoint) = outLength: oldLength(oldPoint) = Len(aWord)
                
                outLength = outLength + Len(aWord) + Len(Delimiter)
                aPoint = aPoint + 1
            Else
                Rem old word is in new string, i.e. is a common word
                
                Do Until LCase(bWords(bPoint)) = LCase(aWord)
                    outPoint = outPoint + 1
                    outWords(outPoint) = bWords(bPoint)
                    
                    newPoint = newPoint + 1
                    newStart(newPoint) = outLength: newLength(newPoint) = Len(bWords(bPoint))
        
                    outLength = outLength + Len(outWords(outPoint)) + Len(Delimiter)
                    bWords(bPoint) = vbNullString
                    bPoint = bPoint + 1
                Loop
            End If
        End If
    Loop Until (High < aPoint) Or (UBound(bWords) < bPoint)
    
    Rem last new/different string
    Do Until UBound(bWords) < bPoint
        outPoint = outPoint + 1
        outWords(outPoint) = bWords(bPoint)
        
        newPoint = newPoint + 1
        newStart(newPoint) = outLength: newLength(newPoint) = Len(bWords(bPoint))
        
        outLength = outLength + Len(outWords(outPoint)) + Len(Delimiter)
        bPoint = bPoint + 1
    Loop
    
    Rem final common string
    Do Until High < aPoint
        outPoint = outPoint + 1
        outWords(outPoint) = aWords(aPoint)
        
        oldPoint = oldPoint + 1
        oldStart(oldPoint) = outLength: oldLength(oldPoint) = Len(aWords(aPoint))
        
        outLength = outLength + Len(outWords(outPoint)) + Len(Delimiter)
        aPoint = aPoint + 1
    Loop
    
    ReDim Preserve outWords(0 To outPoint)
    oStart = oldStart: oLength = oldLength
    nStart = newStart: nLength = newLength
    ComparedText = Join(outWords, Delimiter)
End Function
 
Upvote 0
Hello Mike,
thanks for your reply.
As soon as I come back home (I'm going out right now) I will test it thoroughly.

Anyway I developed a solution, not a very elegant one but it does the job exactly how I like it.
I've been the last three days thinking about how to approach the problem and i decided to actually use the in-built Word function "Compare Documents".

You can test it over a Spreadsheet with Original Text in Column A, Revised Text in Column B. Results are output in Column C.
You just have to run it, since it computes the last row of the spreadsheet and cycle over all the cells (in columns A and B) until the last row with data.

I also managed to solve the splitting that arises when pasting it back in Excel, managing cases where text contains at least one "vbLf".
It also manages deleted text between revisions (text in column A, missing text in column B) and added text (missing text in column A, text in column B).
I also tried to time it: yesterday with a typical file of mine (727 rows of original and revised text) and unoptimized code it took 1 hour. I refined all the logic and wrote several functions to polish the structure and avoid calculation when not needed. Right now I'm leaving it running to see any improvement in speed.

Anyway, I will test your code, since having an all Excel solution is way better (and faster ?).

Thank you very much for you help.
Andrea

Code:
Option Base 1
Sub cellTextDiff()
    Dim WordApp, diffDoc, tSht As Variant
    Dim outCell As Range
    Dim docNames(), strData(), retArr() As Variant
    
    xStartTime = Timer
    Application.ScreenUpdating = False
    Application.Calculation = xlCalculationManual
    Set tSht = ActiveSheet
    lRow = lastRow(tSht)
    
    For i = 1 To lRow
        ReDim doc(2)
        ReDim strData(2)

        ' stores input and output cells on current sheet
        For j = 1 To 2
            Set strData(j) = tSht.Cells(i, j)
        Next
        Set outCell = tSht.Cells(i, 3)
        
        If IsEmpty(strData(1)) And Not IsEmpty(strData(2)) Then
            With outCell
                .Value2 = strData(2).Value2
                .Font.Color = RGB(0, 128, 128)
                .Font.Underline = True
                .WrapText = True
                .Rows.AutoFit
            End With
        ElseIf Not IsEmpty(strData(1)) And IsEmpty(strData(2)) Then
            With outCell
                .Value2 = strData(1).Value2
                .Font.Color = RGB(255, 0, 0)
                .Font.Strikethrough = True
                .WrapText = True
                .Rows.AutoFit
            End With
        Else
            docNames = Array("original", "revised")
            Set WordApp = CreateObject("Word.Application")
            For j = 1 To UBound(docNames)
                With WordApp
                    .Visible = False
                    Set doc(j) = .Documents.Add
                    strData(j).Copy
                    doc(j).Paragraphs(1).Range.PasteSpecial
                End With
            Next
            WordApp.CompareDocuments OriginalDocument:=doc(1), RevisedDocument:=doc(2), _
            Destination:=wdCompareDestinationNew, Granularity:=wdGranularityCharLevel
            For j = 1 To 2
                doc(j).Close SaveChanges:=False
            Next j
            WordApp.Visible = False
            Set diffDoc = WordApp.ActiveDocument
            
            ' sets an array with the positions of the linefeed character Chr(10)
            plainText = diffDoc.Paragraphs(1).Range.Text
            If InStr(1, plainText, Chr(11)) = 0 Then
                diffDoc.Paragraphs(1).Range.Copy
                outCell.Select
                tSht.PasteSpecial Format:="HTML"
                outCell.WrapText = True
                outCell.Rows.AutoFit
                diffDoc.Close SaveChanges:=False
                WordApp.Quit
                Application.CutCopyMode = False
            Else
                ' store where are LineFeed characters
                retArr = storeLf(plainText)
                
                ' merge all the eventual diffDoc paragraphs in one
                ' (to avoid cell splitting when pasting back to Excel)
                With diffDoc.Content.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Forward = True
                    .Wrap = wdFindStop
                    .Format = False
                    .MatchAllWordForms = False
                    .MatchSoundsLike = False
                    .MatchWildcards = True
                    .Text = Chr(11)
                    .Replacement.Text = " "
                    .Execute Replace:=wdReplaceAll
                End With
                
                ' copy the merged, formatted text to Excel
                diffDoc.Paragraphs(1).Range.Copy
                outCell.Select
                tSht.PasteSpecial Format:="HTML"
                
                ' copy the correct text as plain text to Excel
                outCell.Offset(0, 1).Value2 = lfFormat(plainText, retArr)
                outCell.Offset(0, 1).WrapText = True
                diffDoc.Close SaveChanges:=False
                WordApp.Quit
            
                ' apply diffDoc formatting to the structured text
                For j = 1 To Len(outCell.Value2)
                    If outCell.Characters(j, 1).Font.Color = RGB(255, 0, 0) Then
                        outCell.Offset(0, 1).Characters(j, 1).Font.Color = RGB(255, 0, 0)
                        outCell.Offset(0, 1).Characters(j, 1).Font.Strikethrough = True
                    ElseIf outCell.Characters(j, 1).Font.Color = RGB(0, 128, 128) Then
                        outCell.Offset(0, 1).Characters(j, 1).Font.Color = RGB(0, 128, 128)
                        outCell.Offset(0, 1).Characters(j, 1).Font.Underline = True
                    End If
                Next
                ' final output in adjacent column
                With outCell
                    .Offset(0, 1).Copy
                    .Select
                    tSht.PasteSpecial
                    .Offset(0, 1).Clear
                    Application.CutCopyMode = False
                    .Rows.AutoFit
                End With
            End If
        End If
    Next
    
    Application.Calculation = xlCalculationAutomatic
    Application.CalculateBeforeSave = True
    Application.ScreenUpdating = True
    xFinishTime = Timer
    MsgBox xFinishTime - xStartTime
End Sub
Function lfFormat(sTxt As Variant, lfArr() As Variant) As String
    sTxtLen = Len(sTxt)
    lLen = sTxtLen - lfArr(UBound(lfArr))
    For j = 1 To UBound(lfArr)
        If j = 1 Then
            lfFormat = Mid(sTxt, lfArr(j), lfArr(j + 1) - 1) & vbLf
        ElseIf j > 1 And j < UBound(lfArr) Then
            lfFormat = lfFormat & Mid(sTxt, lfArr(j) + 1, lfArr(j + 1) - lfArr(j) - 1) & vbLf
        ElseIf j = UBound(lfArr) Then
            lfFormat = lfFormat & Mid(sTxt, lfArr(j) + 1, lLen)
        End If
    Next j
End Function
Function storeLf(sTxt As Variant) As Variant()
    Dim retArr() As Variant
    For j = 1 To Len(sTxt)
        If Mid(sTxt, j, 1) = Chr(11) Then cntr = cntr + 1
    Next j
    ReDim retArr(cntr + 1)
    retArr(1) = 1
    cntr = 1
    For j = 1 To Len(sTxt)
        If Mid(sTxt, j, 1) = Chr(11) Then
            cntr = cntr + 1
            retArr(cntr) = j
        End If
    Next j
    storeLf = retArr
End Function
Public Function lastRow(xSheet As Variant) As Variant
    If WorksheetFunction.CountA(xSheet.Cells) <> 0 Then
        lastRow = xSheet.Cells.Find(What:="*", After:=xSheet.Cells(1), SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row
    Else
        lastRow = 1
    End If
End Function
 
Upvote 0
Sorry for answering so late. Here's the analysis.

Below you can also find the optimized code I use right now (it has a couple of bells and whistles too).
I kept the use of .CompareDocuments function to a minimum (only when strictly needed) and I optimized the recovery of the original structure (in case there are some linefeed in the original text).
I tried the routine on my sample real-life document and now I get the job done in 40 minutes (against 1 hour originally): it' s absolutely fair and it's ok for what I need; plus I get exactly the output I need.

Original:
The system is intended for everyone to use. It has a clear interface, and it's easy to get familiar with.

Revised: The ZMax-1000 is intended for everyone. Provided with a clear interface, it's easy to get used to it, plus it has a beautiful screen

Expected/actual output: The <s>system</s>ZMax-1000 is intended for everyone<s> to use. It has</s>. Provided with a clear interface, <s>and </s>it's easy to get <s>familiar with.</s>used to it, plus it has a beautiful screen

Your output: The <s>system</s>ZMax-1000 is intended for everyone. <s>t</s>Pro<s> us</s>vide<s>.</s>d <s>I</s>wit<s> </s>h<s>as</s> a clear interface,<s> and</s> it's easy to get <s>fam</s>used to it, pl<s>i</s><s>ar</s><s></s>us <s>w</s>it h<s>.</s>as a beautiful screen

Code:
Option Base 1Sub cellTextDiff()
    Dim WordApp, diffDoc, tSht As Variant
    Dim outCell As Range
    Dim docNames(), strData(), retArr() As Variant
    
    ' preliminary steps to speed up execution
    xStartTime = Timer
    Application.ScreenUpdating = False
    Application.Calculation = xlCalculationManual
    Set tSht = ActiveSheet
    lRow = lastRow(tSht)
    
    For i = 1 To lRow
        ReDim doc(2)
        ReDim strData(2)
        
        ' stores input and output cells on current sheet
        For j = 1 To 2
            Set strData(j) = tSht.Cells(i, j)
        Next
        Set outCell = tSht.Cells(i, 3)
        
        'CASE 1: data is present only in revised document (addition)
        If IsEmpty(strData(1)) And Not IsEmpty(strData(2)) Then
            With outCell
                .Value2 = strData(2).Value2
                .Font.Color = RGB(0, 128, 128)
                .Font.Underline = True
                .WrapText = True
                .Font.Name = strData(2).Font.Name
                .Font.Bold = strData(2).Font.Bold
                .Font.Size = strData(2).Font.Size
                .Interior.Color = strData(2).Interior.Color
                .Rows.AutoFit
            End With
        'CASE 2: data is present only in original document (deletion)
        ElseIf Not IsEmpty(strData(1)) And IsEmpty(strData(2)) Then
            With outCell
                .Value2 = strData(1).Value2
                .Font.Color = RGB(255, 0, 0)
                .Font.Strikethrough = True
                .WrapText = True
                .Font.Name = strData(1).Font.Name
                .Font.Bold = strData(1).Font.Bold
                .Font.Size = strData(1).Font.Size
                .Interior.Color = strData(1).Interior.Color
                .Rows.AutoFit
            End With
        'CASE 3: data in original and revised documents is the same (copy)
        ElseIf strData(1) = strData(2) And Not IsEmpty(strData(1)) Then
            With outCell
                .Value2 = strData(1).Value2
                .Font.Color = RGB(0, 0, 0)
                .WrapText = True
                .Font.Name = strData(2).Font.Name
                .Font.Bold = strData(2).Font.Bold
                .Font.Size = strData(2).Font.Size
                .Interior.Color = strData(2).Interior.Color
                .Rows.AutoFit
            End With
        'CASE 4: data in original and revised documents is different
        Else
            docNames = Array("original", "revised")
            Set WordApp = CreateObject("Word.Application")
            For j = 1 To UBound(docNames)
                With WordApp
                    .Visible = False
                    Set doc(j) = .Documents.Add
                    strData(j).Copy
                    doc(j).Paragraphs(1).Range.PasteSpecial
                End With
            Next
            WordApp.CompareDocuments OriginalDocument:=doc(1), RevisedDocument:=doc(2), _
            Destination:=wdCompareDestinationNew, Granularity:=wdGranularityCharLevel
            For j = 1 To 2
                doc(j).Close SaveChanges:=False
            Next j
            WordApp.Visible = False
            Set diffDoc = WordApp.ActiveDocument
            plainText = diffDoc.Paragraphs(1).Range.Text
            
            'CASE 4.1: compared text doesn't contain linefeeds
            If InStr(1, plainText, Chr(11)) = 0 Then
                diffDoc.Paragraphs(1).Range.Copy
                outCell.Select
                tSht.PasteSpecial Format:="HTML"
                With outCell
                    .WrapText = True
                    .Font.Name = strData(2).Font.Name
                    .Font.Bold = strData(2).Font.Bold
                    .Font.Size = strData(2).Font.Size
                    .Rows.AutoFit
                    .Interior.Color = strData(2).Interior.Color
                End With
                diffDoc.Close SaveChanges:=False
                WordApp.Quit
                Application.CutCopyMode = False
                
            'CASE 4.2: compared text contains linefeeds
            Else
                ' store where are LineFeed characters
                retArr = storeLf(plainText)
                ' merge all the eventual diffDoc paragraphs in one
                ' (to avoid cell splitting when pasting back to Excel)
                With diffDoc.Content.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Forward = True
                    .Wrap = wdFindStop
                    .Format = False
                    .MatchAllWordForms = False
                    .MatchSoundsLike = False
                    .MatchWildcards = True
                    .Text = Chr(11)
                    .Replacement.Text = " "
                    .Execute Replace:=wdReplaceAll
                End With
                ' paste merged and original text to Excel
                diffDoc.Paragraphs(1).Range.Copy
                outCell.Select
                tSht.PasteSpecial Format:="HTML"
                outCell.Offset(0, 1).Value2 = lfFormat(plainText, retArr)
                diffDoc.Close SaveChanges:=False
                WordApp.Quit
                ' apply diffDoc formatting to the structured text
                For j = 1 To Len(outCell.Value2)
                    If outCell.Characters(j, 1).Font.Color = RGB(255, 0, 0) Then
                        outCell.Offset(0, 1).Characters(j, 1).Font.Color = RGB(255, 0, 0)
                        outCell.Offset(0, 1).Characters(j, 1).Font.Strikethrough = True
                    ElseIf outCell.Characters(j, 1).Font.Color = RGB(0, 128, 128) Then
                        outCell.Offset(0, 1).Characters(j, 1).Font.Color = RGB(0, 128, 128)
                        outCell.Offset(0, 1).Characters(j, 1).Font.Underline = True
                    End If
                Next
                ' final output in adjacent column
                With outCell
                    .Offset(0, 1).Copy
                    .Select
                    tSht.PasteSpecial
                    .Offset(0, 1).Clear
                    Application.CutCopyMode = False
                    .Font.Name = strData(2).Font.Name
                    .Font.Bold = strData(2).Font.Bold
                    .Font.Size = strData(2).Font.Size
                    .Interior.Color = strData(2).Interior.Color
                    .Rows.AutoFit
                End With
            End If
        End If
    Next
    
    ' fine formatting of output column
    With outCell.EntireColumn
        With .Borders(xlEdgeLeft)
            .LineStyle = xlContinuous
            .Weight = xlThin
        End With
        With .Borders(xlEdgeTop)
            .LineStyle = xlContinuous
            .Weight = xlThin
        End With
        With .Borders(xlEdgeBottom)
            .LineStyle = xlContinuous
            .Weight = xlThin
        End With
        With .Borders(xlEdgeRight)
            .LineStyle = xlContinuous
            .Weight = xlThin
        End With
        With .Borders(xlInsideHorizontal)
            .LineStyle = xlContinuous
            .Weight = xlThin
        End With
        .ColumnWidth = strData(2).ColumnWidth
        .VerticalAlignment = xlVAlignTop
    End With
    Cells.Rows.AutoFit
    tSht.Range(tSht.Cells(lRow + 1, 1), tSht.Cells(lRow + 1, 3).End(xlDown)).Clear
    
    ' restore normal state of execution
    Application.Calculation = xlCalculationAutomatic
    Application.CalculateBeforeSave = True
    Application.ScreenUpdating = True
    xFinishTime = Timer
    
    MsgBox "Time elapsed: " & getTime(xFinishTime - xStartTime), vbInformation + vbOKOnly
End Sub
Function lfFormat(sTxt As Variant, lfArr() As Variant) As String
    sTxtLen = Len(sTxt)
    lLen = sTxtLen - lfArr(UBound(lfArr))
    For j = 1 To UBound(lfArr)
        If j = 1 Then
            lfFormat = Mid(sTxt, lfArr(j), lfArr(j + 1) - 1) & vbLf
        ElseIf j > 1 And j < UBound(lfArr) Then
            lfFormat = lfFormat & Mid(sTxt, lfArr(j) + 1, lfArr(j + 1) - lfArr(j) - 1) & vbLf
        ElseIf j = UBound(lfArr) Then
            lfFormat = lfFormat & Mid(sTxt, lfArr(j) + 1, lLen)
        End If
    Next j
End Function
Function storeLf(sTxt As Variant) As Variant()
    Dim retArr() As Variant
    For j = 1 To Len(sTxt)
        If Mid(sTxt, j, 1) = Chr(11) Then cntr = cntr + 1
    Next j
    ReDim retArr(cntr + 1)
    retArr(1) = 1
    cntr = 1
    For j = 1 To Len(sTxt)
        If Mid(sTxt, j, 1) = Chr(11) Then
            cntr = cntr + 1
            retArr(cntr) = j
        End If
    Next j
    storeLf = retArr
End Function
Public Function lastRow(xSheet As Variant) As Variant
    If WorksheetFunction.CountA(xSheet.Cells) <> 0 Then
        lastRow = xSheet.Cells.Find(What:="*", After:=xSheet.Cells(1), SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row
    Else
        lastRow = 1
    End If
End Function
Function getTime(myTime As Variant) As Variant
    dDay = 86400
    If myTime <= 60 Then
        getTime = myTime & "s"
    ElseIf myTime > 60 And myTime < 3600 Then
        dMins = Round(myTime / 60, 0)
        dSecs = Round((myTime - (dMins * 60)), 0)
        getTime = dMins & "m" & dSecs & "s"
    ElseIf myTime > 3600 Then
        dHours = Round(myTime / 3600, 0)
        dMins = Round((myTime - (dHours * 3600)) / 60, 0)
        dSecs = Round(myTime - (dHours * 3600) - (dMins * 60), 0)
        getTime = dHours & "h" & dMins & "m" & dSecs & "s"
    End If
End Function

Thanks for your time, I hope you find my procedure useuful. (anyway, the search for the all-excel soluion is not over!)
 
Upvote 0

Forum statistics

Threads
1,217,981
Messages
6,139,753
Members
450,230
Latest member
RStasicky

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top