VBA routine to perform Text Diff

acatalux

New Member
Joined
Jul 19, 2013
Messages
9
Hello everybody, I am new to the forum but I have been an avid reader of your discussions.
As always stated by common forum rules of behavior, don't start a new post if you can already find the answer in other's discussions.

Well, this is the first time I need to start a new post, because it seems the answer is nowhere on the internet.

I am looking to implement the "Compare Documents" function (available in Word 2010) in Excel 2010 through VBA programming between different text contained in two cells.
In Word this function works quite well (not perfectly), but it highlights in different ways which part has been deleted and which one has been added between an "original" document and a "revised" one.
For the nature of my job, I need to do this on a daily basis and I used to output text from Excel to Word, then compare the two text, and then copy it back to Excel.

Here comes the problem: since in Word the text is formatted (and what I'm looking for is formatted/highlighted text as output), I can't just paste it in Excel as it is: any editing, merging, splitting done on the pasted text (that eventually I need to do) makes the formatting disappear (above all with VBA functions, that can only output data and can't format it).

So, browsing the internet I found an implementation of the LCS (Longest Common Subsequence) algorithm written for VBA: you can find it here (for what I see a very neat job by Travis Hydzik). I looked at the strings function only, even though the author provides functions for arrays as well.

I started to tailor his script to my needs, and at first I claimed victory. But as I applied the code to other situations, I couldn't get the right answer.
Anyway, the behaviour that I expect is exactly the same that is provided by the "Compare Documents" in Word 2010.

In other words: given two cells containing different text, I would like to be able to fill a third cell with text formatted accordingly to the TextDiff output between the two original cell.

E.g.:
INPUT:
Cells(1,1).Value2 = "my name is Andrea and I like jogging" (original)
Cells(1,2).Value2 = "my name is Giovanni and I like running" (revised)

OUTPUT:
Cells(1,3) wll contain: "my name is <strike>Andrea</strike>Giovanni and I like <strike>jogging</strike>running"

Obviously, since UDF doesn't allow formatting of cells, I would need to adjust the main Sub for each pair of document I have to revise, but that won't be the problem: what I need is the engine. It's been two years and a half that I do advanced VBA programming at work but it looks like I can't grasp the rationale behind the LCS algorithm.

Any suggestion, hint, link or other resource is more than welcome.
I hope you could help me find a solution to this (considering it is quite useful anyhow).

Best regards,
Andrea.

P.S.: I don't post any code since I used the code "as is" developed by the guy as the engine/core and I edited the example Sub to my needs, converting "+", "-" and "=" to text formatting using the .Characters object.
 
Firstly, great post! Invaluable pieces of code.

Mike, sir your code worked wonders for me. I tried tweaking it by having a Userform's "Userform1.Textbox1" and "Userform1.Textbox2" as inputs instead of cells A1 and B1 discussed above.
The output (compared result), I try displaying in Userform1.Textbox3 of the same userform.
But unfortunately, Im a novice when it comes to VBA. I do not know how to change the code to make the result appear in Userform1.Textbox3 using the With..End With like how you have. I understand that the arguments of the function call:
Code:
Call CompareAndDisplay(aStr, bStr, Range("S1"))
Need to change along with the
Code:
CompareAndDisplay(strOne As String, strTwo As String, outCell As Range, Optional Delimiter As String = " ")
[I][COLOR=#574123].
.
.
With outCell.Cells(1, 1)[/COLOR]
[COLOR=#574123].Clear[/COLOR]
[COLOR=#574123].Value = strResult[/COLOR]
[COLOR=#574123]For i = LBound(olStart) To UBound(olStart)[/COLOR]
[COLOR=#574123]If olStart(i) <> 0 Then[/COLOR]

[COLOR=#574123]With .Characters(olStart(i), olLength(i)).Font[/COLOR]
[COLOR=#574123].ColorIndex = 3[/COLOR]
[COLOR=#574123].StrikeThrough = True[/COLOR]
[COLOR=#574123]End With[/COLOR][/I]


I tried and all I managed was Copying the result stored in C1 into Userform.Textbox3, like:

Code:
Userform1.Textbox.Value=Range("C1").Value

The trouble with this is that the userform wont allow for font color/format changes like StrikeThroughs and underlines. So in the end, the result in Userform1.Textbox3 is displayed, but with out any kind of formatting so its impossible to understand and pick out the differences. Could you please help find a way to display the compare result (with differences HIGHLIGHTED) in a Userform.Textbox field?
Thanks in advance!
 
Upvote 0

Excel Facts

Last used cell?
Press Ctrl+End to move to what Excel thinks is the last used cell.
No userform controls support rich text.
I only vaguly remember this thread, but you might want two result text boxes. Combined result and then a result with the words in common replaced by ellipsis.

Input1: John went to the candy store
Input2: John Smith went to the store

Output1: John Smith went to the candy store
Output2: .....Smith.............candy......
 
Upvote 0
Must say this is quite the code. I cant believe my luck, to be honest this is exactly what Ive been searching for.
Except the req is a little different as far as the output goes. Is there anyway the output of the

Code:
ComparedText(strOne, strTwo, olStart, olLength, nwStart, nwLength)

can be displayed in a word document?
 
Upvote 0
ComparedText is a function that returns a string, so your question is equivalent to "Given a string can it be put in Word".

This code will get the text into an open Word document, but I'm not experience at this, if there are further questions about that topic, you should start a new thread.

Code:
Sub test()
    Dim myWord As Object
    Dim StringToInsert As String
    
    StringToInsert =ComparedText(strOne, strTwo, olStart, olLength, nwStart, nwLength)

    Set myWord = GetObject(, "Word.Application")
    
    myWord.Selection.TypeText Text:=StringToInsert
End Sub
 
Upvote 0
Dear All,

I am brand new to VBA so please bear with me. These codes are amazing but I am running into a few issues.

string1= Part II Long / Non Producer F2F / Part III BMI / Part III BP / MIB / BCP-Protein / BCP-KF / BCP-Glucose / BCP-Chol / BCP-LF / BCP-HIV / HOS / I\ADL Qs / APS / MVR

string2= Part II Long / Non Producer F2F / Part III BMI / Part III BP / MIB / BCP-Protein / BCP-KF / BCP-Glucose / BCP-Chol / BCP-LF / BCP-HIV / NT-proBNP / PSA / HOS / APS / MVR

I have barely tweaked a couple codes from this thread as follows:
Code 1:
Code:
Sub exampleString()
'This is the code i am working on
'This code should be used moving forward


Dim r As Integer
Dim w As Integer
For r = 4 To 12


For w = 4 To 10




Cells(r + 26, w).Clear
   Dim arr() As Long
    
    Dim String1 As String
    Dim String2 As String
    Dim Target As Range
    
    String1 = Cells(r, w)
    String2 = Cells(r + 13, w)
    
    arr = longestCommonSubsequence(String1, String2)
    
    Dim s As String, t As String
    s = backTraceUp(arr, String1, String2, Len(String1), Len(String2))
    t = backTraceLeft(arr, String1, String2, Len(String1), Len(String2))
    Dim a As String, B As String
    a = getDiff(arr, String1, String2, Len(String1), Len(String2))
    
    
    Set Target = ActiveSheet.Cells(r + 26, w)
    
    Dim i As Integer
    For i = 2 To Len(a) Step 2
        Target.Value2 = Target.Value2 & Mid(a, i - 1, 1)
    Next i
    For i = 2 To Len(a) Step 2
        Select Case Mid(a, i, 1)
        Case "-"
            With Target.Characters(i / 2, 1).Font
                .Strikethrough = True
                .Color = RGB(255, 0, 0)
                .Bold = False
            End With
        Case "+"
            With Target.Characters(i / 2, 1).Font
                .ColorIndex = 4
                .Bold = False
                .Underline = xlUnderlineStyleSingle
            End With
        End Select
    Next i
    Dim brr() As Long
    Next w
    Next r
    
End Sub
Public Function longestCommonSubsequence(ByRef String1 As String, ByRef String2 As String) As Long()
    If String1 = vbNullString Or String2 = vbNullString Then
        Exit Function
    End If
 
    Dim num() As Long
    
    'define the array, note rows of zeros get added to front automatically
    ReDim num(Len(String1), Len(String2))
    
    Dim i As Long, j As Long
    
    For i = 1 To Len(String1)
        For j = 1 To Len(String2)
            If Mid$(String1, i, 1) = Mid$(String2, j, 1) Then
                num(i, j) = num(i - 1, j - 1) + 1
            Else
                num(i, j) = max(num(i - 1, j), num(i, j - 1))
            End If
        Next j
    Next i




    longestCommonSubsequence = num
End Function
'back traces c, defaulting in the up direction
Public Function backTraceUp(ByRef C() As Long, ByRef String1 As String, ByRef String2 As String, ByRef i As Long, ByRef j As Long) As String
    If i < 1 Or j < 1 Then
        backTraceUp = vbNullString
    ElseIf Mid$(String1, i, 1) = Mid$(String2, j, 1) Then
        'equal characters, save it and then go up and left
        backTraceUp = backTraceUp(C, String1, String2, i - 1, j - 1) & Mid$(String1, i, 1)
    Else
        'go in the direction of the highest number, defaulting to up
        If (C(i, j - 1) > C(i - 1, j)) Then
            backTraceUp = backTraceUp(C, String1, String2, i, j - 1)
        Else
            backTraceUp = backTraceUp(C, String1, String2, i - 1, j)
        End If
    End If
End Function
'back traces c, defaulting in the left direction
Public Function backTraceLeft(ByRef C() As Long, ByRef String1 As String, ByRef String2 As String, ByRef i As Long, ByRef j As Long) As String
    If i < 1 Or j < 1 Then
        backTraceLeft = vbNullString
    ElseIf Mid$(String1, i, 1) = Mid$(String2, j, 1) Then
        'equal characters, save it and then go up and left
        backTraceLeft = backTraceLeft(C, String1, String2, i - 1, j - 1) & Mid$(String1, i, 1)
    Else
        'go in the direction of the highest number, defaulting to left
        If (C(i, j - 1) >= C(i - 1, j)) Then
            backTraceLeft = backTraceLeft(C, String1, String2, i, j - 1)
        Else
            backTraceLeft = backTraceLeft(C, String1, String2, i - 1, j)
        End If
    End If
End Function
'the following function returns a string with indication to what was deleted or added
'proceding character can be;
' = no change
' - deletion
' + addition
Public Function getDiff(ByRef C() As Long, ByRef stringOld As String, ByRef stringNew As String, ByRef i As Long, ByRef j As Long) As String
    If i > 0 Then
        If j > 0 Then 'both are greater than zero
            'can only do the following comparison when i and j are greater than zero
            If Mid$(stringOld, i, 1) = Mid$(stringNew, j, 1) Then
                getDiff = getDiff(C, stringOld, stringNew, i - 1, j - 1) & Mid$(stringOld, i, 1) & "="
            Else
                If i = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                ElseIf C(i, j - 1) >= C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                ElseIf j = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                ElseIf C(i, j - 1) < C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                Else
                    getDiff = vbNullString
                End If
            End If
        Else 'i is is greater than zero
                If j = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                ElseIf C(i, j - 1) < C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i - 1, j) & Mid$(stringOld, i, 1) & "-"
                Else
                    getDiff = vbNullString
                End If
        End If
    Else
        If j > 0 Then 'j is  greater than zero
                If i = 0 Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                ElseIf C(i, j - 1) >= C(i - 1, j) Then
                    getDiff = getDiff(C, stringOld, stringNew, i, j - 1) & Mid$(stringNew, j, 1) & "+"
                Else
                    getDiff = vbNullString
                End If
        Else 'none are greater than zero
                getDiff = vbNullString
        End If
    End If
End Function
Public Function max(ByRef a As Long, ByRef B As Long) As Long
    If a >= B Then
        max = a
    Else
        max = B
    End If
End Function

Code 2:

Code:
Sub test_UW()
    Dim aStr As String
    Dim bStr As String
    Dim r As Integer
    Dim s As Integer
    For r = 4 To 12
    
    
    For s = 4 To 10
    
    
    aStr = Cells(r, s)
    bStr = Cells(r + 13, s)
    
    Call CompareAndDisplay(aStr, bStr, Cells(r + 26, s))
    
    Next s
    Next r
    
End Sub


Sub CompareAndDisplay(strOne As String, strTwo As String, outCell As Range, Optional Delimiter As String = " ")
    Dim strResult As String
    Dim olStart As Variant, olLength As Variant
    Dim nwStart As Variant, nwLength As Variant
    Dim i As Long
    
    strResult = ComparedText(strOne, strTwo, olStart, olLength, nwStart, nwLength)
    
    With outCell.Cells(1, 1)
        .Clear
        .Value = strResult
        For i = LBound(olStart) To UBound(olStart)
            If olStart(i) <> 0 Then
                
                With .Characters(olStart(i), olLength(i)).Font
                    .ColorIndex = 3
                    .Strikethrough = True
                End With
            End If
        Next i
        For i = LBound(nwStart) To UBound(nwStart)
            If nwStart(i) <> 0 Then
                With .Characters(nwStart(i), nwLength(i)).Font
                    .ColorIndex = 4
                    .Underline = True
                End With
            End If
        Next i
    End With
End Sub




Function ComparedText(aString As String, bString As String, _
                                        Optional ByRef oStart As Variant, Optional oLength As Variant, _
                                        Optional ByRef nStart As Variant, Optional ByRef nLength As Variant, _
                                        Optional Delimiter As String = " ") As String
    Dim aWords As Variant, aWord As String
    Dim bWords As Variant
    Dim outWords() As String
    Dim aPoint As Long, bPoint As Long, outPoint As Long
    Dim matchPoint As Variant, outLength As Long
    Dim High As Long
    Dim oldStart() As Long, oldLength() As Long, oldPoint As Long
    Dim newStart() As Long, newLength() As Long, newPoint As Long
    
    Rem remove double delimiters
        'to be done
        
    aWords = Split(aString, Delimiter)
    bWords = Split(bString, Delimiter)
    High = UBound(aWords)
    ReDim outWords(0 To High + UBound(bWords))
    ReDim oldStart(0 To High + UBound(bWords)): ReDim oldLength(0 To High + UBound(bWords))
    ReDim newStart(0 To High + UBound(bWords)): ReDim newLength(0 To High + UBound(bWords))
    oldPoint = -1: newPoint = -1
    outLength = Len(Delimiter)
    aPoint = 0: bPoint = 0: outPoint = LBound(outWords) - 1
    
    Do
        aWord = aWords(aPoint)
        If LCase(aWord) = LCase(bWords(bPoint)) Then
            Rem is word in common
            outPoint = outPoint + 1
            outWords(outPoint) = aWord
            outLength = outLength + Len(aWord) + Len(Delimiter)
            bWords(bPoint) = vbNullString
            aPoint = aPoint + 1
            bPoint = bPoint + 1
        Else
            Rem word divergence
            matchPoint = Application.Match(aWord, bWords, 0)
            If IsError(matchPoint) Then
                Rem old word is not in new string
                outPoint = outPoint + 1
                outWords(outPoint) = aWord
                
                oldPoint = oldPoint + 1
                oldStart(oldPoint) = outLength: oldLength(oldPoint) = Len(aWord)
                
                outLength = outLength + Len(aWord) + Len(Delimiter)
                aPoint = aPoint + 1
            Else
                Rem old word is in new string, i.e. is a common word
                
                Do Until LCase(bWords(bPoint)) = LCase(aWord)
                    outPoint = outPoint + 1
                    outWords(outPoint) = bWords(bPoint)
                    
                    newPoint = newPoint + 1
                    newStart(newPoint) = outLength: newLength(newPoint) = Len(bWords(bPoint))
        
                    outLength = outLength + Len(outWords(outPoint)) + Len(Delimiter)
                    bWords(bPoint) = vbNullString
                    bPoint = bPoint + 1
                Loop
            End If
        End If
    Loop Until (High < aPoint) Or (UBound(bWords) < bPoint)
    
    Rem last new/different string
    Do Until UBound(bWords) < bPoint
        outPoint = outPoint + 1
        outWords(outPoint) = bWords(bPoint)
        
        newPoint = newPoint + 1
        newStart(newPoint) = outLength: newLength(newPoint) = Len(bWords(bPoint))
        
        outLength = outLength + Len(outWords(outPoint)) + Len(Delimiter)
        bPoint = bPoint + 1
    Loop
    
    Rem final common string
    Do Until High < aPoint
        outPoint = outPoint + 1
        outWords(outPoint) = aWords(aPoint)
        
        oldPoint = oldPoint + 1
        oldStart(oldPoint) = outLength: oldLength(oldPoint) = Len(aWords(aPoint))
        
        outLength = outLength + Len(outWords(outPoint)) + Len(Delimiter)
        aPoint = aPoint + 1
    Loop
    
    ReDim Preserve outWords(0 To outPoint)
    oStart = oldStart: oLength = oldLength
    nStart = newStart: nLength = newLength
    ComparedText = Join(outWords, Delimiter)
End Function

Code 1 yields:
Part II Long / Non Producer F2F / Part III BMI / Part III BP / MIB / BCP-Protein / BCP-KF / BCP-Glucose / BCP-Chol / BCP-LF / BCP-HIV / <strike>HOS</strike>NT-proBNP / <strike>I\</strike>PSA<strike>DL</strike> <strike>Qs</strike>/ HOS / APS / MVR


Code 2 yields:
Part II Long / Non Producer F2F / Part III BMI / Part III BP / MIB / BCP-Protein / BCP-KF / BCP-Glucose / BCP-Chol / BCP-LF / BCP-HIV / NT-proBNP / PSA / HOS / <strike>I\ADL</strike> <strike>Qs</strike> APS / <strike>APS</strike> <strike>/</strike> MVR


I would like to strike through red what has been taken away from string1 and underline green what has been added from string2. So my desired output would be:
Part II Long / Non Producer F2F / Part III BMI / Part III BP / MIB / BCP-Protein / BCP-KF / BCP-Glucose / BCP-Chol / BCP-LF / BCP-HIV / NT-proBNP / PSA / HOS / <strike>I\ADL</strike> <strike>Qs</strike> / APS / MVR



Sometimes these codes work perfectly and other times I run into these issues. Any help would be much appreciated. Thank You!!
 
Upvote 0
I've used acatalux code for some time and it worked quiet well for me but it has some flaws which bothered me at times:
* I'm getting random "error 400" messages as a result of some issues with the clipboard while copying the text from word to excel
* speed is an issue, not just firing up word etc, but also coloring the text of cells containing long strings


So i went back to his original idea of using a VBA implementation of the LCS algorythm he mentioned in his original post. I didn't try to understand the logic behind that code, instead i'd figured how the code processed the strings and what kind of results I get. The code has two implementations of the same problem, one directly processing a string character by character, the other one is processing an array of strings (where each string contains only one character). Both of them produce the ugly result acatalux complained about in post #10.


My idea was to modify the problem a bit and instead of compairing the strings simply character by character it now compaires whole words and non-word characters. It's done by creating a new function to build the array to be processed using RegEx. I've kept it really simple so a word is a string containing only a-z but not umlauts etc. (for the end result that simplification doesn't really matter).

Here is my LCS code (requires a reference to "Microsoft VBScript Regular Expression 5.5" to work). It is used by the module at the end of this post:
Code:
Attribute VB_Name = "StringDiff"'needs a reference to "Microsoft VBScript Regular Expressions 5.5" to work
'stripped down and improved version of the module found here: http://thydzik.com/longest-common-subsequence-implemented-in-vba-visual-basic-for-applications/
'
'improvements:
'* words based instead of charakter based
'* words only contain a-z (case ignored)
'* other characters are used to split words
'* words are compared as whole words ignoring the character case
'* non-word characters are compared individually
'* example: "want2win" is considered as 2 words an one non-word character
'
'future improvements:
'* find minor changes in words and mark them like non-word characters




Option Explicit




'resulting array:
'0: common sequences
'1: removed sequences
'2: added sequences
Public Function compare_strings(ByRef string1 As String, ByRef string2 As String) As String()




    Dim a1() As String
    Dim a2() As String
    
    a1 = stringToWordArray(string1)
    a2 = stringToWordArray(string2)
    
    compare_strings = reduce_diff_array(getDiffArr(longestCommonSubsequenceArr(a1, a2), a1, a2, UBound(a1), UBound(a2)))
        
End Function




Public Function get_string_diff(ByRef dif() As String) As String
    Dim outString As String
    Dim i As Long
    
    For i = LBound(dif, 2) To UBound(dif, 2)
        If Len(dif(0, i)) > 0 Then
            outString = outString & dif(0, i)
        End If
        If Len(dif(1, i)) > 0 Then
            outString = outString & "-" & dif(1, i)
        End If
        If Len(dif(2, i)) > 0 Then
            outString = outString & "+" & dif(2, i)
        End If
    Next i
    get_string_diff = outString
End Function




Public Function get_combined_diff_string(ByRef dif() As String) As String
    Dim outString As String
    Dim i As Long
    
    For i = LBound(dif, 2) To UBound(dif, 2)
        If Len(dif(0, i)) > 0 Then
            outString = outString & dif(0, i)
        End If
        If Len(dif(1, i)) > 0 Then
            outString = outString & dif(1, i)
        End If
        If Len(dif(2, i)) > 0 Then
            outString = outString & dif(2, i)
        End If
    Next i
    get_combined_diff_string = outString
End Function




Public Function get_markup_diff_string(ByRef dif() As String) As String()
    Dim outString As String
    Dim i As Long
    
    For i = LBound(dif, 2) To UBound(dif, 2)
        If Len(dif(0, i)) > 0 Then
            outString = outString & num_chars(Len(dif(0, i)), "=")
        End If
        If Len(dif(1, i)) > 0 Then
            outString = outString & num_chars(Len(dif(1, i)), "-")
        End If
        If Len(dif(2, i)) > 0 Then
            outString = outString & num_chars(Len(dif(2, i)), "+")
        End If
    Next i
    get_markup_diff_string = stringToArray(outString)
End Function




Public Function num_chars(length As Long, char As String) As String
    Dim i As Long
    num_chars = ""
    For i = 1 To length
        num_chars = num_chars + char
    Next i
End Function








Public Function max_long(ByRef a As Long, ByRef b As Long) As Long
    If a >= b Then
        max_long = a
    Else
        max_long = b
    End If
End Function




'note, arrays must be single dimension
Public Function longestCommonSubsequenceArr(ByRef array1() As String, ByRef array2() As String) As Long()
    On Error Resume Next
    If UBound(array1, 2) > 0 Or UBound(array2, 2) > 0 Then 'multidimensional arrays
        If Error = vbNullString Then
            Exit Function
        End If
    End If
    
    If UBound(array1) < 0 Or UBound(array2) < 0 Then 'check if arrays are bounded
        If Error <> vbNullString Then
            Exit Function
        End If
    End If




    Dim num() As Long
    
    'define the array, note rows of zeros get added to front automatically
    ReDim num(UBound(array1) + 1, UBound(array2) + 1)
    
    Dim i As Long, j As Long
    
    'note, arrays must always start at indice zero.
    For i = 0 To UBound(array1)
        For j = 0 To UBound(array2)
'            If array1(i) = array2(j) Then
            If StrComp(array1(i), array2(j)) = 0 Then
                num(i + 1, j + 1) = num(i, j) + 1
            Else
                num(i + 1, j + 1) = max_long(num(i, j + 1), num(i + 1, j))
            End If
        Next j
    Next i




    longestCommonSubsequenceArr = num
End Function




'not uses but kept for future improvements
Public Function stringToArray(ByRef str As String) As String()
    Dim i As Long
    Dim arr() As String
    ReDim arr(Len(str) - 1)
    For i = 1 To Len(str)
        arr(i - 1) = Mid$(str, i, 1)
    Next i
    stringToArray = arr
End Function




Public Function stringToWordArray(ByRef str As String) As String()
    Dim regEx   ' Create variable.
    Dim aMatches, aMatch
    Dim a() As String
    Dim i As Long
    
    Set regEx = New RegExp   ' Create a regular expression.
    regEx.Pattern = "[^a-z]" 'lower case only
    regEx.Pattern = "([a-z]+)|([^a-z])" 'lower case only here, ignoring case anyway
    regEx.Global = True
    regEx.IgnoreCase = True
    
    Set aMatches = regEx.Execute(str)
    
    ReDim a(aMatches.Count - 1)
    For i = 0 To UBound(a)
        a(i) = aMatches.Item(i)
    Next i
    
    stringToWordArray = a
    
End Function




'returns a 2xn array, where
'indice 0 are equal
'indice 1 are deletions
'indice 2 are additions
Public Function getDiffArr(ByRef c() As Long, ByRef arrayOld() As String, ByRef arrayNew() As String, ByRef i As Long, ByRef j As Long) As String()
    Dim arr() As String
    Dim bound As Long
    On Error Resume Next
    If i >= 0 Then
        If j >= 0 Then 'both are greater or equal to zero
            'can only do the following comparison when i and j are greater or equal than zero
            If arrayOld(i) = arrayNew(j) Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i - 1, j - 1)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(0, 0) = arrayOld(i)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(0, bound + 1) = arrayOld(i)
                    End If
                    getDiffArr = arr
            Else
                If i = 0 Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i, j - 1)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(2, 0) = arrayNew(j)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(2, bound + 1) = arrayNew(j)
                    End If
                    getDiffArr = arr
                ElseIf c(i + 1, j - 1 + 1) >= c(i - 1 + 1, j + 1) Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i, j - 1)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(2, 0) = arrayNew(j)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(2, bound + 1) = arrayNew(j)
                    End If
                    getDiffArr = arr
                ElseIf j = 0 Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i - 1, j)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(1, 0) = arrayOld(i)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(1, bound + 1) = arrayOld(i)
                    End If
                    getDiffArr = arr
                ElseIf c(i + 1, j - 1 + 1) < c(i - 1 + 1, j + 1) Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i - 1, j)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(1, 0) = arrayOld(i)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(1, bound + 1) = arrayOld(i)
                    End If
                    getDiffArr = arr
                Else
                    getDiffArr = arr
                End If
            End If
        Else 'i is is greater or equal to zero
                If j = 0 Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i - 1, j)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(1, 0) = arrayOld(i)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(1, bound + 1) = arrayOld(i)
                    End If
                    getDiffArr = arr
                ElseIf c(i + 1, j - 1 + 1) < c(i - 1 + 1, j + 1) Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i - 1, j)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(1, 0) = arrayOld(i)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(1, bound + 1) = arrayOld(i)
                    End If
                    getDiffArr = arr
                Else
                    getDiffArr = arr
                End If
        End If
    Else
        If j >= 0 Then 'j is  greater than zero
                If i = 0 Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i - 1, j)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(2, 0) = arrayNew(j)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(2, bound + 1) = arrayNew(j)
                    End If
                    getDiffArr = arr
                ElseIf c(i + 1, j - 1 + 1) >= c(i - 1 + 1, j + 1) Then
                    arr = getDiffArr(c, arrayOld, arrayNew, i, j - 1)
                    bound = UBound(arr, 2) 'check the bounding of arr
                    If Error <> vbNullString Then
                        Err.Clear
                        ReDim arr(2, 0)
                        arr(2, 0) = arrayNew(j)
                    Else 'no error
                        ReDim Preserve arr(2, bound + 1)
                        arr(2, bound + 1) = arrayNew(j)
                    End If
                    getDiffArr = arr
                Else
                    getDiffArr = arr
                End If
        Else 'none are greater than zero
                getDiffArr = arr
        End If
    End If
End Function




'collapses the array and shrinks the result to a smaller array
Private Function reduce_diff_array(arr() As String) As String()
    Dim state As Integer
    Dim i As Long
    Dim o As Long
    Dim tmpArr() As String
    Dim junks As Long
            
    state = -1
    o = 0
    junks = 0
    
    For i = LBound(arr, 2) To UBound(arr, 2)
        If Len(arr(0, i)) > 0 Then 'unmodified content
            If state = 0 Then 'previous element was in the same state
                arr(state, o) = arr(state, o) & arr(state, i)
                arr(state, i) = ""
            Else
                o = i
                junks = junks + 1
            End If
            state = 0
        End If
        If Len(arr(1, i)) > 0 Then 'removed content
            If state = 1 Then 'previous element was in the same state
                arr(state, o) = arr(state, o) & arr(state, i)
                arr(state, i) = ""
            Else
                o = i
                junks = junks + 1
            End If
            state = 1
        End If
        If Len(arr(2, i)) > 0 Then 'added content
            If state = 2 Then 'previous element was in the same state
                arr(state, o) = arr(state, o) & arr(state, i)
                arr(state, i) = ""
            Else
                o = i
                junks = junks + 1
            End If
            state = 2
        End If
    Next i
    
    'shrink array size
    ReDim tmpArr(2, junks - 1)
    o = 0
    For i = LBound(arr, 2) To UBound(arr, 2)
        If Len(arr(0, i)) > 0 Then 'unmodified content
            tmpArr(0, o) = arr(0, i)
            o = o + 1
        End If
        If Len(arr(1, i)) > 0 Then 'removed content
            tmpArr(1, o) = arr(1, i)
            o = o + 1
        End If
        If Len(arr(2, i)) > 0 Then 'added content
            tmpArr(2, o) = arr(2, i)
            o = o + 1
        End If
    Next i
    
    reduce_diff_array = tmpArr
    
End Function


The next step was to speed up the colorizing of the cell characters. This one was very easy as instead of colorizing each character by itself I could colorize each result junk as one operation.
Another speedup improvement was to reduce the number of junks by combining the junks to a few big junks.


The resulting output is not exactly like the diff generated by Word but pretty similar and very readable.

And here is the code for comparing two cells (both cells and the output cell must be in the same row):
Code:
Attribute VB_Name = "ColoredCellTextDiffNew"'requires "StringDiff" module


Option Base 1
Option Explicit
Const ColoredCellTextDiff_text = "ColoredCellTextDiff"  'name of the module
Const init_text = "vorbereiten"                         'initializing
Const character_text = "Zeichen"                        'character
Const of_text = "von"                                   'of


Public Sub cellTextDiff(row As Integer, oldCol As Integer, newCol As Integer, outCol As Integer)
    Dim tSht As Variant
    Dim outCell As Range
    Dim retArr() As Variant
    Dim StatusBar As String
    Dim diffData() As String
    Dim oldData As String
    Dim newData As String
    Dim i As Long
    Dim j As Long
    Dim laenge As Long
    Dim colorcode() As String
        
    Set tSht = ActiveSheet
    
    StatusBar = Application.StatusBar
    
    Application.StatusBar = StatusBar + " - " & ColoredCellTextDiff_text & " " & init_text
    DoEvents
    
    oldData = tSht.Cells(row, oldCol).Value2
    newData = tSht.Cells(row, newCol).Value2
    Set outCell = tSht.Cells(row, outCol)
    
    With outCell
        .Interior.Color = RGB(255, 255, 255)
        .Font.Color = RGB(0, 0, 0)
        .WrapText = True
        .Rows.AutoFit
        .VerticalAlignment = xlTop
        .Font.Underline = False
        .Font.Strikethrough = False
    End With
    
    'CASE 1: data is present only in revised document (addition)
    If (Len(oldData) = 0 And Not Len(newData) = 0) Then
        With outCell
            .Value2 = newData
            .Font.Color = RGB(0, 128, 128)
            .Font.Underline = True
        End With
    'CASE 2: data is present only in original document (deletion)
    ElseIf Not Len(oldData) = 0 And Len(newData) = 0 Then
        With outCell
            .Value2 = oldData
            .Font.Color = RGB(255, 0, 0)
            .Font.Strikethrough = True
        End With
    'CASE 3: data in original and revised documents is the same (copy)
    ElseIf StrComp(oldData, newData) = 0 And Not Len(newData) = 0 Then
        With outCell
            .Value2 = oldData
        End With
    'CASE 4: data in original and revised documents is different
    Else
        diffData = compare_strings(oldData, newData)
        outCell.Value2 = get_combined_diff_string(diffData)
        colorcode = get_markup_diff_string(diffData)
        
        j = 1
        For i = LBound(diffData, 2) To UBound(diffData, 2)
            Application.StatusBar = StatusBar + " - " & ColoredCellTextDiff_text & " " & character_text & " " & CStr(j) & " " & of_text & " " & CStr(Len(outCell.Value2))
            DoEvents
            laenge = Len(diffData(0, i))
            If laenge > 0 Then
                j = j + laenge
            End If
            laenge = Len(diffData(1, i))
            If laenge > 0 Then
                outCell.Characters(j, laenge).Font.Color = RGB(255, 0, 0)
                outCell.Characters(j, laenge).Font.Strikethrough = True
                j = j + laenge
            End If
            laenge = Len(diffData(2, i))
            If laenge > 0 Then
                outCell.Characters(j, laenge).Font.Color = RGB(0, 128, 128)
                outCell.Characters(j, laenge).Font.Underline = True
                j = j + laenge
            End If
        Next i
    End If
End Sub
 
Upvote 0
Hello,
First of all, thank you very much for the macros shown here on a topic where it is very difficult to find a correct answer. DieterGsaid that he had limitations using special characters and upper/lower case strings to perform a word-by-word comparison of twho excel cells.

The used function stringToWordArray as shown below

Code:
Public Function stringToWordArray(ByRef str As String) As String()
    Dim regEx   ' Create variable.
    Dim aMatches, aMatch
    Dim a() As String
    Dim i As Long
    
    Set regEx = New RegExp   ' Create a regular expression.
    regEx.Pattern = "[^a-z]" 'lower case only
    regEx.Pattern = "([a-z]+)|([^a-z])" 'lower case only here, ignoring case anyway
    regEx.Global = True
    regEx.IgnoreCase = True
    
    Set aMatches = regEx.Execute(str)
    
    ReDim a(aMatches.Count - 1)
    For i = 0 To UBound(a)
        a(i) = aMatches.Item(i)
    Next i
    
    stringToWordArray = a
    
End Function

can be replaced by this one :
Code:
Public Function stringToWordArray(ByRef str As String) As String()
    Dim a() As String
    Dim i As Long
 
a = VBA.Split(str, " ")
For i = 0 To UBound(a)
a(i) = a(i) & " "
Next i
    
stringToWordArray = a
End Function

and then no longer creates restrictions.
Regards
 
Upvote 0
Hello,
You guys have made a lotta progress on this topic and this has been a great help for me. Regarding Excel files with >2500 characters in a cell (paragraphs and new lines), it appears that the code provided above can only process up to 1000+ characters, thus outputting only a part of the whole difference. I have did some research and turns out this is a problem regarding Excel's inherent lack of operating/processing memory. I have tried batch/chunk processing but it keeps bugging out and, although it produced the whole output, every new chunk refreshes previous formatting, leaving only the whole output with it's last chunk formatted. I tried separate it into different subsequent cells, but the way chunk sizes are determined will affect the accuracy of the comparator around the borders of the chunks, which means you are one space away from shifting the formatting into total havoc.

I have a few other ideas to go about this, but I am by no means fluent in VBA. One of the ideas is to only output the differences with their respective formats, and omit the similarities, but my own versions of those did not go well. Another one of those is to set a hard limit on the operating memory and output the differences and refresh memory before operating on the next batch, but this is too advanced for me.

If there is any way to produce a complete output (either just the differences or with similarities and differences, across multiple cells are optional), Please do help me out. Thanks.

Heres the example inputs and outputs i had:(the output formatting is messed up)

Mahaim's commentary on international labour law delves into normative questions concerning the scope and substance of labour law within individual nation states. Although his primary focus is on the essence and "spirit" of international labour law, he also presents his understanding of the general purpose of labour law: to protect workers and ensure a minimum standard of living for citizens. He concludes that "the final purpose of international labour law is to safeguard a minimum standard of labour achievements from international competition, forming a human charter" (286).

In contrast, Isaäc Pieter de Vooys offered a more detailed and distinct perspective on these issues two years later in his article "Labour Legislation and Economic Possibilities" (1923). De Vooys, a Dutch mechanical engineer, industrialist, social critic, and poet, argued that the development of labour law and the consequent enhancement of workers' terms and conditions must occur "within the limits set by economic laws" (de Vooys, 181). This means that wage increases should not surpass productivity growth. He believed that efforts to improve workers' conditions too rapidly would be counterproductive. Given that productivity levels differ between countries and regions, de Vooys suggested two main objectives for labour law. Firstly, it should establish minimum standards below which labour conditions must not fall (183). Secondly, it should support trade unions and collective bargaining as the most effective methods for adjusting wages according to productivity levels. Additionally, labour law should regulate the demands of trade unions and workers to ensure they do not violate economic laws (185).

De Vooys proposed a highly technocratic view of labour law, arguing that wages are a matter of economic fact rather than moral or political considerations (184-185). While this reasoning might be familiar in contemporary debates about national minimum wage policies, it was relatively novel at the time he wrote (183). Although these two early articles from the ILR period mark the beginning of the debate on the appropriate or desired purposes of labour law, this discussion waned for several decades before resurfacing with renewed interest towards the end of the twentieth century.
Mahaim’s commentary on international labour law addresses normative questions about the scope and substance of labour law within nation states. Although the primary focus of his contribution is the substance and “spirit” of international labour law, he reveals his own understanding of labour law’s purpose in general – namely, the protection of the worker and the guarantee to citizens of a minimum standard of living. He concludes that “the final purpose of international labour law is to place beyond attack from international competition a minimum of conquests in the world of labour, such as shall constitute a human charter” (286).

A more extended and very different consideration of these matters was presented by Isaäc Pieter de Vooys two years later in an article on “Labour Legislation and Economic Possibilities” (1923). De Vooys was a Dutch mechanical engineer and industrialist, a social critic and a poet. The main thesis of his article is that the development of labour law and the consequent improvement in workers’ terms and conditions of employment and in their standard of living must always proceed “within the limits set by economic laws” (de Vooys, 181). In other words, wage increases must not outpace increases in productivity. Attempts to improve the worker’s lot at too great a speed will be self-defeating. Since productivity levels can vary between countries and localities, it follows that the appropriate purposes of labour law are twofold. Firstly, it can properly be used to set minimum standards below which conditions of labour must not fall (183). Secondly, it should be used to support trade unions and collective bargaining as potentially the most effective means of adjusting wages in line with productivity levels. At the same time, labour law serves the purpose of controlling the demands of trade unions and workers, ensuring that these, too, do not breach economic laws (185).

De Vooys offers a strongly technocratic idea of labour law. Wages, he believes, are neither a moral nor a political matter but rather a question of economic fact (184–185). Today, this line of reasoning may be familiar. For example, it may be brought up in policy debates about the benefits and risks of introducing a national minimum wage. At the time he was writing, however, de Vooys was able to claim that his points were rarely advanced during any discussion of the subject (183).

Although we can identify the beginnings of the debate about the proper, or desired, purposes of labour law in these two articles from the early years of the ILR, this discussion subsided for several decades before being taken up again, with renewed enthusiasm, towards the end of the twentieth century.
, labour law shouldserves regulatethe purpose of controlling the demands of trade unions and workers, toensuring ensurethat theythese, too, do not violatebreach economic laws (185).

De Vooys proposedoffers a highlystrongly technocratic viewidea of labour law,. arguingWages, thathe wagesbelieves, are neither a mattermoral ofnor economica factpolitical matter but rather thana moralquestion orof politicaleconomic considerationsfact (184-–185). WhileToday, this line of reasoning mightmay be familiar. For example, it may be brought up in contemporarypolicy debates about the benefits and risks of introducing a national minimum wage. policies,At itthe time he was relativelywriting, novelhowever, atde theVooys timewas heable wroteto (183).claim Althoughthat thesehis twopoints earlywere articlesrarely fromadvanced during any discussion of the ILRsubject period(183).

Although markwe can identify the beginningbeginnings of the debate onabout the appropriateproper, or desired, purposes of labour
 
Upvote 0

Forum statistics

Threads
1,223,911
Messages
6,175,334
Members
452,636
Latest member
laura12345

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top