Find text from repeating sections of text file

ryal001

Board Regular
Joined
Feb 8, 2006
Messages
64
Hi,

I want to make a macro to search for repeating blocks of text in a text file. I know how to load the text file to memory but I don't know how to find the repeated text.

I've seen examples of searching for text using regular expressions which seems like it might be the way to go, but I don't know how to use them correctly.

The text file consists of HTML. I have tried to access the data with PowerQuery but I can't get it to work. Furthermore, some of the data I need is in the tag attributes and so is not displayed by Excel if opened as an HTML file.

The text file looks like this (simplified for illustration):

Rich (BB code):
<div>Images from &quot;Marker #6&quot;</div><img src='Small_Images/045854_157363913466558000_16438.jpg' /></a></div>
<span class='image_taken_on'>Taken on: Nov 08, 2019 09:07 am</span></div>

The data I need to pick out in the above example are "Marker #6", "Small_Images/045854_157363913466558000_16438.jpg" and "Nov 08, 2019 09:07 am"

The text pattern repeats dozens of times with different details. Thus the next "Marker #" could be "Marker #6" or another number such as "Marker #5". The next "Small_Images" will always be different as will the "Taken on" date and time. The text file does not have consistent line breaks so you can't rely on looping through one line at a time. However, the order of the tags is always the same.

I want the output to be on a new sheet with one row for each data block.

E.g. A1 = "Marker #6" B1 = "'Small_Images/045854_157363913466558000_16438.jpg" and C1 = "Nov 08, 2019 09:07 am". The next block will start at A2 and so on.

I know how to put the output on the new sheet, I just don't know how to extract it from the text file.

It seems to be a similar problem to the one on this thread but I don't have the ":" before each piece of data. Instead I presume I could search for each instance of "Marker #", "src=" and "Taken on:"

Any help would be much appreciated.
 

Excel Facts

Do you hate GETPIVOTDATA?
Prevent GETPIVOTDATA. Select inside a PivotTable. In the Analyze tab of the ribbon, open the dropown next to Options and turn it off
Hi,​
did you try to use the easy VBA text function Split ?​
And you forgot to post your actual code …​
 
Upvote 0
According to your sample text - within variable T in this code - as a beginner starter :​
VBA Code:
    Debug.Print Split(T, "&quot;")(1)
    Debug.Print Split(T, "'")(1)
    Debug.Print Split(Split(T, "Taken on: ")(1), "<")(0)
 
Upvote 0
Solution
Thanks so much Marc, that worked like a charm.

I just had to add the "double-Split" line to the "src=" part because there are other tags in the original text files that also use single quotes.

I mangled the split function into smozgur's code and got exactly the output I needed as per the attached screenshot.

VBA Code:
Sub ReadFile()

    Dim myFile As String, text As String, textline As String
    Dim Setup As Object: Set Setup = CreateObject("Scripting.Dictionary")
    Dim itm As Variant
    Dim output_sheet As Worksheet
    Dim markerRow As Integer
    Dim imageRow As Integer
    Dim takenRow As Integer
    
    
    myFile = Application.GetOpenFilename()
    
    Open myFile For Input As #1
    
    ' The data for each block is not always in the same row so need
    ' to track rows individually so we can outout all the data from
    ' each block on the same row in the new worksheet
    markerRow = 1
    imageRow = 1
    takenRow = 1
    
    Do Until EOF(1)
        
        Line Input #1, textline

        If InStr(1, textline, "Marker #") Then
            Setup.Add _
                Key:="A" & markerRow, _
                Item:=Split(textline, "&quot;")(1)
            markerRow = markerRow + 1
        End If
        
        If InStr(1, textline, "src=") Then
            Setup.Add _
                Key:="B" & imageRow, _
                Item:=Split(Split(textline, "src='")(1), "'")(0)
            imageRow = imageRow + 1
        End If
        
        If InStr(1, textline, "Taken on:") Then
            Setup.Add _
                Key:="C" & takenRow, _
                Item:=Split(Split(textline, "Taken on: ")(1), "<")(0)
            takenRow = takenRow + 1
        End If
        
    Loop
        
    
    Set output_sheet = Sheets.Add
    
    For Each itm In Setup.Keys
        output_sheet.Range(itm).Value = Setup(itm)
    Next itm
    
    output_sheet.Columns("A:C").EntireColumn.AutoFit
    
    Close #1

End Sub
 

Attachments

  • Output.PNG
    Output.PNG
    25.5 KB · Views: 20
Upvote 0
Well done !​
With an attachment for the source file (dropbox, whatever) I can show you a variation and the classic DOM way (html) …​
 
Upvote 0
A demonstration as a variation but avoiding your code glitch in the first result :​
VBA Code:
Sub Demo1()
    Dim V, F%, S, N&
        V = Application.GetOpenFilename(",*.html"):  If V = False Then Exit Sub
        [A1].CurrentRegion.Offset(1).Clear
        F = FreeFile
        Open V For Input As #F
        S = Split(Input(LOF(F), #F), "class='image_header'>")
        Close #F
        If UBound(S) < 1 Then Beep: Exit Sub
        ReDim V(1 To UBound(S), 2)
    For N = 1 To UBound(S)
        V(N, 0) = Split(S(N), "&quot;")(1)
        V(N, 1) = Split(Split(S(N), "src='")(1), "'")(0)
        V(N, 2) = Split(Split(S(N), "Taken on: ")(1), "<")(0)
    Next
    With [A2:C2].Resize(N - 1)
        .Value = V
        .Columns.AutoFit
    End With
End Sub
➡️ Do you like it ? ⏩ So thanks to click on the bottom right ?Like icon ! ↘️
 
Upvote 0
For the DOM demonstration as getElementsByClassName crashes in late binding​
⚠️ so for an early binding the reference Microsoft HTML Object Library must be activated :
VBA Code:
Sub Demo2()
    Dim V, F%, N&, oDoc As New HTMLDocument
        V = Application.GetOpenFilename(",*.html"):  If V = False Then Exit Sub
        [A1].CurrentRegion.Offset(1).Clear
        F = FreeFile
        Open V For Input As #F
'    With CreateObject("htmlfile")
    With New HTMLDocument
             .body.innerHTML = Input(LOF(F), #F)
              Close #F
        With .getElementsByClassName("image_main")
                 ReDim V(.Length - 1, 2)
            For N = 0 To .Length - 1
                oDoc.body.innerHTML = .Item(N).innerHTML
                V(N, 0) = Split(oDoc.querySelector(".image_header").innerText, """")(1)
                V(N, 1) = Mid(oDoc.querySelector(".image_main_image").getAttribute("src"), 7)
                V(N, 2) = Mid(oDoc.querySelector(".image_taken_on").innerText, 11)
            Next
        End With
    End With
        Set oDoc = Nothing
    With [A2:C2].Resize(N)
        .Value = V
        .Columns.AutoFit
    End With
End Sub
➡️ Do you like it ? ⏩ So thanks to click on the bottom right ?Like icon ! ↘️
 
Upvote 0

Forum statistics

Threads
1,224,814
Messages
6,181,126
Members
453,021
Latest member
Justyna P

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top