Remove all class names in HTML-tags

strooman

Active Member
Joined
Oct 29, 2013
Messages
333
Office Version
  1. 2016
Platform
  1. Windows
I have this HTML snippet:
Rich (BB code):
<ul id="post">
            <li class="blockbody" id="post_1">
                <div class="header">
                    <div class="datetime">
                        8 februari 2024, 14:30
                    </div><span class="username">JaneDoe</span>
                </div>
                <div class="content">
                    <blockquote class="restore">
                        <div class="container">
                            <div class="description">
                                QUOTE:
                            </div>
                            <div class="quote_printable">
                                <hr>
                                <div>
                                    Originally posted by <strong>JohnDoe</strong> <a href="showthread.php?s=49711caa43ab7391973823e5bb92b6a9&amp;p=832072#post832072" rel="nofollow"><img alt="Bekijk bericht" class="inlineimg" src="images/styles/Aesthetica/buttons/viewpost.gif"></a>
                                </div>
                                <div class="message">
                                    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
                                </div>
                                <hr>
                            </div>
                        </div>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
                    </blockquote>
                </div>
            </li>
        </ul>

I want to remove all class names in all DIV-tags and end up with only <div>. And do the same with other tags like ID or SPAN tags. Examples:
<span class="username"> becomes <span>
<div class="message"> becomes <div>
<li class="blockbody" id="post_1"> becomes <li>
<blockquote class="restore"> becomes <blockquote>
<ul id="post"> becomes <ul>

etc.

I'm looking for a VBA solution.
The reason is I want clean HTML-tags with only the basic formating like this.

1sMViug.png
 

Excel Facts

Can a formula spear through sheets?
Use =SUM(January:December!E7) to sum E7 on all of the sheets from January through December
Would you accept an alternative solution without Excel/VBA?

You can simply use the free Text Editor Notepad++ (which also has syntax highlighting) and do a single RegExp Replace.

Take a look at the screenshots:

before:
1707429556679.png


after:
1707429586689.png


If you prefer using VBA just let me know.

EDIT:
You can also remove all class attributes at once with Notepad++
But first let me know if would accept it as solution.
 
Upvote 1
I want to remove all class names in all DIV-tags and end up with only <div>. And do the same with other tags like ID or SPAN tags. Examples:
<span class="username"> becomes <span>
<div class="message"> becomes <div>
<li class="blockbody" id="post_1"> becomes <li>
<blockquote class="restore"> becomes <blockquote>
<ul id="post"> becomes <ul>

Try this macro. It removes all attributes (id, class, href, etc.) from all tags. The HTML is read from the specified input file and the result is written to the specified output file.

You must set a reference to Microsoft HTML Object Library, via Tools -> References in the VBA editor.

VBA Code:
Public Sub Remove_Attributes_From_HTML()
    
    Dim inputHTMLfile As String, outputHTMLfile As String
    Dim HTML As String
    Dim HTMLdoc As HTMLDocument
    Dim elem As HTMLGenericElement
    Dim attribNode As IHTMLDOMAttribute
    
    inputHTMLfile = "C:\path\to\HTML file.html")  'Change this
    outputHTMLfile = "C:\path\to\HTML file output.html") 'Change this
    
    Open inputHTMLfile For Binary As #1
    HTML = Space(LOF(1))
    Get #1, , HTML
    Close #1

    Set HTMLdoc = New HTMLDocument
    HTMLdoc.body.innerHTML = HTML
    
    For Each elem In HTMLdoc.body.all
        For Each attribNode In elem.Attributes
            If attribNode.specified Then
                elem.removeAttributeNode attribNode
            End If
        Next
    Next
    
    Open outputHTMLfile For Output As #1
    Print #1, HTMLdoc.body.innerHTML
    Close #1

End Sub
The result in the output file is:
HTML:
<ul>
            <li>
                <div>
                    <div>
                        8 februari 2024, 14:30
                    </div><span>JaneDoe</span>
                </div>
                <div>
                    <blockquote>
                        <div>
                            <div>
                                QUOTE:
                            </div>
                            <div>
                                <hr>
                                <div>
                                    Originally posted by <strong>JohnDoe</strong> <a><img></a>
                                </div>
                                <div>
                                    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
                                </div>
                                <hr>
                            </div>
                        </div>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
                    </blockquote>
                </div>
            </li>
        </ul>
 
Upvote 1
Solution
Very nice suggestions and more important, solutions. Both work. The Find and Replace from PW is very fast. I have 250.000 lines of HTML and it took only 4 seconds. Amazing. I was aware of the Find and Replace but handling Regular Expressions is not among my skills.
The solution from JW is very neat and effective. Took a little bit longer but gets the job done and is a slick VBA solution.

Thanks to both and this wil save me a lot of time (and more important, I learned something new today).
Unfortunately, I can only mark one post as a solution.
 
Upvote 0
Very nice suggestions and more important, solutions. Both work. The Find and Replace from PW is very fast. I have 250.000 lines of HTML and it took only 4 seconds. Amazing. I was aware of the Find and Replace but handling Regular Expressions is not among my skills.
The solution from JW is very neat and effective. Took a little bit longer but gets the job done and is a slick VBA solution.

Thanks to both and this wil save me a lot of time (and more important, I learned something new today).
Unfortunately, I can only mark one post as a solution.

Thanks for your feedback.
I tweaked the Regular Expression Search/Replace Pattern to remove all class and id attributes in all tags at once, not only the class within div tags:

If this your HTML
HTML:
<ul id="post">
            <li class="blockbody" id="post_1">
                <div class="header">
                    <div class="datetime">
                        8 februari 2024, 14:30
                    </div><span class="username">JaneDoe</span>
                </div>
                <div class="content">
                    <blockquote class="restore">
                        <div class="container">
                            <div class="description">
                                QUOTE:
                            </div>
                            <div class="quote_printable">
                                <hr>
                                <div>
                                    Originally posted by <strong>JohnDoe</strong> <a href="showthread.php?s=49711caa43ab7391973823e5bb92b6a9&amp;p=832072#post832072" rel="nofollow"><img alt="Bekijk bericht" class="inlineimg" src="images/styles/Aesthetica/buttons/viewpost.gif"></a>
                                </div>
                                <div class="message">
                                    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
                                </div>
                                <hr>
                            </div>
                        </div>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
                    </blockquote>
                </div>
            </li>
        </ul>

and you would apply the following RegExp Search/Replace Pattrerns
Search: (class|id)="[^"]+" (note the space at the beginning)
Replace: (leave blank)

then you'll get that
HTML:
<ul>
            <li>
                <div>
                    <div>
                        8 februari 2024, 14:30
                    </div><span>JaneDoe</span>
                </div>
                <div>
                    <blockquote>
                        <div>
                            <div>
                                QUOTE:
                            </div>
                            <div>
                                <hr>
                                <div>
                                    Originally posted by <strong>JohnDoe</strong> <a href="showthread.php?s=49711caa43ab7391973823e5bb92b6a9&amp;p=832072#post832072" rel="nofollow"><img alt="Bekijk bericht" src="images/styles/Aesthetica/buttons/viewpost.gif"></a>
                                </div>
                                <div>
                                    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
                                </div>
                                <hr>
                            </div>
                        </div>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
                    </blockquote>
                </div>
            </li>
        </ul>

Please note that my method only removes the class and id attributes, keeping everything else.

The VBA code @John_w posted will purge your HTML, leaving nothing but text only, even removing the src attribute of the img tag and the href attribute of the a tag.
Thus your HTML will no longer display any images or links just to name a few.

Is this what you wanted?
 
Upvote 0
Sorry for my late reaction and follow up but we had Carnival in the Netherlands, during the weekend :) (and monday, tuesday)

Very nice that you fine-tuned it. This gives me even more options to get the job done and it is also helpfull in other situations. I noticed that the code from JW gives text-only. This is also fine because I just need the raw text. With a couple of breakpoints <br> or paragraph tags <p> it will be more readable but I can manage that by myself. I'm completely happy with the given solutions.
 
Upvote 0

Forum statistics

Threads
1,224,891
Messages
6,181,614
Members
453,057
Latest member
LE102024

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top