Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
468 views
in Technique[技术] by (71.8m points)

.net - Remove all empty HTML tags?

I am imagining a function which I figure would use Regex, and it would be recursive for instances like <p><strong></strong></p> to remove all empty HTML tags within a string. This would have to account for whitespace to if possible. There would be no crazy instances where < character was being used in an attribute value.

I am pretty terrible at regex but I imagine this is possible. How can you do it?

Here is the method I have so far:

Public Shared Function stripEmptyHtmlTags(ByVal html As String) As String
    Dim newHtml As String = Regex.Replace(html, "/(<.+?>s*</.+?>)/Usi", "")

    If html <> newHtml Then
        newHtml = stripEmptyHtmlTags(newHtml)
    End If

    Return newHtml
End Function

However my current Regex is in PHP format, and it doesn't seem to be working. I am not familiar with .NET regex syntax.

To all those saying don't use regex: I am curious what the pattern would be regardless. Surely there is a pattern which could match all opening/closing start tags with any amount of white space (or none) in between the tags? I've seen regex that matches HTML tags with any number of attributes, one empty tag (such as just <p></p>) etc.

So far I have tried the following regex patterns in the above method to no avail (as in, I have a text string with empty paragraphs tags that didn't even get removed.)

Regex.Replace(html, "/(<.+?>s*</.+?>)/Usi", "")

Regex.Replace(html, "(<.+?>s*</.+?>)", "")

Regex.Replace(html, "%<(w+)[^>]*>s*</1s*>%", "")

Regex.Replace(html, "<w+s*>s*</1s*>", "")

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, note that empty HTML elements are, by definition, not nested.

Update: The solution below now applies the empty element regex recursively to remove "nested-empty-element" structures such as: <p><strong></strong></p> (subject to the caveats stated below).

Simple version:

This works pretty well (see caveats below) for HTML having no start tag attributes containing <> funny stuff, in the form of an (untested) VB.NET snippet:

Dim RegexObj As New Regex("<(w+)[^>]*>s*</1s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

Enhanced Version

<(w+)(?:s+[w-.:]+(?:s*=s*(?:"[^"]*"|'[^']*'|[w-.:]+))?)*s*/?>s*</1s*>

Here is the uncommented enhanced version in VB.NET (untested):

Dim RegexObj As New Regex("<(w+)(?:s+[w-.:]+(?:s*=s*(?:""[^""]*""|'[^']*'|[w-.:]+))?)*s*/?>s*</1s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

This more complex regex correctly matches a valid empty HTML 4.01 element even if it has angle brackets in its attribute values (subject once again, to the caveats below). In other words, this regex correctly handles all start tag attribute values which are quoted (which can have <>), unquoted (which can't) and empty. Here is a fully commented (and tested) PHP version:

function strip_empty_tags($text) {
    // Match empty elements (attribute values may have angle brackets).
    $re = '%
        # Regex to match an empty HTML 4.01 Transitional element.
        <                    # Opening tag opening "<" delimiter.
        (w+)              # $1 Tag name.
        (?:                  # Non-capture group for optional attribute(s).
          s+                # Attributes must be separated by whitespace.
          [w-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            s*=s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | '[^']*'     # Single quoted string.
            | [w-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        s*                  # Whitespace is allowed before closing delimiter.
        >                    # Opening tag closing ">" delimiter.
        s*                  # Content is zero or more whitespace.
        </1s*>             # Element closing tag.
        %x';
    while (preg_match($re, $text)) {
        // Recursively remove innermost empty elements.
        $text = preg_replace($re, '', $text);
    }
}

Caveats: This function does not parse HTML. It simply matches and removes any text pattern sequence corresponding to a valid empty HTML 4.01 element (which, by definition, is not nested). Note that this also erroneously matches and removes the same text pattern which may occur outside normal HTML markup, such as within SCRIPT and STYLE tags and HTML comments and the attributes of other start tags. This regex does not work with short tags. To any bobenc fan about give this answer an automatic down vote, please show me one valid HTML 4.01 empty element that this regex fails to correctly match. This regex follows the W3C spec and really does work.

Update: This regex solution also does not work (and will erroneously remove valid markup) if you do something insanely unlikely (but perfectly valid) like this:

<div att="<p att='">stuff</div><div att="'></p>'">stuff</div>

Summary:

On second thought, just use an HTML parser!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...