Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
618 views
in Technique[技术] by (71.8m points)

c# - HTML agility pack - removing unwanted tags without removing content?

I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.

I want to use the HTML Agility Pack to remove unwanted tags from my HTML without losing the content within the tags.

So for instance, in my scenario, I would like to preserve the tags "b", "i" and "u".

And for an input like:

<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>

The resulting HTML should be:

my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>

I tried using HtmlNode's Remove method, but it removes my content too. Any suggestions?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.

It removes all tags except strong, em, u and raw text nodes.

internal static string RemoveUnwantedTags(string data)
{
    if(string.IsNullOrEmpty(data)) return string.Empty;

    var document = new HtmlDocument();
    document.LoadHtml(data);

    var acceptableTags = new String[] { "strong", "em", "u"};

    var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
    while(nodes.Count > 0)
    {
        var node = nodes.Dequeue();
        var parentNode = node.ParentNode;

        if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
        {
            var childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (var child in childNodes)
                {
                    nodes.Enqueue(child);
                    parentNode.InsertBefore(child, node);
                }
            }

            parentNode.RemoveChild(node);

        }
    }

    return document.DocumentNode.InnerHtml;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...