Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
663 views
in Technique[技术] by (71.8m points)

c# - How to fix ill-formed HTML with HTML Agility Pack?

I have this ill-formed HTML with overlapping tags:

<p>word1<b>word2</p>
<p>word3</b>word4</p>

The overlapping can be nested, too.

How can I convert it into well-formed HTML with HTML Agility Pack (HAP)?

I'm looking for this output:

<p>word1<b>word2</b></p>
<p><b>word3</b>word4</p>

I tried HtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap, but it does not work as expected.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is in fact working as expected, but maybe not working as you expected. Anyway, here is a sample piece of code (a Console application) that demonstrates how you can achieve some HTML fixing using the library.

The library has a ParseErrors collection that you can use to determine what errors were detecting during markup parsing.

There are really two types of problems here:

1) unclosed elements. This one is fixed by default by the library, but there is an option on the P element that prevents that in this case.

2) unopened elements. This one is more complex, because it depends how you want to fix it, where do you want to have the tag opened? In the following sample, I've used the nearest previous text sibling node to open the element.

static void Main(string[] args)
{
    // clear the flags on P so unclosed elements in P will be auto closed.
    HtmlNode.ElementsFlags.Remove("p");

    // load the document
    HtmlDocument doc = new HtmlDocument();
    doc.Load("yourTestFile.htm");

    // build a list of nodes ordered by stream position
    NodePositions pos = new NodePositions(doc);

    // browse all tags detected as not opened
    foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
    {
        // find the text node just before this error
        HtmlTextNode last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
        if (last != null)
        {
            // fix the text; reintroduce the broken tag
            last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
        }
    }

    doc.Save(Console.Out);
}

public class NodePositions
{
    public NodePositions(HtmlDocument doc)
    {
        AddNode(doc.DocumentNode);
        Nodes.Sort(new NodePositionComparer());
    }

    private void AddNode(HtmlNode node)
    {
        Nodes.Add(node);
        foreach (HtmlNode child in node.ChildNodes)
        {
            AddNode(child);
        }
    }

    private class NodePositionComparer : IComparer<HtmlNode>
    {
        public int Compare(HtmlNode x, HtmlNode y)
        {
            return x.StreamPosition.CompareTo(y.StreamPosition);
        }
    }

    public List<HtmlNode> Nodes = new List<HtmlNode>();
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...