Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
318 views
in Technique[技术] by (71.8m points)

php - loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Using the LIBXML_HTML_NOIMPLIED flag with an html fragment generates incorrect tags:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();

Outputs:

<p>Lorem ipsum dolor sit amet.<p>Nunc vel vehicula ante.</p></p>

I have found hacks to work around this using regexes, but that defeats the purpose of using DOM. I have tested this with several versions of libxml and php, the latest with libxml 2.9.2, php 5.6.7 (Debian Jessy). Any suggestions appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The re-arrangement is done by the LIBXML_HTML_NOIMPLIED option you're using. Looks like it's not stable enough for your case.

Also you might want to not use it for portablility reasons, for example I've got one PHP 5.4.36 with Libxml 2.7.8 at hand that is not supporting LIBXML_HTML_NOIMPLIED (Libxml >= 2.7.7) but later LIBXML_HTML_NODEFDTD (Libxml >= 2.7.8) option.

I know this way of dealing with it. When you load the fragment, you wrap it into a <div> element:

$doc->loadHTML("<div>$str</div>");

This helps to guide DOMDocument on the structure you want.

You can then extract this container from the document itself:

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);

And then remove all children from the document:

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

Now the document is completely empty and you're now able to append children again. Luckily there is the <div> container element we removed earlier, so we can add from it:

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

The fragment then can be retrieved with the known saveHTML method:

echo $doc->saveHTML();

Which gives in your scenario:

<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>

This methodology is a little different from the existing material here on site (see the references I give below), so the example at once:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

echo $doc->saveHTML();

I also really recommend the reference question on How to saveHTML of DOMDocument without HTML wrapper? for a further read as well as the one about inner-html

References


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...