If you are able to obtain a DOMDocument
object representing your HTML, then you just need to traverse it recursively and construct the data structure that you want.
Converting your HTML document into a DOMDocument
should be as simple as this:
function html_to_obj($html) {
$dom = new DOMDocument();
$dom->loadHTML($html);
return element_to_obj($dom->documentElement);
}
Then, a simple traversal of $dom->documentElement
which gives the kind of structure you described could look like this:
function element_to_obj($element) {
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = element_to_obj($subElement);
}
}
return $obj;
}
Test case
$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
<head>
<title> This is a test </title>
</head>
<body>
<h1> Is this working? </h1>
<ul>
<li> Yes </li>
<li> No </li>
</ul>
</body>
</html>
EOF;
header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);
Output
{
"tag": "html",
"lang": "en",
"children": [
{
"tag": "head",
"children": [
{
"tag": "title",
"html": " This is a test "
}
]
},
{
"tag": "body",
"html": "
",
"children": [
{
"tag": "h1",
"html": " Is this working? "
},
{
"tag": "ul",
"children": [
{
"tag": "li",
"html": " Yes "
},
{
"tag": "li",
"html": " No "
}
],
"html": "
"
}
]
}
]
}
Answer to updated question
The solution proposed above does not work with the <script>
element, because it is parsed not as a DOMText
, but as a DOMCharacterData
object. This is because the DOM extension in PHP is based on libxml2
, which parses your HTML as HTML 4.0, and in HTML 4.0 the content of <script>
is of type CDATA
and not #PCDATA
.
You have two solutions for this problem.
The simple but not very robust solution would be to add the LIBXML_NOCDATA
flag to DOMDocument::loadHTML
. (I am not actually 100% sure whether this works for the HTML parser.)
The more difficult but, in my opinion, better solution, is to add an additonal test when you are testing $subElement->nodeType
before the recursion. The recursive function would become:
function element_to_obj($element) {
echo $element->tagName, "
";
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
elseif ($subElement->nodeType == XML_CDATA_SECTION_NODE) {
$obj["html"] = $subElement->data;
}
else {
$obj["children"][] = element_to_obj($subElement);
}
}
return $obj;
}
If you hit on another bug of this type, the first thing you should do is check the type of node $subElement
is, because there exists many other possibilities my short example function did not deal with.
Additionally, you will notice that libxml2
has to fix mistakes in your HTML in order to be able to build a DOM for it. This is why an <html>
and a <head>
elements will appear even if you don't specify them. You can avoid this by using the LIBXML_HTML_NOIMPLIED
flag.
Test case with script
$html = <<<EOF
<script type="text/javascript">
alert('hi');
</script>
EOF;
header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);
Output
{
"tag": "html",
"children": [
{
"tag": "head",
"children": [
{
"tag": "script",
"type": "text/javascript",
"html": "
alert('hi');
"
}
]
}
]
}