I am trying to learn using DOMDocument for parsing HTML code.
I am just doing some simple work, I already liked gordon's answer on scrap data using regex and simplehtmldom and based my code on his work.
I found documentation on PHP.net not that good due to limited information, almost no examples, and most specifics were based on parsing XML.
<?php
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html');
libxml_clear_errors();
$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case.
# title
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv));
# content (this is not working)
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv));
?>
For testing purposes I am trying to get the title (between h1 tags) and content (HTML) of a nu.nl news article.
As you can see I can get the title, although I am not even that happy with that evaluate string since it just happens to be the only h1 tag on that div-level.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…