I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
echo $profile;
it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
What I see:
?¤?a??¤?·?·???′???|??¢?¤?????3??3???????o-???9?oo?????5?a???¨??|??????????????|4?oo???3?a???a?£????è|a?ˉ?¨??????????1??3?§??ˉè|a?ˉéμ????±????¢??¤??? ?£??é?? ????£?ˉ?-?£??£???¢????¤????¤?????è2è3é????a??????a??ˉ?3???é?? ???é2?-|?
What should be shown:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I've simplified the code down to five lines so you can test it yourself.
$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
<div lang="ja"><p>??¤??a?????¤?·???·?????′???|?€??¢??¤????????3??‰?3???????o-???€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
Question&Answers:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…