python - HTML encoding and lxml parsing

Question

Welcome To Ask or Share your Answers For Others

python - HTML encoding and lxml parsing

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:

1.

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: ? —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: ? —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: ? —’</title>
</head>
<body></body>
</html>

My basic script:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

The results are:

Unicode Chars: ì aa
Unicode Chars: ? —’
Unicode Chars: ? —’

So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.

The lxml docs appear conflicted:

From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

However here it says:

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

If I try to follow the the first suggestion in the lxml docs, my code is now:

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

I now get the following results:

Unicode Chars: ? —’
Unicode Chars: ? —’
ValueError: Unicode strings with encoding declaration are not supported.

Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.

Is there a correct way to handle all of these cases? Is there a better solution than the following?

dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:46:37+0000

lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)

Output

Unicode Chars: ? —’
Unicode Chars: ? —’
Unicode Chars: ? —’

Categories

python - HTML encoding and lxml parsing

python - HTML encoding and lxml parsing

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Output

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags