python - lxml truncates text that contains 'less than' character

Question

Welcome To Ask or Share your Answers For Others

python - lxml truncates text that contains 'less than' character

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - lxml truncates text that contains 'less than' character

>>> s = '<div> < 20 </div>'
>>> import lxml.html
>>> tree = lxml.html.fromstring(s)
>>> lxml.etree.tostring(tree)
'<div> </div>'

Does anybody know any workaround for this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:03:34+0000

Your HTML input is broken; that < left angle bracket should have been encoded to < instead. From the lxml documentation on parsing broken HTML:

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

In other words, you take what you can get from such documents, the way lxml handles broken HTML is not otherwise configurable.

One thing you could try is to use a different HTML parser. Try BeautifulSoup instead, it's broken HTML handling may be able to give you a different version of that document that does give you what you want out of it. BeautifulSoup can re-use different parser backends, including lxml and html5lib, so it'll give you more flexibility.

The html5lib parser does give you the < character (converted to a < escape):

>>> BeautifulSoup("<div> < 20 </div>", "html5lib")
<html><head></head><body><div> &lt; 20 </div></body></html>

Categories

python - lxml truncates text that contains 'less than' character

python - lxml truncates text that contains 'less than' character

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags