python - parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

Question

Welcome To Ask or Share your Answers For Others

python - parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

I send a GET request to the CareerBuilder API :

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

And get back an XML that looks like this. However, I have trouble parsing it.

Using either lxml

>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (srclxmllxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (srclxmllxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

or ElementTree:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:Python27libxmletreeElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:Python27libxmletreeElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)

So, even though the XML file starts with

<?xml version="1.0" encoding="UTF-8"?>

I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:04:15+0000

You are using the decoded unicode value. Use r.raw raw response data instead:

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:

r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

Categories

python - parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

python - parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags