Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
929 views
in Technique[技术] by (71.8m points)

xml - Python: Unicode and ElementTree.parse

I'm trying to move to Python 2.7 and since Unicode is a Big Deal there, I'd try dealing with them with XML files and texts and parse them using the xml.etree.cElementTree library. But I ran across this error:

>>> import xml.etree.cElementTree as ET
>>> from io import StringIO
>>> source = """
... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
... <root>
...   <Parent>
...     <Child>
...       <Element>Text</Element>
...     </Child>
...   </Parent>
... </root>
... """
>>> srcbuf = StringIO(source.decode('utf-8'))
>>> doc = ET.parse(srcbuf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

The same thing happens using io.open('filename.xml', encoding='utf-8') to pass to ET.parse:

>>> with io.open('test.xml', mode='w', encoding='utf-8') as fp:
...     fp.write(source.decode('utf-8'))
...
150L
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     fp.read()
...
u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<root>
  <Parent>

    <Child>
      <Element>Text</Element>
    </Child>
  </Parent>
</root>

'
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     ET.parse(fp)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

Is there something about unicode and ET parsing that I am missing here?

edit: Apparently, the ET parser does not play well with unicode input stream? The following works:

>>> with io.open('test.xml', mode='rb') as fp:
...     ET.parse(fp)
...
<ElementTree object at 0x0180BC10>

But this also means I cannot use io.StringIO if I want to parse from an in-memory text, unless I encode it first into an in-memory buffer?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your problem is that you are feeding ElementTree unicode, but it prefers to consume bytes. It will provide you with unicode in any case.

In Python 2.x, it can only consume bytes. You can tell it what encoding those bytes are in, but that's it. So, if you literally have to work with an object that represents a text file, like io.StringIO, first you will need to convert it into something else.

If you are literally starting with a 2.x-str (AKA bytes) in UTF-8 encoding, in memory, as in your example, use xml.etree.cElementTree.XML to parse it into XML in one fell swoop and don't worry about any of this :-).

If you want an interface that can deal with data that is incrementally read from a file, use xml.etree.cElementTree.parse with an io.BytesIO to convert it into an in-memory stream of bytes rather than an in-memory string of characters. If you want to use io.open, use it with the b flag, so that you get streams of bytes.

In Python 3.x, you can pass unicode directly in to ElementTree, which is a bit more convenient, and arguably the newer version of ElementTree is more correct to allow this. However, you still might not want to, and Python 3's version does still accept bytes as input. You're always starting with bytes anyway: by passing them directly from your input source to ElementTree, you get to let it do its encoding or decoding intelligently inside the XML parsing engine, as well as do on-the-fly detection of encoding declarations within the input stream, which you can do with XML but you can't do with arbitrary textual data. So letting the XML parser do the work of decoding is the right place to put that responsibility.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...