I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.
In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be')
or .encode('utf-16-le')
on the string before you pass it to ElementTree.fromstring
:
ElementTree.fromstring(data.encode('utf-16-be'))
Proof: ElementTree.fromstring
eventually calls down into pyexpat.xmlparser.Parse
, which is implemented in pyexpat.c:
static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
char *s;
int slen;
int isFinal = 0;
if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
return NULL;
return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}
So the unicode parameter you passed in gets converted using s#
. The docs for PyArg_ParseTuple
say:
s# (string, Unicode or any read buffer compatible object) [const char
*, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second
one its length. In this case the Python string may contain embedded
null bytes. Unicode objects pass back a pointer to the default encoded
string version of the object if such a conversion is possible. All
other read-buffer compatible objects pass back a reference to the raw
internal data representation.
Let's check this out:
from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>u2163</root>'
print ElementTree.fromstring(data)
gives the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'u2163' in position 44: ordinal not in range(128)
which means that when you were specifying encoding="utf-8"
, you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:
import sys
reload(sys).setdefaultencoding('utf8')
however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.