Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
395 views
in Technique[技术] by (71.8m points)

python - Encoding gives "'ascii' codec can't encode character … ordinal not in range(128)"

I am working through the Django RSS reader project here.

The RSS feed will read something like "OKLAHOMA CITY (AP) — James Harden let". The RSS feed's encoding reads encoding="UTF-8" so I believe I am passing utf-8 to markdown in the code snippet below. The em dash is where it chokes.

I get the Django error of "'ascii' codec can't encode character u'u2014' in position 109: ordinal not in range(128)" which is an UnicodeEncodeError. In the variables being passed I see "OKLAHOMA CITY (AP) u2014 James Harden". The code line that is not working is:

content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")

I am using markdown 2.0, django 1.1, and python 2.4.

What is the magic sequence of encoding and decoding that I need to do to make this work?


(In response to Prometheus' request. I agree the formatting helps)

So in views I add a smart_unicode line above the parsed_feed encoding line...

content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict')
content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace") 

This pushes the problem to my models.py for me where I have

def save(self, force_insert=False, force_update=False): 
     if self.excerpt: 
         self.excerpt_html = markdown(self.excerpt) 
         # super save after this 

If I change the save method to have...

def save(self, force_insert=False, force_update=False): 
     if self.excerpt: 
         encoded_excerpt_html = (self.excerpt).encode('utf-8') 
         self.excerpt_html = markdown(encoded_excerpt_html)

I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "xe2x80x94" where the em dash was

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If the data that you are receiving is, in fact, encoded in UTF-8, then it should be a sequence of bytes -- a Python 'str' object, in Python 2.X

You can verify this with an assertion:

assert isinstance(content, str)

Once you know that that's true, you can move to the actual encoding. Python doesn't do transcoding -- directly from UTF-8 to ASCII, for instance. You need to first turn your sequence of bytes into a Unicode string, by decoding it:

unicode_content = content.decode('utf-8')

(If you can trust parsed_feed.encoding, then use that instead of the literal 'utf-8'. Either way, be prepared for errors.)

You can then take that string, and encode it in ASCII, substituting high characters with their XML entity equivalents:

xml_content = unicode_content.encode('ascii', 'xmlcharrefreplace')

The full method, then, would look somthing like this:

try:
    content = content.decode(parsed_feed.encoding).encode('ascii', 'xmlcharrefreplace')
except UnicodeDecodeError:
    # Couldn't decode the incoming string -- possibly not encoded in utf-8
    # Do something here to report the error

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...