Convert Unicode to ASCII without errors in Python

Question

Welcome To Ask or Share your Answers For Others

Convert Unicode to ASCII without errors in Python

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Convert Unicode to ASCII without errors in Python

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

But I get a UnicodeDecodeError:

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T21:24:12+0000

>>> u'aあ?'.encode('ascii', 'ignore')
'a'

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:

>>> u'aあ?'.encode('ascii', 'replace')
b'a??'
>>> u'aあ?'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあ?'.encode('ascii', 'backslashreplace')
b'a\u3042\xe4'

See https://docs.python.org/3/library/stdtypes.html#str.encode

Categories

Convert Unicode to ASCII without errors in Python

Convert Unicode to ASCII without errors in Python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags