Problem
Based on the information in the question, the program is processing non-ASCII input data, but is unable to output non-ASCII data.
Specifically, this code:
for i in patchlets_in_latest_list:
print(str(i))
Results in this exception:
UnicodeEncodeError: 'ascii' codec can't encode character 'u2013'
This behaviour was common in Python2, where calling str
on a unicode
object would cause Python to try to encode the object as ASCII, resulting in a UnicodeEncodeError
if the object contained non-ASCII characters.
In Python3, calling str
on a str
instance doesn't trigger any encoding. However calling the print
function on a str
will encode the str
to sys.stdout.encoding
. sys.stdout.encoding
defaults to that returned by locale.getpreferredencoding. This will generally be your linux user's LANG
environment variable.
Solution
If we assume that your program is not overriding normal encoding behaviour, the problem should be fixed by ensuring that the code is being executed by a Python3 interpreter in a UTF-8 locale.
- be 100% certain that the code is being executed by a Python3 interpreter - print
sys.version_info
from within the program.
- try setting the PYTHONIOENCODING environment variable when running your script:
PYTHONIOENCODING=UTF-8 python3 myscript.py
- check your locale using the
locale
command in the terminal (or echo $LANG
). If it doesn't end in UTF-8
, consider changing it. Consult your system administrators if you are on a corporate machine.
- if your code runs in a cron job, bear in mind that cron jobs often run with the 'C' or 'POSIX' locale - which could be using ASCII encoding - unless a locale is explicitly set. Likewise if the script is run under a different user, check their locale settings.
Workaround
If changing the environment is not feasible, you can workaround the problem in Python by encoding to ASCII with an error handler, then decoding back to str
.
There are four useful error handlers in your particular situation, their effects are demonstrated with this code:
>>> s = 'Hello u2013 World'
>>> s
'Hello – World'
>>> handlers = ['ignore', 'replace', 'xmlcharrefreplace', 'namereplace']
>>> print(str(s))
Hello – World
>>> for h in handlers:
... print(f'Handler: {h}:', s.encode('ascii', errors=h).decode('ascii'))
...
Handler: ignore: Hello World
Handler: replace: Hello ? World
Handler: xmlcharrefreplace: Hello – World
Handler: namereplace: Hello N{EN DASH} World
The ignore and replace handlers lose information - you can't tell what character has been replaced with an space or question mark.
The xmlcharrefreplace and namereplace handlers do not lose information, but the replacement sequences may make the text less readable to humans.
It's up to you to decide which tradeoff is acceptable for the consumers of your program's output.
If you decided to use the replace handler, you would change your code like this:
for i in patchlets_in_latest_list:
replaced = i.encode('ascii', errors='replace').decode('ascii')
print(replaced)
wherever you are printing data that might contain non-ASCII characters.