tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file ?ě?λλυ.txt
:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file xc5x99xc4x9bxc4x85xcexbbxcexbbxcfx85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file u0159u011bu0105u03bbu03bbu03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file ?ě?λλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode()
is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key
is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode()
at all.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…