In Python 2, Unicode strings may contain both unicode and bytes:
a = u'u0420u0443u0441u0441u043au0438u0439 xd0xb5xd0xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек
(Unicode u0435u043a
).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек
(u0420u0443u0441u0441u043au0438u0439 u0435u043a
).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'xd0xa0xd1x83xd1x81xd1x81xd0xbaxd0xb8xd0xb9 xc3x90xc2xb5xc3x90xc2xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'u0420u0443u0441u0441u043au0438u0439 xd0xb5xd0xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'"
>>> eval(repr(a)[1:])
'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 xd0xb5xd0xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 u0435u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\u([a-f\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'u0420u0443u0441u0441u043au0438u0439 u0435u043a' # Success!
This works fine but looks very hacky due to its use of eval
, repr
, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…