Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
176 views
in Technique[技术] by (71.8m points)

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'u0420u0443u0441u0441u043au0438u0439 xd0xb5xd0xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode u0435u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (u0420u0443u0441u0441u043au0438u0439 u0435u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'xd0xa0xd1x83xd1x81xd1x81xd0xbaxd0xb8xd0xb9 xc3x90xc2xb5xc3x90xc2xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'u0420u0443u0441u0441u043au0438u0439 xd0xb5xd0xba'

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'"
>>> eval(repr(a)[1:])
'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 xd0xb5xd0xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 u0435u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\u([a-f\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'u0420u0443u0441u0441u043au0438u0439 u0435u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'xd0' == u'u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...