Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
550 views
in Technique[技术] by (71.8m points)

How to work with surrogate pairs in Python?

This is a follow-up to Converting to Emoji. In that question, the OP had a json.dumps()-encoded file with an emoji represented as a surrogate pair - ud83dude4f. S/he was having problems reading the file and translating the emoji correctly, and the correct answer was to json.loads() each line from the file, and the json module would handle the conversion from surrogate pair back to (I'm assuming UTF8-encoded) emoji.

So here is my situation: say I have just a regular Python 3 unicode string with a surrogate pair in it:

emoji = "This is ud83dude4f, an emoji."

How do I process this string to get a representation of the emoji out of it? I'm looking to get something like this:

"This is ??, an emoji."
# or
"This is U0001f64f, an emoji."

I've tried:

print(emoji)
print(emoji.encode("utf-8")) # also tried "ascii", "utf-16", and "utf-16-le"
json.loads(emoji) # and `.encode()` with various codecs

Generally I get an error similar to UnicodeEncodeError: XXX codec can't encode character 'ud83d' in position 8: surrogates no allowed.

I'm running Python 3.5.1 on Linux, with $LANG set to en_US.UTF-8. I've run these samples both in the Python interpreter on the command line, and within IPython running in Sublime Text - there don't appear to be any differences.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You've mixed a literal string ud83d in a json file on disk (six characters: u d 8 3 d) and a single character u'ud83d' (specified using a string literal in Python source code) in memory. It is the difference between len(r'ud83d') == 6 and len('ud83d') == 1 on Python 3.

If you see 'ud83dude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:

>>> "ud83dude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
'??'

Python 2 was more permissive.

Note: even if your json file contains literal ud83dude4f (12 characters); you shouldn't get the surrogate pair:

>>> print(ascii(json.loads(r'"ud83dude4f"')))
'U0001f64f'

Notice: the result is 1 character ( 'U0001f64f'), not the surrogate pair ('ud83dude4f').


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...