Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
435 views
in Technique[技术] by (71.8m points)

Python returns length of 2 for single Unicode character string

In Python 2.7:

In [2]: utf8_str = 'xf0x9fx91x8d'
In [3]: print(utf8_str)
??
In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
?? 
In [6]: unicode_str
Out[6]: u'U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

Since unicode_str only contains a single unicode code point (0x0001f44d), why does len(unicode_str) return 2 instead of 1?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your Python binary was compiled with UCS-2 support (a narrow build) and internally anything outside of the BMP (Basic Multilingual Plane) is represented using a surrogate pair.

That means such codepoints show up as 2 characters when asking for the length.

You'll have to recompile your Python binary to use UCS-4 instead if this matters (./configure --enable-unicode=ucs4 will enable it), or upgrade to Python 3.3 or newer, where Python's Unicode support was overhauled to use a variable-width Unicode type that switches between ASCII, UCS-2 and UCS-4 as required by the codepoints contained.

On Python versions 2.7 and 3.0 - 3.2, you can detect what kind of build you have by inspecting the sys.maxunicode value; it'll be 2^16-1 == 65535 == 0xFFFF for a narrow UCS-2 build, 1114111 == 0x10FFFF for a wide UCS-4 build. In Python 3.3 and up it is always set to 1114111.

Demo:

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"U0001f44d"), list(u"U0001f44d")'
65535 2 [u'ud83d', u'udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"U0001f44d"), list(u"U0001f44d")'
1114111 1 [u'U0001f44d']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...