First of all, in Python2, you need to use Unicode strings (u'<...>'
) for Unicode characters to be seen as Unicode characters. And correct source encoding if you want to use the chars themselves rather than the UXXXXXXXX
representation in source code.
Now, as per Python: getting correct string length when it contains surrogate pairs and Python returns length of 2 for single Unicode character string, in Python2 "narrow" builds (with sys.maxunicode==65535
), 32-bit Unicode characters are represented as surrogate pairs, and this is not transparent to string functions. This has only been fixed in 3.3 (PEP0393).
The simplest resolution (save for migrating to 3.3+) is to compile a Python "wide" build from source as outlined on the 3rd link. In it, Unicode characters are all 4-byte (thus are a potential memory hog) but if you need to routinely handle wide Unicode chars, this is probably an acceptable price.
The solution for a "narrow" build is to make a custom set of string functions (len
, slice
; maybe as a subclass of unicode
) that would detect surrogate pairs and handle them as a single character. I couldn't readily find an existing one (which is strange), but it's not too hard to write:
- as per UTF-16#U+10000 to U+10FFFF - Wikipedia,
- the 1st character (high surrogate) is in range
0xD800..0xDBFF
- the 2nd character (low surrogate) - in range
0xDC00..0xDFFF
- these ranges are reserved and thus cannot occur as regular characters
So here's the code to detect a surrogate pair:
def is_surrogate(s,i):
if 0xD800 <= ord(s[i]) <= 0xDBFF:
try:
l = s[i+1]
except IndexError:
return False
if 0xDC00 <= ord(l) <= 0xDFFF:
return True
else:
raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])
else:
return False
And a function that returns a simple slice:
def slice(s,start,end):
l=len(s)
i=0
while i<start and i<l:
if is_surrogate(s,i):
start+=1
end+=1
i+=1
i+=1
while i<end and i<l:
if is_surrogate(s,i):
end+=1
i+=1
i+=1
return s[start:end]
Here, the price you pay is performance, as these functions are much slower than built-ins:
>>> ux=u"a"*5000+u"U00100000"*30000+u"b"*50000
>>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000)
46.44128203392029 #msec
>>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000)
8.814016103744507 #usec