Python narrow and wide build (Python versions below 3.3)
The error suggests that you are using "narrow" (UCS-2) build, which only supports Unicode code points up to 65535 as one "Unicode character"1. Characters whose code points are above 65536 are represented as surrogate pairs, which means that the Unicode string u'U0001d300'
consists of two "Unicode character" in narrow build.
Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'U0001d300')
2
>>> [hex(ord(i)) for i in u'U0001d300']
['0xd834', '0xdf00']
In "wide" (UCS-4) build, all 1114111 code points are recognized as Unicode character, so the Unicode string u'U0001d300'
consists of exactly one "Unicode character"/Unicode code point.
Python 2.6.6 (r266:84292, May 1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'U0001d300')
1
>>> [hex(ord(i)) for i in u'U0001d300']
['0x1d300']
1 I use "Unicode character" (in quotes) to refer to one character in Python Unicode string, not one Unicode code point. The number of "Unicode characters" in a string is the len()
of the string. In "narrow" build, one "Unicode character" is a 16-bit code unit of UTF-16, so one astral character will appear as two "Unicode character". In "wide" build, one "Unicode character" always corresponds to one Unicode code point.
Matching astral plane characters with regex
Wide build
The regex in the question compiles correctly in "wide" build:
Python 2.6.6 (r266:84292, May 1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[U0001d300-U0001d356]', re.DEBUG)
in
range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>
Narrow build
However, the same regex won't work in "narrow" build, since the engine does not recognize surrogate pairs. It just treats ud834
as one character, then tries to create a character range from udf00
to ud834
and fails.
Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[U0001d300-U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']
The workaround is to use the same method as done in ECMAScript, where we will construct the regex to match the surrogates representing the code point.
Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'ud834[udf00-udf56]', re.DEBUG)
literal 55348
in
range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input = u'Sample U0001d340. Another U0001d305. Leave alone U00011000'
>>> input
u'Sample U0001d340. Another U0001d305. Leave alone U00011000'
>>> re.sub(u'ud834[udf00-udf56]', '', input)
u'Sample . Another . Leave alone U00011000'
Using regexpu to derive astral plane regex for Python narrow build
Since the construction to match astral plane characters in Python narrow build is the same as ES5, you can use regexpu, a tool to convert RegExp literal in ES6 to ES5, to do the conversion for you.
Just paste the equivalent regex in ES6 (note the u
flag and u{hh...h}
syntax):
/[u{1d300}-u{1d356}]/u
and you get back the regex which can be used in both Python narrow build and ES5
/(?:uD834[uDF00-uDF56])/
Do take note to remove the delimiter /
in JavaScript RegExp literal when you want to use the regex in Python.
The tool is extremely useful when the range spread across multiple high surrogates (U+D800 to U+DBFF). For example, if we have to match the character range
/[u{105c0}-u{1cb40}]/u
The equivalent regex in Python narrow build and ES5 is
/(?:uD801[uDDC0-uDFFF]|[uD802-uD831][uDC00-uDFFF]|uD832[uDC00-uDF40])/
which is rather complex and error-prone to derive.
Python version 3.3 and above
Python 3.3 implements PEP 393, which eliminates the distinction between narrow build and wide build, and Python from now behaves like a wide build. This eliminates the problem in the question altogether.
Compatibility issues
While it's possible to workaround and match astral plane characters in Python narrow builds, going forward, it's best to change the execution environment by using Python wide builds, or port the code to use with Python 3.3 and above.
The regex code for narrow build is hard to read and maintain for average programmers, and it has to be completely rewritten when porting to Python 3.
Reference