If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint U00010000
; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.
You could use a regular expression for that:
>>> import re
>>> highpoints = re.compile(u'[U00010000-U0010ffff]')
>>> example = u'Some example text with a sleepy face: U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
Alternatively, you could use the .translate()
function with a mapping table that only contains None
values:
>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '
However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.
This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to 'U0000ffff'
in regular expressions and you'll never run into this problem in the first place.
I note that as of MySQL 5.5.3 the newly-added utf8mb4
codec does supports the full Unicode range.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…