Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

regex - How to convert some character into five digit unicode one in Python 3.3?

I'd like to convert some character into five digit unicode on in Python 3.3. For example,

import re
print(re.sub('a', u'u1D15D', 'abc' ))

but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Python unicode escapes either are 4 hex digits (uabcd) or 8 (Uabcdabcd); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros:

>>> 'U0001D15D'
'??'
>>> 'U0001D15D'.encode('unicode_escape')
b'\U0001d15d'

(And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead.

Because you used a uabcd escape, you replaced a in abc with two characters, the codepoint U+1D15 (?, latin letter small capital ou), and the ASCII character D. Using a 32-bit unicode literal works:

>>> import re
>>> print(re.sub('a', 'U0001D15D', 'abc' ))
??bc
>>> print(re.sub('a', u'U0001D15D', 'abc' ).encode('unicode_escape'))
b'\U0001d15dbc'

where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...