Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
185 views
in Technique[技术] by (71.8m points)

regex - Python and regular expression with Unicode

I need to delete some Unicode symbols from the string '?????? ??????? ???????????? ??????????'

I know they exist here for sure. I tried:

re.sub('([u064B-u0652u06D4u0670u0674u06D5-u06ED]+)', '', '?????? ??????? ???????????? ??????????')

but it doesn't work. String stays the same. What am I doing wrong?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[u064B-u0652u06D4u0670u0674u06D5-u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like w or , of which this pattern does not use any and so would not be affected by.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...