Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
902 views
in Technique[技术] by (71.8m points)

regex - python - problems with regular expression and unicode

Hi I have a problem in python. I try to explain my problem with an example.

I have this string:

>>> string = 'D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??'
>>> print string
D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??

and i want, for example, replace charachters different from ?,?,? with ""

i have tried:

>>> rePat = re.compile('[^???]',re.UNICODE)
>>> print rePat.sub("",string)
????????????????????????????????????????????????????

I obtained this ?. I think that it's happen because this type of characters in python are represented by two position in the vector: for example xc3x91 = ?. For this, when i make the regolar expression, all the xc3 are not substitued. How I can do this type of sub?????

Thanks Franco

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).

Example:

>>> string = 'D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??'
>>> type(string)
<type 'str'>

# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'xd0xd1xd2xd3xd4xd5xd6xd7xd8xd9xdaxdbxdcxddxdexdfxe0xe1xe2xe3xe4xe5xe6xe7xe8xe9xeaxebxecxedxeexefxf0xf1xf2xf3xf4xf5xf6xf7xf8xf9xfaxfbxfcxfdxfexffxc0xc1xc2xc3'
# or:
>>> string = 'D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the xNN form in unicode literals, as in the first example

>>> type(string)
<type 'unicode'>
>>> print string
D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??

>>> rePat = re.compile(u'[^xc3x91xc3x83xc3xaf]',re.UNICODE)
>>> print rePat.sub("", string)
?

When reading from a file, string = open('filename.txt').read() reads a byte sequence.

To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').

The codecs module can decode unicode streams (such as files) on-the-fly.

Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.

I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)

I also recommend: http://www.joelonsoftware.com/articles/Unicode.html


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...