You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'xd0xd1xd2xd3xd4xd5xd6xd7xd8xd9xdaxdbxdcxddxdexdfxe0xe1xe2xe3xe4xe5xe6xe7xe8xe9xeaxebxecxedxeexefxf0xf1xf2xf3xf4xf5xf6xf7xf8xf9xfaxfbxfcxfdxfexffxc0xc1xc2xc3'
# or:
>>> string = 'D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?àá??
>>> rePat = re.compile(u'[^xc3x91xc3x83xc3xaf]',re.UNICODE)
>>> print rePat.sub("", string)
?
When reading from a file, string = open('filename.txt').read()
reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding')
. Or: string = open('filename.txt').read().decode('encoding')
.
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…