I have a unicode string with accented latin chars e.g.
n=unicode('Wikipédia, le projet d’encyclopédie','utf-8')
I want to convert it to plain ascii i.e. 'Wikipedia, le projet dencyclopedie', so all acute/accent,cedilla etc should get removed
What is the fastest way to do that, as it needed to be done for matching a long autocomplete dropdown list
Conclusion:
As one my criteria is speed, Lennart's 'register your own error handler for unicode encoding/decoding' gives best result (see Alex's answer), speed difference increases further as more and more chars are latin.
Here is the translation table I am using, also modified error handler as it need to take care of whole range of un-encoded char from error.start to error.end
# -*- coding: utf-8 -*-
import codecs
"""
This is more of visual translation also avoiding multiple char translation
e.g. £ may be written as {pound}
"""
latin_dict = {
u"?": u"!", u"¢": u"c", u"£": u"L", u"¤": u"o", u"¥": u"Y",
u"|": u"|", u"§": u"S", u"¨": u"`", u"?": u"c", u"a": u"a",
u"?": u"<<", u"?": u"-", u"-": u"-", u"?": u"R", u"ˉ": u"-",
u"°": u"o", u"±": u"+-", u"2": u"2", u"3": u"3", u"′": u"'",
u"μ": u"u", u"?": u"P", u"·": u".", u"?": u",", u"1": u"1",
u"o": u"o", u"?": u">>", u"?": u"1/4", u"?": u"1/2", u"?": u"3/4",
u"?": u"?", u"à": u"A", u"á": u"A", u"?": u"A", u"?": u"A",
u"?": u"A", u"?": u"A", u"?": u"Ae", u"?": u"C", u"è": u"E",
u"é": u"E", u"ê": u"E", u"?": u"E", u"ì": u"I", u"í": u"I",
u"?": u"I", u"?": u"I", u"D": u"D", u"?": u"N", u"ò": u"O",
u"ó": u"O", u"?": u"O", u"?": u"O", u"?": u"O", u"×": u"*",
u"?": u"O", u"ù": u"U", u"ú": u"U", u"?": u"U", u"ü": u"U",
u"Y": u"Y", u"T": u"p", u"?": u"b", u"à": u"a", u"á": u"a",
u"a": u"a", u"?": u"a", u"?": u"a", u"?": u"a", u"?": u"ae",
u"?": u"c", u"è": u"e", u"é": u"e", u"ê": u"e", u"?": u"e",
u"ì": u"i", u"í": u"i", u"?": u"i", u"?": u"i", u"e": u"d",
u"?": u"n", u"ò": u"o", u"ó": u"o", u"?": u"o", u"?": u"o",
u"?": u"o", u"÷": u"/", u"?": u"o", u"ù": u"u", u"ú": u"u",
u"?": u"u", u"ü": u"u", u"y": u"y", u"t": u"p", u"?": u"y",
u"’":u"'"}
def latin2ascii(error):
"""
error is protion of text from start to end, we just convert first
hence return error.start+1 instead of error.end
"""
return latin_dict[error.object[error.start]], error.start+1
codecs.register_error('latin2ascii', latin2ascii)
if __name__ == "__main__":
x = u"? éí?§Dì?? ? ? ? ? ? ’"
print x
print x.encode('ascii', 'latin2ascii')
Why I return error.start + 1
:
error object returned can be multiple characters, and we convert only first of these e.g. if I add print error.start, error.end
to error handler output is
? éí?§Dì?? ? ? ? ? ? ’
0 1
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
11 12
13 14
15 16
17 18
19 20
21 22
1/4 einSDIeN >> 1/4 o R c '
so in second line we get chars from 2-10 but we convert only 2nd hence return 3 as continue point, if we return error.end output is
? éí?§Dì?? ? ? ? ? ? ’
0 1
2 10
11 12
13 14
15 16
17 18
19 20
21 22
1/4 e >> 1/4 o R c '
As we can see 2-10 portion has been replaced by a single char. off-course it would be faster to just encode whole range in one go and return error.end, but for demonstration purpose I have kept it simple.
see http://docs.python.org/library/codecs.html#codecs.register_error for more details
See Question&Answers more detail:
os