Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
592 views
in Technique[技术] by (71.8m points)

python - How do I get str.translate to work with Unicode strings?

I have the following code:

import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
    not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
    translate_table = string.maketrans(not_letters_or_digits,
                                       translate_to
                                         *len(not_letters_or_digits))
    return to_translate.translate(translate_table)

Which works great for non-unicode strings:

>>> translate_non_alphanumerics('<foo>!')
'_foo__'

But fails for unicode strings:

>>> translate_non_alphanumerics(u'<foo>!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in translate_non_alphanumerics
TypeError: character mapping must return integer, None or unicode

I can't make any sense of the paragraph on "Unicode objects" in the Python 2.6.2 docs for the str.translate() method.

How do I make this work for Unicode strings?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord) to Unicode ordinals. If you want to delete characters, you map to None.

I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:

def translate_non_alphanumerics(to_translate, translate_to=u'_'):
    not_letters_or_digits = u'!"#%'()*+,-./:;<=>?@[]^_`{|}~'
    translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
    return to_translate.translate(translate_table)

>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'

edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to to be a Unicode literal. For example:

>>> translate_non_alphanumerics(u'<foo>!', u'bad')
u'badfoobadbad'

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...