What is the best way to remove accents (normalize) in a Python unicode string?

Question

Welcome To Ask or Share your Answers For Others

What is the best way to remove accents (normalize) in a Python unicode string?

posted Oct 16, 2021 in Technique[技术] by 深蓝 (71.8m points)

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T05:57:24+0000

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

Categories

What is the best way to remove accents (normalize) in a Python unicode string?

What is the best way to remove accents (normalize) in a Python unicode string?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags