python - Why doesn't unicodedata recognise certain characters?

Question

Welcome To Ask or Share your Answers For Others

python - Why doesn't unicodedata recognise certain characters?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why doesn't unicodedata recognise certain characters?

In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters.

>>> from unicodedata import name
>>> name(u'
')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'

Certainly Unicode contains the character , and it has a name, specifically "LINE FEED".

NB. unicodedata.lookup('LINE FEED') and unicodedata.lookup(u'LINE FEED') both give a KeyError: undefined character name.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:50:08+0000

The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).

If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:

000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, has no name, other than the generic <control>, which the Python database ignores (as it is not unique).

Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference . However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):

Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and 'N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

TL;DR: LINE FEED is not the official name for , it is but an alias for it. Python 3.3 and up let you look up characters by alias.

Categories

python - Why doesn't unicodedata recognise certain characters?

python - Why doesn't unicodedata recognise certain characters?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags