Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
674 views
in Technique[技术] by (71.8m points)

unicode table information about a character in python

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)

Example: for the letter "?"

  • Name > Latin Capital Letter E with Double Grave
  • Unicode number > U+0204
  • HTML-code > Ȅ
  • Bloc > Latin Extended-B
  • Lowercase > ?

What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "?").

Roughly:
input = a Unicode number
output = corresponding information

The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.

Thank you.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.

Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.

Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.

Code (tested with Python 2.7 and 3.6):

# -*- coding: utf-8 -*-

class UnicodeCharacter:
    def __init__(self):
        self.code = 0
        self.name = 'unnamed'
        self.category = ''
        self.combining = ''
        self.bidirectional = ''
        self.decomposition = ''
        self.asDecimal = None
        self.asDigit = None
        self.asNumeric = None
        self.mirrored = False
        self.uc1Name = None
        self.comment = ''
        self.uppercase = None
        self.lowercase = None
        self.titlecase = None
        self.block = None

    def __getitem__(self, item):
        return getattr(self, item)

    def __repr__(self):
        return '{'+self.name+'}'

class UnicodeBlock:
    def __init__(self):
        self.first = 0
        self.last = 0
        self.name = 'unnamed'

    def __repr__(self):
        return '{'+self.name+'}'

class BlockList:
    def __init__(self):
        self.blocklist = []
        with open('Blocks.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' 
')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.split(';')
                    block = UnicodeBlock()
                    block.name = rawdata[1].strip()
                    rawdata = rawdata[0].split('..')
                    block.first = int(rawdata[0],16)
                    block.last = int(rawdata[1],16)
                    self.blocklist.append(block)
            # make 100% sure it's sorted, for quicker look-up later
            # (it is usually sorted in the file, but better make sure)
            self.blocklist.sort (key=lambda x: block.first)

    def lookup(self,code):
        for item in self.blocklist:
            if code >= item.first and code <= item.last:
                return item.name
        return None

class UnicodeList:
    """UnicodeList loads Unicode data from the external files
    'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org

    These files must appear in the same directory as this program.

    UnicodeList is a new interpretation of the standard library
    'unicodedata'; you may first want to check if its functionality
    suffices.

    As UnicodeList loads its data from an external file, it does not depend
    on the local build from Python (in which the Unicode data gets frozen
    to the then 'current' version).

    Initialize with

        uclist = UnicodeList()
    """
    def __init__(self):

        # we need this first
        blocklist = BlockList()
        bpos = 0

        self.codelist = []
        with open('UnicodeData.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' 
')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.strip().split(';')
                    parsed = UnicodeCharacter()
                    parsed.code = int(rawdata[0],16)
                    parsed.characterName = rawdata[1]
                    parsed.category = rawdata[2]
                    parsed.combining = rawdata[3]
                    parsed.bidirectional = rawdata[4]
                    parsed.decomposition = rawdata[5]
                    parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
                    parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
                    # the following value may contain a slash:
                    #  ONE QUARTER ... 1/4
                    # let's make it Python 2.7 compatible :)
                    if '/' in rawdata[8]:
                        rawdata[8] = rawdata[8].replace('/','./')
                        parsed.asNumeric = eval(rawdata[8])
                    else:
                        parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
                    parsed.mirrored = rawdata[9] == 'Y'
                    parsed.uc1Name = rawdata[10]
                    parsed.comment = rawdata[11]
                    parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
                    parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
                    parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
                    while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
                        bpos += 1
                    parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
                    self.codelist.append(parsed)

    def find_code(self,codepoint):
        """Find the Unicode information for a codepoint (as int).

        Returns:
            a UnicodeCharacter class object or None.
        """
        # the list is unlikely to contain duplicates but I have seen Unicode.org
        # doing that in similar situations. Again, better make sure.
        val = [x for x in self.codelist if codepoint == x.code]
        return val[0] if val else None

    def find_char(self,str):
        """Find the Unicode information for a codepoint (as character).

        Returns:
            for a single character: a UnicodeCharacter class object or
            None.
            for a multicharacter string: a list of the above, one element
            per character.
        """
        if len(str) > 1:
            result = [self.find_code(ord(x)) for x in str]
            return result
        else:
            return self.find_code(ord(str))

When loaded, you can now look up a character code with

>>> ul = UnicodeList()     # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}

which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:

>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B

and (as long as you don't get a None) even chain them:

>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}

It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:

import unicodedata
>>> print (unicodedata.name('U0001F903'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}

(As tested with Python 3.5.3.)

There are currently two lookup functions defined:

  • find_code(int) looks up character information by codepoint as an integer.
  • find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.

After import unicodelist (assuming you saved this as unicodelist.py), you can use

>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'

to look up the hex code for any character, and a list comprehension such as

>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']

for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:

 l = [hex(ord(x)) for x in 'Hello']

The purpose of this module is to give easy access to other Unicode properties. A longer example:

str = 'Héllo...'
dest = ''
for i in str:
    dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)

HéLLO...

and showing a list of properties for a character per your example:

letter = u'?'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))

(I left out HTML; these names are not defined in the Unicode standard.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...