Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
225 views
in Technique[技术] by (71.8m points)

python - Combining Devanagari characters

I have something like

a = "?????? ???? ??? ??"

I want to achieve something like

a[0] = ??
a[1] = ???
a[3] = ?

but as ? takes 4 bytes while ?? takes 8 bytes I am not able to get to that straight. So what could be done to achieve that? In Python.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The algorithm for splitting text into grapheme clusters is given in Unicode Annex 29, section 3.1. I'm not going to implement the full algorithm for you here, but I'll show you roughly how to handle the case of Devanagari, and then you can read the Annex for yourself and see what else you need to implement.

The unicodedata module contains the information you need to detect the grapheme clusters.

>>> import unicodedata
>>> a = "?????? ???? ??? ??"
>>> [unicodedata.name(c) for c in a]
['DEVANAGARI LETTER BA', 'DEVANAGARI VOWEL SIGN I', 'DEVANAGARI LETTER KA', 
 'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER RA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER MA', 'DEVANAGARI VOWEL SIGN E',
 'DEVANAGARI LETTER RA', 'DEVANAGARI VOWEL SIGN O', 'SPACE',
 'DEVANAGARI LETTER NA', 'DEVANAGARI VOWEL SIGN AA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER HA', 'DEVANAGARI VOWEL SIGN O']

In Devanagari, each grapheme cluster consists of an initial letter, optional pairs of virama (vowel killer) and letter, and an optional vowel sign. In regular expression notation that would be LETTER (VIRAMA LETTER)* VOWEL?. You can tell which is which by looking up the Unicode category for each code point:

>>> [unicodedata.category(c) for c in a]
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Lo', 'Zs', 'Lo', 'Mn', 'Lo', 'Mc', 'Zs',
 'Lo', 'Mc', 'Lo', 'Zs', 'Lo', 'Mc']

Letters are category Lo (Letter, Other), vowel signs are category Mc (Mark, Spacing Combining), virama is category Mn (Mark, Nonspacing) and spaces are category Zs (Separator, Space).

So here's a rough approach to split out the grapheme clusters:

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

>>> list(splitclusters(a))
['??', '???', '?', ' ', '??', '??', ' ', '??', '?', ' ', '??']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...