python - What are the upper and lower bound for Chinese char in UTF-8?

Question

Welcome To Ask or Share your Answers For Others

python - What are the upper and lower bound for Chinese char in UTF-8?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - What are the upper and lower bound for Chinese char in UTF-8?

I would like to make a set in python contains all the ord() of the Chinese chars:

for English the equivalent is :

english = set(range(ord('a'),ord('z') + 1 ) +
              range(ord('A'),ord('Z') + 1 ))

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:38:24+0000

From the Unicode Standard (v6.0, section 12.1),

Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2

Table 12-2. Blocks Containing Han Ideographs

Block                                   | Range       | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs                  | 4E00–9FFF   | Common
CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

And there are a couple of extras, outside of these blocks:

Table 12-3. Small Extensions to the URO

Range     | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
9FC3      | 5.1     | Correction of mistaken unification
9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

To use set operations to construct a set of the ordinal values of these, you can do this:

chinese = set(range(0x4E00, 0xA000) +
              range(0x3400, 0x4DC0) +
              range(0x20000, 0x2A6E0) +
              range(0x2A700, 0x2B740) +
              range(0x2B740, 0x2B820) +
              range(0xF900, 0xFB00) +
              range(0x2F800, 0x2FA20) +
              range(0x9FA6, 0x9FCC))

Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.

Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:

>>> ord(u'U00002F800')
194560

Categories

python - What are the upper and lower bound for Chinese char in UTF-8?

python - What are the upper and lower bound for Chinese char in UTF-8?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags