Unicode in Python

Question

Welcome To Ask or Share your Answers For Others

Unicode in Python

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

Unicode in Python

In Python 2.7's documentation, three rules about Unicode are described as follows:

If the code point is <128, it’s represented by the corresponding byte value.

If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.

Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

Then I made some tests about it:

>>>> unichr(40960)

u'ua000'

>>> ord(u'ua000')

40960

In my view, 40960 is a code point > 0x7ff, so it should be turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255, but it only be turned into two-bytes sequence, and the value '00' in u'a000' is lower than 128, not matched with the rules mentioned above. Why?

What's more, I found some more Unicode characters, such as u'u1234', etc. I found that the value ("12" && "34") in it is also lower than 128, but according to the thoery mentioned first, they shouldn't be lower than 128. Any other theories that I lost?

Thanks for all answers.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:37+0000

in python2.7's documentation, three rules about unicodes are described as follows:

That is a description of the UTF-8 encoding.

Then I made some tests about it:

ua000 is an escape sequence representing a Unicode character. The a000 is a hexadecimal representation of the numerical code point value. It has nothing to do with UTF-8 encoding.

You get UTF-8 encoding when you explicitly encode a unicode string using the UTF-8 encoding.

Categories

Unicode in Python

Unicode in Python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags