In Python 2.7's documentation, three rules about Unicode are described as follows:
If the code point is <128, it’s represented by the corresponding byte value.
If the code point is between 128 and 0x7ff
, it’s turned into two byte values between 128 and 255.
Code points >0x7ff
are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
Then I made some tests about it:
>>>> unichr(40960)
u'ua000'
>>> ord(u'ua000')
40960
In my view, 40960 is a code point > 0x7ff
, so it should be turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255, but it only be turned into two-bytes sequence, and the value '00' in u'a000' is lower than 128, not matched with the rules mentioned above. Why?
What's more, I found some more Unicode characters, such as u'u1234'
, etc. I found that the value ("12" && "34") in it is also lower than 128, but according to the thoery mentioned first, they shouldn't be lower than 128. Any other theories that I lost?
Thanks for all answers.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…