Consider the following snippet of code:
import sys
for i in range(128, 256):
sys.stdout.write(chr(i))
Run this with Python 2 and look at the result with hexdump -C
:
00000000 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
Et cetera. No surprises; 128 bytes from 0x80
to 0xff
.
Do the same with Python 3:
00000000 c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
...
00000070 c2 b8 c2 b9 c2 ba c2 bb c2 bc c2 bd c2 be c2 bf |................|
00000080 c3 80 c3 81 c3 82 c3 83 c3 84 c3 85 c3 86 c3 87 |................|
...
000000f0 c3 b8 c3 b9 c3 ba c3 bb c3 bc c3 bd c3 be c3 bf |................|
To summarize:
- Everything from
0x80
to 0xbf
has 0xc2
prepended.
- Everything from
0xc0
to 0xff
has bit 6 set to zero and has 0xc3
prepended.
So, what’s going on here?
In Python 2, strings are ASCII and no conversion is done. Tell it to
write something outside the 0-127 ASCII range, it says “okey-doke!” and
just writes those bytes. Simple.
In Python 3, strings are Unicode. When non-ASCII characters are
written, they must be encoded in some way. The default encoding is
UTF-8.
So, how are these values encoded in UTF-8?
Code points from 0x80
to 0x7ff
are encoded as follows:
110vvvvv 10vvvvvv
Where the 11 v
characters are the bits of the code point.
Thus:
0x80 hex
1000 0000 8-bit binary
000 1000 0000 11-bit binary
00010 000000 divide into vvvvv vvvvvv
11000010 10000000 resulting UTF-8 octets in binary
0xc2 0x80 resulting UTF-8 octets in hex
0xc0 hex
1100 0000 8-bit binary
000 1100 0000 11-bit binary
00011 000000 divide into vvvvv vvvvvv
11000011 10000000 resulting UTF-8 octets in binary
0xc3 0x80 resulting UTF-8 octets in hex
So that’s why you’re getting a c2
before 87
.
How to avoid all this in Python 3? Use the bytes
type.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…