try:
string.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"
In Python 2, str
is a sequence of bytes and unicode
is a sequence of characters. You use str.decode
to decode a byte sequence to unicode
, and unicode.encode
to encode a sequence of characters to str
. So for example, u"é"
is the unicode string containing the single character U+00E9 and can also be written u"xe9"
; encoding into UTF-8 gives the byte sequence "xc3xa9"
.
In Python 3, this is changed; bytes
is a sequence of bytes and str
is a sequence of characters.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…