Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
651 views
in Technique[技术] by (71.8m points)

string - What is a Python bytestring?

What's a Python bytestring?

All I can find are topics on how to encode to bytestring or decode to ascii or utf-8. I'm trying to understand how it works under the hood. In a normal ASCII string, it's an array or list of characters, and each character represents an ASCII value from 0-255, so that's how you know what character is represented by the number. In Unicode, it's the 8- or 16-byte representation for the character that tells you what character it is.

So what is a bytestring? How does Python know which characters to represent as what? How does it work under the hood? Since you can print or even return these strings and it shows you the string representation, I don't quite get it...

Ok, so my point is definitely getting missed here. I've been told that it's an immutable sequence of bytes without any particular interpretation.

A sequence of bytes.. Okay, let's say one byte:
'a'.encode() returns b'a'.

Simple enough. Why can I read the a?

Say I get the ASCII value for a, by doing this:
printf "%d" "'a"

It returns 97. Okay, good, the integer value for the ASCII character a. If we interpret 97 as ASCII, say in a C char, then we get the letter a. Fair enough. If we convert the byte representation to bits, we get this:

01100001

2^0 + 2^5 + 2^6 = 97. Cool.

So why is 'a'.encode() returning b'a' instead of 01100001??
If it's without a particular interpretation, shouldn't it be returning something like b'01100001'?
It seems like it's interpreting it like ASCII.

Someone mentioned that it's calling __repr__ on the bytestring, so it's displayed in human-readable form. However, even if I do something like:

with open('testbytestring.txt', 'wb') as f:
    f.write(b'helloworld')

It will still insert helloworld as a regular string into the file, not as a sequence of bytes... So is a bytestring in ASCII?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is a common misconception that text is ascii or utf8 or cp1252, and therefore bytes are text.

Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: Jpeg, png, svg, and likewise many ways to encode text, ascii, utf8 or cp1252.

Once encoding has happened, bytes are just bytes. Bytes are not images anymore, they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember wether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption)

so, in python (py3), we have two types for things that might otherwise look similar; For text, we have str, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring, which doesn't know if it's text or images or any other kind of data.

The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.

Implementationally, str is stored in memory as UCS-? where the ? is implementation defined, it may be UCS4, UCS2 or UCS1, depending on compile time options and which codepoints are present in the represented string.


edit "but why"?

Some things that look like text are actually defined in other terms. A really good example of this are the many internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFC's. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:

2.3. Terminal Values

Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.

This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked, so long as both the server and the browser understand it correctly.

By comparison, when working with text, you don't really care how it's encoded. You can express the "He?vy M?tal ümlaüts" any way you like, except "Heδvy Mλtal ?mla?ts"


the distinct types thus give you a way to say "this value 'means' text" or "bytes".


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...