Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
192 views
in Technique[技术] by (71.8m points)

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it. According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.

So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.

For instance, this is, what I receive, when I read in the first 20 bytes of an image:

binary_data = open('picture.jpg','rb').read(20)
print(binary_data)

b'xffxd8xffxe1-xfcExifx00x00MMx00*x00x00x00x08'

My questions are now:

1) Why does python not return me nice chunks of 2 bytes (in hex-format). Somthing like this I would expect: b'xff xd8 xff xe1 x-' ... and so on. Some blocks delimited by 'x' are much longer than 2 bytes.

2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).

Both observations hinder me in writing a simple parser. So ultimately my question summarizes to: How do I properly read-in and parse a JPEG Header in Python?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You seem overly worried about how your binary data is represented on your console. Don't worry about that.

The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as , ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!

>>> a = bytes([1,2,4,92,34,39])
>>> a
b'x01x02x04"''
>>> a[0]
1

See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?

If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.

>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'xffxd8xffxe0x00x10JFIFx00x01x02x01x00Hx00Hx00x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '

Why does python not return me nice chunks of 2 bytes (in hex-format)?

Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.

The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):

>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']

I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...