Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
952 views
in Technique[技术] by (71.8m points)

language agnostic - Extract zlib compressed data from binary file in python

My company uses a legacy file format for Electromiography data, which is no longer in production. However, there is some interest in maintaining retro-compatibility, so I am studying the possibility to write a reader for that file format.

By analyzing a very convoluted former source code written in Delphi, the file reader/writer uses ZLIB, and inside a HexEditor it looks like there is a file header in binary ASCII (with fields like "Player", "Analyzer" readily readable), followed by a compressed string containing raw data.

My doubt is: how should I proceed in order to identify:

  • If it is a compressed stream;
  • Where does the compressed stream start and where does it end;

From Wikipedia:

zlib compressed data is typically written with a gzip or a zlib wrapper. The wrapper encapsulates the raw DEFLATE data by adding a header and trailer. This provides stream identification and error detection

Is this relevant?

I'll be glad to post more information, but I don't know what would be most relevant.

Thanks for any hint.

EDIT: I have the working application, and can use it to record actual data of any time length, getting files even smaller than 1kB if necessary.


Some sample files:

A freshly created one, without datastream: https://dl.dropbox.com/u/4849855/Mio_File/HeltonEmpty.mio

The same as above after a very short (1 second?) datastream has been saved: https://dl.dropbox.com/u/4849855/Mio_File/HeltonFilled.mio

A different one, from a patient named "manco" instead of "Helton", with an even shorter stream (ideal for Hex viewing): https://dl.dropbox.com/u/4849855/Mio_File/manco_short.mio

Instructions: each file should be the file of a patient (a person). Inside these files, one or more exams are saved, each exam consisting of one or more time series. The provided files contain only one exam, with one data series.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To start, why not scan the files for all valid zip streams (it's good enough for small files and to figure out the format):

import zlib
from glob import glob

def zipstreams(filename):
    """Return all zip streams and their positions in file."""
    with open(filename, 'rb') as fh:
        data = fh.read()
    i = 0
    while i < len(data):
        try:
            zo = zlib.decompressobj()
            yield i, zo.decompress(data[i:])
            i += len(data[i:]) - len(zo.unused_data)
        except zlib.error:
            i += 1

for filename in glob('*.mio'):
    print(filename)
    for i, data in zipstreams(filename):
        print (i, len(data))

Looks like the data streams contain little-endian double precision floating point data:

import numpy
from matplotlib import pyplot

for filename in glob('*.mio'):
    for i, data in zipstreams(filename):
        if data:
            a = numpy.fromstring(data, '<f8')
            pyplot.plot(a[1:])
            pyplot.title(filename + ' - %i' % i)
            pyplot.show()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...