Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

error handling - Checking tarfile integrity in Python

I'm working on converting my backup script from shell to Python. One of the features of my old script was to check the created tarfile for integrity by doing: gzip -t .

This seems to be a bit tricky in Python.

It seems that the only way to do this, is by reading each of the compressed TarInfo objects within the tarfile.

Is there a way to check a tarfile for integrity, without extracting to disk, or keeping it in memory (in it's entirety)?

Good people on #python on freenode suggested that I should read each TarInfo object chunk-by-chunk, discarding each chunk read.

I must admit that I have no idea how to do this, seeing that I just started Python.

Imagine that I have a tarfile of 30GB which contains files ranging from 1kb to 10GB...

This is the solution that I started writing:

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

for member_info in tardude.getmembers():
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

This code is far from finished. I would not dare running this on a huge 30GB tar archive, because at one point, check would be an object of 10+GB (If i have such huge files within the tar archive)

Bonus: I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError... Here is the output:

Traceback (most recent call last):
  File "./test.py", line 31, in <module>
    for member_info in tardude.getmembers():
  File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
    tarinfo = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2315, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib/python2.7/gzip.py", line 429, in seek
    self.read(1024)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 320, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Just a minor improvement on Aya's answer to make things a little more idiomatic (although I'm removing some of the error checking to make the mechanics more visible):

BLOCK_SIZE = 1024

with tarfile.open("zero.tar.gz") as tardude:
    for member in tardude.getmembers():
        with tardude.extractfile(member.name) as target:
            for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
                pass

This really just removes the while 1: (sometimes considered a minor code smell) and the if not data: check. Also note that the use of with restricts this to Python 2.7+


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...