Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
389 views
in Technique[技术] by (71.8m points)

tableofcontents - Navigating a large tar.gz file in python without extracting it first

I have seen this question but I need something else.

My files contains a very large amount of text files (hundreds of thousands) organized by variable name. Something like

filename/maxvalue/IDXstation.txt     (with X that goes from 100000 to 200000)
filename/minvalue/IDXstation.txt  
filename/meanvalue/IDXstation.txt 

and so on. Problem is that I don't have a readme.txt files that tells me how many folders are in the tar files or how they are named (I made them up) (or how many stations are in each folder). For now all I care to read is the structure of the filename.tar.gz and print something like

filename/maxvalue/  
filename/minvalue/  
filename/meanvalue/

I need to read the structure of it before I start extracting the file, because I am interested only in some folders and not all of them.

if I use

for tarinfo in tar:
    print tarinfo.name

It will print all the files, and they are hundreds of thousands and I don't want that, but I am not sure how to set it up.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The wikipedia page on tar says to list the names of the files that are in the archive, one must read through the entire archive and look for places where files start. So you will have to untar the datastream to get the file names. One simple way to only print only expected names would be to use a regex to keep only the relevant directory names. If you are sure that the directory themselves are registered in the tar file something like that should be enough :

import re

rx = re.compile('[^/]+/[^/]/?$')
...
for tarinfo in tar:
    if rx.match(tarinfo.name):
        print tarinfo.name

If you are not sure that expexted directories are registered in tarfile, you can use a less strict match an put the directory part in a set. Something like :

import re

rx = re.compile('([^/]+/[^/])/')
...
names = set()
for tarinfo in tar:
    if rx.match(tarinfo.name):
        names.add(tarinfo.name)
for name in names:
    print name
        print tarinfo.name

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...