Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
570 views
in Technique[技术] by (71.8m points)

python - Split one file into multiple files based on pattern (cut can occur within lines)

A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:

Infile:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>

Should become with pattern <?xml

Outfile1:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>

Outfile2:

<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>

Outfile3:

<?xml 2><blabla><blabla>

Actually the perl script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:

panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.

I don't have the permissions to comment, that's why I started a new post. And finally, a Python solution would be even more appreciated, as I understand it better.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This performs the split without reading everything into RAM:

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

A word of warning: this doesn't work if your pattern spreads across multiple lines (that is, contains " "). Consider the mmap solution if this is the case.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...