memory error when splitting big file into smaller files in python

Question

Welcome To Ask or Share your Answers For Others

memory error when splitting big file into smaller files in python

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

memory error when splitting big file into smaller files in python

I have read several posts including this one. but none helped.

Here is the python code that I have currently that splits the file

my input file size is 15G and I am splitting it into 128MB. my computer has 8G memory

import sys

def read_line(f_object,terminal_byte):
     line = ''.join(iter(lambda:f_object.read(1),terminal_byte))
     line+="x01"
     return line

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield ''.join(current_chunk)

inputfile=sys.argv[1]

with open(inputfile,"rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):
        with open("out%d.txt"%i,"wb") as f_out:
            f_out.write(chunk)

when I execute the script, I get the following error:

Traceback (most recent call last):
  File "splitter.py", line 30, in <module>
    for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):
  File "splitter.py", line 17, in make_chunks
    for line in read_lines(f_object,terminal_byte):
  File "splitter.py", line 12, in read_lines
    tmp = read_line(f_object,terminal_byte)
  File "splitter.py", line 4, in read_line
    line = ''.join(iter(lambda:f_object.read(1),terminal_byte))
MemoryError

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:13:24+0000

Question: splitting big file into smaller files

Instead of finding every single x01 do this only in the Last chunk.
Either reset the Filepointer to offset+1 of Last found x01 and continue or write up to offset in the Current Chunk File and the remaining Part of chunk in the next Chunk File.

Note: Your chunk_size should be io.DEFAULT_BUFFER_SIZE or a multiple of that.
You gain no speedup if you raise the chunk_size to high.
Read this relevant SO QA: Default buffer size for a file

My Example shows use of resetting the Filepointer, for instance:

import io

large_data = b"""Lorem ipsumx01dolor sitx01sadipscing elitr, sedx01labore etx01dolores et ea rebum.x01magna aliquyam erat,x01"""

def split(chunk_size, split_size):
    with io.BytesIO(large_data) as fh_in:
        _size = 0
        # Used to verify chunked writes
        result_data = io.BytesIO()

        while True:
            chunk = fh_in.read(chunk_size)
            print('read({})'.format(bytearray(chunk)))
            if not chunk: break

            _size += chunk_size
            if _size >= split_size:
                _size = 0
                # Split on last 0x01
                l = len(chunk)
                print('	split_on_last_\x01({})	{}'.format(l, bytearray(chunk)))

                # Reverse iterate 
                for p in range(l-1, -1, -1):
                    c = chunk[p:p+1]
                    if ord(c) == ord('x01'):
                        offset = l-(p+1)

                        # Condition if x01 is the Last Byte in chunk
                        if offset == 0:
                            print('	offset={} write({})		{}'.format(offset, l - offset, bytearray(chunk)))
                            result_data.write(chunk)
                        else:
                            # Reset Fileppointer
                            fh_in.seek(fh_in.tell()-offset)
                            print('	offset={} write({})		{}'.format(offset, l-offset, bytearray(chunk[:-offset])))
                            result_data.write(chunk[:-offset])
                        break
            else:
                print('	write({}) {}'.format(chunk_size, bytearray(chunk)))
                result_data.write(chunk)

        print('INPUT :{}
OUTPUT:{}'.format(large_data, result_data.getvalue()))   

if __name__ == '__main__':
    split(chunk_size=30, split_size=60)

Output:

read(bytearray(b'Lorem ipsumx01dolor sitx01sadipsci'))
    write(30) bytearray(b'Lorem ipsumx01dolor sitx01sadipsci')
read(bytearray(b'ng elitr, sedx01labore etx01dolore'))
    split_on_last_x01(30)  bytearray(b'ng elitr, sedx01labore etx01dolore')
    offset=6 write(24)      bytearray(b'ng elitr, sedx01labore etx01')
read(bytearray(b'dolores et ea rebum.x01magna ali'))
    write(30) bytearray(b'dolores et ea rebum.x01magna ali')
read(bytearray(b'quyam erat,x01'))
    split_on_last_x01(12)  bytearray(b'quyam erat,x01')
    offset=0 write(12)      bytearray(b'quyam erat,x01')
read(bytearray(b''))
INPUT :b'Lorem ipsumx01dolor sitx01sadipscing elitr, sedx01labore etx01dolores et ea rebum.x01magna aliquyam erat,x01'
OUTPUT:b'Lorem ipsumx01dolor sitx01sadipscing elitr, sedx01labore etx01dolores et ea rebum.x01magna aliquyam erat,x01'

Tested with Python: 3.4.2

Categories

memory error when splitting big file into smaller files in python

memory error when splitting big file into smaller files in python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags