I used file.read()
to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.
Code:
from functools import partial
def read_in_chunks(size_in_bytes):
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt', 'r+b') as f:
prev = ''
count = 0
f_read = partial(f.read, size_in_bytes)
for text in iter(f_read, ''):
if not text.endswith('
'):
# if file contains a partial line at the end, then don't
# use it when counting the substring count.
text, rest = text.rsplit('
', 1)
# pre-pend the previous partial line if any.
text = prev + text
prev = rest
else:
# if the text ends with a '
' then simple pre-pend the
# previous partial line.
text = prev + text
prev = ''
count += text.count(s)
count += prev.count(s)
print count
Timings:
read_in_chunks(104857600)
$ time python so.py
10000000
real 0m1.649s
user 0m0.977s
sys 0m0.669s
read_in_chunks(524288000)
$ time python so.py
10000000
real 0m1.558s
user 0m0.893s
sys 0m0.646s
read_in_chunks(1073741824)
$ time python so.py
10000000
real 0m1.242s
user 0m0.689s
sys 0m0.549s
read_in_chunks(2147483648)
$ time python so.py
10000000
real 0m0.844s
user 0m0.415s
sys 0m0.408s
On the other hand the simple loop version takes around 6 seconds on my system:
def simple_loop():
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt') as f:
print sum(line.count(s) for line in f)
$ time python so.py
10000000
real 0m5.993s
user 0m5.679s
sys 0m0.313s
Results of @SlaterTyranus's grep
version on my file:
$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l
10000000
real 0m11.975s
user 0m11.779s
sys 0m0.568s
Results of @woot's solution:
$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets say i have a text file of 1000 GB' | wc -l
10000000
real 0m5.955s
user 0m14.825s
sys 0m5.766s
Got best timing when I used 100 MB as block size:
$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets say i have a text file of 1000 GB' | wc -l
10000000
real 0m4.632s
user 0m13.466s
sys 0m3.290s
Results of woot's second solution:
$ time python woot_thread.py # CHUNK_SIZE = 1073741824
10000000
real 0m1.006s
user 0m0.509s
sys 0m2.171s
$ time python woot_thread.py #CHUNK_SIZE = 2147483648
10000000
real 0m1.009s
user 0m0.495s
sys 0m2.144s
System Specs: Core i5-4670, 7200 RPM HDD
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…