Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
212 views
in Technique[技术] by (71.8m points)

python - Why is reading multiple files at the same time slower than reading sequentially?

I am trying to parse many files found in a directory, however using multiprocessing slows my program.

# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
                                                       combined ~100MB

Following this example from python documentation:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

I've written this piece of code:

from multiprocessing import Pool
from api.ttypes import *

import gc
import os

def _parse(pathToFile):
    myList = []
    with open(pathToFile) as f:
        for line in f:
            s = line.split()
            x, y = [int(v) for v in s]
            obj = CoresetPoint(x, y)
            gc.disable()
            myList.append(obj)
            gc.enable()
    return Points(myList)

def getParsedFiles(pathToFile):
    myList = []
    p = Pool(2)
    for filename in os.listdir(pathToFile):
        if filename.endswith(".txt"):
            myList.append(filename)
    return p.map(_pars, , myList)

I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:

#Pool 32  ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)

Graph:
enter image description here

Machine specification:

62.8 GiB RAM
Intel? Core? i7-6850K CPU @ 3.60GHz × 12   

What am I missing here ?
Thanks in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Looks like you're I/O bound:

In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.

You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.

As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:

Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:

def _parse_coreset_points(lines):
    return Points([_parse_coreset_point(line) for line in lines])

def _parse_coreset_point(line):
    s = line.split()
    x, y = [int(v) for v in s]
    return CoresetPoint(x, y)

And our main function:

import fileinput

def getParsedFiles(directory):
    pool = Pool(2)

    txts = [filename for filename in os.listdir(directory):
            if filename.endswith(".txt")]

    return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...