Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
687 views
in Technique[技术] by (71.8m points)

python - multiprocessing.Pool with a global variable

I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.

Here is an abstraction of what I am trying to do:

def myFunction(x):
    # myObject is a global variable in this case
    return myFunction2(x, myObject)

def myFunction2(x,myObject):
    myObject.modify() # here I am calling some method that changes myObject
    return myObject.f(x)

poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)

The function f(x) is contained in a *.so file, i.e., it is calling a C function.

The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)

I have tried creating the object rather than storing it as a global variable:

def myFunction(x):
    myObject = createObject()
    return myFunction2(x, myObject)

However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.

Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I am using the Pool class from python's multiprocessing library to do some shared memory processing on an HPC cluster.

Processes are not threads! You cannot simply replace Thread with Process and expect all to work the same. Processes do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.

If you want to use shared memory between processes then you must use the multiprocessing's data types, such as Value, Array, or use the Manager to create shared lists etc.

In particular you might be interested in the Manager.register method, which allows the Manager to create shared custom objects(although they must be picklable).

However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.

Note that you can do some initialization of the worker processes passing the initializer and initargs argument when creating the Pool.

For example, in its simplest form, to create a global variable in the worker process:

def initializer():
    global data
    data = createObject()

Used as:

pool = Pool(4, initializer, ())

Then the worker functions can use the data global variable without worries.


Style note: Never use the name of a built-in for your variables/modules. In your case object is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...