Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
198 views
in Technique[技术] by (71.8m points)

python - pandas and numpy thread safety

I'm using pandas on a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.

After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.

The point is: contrary to the answer of this question this example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...

The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)

Here is the code which causes a Segmentation Fault:

import threading
import pandas as pd
import numpy as np

def let_crash(crash=True):
    t = 0.02 * np.arange(100000) # ok con 10000                                                                               
    data = pd.DataFrame({'t': t})
    if crash:
        data['t'] * 1.5  # CRASH
    else:
        data['t'].values * 1.5  # THIS IS OK!

if __name__ == '__main__':
        threads = []
        for i in range(100):
            if True:  # asynchronous                                                                                          
                t = threading.Thread(target=let_crash, args = ())
                t.daemon = True
                t.start()
                threads.append(t)
            else:  # synchronous                                                                                              
                let_crash()
        for t in threads:
            t.join()

My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety

pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.

Copy is the basis of pandas operations (as most operations generate a new object to return to the user)

It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.

Easiest is simply not to share objects across threads or lock them on usage.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...