Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
119 views
in Technique[技术] by (71.8m points)

python - Pandas DataFrame performance

Pandas is really great, but I am really surprised by how inefficient it is to retrieve values from a Pandas.DataFrame. In the following toy example, even the DataFrame.iloc method is more than 100 times slower than a dictionary.

The question: Is the lesson here just that dictionaries are the better way to look up values? Yes, I get that that is precisely what they were made for. But I just wonder if there is something I am missing about DataFrame lookup performance.

I realize this question is more "musing" than "asking" but I will accept an answer that provides insight or perspective on this. Thanks.

import timeit

setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
dictionary = df.to_dict()
'''

f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']

for func in f:
    print func
    print min(timeit.Timer(func, setup).repeat(3, 100000))

value = dictionary[5][5]

0.130625009537

value = df.loc[5, 5]

19.4681699276

value = df.iloc[5, 5]

17.2575249672

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A dict is to a DataFrame as a bicycle is to a car. You can pedal 10 feet on a bicycle faster than you can start a car, get it in gear, etc, etc. But if you need to go a mile, the car wins.

For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.

Now for example -- to be more concrete -- a dict is good for accessing columns, but it is not so convenient for accessing rows.

import timeit

setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000]))
dictionary = df.to_dict()
'''

# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']

for func in f:
    print(func)
    print(min(timeit.Timer(func, setup).repeat(3, 100000)))

yields

value = [val[5] for col,val in dictionary.iteritems()]
25.5416321754
value = df.loc[5]
5.68071913719
value = df.iloc[5]
4.56006002426

So the dict of lists is 5 times slower at retrieving rows than df.iloc. The speed deficit becomes greater as the number of columns grows. (The number of columns is like the number of feet in the bicycle analogy. The longer the distance, the more convenient the car becomes...)

This is just one example of when a dict of lists would be less convenient/slower than a DataFrame.

Another example would be when you have a DatetimeIndex for the rows and wish to select all rows between certain dates. With a DataFrame you can use

df.loc['2000-1-1':'2000-3-31']

There is no easy analogue for that if you were to use a dict of lists. And the Python loops you would need to use to select the right rows would again be terribly slow compared to the DataFrame.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...