Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
544 views
in Technique[技术] by (71.8m points)

python - Understanding performance difference

Answering this question I faced an interesting situation 2 similar code snippets performed quite differently. I'm asking here just to understand the reason for that and to improve my intuition for such cases.

I'll adapt code snippets for Python 2.7 (In Python 3 the performance differences are the same).

from collections import OrderedDict
from operator import itemgetter
from itertools import izip

items = OrderedDict([('a', 10), ('b', 9), ('c', 4), ('d', 7), ('e', 3), ('f', 0), ('g', -5), ('h', 9)])

def f1():
    return min(items, key=items.get)

def f2():
    return min(items.iteritems(), key=itemgetter(1))[0]


from timeit import Timer
N = 100000

print(Timer(stmt='f1()', setup='from __main__ import f1').timeit(number = N))
print(Timer(stmt='f2()', setup='from __main__ import f2').timeit(number = N))

Output:

0.603327797248
1.21580172899

The first solution has to make lookups in OrderedDictionary to get value for each key. Second solution just iterates through OrderedDictionary key-value pairs, which have to be packed into tuples.

The second solution is 2 times slower.

Why is that?

I recently watched this video, where Raymond Hettinger says that Python tends to reuse tuples, so no extra allocations.

So, what does this performance issue boil down to?


I want to elaborate a bit on why I'm asking.

The first solution has dictionary lookup. It implies taking key hash, then finding bin by this hash, then getting key from that bin (hopefully there will be no key collisions), and then getting value associated with that key.

The second solution just goes through all bins and yields all the keys in those bins. It goes through all the bins one-by-one without an overhead of calculation which bin to take. Yes, it has to access values associated with those keys, but the value is only one step from the key, while the first solution has to go through hash-bin-key-value chain to get the value when it needs it. Each solution has to get the value, the first one gets it through hash-bin-key-value chain, the second gets it following one more pointer when accessing key. The only overhead of the second solution is it has to store this value in the tuple temporary along with the key. It turns out that this storing is the major reason for the overhead. Still I don't fully understand why is it so, given the fact there is so-called "tuple reuse" (see the video mentioned above).

To my mind, the second solution has to save value along with the key, but it avoid us of having to make hash-bin-key calculation and access to get value for that key.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The performance differences is mainly caused by OrderedDict.
OrderedDict uses dict's get and __getitem__, but redefined its own __iter__ and iteritems.


    def __iter__(self):
        'od.__iter__()  iter(od)'
        # Traverse the linked list in order.
        root = self.__root
        curr = root[1]                                  # start at the first node
        while curr is not root:
            yield curr[2]                               # yield the curr[KEY]
            curr = curr[1]                              # move to next node

    def iteritems(self):
        'od.iteritems -> an iterator over the (key, value) pairs in od'
        for k in self:
            yield (k, self[k])

Look at what we found: self[k].
Your second solution did not help us avoid the hash-bin-key calculation. While the iterator generated by dict, more precisely, items.iteritems().next() if items is a dict, does not make that calculation.

Moreover, iteritems is also more expensive.

from timeit import Timer
N = 1000

d = {i:i for i in range(10000)}

def f1():
    for k in d: pass

def f2():
    for k in d.iterkeys(): pass

def f3():
    for v in d.itervalues(): pass

def f4():
    for t in d.iteritems(): pass

print(Timer(stmt='f1()', setup='from __main__ import f1').timeit(number=N))
print(Timer(stmt='f2()', setup='from __main__ import f2').timeit(number=N))
print(Timer(stmt='f3()', setup='from __main__ import f3').timeit(number=N))
print(Timer(stmt='f4()', setup='from __main__ import f4').timeit(number=N))

Output

0.256800375467
0.265079360645
0.260599391822
0.492333103788

Comparing to iterkeys' dictiter_iternextkey and itervalues' dictiter_iternextvalue, iteritems'dictiter_iternextitem has additional parts.


    if (result->ob_refcnt == 1) {
        Py_INCREF(result);
        Py_DECREF(PyTuple_GET_ITEM(result, 0));
        Py_DECREF(PyTuple_GET_ITEM(result, 1));
    } else {
        result = PyTuple_New(2);
        if (result == NULL)
            return NULL;
    }
    di->len--;
    key = ep[i].me_key;
    value = ep[i].me_value;
    Py_INCREF(key);
    Py_INCREF(value);
    PyTuple_SET_ITEM(result, 0, key);
    PyTuple_SET_ITEM(result, 1, value);

I think that tuple creation could decrease the performance.

Python indeed tends to reuse tuples.
tupleobject.c shows

/* Speed optimization to avoid frequent malloc/free of small tuples */
#ifndef PyTuple_MAXSAVESIZE
#define PyTuple_MAXSAVESIZE     20  /* Largest tuple to save on free list */
#endif
#ifndef PyTuple_MAXFREELIST
#define PyTuple_MAXFREELIST  2000  /* Maximum number of tuples of each size to save */
#endif

This optimization just means Python does not build some tuples from scratch. But there is still a lot of work to be done.


Case: dict

If OrderedDict is replaced by dict, I think the second solution is slightly better in general.
Python dictionaries are implemented using hash tables. So the lookup is fast. The average time complexity of lookup is O(1), while the worst is O(n)1. The average time complexity of your first solution is the same as the time complexity of your second solution. They are both O(n). Therefore, the second solution has no advantages or is even slower sometimes, especially when the input data is small. In that case, the extra cost caused by iteritems could not be compensated.

from collections import OrderedDict
from operator import itemgetter
from timeit import Timer
from random import randint, random

N = 100000
xs = [('a', 10), ('b', 9), ('c', 4), ('d', 7), ('e', 3), ('f', 0), ('g', -5), ('h', 9)]

od = OrderedDict(xs)
d = dict(xs)

def f1od_min():
    return min(od, key=od.get)

def f2od_min():
    return min(od.iteritems(), key=itemgetter(1))[0]

def f1d_min():
    return min(d, key=d.get)

def f2d_min():
    return min(d.iteritems(), key=itemgetter(1))[0]

def f1od():
    for k in od: pass

def f2od():
    for t in od.iteritems(): pass

def f1d():
    for k in d: pass

def f2d():
    for t in d.iteritems(): pass

print 'min'
print(Timer(stmt='f1od_min()', setup='from __main__ import f1od_min').timeit(number=N))
print(Timer(stmt='f2od_min()', setup='from __main__ import f2od_min').timeit(number=N))
print(Timer(stmt='f1d_min()', setup='from __main__ import f1d_min').timeit(number=N))
print(Timer(stmt='f2d_min()', setup='from __main__ import f2d_min').timeit(number=N))
print
print 'traverse'
print(Timer(stmt='f1od()', setup='from __main__ import f1od').timeit(number=N))
print(Timer(stmt='f2od()', setup='from __main__ import f2od').timeit(number=N))
print(Timer(stmt='f1d()', setup='from __main__ import f1d').timeit(number=N))
print(Timer(stmt='f2d()', setup='from __main__ import f2d').timeit(number=N))

Output

min
0.398274431527
0.813040903243
0.185168156847
0.249574387248    <-- dict/the second solution

traverse
0.251634216081
0.642283865687
0.0565099754298
0.0958057518483

Then replace N and xs by

N = 50
xs = [(x, randint(1, 100)) for x in range(100000)]

Output

min
1.5148923257
3.47020082161
0.712828585756
0.70823812803    <-- dict/the second solution

traverse
0.975989336634
2.92283956481
0.127676073356
0.253622387762

Now replace N and xs by

N = 10
xs = [(random(), random()) for x in range(1000000)]

Output

min
6.23311265817
10.702984667
4.32852708934
2.87853889251    <-- dict/the second solution

traverse
2.06231783648
9.49360449443
1.33297618831
1.73723008092

Finally, the second solution starts to shine.


The worse case for the first solution: hash collision

Let

N = 10000
xs = [(2 ** (32 + x) - 2 ** x + 1, 1) for x in range(100)]
# hash(2 ** (32 + x) - 2 ** x + 1) is always 1

Output

min
2.44175265292    <-- lookup is slow
2.76424538594    <-- lookup is slow
2.26508627493    <-- lookup is slow
0.199363955475

traverse
0.200654482623
2.59635966303    <-- lookup is slow
0.0454684184722
0.0733798569371

1 The Average Case times listed for dict objects assume that the hash function for the objects is sufficiently robust to make collisions uncommon. The Average Case assumes the keys used in parameters are selected uniformly at random from the set of all keys. See TimeComplexity.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...