Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
655 views
in Technique[技术] by (71.8m points)

performance - Fastest way to sort a python 3.7+ dictionary

Now that the insertion order of Python dictionaries is guaranteed starting in Python 3.7 (and in CPython 3.6), what is the best/fastest way to sort a dictionary - both by value and by key?

The most obvious way to do it is probably this:

by_key = {k: dct[k] for k in sorted(dct.keys())}
by_value = {k: dct[k] for k in sorted(dct.keys(), key=dct.__getitem__)}

Are there alternative, faster ways to do this?

Note that this question is not a duplicate since previous questions about how to sort a dictionary are out of date (to which the answer was, basically, You can't; use a collections.OrderedDict instead).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TL;DR: Best ways to sort by key or by value (respectively), in CPython 3.7:

{k: d[k] for k in sorted(d)}
{k: v for k,v in sorted(d.items(), key=itemgetter(1))}

Tested on a macbook with sys.version:

3.7.0b4 (v3.7.0b4:eb96c37699, May  2 2018, 04:13:13)
[Clang 6.0 (clang-600.0.57)]

One-time setup with a dict of 1000 floats:

>>> import random
>>> from operator import itemgetter
>>> random.seed(123)
>>> d = {random.random(): random.random() for i in range(1000)}

Sorting numbers by key (best to worst):

>>> %timeit {k: d[k] for k in sorted(d)}
# 296 μs ± 2.6 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d.keys())}
# 306 μs ± 9.25 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
# 345 μs ± 4.15 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
# 359 μs ± 2.42 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
# 391 μs ± 8.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items()))
# 409 μs ± 9.33 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items())}
# 420 μs ± 5.39 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
# 432 μs ± 39.6 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sorting numbers by value (best to worst):

>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
# 355 μs ± 2.24 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
# 375 μs ± 31.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
# 393 μs ± 1.89 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
# 402 μs ± 9.74 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
# 404 μs ± 3.55 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
# 404 μs ± 20.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
# 480 μs ± 12 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

One-time setup with a large dict of strings:

>>> import random
>>> from pathlib import Path
>>> from operator import itemgetter
>>> random.seed(456)
>>> words = Path('/usr/share/dict/words').read_text().splitlines()
>>> random.shuffle(words)
>>> keys = words.copy()
>>> random.shuffle(words)
>>> values = words.copy()
>>> d = dict(zip(keys, values))
>>> list(d.items())[:5]
[('ragman', 'polemoscope'),
 ('fenite', 'anaesthetically'),
 ('pycnidiophore', 'Colubridae'),
 ('propagate', 'premiss'),
 ('postponable', 'Eriglossa')]
>>> len(d)
235886

Sorting a dict of strings by key:

>>> %timeit {k: d[k] for k in sorted(d)}
# 387 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d.keys())}
# 387 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
# 461 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
# 466 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
# 488 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
# 536 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items()))
# 661 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items())}
# 687 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sorting a dict of strings by value:

>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
# 468 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
# 473 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
# 492 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
# 496 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
# 533 ms ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
# 544 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
# 566 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: Real-world data often contains long runs of already-sorted sequences, which Timsort algorithm can exploit. If sorting a dict lies on your fast path, then it's recommended to benchmark on your own platform with your own typical data before drawing any conclusions about the best approach. I have prepended a comment character (#) on each timeit result so that IPython users can copy/paste the entire code block to re-run all the tests on their own platform.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...