multithreading - Why does this Python script run 4x slower on multiple cores than on a single core

Question

Welcome To Ask or Share your Answers For Others

multithreading - Why does this Python script run 4x slower on multiple cores than on a single core

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

multithreading - Why does this Python script run 4x slower on multiple cores than on a single core

I'm trying to understand how CPython's GIL works and what are the differences between GIL in CPython 2.7.x and CPython 3.4.x. I'm using this code for benchmarking:

from __future__ import print_function

import argparse
import resource
import sys
import threading
import time


def countdown(n):
    while n > 0:
        n -= 1


def get_time():
    stats = resource.getrusage(resource.RUSAGE_SELF)
    total_cpu_time = stats.ru_utime + stats.ru_stime
    return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime


def get_time_diff(start_time, end_time):
    return tuple((end-start) for start, end in zip(start_time, end_time))


def main(total_cycles, max_threads, no_headers=False):
    header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
              ("#t", "seq_r", "seq_c", "seq_u", "seq_s",
               "par_r", "par_c", "par_u", "par_s"))
    row_format = ("%(threads)4d "
                  "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
                  "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
    if not no_headers:
        print(header)
    for thread_count in range(1, max_threads+1):
        # We don't care about a few lost cycles
        cycles = total_cycles // thread_count

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]

        start_time = get_time()
        for thread in threads:
            thread.start()
            thread.join()
        end_time = get_time()
        sequential = get_time_diff(start_time, end_time)

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]
        start_time = get_time()
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        end_time = get_time()
        parallel = get_time_diff(start_time, end_time)

        print(row_format % {"threads": thread_count,
                            "seq_r": sequential[0],
                            "seq_c": sequential[1],
                            "seq_u": sequential[2],
                            "seq_s": sequential[3],
                            "par_r": parallel[0],
                            "par_c": parallel[1],
                            "par_u": parallel[2],
                            "par_s": parallel[3]})


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("max_threads", nargs="?",
                            type=int, default=5)
    arg_parser.add_argument("total_cycles", nargs="?",
                            type=int, default=50000000)
    arg_parser.add_argument("--no-headers",
                            action="store_true")
    args = arg_parser.parse_args()
    sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

When running this script on my quad-core i5-2500 machine under Ubuntu 14.04 with Python 2.7.6, I get the following results (_r stands for real time, _c for CPU time, _u for user mode, _s for kernel mode):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93
   3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65
   4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30
   5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

Now if I bind all threads to one core, the results are very different:

taskset -c 0 python countdown.py 
  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00
   3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04
   4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15
   5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

So the question is: why running this Python code on multiple cores is 1.5x-2x slower by wall clock and 4x-5x slower by CPU clock than running it on a single core?

Asking around and googling produced two hypotheses:

When running on multiple cores, a thread can be re-scheduled to run on a different core which means that local cache gets invalidated, hence the slowdown.
The overhead of suspending a thread on one core and activating it on another core is larger than suspending and activating the thread on the same core.

Are there any other reasons? I would like to understand what's going on and to be able to back my understanding with numbers (meaning that if the slowdown is due to cache misses, I want to see and compare the numbers for both cases).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:56:02+0000

It's due to GIL thrashing when multiple native threads are competing for the GIL. David Beazley's materials on this subject will tell your everything you want to know.

See info here for a nice graphical representation of what is happening.

Python3.2 introduced changes to the GIL that help solve this problem so you should see improved performance with 3.2 and later.

It should also be noted that the GIL is an implementation detail of the cpython reference implementation of the language. Other implementations like Jython do not have GIL and do not suffer this particular problem.

The rest of D. Beazley's info on the GIL will also be helpful to you.

To specifically answer your question about why performance is so much worse when multiple cores are involved, see slide 29-41 of the Inside the GIL presentation. It goes into a detailed discussion on multicore GIL contention as opposed to multiple threads on a single core. Slide 32 specifically shows that the number of system calls due to thread signaling overhead goes through the roof as you add cores. This is because the threads are now running simulatneously on different cores and which allows them to engage in a true GIL battle. As opposed to multiple threads sharing a single CPU. A good summary bullet from the above presentation is:

With multiple cores, CPU-bound threads get scheduled simultaneously (on different cores) and then have a GIL battle.

Categories

multithreading - Why does this Python script run 4x slower on multiple cores than on a single core

multithreading - Why does this Python script run 4x slower on multiple cores than on a single core

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags