c++ - Why is __int128_t faster than long long on x86-64 GCC?

Question

Welcome To Ask or Share your Answers For Others

c++ - Why is __int128_t faster than long long on x86-64 GCC?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Why is __int128_t faster than long long on x86-64 GCC?

This is my test code:

#include <chrono>
#include <iostream>
#include <cstdlib>
using namespace std;

using ll = long long;

int main()
{
    __int128_t a, b;
    ll x, y;

    a = rand() + 10000000;
    b = rand() % 50000;
    auto t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        a += b;
        a /= b;
        b *= a;
        b -= a;
        a %= b;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)a % 100000 << '
';

    x = rand() + 10000000;
    y = rand() % 50000;
    t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        x += y;
        x /= y;
        y *= x;
        y -= x;
        x %= y;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)x % 100000 << '
';

    return 0;
}

This is the test result:

$ g++ main.cpp -o main -O2
$ ./main
2432 1
2627 1

Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t is always a little faster than long long.

int and double are both significantly faster than long long; long long has become the slowest type.

How does this happen?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:37:21+0000

The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case.

Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t. But, built-in functions for divisions/modulus on 16-byte types (__divti3 and __modti3 on x86 GCC/Clang) are surprisingly faster than the native idiv instruction (which is pretty slow, at least on Intel processors).

If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t here), we can see that __modti3 uses conditionals (when calling __udivmodti4). Intel processors can execute the code faster because:

taken branches can be well predicted in this case since they are always the same (and also because the loop is executed millions of times);
the division/modulus is split in faster native instructions that can mostly be executed in parallel by multiple CPU ports (and that benefit from out-of-order execution). A div instruction is still used in most possible paths (especially in this case);
The execution time of the div/idiv instructions covers most of the overall execution time because of their very high latencies. The div/idiv instructions cannot be executed in parallel because of the loop dependencies. However, the latency of a div lower than a idiv making the former faster.

Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv instruction). Indeed, the latency of a 64-bit idiv instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).

Categories

c++ - Why is __int128_t faster than long long on x86-64 GCC?

c++ - Why is __int128_t faster than long long on x86-64 GCC?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags