The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case.
Indeed, on my system as well as on godbolt, sizeof(long long) = 8
and sizeof(__int128_t) = 16
. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t
. But, built-in functions for divisions/modulus on 16-byte types (__divti3
and __modti3
on x86 GCC/Clang) are surprisingly faster than the native idiv
instruction (which is pretty slow, at least on Intel processors).
If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t
here), we can see that __modti3
uses conditionals (when calling __udivmodti4
). Intel processors can execute the code faster because:
- taken branches can be well predicted in this case since they are always the same (and also because the loop is executed millions of times);
- the division/modulus is split in faster native instructions that can mostly be executed in parallel by multiple CPU ports (and that benefit from out-of-order execution). A
div
instruction is still used in most possible paths (especially in this case);
- The execution time of the
div
/idiv
instructions covers most of the overall execution time because of their very high latencies. The div
/idiv
instructions cannot be executed in parallel because of the loop dependencies. However, the latency of a div
lower than a idiv
making the former faster.
Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv
instruction).
Indeed, the latency of a 64-bit idiv
instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div
is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…