There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.
Clang in addition has a built-in which one would think does this, __builtin_addcll
, but it does not produce efficient code either.
The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).
Therefore, with Visual Studio use _addcarry_u64
. With ICC use _addcarry_u64
or inline assembly. With Clang and GCC use inline assembly.
Note that since the Broadwell microarchitecture there are two new instructions: adcx
and adox
which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx
with _addcarryx_u64
whereas ICC produces both adcx
and adox
with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.
Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc
instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc
. Sure they can have built-in types such as __int128
but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.
In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.
Here is that code reposted.
#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4)
__asm__ __volatile__ (
"addq %[v1], %[u1]
"
"adcq %[v2], %[u2]
"
"adcq %[v3], %[u3]
"
"adcq %[v4], %[u4]
"
: [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4)
: [v1] "r" (Y1), [v2] "r" (Y2), [v3] "r" (Y3), [v4] "r" (Y4))
If you want to explicitly load the values from memory you can do something like this
//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
"movq (%[in]), %%rax
"
"addq %%rax, %[out]
"
"movq 8(%[in]), %%rax
"
"adcq %%rax, 8%[out]
"
"movq 16(%[in]), %%rax
"
"adcq %%rax, 16%[out]
"
"movq 24(%[in]), %%rax
"
"adcq %%rax, 24%[out]
"
: [out] "=m" (dst)
: [in]"r" (src)
: "%rax"
);
That produces nearlly identical assembly as from the following function in ICC
void add256(uint256 *x, uint256 *y) {
unsigned char c = 0;
c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
_addcarry_u64(c, x->x4, y->x4, &x->x4);
}
I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.
So what I'm looking for is code that I could generalize to any length
To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64
efficiently this would be a good general solution.
#include <x86intrin.h>
#include <inttypes.h>
#define LEN 4 // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...
static unsigned char c = 0;
template<int START, int N>
struct Repeat {
static void add (uint64_t *x, uint64_t *y) {
c = _addcarry_u64(c, x[START], y[START], &x[START]);
Repeat<START+1, N>::add(x,y);
}
};
template<int N>
struct Repeat<LEN, N> {
static void add (uint64_t *x, uint64_t *y) {}
};
void sum_unroll(uint64_t *x, uint64_t *y) {
Repeat<0,LEN>::add(x,y);
}
Assembly from ICC
xorl %r10d, %r10d #12.13
movzbl c(%rip), %eax #12.13
cmpl %eax, %r10d #12.13
movq (%rsi), %rdx #12.13
adcq %rdx, (%rdi) #12.13
movq 8(%rsi), %rcx #12.13
adcq %rcx, 8(%rdi) #12.13
movq 16(%rsi), %r8 #12.13
adcq %r8, 16(%rdi) #12.13
movq 24(%rsi), %r9 #12.13
adcq %r9, 24(%rdi) #12.13
setb %r10b
Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).
The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better
void foo(uint64_t *dst, uint64_t *src)
{
__asm (
"movq (%[in]), %%rax
"
"addq %%rax, (%[out])
"
"movq 8(%[in]), %%rax
"
"adcq %%rax, 8(%[out])
"
"movq 16(%[in]), %%rax
"
"addq %%rax, 16(%[out])
"
"movq 24(%[in]), %%rax
"
"adcq %%rax, 24(%[out])
"
:
: [in] "r" (src), [out] "r" (dst)
: "%rax"
);
}