Actually gcc will use the carry automatically if you write your code carefully...
Current GCC can optimize hiWord += (loWord < loAdd);
into add
/adc
(x86's add-with-carry). This optimization was introduced in GCC5.3.
(editor's note: Of course the hard part is writing a correct full-adder with carry in and carry out; that's hard in C and GCC doesn't know how to optimize any that I've seen.)
Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.
Older GCC, like GCC4.5, will branch or setc
on the carry-out from an add, instead of using adc
, and only used adc
(add-with-carry) on the flag-result from an add
if you used __int128
. (Or uint64_t
on a 32-bit target). See Is there a 128 bit integer in gcc? - only on 64-bit targets, supported since GCC4.1.
I compiled this code with gcc -O2 -Wall -Werror -S
:
void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
hiWord += hiAdd;
hiWord += (loWord < loAdd); // test_and_add_carry
}
This is the assembly for increment128_1:
.cfi_startproc
movabsq $-8801131483544218438, %rax
addq (%rsi), %rax
movabsq $-8801131483544218439, %rdx
cmpq %rdx, %rax
movq %rax, (%rsi)
ja .L5
movq (%rdi), %rax
addq $1, %rax
.L3:
movabsq $6794178679361, %rdx
addq %rdx, %rax
movq %rax, (%rdi)
ret
...and this is the assembly for increment128_2:
movabsq $-8801131483544218438, %rax
addq %rax, (%rsi)
movabsq $6794178679361, %rax
addq (%rdi), %rax
movabsq $-8801131483544218439, %rdx
movq %rax, (%rdi)
cmpq %rdx, (%rsi)
setbe %dl
movzbl %dl, %edx
leaq (%rdx,%rax), %rax
movq %rax, (%rdi)
ret
Note the lack of conditional branches in the second version.
[edit]
Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:
struct my_uint128_t {
unsigned long hi;
unsigned long lo;
};
my_uint128_t increment128_3(my_uint128_t x)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
x.lo += loAdd;
x.hi += hiAdd + (x.lo < loAdd);
return x;
}
Assembly:
.cfi_startproc
movabsq $-8801131483544218438, %rdx
movabsq $-8801131483544218439, %rax
movabsq $6794178679362, %rcx
addq %rsi, %rdx
cmpq %rdx, %rax
sbbq %rax, %rax
addq %rcx, %rax
addq %rdi, %rax
ret
This is actually the tightest code of the three.
...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).
[edit 2]
And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?
typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));
my_uint128_t increment128_4(my_uint128_t x)
{
const my_uint128_t hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
return x + (hiAdd << 64) + loAdd;
}
The assembly for this one is about as good as it gets:
.cfi_startproc
movabsq $-8801131483544218438, %rax
movabsq $6794178679361, %rdx
pushq %rbx
.cfi_def_cfa_offset 16
addq %rdi, %rax
adcq %rsi, %rdx
popq %rbx
.cfi_offset 3, -16
.cfi_def_cfa_offset 8
ret
(Not sure where the push/pop of ebx
came from, but this is still not bad.)
All of these are with GCC 4.5.2, by the way.