There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr
is only available in memory-source form, not memory-destination form; lock blsr [shared_var]
does not exist. (Compilers know how to optimize (var-1) & (var)
into blsr
for local vars when you compile with -march=haswell
or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.
The best you can do on x86 is the lock cmpxchg
retry loop you propose in the question. This is a better option than finding the right bit in an old version of the variable and then using lock btr
, especially if it would be a correctness problem to clear the wrong bit if a lower bit was set between the bit-scan and the lock btr
. And you'd still need a retry loop in case another thread already cleared the bit you wanted.
CAS retry loops are not bad in practice: retry is quite rare
(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add
where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)
The window for the CPU to lose the cache line between the load to get expected
and running lock cmpxchg
with the result of a couple instructions on that value is tiny. Usually it will succeed the first time through, because the cache line will still be present in L1d cache when the cmpxchg
runs. When the cache line arrives, it will (hopefully) already be in MESI Exclusive state, if the CPU saw far enough ahead to do an RFO for it.
You can instrument your cmpxchg retry loops to see how much contention you actually get in your real program. (e.g. by incrementing a local inside the loop and using an atomic relaxed +=
into a shared counter once you succeed, or with thread-local counters.)
Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.
EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.
CAS leaves the cache line unmodified if it fails. On x86 it dirties the cache line (MESI M state), but x86 cmpxchg doesn't detect ABA, it only compares, so one other thread that loaded the same expected
will succeed. On LL/SC machines, hopefully an SC failure on one core doesn't cause surious SC failures on other cores, otherwise it could livelock. That's something you can assume CPU architects thought of.
It's not wait-free because a single thread can in theory have to retry an unbounded number of times.
Your code compiles with gcc8.1 -O3 -march=haswell
to this asm (from the Godbolt compiler explorer)
# gcc's code-gen for x86 with BMI1 looks optimal to me. No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic<unsigned long>&):
mov rax, QWORD PTR [rdi]
.L2:
blsr rdx, rax # Bit Lowest Set Reset
lock cmpxchg QWORD PTR [rdi], rdx
jne .L2 # fall through on success: cmpxchg sets ZF on equal like regular cmp
blsi rax, rax # Bit Lowest Set Isolate
ret
Without BMI1, blsr and blsi become two instructions each. This is close to irrelevant for performance given the cost of a lock
ed instruction.
clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely()
macro. How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).
Footnote 1:
Unless you're building binaries for a known set of machines, you can't assume BMI1 is available anyway. So the compiler will need to do something like lea rdx, [rax-1]
/ and rdx, rax
instead of blsr rdx, rax
. This makes a trivial difference for this function; the majority of the cost is the atomic operation even in the uncontended case, even for the hot-in-L1d cache no contention case, looking at the out-of-order execution throughput impact on surrounding code. (e.g. 10 uops for lock cmpxchg
on Skylake (http://agner.org/optimize/) vs. saving 1 uop with blsr
instead of 2 other instructions. Assuming the front-end is the bottleneck, rather than memory or something else. Being a full memory barrier has an impact on loads/stores in surrounding code, too, but fortunately not on out-of-order execution of independent ALU instructions.)
Most non-x86 machines do all their atomic operations with load-linked / store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x
inside an LL/SC instead of just the add that you'd get from using fetch_add
), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)
So you just have to write code using compare_exchange_weak
if C++ doesn't provide the operation you want (e.g. fetch_or
or fetch_and
).
What you get in practice from current compilers is a compare_exchange_weak
implemented with LL/SC, and your arithmetic operation separate from that. The asm actually does the relaxed load before the exclusive-load-acquire (ldaxr
), instead of just basing the computation on the ldaxr
result. And there's extra branching to check that expected
from the last attempt matches the new load result before attempting the store.
The LL/SC window is maybe shorter than with 2 dependent ALU instructions between the load and store, in case that matters. The CPU has the desired value ready ahead of time, not dependent on the load result. (It depends on the previous load result.) Clang's code-gen puts the sub
/and
inside the loop, but dependent on the previous iteration's load, so with out of order execution they can still be ready ahead of time.
But if that was really the most efficient way to do things, compilers should be using it for fetch_add
as well. They don't because it probably isn't. LL/SC retries are rare, just like CAS retries on x86. I don't know if it could make a different for very-high-contention situations. Maybe, but compilers don't slow down the fast path to optimize for that when compiling fetch_add
.
I renamed your functions and re-formatted the while()
for readability, because it was already too long for one line without wrapping it with unlikely()
.
This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.
#include <atomic>
#include <cstdint>
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)