No, volatile
wouldn't be helpful at all, and certainly not for this reason. It would just give that first read "acquire" semantics instead of effectively relaxed, but either way will compile to similar asm that runs a load instruction.
if you get a dirty value from a CPU cache
CPU caches are coherent, so anything you read from CPU cache is the current globally agreed-on value for this line. "Dirty" just means it doesn't match DRAM contents, and will have to get written-back if / when evicted. A load value can also be forwarded from the store buffer, for a value this thread stored recently that isn't yet globally visible, but that's fine, Interlocked methods are full barriers that result in waiting for the store buffer to drain as well.
If you mean stale, then no, that's impossible, cache coherency protocols like MESI prevent that. This is why Interlocked things like CAS aren't horribly slow if the cache line is already owned by this core (MESI Modified or Exclusive state). See Myths Programmers Believe about CPU Caches which talks some about Java volatiles, which I think are similar to C# volatile.
This C++11 answer also explains some about cache coherency and asm. (Note that C++11 volatile
is significantly different from C#, and doesn't imply any thread-safety or ordering, but does still imply the asm must do a load or a store, not optimize into a register.)
On non-x86, running extra barrier instructions after the initial read (to give those acquire semantics) before you even try a CAS just slows things down. (On x86 including x86-64, a volatile read compiles to the same asm as a plain read, except it prevents compile-time reordering).
A volatile read can't be optimized into just using a value in a register if the current thread just wrote something via a non-interlocked =
assignment. That's not helpful either; if we just stored something and remember in a register what we stored, a load that does store-forwarding from the store buffer is morally equivalent to just using the register value.
Most of the good use-cases for lock-free atomics are when contention is lowish, so usually things can succeed without hardware having to wait a long time for access / ownership of the cache line. So you want to make the uncontended case as fast as possible. Avoid volatile
even if there was anything to gain from it in highly-contended cases, which I don't think there is anyway.
If you ever did any plain stores (assignments with =
, not interlocked RMW), volatile
would have an effect on those, too. That might mean waiting for the store buffer to drain before later memory ops in this thread can run, if C# volatile
gives semantics like C++ memory_order_seq_cst
. In that case, you'd be slowing down the code involving the stores a lot, if you didn't need ordering wrt. other loads/stores. If you did such a store before this CAS code, yeah you'd be waiting until the store (and all previous stores) were globally visible to try reloading it. This would mean a reload + CAS the CPU is waiting to do right after are very likely to not have to spin because the CPU will have ownership of that line, but I think you'd effectively get similar behaviour from the full barrier that's part of an Interlocked CAS.