Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
545 views
in Technique[技术] by (71.8m points)

winapi - Alignment requirements for atomic x86 instructions vs. MS's InterlockedCompareExchange documentation?

Microsoft offers the InterlockedCompareExchange function for performing atomic compare-and-swap operations. There is also an _InterlockedCompareExchange intrinsic.

On x86 these are implemented using the lock cmpxchg instruction.

However, reading through the documentation on these three approaches, they don't seem to agree on the alignment requirements.

Intel's reference manual says nothing about alignment (other than that if alignment checking is enabled and an unaligned memory reference is made, an exception is generated)

I also looked up the lock prefix, which specifically states that

The integrity of the LOCK prefix is not affected by the alignment of the memory field.

(emphasis mine)

So Intel seems to say that alignment is irrelevant. The operation will be atomic no matter what.

The _InterlockedCompareExchange intrinsic documentation also says nothing about alignment, however the InterlockedCompareExchange function states that

The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.

So what gives? Are the alignment requirements for InterlockedCompareExchange just to make sure the function will work even on pre-486 CPU's where the cmpxchg instruction isn't available? That seems likely based on the above information, but I'd like to be sure before I rely on it. :)

Or is alignment required by the ISA to guarantee atomicity, and I'm just looking the wrong places in Intel's reference manuals?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.

This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).

Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)

Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.

Intel? 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013

8.1.2.2 Software Controlled Bus Locking

To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]

? The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
? The LOCK prefix is automatically assumed for XCHG instruction.
? [...]

[...] The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:

? Any boundary for an 8-bit access (locked or otherwise).
? 16-bit boundary for locked word accesses.
? 32-bit boundary for locked doubleword accesses.
? 64-bit boundary for locked quadword accesses.


Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.

Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.

If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...