In the Intel? 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.2.3.1, as you note yourself, that
The Intel-64 memory ordering model guarantees that, for each of the following
memory-access instructions, the constituent memory operation appears to execute
as a single memory access:
? Instructions that read or write a single byte.
? Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
? Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
? Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.
Any locked instruction (either the XCHG instruction or another read-modify-write
instruction with a LOCK prefix) appears to execute as an indivisible and
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.
Now, since the above list does NOT contain the same language for double quadword (16 bytes), it follows that the architecture does NOT guarantee that instructions which access 16 bytes of memory are atomic.
That being said, the last paragraph does hint at a way out, namely the CMPXCHG16B instruction with the LOCK prefix. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).
In the corresponding AMD document, AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 2: System Programming, I can't find similar clear language.
EDIT: Test program results
(Test program modified to increase #iterations by a factor of 10)
On a Xeon X3450 (x86-64):
0000 999998139 1572
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 0
1101 0 0
1110 0 0
1111 1861 999998428
On a Xeon 5150 (32-bit):
0000 999243100 283087
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 0
1101 0 0
1110 0 0
1111 756900 999716913
On an Opteron 2435 (x86-64):
0000 999995893 1901
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 0
1101 0 0
1110 0 0
1111 4107 999998099
Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.
EDIT 2: How to make the test program fail
Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:
0000 999998634 5990
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 1 Not a single memory access!
1101 0 0
1110 0 0
1111 1366 999994009
So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.
EDIT 3: ASM for the thread functions, on request of "GJ."
Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:
.globl thread2
.type thread2, @function
thread2:
.LFB537:
.cfi_startproc
movdqa .LC3(%rip), %xmm1
xorl %eax, %eax
.p2align 5,,24
.p2align 3
.L11:
movaps x(%rip), %xmm0
incl %eax
movaps %xmm1, x(%rip)
movmskps %xmm0, %edx
movslq %edx, %rdx
incl n2(,%rdx,4)
cmpl $1000000000, %eax
jne .L11
xorl %eax, %eax
ret
.cfi_endproc
.LFE537:
.size thread2, .-thread2
.p2align 5,,31
.globl thread1
.type thread1, @function
thread1:
.LFB536:
.cfi_startproc
pxor %xmm1, %xmm1
xorl %eax, %eax
.p2align 5,,24
.p2align 3
.L15:
movaps x(%rip), %xmm0
incl %eax
movaps %xmm1, x(%rip)
movmskps %xmm0, %edx
movslq %edx, %rdx
incl n1(,%rdx,4)
cmpl $1000000000, %eax
jne .L15
xorl %eax, %eax
ret
.cfi_endproc
and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:
.LC3:
.long -1
.long -1
.long -1
.long -1
.ident "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
.section .note.GNU-stack,"",@progbits
Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.