The example at 8.2.3.5
should be "surprising" if you expect memory ordering to be all strict an clean, and even if you acknowledge that 8.2.3.4
allows loads to reorder with stores of different addresses.
Processor 0 | Processor 1
--------------------------------------
mov [x],1 | mov [y],1
mov R1, [x] | mov R3,[y]
mov R2, [y] | mov R4,[x]
Note that the key part is that the newly added loads in the middle both return 1
(store-to-load forwarding makes that possible in the uarch without stalling). So in theory, you would expect that both stores have been "observed" globally by the time both these loads completed (that would have been the case with sequential consistency, where there is a unique ordering between stores and all cores see it).
However, having later R2 = R4 = 0
as a valid outcome proves this is not the case - the stores are in fact observed locally first. In other words, allowing this outcome means that processor 0 sees the stores as time(x) < time(y)
, while processor 1 sees the opposite.
This is a very important observation about the consistency of this memory model, which the previous example doesn't prove. This nuance is the biggest difference between Sequential Consistency and Total Store Ordering - the second example breaks SC, the first one doesn't.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…