This afflicts only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem manifest in 32-bit builds whether optimizing for speed (/O2) or size (/O1). As you mentioned, it works as expected in debugging builds with optimization disabled.
Wimmel's suggestion of changing the packing, accurate though it is, does not change the behavior. (The code below assumes the packing is correctly set to 1 for WMatrix
.)
I can't reproduce it in VS 2010, but I can in VS 2013 and 2015. I don't have 2012 installed. That's good enough, though, to allow us to analyze the difference between the object code produced by the two compilers.
Here is the code for mul1
from VS 2010 (the "working" code):
(Actually, in many cases, the compiler inlined the code from this function at the call site. But the compiler will still output disassembly files containing the code it generated for the individual functions prior to inlining. That's what we're looking at here, because it is more cluttered. The behavior of the code is entirely equivalent whether it's been inlined or not.)
PUBLIC mul1
_TEXT SEGMENT
_m$ = 8 ; size = 64
_f$ = 72 ; size = 4
mul1 PROC
___$ReturnUdt$ = eax
push esi
push edi
; WMatrix out = m;
mov ecx, 16 ; 00000010H
lea esi, DWORD PTR _m$[esp+4]
mov edi, eax
rep movsd
; for (unsigned int i = 0; i < 4; i++)
; {
; for (unsigned int j = 0; j < 4; j++)
; {
; unsigned int idx = i * 4 + j; // critical code
; *(&out._11 + idx) *= f; // critical code
movss xmm0, DWORD PTR [eax]
cvtps2pd xmm1, xmm0
movss xmm0, DWORD PTR _f$[esp+4]
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax], xmm1
movss xmm1, DWORD PTR [eax+4]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+4], xmm1
movss xmm1, DWORD PTR [eax+8]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+8], xmm1
movss xmm1, DWORD PTR [eax+12]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+12], xmm1
movss xmm2, DWORD PTR [eax+16]
cvtps2pd xmm2, xmm2
cvtps2pd xmm1, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+16], xmm1
movss xmm1, DWORD PTR [eax+20]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+20], xmm1
movss xmm1, DWORD PTR [eax+24]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+24], xmm1
movss xmm1, DWORD PTR [eax+28]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+28], xmm1
movss xmm1, DWORD PTR [eax+32]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+32], xmm1
movss xmm1, DWORD PTR [eax+36]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+36], xmm1
movss xmm2, DWORD PTR [eax+40]
cvtps2pd xmm2, xmm2
cvtps2pd xmm1, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+40], xmm1
movss xmm1, DWORD PTR [eax+44]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+44], xmm1
movss xmm2, DWORD PTR [eax+48]
cvtps2pd xmm1, xmm0
cvtps2pd xmm2, xmm2
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+48], xmm1
movss xmm1, DWORD PTR [eax+52]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+52], xmm1
movss xmm1, DWORD PTR [eax+56]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
cvtps2pd xmm0, xmm0
movss DWORD PTR [eax+56], xmm1
movss xmm1, DWORD PTR [eax+60]
cvtps2pd xmm1, xmm1
mulsd xmm1, xmm0
pop edi
cvtpd2ps xmm0, xmm1
movss DWORD PTR [eax+60], xmm0
pop esi
; return out;
ret 0
mul1 ENDP
Compare that to the code for mul1
generated by VS 2015:
mul1 PROC
_m$ = 8 ; size = 64
; ___$ReturnUdt$ = ecx
; _f$ = xmm2s
; WMatrix out = m;
movups xmm0, XMMWORD PTR _m$[esp-4]
; for (unsigned int i = 0; i < 4; i++)
xor eax, eax
movaps xmm1, xmm2
movups XMMWORD PTR [ecx], xmm0
movups xmm0, XMMWORD PTR _m$[esp+12]
shufps xmm1, xmm1, 0
movups XMMWORD PTR [ecx+16], xmm0
movups xmm0, XMMWORD PTR _m$[esp+28]
movups XMMWORD PTR [ecx+32], xmm0
movups xmm0, XMMWORD PTR _m$[esp+44]
movups XMMWORD PTR [ecx+48], xmm0
npad 4
$LL4@mul1:
; for (unsigned int j = 0; j < 4; j++)
; {
; unsigned int idx = i * 4 + j; // critical code
; *(&out._11 + idx) *= f; // critical code
movups xmm0, XMMWORD PTR [ecx+eax*4]
mulps xmm0, xmm1
movups XMMWORD PTR [ecx+eax*4], xmm0
inc eax
cmp eax, 4
jb SHORT $LL4@mul1
; return out;
mov eax, ecx
ret 0
?mul1@@YA?AUWMatrix@@U1@M@Z ENDP ; mul1
_TEXT ENDS
It is immediately obvious how much shorter the code is. Apparently the optimizer got a lot smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer's "smarts" is the exploitation of bugs in your code.
Looking at the code that matches up with the loops, you can see that VS 2010 is unrolling the loops. All of the computations are done inline so that there are no branches. This is kind of what you'd expect for loops with upper and lower bounds that are known at compile time and, as in this case, reasonably small.
What happened in VS 2015? Well, it didn't unroll anything. There are 5 lines of code, and then a conditional jump JB
back to the top of the loop sequence. That alone doesn't tell you much. What does look highly suspicious is that it only loops 4 times (see the cmp eax, 4
statement that sets flags right before doing the jb
, effectively continuing the loop as long as the counter is less than 4). Well, that might be okay if it had merged the two loops into one. Let's see what it's doing inside of the loop:
$LL4@mul1:
movups xmm0, XMMWORD PTR [ecx+eax*4] ; load a packed unaligned value into XMM0
mulps xmm0, xmm1 ; do a packed multiplication of XMM0 by XMM1,
; storing the result in XMM0
movups XMMWORD PTR [ecx+eax*4], xmm0 ; store the result of the previous multiplication
; back into the memory location that we
; initially loaded from
inc eax ; one iteration done, increment loop counter
cmp eax, 4 ; see how many loops we've done
jb $LL4@mul1 ; keep looping if < 4 iterations
The code reads a value from memory (an XMM-sized value from the location determined by ecx + eax * 4
) into XMM0
, multiplies it by a value in XMM1
(which was set outside the loop, based on the f
parameter), and then stores the result back into the original memory location.
Compare that to the code for the corresponding loop in mul2
:
$LL4@mul2:
lea eax, DWORD PTR [eax+16]
movups xmm0, XMMWORD PTR [eax-24]
mulps xmm0, xmm2
movups XMMWORD PTR [eax-24], xmm0
sub ecx, 1
jne $LL4@mul2
Aside from a different loop control sequence (this sets ECX
to 4 outside of the loop, subtracts 1 each time through, and keeps looping as long as ECX
!= 0), the big difference here is the actual XMM values that it manipulates in memory. Instead of loading from [ecx+eax*4]
, it loads from [eax-24]
(after having previously added 16 to EAX
).
What's different about mul2
? You had added code to track a separate index in idx2
, incrementing it each time through the loop. Now, this alone would not be enough. If you comment out the assignment to the bool
variable b
, mul1
and mul2
result in identical object code. Clearly without the comparison of idx
to idx2
, the compiler is able to deduce that idx2
is completely unused, and therefore eliminate it, turning mul2
into mul1
. But with that comparison, the compiler apparently becomes unable to eliminate idx2
, and its presence ever so slightly changes what optimizations are deemed possible for the function, resulting in the output discrepancy.
Now the question turns to why is this happening. Is it an optimizer bug, as you first suspected? Well, no—and as some of the commenters have mentioned, it should never be your first instinct to blame the compiler/optimizer. Always assume that there are bugs in your code unless you can prove otherwise. That proof would always involve looking at the disassembly, and preferably referencing the relevant portions of the language standard if you really want to be taken seriously.
In this case, Mystical has already nailed the problem. Your code exhibits undefined behavior when it does *(&out._11 + idx)
. This makes certain assumptions about the layout of the WMatrix
struct in memory, which you cannot legally make, even after explicitly setting the packing.
This is why undefined behavior is evil—it results in code that seems to work sometimes, but other times it doesn't. It is very sensitive to compiler flags, especially optimizations, but also target platforms (as we saw at the top of this answer). mul2
only works by accident. Both mul1
and mul2
are wrong. Unfortunately, the bug is in your code. Worse, the compiler didn't issue a warning that might have alerted you to your use of undefined behavior.