In almost any situation where there's a fast mode and a safe mode, you'll find a trade-off of some sort. Otherwise everything would run in fast-safe mode :-).
And, if you're getting different results with the same input, your process is not deterministic, no matter how much you believe it to be (in spite of the empirical evidence).
I would say your explanation is the most likely. Put it in safe mode and see if the non-determinism goes away. That will tell you for sure.
As to whether there are other optimizations, if you're compiling on the same hardware with the same compiler/linker and the same options to those tools, it should generate identical code. I can't see any other possibility other than the fast mode (or bit rot in the memory due to cosmic rays, but that's pretty unlikely).
Following your update:
Intel has a document here which explains some of the things they're not allowed to do in safe mode, including but not limited to:
- reassociation:
(a+b)+c -> a+(b+c)
.
- zero folding:
x + 0 -> x, x * 0 -> 0
.
- reciprocal multiply:
a/b -> a*(1/b)
.
While you state that these operations are compile-time defined, the Intel chips are pretty darned clever. They can re-order instructions to keep pipelines full in multi-CPU set-ups so, unless the code specifically prohibits such behavior, things may change at run-time (not compile-time) to keep things going at full speed.
This is covered (briefly) on page 15 of that linked document that talks about vectorization ("Issue: different results re-running the same binary on the same data on the same processor").
My advice would be to decide whether you need raw grunt or total reproducability of results and then choose the mode based on that.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…