Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).
So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with
int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];
which computes something close to String.hashCode()
(set M1=31
and use a char[]
). The computation is pretty trivial and for t.length=1000
takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).
Observe that each two steps a
gets multiplied by M2 = M1*M1
and added something. This leads to this piece of code
int a = 0;
for (int i=0; i<N; i+=2) {
a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.
This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.
Unlike java, gcc -O3
chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops
shows).
So my question1 is: What prevents such an optimization?
Googling didn't work, I got "associative arrays" and "associative operators" only.
Update
I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.
Update 2
As Java already unrolls the loop, all the hard work is done. What we get is something like
...pre-loop
for (int i=0; i<N; i+=2) {
a2 = M1 * a + t[i];
a = M1 * a2 + t[i+1];
}
...post-loop
where the interesting part can be rewritten like
a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add
This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for int
s even in case of overflow or whatever, but not applicable to floating point).
a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add
So far we gained nothing, but it allows us to fold the constants
a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add
and gain even more by regrouping the sum
a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add
See Question&Answers more detail:
os