The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs
is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):
#include <time.h>
#include <stdio.h>
#define ELEMENTS 100000
int main() {
int a[ELEMENTS] = { 0 };
clock_t start = clock();
for (int i = 0; i < ELEMENTS; ++i) {
int lowerElementIndex = i;
for (int j = i+1; j < ELEMENTS; ++j) {
if (a[j] < a[lowerElementIndex]) {
lowerElementIndex = j;
}
}
int tmp = a[i];
a[i] = a[lowerElementIndex];
a[lowerElementIndex] = tmp;
}
clock_t end = clock();
float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
printf("Time: %2.3f
", timeExec);
printf("ignore this line %d
", a[ELEMENTS-1]);
}
After playing with the optimization flags for a long while, it turned out that -ftree-vectorize
also yields this weird behavior so we can take -fprofile-arcs
out of the question. After profiling with perf
I have found that the only relevant difference is:
Fast case gcc -std=c99 -O2 simp.c
(runs in 3.1s)
cmpl %esi, %ecx
jge .L3
movl %ecx, %esi
movslq %edx, %rdi
.L3:
Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c
(runs in 6.1s)
cmpl %ecx, %esi
cmovl %edx, %edi
cmovl %esi, %ecx
As for the first snippet: Given that the array only contains zeros, we always jump to .L3
. It can greatly benefit from branch prediction.
I guess the cmovl
instructions cannot benefit from branch prediction.
Questions:
Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization
workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization
? The documentation is quite
vague, I would need a little more explanation to understand what's happening.
Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize
flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.
Since it also came up: I have a Core i5 CPU.
Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if
branch in the for
loop with probability p
.
#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8;
constexpr double p = 0.50;
int main() {
printf("p = %.2f
", p);
int* a = new int[ELEMENTS];
mt19937 mt(1759);
bernoulli_distribution rnd(p);
for (int i = 0 ; i < ELEMENTS; ++i){
a[i] = rnd(mt)? i : -i;
}
auto start = high_resolution_clock::now();
int lowerElementIndex = 0;
for (int i=0; i<ELEMENTS; ++i) {
if (a[i] < a[lowerElementIndex]) {
lowerElementIndex = i;
}
}
auto finish = high_resolution_clock::now();
printf("%ld ms
", duration_cast<milliseconds>(finish-start).count());
printf("Ignore this line %d
", a[lowerElementIndex]);
delete[] a;
}
The loops of interest:
This will be referred to as cmov
g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L30:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx
cmovl %rdx, %rbp
addq $1, %rax
cmpq $100000000, %rax
jne .L30
This will be referred to as no cmov, the -fno-if-conversion
flag was pointed out by Turix in his answer.
g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp
xorl %eax, %eax
.L29:
movl (%rbx,%rbp,4), %edx
cmpl %edx, (%rbx,%rax,4)
jge .L28
movslq %eax, %rbp
.L28:
addq $1, %rax
cmpq $100000000, %rax
jne .L29
The difference side by side
cmpl %edx, (%rbx,%rax,4) | cmpl %edx, (%rbx,%rax,4)
movslq %eax, %rdx | jge .L28
cmovl %rdx, %rbp | movslq %eax, %rbp
| .L28:
The execution time as a function of the Bernoulli parameter p
The code with the cmov
instruction is absolutely insensitive to p
. The code without the cmov
instruction is the winner if p<0.26
or 0.81<p
and is at most 4.38x faster (p=1
). Of course, the worse situation for the branch predictor is at around p=0.5
where the code is 1.58x slower than the code with the cmov
instruction.
See Question&Answers more detail:
os