Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this
for(int i=1; i<n; i++) {
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
One way to parallelize this is to swap the loops like this
Method 1:
#pragma omp parallel for
for(int j=0; j<n; j++) {
for(int i=1; i<n; i++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
With this method each thread runs all n-1
iterations of i
of the inner loop but only n/nthreads
iterations of j
. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.
Another possibility is to only parallelize the inner loop.
Method 2:
for(int i=1; i<n; i++) {
#pragma omp parallel for
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
This essentially processes the columns in a single row in parallel but each row sequentially. The values of i
are only run by the master thread.
Another way to process the columns in parallel but each row sequentially is:
Method 3:
#pragma omp parallel
for(int i=1; i<n; i++) {
#pragma omp for
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
In this method, like method 1, each thread runs over all n-1
iteration over i
. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.
The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait
clause.
Method 4:
#pragma omp parallel
for(int i=1; i<n; i++) {
#pragma omp for nowait
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
In my tests the nowait
clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait
would probably make more of a difference.
Here are the times in seconds for n=3000
on my my four cores IVB system GCC 4.9.2:
method 1: 3.00
method 2: 0.26
method 3: 0.21
method 4: 0.21
This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.
It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait
has little effect in this case.
Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:
Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.
The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i
whereas in method 3 each thread processes a private i
. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i
is a matter of an increment and a conditional test.
The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i
So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.