For some homework I have, I need to implement the multiplication of a matrix by a vector, parallelizing it by rows and by columns. I do understand the row version, but I am a little confused in the column version.
Lets say we have the following data:
And the code for the row version:
#pragma omp parallel default(none) shared(i,v2,v1,matrix,tam) private(j)
{
#pragma omp for
for (i = 0; i < tam; i++)
for (j = 0; j < tam; j++){
// printf("Hebra %d hizo %d,%d
", omp_get_thread_num(), i, j);
v2[i] += matrix[i][j] * v1[j];
}
}
Here the calculations are done right and the result is correct.
The column version:
#pragma omp parallel default(none) shared(j,v2,v1,matrix,tam) private(i)
{
for (i = 0; i < tam; i++)
#pragma omp for
for (j = 0; j < tam; j++) {
// printf("Hebra %d hizo %d,%d
", omp_get_thread_num(), i, j);
v2[i] += matrix[i][j] * v1[j];
}
}
Here, due to how the parallelization is done, the result varies on each execution depending on who thread execute each column. But it happens something interesting, (And I would think is because of compiler optimizations) if I uncomment the printf
then the results all the same as the row version and therefore, correct, for example:
Thread 0 did 0,0
Thread 2 did 0,2
Thread 1 did 0,1
Thread 2 did 1,2
Thread 1 did 1,1
Thread 0 did 1,0
Thread 2 did 2,2
Thread 1 did 2,1
Thread 0 did 2,0
2.000000 3.000000 4.000000
3.000000 4.000000 5.000000
4.000000 5.000000 6.000000
V2:
20.000000, 26.000000, 32.000000,
Is right, but If I remove the printf:
V2:
18.000000, 11.000000, 28.000000,
What kind of mechanism should I use to get the column version right?
Note: I care more about the explanation rather than the code you may post as answer, because what I really want is understand what is going wrong in the column version.
EDIT
I've found a way of get rid of the private vector proposed by Z boson in his answer. I've replaced that vector by a variable, here is the code:
#pragma omp parallel
{
double sLocal = 0;
int i, j;
for (i = 0; i < tam; i++) {
#pragma omp for
for (j = 0; j < tam; j++) {
sLocal += matrix[i][j] * v1[j];
}
#pragma omp critical
{
v2[i] += sLocal;
sLocal = 0;
}
}
}
See Question&Answers more detail:
os