The segmentation fault was solved by _mm_loadu_ps
Also there is something wrong with logic...
You're loading 4 overlapping windows on b[k][j+0..7]
. (This is why you needed loadu
).
Perhaps you meant to load b[k][j+0]
, +4
, +8
, +12
? If so, you should align b
by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.
I take 4 elements from row from a
multiply it by 4 elements from a column from b
I'm not sure your text description describes your code.
Unless you've already transposed b
, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.
C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1])
was going to give you b[k+0..3][j+1]
? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)
To debug this, put a simple pattern of values into b[]
. Like
#include <stdalign.>
alignas(64) float b[MAX_DIM][MAX_DIM] = {
0000, 0001, 0002, 0003, 0004, ...,
0100, 0101, 0102, ...,
0200, 0201, 0202, ...,
};
// i.e. for (...) b[i][j] = 100 * i + j;
Then when you step through your code in the debugger, you can see what values end up in your vectors.
For your a[][]
values, maybe use 90000.0 + 100 * i + j
so if you're looking at registers (instead of C variables) you can still tell which values are a
and which are b
.
Related: