c - Matrix Multiplication of size 100*100 using SSE Intrinsics

Question

Welcome To Ask or Share your Answers For Others

c - Matrix Multiplication of size 100*100 using SSE Intrinsics

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

c - Matrix Multiplication of size 100*100 using SSE Intrinsics

    int MAX_DIM = 100;
    float a[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    float b[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    float d[MAX_DIM][MAX_DIM]__attribute__   ((aligned(16)));
    /*
     * I fill these arrays with some values
     */

for(int i=0;i<MAX_DIM;i+=1){

      for(int j=0;j<MAX_DIM;j+=4){

        for(int k=0;k<MAX_DIM;k+=4){

          __m128 result = _mm_load_ps(&d[i][j]);

          __m128 a_line  = _mm_load_ps(&a[i][k]);

          __m128 b_line0 = _mm_load_ps(&b[k][j+0]);

          __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);

          __m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);

          __m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);

         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
         result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
         _mm_store_ps(&d[i][j],result);
        }
      }
    }

the above code I made to make matrix multiplication using SSE. the code runs as flows I take 4 elements from row from a multiply it by 4 elements from a column from b and move to the next 4 elements in the row of a and next 4 elements in column b

I get an error Segmentation fault (core dumped) I don't really know why

I use gcc 5.4.0 on ubuntu 16.04.5

Edit : The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic i will be greatfull if someone helps me to find it

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:05:55+0000

The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic...

You're loading 4 overlapping windows on b[k][j+0..7]. (This is why you needed loadu).

Perhaps you meant to load b[k][j+0], +4, +8, +12? If so, you should align b by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.

I take 4 elements from row from a multiply it by 4 elements from a column from b

I'm not sure your text description describes your code.

Unless you've already transposed b, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.

C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1]) was going to give you b[k+0..3][j+1]? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)

To debug this, put a simple pattern of values into b[]. Like

#include <stdalign.>

alignas(64) float b[MAX_DIM][MAX_DIM] = {
    0000, 0001, 0002, 0003, 0004, ...,
    0100, 0101, 0102, ...,
    0200, 0201, 0202, ...,
 };

 // i.e. for (...) b[i][j] = 100 * i + j;

Then when you step through your code in the debugger, you can see what values end up in your vectors.

For your a[][] values, maybe use 90000.0 + 100 * i + j so if you're looking at registers (instead of C variables) you can still tell which values are a and which are b.

Ulrich Drepper's What Every Programmer Should Know About Memory shows an optimized matmul with cache-blocking with SSE instrinsics for double-precision. Should be straightforward to adapt for float.
How does BLAS get such extreme performance? (You might want to just use an optimized matmul library; tuning matmul for optimal cache-blocking is non-trivial but important)
Matrix Multiplication with blocks
Poor maths performance in C vs Python/numpy has some links to other questions
how to optimize matrix multiplication (matmul) code to run fast on a single processor core

Categories

c - Matrix Multiplication of size 100*100 using SSE Intrinsics

c - Matrix Multiplication of size 100*100 using SSE Intrinsics

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags