'Implementing matrix operation using AVX in C

I'm trying to implement the following operation using AVX:

for (i=0; i<N; i++) {
  for(j=0; j<N; j++) {
    for (k=0; k<K; k++) {
      d[i][j] += 2 * a[i][k] * ( b[k][j]- c[k]);
    }
  }
}

for (int i=0; i<N; i++){
   f+= d[ind[i]][ind[i]]/2;
}    

Where d is a NxN matrix, a is a NxK, b a KxN and c a vector of length K. All of them are doubles. Of course, all the data is aligned and I am using #pragma vector aligned to help compiler (gcc).

I know how to use AVX extensions with one-dimension arrays, but it is being a little bit tricky to me to do it with matrix. Currently, I have the following, but I'm not getting correct results:

    for (int i=0; i< floor (N/4); i++){
        for (int j=0; j< floor (N/4); j++){
            __m256d D, A, B, C;
            D = _mm256_setzero_pd();
            #pragma vector aligned
            for (int k=0; k<K_MAX; k++){
                A = _mm256_load_pd(a[i] + k*4);
                B = _mm256_load_pd(b[k] + j*4);
                C = _mm256_load_pd(c + 4*k);
                B = _mm256_sub_pd(B, C);
                A = _mm256_mul_pd(A, B);
                D = _mm256_add_pd(_mm256_set1_pd(2.0), A);
                _mm256_store_pd(d[i] + j*4, D);
            }

        }
    }


    for (int i=0; i<N; i++){
        f+= d[ind[i]][ind[i]]/2;
    }    

I hope someone can tell me where the mistake is.

Thanks in advance.

NOTE: I'm not willing to introduce OpenMP, just using SIMD Intel instructions



Solution 1:[1]

b[k][j] is your problem: the elements b[k + 0..3][j] aren't contiguous in memory. Using SIMD (in a reasonable / useful way) is not something you can drop in to the classic naive matmul loop. See What Every Programmer Should Know About Memory? - there's an appendix with an example of an SSE2 matmul (with cache-blocking) which shows how to do operations in a different order that's SIMD-friendly.

Soonts's answer shows how to vectorize at all, by vectorizing over j, the middle loop. But that leaves a relatively poor memory access pattern, and 3 loads + 3 ALU operations inside the loop. (This answer started out as a comment on it, see it for the code I'm talking about and proposing changes to.)

Loop inversion should be possible to do j as the inner-most loop. That would mean doing stores for d[i][j] += ... inside the inner-most loop, but OTOH it makes more loop invariants in 2 * a[i][k] * ( b[k][j]- c[k] ) so you can usefully transform to d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k), i.e. one VFMSUBPD and one VADDPD per load&store. (With the bv load folding into the FMSUB as a memory source operand, and the dv load folding into VADDPD, so hopefully only 3 uops for the front-end, including a separate store, not including loop overhead.)

The compiler will have to unroll and avoid an indexed addressing mode so the store-address uop can stay micro-fused and run on port 7 on Intel CPUs (Haswell through Skylake-family), not competing with the two loads. Ice Lake doesn't have that problem, having two full independent store-AGUs separate from the two load AGUs. But probably still needs some loop unrolling to avoid a front-end bottleneck.

Here's an example, untested (original version contributed by Soonts, thanks). It optimizes down to 2 FP math ops in the loop in a different way: simply hoisting 2*a out of the loop, doing SUB then FMA for dv += (2av)*(sub_result). But bv can't be a source operand for vsubpd because we need bv - cv. But we can fix that by negating cv to allow (-cv) + bv in the inner loop, with bv as a memory source operand. Sometimes compilers will do things like that for you, but here it seems they didn't, so I did it manually. Otherwise we get a separate vmovupd load going through the front-end.

#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>

// This double [N][N] C99 VLA syntax isn't portable to C++ even with GNU extensions
// restrict tells the compiler the output doesn't overlap with any of the inputs
void matop(size_t N, size_t K, double d[restrict N][N], const double a[restrict N][K], const double b[restrict K][N], const double c[restrict K])
{
    for( size_t i = 0; i < N; i++ ) {
        // loop-invariant pointers for this outer iteration
        //double* restrict rowDi = &d[ i ][ 0 ];
        const double* restrict rowAi = &a[ i ][ 0 ];
        for( size_t k = 0; k < K; k++ ) {
            const double* restrict rowBk = &b[ k ][ 0 ];
            double* restrict rowDi = &d[ i ][ 0 ];
#if 0  // pure scalar
            // auto-vectorizes ok; still a lot of extra checking outside outermost loop even with  restrict
            for (size_t j=0 ; j<N ; j++){
                rowDi[j] += 2*rowAi[k] * (rowBk[j] - c[k]);
            }
#else   // SIMD inner loop with cleanup

         // *** TODO: unroll over 2 or 3 i values
         // and maybe also 2 or 3 k values, to reuse each bv a few times while it's loaded.
            __m256d av = _mm256_broadcast_sd( rowAi + k );
            av = _mm256_add_pd( av, av );   // 2*a[ i ][ k ] broadcasted
            const __m256d cv = _mm256_broadcast_sd( &c[ k ] );
            const __m256d minus_ck = _mm256_xor_pd(cv, _mm256_set1_pd(-0.0));    // broadcasted -c[k]

            //const size_t N_aligned = ( (size_t)N / 4 ) * 4;
            size_t N_aligned = N & -4;    // round down to a multiple of 4 j iterations
            const double* endBk = rowBk + N_aligned;
            //for( ; j < N_aligned; j += 4 )
            for ( ; rowBk != endBk ; rowBk += 4, rowDi += 4) {  // coax GCC into using pointer-increments in the asm, instead of j+=4
                // Load the output vector to update
                __m256d dv = _mm256_loadu_pd( rowDi );
                // Update with FMA
                __m256d bv = _mm256_loadu_pd( rowBk );
                __m256d t2 = _mm256_add_pd( minus_ck, bv );   // bv - cv
                dv = _mm256_fmadd_pd( av, t2, dv );
                // Store back to the same address
                _mm256_storeu_pd( rowDi, dv );
            }
            // rowDi and rowBk point to the double after the last full vector

            // The remainder, if you can't pad your rows to a multiple of 4 and step on that padding
            for(int j=0 ; j < (N&3); j++ )
                rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] + _mm256_cvtsd_f64( minus_ck ) );
#endif
        }
    }
}

Without unrolling (https://godbolt.org/z/6WeYKbnYY), GCC11's inner loop asm looks like this, all single-uop instructions that can stay micro-fused even in the back-end on Haswell and later.

.L7:                                             # do{
        vaddpd  ymm0, ymm2, YMMWORD PTR [rax]     # -c[k] + rowBk[0..3]
        add     rax, 32                           # rowBk += 4
        add     rdx, 32                           # rowDi += 4
        vfmadd213pd     ymm0, ymm1, YMMWORD PTR [rdx-32]  # fma(2aik, Bkj-ck, Dij)
        vmovupd YMMWORD PTR [rdx-32], ymm0        # store FMA result
        cmp     rcx, rax
        jne     .L7                              # }while(p != endp)

But it's 6 total uops, 3 of them loop overhead (pointer increments and fused cmp+jne), so Haswell through Skylake could only run it at 1 iteration per 1.5 clocks, bottlenecked on the 4-wide issue stage in the front-end. (Which wouldn't let OoO exec get ahead on executing the pointer increments and loop branch, to notice early and recover while the back-end was still chewing on older loads and FP math.)


So loop unrolling should be helpful, since we managed to coax GCC into using indexed addressing modes. Without that it's relatively useless with AVX code on Intel Haswell/Skylake CPUs, with each vaddpd ymm5, ymm4, [rax + r14] decoding as 1 micro-fused uop, but unlaminating into 2 at issue into the back-end, not helping us get more work through the narrowest part of the front-end. (A lot like if we'd used a separate vmovupd load like we got with _mm256_sub_pd(bv, cv) instead of add(bv, -cv).)

The vmovupd ymmword ptr [rbp + r14], ymm5 store stays micro-fused but can't run on port 7, limiting us to a total of 2 memory operations per clock (up to 1 of which can be a store.) So a best case of 1.5 cycles per vector.

Compiled on https://godbolt.org/z/rd3rn9zor with GCC and clang -O3 -march=skylake -funroll-loops. GCC does actually use pointer increments with loads folded into 8x vaddpd and 8x vfmadd213pd. But clang uses indexed addressing modes and doesn't unroll. (You probably don't want -funroll-loops for your whole program, so either compile this separately or manually unroll. GCC's unrolling fully peels a prologue that does 0..7 vector iterations before entering the actual SIMD loop, so it's quite aggressive.)

GCC's loop-unrolling looks useful here for large N, amortizing the pointer increments and loop overhead over multiple vectors. (GCC doesn't know how to invent multiple accumulators for FP dep chains in a dot product for example, making its unrolling useless in that case, unlike clang.)

Unfortunately clang doesn't unroll the inner loop for us, but it does use vmaskmovpd in an interesting way for the cleanup.

It's maybe good that we use a separate loop counter for cleanup, in a way that lets the compiler easily prove the trip-count for the cleanup is 0..3, so it doesn't try to auto-vectorize with YMM.

The other way to do it, using an actual j variable for the inner loop and its cleanup, more like Soonts' edit. IIRC, compilers did try to auto-vectorize the cleanup for this, wasting code size and some always-false branching.

            size_t j = 0;     // used for cleanup loop after
            for( ; j < N_aligned; j += 4 )
            {
                // Load the output vector to update
                __m256d dv = _mm256_loadu_pd( rowDi + j );
                // Update with FMA
                __m256d bv = _mm256_loadu_pd( rowBk + j );
                __m256d t2 = _mm256_sub_pd( bv, cv );   // bv - cv
                dv = _mm256_fmadd_pd( av, t2, dv );
                // Store back to the same address
                _mm256_storeu_pd( rowDi + j, dv );
            }
            // The remainder, if you can't pad your rows to a multiple of 4
            for( ; j < N; j++ )
                rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] - _mm256_cvtsd_f64( cv ) );

This has a fairly good mix of load&store vs. FP math for modern CPUs (https://agner.org/optimize/ and https://uops.info/), especially Intel where we can do 2 loads and 1 store. I think Zen 2 or 3 can also do 2 loads + 1 store. It needs to hit in L1d cache to sustain that kind of throughput, though. (And even then, Intel's optimization manual says the max sustained L1d bandwidth on Skylake is less than the full 96 bytes/cycle that would require. More like mid-80s IIRC, so we can't quite expect one result vector per cycle, even with sufficient unrolling to avoid front-end bottlenecks.)

There's no latency bottleneck, since we move on to a new dv every iteration instead of accumulating anything across loop iterations.

The other advantage to this is that memory access to d[i][j] and b[k][j] would be sequential, with no other memory access in the inner-most loop. (The middle loop would do broadcast-loads of a[i][k] and c[k]. Those seem likely to cache-miss if the inner loop evicts too much; with some unrolling of the outer loop, one SIMD load and some shuffling could help, but probably cache-blocking would avoid a need for that.)

Looping over the same d[i] row repeatedly for different b[k] rows gives us locality for the part that we're modifying (i.e. use k as the middle loop, keeping i as the outer-most.) With k as the outer loop, we'd be looping K times over the whole d[0..N-1][0..N-1], probably needing to write + read each pass all that way out to whichever level of cache or memory could hold it.

But really you'd still want to cache-block if each row is really long, so you avoid the cache misses to bring all of b[][] in from DRAM N times. And avoid evicting the stuff you're going to broadcast-load next.


Smarter unrolling: a first step towards cache-blocking

Some of the above problems with maxing out load/store execution unit throughput, and requiring the compiler to use non-indexed addressing modes, can go away if we do more with each vector of data while it's loaded.

For example, instead of working on just one row of d[][], we could be working on 2, 3, or 4. Then every (rowBk[j] - c[k]) result can be used that many times (with a different 2aik) for a d[i+unroll][j + 0..vec] vector.

And we can also load a couple different (rowBk+K*0..unroll)[j+0..3], each with a corresponding minus_ck0, minus_ck1, etc. (Or keep an array of vectors; as long as it's small and the compiler has enough registers, the elements won't exist in memory.)

With multiple bv-cv and dv vectors in registers all at the same time, we can do significantly more FMAs per load without increasing the total amount of FP work. It takes more registers for constants, though, otherwise we could be defeating the purpose by forcing more reloads.

The d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k) transformation wouldn't be useful here; we want to keep bv-cv separate from i so we can reuse that result as an input for different FMAs.

The b[k][j]+(-c[k]) can still benefit from micro-fusion of a load with a vaddpd so ideally it would still use a pointer increment, but the front-end might not be a bottleneck anymore.

Don't overdo it with this; too many memory input streams can be a problem for cache conflict misses especially for some N values that might create aliasing, and also for HW prefetching tracking them all. (Although Intel's L2 streamer is said to track 1 forward and 1 backward stream per 4k page, IIRC.) Probably about 4 to 8 ish streams is ok. But if d[][] isn't missing in L1d, then it's not really an input stream from memory. You don't want your b[][] input rows to be evicting the d data, though, since you'll be looping over 2 to 4 rows of d data repeatedly.


By comparison: Soonts's loop - less frequent cleanup, but worse memory access pattern.

Soonts's current loop with 3 loads and 3 ALU operations isn't ideal, although 1 load per FMA operation is already ok if they hit in cache (most modern CPUs can do 2 each per clock, although AMD Zen can also do 2 FP adds in parallel with mul/fma). If that extra ALU operation was a bottleneck, we could pre-multiply a[][] by 2 once, taking only O(N*K) work vs. O(N^2*K) to do it on the fly. But it's probably not a bottleneck and thus not worth it.

More importantly, the memory access pattern in Soonts's current answer is looping forward 1 double at a time for broadcast loads of c[k] and a[i][k] which is good, but the bv = _mm256_loadu_pd of b[k][j + 0..3] is unfortunately striding down a column.

If you're going to unroll as Soonts suggested, don't just do two dep chains for one dv, do at least two vectors, d[i][j + 0..3] and 4..7 so you use a whole 64 bytes (full cache line) from every b[k][j] you touch. Or four vectors for a pair of cache-lines. (Intel CPUs at least use an adjacent-line prefetcher, which likes to complete a 128-byte aligned pair of cache lines, so you'd benefit from aligning the rows of b[][] by 128. Or at least by 64, and get some benefit from adjacent-line prefetching.

If a vertical slice of b[][] fits in some level of cache (along with the row of d[i][] you're currently accumulating into), the next stride down the next group of columns can benefit from that prefetching and locality. If not, fully using the lines you touch is more important, so they don't have to get pulled in again later.

So with Soonts's vectorization strategy, for large problems where this won't fit in L1d cache, probably good to make sure b's rows are aligned by 64, even if that means padding at the end of each row. (The storage geometry doesn't have to match the actual matrix dimension; you pass N and row_stride separately. You use one for index calculations, the other for loop bounds.)

Solution 2:[2]

Assuming both N and K numbers are relatively large (much larger than 4 which is hardware vector size), here's one way to vectorize your main loop. Untested.

The main idea is vectorizing the middle loop instead of the inner one. This is done for two reasons.

  1. This avoids horizontal operations. When vectorizing just the inner loop, we would have to compute horizontal sum of a vector.

  2. That b[k][j] load has unfortunate RAM access pattern when loading for 4 consecutive k values, need either 4 separate load instructions, or gather load, both methods are relatively slow. Loading elements for 4 consecutive j values is a full-vector load instruction, very efficient, especially since you align your inputs.

    const int N_aligned = ( N / 4 ) * 4;
    for( int i = 0; i < N; i++ )
    {
        int j = 0;
        for( ; j < N_aligned; j += 4 )
        {
            // Load 4 scalars from d
            __m256d dv = _mm256_loadu_pd( &d[ i ][ j ] );

            // Run the inner loop which only loads from RAM but never stores any data
            for( int k = 0; k < K; k++ )
            {
                __m256d av = _mm256_broadcast_sd( &a[ i ][ k ] );
                __m256d bv = _mm256_loadu_pd( &b[ k ][ j ] );
                __m256d cv = _mm256_broadcast_sd( &c[ k ] );

                // dv += 2*av*( bv - cv )
                __m256d t1 = _mm256_add_pd( av, av );   // 2*av
                __m256d t2 = _mm256_sub_pd( bv, cv );   // bv - cv
                dv = _mm256_fmadd_pd( t1, t2, dv );
            }
            // Store the updated 4 values
            _mm256_storeu_pd( &d[ i ][ j ], dv );
        }

        // Handle remainder with scalar code
        for( ; j < N; j++ )
        {
            double ds = d[ i ][ j ];
            for( int k = 0; k < K; k++ )
                ds += 2 * a[ i ][ k ] * ( b[ k ][ j ] - c[ k ] );
            d[ i ][ j ] = ds;
        }
    }

If you want to optimize further, try to unroll the inner loop by a small factor like 2, use 2 independent accumulators initialized with _mm256_setzero_pd(), add them after the loop. It could be that on some processors, this version stalls on the latency of the FMA instruction, instead of saturating load ports or ALUs. Multiple independent accumulators sometimes help.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2