'Need Help Understanding OpenMP Matrix Multiplication C++ code
Here is my Matrix Multiplication C++ OpenMP code that I have written. I am trying to use OpenMP to optimize the program. The sequential code speed was 7 seconds but when I added openMP statements but it only got faster by 3 seconds. I thought it was going to get much faster and don't understand if I'm doing it right.
The OpenMP statements are in the fill_random function and in the matrix multiplication triple for loop section in main.
I would appreciate any help or advice you can give to understand this!
#include <iostream>
#include <cassert>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
double** fill_random(int rows, int cols )
{
double** mat = new double* [rows]; //Allocate rows.
#pragma omp parallell collapse(2)
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols]; // added
for( int j = 0; j < cols; ++j)
{
mat[i][j] = rand() % 10;
}
}
return mat;
}
double** create_matrix(int rows, int cols)
{
double** mat = new double* [rows]; //Allocate rows.
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols](); //Allocate each row and zero initialize..
}
return mat;
}
void destroy_matrix(double** &mat, int rows)
{
if (mat)
{
for (int i = 0; i < rows; ++i)
{
delete[] mat[i]; //delete each row..
}
delete[] mat; //delete the rows..
mat = nullptr;
}
}
int main()
{
int rowsA = 1000; // number of rows
int colsA= 1000; // number of columns
double** matA = fill_random(rowsA, colsA);
int rowsB = 1000; // number of rows
int colsB = 1000; // number of columns
double** matB = fill_random(rowsB, colsB);
//Checking matrix multiplication qualification
assert(colsA == rowsB);
double** matC = create_matrix(rowsA, colsB);
//measure the multiply only
const auto start = high_resolution_clock::now();
//Multiplication
#pragma omp parallel for
for(int i = 0; i < rowsA; ++i)
{
for(int j = 0; j < colsB; ++j)
{
for(int k = 0; k < colsA; ++k) //ColsA..
{
matC[i][j] += matA[i][k] * matB[k][j];
}
}
}
const auto stop = high_resolution_clock::now();
const auto duration = duration_cast<seconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " seconds" << std::endl;
//Clean up..
destroy_matrix(matA, rowsA);
destroy_matrix(matB, rowsB);
destroy_matrix(matC, rowsA);
return 0;
}
Solution 1:[1]
- Your problem is rather small.
- The
collapse
in the matrix creation does nothing because the loops are not perfectly nested. On the other hand, in the multiplication routine you should add acollapse(2)
directive. - Creating a matrix with an array of pointers means that the expression
matB[k][j]
dances all over memory. Allocate your matrices as a single array and then usei*N+j
as an indexing expression. (Of course I would put that in a macro or so.)
Solution 2:[2]
Matrix size of 1000x1000 with double(64 bit) element type requires 8MB data. When you multiply two matrices, you read 16MB data. When you write to a third matrix, you also access 24MB data total.
If L3 cache is smaller than 24MB then RAM is bottleneck. Maybe single thread did not fully use its bandwidth but when OpenMP is used, RAM bandwidth is fully used. In your case it had only 50% headroom for bandwidth.
Naive version is not using cache well. You need to swap order of two loops to gain more caching:
loop
loop k
loop
C[..] += B[..] * A[..]
although incrementing C does not re-use a register in this optimized version, it re-uses cache that is more important in this case. If you do it, it should get ~100-200 milliseconds computation time even in single-thread.
Also if you need performance, don't do this:
//Allocate each row and zero initialize..
allocate whole matrix at once so that your matrix is not scattered in memory.
To add more threads efficiently, you can do sub-matrix multiplications to compute full matrix multiplication. Scan-line multiplication is not good for load-balancing between threads. When sub-matrices are multiplied, they give better load distribution due to caching and higher floating-point operations per element fetched from memory.
Edit:
Swapping order of loops also makes compiler able to vectorize the innermost loop because one of the input matrices becomes a constant during the innermost loop.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Victor Eijkhout |
Solution 2 |