'STL parallel execution vs. OpenMP performance

I'm starting a new project and would like to parallelize some computations. I've used OpenMP in the past, but am aware that now many STL algorithms can be parallelized directly. Since both approaches follow different paradigms (e.g. raw loops versus iterators and anonymous functions), I'd like to choose one up front.

Which is generally faster?

To test this I benchmarked the following C++20 code:

#include <algorithm>
#include <iostream>
#include <vector>
#include <numeric>
#include <cmath>
#include <chrono>
#include <execution>

template <class ExecutionPolicy>
int test_stl(const std::vector<double>& X, ExecutionPolicy policy) {
    std::vector<double> Y(X.size());
    const auto start = std::chrono::high_resolution_clock::now();
    std::transform(policy, X.cbegin(), X.cend(), Y.begin(), [](double x){
        volatile double y = std::sin(x);
        return y;
    });
    const auto stop = std::chrono::high_resolution_clock::now();
    auto diff = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    return diff.count();
}

int test_openmp(const std::vector<double>& X) {
    std::vector<double> Y(X.size());
    const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
    for (size_t i = 0; i < X.size(); ++i) {
        volatile double y = std::sin(X[i]);
        Y[i] = y;
    }
    const auto stop = std::chrono::high_resolution_clock::now();
    auto diff = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    return diff.count();
}

int main() {
    const size_t N = 10000000;
    std::vector<double> data(N);
    std::iota(data.begin(), data.end(), 1);
    std::cout << "OpenMP:        " << test_openmp(data) << std::endl;
    std::cout << "STL seq:       " << test_stl(data, std::execution::seq) << std::endl;
    std::cout << "STL par:       " << test_stl(data, std::execution::par) << std::endl;
    std::cout << "STL par_unseq: " << test_stl(data, std::execution::par_unseq) << std::endl;
    std::cout << "STL unseq:     " << test_stl(data, std::execution::unseq) << std::endl;
    return 0;
}

Compiled on my machine with GCC 10.3.0 (MSYS2), the OpenMP code consistently runs ~10 times faster:

OpenMP:        54719
STL seq:       628451
STL par:       638454
STL par_unseq: 494143
STL unseq:     506647

Is OpenMP faster in general (heuristically) for functionally equivalent code? Given the current state of development, might this change in the future?

Edit:

I'm building this benchmark using the follow CMakeLists.txt:

cmake_minimum_required(VERSION 3.19)

add_executable(TEST main.cpp)
target_compile_features(TEST PRIVATE cxx_std_20)
set_target_properties(TEST PROPERTIES CXX_EXTENSIONS OFF)

find_package(OpenMP)
target_link_libraries(TEST PUBLIC OpenMP::OpenMP_CXX)

And then I compile it with the Windows Powershell commands:

cmake .. -G "MinGW Makefiles"
mingw32-make
./TEST.exe


Solution 1:[1]

I've tested your code (just changing size_t to int in the openmp implementation) with MSVC in my windows 11 machine because I thought it was very strange to have almos all stl parallel with the same performance... The seq execution policy does not do parallelism at all... and in your test it was performing much close to the other execution policies...

So, I've compiled with this:

cl.exe /Zi /EHsc /nologo /std:c++latest /O2 /openmp /Fe: .\openmp-vs-exec-policy.exe .\openmp-vs-exec-policy.cpp

And my results were:

.\openmp-vs-exec-policy.exe
OpenMP:        14089
STL seq:       99299
STL par:       10659
STL par_unseq: 9811
STL unseq:     68051

In another tests of mine, stl performs better than openmp almost always...

So, my guess is that the stl you are using is not very well implemented or the GCC for windows does not do a good job compiling the stl...

[EDIT]

I was looking for g++ implementation of STL parallelism, and found out that it only works if you have the libtbb installed with it.

Just like OpenMP only works if you compile with -fopenmp and if it is not passed to the compiler everything falls back to sequential, STL implementation of execution policy falls back to sequential if you doesn't have the libtbb installed and it does not come by default in g++.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1