'How can I implement parallel Cartesian Product efficiently using C++ Cuda

I have two arrays A and B which have m and n int variables respectively. I have a function f, and I want to generate all f(a, b) which a is in A and b is in B:

A = {a1, a2, a3, ...., am}
B = {b1, b2, b3, ...., bn}

C = f(A * B) = {f(a1,b1), f(a1,b2), f(a1,b3), ..., f(a1, bn),
                f(a2,b1), f(a2,b2), f(a1,b3), ..., f(a2, bn),
                ...
                f(am,b1), f(am,b2), f(am,b3), ..., f(am, bn)}

More precisely, my function f is a simple bitwise or between numbers, but I wanted to ask it in general. Moreover, the order of the items in the result is not important. I only want to generate all possible pairs from A and B.

As I need to generate and store the result, how can I implement it efficiently (using C++ Cuda) to reduce the number of visiting the global memory? Do I need to use shared memory?

Any recommendation will be really appreciated.

c++cuda cartesian-product

Solution 1:^[1]

The idea is to operate on a thread blocks of size 32x32 for example. For each thread block at the location (i, j), a[i:i+32] can be loaded in shared memory by a part of the threads in the current block. The same thing applies for b[j:j+32]. Then, all the threads can compute f(shared_a[thread_i], shared_b[thread_j]) where shared_a and shared_b are the values previously loaded in the shared memory, and thread_i and thread_j are the index of the thread in the current block.

For better performance, a thread block can compute many close chunks of size 32x32. For example a thread block of 32x32 can compute 4x4 packed chunks of size 32x32 (resulting in 128x128 items). The idea is to load more values in the shared memory so that they are less reloaded.

That being said, node that writing the C matrix in memory will be very expensive and this will certainly make all the previous optimizations worthless. Computing the operation that require C on the fly should be a good idea.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	JÃ©rÃ´me Richard

'How can I implement parallel Cartesian Product efficiently using C++ Cuda

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]