'CUDA dynamic parallelism: Access child kernel results in global memory
I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
// Access results in payload array here
Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads
after kernel execution? (I tried cudaMemcpy()
to copy aGlobalPayloads
back to aPayloads
but cudaMemcpy()
is not allowed in device code).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|