'Writing from Device to Host and notifying the host
Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
- The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
- The data size is too large to fit on the device memory and wait until the end.
- The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Solution 1:[1]
Here is one possible approach. At a high level, on the device:
- You'll need to write the data to either device global memory (allocated previously with
cudaMalloc
) or else directly to host memory (allocated previously withcudaHostAlloc
). This memory should be accessed via avolatile
pointer. - You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
- You'll then want to issue a
threadfence()
(if you're using device global memory) orthreadfence_system()
call (if using host memory) prior to the following steps - Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a
volatile
pointer. - Optionally issue another threadfence or threadfence_system call
- for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a
volatile
pointer.
On the host:
- Before launching the kernel, the host will need to set the mailbox location to a default value.
- After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
- Once the specific value is seen, indicating that the data is ready, the host can consume the data
- Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
- Both the mailbox location and the payload region should be accessed by the host thread using a
volatile
pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments. It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
- while (more_frames)
-
- process a frame
-
- signal host
-
- increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
- while (more_frames)
-
- call kernel to process frame
-
- consume frame
-
- increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |