'OpenCL - clEnqueueNDRangeKernel - output array becomes input array

I perform a 5-point stencil operation on a 2D array until getting a convergence computed on this 2D array. So I have multiple iterations (until convergence) and for each iteration, I am calling clEnqueueNDRangeKernel function to compute the new values of 2D input array.

Actually, I manipulate 1D array since kernel code doesn't support 2D (at least, I believe).

My issue is that I don't know how to do the affectation between output and input array. After a computing iteration (stencil operation), I want to assign the output to the input for the next iteration.

But I am confused about how to achieve this.

Below the function used in my main loop :

while(!convergence)
    {
      step = step + 1;

      Compute_Stencil(command_queue, global_item_size, local_item_size, kernel, x0_mem_obj, x_mem_obj, r_mem_obj, x_input, r, size_x, size_y, &error) ;

      convergence = sqrt(error);

      if ((convergence<epsilon) || (step>maxStep)) break;
    }

where x0_mem_obj is the buffer associated to x_input array and x_mem_obj is associated to x_ouput array.

and the Compute_Stencil function that interests me :

void Compute_Stencil(cl_command_queue command_queue, size_t* global_item_size, size_t* local_item_size, cl_kernel kernel, cl_mem x0_mem_obj, cl_mem x_mem_obj, cl_mem r_mem_obj, double* x, double* r, int size_x, int size_y, double* error)
{

status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL,
        global_item_size, local_item_size, 0, NULL, NULL);

 // Read the buffer back to the array
  if(clEnqueueReadBuffer(command_queue, x_mem_obj, CL_TRUE, 0,
        (size_x+2) * (size_y+2) * sizeof(double), x, 0, NULL, NULL) != CL_SUCCESS)
    fprintf(stderr,"Error in clEnqueueReadBuffer with x_mem_obj\n");


  if(clEnqueueReadBuffer(command_queue, r_mem_obj, CL_TRUE, 0,
        (size_x+2) * (size_y+2) * sizeof(double), r, 0, NULL, NULL) != CL_SUCCESS)
    fprintf(stderr,"Error in clEnqueueReadBuffer with r_mem_obj\n");


  status = clFlush(command_queue);
  if(status)
  {fprintf(stderr,"Failed to flush command Queue\n");
    exit(-1);}

  if(clEnqueueWriteBuffer(command_queue, x0_mem_obj, CL_TRUE, 0,
        (size_x+2) * (size_y+2) * sizeof(cl_double), x, 0, NULL, NULL) != CL_SUCCESS)
    fprintf(stderr,"Error in clEnqueueWriteuffer with x0_mem_obj\n");

 // Set new Argument - Outputs become Inputs
  status = clSetKernelArg(
      kernel,
      5,
      sizeof(cl_mem),
      (void*)&x0_mem_obj);

...

I think this is not the best method because for each iteration, I have to read the output x_mem_obj buffer to x_input (with clEnqueueReadBuffer) and write x_input to x0_mem_obj buffer (with clEnqueueWWriteBuffer) and finally set the x0_mem_obj buffer to the kernelArg (5th argument) : this buffer represents the input x0_mem_obj in main :

ret = clSetKernelArg(kernel, 5, sizeof(cl_mem), (void *)&x0_mem_obj);

I think this is not the good method because performances are very bad ( I think Read and Write Buffer operations cost a lot of time).

I try not to use ReadBuffer and WriteBuffer in Compute_Stencil function and put directly the output buffer x_mem_obj in the 5th argument for the next call :

 status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL,
        global_item_size, local_item_size, 0, NULL, NULL);

status = clFlush(command_queue);

// Set new Argument - Outputs become Inputs
  status = clSetKernelArg(
      kernel,
      5,
      sizeof(cl_mem),
      (void*)&x_mem_obj);

But the results are not valid.

How can I transfer simply, after a NDRangeKernel call, the output array to the input array for the next call of NDRangeKernel?

Update 1

@doqtor, thanks for your answer but I have to specify that, after the computing of new values (i.e after the call of NDRangeKernel), I need to assign the new calculated values to the input, but I think I don't need to replace the input array by the output one : the output buffer will be systematically overwritted by the new values calculated from the input buffer values.

In my kernel code, I have the following arguments :

__kernel void kernelHeat2D(const double diagx, const double diagy,
                                 const double weightx, const double weighty,
                                 const int size_x,
                                 __global double* tab_current,
                                 __global double* tab_new,
                                 __global double* r)

where tab_new is the output array and tab_current the input one. tab_current is the 6th argument (so numbered by 5 in clSetKernelArg).

That's why, after NDRangeKernel call, I think that I have only to use :

// Set new Argument - Outputs become Inputs
  status = clSetKernelArg(
      kernel,
      5,
      sizeof(cl_mem),
      (void*)&x_mem_obj);

Update 2

The method above in Update 1 doesn't work : I get at the execution random difference values in array "r" (whose buffer is r_mem_obj in my code ). This array allows to compute the convergence, so I get different number of steps at each execution.

To work, I have to put explicitly in main loop :

while (!convergence) {

clEnqueueNDRangeKernel();

// Read output buffer and put it into xOutput
clEnqueueReadBuffer( x_mem_obj, xOutput);

// Read error buffer and put it into r
clEnqueueReadBuffer( r_mem_obj, r);

// Write output array to input buffer
clEnqueueWriteBuffer( x0_mem_obj, xOutput)

// put input buffer into input argument for next call of NDRangeKernel
status = clSetKernelArg(
      kernel,
      5,
      sizeof(cl_mem),
       (void*)&x0_mem_obj);
}

I would like to avoid using ReadBuffer and WriteBuffer (to force setting xOutput to input x0_mem_obj buffer) because it gives poor performances from a time execution point of view.



Solution 1:[1]

The problem seems to be that you set output as input only and then you have the same buffer as input and output. You need to swap buffers:

buffer1 = create buffer 1
buffer2 = create buffer 2

clEnqueueWriteBuffer(..., buffer1, ...);
clEnqueueWriteBuffer(..., buffer2, ...);

cl_mem *ptrInput = &buffer1;
cl_mem *ptrOutput = &buffer2;

for(..)
{
    clSetKernelArg(..., inputIdx, ptrInput, ...);
    clSetKernelArg(..., outputIdx, ptrOutout, ...);
    clEnqueueNDRangeKernel(...);

    // swap buffers
    cl_mem *ptrTpm = ptrInput;
    ptrInput = ptrOutput;
    ptrOuput = ptrTmp;
}

// ...
// Read results data back
clEnqueueReadBuffer(..., ptrInput, ...); // read from ptrInput because we did extra swap

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1