'In a CUDA kernel, how do I store an array in "local thread memory"?
I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?
Shouldn't it be something like:
__global__ my_kernel(int a)
{
__local__ unsigned char p[50];
}
Solution 1:[1]
Arrays, local memory and registers
There is a misconception here regarding the definition of "local memory". "Local memory" in CUDA is actually global memory (and should really be called "thread-local global memory") with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread's data blocked together). If you want things to be really fast, you want to use either shared memory, or, better yet, registers (especially on the latest devices where you get up to 255 registers per thread). Explaining the entire CUDA memory hierarchy falls out of scope of this post. Let us instead focus on making small array computations fast.
Small arrays, just like variables can be stored entirely in registers. On current NVIDIA hardware, however, putting arrays into registers is difficult. Why? Because registers need very careful treatment. If you don't do it exactly right, your data will end up in local memory (which, again, is really global memory, which is the slowest memory you have). The CUDA Programming Guide, section 5.3.2 tells you when local memory is used:
Local Memory
Local memory accesses only occur for some automatic variables as mentioned in Variable Type Qualifiers. Automatic variables that the compiler is likely to place in local memory are:
- Arrays for which it cannot determine that they are indexed with constant quantities,
- Large structures or arrays that would consume too much register space,
- Any variable if the kernel uses more registers than available (this is also known as register spilling).
How does register allocation work?
Note that register allocation is an extremely complicated process which is why you cannot (and should not) interfere with it. Instead, the compiler will convert CUDA code into PTX code (a sort of bytecode) which assumes a machine with infinitely many registers. You can write inline PTX but it won't do too much to register allocation. PTX code is device-independent code and it is only the first stage. In a second stage, PTX will be compiled into device assembly code, called SASS. SASS code has the actual register allocations. The SASS compiler and it's optimizer will also be the ultimate authority on whether a variable will be in registers or local memory. All you can do is try to understand what the SASS compiler does in certain cases and use that for your advantage. Code correlation view in Nsight can help you with that (see below). However, since the compiler and optimizer keep changing, there is no guarantees as to what will or will not be in registers.
Insufficient registers
Appendix G, section 1 tells you how many registers a thread can have. Look for "Maximum number of 32-bit registers per thread". In order to interpret that table, you must know your compute capability (see below). Don't forget that registers are used for all kinds of things, and don't just correlate to single variables. Registers on all devices up to CC 3.5 are 32 bit each. If the compiler is smart enough (and the CUDA compiler keeps changing), it can for example pack multiple bytes into the same register. The Nsight code correlation view (see "Analyzing Memory Accesses" below) also reveals that.
Constant vs. Dynamic Indexing
While the space constraint is an obvious hurdle to in-register arrays, the thing that is easily overseen is the fact that, on current hardware (Compute Capability 3.x and below), the compiler places any array in local memory that is accessed with dynamic indexing. A dynamic index is an index which the compiler cannot figure out. Arrays accessed with dynamic indices can't be placed in registers because registers must be determined by the compiler, and thus the actual register being used must not depend on a value determined at run-time. For example, given an array arr
, arr[k]
is constant indexing if and only if k
is a constant, or only depends on constants. If k
, in any way, depends on some non-constant value, the compiler cannot compute the value of k
and you got dynamic indexing. In loops where k
starts and ends at a (small) constant numbers, the compiler (most probably) can unroll your loop, and can still achieve constant indexing.
Example
For example, sorting a small array can be done in registers but you must use sorting networks or similarly "hard-wired" approaches, and can't just use a standard algorithm because most algorithms use dynamic indexing.
With quite a high probability, in the following code example, the compiler keeps the entire aBytes
array in registers because it is not too large and the loops can fully be unrolled (because the loop iterates over a constant range). The compiler (very probably) knows which register is being accessed at every step and can thus keep it fully in registers. Keep in mind that there are no guarantees. The best you can do is to verify it on a case-by-case basis using CUDA developer tools, as described below.
__global__
void
testSortingNetwork4(const char * aInput, char * aResult)
{
const int NBytes = 4;
char aBytes[NBytes];
// copy input to local array
for (int i = 0; i < NBytes; ++i)
{
aBytes[i] = aInput[i];
}
// sort using sorting network
CompareAndSwap(aBytes, 0, 2); CompareAndSwap(aBytes, 1, 3);
CompareAndSwap(aBytes, 0, 1); CompareAndSwap(aBytes, 2, 3);
CompareAndSwap(aBytes, 1, 2);
// copy back to result array
for (int i = 0; i < NBytes; ++i)
{
aResult[i] = aBytes[i];
}
}
Analyzing memory accesses
Once you are done, you generally want to verify whether the data is actually stored in registers or whether it went to local memory. The first thing you can do is to tell your compiler to give you memory statistics using the --ptxas-options=-v
flag. A more detailed way of analyzing memory accesses is using Nsight.
Nsight has many cool features. Nsight for Visual Studio has a built-in profiler and a CUDA <-> SASS code correlation view. The feature is explained here. Note that Nsight versions for different IDEs are probably developed independently, and thus their features might vary between the different implementations.
If you follow the instructions in above link (make sure to add the corresponding flags when compiling!), you can find the "CUDA Memory Transactions" button at the very bottom of the lower menu. In that view, you want to find that there is no memory transaction coming from the lines that are only working on the corresponding array (e.g. the CompareAndSwap lines in my code example). Because if it does not report any memory access for those lines, you (very probably) were able to keep the entire computation in registers and might just have gained a speed up of thousands, if not tenthousands, of percent (You might also want to check the actual speed gain, you get out of this!).
Figuring out Compute Capability
In order to figure out how many registers you have, you need to know your device's compute capability. The standard way of getting such device information is running the deviceQuery
sample.
(Update - as mentioned by paleonix
in the comments) deviceQuery
is part of the official cuda-samples
repo. You can find it here.
If you have Nsight for Visual Studio, just go to Nsight -> Windows -> System Info.
Don't optimize early
I am sharing this today because I came across this very problem very recently. However, as mentioned in this thread, forcing data to be in registers is definitely not the first step you want to take. First, make sure that you actually understand what is going on, then approach the problem step by step. Looking at the assembly code is certainly a good step, but it should generally not be your first. If you are new to CUDA, the CUDA Best Practices Guide will help you figure out some of those steps.
Solution 2:[2]
All you need is this:
__global__ my_kernel(int a)
{
unsigned char p[50];
........
}
The compiler will automatically spill this to thread local memory if it needs to. But be aware that local memory is stored in SDRAM off the GPU, and it is as slow as global memory. So if you are hoping that this will yield a performance improvement, it might be that you are in for a disappointment.....
Solution 3:[3]
~ For someone that runs across this in the future ~
In a nutshell, to create an array for each thread, you would want to create them in device memory. To do this, a little bit of shared memory can be carved out per thread. Special attention must be taken to prevent conflicts or performance will drop.
Here is an example from an nvidia blog post by Maxim Milakov in 2015:
// Should be multiple of 32
#define THREADBLOCK_SIZE 64
// Could be any number, but the whole array should fit into shared memory
#define ARRAY_SIZE 32
__device__ __forceinline__ int no_bank_conflict_index(int thread_id, int logical_index)
{
return logical_index * THREADBLOCK_SIZE + thread_id;
}
__global__ void kernel5(float * buf, int * index_buf)
{
// Declare shared memory array A which will hold virtual
// private arrays of size ARRAY_SIZE elements for all
// THREADBLOCK_SIZE threads of a threadblock
__shared__ float A[ARRAY_SIZE * THREADBLOCK_SIZE];
...
int index = index_buf[threadIdx.x + blockIdx.x * blockDim.x];
// Here we assume thread block is 1D so threadIdx.x
// enumerates all threads in the thread block
float val = A[no_bank_conflict_index(threadIdx.x, index)];
...
}
Solution 4:[4]
The keyword you are looking for is __shared__
. Large arrays will not fit in the shared memory space, but the compiler should used shared memory for a small fixed-size array like in this case. You can use the __shared__
keyword to ensure this happens. You will see a compile-time error if you exceed the maximum amount of shared memory for a block.
Solution 5:[5]
You are mixing up local and register memory space.
Single variables and constant sized arrays are automatically saved in register space on the chip with almost no costs for read and write.
If you exceed your amount of registers per multiprocessor they will get stored in local memory.
Local memory resides in global memory space and has the same slow bandwidth for read and write operations.
#DEFINE P_SIZE = 50
__global__ void kernel()
{
unsigned char p[P_SIZE];
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | SunsetQuest |
Solution 4 | Bruce Hart |
Solution 5 | talonmies |