I've seen a similar post on stackoverflow which tackles the problem in C++: Parallel implementation for multiple SVDs using CUDA I want to do exactly the same i
I'm somewhat lost on the point of converting, for instance, cgbn_mem_t<256> into cgbn_mem_t<1024> in device code. Say, the kernel receives two point
I am trying to set up a cross-compile environment on an AWS EC2 Ubuntu box targeting Nvida Xavier devices on Cuda 10.2. I tried following the "instructions" at
I am having trouble using cuda-gdb. My program starts from python and it loads a shared library containing tensorflow and cuda code. The command I used to start
I want to run CUDA code on google colab. For that I am following the below steps but I am not able to install CUDA packages. Step 1: Removing previous CUDA vers
I am trying to implement some jpeg encoding cuda code based one a sample code below: https://docs.nvidia.com/cuda/nvjpeg/index.html#nvjpeg-encode-examples I pos
During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And ev
I tried vs2015 2017 2019 2022 without success, cmake also tried 3.14.1 and the latest version, cuda is available, and vs2019 seems to have also compiled test.cu
Till Apr26th, 2022, CUDA has updated to version 11.6, which can be installed by Nvidia Instruction: wget https://developer.download.nvidia.com/compute/cuda/11.6
I am using numba cuda to calculate a function. The code is simply to add up all the values into one result, but numba cuda gives me a different result from nu
I made a new CUDA executable project in CLion and when it opened I got CMake error: CUDA_ARCHITECTURES is empty for target "cmTC_908f4". CMakeLists.txt: cmake_
Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough? I cannot provide
I was not able to find other topics about finding the largest prime factor of a number using Cuda and I am having some issues. #include <cuda.h> #include
Suppose that, in my CUDA grid block, I have a Matrix, which I want to multiply by a vector. And that my data type is either half, single, or double precision (i
In the CUDA Programming guide, v11.7, section B.24.6. Element Types & Matrix Sizes, there's a table of supported type combinations, in which the multiplicat
Newbie here, I recon this may be a very foolish question. I am simultaneously running on cuda, in two distinct processes, a simple 3-layer MLP neural network ov
The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax: my_kernel<<<grid_dims, block_dims, shar
I am trying to install PyTorch with CUDA. I followed the instructions (installation using conda) mentioned in https://pytorch.org/get-started/locally/ conda in
When I run nvidia-smi, I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the sa
Numba Cuda has syncthreads() to sync all thread within a block. How can I sync all blocks in a grid without exiting the current kernel? In C-Cuda there's a coo