'When should NVRTC compilation produce a CUBIN?

If I understand the workflow description in the NVRTC documentation correctly, here's how it works:

  • Create an NVRTC program from the source text.
  • Compile the NVRTC program to get PTX code.
  • Device-link the PTX code using NVIDIA's Driver API (cuLinkCreate, cuLinkAddData, cuLinkComplete) to get the cubin.

However... beginning with CUDA 11.3, NVRTC has the following API call :

nvrtcResult nvrtcGetCUBIN ( nvrtcProgram prog, char* cubin );

So how can I have a cubin after compilation only?



Solution 1:[1]

Well, on the host side you get proper machine code after just compilation, so why not on the device side?

It seems that cubin availability depends on what you targeted with your compilation:

  • If you targeted a "virtual architecture", i.e. a certain compute capability (e.g. compute_60 - then the only thing you can get is the PTX, which is not yet specific to any microarchitecture.

  • If you targeted a concrete (micro-)architecture (e.g. sm_70), then compilation can proceed all the way to SASS assembly placed in cubin.

Now, when you link using the CUDA driver, you have a context at play, and that's always associated with a physical GPU - a concrete micro-architecture. So that necessarily gives you a cubin.

PS:

  1. Other switches could also affect the availability of cubin output, e.g. --dlink-time-opt.
  2. Before CUDA 11.3, we couldn't nvrtcGetCUBIN() at all. This seems to also have effected the creation of modules, i.e. whether you can create a module using the PTX vs the CUBIN.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1