'Warp Matrix-Multiply functions - are single-precision multiplicands supported?

In the CUDA Programming guide, v11.7, section B.24.6. Element Types & Matrix Sizes, there's a table of supported type combinations, in which the multiplications are either sub-single-precision floating point types, or double - never `float . But - in section B.24.1 Description, it says that:

The data type, T [for matrix fragments], may be double, float, __half, __nv_bfloat16, char, or unsigned char for multiplicands and double, float, int, or __half for accumulators.

So, can the multiplicand matrices be float, or can't they?



Solution 1:[1]

Probably not, single-precision floating-point multiplicands are not supported.

The PTX ISA guide lists the lower-level WMMA primitives and their different operand combinations, in Section 9.7.13; and, indeed, there are no primitives where single-precision floating-point (f32) can be the data type of the multiplicand matrices. The closest we can get is tf32 for the multiplicands, and f32 for the addend and the result. Now, if PTX doesn't have the primitives we're after, it is all but impossible that they exist on the actual micro-architectures and are simply not exposed (and the compiler will not be able to optimize other PTX code into fp32 WMMA).

Note that double-precision multiplicands are supported (although YMMV when it comes to their speed).


edit: This answer may need some qualification. A 2020 NVIDIA blog post about TF32 says:

TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.

So, if you can pretend your multiplicands are TF32 (and I'm not sure you actually can), then, in a sense, single-precision floating-point is usable, though the accuracy may be lower than expected.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1