'Are tensor cores / WMMA useful for matrix-vector multiplication?

Suppose that, in my CUDA grid block, I have a Matrix, which I want to multiply by a vector. And that my data type is either half, single, or double precision (i.e. not something weirder.)

Is it faster for me to use the "Tensor core" hardware (via the Warp Matrix-Multiply-Add facilities) for this purpose? Or will it not provide any speed benefit, because second multiplicand matrix is just one-column wide (A x v instead of A x B) and with no addend?

Notes:

  • If the answer is "it depends on the specifics", that's good too - just give a (non-contrived) example in which this is useful.
  • You may assume the vector is small-ish - as otherwise we can always break it up into pieces, do several matrix-vector multiplications, and perform reductions).
  • Similarly, you may assume the matrix is small-ish - since in one dimension, the vector is smallish, and in the other, we can always break the work up into more blocks, or have the block work on adjacent submatrices sequentially.
  • The baselines for "faster" would be, say, putting the vector in shared memory and performing the multiplication with a straightforward loop (with an appropriate distribution of work among the block threads etc.)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source