'Bit-twiddling Wizardry for Index of Min or Max Element in XMM/YMM/ZMM
Is there an instruction or efficient branchless sequence of instructions to figure out the INDEX of (not the value of) the largest (or smallest) element of an unordered (unsorted) ZMM?
Data type doesn't matter- i'm more interested to know if there's a usage pattern for this established.
A related problem with a known solutions is, with a strictly ordered ZMM, one may use CMPPS, MOVMSKPS, and TZCNT to get the index of where an outside element WOULD fit into this list (i.e. BSEARCH)
Solution 1:[1]
Broadcast the minimum (or maximum) element over the complete vector, compare vectors for equality, use movemask instruction to convert to bitmap, then count trailing zeroes in the bitmap.
Example for FP32 lanes in SSE vector:
uint32_t minpos_ps( __m128 vec )
{
// Broadcast minimum value in the vector with a few shuffles
__m128 i = _mm_min_ps( vec, _mm_permute_ps( vec, _MM_SHUFFLE( 1, 0, 3, 2 ) ) );
i = _mm_min_ps( i, _mm_permute_ps( i, _MM_SHUFFLE( 2, 3, 0, 1 ) ) );
// Compare lanes for equality with the minimum
uint32_t mask = (uint32_t)_mm_movemask_ps( _mm_cmpeq_ps( vec, i ) );
// Return index of the smallest set bit in the mask
return std::countr_zero( mask );
}
More complicated example, for unsigned bytes in 32-byte AVX vector:
uint32_t minpos_epu8( __m256i vec )
{
__m256i i = _mm256_min_epu8( vec, _mm256_permute2x128_si256( vec, vec, 1 ) );
i = _mm256_min_epu8( i, _mm256_shuffle_epi32( i, _MM_SHUFFLE( 1, 0, 3, 2 ) ) );
i = _mm256_min_epu8( i, _mm256_shuffle_epi32( i, _MM_SHUFFLE( 2, 3, 0, 1 ) ) );
// If you calling this in a loop where compiler can preload constant vectors,
// replace shuffles and shifts below with _mm256_shuffle_epi8
__m256i tmp = _mm256_shufflehi_epi16( i, _MM_SHUFFLE( 2, 3, 0, 1 ) );
tmp = _mm256_shufflelo_epi16( tmp, _MM_SHUFFLE( 2, 3, 0, 1 ) );
i = _mm256_min_epu8( i, tmp );
tmp = _mm256_or_si256( _mm256_slli_epi16( i, 8 ), _mm256_srli_epi16( i, 8 ) );
i = _mm256_min_epu8( i, tmp );
uint32_t mask = (uint32_t)_mm256_movemask_epi8( _mm256_cmpeq_epi8( vec, i ) );
return std::countr_zero( mask );
}
The std::countr_zero
standard library function requires C++/20.
If you don't yet have that version, replace with _tzcnt_u32
, _BitScanForward
, or __builtin_ctz
intrinsics depending on compiler and target platform.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Soonts |