'Bit-twiddling Wizardry for Index of Min or Max Element in XMM/YMM/ZMM

Is there an instruction or efficient branchless sequence of instructions to figure out the INDEX of (not the value of) the largest (or smallest) element of an unordered (unsorted) ZMM?

Data type doesn't matter- i'm more interested to know if there's a usage pattern for this established.


A related problem with a known solutions is, with a strictly ordered ZMM, one may use CMPPS, MOVMSKPS, and TZCNT to get the index of where an outside element WOULD fit into this list (i.e. BSEARCH)



Solution 1:[1]

Broadcast the minimum (or maximum) element over the complete vector, compare vectors for equality, use movemask instruction to convert to bitmap, then count trailing zeroes in the bitmap.

Example for FP32 lanes in SSE vector:

uint32_t minpos_ps( __m128 vec )
{
    // Broadcast minimum value in the vector with a few shuffles
    __m128 i = _mm_min_ps( vec, _mm_permute_ps( vec, _MM_SHUFFLE( 1, 0, 3, 2 ) ) );
    i = _mm_min_ps( i, _mm_permute_ps( i, _MM_SHUFFLE( 2, 3, 0, 1 ) ) );

    // Compare lanes for equality with the minimum
    uint32_t mask = (uint32_t)_mm_movemask_ps( _mm_cmpeq_ps( vec, i ) );

    // Return index of the smallest set bit in the mask
    return std::countr_zero( mask );
}

More complicated example, for unsigned bytes in 32-byte AVX vector:

uint32_t minpos_epu8( __m256i vec )
{
    __m256i i = _mm256_min_epu8( vec, _mm256_permute2x128_si256( vec, vec, 1 ) );
    i = _mm256_min_epu8( i, _mm256_shuffle_epi32( i, _MM_SHUFFLE( 1, 0, 3, 2 ) ) );
    i = _mm256_min_epu8( i, _mm256_shuffle_epi32( i, _MM_SHUFFLE( 2, 3, 0, 1 ) ) );
    // If you calling this in a loop where compiler can preload constant vectors,
    // replace shuffles and shifts below with _mm256_shuffle_epi8
    __m256i tmp = _mm256_shufflehi_epi16( i, _MM_SHUFFLE( 2, 3, 0, 1 ) );
    tmp = _mm256_shufflelo_epi16( tmp, _MM_SHUFFLE( 2, 3, 0, 1 ) );
    i = _mm256_min_epu8( i, tmp );
    tmp = _mm256_or_si256( _mm256_slli_epi16( i, 8 ), _mm256_srli_epi16( i, 8 ) );
    i = _mm256_min_epu8( i, tmp );

    uint32_t mask = (uint32_t)_mm256_movemask_epi8( _mm256_cmpeq_epi8( vec, i ) );

    return std::countr_zero( mask );
}

The std::countr_zero standard library function requires C++/20.

If you don't yet have that version, replace with _tzcnt_u32, _BitScanForward, or __builtin_ctz intrinsics depending on compiler and target platform.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Soonts