'Efficiently shift-or large bit vector

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.

By shifting whole array I mean not to shift each element (like a[i] <<= Shift), but to shift it as a single large bit vector. In other words for each intermediate position i (except for first and last element) I can do following in a loop:

dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);

where w is some temporary variable, holding right-shifted value of previous array element.

This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.

Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.

But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.

Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.

BTW, I need solutions for all of MSVC/GCC/CLang.

Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.

If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X, and several hundreds of shift sizes s0, s1, ..., sN, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN). I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.

As requested by @Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src to produced or-ed final dst bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:

Try it online!

#include <cstdint>
#include <vector>
#include <random>

#define bit_sizeof(x) (sizeof(x) * 8)

using u64 = uint64_t;
using T = u64;

int main() {
    std::mt19937_64 rng{123};

    // Random generate source bit vector
    std::vector<T> src(100'000);
    for (size_t i = 0; i < src.size(); ++i)
        src[i] = rng();

    size_t const src_bitsize = src.size() * bit_sizeof(T);

    // Destination bit vector, for example twice bigger in size
    std::vector<T> dst(src.size() * 2);

    // Random generate shifts
    std::vector<u64> shifts(200);
    for (size_t i = 0; i < shifts.size(); ++i)
        shifts[i] = rng() % src_bitsize;

    // Right-shift that handles overflow
    auto Shr = [](auto x, size_t s) {
        return s >= bit_sizeof(x) ? 0 : (x >> s);
    };

    // Do actual Shift-Ors
    for (auto orig_shift: shifts) {
        size_t const
            word_off = orig_shift / bit_sizeof(T),
            bit_off = orig_shift % bit_sizeof(T);

        if (word_off >= dst.size())
            continue;
        
        size_t const
            lim = std::min(src.size(), dst.size() - word_off);

        T w = 0;
        
        for (size_t i = 0; i < lim; ++i) {
            dst[word_off + i] |= w | (src[i] << bit_off);
            w = Shr(src[i], bit_sizeof(T) - bit_off);
        }

        // Special case of handling for last word
        if (word_off + lim < dst.size())
            dst[word_off + lim] |= w;
    }
}

My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.



Solution 1:[1]

You can, and possibly you don't even need to use SIMD instructions explicitly. The target compilers GCC, CLANG and MSVC and other compilers like ICC all support auto-vectorization. While hand-optimized assembly can outperform compiler generated vectorized instructions, it's generally harder to achieve and you may need several versions for different architectures. Generic code that leads to efficient auto-vectorized instructions is a solution that may be portable across many platforms.

For instance a simple shiftvec version

void shiftvec(uint64_t* dst, uint64_t* src, int size, int shift)
{
    for (int i = 0; i < size; ++i,++src,++dst)
    {
        *dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
    }
}

compiled with a recent GCC (or CLANG works as well) and -O3 -std=c++11 -mavx2 leads to SIMD instructions in the core loop of the assembly

.L5:
  vmovdqu ymm4, YMMWORD PTR [rsi+rax]
  vmovdqu ymm5, YMMWORD PTR [rsi+8+rax]
  vpsllq ymm0, ymm4, xmm2
  vpsrlq ymm1, ymm5, xmm3
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rdi+rax], ymm0
  add rax, 32
  cmp rax, rdx
  jne .L5

See on godbolt.org: https://godbolt.org/z/5TxhqMhnK

This also generalizes if you want to do combine multiple shifts in dst:

void shiftvec2(uint64_t* dst, uint64_t* src1, uint64_t* src2, int size1, int size2, int shift1, int shift2)
{
    int size = size1<size2 ? size1 : size2;
    for (int i = 0; i < size; ++i,++src1,++src2,++dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
        *dst |= ((*src2)<<shift2) | (*(src2+1)>>(64-shift2)); 
    }
    for (int i = size; i < size1; ++i,++src1,++dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));        
    }
    for (int i = size; i < size2; ++i,++src2,++dst)
    {
        *dst = ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
    }
}

compiles to a core-loop:

.L38:
  vmovdqu ymm7, YMMWORD PTR [rsi+rcx]
  vpsllq ymm1, ymm7, xmm4
  vmovdqu ymm7, YMMWORD PTR [rsi+8+rcx]
  vpsrlq ymm0, ymm7, xmm6
  vpor ymm1, ymm1, ymm0
  vmovdqu YMMWORD PTR [rax+rcx], ymm1
  vmovdqu ymm7, YMMWORD PTR [rdx+rcx]
  vpsllq ymm0, ymm7, xmm3
  vmovdqu ymm7, YMMWORD PTR [rdx+8+rcx]
  vpsrlq ymm2, ymm7, xmm5
  vpor ymm0, ymm0, ymm2
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rax+rcx], ymm0
  add rcx, 32
  cmp r10, rcx
  jne .L38

Combining multiple sources in one loop will reduce the total amount of memory bandwidth spent on loading/writing the destination. The limit in how many you can combine is of course limited by available registers. Note that xmm2 and xmm3 for shiftvec contain the shift values, so having different versions for compile-time known shift values may free those registers.

Additionally using __restrict (supported by GCC,CLANG,MSVC) for each of the pointers will tell the compiler that the ranges are not overlapping.

I initially had problems with MSVC giving proper auto vectorized code, but it seems adding more SIMD-like structure will make it work for all three desired compilers GCC, CLANG and MSVC:

void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift)
{
    int i = 0;
    // MSVC: use steps of 2 for SSE, 4 for AVX2, 8 for AVX512
    for (; i+4 < size; i+=4,dst+=4,src+=4)
    {
        for (int j = 0; j < 4; ++j)
            *(dst+j) = (*(src+j))<<shift;
        for (int j = 0; j < 4; ++j)
            *(dst+j) |= (*(src+1)>>(64-shift));
    }
    for (; i < size; ++i,++src,++dst)
    {
        *dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
    }    
}

Solution 2:[2]

I would attempt to rely on x64 ability to read from unaligned addresses, and to do that with almost no visible penalty when stars are properly (un)aligned. One would only need to handle a few cases of (shift % 8) or (shift % 16) -- all doable with SSE2 instruction set, fixing the remainder with zeros and having an unaligned offset to the data vector and addressing the UB by memcpy.

That said, the inner loop would look like:

uint16_t const *ptr;
auto a = _mm_loadu_si128((__m128i*)ptr);
auto b = _mm_loadu_si128((__m128i*)(ptr - 1);
a = _mm_srl_epi16(a, c);
b = _mm_sll_epi16(b, 16 - c);
_mm_storeu_si128((__m128i*)ptr, mm_or_si128(a,b));
ptr += 8;

Unrolling this loop a few times, one might be able to use _mm_alignr_epi8 on SSE3+ to relax memory bandwidth (and those pipeline stages that need to combine results from unaligned memory accesses):

auto a0 = w; 
auto a1 = _mm_load_si128(m128ptr + 1);
auto a2 = _mm_load_si128(m128ptr + 2);
auto a3 = _mm_load_si128(m128ptr + 3);
auto a4 = _mm_load_si128(m128ptr + 4);
auto b0 = _mm_alignr_epi8(a1, a0, 2);
auto b1 = _mm_alignr_epi8(a2, a1, 2);
auto b2 = _mm_alignr_epi8(a3, a2, 2);
auto b3 = _mm_alignr_epi8(a4, a3, 2);
// ... do the computation as above ...
w = a4;   // rotate the context

Solution 3:[3]

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

The key to efficiency is to be lazy. The key to being lazy is to lie - pretend you shifted without actually doing any shifting.

For an initial example (to illustrate the concept only), consider:

struct Thingy {
    int ignored_bits;
    uint64_t data[];
}

void shift_right(struct Thingy * thing, int count) {
    thing->ignored_bits += count;
}

void shift_left(struct Thingy * thing, int count) {
    thing->ignored_bits -= count;
}

int get_bit(struct Thingy * thing, int bit_number) {
    bit_number += thing->ignored_bits;
    return !!(thing->data[bit_number / 64] & (1 << bit_number % 64));
}

For practical code you'll need to care about various details - you'll probably want to start with spare bits at the start of the array (and non-zero ignored_bits) so that you can pretend to shift right; for each small shift you'll probably want to clear "shifted in" bits (otherwise it'll behave like floating point - e.g. (5.0 << 8) >> 8) == 5.0); if/when ignored_bits goes outside a certain range you'll probably want a large memcpy(); etc.

For more fun; abuse low level memory management - use VirtualAlloc() (Windows) or mmap() (Linux) to reserve a huge space, then put your array in the middle of the space, then allocate/free pages at the start/end of array as needed; so that you only need to memcpy() after the original bits have been "shifted" many billions of bits to the left/right.

Of course the consequence is that it's going to complicate other parts of your code - e.g. to OR 2 bitfields together you'll have to do a tricky "fetch A; shift A to match B; result = A OR B" adjustment. This isn't a deal breaker for performance.

Solution 4:[4]

#include <cstdint>
#include <immintrin.h>

template<unsigned Shift>
void foo(uint64_t* __restrict pDst, const uint64_t* __restrict pSrc, intptr_t size)
{
    uint64_t* pSrc0, * pSrc1, * pSrc2, * pSrc3, * pDst0, * pDst1, * pDst2, * pDst3;
    __m256i prev, current;
    intptr_t i, stride;

    stride = size >> 2;
    i = stride;

    pSrc0 = pSrc;
    pSrc1 = pSrc + stride;
    pSrc2 = pSrc + 2 * stride;
    pSrc2 = pSrc + 3 * stride;

    pDst0 = pDst;
    pDst1 = pDst + stride;
    pDst2 = pDst + 2 * stride;
    pDst3 = pDst + 3 * stride;

    prev = _mm256_set_epi64x(0, pSrc1[-1], pSrc2[-1], pSrc3[-1]);

    while (i--)
    {
        current = _mm256_set_epi64x(*pSrc0++, *pSrc1++, *pSrc2++, *pSrc3++);
        prev = _mm256_srli_epi64(prev, 64 - Shift);
        prev = _mm256_or_si256(prev, _mm256_slli_epi64(current, Shift));
        *pDst0++ = _mm256_extract_epi64(prev, 3);
        *pDst1++ = _mm256_extract_epi64(prev, 2);
        *pDst2++ = _mm256_extract_epi64(prev, 1);
        *pDst3++ = _mm256_extract_epi64(prev, 0);

        prev = current;
    }
}

You can do the operation on up to four 64bit elements at once on AVX2 (up to eight on AVX512)

If size isn't a multiple of four, there will be up to 3 remaining ones to deal with.

PS: Auto vectorization is never a proper solution.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4 Jake 'Alquimista' LEE