Category "avx2"

AVX2 code cannot be faster than gcc base optmization

I am studying AVX by writing AVX code with inline assembly. In this case, I tried to implement AVX in a simple function. The function name I made is lower_all_c

Is it possible to popcount __m256i and store result in 8 32-bit words instead of the 4 64-bit using Wojciech Mula algorithm's?

I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algori