Category "micro-optimization"

AVX2 code cannot be faster than gcc base optmization

I am studying AVX by writing AVX code with inline assembly. In this case, I tried to implement AVX in a simple function. The function name I made is lower_all_c

Optimization of a short-length cyclic convolution

I have two sequences of 8 unsigned bytes and I need to compute their cyclic convolution, which yields 8 unsigned 19 bits integers. As I repeat this million time

During thread contention how can I speed up this ConcurrentQueue implementation which uses ReaderWriterLockSlim over a regular Queue<T>

Question: How can I implement a faster thread safe queue to support an object pool when under heavy thread contention? Scenario: My overall final objective is a

Can modern x86 implementations store-forward from more than one prior store?

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forwa

Can x86's MOV really be "free"? Why can't I reproduce this at all?

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single tes

One instruction to clear PF (Parity Flag) -- get odd number of bits in result register

In x86 assembly, is it possible to clear the Parity Flag in one and only one instruction, working under any initial register configuration? This is equivalent