Category "cpu-architecture"

How can some architectures guarantee that aligned memory operations are atomic?

As explained in this post: Why is integer assignment on a naturally aligned variable atomic on x86? : Memory load/store on a byte value - and any correctly alig

Create a branch history in loop

Consider int t = 0; for( int i = 0; i < 8; i++ ) { for( int j = 0; j < 8; j++ ) { t = t + i*j; } } Ex: Create a branch history table in t =

What is the difference between BZ and BNZ in instruction pipeline?

I am confused between branching instructions BZ and BNZ. Can anybody, please, explain the concept and working of BZ and BNZ with an example?

Is a mov to a segmentation register slower than a mov to a general purpose register?

Specifically is: mov %eax, %ds Slower than mov %eax, %ebx Or are they the same speed. I've researched online, but have been unable to find a definitive an

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.

Optimize a loop for static predict-not-taken? Which prediction problems exist for that in a normal loop?

Which problems arise in the following assembly loop, if Predict Not Taken is chosen by default? Optimize the example to Predict not Taken. addi $s1, $zero, 1024

Can compilers break control dependencies used for LoadStore memory ordering or similar, in any real use-cases?

I'm reading the mail list about LKMM: Add volatile_if(). The control dependency is somewhat subtle since it is easily forgotten by us developers. So I wonder i

Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?

After serious development, CPUs gained many cores, gained distributed blocks of cores on multiple chiplets, numa systems, etc but still a piece of data has to p

In computers 32-bit or 64-bit processors are used, why not 40-bit or other numbers?

For example, in case of 32-bit processors, a word is 4-byte long. Is it also possible to use 5-byte word or others?

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answer

Why is a conditional move not vulnerable to Branch Prediction Failure?

After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction

Can x86's MOV really be "free"? Why can't I reproduce this at all?

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single tes

Why is processing a sorted array faster than processing an unsorted array?

Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (before the timed region) miraculously makes the l

Understanding bubble vs stall vs repeated decode/fetch

I'm really confused on the difference between bubbles, stalls, and repeated decoding/fetching. My text is the Patterson text, 3rd edition. Example 1: add $3,

Micro fusion and addressing modes

I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA). The following instruction using [base+index] addressing add