'Optimize a loop for static predict-not-taken? Which prediction problems exist for that in a normal loop?

Which problems arise in the following assembly loop, if Predict Not Taken is chosen by default? Optimize the example to Predict not Taken.

addi $s1, $zero, 1024 // s1 := 1024
loop: addi $s1, $s1, -1 // s1--
jal subroutine // call subroutine()
bne $s1, $zero, loop // if (s1 != 0) jump loop

To me the most obvious answer is to copy the code of the subroutine 1024 times such that you don't have any branches at all. Problem solved. But this is too simple. Any ideas?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source