'Program in assembly x86 [closed]
I recently made a program with C++ and ASM. Can anyone help me make this code a more efficient one , in the ASM part or both. I would really appreciate it because i dont know every asm instriction and probably i am using way too many. BTW the program sums two integer vectors with any size. The code that i have is the one above:
C++:
extern "C" {
int add_vtr_asm(int*, int*, int*, int);
}
void add_vtr() {
__declspec(align(16))
int vetor1[1024];
__declspec(align(16))
int vetor2[1024];
__declspec(align(16))
int soma[1024];
for (i = 0; i <= 1023; i++) {
vetor1[i] = i;
vetor2[i] = i;
}
add_vtr_asm(vetor1, vetor2, soma, 1024);
for (i = 0; i <= 1023; i++) {
printf("% d + % d = % d \n",vetor1[i] ,vetor2[i], soma[i]);
}
exit(0);
}
int main()
{
printf("Programa para somar vetores de inteiros: \n");
printf("Soma de vetores com % d elementos \n", 1024);
add_vtr();
}
ASM:
.MODEL FLAT, C
.CODE
add_vtr_asm PROC
push ebp
mov ebp,esp
push esi
push edi
mov esi,[ebp+8]
mov ebx, [ebp+12]
mov edi, [ebp+16]
mov ecx,[ebp+20]
shr ecx,2
next:movdqa XMM0,[esi]
add esi,16
paddd xmm0,[ebx]
add ebx,16
movdqa [edi],xmm0
add edi,16
dec ecx
jnz next
pop edi
pop esi
pop ebp
ret
add_vtr_asm ENDP
END
Solution 1:[1]
I just coded it up in simple c++ and get this: https://godbolt.org/z/P1zPWv65b
struct Vec {
alignas(16) int data[1024];
};
Vec add(const Vec &v1, const Vec &v2) {
Vec v;
for (int i = 0; i < 1024; ++i) {
v.data[i] = v1.data[i] + v2.data[i];
}
return v;
}
Compiling with: g++ -std=c++20 -O2 -W -Wall
add(Vec const&, Vec const&):
xor eax, eax
.L2:
movdqa xmm0, XMMWORD PTR [rsi+rax]
paddd xmm0, XMMWORD PTR [rdx+rax]
add rax, 16
movaps XMMWORD PTR [rax-16+rdi], xmm0
cmp rax, 4096
jne .L2
mov rax, rdi
ret
I don't see how I can improve that. That's not even vectorized, the plain optimization already picks SIMD just fine.
Note: without alignas(16)
or std::array<int, 1024>
the compiler uses movdqu
which I assume might be slower. But I didn't test that. It's likely the whole code is limited by the memory bandwidth.
Adding -funroll-loops
makes the loop do 128 bytes at a time using xmm0-xmm7 but you have to measure yourself if that is better.
The c/c++ code is much more readable and will work on any cpu. A simple vector addition is so simple for the compiler to optimize that writing it in asm can only make it worse.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Goswin von Brederlow |