'Program in assembly x86 [closed]

I recently made a program with C++ and ASM. Can anyone help me make this code a more efficient one , in the ASM part or both. I would really appreciate it because i dont know every asm instriction and probably i am using way too many. BTW the program sums two integer vectors with any size. The code that i have is the one above:

C++:


extern "C" {
    int add_vtr_asm(int*, int*, int*, int);
}

void add_vtr() {

    __declspec(align(16))
        int vetor1[1024];
    __declspec(align(16))
        int vetor2[1024];
    __declspec(align(16))
        int soma[1024];
    for (i = 0; i <= 1023; i++) {
        vetor1[i] = i;
        vetor2[i] = i;
    }
    add_vtr_asm(vetor1, vetor2, soma, 1024);
    for (i = 0; i <= 1023; i++) {
        printf("% d + % d = % d \n",vetor1[i] ,vetor2[i], soma[i]);
     
    }
    exit(0);
}

int main()
{

    printf("Programa para somar vetores de inteiros: \n");
    printf("Soma de vetores com % d elementos \n", 1024);
    add_vtr();
}

ASM:
 

.MODEL FLAT, C  

.CODE             
add_vtr_asm PROC 
    push ebp 
    mov ebp,esp
    push esi 
    push edi 
    mov esi,[ebp+8] 
    mov ebx, [ebp+12]
    mov edi, [ebp+16]
    mov ecx,[ebp+20]
    shr ecx,2  
    next:movdqa XMM0,[esi]
    add esi,16
    paddd xmm0,[ebx]
    add ebx,16
    movdqa [edi],xmm0
    add edi,16
    dec ecx
    jnz next
    pop edi
    pop esi
    pop ebp
    ret
    add_vtr_asm ENDP
    END



Solution 1:[1]

I just coded it up in simple c++ and get this: https://godbolt.org/z/P1zPWv65b

struct Vec {
    alignas(16) int data[1024];
};

Vec add(const Vec &v1, const Vec &v2) {
    Vec v;
    for (int i = 0; i < 1024; ++i) {
        v.data[i] = v1.data[i] + v2.data[i];
    }
    return v;
}

Compiling with: g++ -std=c++20 -O2 -W -Wall

add(Vec const&, Vec const&):
        xor     eax, eax
.L2:
        movdqa  xmm0, XMMWORD PTR [rsi+rax]
        paddd   xmm0, XMMWORD PTR [rdx+rax]
        add     rax, 16
        movaps  XMMWORD PTR [rax-16+rdi], xmm0
        cmp     rax, 4096
        jne     .L2
        mov     rax, rdi
        ret

I don't see how I can improve that. That's not even vectorized, the plain optimization already picks SIMD just fine.

Note: without alignas(16) or std::array<int, 1024> the compiler uses movdqu which I assume might be slower. But I didn't test that. It's likely the whole code is limited by the memory bandwidth.

Adding -funroll-loops makes the loop do 128 bytes at a time using xmm0-xmm7 but you have to measure yourself if that is better.

The c/c++ code is much more readable and will work on any cpu. A simple vector addition is so simple for the compiler to optimize that writing it in asm can only make it worse.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Goswin von Brederlow