'MPI_alltoallw working and MPI_Ialltoallw failing
I am trying to implement non-blocking communications in a large code. However, the code tends to fail for such cases. I have reproduced the error below. When running on one CPU, the code below works when switch is set to false but fails when switch is set to true.
program main
use mpi
implicit none
logical :: switch
integer, parameter :: maxSize=128
integer scounts(maxSize), sdispls(maxSize)
integer rcounts(maxSize), rdispls(maxSize)
integer :: types(maxSize)
double precision sbuf(maxSize), rbuf(maxSize)
integer comm, size, rank, req
integer ierr
integer ii
call MPI_INIT(ierr)
comm = MPI_COMM_WORLD
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
switch = .true.
! Init
sbuf(:) = rank
scounts(:) = 0
rcounts(:) = 0
sdispls(:) = 0
rdispls(:) = 0
types(:) = MPI_INTEGER
if (switch) then
! Send one time N double precision
scounts(1) = 1
rcounts(1) = 1
sdispls(1) = 0
rdispls(1) = 0
call MPI_Type_create_subarray(1, (/maxSize/), &
(/maxSize/), &
(/0/), &
MPI_ORDER_FORTRAN,MPI_DOUBLE_PRECISION, &
types(1),ierr)
call MPI_Type_commit(types(1),ierr)
else
! Send N times one double precision
do ii = 1, maxSize
scounts(ii) = 1
rcounts(ii) = 1
sdispls(ii) = ii-1
rdispls(ii) = ii-1
types(ii) = MPI_DOUBLE_PRECISION
enddo
endif
call MPI_Ibarrier(comm, req, ierr)
call MPI_Wait(req, MPI_STATUS_IGNORE, ierr)
if (switch) then
call MPI_Ialltoallw(sbuf, scounts, sdispls, types, &
rbuf, rcounts, rdispls, types, &
comm, req, ierr)
call MPI_Wait(req, MPI_STATUS_IGNORE, ierr)
call MPI_TYPE_FREE(types(1), ierr)
else
call MPI_alltoallw(sbuf, scounts, sdispls, types, &
rbuf, rcounts, rdispls, types, &
comm, ierr)
endif
call MPI_Finalize( ierr )
end program main
Compiling with the debug flag and running with mpirun -np 1 valgrind --vgdb=yes --vgdb-error=0 ./a.out
leads to the following errors in valgrind and gdb :
valgrind :
==249074== Invalid read of size 8
==249074== at 0x4EB0A6D: release_vecs_callback (coll_base_util.c:222)
==249074== by 0x4EB100A: complete_vecs_callback (coll_base_util.c:245)
==249074== by 0x74AD1CC: ompi_request_complete (request.h:441)
==249074== by 0x74AE86D: ompi_coll_libnbc_progress (coll_libnbc_component.c:466)
==249074== by 0x4FC0C39: opal_progress (opal_progress.c:231)
==249074== by 0x4E04795: ompi_request_wait_completion (request.h:415)
==249074== by 0x4E047EB: ompi_request_default_wait (req_wait.c:42)
==249074== by 0x4E80AF7: PMPI_Wait (pwait.c:74)
==249074== by 0x48A30D2: mpi_wait (pwait_f.c:76)
==249074== by 0x10961A: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
==249074== Address 0x7758830 is 0 bytes inside a block of size 8 free'd
==249074== at 0x483CA3F: free (vg_replace_malloc.c:540)
==249074== by 0x4899CCC: PMPI_IALLTOALLW (pialltoallw_f.c:125)
==249074== by 0x1095FC: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
==249074== Block was alloc'd at
==249074== at 0x483B7F3: malloc (vg_replace_malloc.c:309)
==249074== by 0x4899B4A: PMPI_IALLTOALLW (pialltoallw_f.c:90)
==249074== by 0x1095FC: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
gdb :
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x0000000004eb0a6d in release_vecs_callback (request=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:222
222 if (NULL != request->data.vecs.stypes[i]) {
(gdb) bt
#0 0x0000000004eb0a6d in release_vecs_callback (request=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:222
#1 0x0000000004eb100b in complete_vecs_callback (req=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:245
#2 0x00000000074ad1cd in ompi_request_complete (request=0x7758af8, with_signal=true) at ../../../../../openmpi-4.1.0/ompi/request/request.h:441
#3 0x00000000074ae86e in ompi_coll_libnbc_progress () at ../../../../../openmpi-4.1.0/ompi/mca/coll/libnbc/coll_libnbc_component.c:466
#4 0x0000000004fc0c3a in opal_progress () at ../../openmpi-4.1.0/opal/runtime/opal_progress.c:231
#5 0x0000000004e04796 in ompi_request_wait_completion (req=0x7758af8) at ../../openmpi-4.1.0/ompi/request/request.h:415
#6 0x0000000004e047ec in ompi_request_default_wait (req_ptr=0x1ffeffdbb8, status=0x1ffeffdbc0) at ../../openmpi-4.1.0/ompi/request/req_wait.c:42
#7 0x0000000004e80af8 in PMPI_Wait (request=0x1ffeffdbb8, status=0x1ffeffdbc0) at pwait.c:74
#8 0x00000000048a30d3 in ompi_wait_f (request=0x1ffeffe6cc, status=0x10c0a0 <mpi_fortran_status_ignore_>, ierr=0x1ffeffeee0) at pwait_f.c:76
#9 0x000000000010961b in MAIN__ () at tmp.f90:61
Any help would be appreciated. Ubuntu 20.04. gfortran 9.3.0. OpenMP 4.1.0. Thanks.
Solution 1:[1]
The proposed program is currently broken when using Open MPI, see issue https://github.com/open-mpi/ompi/issues/8763. The current workaround is to use MPICH.
EDIT : the bug is fixed in the main branch of OpenMPI and should be fixed in versions 5.0 and above.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |