'Running parallel jobs in slurm
I was wondering if I could ask something about running slurm jobs in parallel.(Please note that I am new to slurm and linux and have only started using it 2 days ago...)
As per the insturctions on the picture below (source : https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/),
I have designed the following bash script
#!/bin/bash
#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1
for num in {0..29}
do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done
wait
The, I ran sbatch as follows: sbatch test_bash
However, when I view the outputs, it is apparent that only one of the sruns
in the bash script are being executed... Could anyone tell me where I went wrong and how I can fix it?
**update : when I look at the error file I get the following : srun: Job 43969 step creation temporarily disabled, retrying
. I searched the internet and it says that this could be caused by not specifying the memory and hence not having enough memory for the second job.. but I thought that I already specifeid the memory when I did --mem_per_cpu=300MB
?
**update : I have tried changing the code as said as in here : Why are my slurm job steps not launching in parallel?, but.. still it didn't work
**potentially pertinent information: our node has about 96cores, which seems odd when compared to tutorials that say one node has like 4cores or something
Thank you!!
Solution 1:[1]
Try adding --exclusive
to the srun
command line:
srun --exclusive --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
This will instruct srun
to use a sub-allocation and work as you intended.
Note that the --exclusive
option has a different meaning in this context than if used with sbatch
.
Note also that different versions of Slurm have a distinct canonical way of doing this, but using --exclusive
should work across most versions.
Solution 2:[2]
Even though you have solved your problem which turned out to be something else, and that you have already specified --mem_per_cpu=300MB
in your sbatch
script, I would like to add that in my case, my Slurm setup doesn't allow --mem_per_cpu
in sbatch
, only --mem
. So the srun
command will still allocate all the memory and block the subsequent steps. The key for me, is to specify --mem_per_cpu
(or --mem
) in the srun
command.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | damienfrancois |
Solution 2 | Isabella |