'Running Python scripts in Slurm

I've recently started a new job and need to run some scripts on the HPC through Slurm.

My scripts are written in Python, and therefore I want to execute these using python script.py in my .slurm file.

However, when I try to run the .slurm file, it doesn't seem to be able to call the python scripts. I've tried loading the python environment using module load anaconda3, and variations thereof (e.g. module load python, etc.). Attached is my array.slurm file, for reference(.slurm file). I've left the account and mail-user empty for uploading here for anonymity, but I have these in when I run the script.

The error file output by Slurm indicates the following:

/var/spool/slurmd/job220829/slurm_script: line 19: module: command not found

Can someone offer practical guidance? I need to run these Python scripts as soon as possible.



Solution 1:[1]

As md2perpe mentioned every HPC system is different. They customize the slurm scheduler up to some extent. Still many HPCs share the same basic commands.

For instance, here is a job submission script that I created to run a python file on a GPU node.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu_check
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --gres=gpu:1
#SBATCH --account=scw1901
#SBATCH --partition=accel_ai

module load anaconda/3
source activate base
python gpu.py

I can suggest you the following:

  • After loading anaconda module you should activate the conda virtual environment. For example, source activate base. To see a list of available conda environments type this conda env list. Then activate the conda environment of your choice.
  • I don't know what your python script is, so can't really comment on the argument that you used.
  • Make sure you have access to the partition. To see a list of partition type sinfo. Also check the state. If it is drain or reserved then you simply can't use that partition.
  • Maybe you can run your script without --ntasks-per-nodes and --array. Why not try my job script?
  • If nothing works, please paste the output of error file in your question. In my case, the JOBID is defined by %J not %a as in your case.
  • You can remove those email arguments --mail if you don't need it.
  • What is SLURM_ARRAY_TASK_ID? If you don't know please remove it.
  • You said you don't have module command. The error is in line 19. But you used module command in line 18. Are you sure you are sharing the correct job script?
  • Can you run module load anaconda/3 in the login node? Just copy and paste this after SSHing. If yes then module is available.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1