'What does it mean for slurm job to crash with `bus error`?

When running a Python script via slurm srun --pty bash I get a cryptic error message Bus error: core dumped.

I searched the slurm documentation and it doesn't mention this error type.

What's going on and how can I fix it?

I found this general information on the bus error, but that doesn't explain how and why it happens in a SLURM environment and what can be done to avoid it: What is a bus error? Is it different from a segmentation fault?



Solution 1:[1]

Helpful answer from Ben Evans on the Yale cluster Discourse that may apply more generally to other clusters:

On the Yale clusters, a bus error usually means your job ran out of memory (RAM). If you cannot reduce the memory usage of your code, you can request additional memory for your job using the --mem-per-cpu or --mem Slurm flags.

More details: Your program can run into this fault because of the way we manage memory with cgroups 7 so that many jobs can be run on the same physical machine without interfering with one another. If a process inside a job tries to access memory “outside” what was allocated to that job, e.g. more than what you requested, the operating system tells your program that address is invalid with the fault Bus Error, aka SIGBUS, exit(10). A similar fault you might be more familiar with is a Segmentation Fault, aka SIGSEGV, exit(11) which usually results from a program incorrectly trying to access a valid memory address.

https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101/2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Cornelius Roemer