'How to use multiple CPUs with Python on a HPC?

I'm working on a data analysis project in Python and I'm using a HPC cluster to process my data. I'm having a hard time getting my program to use multiple CPUs to make it run faster. Here's an example of what I'm doing:

import multiprocessing as mp
import csv

def calc_function(sample-file):  
    # do some heavy calculation and make a dictionary about it
    sample_dict = {'Name': str(sample_file), 'Info': 'blablabla'}
    return sample_dict

list_of_files = [1, 2]
pool = mp.Pool( mp.cpu_count() )
with pool as p:
    list_of_results = p.map(calc_function, list_of_files)

# make csv file
# add all data to csv

I have found that my jobs timeout even when I give them the same or more time than I'd need if I didn't use multiprocessing at all.

I know this way is not very good as it relies on the whole pool finishing before it gets saved into the csv. I've found in other questions that it's probably better put everything in a Queue and have that write to the csv. But I can't get that to work either.

I tried a small 3GB dataset that takes around 40 mins without mp.Pool. Here is my SBATCH-file as well, if that's helpful:

#SBATCH --partition=general
#SBATCH --qos=short
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=4G
#SBATCH --mail-type=BEGIN,END

conda activate main_env
srun python <my_file.py>
conda deactivate

Thanks for the help!



Solution 1:[1]

Q : "How to use multiple CPUs with Python on a HPC?"

A :
Wrong, sorry,
there is no way Python will ever
( unless a total re-design from scratch, bottom-up, takes place, which Mr. Guido van ROSSUM, the Father of Python, is often cited himself not to consider that probable, if ever, within a foreseeable amount of efforts and within some acceptable time horizon )
use more CPU(cores), due to a fundamental design property, as Python-interpreter processes and threads all wait for a single ( a process-central, concurrency principally-A-V-O-I-D-I-N-G, MUTEX-lock as a coordinating point ), the Global Interpreter Lock ( a.k.a. GIL ).

This said, whatever amounts of threads the Python-Interpreter process launches, all wait and one and only one thread, which has successfully grabbed the GIL MUTEX-lock, does a small amount of useful work ( a configurable chunk of about ~ 100 ms ). I repeat, ALL OTHERS WAIT, doing nothing. This is principally the very opposite state, than what HPC fights for in doing the computing (i.e. the High Performance Computing - kindly notice the accent on the words PERFORMANCE, and more exactly the HIGH performance ).

If we were consistent in using the same words for the very same meanings these words indeed naturally carry, a launching of a Python-Interpreter based jobs onto an HPC-infrastructure is closer to a High-Performance-Waiting, than it will ever be to a High-Performance-Computing ( smart-vectorised non-Python, GIL-bypassing math-libraries are sure the honourable exceptions to this fact, yet these are per-se not a Python-Interpreter, but a well-oiled and performance-optimised external mathematical engines, so the observation still holds ).

The Problem ( as-is ) - what is the best Processing Strategy for ... ?

for

  • a given list-of-filenames to process
  • a given process to convert file-content into results
  • a given HPC-infrastructure to launch a computing strategy on

The Solution

Avoid any and all add-on overhead costs :

  1. forget the Queue - the Queue-handling overheads plus data-SER/DES-overheads ( read pickle.dumps( sample_dict ) + pickle.loads( sample_dict ) represent awfully bad anti-patterns if performance is to get improved )
  2. forget to spawn mp.cpu_count()-many full-copies of the whole Python-Interepreter processes ( incl. all it's internal data-structures, file-handles, etc., etc. ) RAM-footprint if doing so will soon on modern { 8- | 16- | 32- | 64- | ... }-core CPUs deplete physical-RAM and turn O/S into RAM-thrashing swap-in / swap-out moribundus, where performance got annihilated just by moving physical-RAM blocks from-disk-to-RAM / from-RAM-to-disk ad nausea, leaving the less, if any, residual RAM-I/O bandwidth to some (already starved, now)-suffocated Computing. So the already inefficient ( High-Performance-Waiting ) GIL-lock engine now even waits for receiving next pseudo-instructions' data back from RAM to CPU to become able to do some next ~ 100 ms of "calculation"

The most promising Candidate
with maximum Python-allowed performance :

If you cannot or do not with to re-factor the code, so as to bypass the GIL-lock performance-avoiding trap, do :

a) use slurm tooling to launch as many N python-processes as your HPC-usage plan permits (Python-Interpreter does not require any multi-core, neither does other part of the Strategy, right-sized RAM-specific parameters may help finding more free nodes to map the batch onto HPC-Infrastructure - see a remark below )

b) configure each slurm-launched python-process via a process-specific launch-configuration parameter(s)

c) self-configure each python-process, based on the slurm-launch delivered process-specific parameter(s)

d) make each python-process start a trivial, private, concurrency-avoided, embarrasingly-parallel, yet still a Python-Interpreter natively pure-[SERIAL] processing of those and only those files, that come from its own, non-overlapping sub-set of the given <_list-of-filenames_>, the 1st-of-N starting the 1st-N-th, the 2nd the 2nd-N-th part, the 3rd the 3rd-N-th, the last the last-N-th of the original list.

e) make the file-processing function to self-store its own results in some file, be it a .CSV or other ( HPC-policies' details can prefer append-based storing, or not, with later collection and joining of individual files to be safely performed "outside" of the HPC-time-quota, by O/S-tools )

f) finally collect and join the resulting files, using O/S tools ex-post ad-libidum

Summary :

This recipe did all the performance-oriented steps without adding a single (expensive and many times repeated) instruction and will be the fastest ever Computing Strategy on your HPC-infrastructure.

Having focused on performance, this orchestration adds minimum add-on overheads to the pure-[SERIAL] sequential-processing, so the achieved speedup will best scale with N :

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N               

For HPC-usage-policies and HPC-resources / HPC-usage-quota details, please consult your HPC-infrastructure administrators & Technical Support Dept.

Details matter ( different file-storage filesystems, scratch storage access / usage / deletion policies can and will decide on the best pre-launch & post-completion file-manipulations to take place so as the whole batch-process gets most out of the legitimate HPC-Infrastructure resources' usage )

No other magicks will help here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1