'python parallel processing running all tasks on one core - multiprocessing, ray

I have a model.predict()-method and 65536 rows of data which takes about 7 seconds to perform. I wanted to speed this up using the joblib.parallel_backend tooling using this example.

this is my code:

import numpy as np
from joblib import load, parallel_backend
from time import clock as time

from urllib.request import urlopen

NN_model=load(urlopen("http://clima-dods.ictp.it/Users/tompkins/CRM/nnet_3var.jl"))

npt=65536
t=np.random.uniform(low=-1,high=1,size=npt)
u=np.random.uniform(low=-1,high=1,size=npt)
q=np.random.uniform(low=-1,high=1,size=npt)
X=np.column_stack((u,t,q))

t0=time()
out1=NN_model.predict(X)os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))

t1=time()
print("serial",t1-t0)
with parallel_backend('threading', n_jobs=-1):
    out2=NN_model.predict(X)
t2=time()
print("parallel",t2-t1)

And these are my timings:

serial   6.481805
parallel 6.389198

I know from past experience that very small tasks are not speeded up by parallel shared memory techniques due to the overhead, as is also the posted answer here, but this is not the case here, as the job is 7 seconds and should far exceed any overhead. In fact, I traced the load on the machine and it seems to only be running in serial.

What am I doing wrong with the joblib specification? How can I use threading on my desktop to parallelize this task with joblib (or an alternative)?

Edit 1

From the post below, I was wondering if the application of joblib attempts to apply parallelization to model itself, rather than dividing up the rows of data into ncore batches to distribute to each core. Thus I decided that maybe I would need to do this division manually myself and farm the out the data "chunks" to each core. I've thus tried to use now Parallel and delay instead, chunking the data as per this post,

from joblib import Parallel, delayed 

ncore    = 8
nchunk   = int( npt / ncore )
parallel = Parallel( n_jobs = ncore )
results  = parallel( delayed( NN_model.predict )
                            ( X[i*nchunk:(i+1)*nchunk,:] )
                     for i in range( ncore )
                     )

This now runs ncore-instances on my machine, but they are all running at 1 / ncore efficiency (as if it were gating?) and the wall-clock is still not improved...

Edit 2

As an alternative, I have now also tried to do the manual division of the dataset using the multiprocessing package,

import  multiprocessing 
def predict_chunk(Xchunk):
    results=NN_model.predict(Xchunk)
    return (results)

pool=multiprocessing.Pool(processes=ncore)
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
stats=pool.starmap(predict_chunk,([X[i*nchunk:(i+1)*nchunk,:]] for i in range(ncore)))
res=np.vstack(stats).flatten()
pool.close()
pool.join()

Apart from the overhead of dividing the input data up and restacking the results, the problem should be embarrassingly parallel. Then I recalled earlier posts, and was wondering if the issue with the slow performance was due to the task affinity issue upon importing numpy as reported here, so I added the os.system command, but that doesn't seem to help, I still get each of 8 cores using around 12% of their CPU load and an overall timing that is now slightly slower than the serial solution due to the aforementioned overhead.

Edit 3

I've now tried to use ray instead

import ray

@ray.remote
def predict_chunk(Xchunk,start,end):
    results=NN_model.predict(Xchunk[start:end,:])
    return (results)

ray.init(num_cpus=ncore)
data_id=ray.put(X)
stats=ray.get([predict_chunk.remote(data_id,i*nchunk,(i+1)*nchunk) for i in range(ncore)])
res=np.vstack(stats).flatten()

Again, this creates 8 sub-processes, but they are all running on a single CPU and thus the parallel process is slower than the serial.

I'm almost certain this is related to the affinity issue referred to above, but the solutions don't seem to be working.

This is a summary of the architecture:

Linux hp6g4-clima-5.xxxx.it 4.15.0-124-generic #127-Ubuntu SMP Fri Nov 6 10:54:43 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Solution 1:^[1]

Q : "What am I doing wrong with the joblib specification?"

The biggest sin _{( being excused by FORTRAN history, where smart uses of COMMON-blocks have an unparalleled beauty of its own )} is, that you assume a process-based Python parallelism to remain a shared-memory one, which it is not & for non-process based forms of a just-[CONCURRENT] flow of processing you assume it to perform any faster ( as if it were able to indeed escape from a central GIL-lock re-[SERIAL]-isation of any amount of thread-based code-execution back into a naive sequence of a small-time-quota driven monopolistic, pure-[SERIAL] ( concurrency thus principally avoided ) processing, which it is (due to python evangelisation reasons) not )

Q : "How can I use threading on my desktop to parallelize this task with joblib (or an alternative)?"

There is no such way for your code.

Python threading is a no-go way for your compute-intensive & heavy memory-I/O bound workloads in python.

If in a need of more reads, feel free to read this, perhaps my previous answers in this tag, and try your system NUMA-map details by using lstopo.

DISCUSSION :

As timings suggest:

serial   6.481805
parallel 6.389198

There not more than about a 1.5 % "improvement", yet there is also other O/S processes noise in that same range of "runtime" differences and only small amount of memory-I/O accesses may enjoy some meaningful latency-masking, as you operate a matrix-heavy many-MULs/many-ADDs_{(transformers)} inside the neural-network.

PRINCIPAL MISS :

_{The source of similar impacts of (not only) the initial range of value-related uncertainty was demonstrated as early as in 1972 by no one less, than a METEO guru, mathematician and meteorologist Edward N. LORENZ - in his fabulous lecture held at American Association for the Advancement of Science, 139-th meeting, right on this very day DEC-29, 1972}

Neural networks are fine for model-less (statistical-justified, as being only a least-penalised) guessing, classification of non-critical objects (where humans are soon tired or not able to see/hear a "hidden"-pattern inside devastatingly many gazilions of samples to "learn" from - otherwise, we, humans, are excellent in pattern recognition & in "learning" on-the-fly. The Mother Evolution has developed our cognitive apparatuses to do that enormously efficient (energy) & remarkably hierarchically - finding "a cat" pictured by oranges inside a pool of bananas )

Neural networks being "used" in all (known) model-driven domains are, sorry for being straight on this, is an awful sin of its own.

Sure, thermodynamic models, state-change modes, humidity/temperature/pressure/ion-interactions-specific atmosphere models are complex, yet are known & physics is not a penalty-driven guessing ( the neural-network evangelisation of many-MULs/many-ADDs_{(transformers)} are claimed to be blindfully "good" at ).

Sure, we can spend infinite HPC-budgets, infinite R&D-capacities, yet no model-less NN-driven guessing will outperform a smart, responsibly implemented physics-respecting model, within the same amount of time, energy ( yes, the HPC-infrastructure toys consume immense amounts of energy for both computing (turning it directly to dissipated heat) and cooling (turning another immense amounts of energy into cooling the exhaust-heat dissipated by the HPC-infrastructure doing any kind of the number-crunching-games (be they wise or less) in the prior step).

Last but not least, as secondary school graders should know already, MUL-s/ADD-s increase the propagation of the principal uncertainty ( not only due to the limitations of the float-IEEE-specified storage of values ). After such process the resulting uncertainty of the "result" is orders of magnitude worse than the inputs were. This is a known alphabet for HPC-computing, so needless to remind you of, yet introducing NN-many-MULs/many-ADDs_{(transformers)} into any kind of predictive systems, the less for long-range predictive systems (like the Climate evolution or the Weather near-casting) is an awful anti-pattern (even when it might get fat financing from EU agencies or from the hardware vendors (a.k.a. technology marketing) - sorry, numbers do not work this way & responsible scientists should not close our eyes from these principal gaps, if not biased cognitive manipulations, not to call them intentionally broadcast lies )

Given as trivial example as possible, take any super-trivial model-based chaotic-attractor, be it a { Duffy | Lorenz }-one,

as we "know" both the exact model (so we can compute & simulate the exact evolution in time-space with a zero-uncertainty) and its parameters, which gives us a unique chance to use these demonstrators show us, how fast the ( known, repeatable & inspectable ) solution gets devastated by a natural propagation of any and all imprecisions & uncertainties (discussed here), as we can quantitatively "show" the growing ranges of uncertainty alongside the numerical simulation which comfort we never have with unknown, empiric (the less with approximate & many-hidden degrees of freedom oversimplified) models like this

which are visually impressive, which might be captive as they look so acceptable (and we got zero-chance to review model-results against reality in time, we cannot repeat the whole reality to re-review the deltas of the model etc, so we just let others to "believe" ).

Now, let's turn for these reasons to the "known" model demonstrators, and add any tiny amount of initial data uncertainty - in position, in speed, in time-stepping ( as an abstracted coexistence of all kinds of persistently present & unavoidable observations' / readouts' systematic + random error inprecisions, in-congruent time of data-acquisition / assimilation, etc. ) and you soon get the same simulation work, but now with the "new"-dataPOINTs, yet these so fast start to bear greater and greater until soon indeed infinite ranges of their respective principally associated uncertainties ( of X, Y, Z positions, of dX/dt, dY/dt, dZ/dt speeds ), that yields them meaning less.

Is there any field of a seriously accepted science, that can make any serious use of a DataPOINT == 17.3476 ± ? that right the many-MULs/many-ADDs_{(transformers)} produce so insanely fast?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Cody Gray