'Keras model.fit() runs faster on GPU when the CPU is loaded with a heavy multiprocessing script

I wasn't expecting this to happen. The relevant code pieces are:

import os
import tensorflow as tf
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

...

    csv_logger = CSVLogger(out_dir + 'log.csv', append = True, separator = '|')
    
    for epoch in range(epochs):
        
        np_in_data = generate_data(#arguments)

        model.fit([np_in_data[:, 0], np_in_data[:, 1]], np_in_data[:, 2], 
                  batch_size = 128, callbacks = [csv_logger])

Yielded:

703512/703512 [==============================] - 4478s 6ms/step - loss: 0.3591 
703512/703512 [==============================] - 4486s 6ms/step - loss: 0.3330 
703512/703512 [==============================] - 3919s 6ms/step - loss: 0.3354 
703512/703512 [==============================] - 3503s 5ms/step - loss: 0.3379 

Where I've launched another script in the middle of the 3rd epoch. Said script tries to utilize all available CPU cores as follows:

    n_cpu_worker = mp.cpu_count()

    if __name__ == "__main__":

        main_data = []

        with mp.Pool(processes = n_cpu_worker) as pool:

            main_data.extend(pool.starmap(para_proc_func, zip(#args)
            pool.close()
            pool.join()
            
    return main_data

The execution time of this script slowed down (as somewhat expected) from around 2200 seconds to 2700, meanwhile GPU usage (according to nvidia-smi) increased from around 17% (only model.fit) to 26% (model.fit + this script). The script has no GPU parallelization, and the model has no dropout or anything that should alter the runtime between epochs.

Is it possible, that my Keras model utilizes both CPU and GPU but would benefit from mainly prioritizing GPU for some tasks? How should the CPU usage be limited?



Solution 1:[1]

After more research, empirical results show that limiting CPU parallelism indeed accelerates my model.fit().

First, I've found:

config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads = 4,
                        inter_op_parallelism_threads = 1, 
                        allow_soft_placement = True,
                        device_count = {'CPU' : 4,
                                        'GPU' : 1})

session = tf.compat.v1.Session(config = config)
tf.compat.v1.keras.backend.set_session(session)

But that didn't affect the CPU for some reason. Then, by this answer: Tensorflow 2.2 not respecting thread settings (inter_op, intra_op, and OMP_NUM_THREADS)

I tested the following code instead:

tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(4)

Which successfully limited the amount of CPU cores used for feeding data into model.fit() and that, for some reason, reduced the time/epoch for training. Now, I'm experimenting with different values for these arguments.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 oliver.c