'How to run tensorflow inference for multiple models on GPU in parallel?

Do you know any elegant way to do inference on 2 python processes with 1 GPU tensorflow?

Suppose I have 2 processes, first one is classifying cats/dogs, 2nd one is classifying birds/planes, each process is running different tensorflow model and run on GPU. These 2 models will be given images from different cameras continuously. Usually, tensorflow will occupy all memory of the entire GPU. So when you start another process, it will crash saying OUT OF MEMORY or failed convolution CUDA or something along that line. Is there a tutorial/article/sample code that shows how to load 2 models in different processes and both run in parallel? This is very useful also in case you are running a model inference while you are doing some heavy graphics e.g. playing games. I also want to know how running the model affects the game.

I've tried using python Thread and it works but each model predicts 2 times slower (and you know that python thread is not utilizing multiple CPU cores). I want to use python Process but it's not working. If you have sample few lines of code that work I would appreciate that very much.

I've attached current Thread code also: enter image description here



Solution 1:[1]

OK. I think I've found the solution now.

I use tensorflow 2 and there are essentially 2 methods to manage the memory usage of GPU.

  1. set memory growth to true
  2. set memory limit to some number

You can use both methods, ignore all the warning messages about out of memory stuff. I still don't know what it exactly means but the model is still running and that's what I care about. I measured the exact time the model uses to run and it's a lot better than running on CPU. If I run both processes at the same time, the speed drop a bit, but it's still lot better than running on CPU.

For memory growth approach, my GPU is 3GB so first process try to allocate everything and then 2nd process said out of memory. But it still works.

For memory limit approach, I set the limit to some number e.g. 1024 MB. Both processes work.

So What is the right minimum number that you can set?

I tried reducing the memory limit until I found that my model works with 64 MB limit fine. The prediction speed is still the same as when I set the memory limit to 1024 MB. When I set the memory limit to 32MB, I noticed 50% speed drop. When I set to 16 MB, the model refuses to run because it does not have enough memory to store the image tensor. This means that my model requires minimum of 64 MB which is very little considering that I have 3GB to spare. This also allows me to run the model while playing some video games.

Conclusion: I chose to use the memory limit approach with 64 MB limit. You can check how to use memory limit here: https://www.tensorflow.org/guide/gpu

I suggest you to try changing the memory limit to see the minimum you need for your model. You will see speed drop or model refusing to run when the memory is not enough.

Solution 2:[2]

Apart from setting gpu memory fraction, you need to enable MPS in CUDA to get better speed if you are running more than one model on GPU simultaneoulsy. Otherwise, inference speed will be slower as compared to single model running on GPU.

sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
sudo nvidia-cuda-mps-control -d

Here 0 is your GPU number
After finishing stop the MPS daemon

echo quit | sudo nvidia-cuda-mps-control

Solution 3:[3]

As summarized here, you can specify the proportion of GPU memory allocated per process.

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

Using Keras, it may be simpler to allow 'memory growth' which will expand the allocated memory on demand as described here.

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

The following should work for Tensorflow 2.0:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Ritesh
Solution 3