'Keras/Tensorflow network inference performance

I am using a Keras network which I am calling predict() many times on a single input. A rough calculation based on the layers gives ~3Mops. Running on my CPU should give ~1000 inferences per second, however in a test run which had 400 predicts it took 12 seconds => ~30 inferences per second. It only has 139k parameters which easily fit into cache so it cannot be bandwidth limited. How can I speed this up?

    Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(None, 2, 7, 6)]    0
__________________________________________________________________________________________________
tf.compat.v1.transpose (TFOpLam (None, 7, 6, 2)      0           input_1[0][0]
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 7, 6, 64)     1216        tf.compat.v1.transpose[0][0]
__________________________________________________________________________________________________
dropout (Dropout)               (None, 7, 6, 64)     0           conv2d[0][0]
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 7, 6, 32)     18464       dropout[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 7, 6, 32)     0           conv2d_1[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 5, 4, 64)     18496       dropout_1[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 5, 4, 64)     0           conv2d_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 5, 4, 64)     36928       dropout_2[0][0]
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 5, 4, 64)     0           conv2d_3[0][0]
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 5, 4, 64)     36928       dropout_3[0][0]
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 5, 4, 64)     0           conv2d_4[0][0]
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 3, 2, 32)     18464       dropout_4[0][0]
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 3, 2, 32)     0           conv2d_5[0][0]
__________________________________________________________________________________________________
flatten (Flatten)               (None, 192)          0           dropout_5[0][0]
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 20)           3860        flatten[0][0]
__________________________________________________________________________________________________
dense (Dense)                   (None, 20)           3860        flatten[0][0]
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 20)           420         dense_2[0][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 20)           420         dense[0][0]
__________________________________________________________________________________________________
policy (Dense)                  (None, 7)            147         dense_3[0][0]
__________________________________________________________________________________________________
value (Dense)                   (None, 1)            21          dense_1[0][0]
==================================================================================================
Total params: 139,224


Solution 1:[1]

It seems converting to a tflite model gives over a 100x speed up.

Solution 2:[2]

It can be also done by using a different toolkit for the inference e.g. OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.

It's rather straightforward to convert the Keras model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[tensorflow2]

Save your model as SavedModel

OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.

import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics).

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rob
Solution 2 dragon7