'Keras/Tensorflow network inference performance
I am using a Keras network which I am calling predict()
many times on a single input. A rough calculation based on the layers gives ~3Mops. Running on my CPU should give ~1000 inferences per second, however in a test run which had 400 predicts it took 12 seconds => ~30 inferences per second. It only has 139k parameters which easily fit into cache so it cannot be bandwidth limited. How can I speed this up?
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 2, 7, 6)] 0
__________________________________________________________________________________________________
tf.compat.v1.transpose (TFOpLam (None, 7, 6, 2) 0 input_1[0][0]
__________________________________________________________________________________________________
conv2d (Conv2D) (None, 7, 6, 64) 1216 tf.compat.v1.transpose[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 7, 6, 64) 0 conv2d[0][0]
__________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 7, 6, 32) 18464 dropout[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout) (None, 7, 6, 32) 0 conv2d_1[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 5, 4, 64) 18496 dropout_1[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout) (None, 5, 4, 64) 0 conv2d_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 5, 4, 64) 36928 dropout_2[0][0]
__________________________________________________________________________________________________
dropout_3 (Dropout) (None, 5, 4, 64) 0 conv2d_3[0][0]
__________________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 5, 4, 64) 36928 dropout_3[0][0]
__________________________________________________________________________________________________
dropout_4 (Dropout) (None, 5, 4, 64) 0 conv2d_4[0][0]
__________________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 3, 2, 32) 18464 dropout_4[0][0]
__________________________________________________________________________________________________
dropout_5 (Dropout) (None, 3, 2, 32) 0 conv2d_5[0][0]
__________________________________________________________________________________________________
flatten (Flatten) (None, 192) 0 dropout_5[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 20) 3860 flatten[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 20) 3860 flatten[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 20) 420 dense_2[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 20) 420 dense[0][0]
__________________________________________________________________________________________________
policy (Dense) (None, 7) 147 dense_3[0][0]
__________________________________________________________________________________________________
value (Dense) (None, 1) 21 dense_1[0][0]
==================================================================================================
Total params: 139,224
Solution 1:[1]
It seems converting to a tflite model gives over a 100x speed up.
Solution 2:[2]
It can be also done by using a different toolkit for the inference e.g. OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics).
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Rob |
Solution 2 | dragon7 |