'OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.

While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:

  • CPUExecutionProvider - 0.72 seconds
  • OpenVINOExecutionProvider - 4.47 seconds

But if I run let's say 5 iterations the result is different:

  • CPUExecutionProvider - 3.83 seconds
  • OpenVINOExecutionProvider - 14.13 seconds

And if I run 100 iterations, the result is drastically different:

  • CPUExecutionProvider - 74.19 seconds
  • OpenVINOExecutionProvider - 46.96seconds

It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why. So my questions are:

  • Why does OpenVINOExecutionProvider behave this way?
  • What ExecutionProvider should I use?

The code is very basic:

import onnxruntime as rt
import numpy as np
import time 
from tqdm import tqdm

limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'

image = np.random.rand(1, 3, 512, 512).astype(np.float32)

# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name

start = time.time()
for i in tqdm(range(limit)):
    out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)

# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name

start = time.time()
for i in tqdm(range(limit)):
    out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)


Solution 1:[1]

The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend. This accelerates ONNX model's performance on the same hardware compared to generic acceleration on IntelĀ® CPU, GPU, VPU and FPGA.

Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and IntelĀ® MovidiusTM Vision Processing Units (VPUs).

This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.

You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.

For more information, you may refer to this ONNX Runtime Performance Tuning

Solution 2:[2]

Regarding non-linear time, it might be the case that there is some preparation that happens when you first run the model with OpenVINO - perhaps the model is first compiled to OpenVINO when you first call sess.run. I observed a similar effect when using TFLite. For these scenarios, it makes sense to discard the time of the first iteration when benchmarking. There also tends to be quite a bit of variance so running >10 or ideally >100 iterations is a good idea.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rommel_Intel
Solution 2 Wave