'Speed up text generation of GPT-2 simple

Hi I am trying to generate a 20 token text using GPT-2 simple. It is taking me around 15 seconds to generate the sentence. AI Dungeon is taking around 4 seconds to generate the same size sentence. Is there a way to fasten the GPT-2 text generation?



Solution 1:[1]

I think they have quicker results because their program is better optimized and they have greater computing power. They pay a lot for server. As well, Ai Dungeon uses GPT-3 which might be just faster. I'm as well struggling with speed of GPT-2. Let me know if you figured anything. Cheers

Solution 2:[2]

Text generation models like GPT-2 are slow, and it is of course even worse with bigger models like GPT-J and GPT-NeoX.

If you want to speed up your text generation you have a couple of options:

  • Use a GPU. GPT-2 doesn't require too much VRAM so an entry level GPU will do. On a GPU, generating 20 tokens with GPT-2 shouldn't take more than 1 second.
  • Quantize your model and convert it to TensorRT. See this good tutorial: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace/GPT2
  • Serve it through a dedicated inference server (like TorchServe or Triton Inference Server).

I actually wrote an article about how to speed up inference of transformer based models. You might find it helpful: how to speed up deep learning inference

Solution 3:[3]

You can use the OpenVINO optimized version of GPT-2 model. The demo can be found here. It should be much faster as it's heavily optimized.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Michal Aleksandrowicz
Solution 2 Julien Salinas
Solution 3 dragon7