'Get alternative suggestions during speech recognition

I would like to use offline speech to text recognition, mostly for German language.

Especially, I want to use Mozilla DeepSpeech (a TensorFlow implementation of Baidu's DeepSpeech architecture), but I fear that the audio quality of the audio input is not good enough to produce low error rates (WER - word error rates).

(English) example:

The speaker said "know" but the engine might have understood "flow" or "show" or "go" or "know".

I would like to get [flow, show, go, know] back from the engine, so that I can afterwards manually decide which suggestion fits best. How can I get this?

Does other speech to text engines offer this possibility?



Solution 1:[1]

DeepSpeech have updated releases. For better inference results, you need to follow their instructions and suggestions such as, your input audio file should be on 16000 Hz, mono channel, and 16 bit. Audio resampling may affect the quality of inference, keep this in mind. I personally use SoX for resampling but there are other options, samplerate. Also, there are many good suggestions on their forum.

There is a Python library called SpeechRecognition. They have some offline models and online API services for speech to text.

Solution 2:[2]

You could use .NET speech recognition: https://docs.microsoft.com/en-us/dotnet/api/system.speech.recognition?view=netframework-4.8.

Just note that .NET speech recognition only works properly if you set up the grammar for the speech recognition (the rules around what they could say).

Have a look at the Alternatives or Homophones properties of the RecognitionResult object.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 erolrecep
Solution 2 Wudfulstan