'Does "precision" of audio files have importance during training ASR systems?

I am resampling audio files with 8 kHz into 16 kHz by torchaudio.

An example of an original file:

Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, 1 channels, s16, 128 kb/s

After resampling it's become:

Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 16000 Hz, 1 channels, flt, 512 kb/s

So the precision has been changed to pcm_f32le.

I'd like to know if this is important for training of ASR systems or not.



Solution 1:[1]

Actually, Kaldi's doc says "Support only KSDATAFORMAT_SUBTYPE_PCM for now." This makes pcm_f32le (which is of KSDATAFORMAT_SUBTYPE_IEEE_FLOAT type) incompatible. So, save only in a PCM format:

torchaudio.save(path, waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)

And if you want to increase audio precision, do so only by increasing bits_pers_sample (in PCM_S encoding).

As for your actual question, it most likely depends on your dataset. So perhaps try both ways and pick the better performing one?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1