'Convert a .npy file to wav following tacotron2 training

I am training the Tacotron2 model using TensorflowTTS for a new language. I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files) The files in the output directory are .npy files. Which makes sense as they are mel-spectograms. I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.

I used this :

 melspectrogram = librosa.feature.melspectrogram(
    "/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050, 
    window=scipy.signal.hanning, n_fft=1024, hop_length=256)

 print('melspectrogram.shape', melspectrogram.shape)
 print(melspectrogram)

 audio_signal = librosa.feature.inverse.mel_to_audio(
       melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
 print(audio_signal, audio_signal.shape)

 sf.write('test.wav', audio_signal, sample_rate)

But it is given me this error : Audio data must be of type numpy.ndarray. Although I am already giving it a numpy.ndarray file. Does anyone know where the issue might be, and if anyone knows a better way to do it?



Solution 1:[1]

I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)

Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 A. Pine