You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm trying to finetune a custom fastpitch model provided by AI4 Bharat on the openSLR dataset. It looks like they have trained the model using 4 speakers. And they have provided the speaker_ids of those 4 speakers in the form of a .pth file i.e., they have provided best_model.pth, speakers.pth and a config file IndicTTS. With all this information, I'm able to finetune the model using the Coqui-ai implementation.
However, I would like to use my own speakers instead of the available speaker embeddings. So, my query is like is it possible to directly inject the custom multi speaker audio data into the coqui-ai implementation to derive the speaker embeddings alone and then to generate the output speech using the selected speaker_id? or Should i train a speaker embedding model as a standalone module and then inject the speaker embeddings externally? or Should i re-train a single model from the scratch to derive both the acoustic, speaker information and then having speaker_ids information in a separate .json file, as shown below?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I'm trying to finetune a custom fastpitch model provided by AI4 Bharat on the openSLR dataset. It looks like they have trained the model using 4 speakers. And they have provided the speaker_ids of those 4 speakers in the form of a .pth file i.e., they have provided best_model.pth, speakers.pth and a config file IndicTTS. With all this information, I'm able to finetune the model using the Coqui-ai implementation.
However, I would like to use my own speakers instead of the available speaker embeddings. So, my query is like is it possible to directly inject the custom multi speaker audio data into the coqui-ai implementation to derive the speaker embeddings alone and then to generate the output speech using the selected speaker_id? or Should i train a speaker embedding model as a standalone module and then inject the speaker embeddings externally? or Should i re-train a single model from the scratch to derive both the acoustic, speaker information and then having speaker_ids information in a separate .json file, as shown below?
$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
Beta Was this translation helpful? Give feedback.
All reactions