Confusion in injecting speaker information while finetuning the model. #4155

RamakrishnaChaitanya · 2025-02-19T05:26:27Z

RamakrishnaChaitanya
Feb 19, 2025

Hi, I'm trying to finetune a custom fastpitch model provided by AI4 Bharat on the openSLR dataset. It looks like they have trained the model using 4 speakers. And they have provided the speaker_ids of those 4 speakers in the form of a .pth file i.e., they have provided best_model.pth, speakers.pth and a config file IndicTTS. With all this information, I'm able to finetune the model using the Coqui-ai implementation.

However, I would like to use my own speakers instead of the available speaker embeddings. So, my query is like is it possible to directly inject the custom multi speaker audio data into the coqui-ai implementation to derive the speaker embeddings alone and then to generate the output speech using the selected speaker_id? or Should i train a speaker embedding model as a standalone module and then inject the speaker embeddings externally? or Should i re-train a single model from the scratch to derive both the acoustic, speaker information and then having speaker_ids information in a separate .json file, as shown below?

$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion in injecting speaker information while finetuning the model. #4155

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Confusion in injecting speaker information while finetuning the model. #4155

RamakrishnaChaitanya Feb 19, 2025

Replies: 0 comments

RamakrishnaChaitanya
Feb 19, 2025