PromotiCon
Text-to-speech (TTS) technologies have recently expanded to incorporate natural language prompts for user-friendly control of speech styles, driven by significant advancements in language models. Traditional prompt-based TTS research, however, typically requires large-scale prompt generation that often necessitates costly human annotations. To address this challenge, we propose PromotiCon, a model that leverages prompts generated without human annotations to control emotions in speech. Our model utilizes abundant prompts generated using a large language model. Additionally, we propose an emotion distance-based prompt-speech matching method to appropriately pair the generated prompts with the most resembling speech data. To enhance speaker adaptation, we utilize a semi-supervised approach that allows the joint utilization of multi-speaker data without emotion labels. As a result, our model facilitates zero-shot emotional speech synthesis. Our experimental results confirm the effectiveness of our approach.
Emotion Speech Dataset (ESD)
Target Speaker | Prompt | Target Speech | Synthesized | ||||
---|---|---|---|---|---|---|---|
0011 |
(Neutral) His speech is flat, free from any emotional undulations. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
|
(Angry) He responds with a rapid, high-pitched tone, full of anger. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
||
(Happy) He's absolutely thrilled, his excitement contagious. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
||
(Sad) He articulates with a subdued grief, a low, sorrowful tone. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
||
(Surprise) His speech has a spirited rhythm, lively and full of wonder. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
Target Speaker | Prompt | Target Speech | Synthesized | ||||
---|---|---|---|---|---|---|---|
0016 |
(Neutral) She maintains a steady, neutral voice. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
|
(Angry) Her voice is gruff, each word heavier with annoyance. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
||
(Happy) She sounds elated, her voice bright and lively. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
||
(Sad) She speaks with a slow, deepening sadness. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
||
(Surprise) She speaks in a higher register, her voice reflecting her amazement. |
GT |
Vocoded |
PromotiCon |
Mellotron |
Fs2+EDM |
All speakers are unseen during training
Target Speaker | Prompt | Synthesized | ||
---|---|---|---|---|
8230 |
(Neutral) He speaks in an even, calm tone. |
PromotiCon |
Fs2+EDM |
|
5105 |
(Angry) He speaks with a tone of indignation, his annoyance evident. |
PromotiCon |
Fs2+EDM |
|
5683 |
(Happy) Her voice her radiant, reflecting her inner joy. |
PromotiCon |
Fs2+EDM |
|
237 |
(Sad) She conveys her grief in a tone that's quietly sorrowful. |
PromotiCon |
Fs2+EDM |
|
3575 |
(Surprise) She speaks with a quick, high tone, her surprise is clear and unmistakable. |
PromotiCon |
Fs2+EDM |