PromotiCon PromotiCon Demo

PromotiCon: Prompt-based Emotion Controllable Text-to-Speech via Prompt Generation and Matching

 

main Overall framework of PromotiCon


Abstract

 

Text-to-speech (TTS) technologies have recently expanded to incorporate natural language prompts for user-friendly control of speech styles, driven by significant advancements in language models. Traditional prompt-based TTS research, however, typically requires large-scale prompt generation that often necessitates costly human annotations. To address this challenge, we propose PromotiCon, a model that leverages prompts generated without human annotations to control emotions in speech. Our model utilizes abundant prompts generated using a large language model. Additionally, we propose an emotion distance-based prompt-speech matching method to appropriately pair the generated prompts with the most resembling speech data. To enhance speaker adaptation, we utilize a semi-supervised approach that allows the joint utilization of multi-speaker data without emotion labels. As a result, our model facilitates zero-shot emotional speech synthesis. Our experimental results confirm the effectiveness of our approach.

 

Emotion Control

Emotion Speech Dataset (ESD)

Target Speaker Prompt Target Speech Synthesized

0011

(Neutral) His speech is flat, free from any emotional undulations.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Angry) He responds with a rapid, high-pitched tone, full of anger.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Happy) He's absolutely thrilled, his excitement contagious.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Sad) He articulates with a subdued grief, a low, sorrowful tone.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Surprise) His speech has a spirited rhythm, lively and full of wonder.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

Target Speaker Prompt Target Speech Synthesized

0016

(Neutral) She maintains a steady, neutral voice.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Angry) Her voice is gruff, each word heavier with annoyance.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Happy) She sounds elated, her voice bright and lively.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Sad) She speaks with a slow, deepening sadness.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

(Surprise) She speaks in a higher register, her voice reflecting her amazement.

GT

Vocoded

PromotiCon

Mellotron

Fs2+EDM

Zero-shot (Libri test-clean)

All speakers are unseen during training

Target Speaker Prompt Synthesized

8230

(Neutral) He speaks in an even, calm tone.

PromotiCon

Fs2+EDM

5105

(Angry) He speaks with a tone of indignation, his annoyance evident.

PromotiCon

Fs2+EDM

5683

(Happy) Her voice her radiant, reflecting her inner joy.

PromotiCon

Fs2+EDM

237

(Sad) She conveys her grief in a tone that's quietly sorrowful.

PromotiCon

Fs2+EDM

3575

(Surprise) She speaks with a quick, high tone, her surprise is clear and unmistakable.

PromotiCon

Fs2+EDM