PromotiCon Demo

Abstract

Text-to-speech (TTS) technologies have recently expanded to incorporate natural language prompts for user-friendly control of speech styles, driven by significant advancements in language models. Traditional prompt-based TTS research, however, typically requires large-scale prompt generation that often necessitates costly human annotations. To address this challenge, we propose PromotiCon, a model that leverages prompts generated without human annotations to control emotions in speech. Our model utilizes abundant prompts generated using a large language model. Additionally, we propose an emotion distance-based prompt-speech matching method to appropriately pair the generated prompts with the most resembling speech data. To enhance speaker adaptation, we utilize a semi-supervised approach that allows the joint utilization of multi-speaker data without emotion labels. As a result, our model facilitates zero-shot emotional speech synthesis. Our experimental results confirm the effectiveness of our approach.

Emotion Control

Emotion Speech Dataset (ESD)

Target Speaker	Prompt	Target Speech		Synthesized
0011	(Neutral) His speech is flat, free from any emotional undulations.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Angry) He responds with a rapid, high-pitched tone, full of anger.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Happy) He's absolutely thrilled, his excitement contagious.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Sad) He articulates with a subdued grief, a low, sorrowful tone.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Surprise) His speech has a spirited rhythm, lively and full of wonder.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM

Target Speaker	Prompt	Target Speech		Synthesized
0016	(Neutral) She maintains a steady, neutral voice.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Angry) Her voice is gruff, each word heavier with annoyance.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Happy) She sounds elated, her voice bright and lively.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Sad) She speaks with a slow, deepening sadness.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM
	(Surprise) She speaks in a higher register, her voice reflecting her amazement.	GT	Vocoded	PromotiCon	Mellotron	Fs2+EDM

Zero-shot (Libri test-clean)

All speakers are unseen during training

Target Speaker	Prompt	Synthesized
8230	(Neutral) He speaks in an even, calm tone.	PromotiCon	Fs2+EDM
5105	(Angry) He speaks with a tone of indignation, his annoyance evident.	PromotiCon	Fs2+EDM
5683	(Happy) Her voice her radiant, reflecting her inner joy.	PromotiCon	Fs2+EDM
237	(Sad) She conveys her grief in a tone that's quietly sorrowful.	PromotiCon	Fs2+EDM
3575	(Surprise) She speaks with a quick, high tone, her surprise is clear and unmistakable.	PromotiCon	Fs2+EDM