Audio samples from "Emotional Voice Conversion with Semi-Supervised Generative Modeling"

Authors: Hai Zhu, Huayi Zhan, Hong Cheng, Ying Wu

Abstract:
Emotional Vocal Conversion (EVC) is a task that aims to convert the emotional state of speech from one to another while preserving the linguistic information and identity of the speaker.Previous works have been limited to converting parallel and labelled emotional speech data, which is not wildly available in real-life applications. To solve this problem, in this paper, we propose SGEVC, a novel semi-supervised generative model for emotional voice conversion. The proposed method constructs a continuous latent space of disentangled representations of linguistic, speaker identity, and emotion attribute. In addition, we introduce the TTS (Text-to-encoder) into EVC to guide linguistic information and design the SGEVC framework in an end-to-end manner. We can prove that as low as 1% supervised data (20 minutes) is enough to perform the emotional voice conversion. Experimental results show that our proposed model achieves extraordinary performance and consistently outperforms EVC baseline frameworks.

Contents

 

We conduct experiments on the emotional speech dataset (ESD). The dataset consists of 350 parallel utterances with an average duration of 2.9 seconds recorded by 10 native Mandarin speakers and 10 native English speakers. For each speaker, the corpus consists of five emotions as follows: happy, sad, neutral, angry, and surprised. In this paper, we only consider the Mandarin speakers in emotional dataset. For each speaker, we conduct emotion conversion from neutral to happy (N2H), neutral to angry (N2A), neutral to sad (N2S1), and neutral to surprised (N2S2). For each pair, we split the corpus into training set (330 samples) and testing set (20 samples). To make sure that our proposed model is trained under non-parallel conditions, we randomly shuffle the training set and construct non-parallel utterances for each training batch.

 

Emotional voice conversion

neutral-to-happy

Source
Target
StarGAN
PPG
SGEVC1
SGEVC10

neutral-to-angry

Source
Target
StarGAN
PPG
SGEVC1
SGEVC10

neutral-to-sad

Source
Target
StarGAN
PPG
SGEVC1
SGEVC10

neutral-to-surprise

Source
Target
StarGAN
PPG
SGEVC1
SGEVC10