VAW-GAN for Emotional Voice Conversion

NUS Presentation Title 2001
VAW-GAN for Disentanglement and
Recomposition of Emotional Elements in Speech
Kun Zhou 1, Berrak Sisman 2,1, Haizhou Li 1
1 Department of Electrical and Computer Engineering, National University of Singapore
2 Information System Technology and Design, Singapore University of Technology and Design
IEEE SLT 2021

Highlight
- In this paper, we propose a speaker-dependent emotional voice
conversion framework.
- We propose to disentangle and recompose emotional elements in
speech through VAW-GAN.
- We perform a non-linear prosody modelling method which uses
CWT to decompose F0 into different time-scales.
- We investigate the effect of F0 conditioning on the decoder to
improve the emotion conversion performance.
Figure 1 The conversion stage of proposed emotional voice conversion framework.

Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions

Introduction
1. Emotional voice conversion
Emotional voice conversion (EVC) is a voice conversion technique to convert
the emotional prosody of speech from one to another while preserving the
linguistic content and speaker identity.
Applications:
Emotional text-to-speech, virtual assistants, conversational robots, …
Fig. 1. An illustration of emotional voice conversion system which converts the
emotion in speech from happy to sad.

Introduction
1. Emotional voice conversion
Fundamental
Frequency
(F0)
Fundamental
Frequency
(F0)
Spectral
Features
(SP)
Spectral
Features
(SP)
Aperiodicity
(AP)
Aperiodicity
(AP)
Source Speech
(Neutral)
Target Speech
(Angry)
Feature Mapping
(Main focus)
Speech Analysis
(such as WORLD[1]
vocoder)
Fig. 1. An illustration of a typical emotional voice conversion system at the training stage.
Speech Analysis
(such as WORLD vocoder)
[1] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE
TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.

Introduction
2. Emotional Voice Conversion (EVC) vs. Voice Conversion (VC)
Similarities:
Both EVC and VC aim to preserve the linguistic information and convert para-
linguistic information.
Differences:
1. EVC aims to convert the emotional prosody;
VC aims to convert the speaker identity and considers prosody-related features
as speaker-independent.
2. VC mainly focuses on spectrum conversion;
-- Speaker identity is mainly determined by voice quality.
EVC focuses on both spectrum and prosody conversion;
-- Emotion is the interplay of multiple acoustic attributes concerning spectrum
and prosody.

Introduction
3. Challenges in Emotional Voice Conversion:
1) Non-parallel training
Parallel data is difficult and expensive to collect in real life.
2) Emotional prosody modelling
Emotional prosody is hierarchical in nature, and perceived at phoneme, word,
and sentence level [2].
2) Disentanglement of emotional prosody
Emotional prosody is complex with multiple signal attributes, which makes it
difficult to disentangle and recompose [3].
[2] Yi Xu, “Speech prosody: A methodological review,” Journal of Speech Sciences, vol. 1, no. 1, pp. 85–115, 2011
[3] Dagmar M Schuller and Björn W Schuller, “A review on five recent and near-future developments in computational processing of emotion in the human
voice,” Emotion Review, p. 1754073919898526, 2020.

Related Work
1. Variational Auto-encoding Wasserstein Generative Network
(VAW-GAN)[4]
- Based on variational auto-encoder
(VAE)[5]
- Consisting of three components:
Encoder: to learn a latent code from the input;
Generator: to reconstruct the input from the latent
representation and inference code;
Discriminator: to judge the reconstructed input
whether real or not.
- First introduced in voice conversion for
spectrum conversion in 2017 [3].
Fig. 2. Illustration of VAW-GAN framework, where E, G, D
represent encoder, generator, and discriminator respectively.
A VAW-GAN is incorporated with a VAE and a GAN model
by assigning VAE’s decoder as GAN’s generator.
[4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from unaligned corpora using variational autoencoding
wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017.
[5] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in
2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6.

Related Work
2. Prosody Characterization with Wavelet Transform
- Emotional prosody usually refers to pitch, intonation, and
speaking rate;
- F0 is used to describe the pitch and intonation, but F0 is
hierarchical in nature and difficult to model [2];
- Continuous wavelet transform (CWT) is a multi-scale modelling
technique, which decompose F0 from short-term to long-term
variations[6];
[6] Hans Kruschke and Michael Lenz, “Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis,” in Eighth European
Conference on Speech Communication and Technology, 2003.

Related Work
2. Prosody Characterization with Wavelet Transform
Fig. 3. 10-scales CWT analysis of F0 of an utterance in neutral and angry tone with the same linguistic content.

Contributions
- We propose an emotional voice conversion framework with
VAW-GAN that is trained on non-parallel data;
- We study the use of CWT decomposition to characterize F0 for
VAW-GAN prosody mapping;
- We propose to use an encoder to disentangle emotion
information from speech, and a decoder to recompose target
emotional speech.

Proposed Framework
1. Training stage
2) VAW-GAN for Spectrum
- Encoder learns an emotion-independent latent code z from the spectra;
- Decoder learns to reconstruct spectra from latent code z, F0, and emotion ID.
1) VAW-GAN for Prosody
- Encoder learns an emotion-independent latent code z from the CWT-based F0;
- Decoder learns to reconstruct CWT-based F0 from latent code z and emotion ID.
There are two pipelines in proposed framework:

Proposed Framework
2. Conversion stage
- VAW-GAN for prosody to convert F0;
Decoder is conditioned on target emotion ID.
- VAW-GAN for spectrum to convert spectra;
Decoder is conditioned on converted F0 and target emotion ID.
- APs are carried directly from the source without modifications.
APs do not affect conversion performance.

Proposed Framework
3. Conditioning on F0 for decoding
- In VAE, the latent code extract
from source spectral features
still contains source F0
information[7].
- The conversion performance
suffers from this flaw!
- We propose to use F0 as a
condition to the decoder.
- Decoder learns to discard
prosodic information during the
training!
[7] Wen-Chin Huang, Yi-Chiao Wu, Chen-Chou Lo, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, and Hsin-Min Wang,
“Investigation of f0 conditioning and fully convolutional networks in variational autoencoder based voice conversion,” arXiv preprint arXiv:1905.00615, 2019.

Experiments
1. Experimental Setup
- Database: EmoV-DB [8], an English emotional speech database
- Speakers: One male speaker (‘Sam’), one female speaker
(‘Bea’)
- Emotions: neutral, sleepy, angry
- Details:
1) training data: 300 non-parallel utterances from each speaker;
2) evaluation data: 20 utterances from each speaker
3) we conduct emotion conversion for each speaker from:
a) neutral to angry b) neutral to sleepy
[8] Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit, “The emotional voices database: Towards controlling the emotion
dimension in voice generation systems,” arXiv preprint arXiv:1806.09514, 2018.

Experiments
2. Comparative study
Baseline Frameworks:
1)VAW-GAN (SP + CWT) :
(To show the efficiency of F0 conditioning)
VAW-GAN based EVC framework, where both spectral and CWT-based F0 features are
converted without F0 conditioning on the spectral decoder
2)VAW-GAN (SP + F0 + C):
(To show the efficiency of CWT modelling of F0)
VAWGAN based EVC framework, where both spectral features are converted with LG-based F0
conditioning on the spectral decoder, and F0 is converted with LG-based linear transformation.
Proposed Framework:
VAW-GAN (SP + CWT + C)

Experiments
3. Objective Evaluation
- MCD [dB] and LSD [dB] to measure spectrum conversion;
- RMSE [Hz] and PCC to measure prosody conversion;
Higher values of PCC, Lower values of MCD, LSD, and RMSE indicate better performance!
- F0 conditioning improves spectrum conversion performance;
- CWT modelling of F0 improves prosody conversion performance.

Experiments
4. Subjective Evaluation
11 subjects participate in three listening tests:
-XAB test for speech quality to assess the
effect of F0 conditioning;
-XAB test for emotion similarity to assess the
effect of F0 conditioning;
-XAB test for emotion similarity to assess the
effect of CWT modelling of F0;
- F0 conditioning improves speech quality
and emotion conversion performance!
- CWT modelling improves emotion
conversion performance!

Experiments
5. Speech Samples
For more speech samples:
https://demo9646.github.io/slt2020_demo/
(Better to open with Safari or Firefox!)
Source (Neutral) Converted (Angry) Target (Angry)
Source (Neutral) Converted (Sleepy) Target (Sleepy)

Conclusions
- In this paper, we propose a speaker-dependent emotional voice
conversion framework.
- We perform both spectrum and prosody conversion based on VAW-
GAN.
- We perform a non-linear prosody modelling method which uses
CWT to decompose F0 into different time-scales.
- Moreover, we investigate the effect of F0 conditioning on the
decoder to improve the emotion conversion performance.

Thank you!

VAW-GAN for Emotional Voice Conversion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VAW-GAN for Emotional Voice Conversion

Similar to VAW-GAN for Emotional Voice Conversion (20)

Recently uploaded

Recently uploaded (20)

VAW-GAN for Emotional Voice Conversion