SlideShare a Scribd company logo
1 of 27
NUS Presentation Title 2001
VAW-GAN for Disentanglement and
Recomposition of Emotional Elements in Speech
Kun Zhou 1, Berrak Sisman 2,1, Haizhou Li 1
1 Department of Electrical and Computer Engineering, National University of Singapore
2 Information System Technology and Design, Singapore University of Technology and Design
IEEE SLT 2021
NUS Presentation Title 2001
Highlight
- In this paper, we propose a speaker-dependent emotional voice
conversion framework.
- We propose to disentangle and recompose emotional elements in
speech through VAW-GAN.
- We perform a non-linear prosody modelling method which uses
CWT to decompose F0 into different time-scales.
- We investigate the effect of F0 conditioning on the decoder to
improve the emotion conversion performance.
Figure 1 The conversion stage of proposed emotional voice conversion framework.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Introduction
1. Emotional voice conversion
Emotional voice conversion (EVC) is a voice conversion technique to convert
the emotional prosody of speech from one to another while preserving the
linguistic content and speaker identity.
Applications:
Emotional text-to-speech, virtual assistants, conversational robots, …
Fig. 1. An illustration of emotional voice conversion system which converts the
emotion in speech from happy to sad.
NUS Presentation Title 2001
Introduction
1. Emotional voice conversion
Fundamental
Frequency
(F0)
Fundamental
Frequency
(F0)
Spectral
Features
(SP)
Spectral
Features
(SP)
Aperiodicity
(AP)
Aperiodicity
(AP)
Source Speech
(Neutral)
Target Speech
(Angry)
Feature Mapping
(Main focus)
Speech Analysis
(such as WORLD[1]
vocoder)
Fig. 1. An illustration of a typical emotional voice conversion system at the training stage.
Speech Analysis
(such as WORLD vocoder)
[1] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE
TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
NUS Presentation Title 2001
Introduction
2. Emotional Voice Conversion (EVC) vs. Voice Conversion (VC)
Similarities:
Both EVC and VC aim to preserve the linguistic information and convert para-
linguistic information.
Differences:
1. EVC aims to convert the emotional prosody;
VC aims to convert the speaker identity and considers prosody-related features
as speaker-independent.
2. VC mainly focuses on spectrum conversion;
-- Speaker identity is mainly determined by voice quality.
EVC focuses on both spectrum and prosody conversion;
-- Emotion is the interplay of multiple acoustic attributes concerning spectrum
and prosody.
NUS Presentation Title 2001
Introduction
3. Challenges in Emotional Voice Conversion:
1) Non-parallel training
Parallel data is difficult and expensive to collect in real life.
2) Emotional prosody modelling
Emotional prosody is hierarchical in nature, and perceived at phoneme, word,
and sentence level [2].
2) Disentanglement of emotional prosody
Emotional prosody is complex with multiple signal attributes, which makes it
difficult to disentangle and recompose [3].
[2] Yi Xu, “Speech prosody: A methodological review,” Journal of Speech Sciences, vol. 1, no. 1, pp. 85–115, 2011
[3] Dagmar M Schuller and Björn W Schuller, “A review on five recent and near-future developments in computational processing of emotion in the human
voice,” Emotion Review, p. 1754073919898526, 2020.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Related Work
1. Variational Auto-encoding Wasserstein Generative Network
(VAW-GAN)[4]
- Based on variational auto-encoder
(VAE)[5]
- Consisting of three components:
Encoder: to learn a latent code from the input;
Generator: to reconstruct the input from the latent
representation and inference code;
Discriminator: to judge the reconstructed input
whether real or not.
- First introduced in voice conversion for
spectrum conversion in 2017 [3].
Fig. 2. Illustration of VAW-GAN framework, where E, G, D
represent encoder, generator, and discriminator respectively.
A VAW-GAN is incorporated with a VAE and a GAN model
by assigning VAE’s decoder as GAN’s generator.
[4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from unaligned corpora using variational autoencoding
wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017.
[5] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in
2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6.
NUS Presentation Title 2001
Related Work
2. Prosody Characterization with Wavelet Transform
- Emotional prosody usually refers to pitch, intonation, and
speaking rate;
- F0 is used to describe the pitch and intonation, but F0 is
hierarchical in nature and difficult to model [2];
- Continuous wavelet transform (CWT) is a multi-scale modelling
technique, which decompose F0 from short-term to long-term
variations[6];
[6] Hans Kruschke and Michael Lenz, “Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis,” in Eighth European
Conference on Speech Communication and Technology, 2003.
NUS Presentation Title 2001
Related Work
2. Prosody Characterization with Wavelet Transform
Fig. 3. 10-scales CWT analysis of F0 of an utterance in neutral and angry tone with the same linguistic content.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Contributions
- We propose an emotional voice conversion framework with
VAW-GAN that is trained on non-parallel data;
- We study the use of CWT decomposition to characterize F0 for
VAW-GAN prosody mapping;
- We propose to use an encoder to disentangle emotion
information from speech, and a decoder to recompose target
emotional speech.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Proposed Framework
1. Training stage
2) VAW-GAN for Spectrum
- Encoder learns an emotion-independent latent code z from the spectra;
- Decoder learns to reconstruct spectra from latent code z, F0, and emotion ID.
1) VAW-GAN for Prosody
- Encoder learns an emotion-independent latent code z from the CWT-based F0;
- Decoder learns to reconstruct CWT-based F0 from latent code z and emotion ID.
There are two pipelines in proposed framework:
NUS Presentation Title 2001
Proposed Framework
2. Conversion stage
- VAW-GAN for prosody to convert F0;
Decoder is conditioned on target emotion ID.
- VAW-GAN for spectrum to convert spectra;
Decoder is conditioned on converted F0 and target emotion ID.
- APs are carried directly from the source without modifications.
APs do not affect conversion performance.
NUS Presentation Title 2001
Proposed Framework
3. Conditioning on F0 for decoding
- In VAE, the latent code extract
from source spectral features
still contains source F0
information[7].
- The conversion performance
suffers from this flaw!
- We propose to use F0 as a
condition to the decoder.
- Decoder learns to discard
prosodic information during the
training!
[7] Wen-Chin Huang, Yi-Chiao Wu, Chen-Chou Lo, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, and Hsin-Min Wang,
“Investigation of f0 conditioning and fully convolutional networks in variational autoencoder based voice conversion,” arXiv preprint arXiv:1905.00615, 2019.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Experiments
1. Experimental Setup
- Database: EmoV-DB [8], an English emotional speech database
- Speakers: One male speaker (‘Sam’), one female speaker
(‘Bea’)
- Emotions: neutral, sleepy, angry
- Details:
1) training data: 300 non-parallel utterances from each speaker;
2) evaluation data: 20 utterances from each speaker
3) we conduct emotion conversion for each speaker from:
a) neutral to angry b) neutral to sleepy
[8] Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit, “The emotional voices database: Towards controlling the emotion
dimension in voice generation systems,” arXiv preprint arXiv:1806.09514, 2018.
NUS Presentation Title 2001
Experiments
2. Comparative study
Baseline Frameworks:
1)VAW-GAN (SP + CWT) :
(To show the efficiency of F0 conditioning)
VAW-GAN based EVC framework, where both spectral and CWT-based F0 features are
converted without F0 conditioning on the spectral decoder
2)VAW-GAN (SP + F0 + C):
(To show the efficiency of CWT modelling of F0)
VAWGAN based EVC framework, where both spectral features are converted with LG-based F0
conditioning on the spectral decoder, and F0 is converted with LG-based linear transformation.
Proposed Framework:
VAW-GAN (SP + CWT + C)
NUS Presentation Title 2001
Experiments
3. Objective Evaluation
- MCD [dB] and LSD [dB] to measure spectrum conversion;
- RMSE [Hz] and PCC to measure prosody conversion;
Higher values of PCC, Lower values of MCD, LSD, and RMSE indicate better performance!
- F0 conditioning improves spectrum conversion performance;
- CWT modelling of F0 improves prosody conversion performance.
NUS Presentation Title 2001
Experiments
4. Subjective Evaluation
11 subjects participate in three listening tests:
-XAB test for speech quality to assess the
effect of F0 conditioning;
-XAB test for emotion similarity to assess the
effect of F0 conditioning;
-XAB test for emotion similarity to assess the
effect of CWT modelling of F0;
- F0 conditioning improves speech quality
and emotion conversion performance!
- CWT modelling improves emotion
conversion performance!
NUS Presentation Title 2001
Experiments
5. Speech Samples
For more speech samples:
https://demo9646.github.io/slt2020_demo/
(Better to open with Safari or Firefox!)
Source (Neutral) Converted (Angry) Target (Angry)
Source (Neutral) Converted (Sleepy) Target (Sleepy)
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Conclusions
- In this paper, we propose a speaker-dependent emotional voice
conversion framework.
- We perform both spectrum and prosody conversion based on VAW-
GAN.
- We perform a non-linear prosody modelling method which uses
CWT to decompose F0 into different time-scales.
- Moreover, we investigate the effect of F0 conditioning on the
decoder to improve the emotion conversion performance.
NUS Presentation Title 2001
Thank you!

More Related Content

What's hot

Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionTakato Hayashi
 
Estimating the quality of digitally transmitted speech over satellite communi...
Estimating the quality of digitally transmitted speech over satellite communi...Estimating the quality of digitally transmitted speech over satellite communi...
Estimating the quality of digitally transmitted speech over satellite communi...Alexander Decker
 
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientMalayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanAbhishek Mahajan
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
IRJET- Study of Effect of PCA on Speech Emotion Recognition
IRJET- Study of Effect of PCA on Speech Emotion RecognitionIRJET- Study of Effect of PCA on Speech Emotion Recognition
IRJET- Study of Effect of PCA on Speech Emotion RecognitionIRJET Journal
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemVani011
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A ReviewIRJET Journal
 
Automatic speech recognition system using deep learning
Automatic speech recognition system using deep learningAutomatic speech recognition system using deep learning
Automatic speech recognition system using deep learningAnkan Dutta
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminarDiptimaya Sarangi
 

What's hot (20)

Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognition
 
Estimating the quality of digitally transmitted speech over satellite communi...
Estimating the quality of digitally transmitted speech over satellite communi...Estimating the quality of digitally transmitted speech over satellite communi...
Estimating the quality of digitally transmitted speech over satellite communi...
 
H010215561
H010215561H010215561
H010215561
 
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientMalayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajan
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Speaker recognition.
Speaker recognition.Speaker recognition.
Speaker recognition.
 
Nd2421622165
Nd2421622165Nd2421622165
Nd2421622165
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
Mini Project- Audio Enhancement
 
IRJET- Study of Effect of PCA on Speech Emotion Recognition
IRJET- Study of Effect of PCA on Speech Emotion RecognitionIRJET- Study of Effect of PCA on Speech Emotion Recognition
IRJET- Study of Effect of PCA on Speech Emotion Recognition
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition System
 
Ho3114511454
Ho3114511454Ho3114511454
Ho3114511454
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A Review
 
Kf2517971799
Kf2517971799Kf2517971799
Kf2517971799
 
Mini Project- Audio Enhancement
Mini Project- Audio EnhancementMini Project- Audio Enhancement
Mini Project- Audio Enhancement
 
Speech processing
Speech processingSpeech processing
Speech processing
 
Automatic speech recognition system using deep learning
Automatic speech recognition system using deep learningAutomatic speech recognition system using deep learning
Automatic speech recognition system using deep learning
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 

Similar to VAW-GAN for Emotional Voice Conversion

A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...ijtsrd
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networksijtsrd
 
silent sound technology pdf
silent sound technology pdfsilent sound technology pdf
silent sound technology pdfrahul mishra
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET Journal
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
 
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Esra Açar
 
Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...karthik annam
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNIJCSEA Journal
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNIJCSEA Journal
 
Incremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIncremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIDES Editor
 
Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identificationIJCSEA Journal
 
On the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionOn the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionjournalBEEI
 
Bachelors project summary
Bachelors project summaryBachelors project summary
Bachelors project summaryAditya Deshmukh
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET Journal
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSkevig
 

Similar to VAW-GAN for Emotional Voice Conversion (20)

Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networks
 
silent sound technology pdf
silent sound technology pdfsilent sound technology pdf
silent sound technology pdf
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion Analysis
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
 
Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
 
Incremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIncremental Difference as Feature for Lipreading
Incremental Difference as Feature for Lipreading
 
Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identification
 
On the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionOn the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognition
 
Servey
ServeyServey
Servey
 
Bachelors project summary
Bachelors project summaryBachelors project summary
Bachelors project summary
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Speech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - ProjectSpeech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - Project
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
 

Recently uploaded

Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMoumonDas2
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 

VAW-GAN for Emotional Voice Conversion

  • 1. NUS Presentation Title 2001 VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech Kun Zhou 1, Berrak Sisman 2,1, Haizhou Li 1 1 Department of Electrical and Computer Engineering, National University of Singapore 2 Information System Technology and Design, Singapore University of Technology and Design IEEE SLT 2021
  • 2. NUS Presentation Title 2001 Highlight - In this paper, we propose a speaker-dependent emotional voice conversion framework. - We propose to disentangle and recompose emotional elements in speech through VAW-GAN. - We perform a non-linear prosody modelling method which uses CWT to decompose F0 into different time-scales. - We investigate the effect of F0 conditioning on the decoder to improve the emotion conversion performance. Figure 1 The conversion stage of proposed emotional voice conversion framework.
  • 3. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 4. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 5. NUS Presentation Title 2001 Introduction 1. Emotional voice conversion Emotional voice conversion (EVC) is a voice conversion technique to convert the emotional prosody of speech from one to another while preserving the linguistic content and speaker identity. Applications: Emotional text-to-speech, virtual assistants, conversational robots, … Fig. 1. An illustration of emotional voice conversion system which converts the emotion in speech from happy to sad.
  • 6. NUS Presentation Title 2001 Introduction 1. Emotional voice conversion Fundamental Frequency (F0) Fundamental Frequency (F0) Spectral Features (SP) Spectral Features (SP) Aperiodicity (AP) Aperiodicity (AP) Source Speech (Neutral) Target Speech (Angry) Feature Mapping (Main focus) Speech Analysis (such as WORLD[1] vocoder) Fig. 1. An illustration of a typical emotional voice conversion system at the training stage. Speech Analysis (such as WORLD vocoder) [1] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
  • 7. NUS Presentation Title 2001 Introduction 2. Emotional Voice Conversion (EVC) vs. Voice Conversion (VC) Similarities: Both EVC and VC aim to preserve the linguistic information and convert para- linguistic information. Differences: 1. EVC aims to convert the emotional prosody; VC aims to convert the speaker identity and considers prosody-related features as speaker-independent. 2. VC mainly focuses on spectrum conversion; -- Speaker identity is mainly determined by voice quality. EVC focuses on both spectrum and prosody conversion; -- Emotion is the interplay of multiple acoustic attributes concerning spectrum and prosody.
  • 8. NUS Presentation Title 2001 Introduction 3. Challenges in Emotional Voice Conversion: 1) Non-parallel training Parallel data is difficult and expensive to collect in real life. 2) Emotional prosody modelling Emotional prosody is hierarchical in nature, and perceived at phoneme, word, and sentence level [2]. 2) Disentanglement of emotional prosody Emotional prosody is complex with multiple signal attributes, which makes it difficult to disentangle and recompose [3]. [2] Yi Xu, “Speech prosody: A methodological review,” Journal of Speech Sciences, vol. 1, no. 1, pp. 85–115, 2011 [3] Dagmar M Schuller and Björn W Schuller, “A review on five recent and near-future developments in computational processing of emotion in the human voice,” Emotion Review, p. 1754073919898526, 2020.
  • 9. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 10. NUS Presentation Title 2001 Related Work 1. Variational Auto-encoding Wasserstein Generative Network (VAW-GAN)[4] - Based on variational auto-encoder (VAE)[5] - Consisting of three components: Encoder: to learn a latent code from the input; Generator: to reconstruct the input from the latent representation and inference code; Discriminator: to judge the reconstructed input whether real or not. - First introduced in voice conversion for spectrum conversion in 2017 [3]. Fig. 2. Illustration of VAW-GAN framework, where E, G, D represent encoder, generator, and discriminator respectively. A VAW-GAN is incorporated with a VAE and a GAN model by assigning VAE’s decoder as GAN’s generator. [4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017. [5] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6.
  • 11. NUS Presentation Title 2001 Related Work 2. Prosody Characterization with Wavelet Transform - Emotional prosody usually refers to pitch, intonation, and speaking rate; - F0 is used to describe the pitch and intonation, but F0 is hierarchical in nature and difficult to model [2]; - Continuous wavelet transform (CWT) is a multi-scale modelling technique, which decompose F0 from short-term to long-term variations[6]; [6] Hans Kruschke and Michael Lenz, “Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis,” in Eighth European Conference on Speech Communication and Technology, 2003.
  • 12. NUS Presentation Title 2001 Related Work 2. Prosody Characterization with Wavelet Transform Fig. 3. 10-scales CWT analysis of F0 of an utterance in neutral and angry tone with the same linguistic content.
  • 13. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 14. NUS Presentation Title 2001 Contributions - We propose an emotional voice conversion framework with VAW-GAN that is trained on non-parallel data; - We study the use of CWT decomposition to characterize F0 for VAW-GAN prosody mapping; - We propose to use an encoder to disentangle emotion information from speech, and a decoder to recompose target emotional speech.
  • 15. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 16. NUS Presentation Title 2001 Proposed Framework 1. Training stage 2) VAW-GAN for Spectrum - Encoder learns an emotion-independent latent code z from the spectra; - Decoder learns to reconstruct spectra from latent code z, F0, and emotion ID. 1) VAW-GAN for Prosody - Encoder learns an emotion-independent latent code z from the CWT-based F0; - Decoder learns to reconstruct CWT-based F0 from latent code z and emotion ID. There are two pipelines in proposed framework:
  • 17. NUS Presentation Title 2001 Proposed Framework 2. Conversion stage - VAW-GAN for prosody to convert F0; Decoder is conditioned on target emotion ID. - VAW-GAN for spectrum to convert spectra; Decoder is conditioned on converted F0 and target emotion ID. - APs are carried directly from the source without modifications. APs do not affect conversion performance.
  • 18. NUS Presentation Title 2001 Proposed Framework 3. Conditioning on F0 for decoding - In VAE, the latent code extract from source spectral features still contains source F0 information[7]. - The conversion performance suffers from this flaw! - We propose to use F0 as a condition to the decoder. - Decoder learns to discard prosodic information during the training! [7] Wen-Chin Huang, Yi-Chiao Wu, Chen-Chou Lo, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, and Hsin-Min Wang, “Investigation of f0 conditioning and fully convolutional networks in variational autoencoder based voice conversion,” arXiv preprint arXiv:1905.00615, 2019.
  • 19. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 20. NUS Presentation Title 2001 Experiments 1. Experimental Setup - Database: EmoV-DB [8], an English emotional speech database - Speakers: One male speaker (‘Sam’), one female speaker (‘Bea’) - Emotions: neutral, sleepy, angry - Details: 1) training data: 300 non-parallel utterances from each speaker; 2) evaluation data: 20 utterances from each speaker 3) we conduct emotion conversion for each speaker from: a) neutral to angry b) neutral to sleepy [8] Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit, “The emotional voices database: Towards controlling the emotion dimension in voice generation systems,” arXiv preprint arXiv:1806.09514, 2018.
  • 21. NUS Presentation Title 2001 Experiments 2. Comparative study Baseline Frameworks: 1)VAW-GAN (SP + CWT) : (To show the efficiency of F0 conditioning) VAW-GAN based EVC framework, where both spectral and CWT-based F0 features are converted without F0 conditioning on the spectral decoder 2)VAW-GAN (SP + F0 + C): (To show the efficiency of CWT modelling of F0) VAWGAN based EVC framework, where both spectral features are converted with LG-based F0 conditioning on the spectral decoder, and F0 is converted with LG-based linear transformation. Proposed Framework: VAW-GAN (SP + CWT + C)
  • 22. NUS Presentation Title 2001 Experiments 3. Objective Evaluation - MCD [dB] and LSD [dB] to measure spectrum conversion; - RMSE [Hz] and PCC to measure prosody conversion; Higher values of PCC, Lower values of MCD, LSD, and RMSE indicate better performance! - F0 conditioning improves spectrum conversion performance; - CWT modelling of F0 improves prosody conversion performance.
  • 23. NUS Presentation Title 2001 Experiments 4. Subjective Evaluation 11 subjects participate in three listening tests: -XAB test for speech quality to assess the effect of F0 conditioning; -XAB test for emotion similarity to assess the effect of F0 conditioning; -XAB test for emotion similarity to assess the effect of CWT modelling of F0; - F0 conditioning improves speech quality and emotion conversion performance! - CWT modelling improves emotion conversion performance!
  • 24. NUS Presentation Title 2001 Experiments 5. Speech Samples For more speech samples: https://demo9646.github.io/slt2020_demo/ (Better to open with Safari or Firefox!) Source (Neutral) Converted (Angry) Target (Angry) Source (Neutral) Converted (Sleepy) Target (Sleepy)
  • 25. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 26. NUS Presentation Title 2001 Conclusions - In this paper, we propose a speaker-dependent emotional voice conversion framework. - We perform both spectrum and prosody conversion based on VAW- GAN. - We perform a non-linear prosody modelling method which uses CWT to decompose F0 into different time-scales. - Moreover, we investigate the effect of F0 conditioning on the decoder to improve the emotion conversion performance.
  • 27. NUS Presentation Title 2001 Thank you!