SlideShare a Scribd company logo
1 of 25
NUS Presentation Title 2001
Seen and Unseen emotional style transfer for voice
conversion with a new emotional speech dataset
Kun Zhou1, Berrak Sisman2, Rui Liu2, Haizhou Li1
1National University of Singapore
2Singapore University of Technology and Design
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
1. Introduction
1.1 Emotional voice conversion
Emotional voice conversion aims to transfer the emotional style of an utterance
from one to another, while preserving the linguistic content and speaker identity.
Applications: emotional text-to-speech, conversational agents
Compared with speaker voice conversion:
1) Speaker identity is mainly determined by voice quality;
-- speaker voice conversion mainly focuses on spectrum conversion
2) Emotion is the interplay of multiple acoustic attributes concerning both spectrum
and prosody.
-- emotional voice conversion focuses on both spectrum and prosody conversion
NUS Presentation Title 2001
1. Introduction
1.2 Current Challenges
1) Non-parallel training
Parallel data is difficult and expensive to collect in real life;
2) Emotional prosody modelling
Emotional prosody is hierarchical in nature, and perceived at phoneme,
word, and sentence level;
3) Unseen emotion conversion
Existing studies use discrete emotion categories, such as one-hot label to
remember a fixed set of emotional styles.
Our focus
Our focus
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
2. Related Work
2.1 Emotion representation
1) Categorical representation
Example: Ekman’s six basic emotions theory[1]:
i.e., anger, disgust, fear, happiness, sadness and surprise
2) Dimensional representation
Example: Russell’s circumplex model[2]:
i.e., arousal, valence, and dominance
[1] Paul Ekman, “An argument for basic emotions,” Cognition & emotion, 1992
[2] James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
NUS Presentation Title 2001
2. Related Work
2.2 Emotional descriptor
1) One-hot emotion labels
2) Expert-crafted features such as
openSMILE [3]
3) Deep emotional features:
[3] Florian Eyben, Martin Wöllmer, and Björn Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, Proceedings of the
18th acm international conference on multimedia, 2010, pp. 1459–1462
-- data-driven, less dependent on human
knowledge, more suitable for emotional
style transfer…
NUS Presentation Title 2001
2. Related Work
2.3 Variational Auto-encoding Wasserstein Generative Adversarial
Network (VAW-GAN)[4]
Fig 2. An illustration of VAW-GAN framework.
- Based on variational auto-encoder
(VAE)
- Consisting of three components:
Encoder: to learn a latent code from the
input;
Generator: to reconstruct the input from the
latent code and the inference code;
Discriminator: to judge the reconstructed
input whether real or not;
- First introduced in speaker voice
conversion in 2017
[4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from unaligned
corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech, 2017.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
3. Contributions
1. One-to-many emotional style transfer framework
2. Does not require parallel training data;
3. Use speech emotion recognition model to describe the emotional style;
4. Publicly available multi-lingual and multi-speaker emotional speech
corpus (ESD), that can be used for various speech synthesis tasks;
To our best knowledge, this is the first reported study on emotional style transfer
for unseen emotion.
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
4. Proposed Framework
1) Stage I: Emotion Descriptor Training
2) Stage II: Encoder-Decoder Training with VAW-GAN
3) Stage III: Run-time Conversion
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
1) Emotion Descriptor Training
- A pre-trained SER model is used to extract deep emotional features from the
input utterance.
- The utterance-level deep emotional features are thought to describe different
emotions in a continuous space.
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
2) Encoder-Decoder Training with VAW-GAN
- The encoder learns to generate emotion-independent latent representation z
from the input features
- The decoder learns to reconstruct the input features with latent representation,
F0 contour, and deep emotional features;
- The discriminator learns to determine whether the reconstructed features real
or not;
NUS Presentation Title 2001
4. Proposed Framework
Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks
that involved in the training, and the red boxes represent the networks that are already trained.
3) Run-time Conversion
- The pre-trained SER generates the deep emotional features from the reference
set;
- We then concatenate the deep emotional features with the converted F0 and
emotion-independent z as the input to the decoder;
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
5. Experiments
5.1 Emotional Speech Dataset -- ESD
1) The dataset consists of 350 parallel utterances with an average duration
of 2.9 seconds spoken by 10 native English and 10 native Mandarin
speakers;
2) For each language, the dataset consists of 5 male and 5 female speakers
in five emotions:
a) happy, b) sad, c) neutral, d) angry, and e) surprise
3) ESD dataset is publicly available at:
https://github.com/ HLTSingapore/Emotional-Speech-Data
To our best knowledge, this is the first parallel emotional voice
conversion dataset in a multi-lingual and multi-speaker setup.
NUS Presentation Title 2001
5. Experiments
5.2 Experimental setup
1) 300-30-20 utterances (training-reference-test set);
2) One universal model that takes neutral, happy and sad utterances as input;
3) SER is trained on a subset of IEMOCAP[5], with four emotion types (happy, angry,
sad and neutral)
4) We conduct emotion conversion from neutral to seen emotional states (happy, sad)
to unseen emotional states (angry).
5) Baseline: VAW-GAN-EVC [6]: only capable of one-to-one conversion and generate
seen emotions;
Proposed: DeepEST: one-to-many conversion, seen and unseen emotion generation
[5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang,
Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language
resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
[6] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-
Independent Emotional Voice Conversion,” in Proc. Interspeech 2020, 2020, pp. 3416–3420.
NUS Presentation Title 2001
5. Experiments
5.3 Objective Evaluation
We calculate MCD [dB] to measure the spectral distortion
Proposed DeepEST outperforms the baseline framework for seen emotion
conversion and achieves comparable results for unseen emotion conversion.
NUS Presentation Title 2001
5. Experiments
5.4 Subjective Evaluation
1) MOS listening experiments
2) AB preference test
3) XAB preference test
To evaluate speech quality:
To evaluate emotion similarity:
Proposed DeepEST achieves competitive
results for both seen and unseen emotions
NUS Presentation Title 2001
5. Experiments
Codes & Speech Samples
https://kunzhou9646.github.io/controllable-evc/
NUS Presentation Title 2001
Outline
1. Introduction
2. Related Work
3. Contributions
4. Proposed Framework
5. Experiments
6. Conclusions
NUS Presentation Title 2001
6. Conclusion
1. We propose a one-to-many emotional style transfer framework based on
VAW-GAN without the need for parallel data;
2. We propose to leverage deep emotional features from speech emotion
recognition to describe the emotional prosody in a continuous space;
3. We condition the decoder with controllable attributes such as deep emotional
features and F0 values;
4. We achieve competitive results for both seen and unseen emotions over the
baselines;
5. We introduce and release a new emotional speech dataset, ESD, that can be
used in speech synthesis and voice conversion.
NUS Presentation Title 2001
Thank you for listening!

More Related Content

What's hot

[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audioDeep Learning JP
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3Naoya Takahashi
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元NU_I_TODALAB
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection PipelineAbhinav Dadhich
 
P J S: 音素バランスを考慮した日本語歌声コーパス
P J S: 音素バランスを考慮した日本語歌声コーパスP J S: 音素バランスを考慮した日本語歌声コーパス
P J S: 音素バランスを考慮した日本語歌声コーパスShinnosuke Takamichi
 
【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural Representations【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural RepresentationsDeep Learning JP
 
スペクトログラム無矛盾性に基づく独立低ランク行列分析
スペクトログラム無矛盾性に基づく独立低ランク行列分析スペクトログラム無矛盾性に基づく独立低ランク行列分析
スペクトログラム無矛盾性に基づく独立低ランク行列分析Kitamura Laboratory
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseWei Yang
 
Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Kenyu Uehara
 
[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)
[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)
[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)Suzi Kim
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現NU_I_TODALAB
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法NU_I_TODALAB
 
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討Yuta Matsunaga
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術NU_I_TODALAB
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワークNU_I_TODALAB
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learningYu Huang
 
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기Hangil Kim
 
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~Yui Sudo
 

What's hot (20)

[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
 
P J S: 音素バランスを考慮した日本語歌声コーパス
P J S: 音素バランスを考慮した日本語歌声コーパスP J S: 音素バランスを考慮した日本語歌声コーパス
P J S: 音素バランスを考慮した日本語歌声コーパス
 
【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural Representations【DL輪読会】WIRE: Wavelet Implicit Neural Representations
【DL輪読会】WIRE: Wavelet Implicit Neural Representations
 
スペクトログラム無矛盾性に基づく独立低ランク行列分析
スペクトログラム無矛盾性に基づく独立低ランク行列分析スペクトログラム無矛盾性に基づく独立低ランク行列分析
スペクトログラム無矛盾性に基づく独立低ランク行列分析
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defense
 
Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves)
 
[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)
[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)
[KCC 2019] CNN 기반 물체 파지를 위한 위치 탐색 (CNN-based Grasping Box Detection)
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
微分可能な信号処理に基づく音声合成器を用いた DNN 音声パラメータ推定の検討
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
 
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
深層学習を用いた音源定位、音源分離、クラス分類の統合~環境音セグメンテーション手法の紹介~
 

Similar to Emotional Voice Conversion with DeepEST Framework

VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechKunZhou18
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...ijtsrd
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networksijtsrd
 
PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptKunZhou18
 
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...Pravena Duplex
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHIJCSEA Journal
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllablesIJECEIAES
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningIRJET Journal
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingParrotAI
 
A critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesA critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesjournalBEEI
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
 
Predicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalPredicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalTELKOMNIKA JOURNAL
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSkevig
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptMeghanaDBengalur
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level IOSR Journals
 
Transformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechTransformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechIJERA Editor
 

Similar to Emotional Voice Conversion with DeepEST Framework (20)

VAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speechVAW-GAN for disentanglement and recomposition of emotional elements in speech
VAW-GAN for disentanglement and recomposition of emotional elements in speech
 
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...
 
Speech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural NetworksSpeech Emotion Recognition Using Neural Networks
Speech Emotion Recognition Using Neural Networks
 
PhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.pptPhD_Oral_Defense_Kun.ppt
PhD_Oral_Defense_Kun.ppt
 
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...A comparative analysis of classifiers in emotion recognition thru acoustic fea...
A comparative analysis of classifiers in emotion recognition thru acoustic fea...
 
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHBASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECH
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Emotion recognition based on the energy distribution of plosive syllables
Emotion recognition based on the energy distribution of plosive  syllablesEmotion recognition based on the energy distribution of plosive  syllables
Emotion recognition based on the energy distribution of plosive syllables
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
A critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databasesA critical insight into multi-languages speech emotion databases
A critical insight into multi-languages speech emotion databases
 
histogram-based-emotion
histogram-based-emotionhistogram-based-emotion
histogram-based-emotion
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
Predicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech SignalPredicting the Level of Emotion by Means of Indonesian Speech Signal
Predicting the Level of Emotion by Means of Indonesian Speech Signal
 
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONSNEURAL DISCOURSE MODELLING OF CONVERSATIONS
NEURAL DISCOURSE MODELLING OF CONVERSATIONS
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Day 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.pptDay 4 Topic 1 - Ivanovic.ppt
Day 4 Topic 1 - Ivanovic.ppt
 
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level
 
Transformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechTransformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speech
 

Recently uploaded

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Emotional Voice Conversion with DeepEST Framework

  • 1. NUS Presentation Title 2001 Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset Kun Zhou1, Berrak Sisman2, Rui Liu2, Haizhou Li1 1National University of Singapore 2Singapore University of Technology and Design
  • 2. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 3. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 4. NUS Presentation Title 2001 1. Introduction 1.1 Emotional voice conversion Emotional voice conversion aims to transfer the emotional style of an utterance from one to another, while preserving the linguistic content and speaker identity. Applications: emotional text-to-speech, conversational agents Compared with speaker voice conversion: 1) Speaker identity is mainly determined by voice quality; -- speaker voice conversion mainly focuses on spectrum conversion 2) Emotion is the interplay of multiple acoustic attributes concerning both spectrum and prosody. -- emotional voice conversion focuses on both spectrum and prosody conversion
  • 5. NUS Presentation Title 2001 1. Introduction 1.2 Current Challenges 1) Non-parallel training Parallel data is difficult and expensive to collect in real life; 2) Emotional prosody modelling Emotional prosody is hierarchical in nature, and perceived at phoneme, word, and sentence level; 3) Unseen emotion conversion Existing studies use discrete emotion categories, such as one-hot label to remember a fixed set of emotional styles. Our focus Our focus
  • 6. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 7. NUS Presentation Title 2001 2. Related Work 2.1 Emotion representation 1) Categorical representation Example: Ekman’s six basic emotions theory[1]: i.e., anger, disgust, fear, happiness, sadness and surprise 2) Dimensional representation Example: Russell’s circumplex model[2]: i.e., arousal, valence, and dominance [1] Paul Ekman, “An argument for basic emotions,” Cognition & emotion, 1992 [2] James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
  • 8. NUS Presentation Title 2001 2. Related Work 2.2 Emotional descriptor 1) One-hot emotion labels 2) Expert-crafted features such as openSMILE [3] 3) Deep emotional features: [3] Florian Eyben, Martin Wöllmer, and Björn Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, Proceedings of the 18th acm international conference on multimedia, 2010, pp. 1459–1462 -- data-driven, less dependent on human knowledge, more suitable for emotional style transfer…
  • 9. NUS Presentation Title 2001 2. Related Work 2.3 Variational Auto-encoding Wasserstein Generative Adversarial Network (VAW-GAN)[4] Fig 2. An illustration of VAW-GAN framework. - Based on variational auto-encoder (VAE) - Consisting of three components: Encoder: to learn a latent code from the input; Generator: to reconstruct the input from the latent code and the inference code; Discriminator: to judge the reconstructed input whether real or not; - First introduced in speaker voice conversion in 2017 [4] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and HsinMin Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech, 2017.
  • 10. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 11. NUS Presentation Title 2001 3. Contributions 1. One-to-many emotional style transfer framework 2. Does not require parallel training data; 3. Use speech emotion recognition model to describe the emotional style; 4. Publicly available multi-lingual and multi-speaker emotional speech corpus (ESD), that can be used for various speech synthesis tasks; To our best knowledge, this is the first reported study on emotional style transfer for unseen emotion.
  • 12. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 13. NUS Presentation Title 2001 4. Proposed Framework 1) Stage I: Emotion Descriptor Training 2) Stage II: Encoder-Decoder Training with VAW-GAN 3) Stage III: Run-time Conversion Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained.
  • 14. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 1) Emotion Descriptor Training - A pre-trained SER model is used to extract deep emotional features from the input utterance. - The utterance-level deep emotional features are thought to describe different emotions in a continuous space.
  • 15. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 2) Encoder-Decoder Training with VAW-GAN - The encoder learns to generate emotion-independent latent representation z from the input features - The decoder learns to reconstruct the input features with latent representation, F0 contour, and deep emotional features; - The discriminator learns to determine whether the reconstructed features real or not;
  • 16. NUS Presentation Title 2001 4. Proposed Framework Fig. 3. The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training, and the red boxes represent the networks that are already trained. 3) Run-time Conversion - The pre-trained SER generates the deep emotional features from the reference set; - We then concatenate the deep emotional features with the converted F0 and emotion-independent z as the input to the decoder;
  • 17. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 18. NUS Presentation Title 2001 5. Experiments 5.1 Emotional Speech Dataset -- ESD 1) The dataset consists of 350 parallel utterances with an average duration of 2.9 seconds spoken by 10 native English and 10 native Mandarin speakers; 2) For each language, the dataset consists of 5 male and 5 female speakers in five emotions: a) happy, b) sad, c) neutral, d) angry, and e) surprise 3) ESD dataset is publicly available at: https://github.com/ HLTSingapore/Emotional-Speech-Data To our best knowledge, this is the first parallel emotional voice conversion dataset in a multi-lingual and multi-speaker setup.
  • 19. NUS Presentation Title 2001 5. Experiments 5.2 Experimental setup 1) 300-30-20 utterances (training-reference-test set); 2) One universal model that takes neutral, happy and sad utterances as input; 3) SER is trained on a subset of IEMOCAP[5], with four emotion types (happy, angry, sad and neutral) 4) We conduct emotion conversion from neutral to seen emotional states (happy, sad) to unseen emotional states (angry). 5) Baseline: VAW-GAN-EVC [6]: only capable of one-to-one conversion and generate seen emotions; Proposed: DeepEST: one-to-many conversion, seen and unseen emotion generation [5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008. [6] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker- Independent Emotional Voice Conversion,” in Proc. Interspeech 2020, 2020, pp. 3416–3420.
  • 20. NUS Presentation Title 2001 5. Experiments 5.3 Objective Evaluation We calculate MCD [dB] to measure the spectral distortion Proposed DeepEST outperforms the baseline framework for seen emotion conversion and achieves comparable results for unseen emotion conversion.
  • 21. NUS Presentation Title 2001 5. Experiments 5.4 Subjective Evaluation 1) MOS listening experiments 2) AB preference test 3) XAB preference test To evaluate speech quality: To evaluate emotion similarity: Proposed DeepEST achieves competitive results for both seen and unseen emotions
  • 22. NUS Presentation Title 2001 5. Experiments Codes & Speech Samples https://kunzhou9646.github.io/controllable-evc/
  • 23. NUS Presentation Title 2001 Outline 1. Introduction 2. Related Work 3. Contributions 4. Proposed Framework 5. Experiments 6. Conclusions
  • 24. NUS Presentation Title 2001 6. Conclusion 1. We propose a one-to-many emotional style transfer framework based on VAW-GAN without the need for parallel data; 2. We propose to leverage deep emotional features from speech emotion recognition to describe the emotional prosody in a continuous space; 3. We condition the decoder with controllable attributes such as deep emotional features and F0 values; 4. We achieve competitive results for both seen and unseen emotions over the baselines; 5. We introduce and release a new emotional speech dataset, ESD, that can be used in speech synthesis and voice conversion.
  • 25. NUS Presentation Title 2001 Thank you for listening!