조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단

Introduction Articulatory features Method Quant. Analysis Experiments Conclusion
Mispronunciation Diagnosis of L2 English
at Articulatory Level Using Articulatory
Goodness-Of-Pronunciation Features
Naver Tech Talk
Hyuksu Ryu1
1Department of Linguistics, Seoul National University, Seoul, Korea
July 3, 2017

Table of Contents
1 Introduction
2 Articulatory features
3 Method
4 Quantitative analysis of salient mispronunciation
5 Experiments
6 Conclusion

Outlines
1 Introduction
2 Articulatory features
3 Method
4 Quantitative analysis of salient mispronunciation
5 Experiments
6 Conclusion

Introduction
CALL/CAPT
• Computer-Assisted Language Learning
• Computer-Aided Pronunciation Training
Mispronunciation detection & diagnosis
• Necessary for conducting eﬀective CALL/CAPT
Previous works regarding mispronunciation detection
• Extended recognition network (ERN) based approach
• Harrison et al. (2009)
• Conﬁdence score based approach
• Franco et al. (1997), Witt & Young (2000)

Introduction
Mispronunciation detection - ERN
• Expands pronunciation dictionaries of learners
• By predicting frequent erroneous pronunciation sequences
• When the erroneous pronunciation seq. are recognized
• Considered learners made pronunciation error
• Drawbacks
• diﬃcult to identify mispronunciation patterns that learners
frequently show in terms of each L1-L2 pair
• diﬃcult to guarantee that ERN covers most of the
possible mispronunciations

Introduction
Mispronunciation detection - Confidence score
• Goodness-Of-Pronunciation (Witt & Young, 2000)
• Virtues
• easy to compute
• L1/L2 independence
• Drawbacks
• difficult to provide corrective feedback
learners do not know how to interpret confidence score alone
• Diagnosis for the detected errors are not provided

Introduction
Previous works regarding diagnosis for mispronunciation
• Li et al. (2017)
• suggested multi-distribution DNN
• using acoustic features, grapheme, and canonical pronunciation
as input
• to predict actual pronunciation learners
• predicted pronunciation = canonical pronunciation →
mispronunciation
• Xie et al. (2016)
• extracted landmark features for nasal codas
• spoken by learners of Chinese
• detected pronunciation errors by applying SVM
• diagnose mispronunciation by recognition and detection results

Introduction
In which way diagnosis is performed?
Pronunciation segments
Mispronunciation Correct pronunciation
False Acceptance
(FA)
True Rejection
(TR)
True Acceptance
(TA)
False Rejection
(FR)
Correct Diagnosis
(CD)
Diagnostic Error
(DE)
1 Pronunciation error detector
• distinguishes b/w mispronunciation & correct pronunciation
2 Mispronunciation Diagnosis
• carried out for instances which are correctly detected as
mispronunciations (True Rejection)
• diagnosis performance - DER (diagnosis error rate)
• deﬁned as the % of incorrectly recognized among TR

Introduction
Limitation of hierarchical approaches for diagnosis
• Provide diagnosis at phone level only
• example: ‘give’ /gIv/ as /gIb/
• if detect errors & recognize the phone as /b/
• the system reports a diagnosis of /v/→/b/
• Had better provide diagnosis information at articulatory level
• for more eﬀective feedback
• diagnosis of fricative → stop, rather than /v/→/b/
• 2-step diagnosis procedure: detection & recognition
• detection errors and recognition errors are piled up
• aﬀect diagnosis accuracy

Introduction
Previous studies using articulatory features
• Ryu & Chung (2016)
• propose articulatory Goodness-Of-Pronunciation
• as novel features for pronunciation assessment in English
• Li et al. (2016a)
• extend GOP into speech attributes
• to detect mispronunciation of onset consonants in learners’
Chinese
Goal of this paper
• Propose a method to provide an articulatory diagnosis
• in English produced by Korea learners
• using articulatory Goodness-Of-Pronunciation features
• based on the distinctive feature theory

Distinctive features
Phoneme
• The smallest unit that distinguishes meaning b/w words in a
particular language
• Chomsky and Halle (1968)
• The minimum unit that discriminates phonemes in a language
• Diﬀerentiated by phonological features (Hayes 2008)
• makes the two phonemes ‘distinctively’ diﬀerent
• /p/: [-voice] & /b/: [+voice]
Natural class
• A set of distinctive features
• Phoneme - represented by natural classes
• /p/: [-voice, -sonorant, -continuant, . . . , +labial]

Characteristics of distinctive features
• Binary values
• present / absent
• Possible to distinguish phonemes by multiple distinctive
features
• /p/: [-voice, -sonorant, -delayed release, . . . +labial]
• /d/: [+voice, -sonorant, -delayed release, . . . -labial]
• Articulatory properties
• articulatory features in this paper → distinctive features
• based on Hayes(2008)

List of Distinctive features
24 Articulatory attributes (distinctive features)
• In terms of categories of Manner, Place, and Laryngeal
Cat. Attribute Phonemes
Manner
Consonantal
/p, b, m, f, v, T, D, t, d, s, z, m, n, l, Ù, Ã, S, Z, ô, j,
k, g, N, h, w/
Sonorant
/m, n, l, ô, j, N, w, i, u, I, U, E, o, 2, O, æ, A, aU, aI,
eI, OI, Ä/
Continuant
/f, v, T, D, s, z, l, S, Z, ô, j, h, w, i, u, I, U, E, o, 2, O,
æ, A, aU, aI, eI, OI, Ä/
Approximant /l, ô, j, w, i, u, I, U, E, o, 2, O, æ, A, aU, aI, eI, OI, Ä/
Delayed release /f, v, T, D, s, z, Ù, Ã, S, Z/
Nasal /m, n, N/
Stop /p, b, t, d, k, g/
Fricative /f, v, T, D, s, z, S, Z/
Aﬀricate /Ù, Ã/

Cat. Attribute Phonemes
Place
Labial /p, b, m, f, v, u, U, o, O, aU, OI/
Round /w, u, U, o, O, aU, OI/
Labiodental /f, v/
Coronal /T, D, t, d, s, z, n, l, Ù, Ã, S, Z, ô, Ä/
Anterior /T, D, s, z, n, l, Ä/
Distributed /T, D, Ù, Ã, S, Z, ô, Ä/
Strident /s, z, Ù, Ã, S, Z/
Lateral /l/
Dorsal /j, k, g, N, w/
High /j, k, g, N, w, i, u, I, U, aI, eI, OI/
Low /æ, A, aU, aI/
Front /j, i, I, E, æ, aI, eI, OI/
Back /w, u, U, o, 2, O, A, aU, aI, OI/
Tense /j, w, i, u, E, o, OI, eI, Ä/

Category Attribute Phonemes
Laryngeal Voice
/b, m, v, D, d, z, n, l, Ã, Z, ô, j, g, N, w, i, u, I,
U, E, o, 2, O, æ, A, aU, aI, eI, OI, Ä/

Goodness-Of-Pronunciation (GOP)
Goodness-Of-Pronunciation (GOP)
• Suggested by Witt & Young (2000)
• To detect individual pronunciation errors
• Deﬁned as the normalized posterior probability
• The distance b/w the phone of learners & native AM
GOP ≡
log P(op|p)
N(p)
−
log maxI
i=1P(op|qi )
N(p)
• N(p): # of frames composing the target phone p
• P(op|qi ): the prob. of observing op given the phone qi

Articulatory GOP (aGOP)
Articulatory GOP (aGOP)
• Suggested in this paper
• Used to compare articulatory characteristics b/w natives and
learners w.r.t articulatory attributes
• Also used for pronunciation assessment (Ryu & Chung 2016)
aGOPk
(p) ≡
log P(op|qk)
N(p)
−
maxi P(op|qk
i )
N(p)
• k: the sort of articulatory attribute
• qk: the canonical value of the kth articulatory attribute at the
position of the forced-aligned target segment p

Previous study using articulatory features
Li et al. (2016b)
• Mispronunciation detection of Mandarin learners
• Focused on mispronunciation detection of onset consonants
• Articulatory modeling in terms of categories
• only 4 articulatory models; manner, place, voice, aspiration
• each category - multiple attributes
• limitation that low performance when the category has
multiple attributes, such as place (Li et al. 2016a)
This study
• Articulatory modeling based on each attribute
• binary modeling: presence/absence
• Specify articulatory attributes in more details based on the
phonological theory
• more various articulatory information
• use them for mispronunciation diagnosis

Corpus and Annotation
Corpus
• ETRI English speech corpus produced by Korean learners
• 21,110 sentences (21 hours)
• 151 learners
Annotation
• Phone-level transcription
• Ten Korean annotators
• expertise in phonetics/phonology
• experience in phone-level transcription
• 88.13% of phone-level agreement (Ryu et al. (2012))

Acoustic model
Acoustic model
• AM for English native speech
• Using WSJ corpus of 37,000 sentences
• CD-DNN-HMM AM
• 39-Dim. MFCC+∆+∆∆
• using the default conﬁgurations of the Kaldi toolkit
• In addition to phone AM
• articulatory AM also trained in terms of articulatory attributes
• in order to compute aGOPs

Diagnosis modeling
Articulatory diagnosis framework
Forced alignment/Recognition
GOP/aGOPs extraction
Is forced-aligned segment
a consonant?
Yes
Voicing/Place/Manner
Diagnosis
Rounding/Height/Backness
Diagnosis
No

Diagnosis modeling
Articulatory diagnosis
• Based on forced-alignment, examine whether the
corresponding segment is a consonant or a vowel
• Articulatory diagnosis in the case of consonants
• voicing
• place of articulation
• manner of articulation
• Articulatory diagnosis in the case of vowels
• rounding
• height
• backness

Articulatory Diagnosis for Consonants
aGOP
continuant
aGOP
voice
GOP
phone
… aGOP
alveolar
Place
Diagnosis
Voicing
Diagnosis
Manner
Diagnosis

Articulatory Diagnosis for Consonants
• Explanatory variables: 24aGOPs + GOP
• Response variable:
• Binary value - correct/incorrect at each articulatory level
• by comparing canonical pronunciation & the actual realization
• Example of /T/→/s/
phone voice place manner
canonical /T/ voiceless dental fricative
actual /s/ voiceless alveolar fricative
response correct incorrect correct

Articulatory diagnosis modeling
• Feed-Forward Neural Network (FFNN)
• for each articulatory-level diagnosis
• Implemented by TensorFlow (Abadi et al., 2015)
• Hyper-parameters & conﬁgurations
• # of hidden layers: [3, 4, 5, 6, 7]
• # of nodes per layer: [128, 256, 512, 1024]
• act. func.: Exponential Linear Unit (Clevert et al., 2016)
• dropout rate: 0.5
• weight initialization: He initialization (He et al., 2015)
• learning rate: 0.005
• 10,000 epochs & early stopping based on the accuracy of the
validation set

Articulatory Diagnosis for Vowels
aGOP
continuant
aGOP
voice
GOP
phone
… aGOP
alveolar
Height
Diagnosis
Rounding
Diagnosis
Backness
Diagnosis

Articulatory Diagnosis for Vowels
• Explanatory variables: 24aGOPs + GOP
• Response variable:
• Binary value - correct/incorrect at each articulatory level
• by comparing canonical pronunciation & the actual realization
• Example of /A/→/o/
phone rounding height backness
canonical /A/ unround low back
actual /o/ round mid back
response incorrect incorrect correct
• Identical conﬁgurations of diagnosis modeling w/ consonants

Quantitative analysis
Corpus analysis for salient mispronunciations
• Mispronunciation patterns in English by Korean learners
• 38,100 phones - marked as incorrect / entirely 602,810 phones
→ 6.32% of variation rate
Criteria for choosing salient phones
1 The variation rate > the overall variation rate (6.32%)
2 Entire freq. > 500
Salient phones
• 9 salient phones - variation freq. 26.553 instance
• Occupying approx. 70%

Details of salient phones
Category Phone
Entire Variation Variation
Freq. Freq. rate
Consonant
/z/ 12,603 3,425 27.18%
/D/ 14,967 3,488 23.30%
/T/ 3,392 613 18.07%
/v/ 9,492 1,434 15.11%
/d/ 23,814 2,999 12.59%
/t/ 49,804 4,206 8.45%
Vowel
/A/ 11,327 2,381 21.02%
/O/ 10,204 1,690 16.56%
/2/ 44,490 6,317 14.20%

Determining the most noticeable variations
• Appear only in the learners’ speech
• Choose variations more frequent than in native speech (Hong
et al., 2014) among salient phones
• /d, t/
1 deletion in consonant clusters (‘just’ /Ã2st/→/Ã2s/)
2 ﬂapping (‘body’ /bAdi/→/bARi/)
• such variations - frequent in natives’ speech (Hong et al., 2014)
• not included in the list of the most noticeable variations
• Adopting the analysis of Hong et al. (2014)
• Consider the most noticeable variations → salient
mispronunciation patterns

Salient mispronunciations in consonants at articulatory level
Level Canon. Act. Example Freq. RatioVoicing
/z/
/s/
does
2,935 85.69%
(3,425) /d2z/→/d2s/
/v/
/f/
love to
305 21.27%
(1,434) /l2v tU/→/l2f tU/
Place
/D/
/d/
this
3,235 92.75%
(3,488) /DIs/→/dIs/
/s/
thing
213 34.75%
/T/ /TIN/→/sIN/
(613)
/t/
thank
331 54.00%
/TæNk/→/tæNk/
Manner
/D/
/d/
this
3,235 92.75%
(3,488) /DIs/→/dIs/
/T/
/s/
thing
213 34.75%
(613) /TIN/→/sIN/
/v/
/b/
give
766 53.42%
(1,434) /gIv/→/gIb/

Salient mispronunciations in consonants at articulatory level
• Voicing
• devoicing
• /z/→/s/: mainly occurs at word ﬁnal
• /v/→/f/: mostly caused by regressive assimilation
• Place of articulation
• dental→alveolar
• do not exist in L1 phonemes
• Manner of articulation
• fricative→stop
• learners fail to produce fricative
• which do not exit in L1
• substitute them w/ their corresponding stops

Salient mispronunciations in vowels at articulatory level
Level Canon. Act. Example Freq. RatioRound
/A/
(2,381)
/o/
project
295 12.39%/prAÃEkt/→
/proÃEkt/
Height
/A/
(2,381)
/o/
project
295 12.39%/prAÃEkt/→
/proÃEkt/
/O/
(1,690)
/o/
law
735 43.49%
/lO:/→/lo/
/2/
(6,317)
/A/
another
1,106 17.51%/@n2DÄ/→
/@nADÄ/
/æ/
and
1,030 16.31%
/2nd/→/ænd/
Backness
/2/
(6,317)
/æ/
and
1,030 16.31%
/2nd/→/ænd/
/E/
Helen
654 10.35%
/hEl@n/→/hElEn/

Salient mispronunciations in vowels at articulatory level
• Rounding
• unrounded→rounded
• Height
• raising: low→mid
• lowering: mid→low
• Backness
• fronting: back→front
Reason for variations
• Not exist L1 and replace it w/ the most similar phoneme
• /O/→/o/
• Orthographic interference (Hong et al. 2015)
• ‘project’/prAÃEkt/→/proÃEkt/
• inﬂuenced from the grapheme ‘o’ for /A/

Experimental setup
Articulatory diagnosis experiment
• Based on the corpus analysis of salient mispronunciations
• 7 salient phones
Data balancing
• Correct » incorrect → bias problem
• Adopt other phones’ correctly pronounced observations →
mispronounced samples of the target segment (Li et al., 2016)
Data split
• Training : test = 8:2
• 1:1 balance of correct/incorrect in training & test set
• Augmented instances - only in training set
• Validation = 20% of training set
• to determine hyper-parameters of FFNN

Experimental setup
Details of training, validation, and test sets
Cat. Phone Training (Validation) Test Total
consonant
/z/ 14,685 (2,937) 3,671 18,356
/D/ 18,367 (3,673) 4,591 22,958
/T/ 4,447 (889) 1,111 5,558
/v/ 12,893 (2,578) 3,223 16,116
vowel
/A/ 14,314 (2,862) 3,578 17,892
/O/ 13,624 (2,724) 3,405 17,029
/2/ 61,077 (12,215) 15,269 76,346

Experimental results
Performance of articulatory diagnosis in consonants
• In average: > 70% accuracy & .75 F1 score
• The proposed - eﬀective for articulatory diagnosis
Phone Level Accuracy Precision Recall F1
/z/
voicing 70.14% 0.683 0.890 0.773
place 85.57% 0.857 0.877 0.867
manner 79.38% 0.821 0.825 0.823
/D/
voicing 83.60% 0.837 0.898 0.866
place 60.50% 0.623 0.670 0.646
manner 62.13% 0.632 0.852 0.726
/T/
voicing 79.68% 0.814 0.857 0.835
place 65.83% 0.672 0.697 0.684
manner 71.76% 0.761 0.830 0.794
/v/
voicing 80.18% 0.821 0.859 0.840
place 75.43% 0.795 0.842 0.818
manner 71.40% 0.751 0.815 0.782
average
voicing 78.40% 0.789 0.876 0.828
place 71.83% 0.737 0.772 0.754
manner 71.17% 0.741 0.831 0.781

Performance of articulatory diagnosis in consonants
• Performance of place for /D, T/
• slightly lower than average
• Why?
• inter-dental fricative
• relatively small amount of amplitude (low energy)
• diﬃcult to distinguish mispronunciation
• these factors aﬀect the performance

Performance of articulatory diagnosis in vowels
• High average performance, except in height
• rounding, backness - 70% accuracy,
• height - 65%
Phone Level Accuracy Precision Recall F1
/A/
rounding 83.79% 0.839 0.888 0.863
height 59.60% 0.658 0.853 0.743
backness 80.94% 0.811 0.896 0.851
/O/
rounding 70.43% 0.731 0.855 0.788
height 70.93% 0.759 0.898 0.823
backness 75.98% 0.753 0.887 0.815
/2/
rounding 88.25% 0.883 0.865 0.874
height 65.16% 0.693 0.887 0.778
backness 65.59% 0.761 0.859 0.807
average
rounding 80.82% 0.818 0.869 0.843
height 65.23% 0.703 0.879 0.782
backness 74.17% 0.775 0.881 0.824

Performance of articulatory diagnosis in vowels
• Low performance for certain articulatory level
• Training sets contain variations to diphthongs
• /A/→/aI/
• Diphthongs - drastic articulatory change within a segment
• ex. /OI/
• mid→high in height
• back→front in backness

Conclusion
In this paper,
• We proposed a method to provide an articulatory diagnosis
• in English spoken by Korean learners
• using articulatory Goodness-Of-Pronunciation (aGOP) features
• based on the distinctive feature theory in Phonology
So far, previous studies regarding diagnosis have limitation
• Carried out diagnosis at phone level
• Need to be performed at articulatory level for corrective
feedback

Conclusion
We performed
• Articulatory diagnosis modeling
• consonants: voicing, place, and manner of articulation
• vowels: rounding, height, and backness
• Corpus-based analysis of salient mispronunciation patterns
By the results,
• The proposed method for articulatory diagnosis
• > 70% accuracy & > .75 of F1-score for all articulatory levels
• except height in vowels
• Eﬀective mispronunciation diagnosis at articulatory level by
the proposed method

Conclusion
Limitations
• Only decides the pronunciation is correct or not at the
articulatory level
• Not provide corrective feedback on how to correct the
pronunciation
In future work,
• Need to extend the experiment to provide corrective feedback
at articulatory levels

Thanks for listening
Any questions?

조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to 조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단

Similar to 조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단 (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단