WFST

휴먼인터페이스 연구실
Human Interface Lab.
WFST for Speech Recognition
and Utterance Verification
Won Ik Cho
05 April, 2017

Contents
• Task : Utterance verification
• WFST in Speech recognition
• WFST : Theory and examples
 Structures of FSA and WFST
 Operations
• Speech recognition revisited
 Construction of decoding network
 Decoding with WFST
• WFST level utterance verification
2

Task : Utterance verification
3

4
How can we judge whether
‘this sentence’ is question or not?
Raw data?

5
?
?
?

• Real-time WFST level detection?
6

WFST in Speech recognition
• WFST gives common and natural representation for
major components of speech recognition systems :
 Hidden Markov models (HMMs)
 Context-dependency models
 Pronunciation dictionaries
 Statistical grammars
 Word/phone lattices
• Why WFST?
 Efficient algorithm exists
 A unified framework to represent
different layers of knowledges
 Can be optimized at training phase
8

WFST in Speech recognition
• WFST in KALDI
 Decoding graph : min(det(H ∘ C ∘ L ∘ G))
9
H: mapping from PDFs to context labels
C: mapping from context labels to phones
L: mapping from phones to words
G: grammar or language model
What are
∘, det, min?

WFST : Theory and examples
• Finite state automata (acceptors)
 Representation of
possibly infinite set of strings
(ex) {ab}
Numbers in circle : state labels
Labels on arc : symbols
 Strings can be infinite
(ex) {aab}, {aaab}, …
 String is ‘accepted’ if
There is a path with that sequence of symbols on it
 Epsilon symbol : ‘no symbol there’
Usually symbol numbered 0
Simply making loop
11
Since they accept each string
that can be read along a path

• Weighted set as semirings
 Ring : R(⊕, ⊗) with 0� and 1�
 Semiring : ring that does not require and additive
inverse for each element
Sum : to compute the weight of a sequence
Product : to compute the weight of a path
12
state label or symbol/weight

• Weighted finite state automata
13
Toy finite-state
Language model
Possible pronunciation of ‘data’
In real language model
Weighted finite state automata consists of :
- Set of states
- An initial state
- Set of final states
- Set of transition between states
- Transition : source state/destination state/label/weight

• Weighted finite state transducers :
 WFSA with input label, output label, weight on each
transition
 Transduces a phone string to a word string
14
That can be read along a path
from start state to a final state
Output by the transition that consumes
the first phone for that pronunciation
Input
Output
Output

• Weighted finite state transducers contains more
information relatively to WFSA
 Can represent a relationship between two levels of
representation
(ex) between phones and words / between HMMs and context-
independent phones.
 Possible to combine the pronunciation transducers for
more than one word without losing word identity
15

• Elementary operations
 Combine transducers in parallel, in series
Union
Concatenation
 Two weighted automata are equivalent
If they associate the same weight to each input string
• Composition
• Determinization
• Weight pushing
• Minimization
16

• Composition
 Transducer operation for combining different levels of
representation
 Key operation for model combination
17
Composition in log probability semiring

• Determinization
 To mean ‘deterministic on the input symbol’
 Deterministic automaton
1) If it has a unique initial state, and 2) If no two transitions
leaving any state share the same input label
 Key operation for redundant path removal
18
Determinization in tropical semiring

• Weight pushing
 Creates an equivalent pushed/stochastic machine
Operation that ensures if the FST is stochastic
Stochastic FST : weights sum to one for each state
 Useful as first step of minimization, also redistributing
weight among transitions to improve pruned search
19
Weight pushing in probability semiring

• Minimization
 Any deterministic weighted automaton can be
minimized
 Minimized automaton B of A
Has the least number of states and transitions
among all deterministic automata equivalent to A
 Key operation for size reduction
20
Deterministic WA
After weight pushing
in tropical semiring
Equivalent minimal WA

Speech recognition revisited
21

• WFST in KALDI
22
H: mapping from PDFs to context labels
C: mapping from context labels to phones
L: mapping from phones to words
G: grammar or language model
H ∘ C ∘ L ∘ G :
mapping from PDFs to words
based on language model

• Construction of decoding network
23
1) Decoder finds word pronunciations in its lexicon and
substitutes them into the grammar
(might be restricted to trigrams)
2) Decoder identifies the correct context-dependent models to use for
each phone in context
(might have to be triphonic)
3) Decoder substitutes them to create an HMM-level transducer
particular model topologies
mkgraph function in KALDI

 G : Probabilistic grammar or language model acceptor
Stochastic n-gram models can be represented compactly by
finite-state models
Input : word
Weight : history-dependent word probability
24
Word bigram
transducer model
−log(𝑝𝑝̂ 𝑤𝑤2 𝑤𝑤1 )
Backoff weight

 L : Pronunciation lexicon
Input : context-independent phone (phoneme)
Output : word
Weight : pronunciation probability
25

 L : Pronunciation lexicon
Non-deterministic because of homonyms
• Ex) read <-> red ?
Disambiguation symbols added
• Removed at last stage
26

 C : Context-dependency transducer
Input : context-dependent phone
(triphone)
Output : context-independent phone
(phone)
27
Non-deterministic Deterministic
Triphone:Phone/LeftContext_RightContext

C ∘ L ∘ G : transducer that maps from
context-dependent phones to word strings
restricted to grammar G
• Determinizable if C, L, G determinizable
• G determinizable if G is an n-gram language mode
• L may not be determinizable if L has ambiguities
• Revised 𝐿𝐿� with auxiliary homophone tagging
• Modified 𝐶𝐶̃ that pairs the context-independent auxiliary symbols
in the lexicon with new context-dependent auxiliary symbols
 𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺 : revised determinizable and minimizable transducer
28

 H : HMM topology transducer (maps states to phonemes)
Input : state
Output : context-dependent phone (triphone)
Weight : HMM transition probability
29
Monophone case
without self-loops
Used in KALDI

 Decoding graph : min(det(𝐻𝐻� ∘ 𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺))
 𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺 : revised determinizable and minimizable transducer
• H : closure of the union of the individual HMMs
• 𝐻𝐻� : self-loops added to auxiliary distribution name input labels and
auxiliary context phone output labels
 𝐻𝐻� ∘ 𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺 : transducer that maps from distribution to word
strings restricted to G
30
Standardized integrated transducer :
unique deterministic, minimal transducer for which
the weights for all transitions leaving any state sum to 1 in probability

31
Grammar G
Lexicon 𝐿𝐿�

32
𝐿𝐿� ∘ 𝐺𝐺
Det(𝐿𝐿� ∘ 𝐺𝐺)

33
𝑚𝑚𝑚𝑚𝑚𝑚𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡 (Det(𝐿𝐿� ∘ 𝐺𝐺))
𝑚𝑚𝑚𝑚𝑚𝑚𝑙𝑙𝑙𝑙𝑙𝑙 (Det(𝐿𝐿� ∘ 𝐺𝐺))
Conjectured to be best for pruning efficiency
of a standard Viterbi beam search

 Weight and label pushing
 Decoding the graph construction
 Decoding with WFSTs
Make outgoing arcs stochastic distribution
• Output labels not synchronized anymore in WFST
34

Determinization for WFSTs can fail
Need to guarantee that the final HCLG is stochastic
• Needed for optimal pruning
35

finding best path : Solving 𝑊𝑊′
= 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑊𝑊 𝑃𝑃 𝑋𝑋 𝑊𝑊 𝑃𝑃(𝑊𝑊)
• Compose recognizer as HCLG that maps states to word sequence
• Decode by aligning the feature vectors X with HCLG
• 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑊𝑊 𝑋𝑋 ∘ (𝐻𝐻 ∘ 𝐶𝐶 ∘ 𝐿𝐿 ∘ 𝐺𝐺)
36

WFST level utterance
verification
37

WFST level utterance verification
• Real-time WFST level detection?
38

WFST level utterance verification
• WFST level detection
 Operations in making graph
Make new decoding graph based on new corpus
• For example, from Q&A style corpus
 Operations in searching path
Detect the utterance by giving higher scores to paths related to
objective
• Lattice structure and classification algorithms in NLP can be
considered
39
Not a question!
(low score assigned by
classification algorithm)

Summary
• WFST gives common and natural representation for
major components of speech recognition systems.
• WFST in speech recognition system implies
decoding graph which maps PDF to words based on
language model.
• WFST-based utterance verification includes change
of weights in graphs such as in C, L or G; or lattice
structure reweighting.
40

Reference
• M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted
finite-state transducers” In Springer Handbook of Speech Processing,
Springer Berlin Heidelberg, pp. 559-584, 2008.
• OpenFst: An Open-Source, Weighted Finite-State Transducer Library and
its Applications to Speech and Language, Part I. Theory and Algorithms.
http://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part1.pdf
• M. Hannemann, Weighted Finite State Transducers in Automatic Speech
Recognition, ZRE lecture 15, Apr., 2015.
http://www.fit.vutbr.cz/study/courses/ZRE/public/pred/10_wfst_lvcsr/z
re_lecture_asr_wfst_2015.pdf
• T. Hanneforth, Finite-state Machines: Theory and Applications, Dec.,
2008.
http://tagh.de/tom/wp-content/uploads/fsm_weightedautomata.pdf
• Kind explanation on KALDI decoding
http://vpanayotov.blogspot.kr/2012/06/kaldi-decoding-graph-
construction.html
41

WFST

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WFST

Similar to WFST (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

WFST