8. WFST in Speech recognition
• WFST gives common and natural representation for
major components of speech recognition systems :
Hidden Markov models (HMMs)
Context-dependency models
Pronunciation dictionaries
Statistical grammars
Word/phone lattices
• Why WFST?
Efficient algorithm exists
A unified framework to represent
different layers of knowledges
Can be optimized at training phase
8
9. WFST in Speech recognition
• WFST in KALDI
Decoding graph : min(det(H ∘ C ∘ L ∘ G))
9
H: mapping from PDFs to context labels
C: mapping from context labels to phones
L: mapping from phones to words
G: grammar or language model
What are
∘, det, min?
11. WFST : Theory and examples
• Finite state automata (acceptors)
Representation of
possibly infinite set of strings
(ex) {ab}
Numbers in circle : state labels
Labels on arc : symbols
Strings can be infinite
(ex) {aab}, {aaab}, …
String is ‘accepted’ if
There is a path with that sequence of symbols on it
Epsilon symbol : ‘no symbol there’
Usually symbol numbered 0
Simply making loop
11
Since they accept each string
that can be read along a path
12. WFST : Theory and examples
• Weighted set as semirings
Ring : R(⊕, ⊗) with 0� and 1�
Semiring : ring that does not require and additive
inverse for each element
Sum : to compute the weight of a sequence
Product : to compute the weight of a path
12
state label or symbol/weight
13. WFST : Theory and examples
• Weighted finite state automata
13
Toy finite-state
Language model
Possible pronunciation of ‘data’
In real language model
Weighted finite state automata consists of :
- Set of states
- An initial state
- Set of final states
- Set of transition between states
- Transition : source state/destination state/label/weight
14. WFST : Theory and examples
• Weighted finite state transducers :
WFSA with input label, output label, weight on each
transition
Transduces a phone string to a word string
14
That can be read along a path
from start state to a final state
Output by the transition that consumes
the first phone for that pronunciation
Input
Output
Output
15. WFST : Theory and examples
• Weighted finite state transducers contains more
information relatively to WFSA
Can represent a relationship between two levels of
representation
(ex) between phones and words / between HMMs and context-
independent phones.
Possible to combine the pronunciation transducers for
more than one word without losing word identity
15
16. WFST : Theory and examples
• Elementary operations
Combine transducers in parallel, in series
Union
Concatenation
Two weighted automata are equivalent
If they associate the same weight to each input string
• Composition
• Determinization
• Weight pushing
• Minimization
16
17. WFST : Theory and examples
• Composition
Transducer operation for combining different levels of
representation
Key operation for model combination
17
Composition in log probability semiring
18. WFST : Theory and examples
• Determinization
To mean ‘deterministic on the input symbol’
Deterministic automaton
1) If it has a unique initial state, and 2) If no two transitions
leaving any state share the same input label
Key operation for redundant path removal
18
Determinization in tropical semiring
19. WFST : Theory and examples
• Weight pushing
Creates an equivalent pushed/stochastic machine
Operation that ensures if the FST is stochastic
Stochastic FST : weights sum to one for each state
Useful as first step of minimization, also redistributing
weight among transitions to improve pruned search
19
Weight pushing in probability semiring
20. WFST : Theory and examples
• Minimization
Any deterministic weighted automaton can be
minimized
Minimized automaton B of A
Has the least number of states and transitions
among all deterministic automata equivalent to A
Key operation for size reduction
20
Deterministic WA
After weight pushing
in tropical semiring
Equivalent minimal WA
22. Speech recognition revisited
• WFST in KALDI
Decoding graph : min(det(H ∘ C ∘ L ∘ G))
22
H: mapping from PDFs to context labels
C: mapping from context labels to phones
L: mapping from phones to words
G: grammar or language model
H ∘ C ∘ L ∘ G :
mapping from PDFs to words
based on language model
23. Speech recognition revisited
• Construction of decoding network
Decoding graph : min(det(H ∘ C ∘ L ∘ G))
23
1) Decoder finds word pronunciations in its lexicon and
substitutes them into the grammar
(might be restricted to trigrams)
2) Decoder identifies the correct context-dependent models to use for
each phone in context
(might have to be triphonic)
3) Decoder substitutes them to create an HMM-level transducer
particular model topologies
mkgraph function in KALDI
24. Speech recognition revisited
• Construction of decoding network
G : Probabilistic grammar or language model acceptor
Stochastic n-gram models can be represented compactly by
finite-state models
Input : word
Weight : history-dependent word probability
24
Word bigram
transducer model
−log(𝑝𝑝̂ 𝑤𝑤2 𝑤𝑤1 )
Backoff weight
25. Speech recognition revisited
• Construction of decoding network
L : Pronunciation lexicon
Input : context-independent phone (phoneme)
Output : word
Weight : pronunciation probability
25
26. Speech recognition revisited
• Construction of decoding network
L : Pronunciation lexicon
Non-deterministic because of homonyms
• Ex) read <-> red ?
Disambiguation symbols added
• Removed at last stage
26
27. Speech recognition revisited
• Construction of decoding network
C : Context-dependency transducer
Input : context-dependent phone
(triphone)
Output : context-independent phone
(phone)
27
Non-deterministic Deterministic
Triphone:Phone/LeftContext_RightContext
28. Speech recognition revisited
• Construction of decoding network
Decoding graph : min(det(H ∘ C ∘ L ∘ G))
C ∘ L ∘ G : transducer that maps from
context-dependent phones to word strings
restricted to grammar G
• Determinizable if C, L, G determinizable
• G determinizable if G is an n-gram language mode
• L may not be determinizable if L has ambiguities
• Revised 𝐿𝐿� with auxiliary homophone tagging
• Modified 𝐶𝐶̃ that pairs the context-independent auxiliary symbols
in the lexicon with new context-dependent auxiliary symbols
𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺 : revised determinizable and minimizable transducer
28
29. Speech recognition revisited
• Construction of decoding network
H : HMM topology transducer (maps states to phonemes)
Input : state
Output : context-dependent phone (triphone)
Weight : HMM transition probability
29
Monophone case
without self-loops
Used in KALDI
30. Speech recognition revisited
• Construction of decoding network
Decoding graph : min(det(𝐻𝐻� ∘ 𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺))
𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺 : revised determinizable and minimizable transducer
• H : closure of the union of the individual HMMs
• 𝐻𝐻� : self-loops added to auxiliary distribution name input labels and
auxiliary context phone output labels
𝐻𝐻� ∘ 𝐶𝐶̃ ∘ 𝐿𝐿� ∘ 𝐺𝐺 : transducer that maps from distribution to word
strings restricted to G
30
Standardized integrated transducer :
unique deterministic, minimal transducer for which
the weights for all transitions leaving any state sum to 1 in probability
33. Speech recognition revisited
• Construction of decoding network
33
𝑚𝑚𝑚𝑚𝑚𝑚𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡 (Det(𝐿𝐿� ∘ 𝐺𝐺))
𝑚𝑚𝑚𝑚𝑚𝑚𝑙𝑙𝑙𝑙𝑙𝑙 (Det(𝐿𝐿� ∘ 𝐺𝐺))
Conjectured to be best for pruning efficiency
of a standard Viterbi beam search
34. Speech recognition revisited
• Construction of decoding network
Weight and label pushing
Decoding the graph construction
Decoding with WFSTs
Make outgoing arcs stochastic distribution
• Output labels not synchronized anymore in WFST
34
35. Speech recognition revisited
• Construction of decoding network
Weight and label pushing
Decoding the graph construction
Decoding with WFSTs
Determinization for WFSTs can fail
Need to guarantee that the final HCLG is stochastic
• Needed for optimal pruning
35
36. Speech recognition revisited
• Construction of decoding network
Weight and label pushing
Decoding the graph construction
Decoding with WFSTs
finding best path : Solving 𝑊𝑊′
= 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑊𝑊 𝑃𝑃 𝑋𝑋 𝑊𝑊 𝑃𝑃(𝑊𝑊)
• Compose recognizer as HCLG that maps states to word sequence
• Decode by aligning the feature vectors X with HCLG
• 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑊𝑊 𝑋𝑋 ∘ (𝐻𝐻 ∘ 𝐶𝐶 ∘ 𝐿𝐿 ∘ 𝐺𝐺)
36
39. WFST level utterance verification
• WFST level detection
Operations in making graph
Make new decoding graph based on new corpus
• For example, from Q&A style corpus
Operations in searching path
Detect the utterance by giving higher scores to paths related to
objective
• Lattice structure and classification algorithms in NLP can be
considered
39
Not a question!
(low score assigned by
classification algorithm)
40. Summary
• WFST gives common and natural representation for
major components of speech recognition systems.
• WFST in speech recognition system implies
decoding graph which maps PDF to words based on
language model.
• WFST-based utterance verification includes change
of weights in graphs such as in C, L or G; or lattice
structure reweighting.
40
41. Reference
• M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted
finite-state transducers” In Springer Handbook of Speech Processing,
Springer Berlin Heidelberg, pp. 559-584, 2008.
• OpenFst: An Open-Source, Weighted Finite-State Transducer Library and
its Applications to Speech and Language, Part I. Theory and Algorithms.
http://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part1.pdf
• M. Hannemann, Weighted Finite State Transducers in Automatic Speech
Recognition, ZRE lecture 15, Apr., 2015.
http://www.fit.vutbr.cz/study/courses/ZRE/public/pred/10_wfst_lvcsr/z
re_lecture_asr_wfst_2015.pdf
• T. Hanneforth, Finite-state Machines: Theory and Applications, Dec.,
2008.
http://tagh.de/tom/wp-content/uploads/fsm_weightedautomata.pdf
• Kind explanation on KALDI decoding
http://vpanayotov.blogspot.kr/2012/06/kaldi-decoding-graph-
construction.html
41