26. X — states
y — possible
observations
a — state transition
probabilities
b — output
probabilities
"HiddenMarkovModel" by Tdunningvectorization: Wikimedia
Voice User Interface Designer
10 years in the field
English major, former coder; got interested in UX
President of the Association for Voice Interaction Design
Consultant for Versay Solutions
2 weeks in a row for conferences
Jarvis:
Audio and gestural
Perfect recognition.
No error recovery needed
Great voice quality
Connected to vast amounts of data
Understands all the parts of the model: “Lose the landscape.”
Context-sensitive.
Aware of the space around him
Sense of humor. “Am I to include the Belgian Waffle stands?”
Takes initiative. “What is it you’re trying to achieve, sir?”
Replicator:
Good recognition
No error recovery needed
Good voice quality – understandable
Connected to data – perhaps too much so?
Context sensitive- but was this enough?
A design failure (not a tech failure)
Specifically around excessive disambiguation
A Better Replicator Conversation
“Speech to Text” ?
Spoken Language – Machine readable format
Not necessarily tied to speech recognition
Also called voiceprints, biometrics, voice authentication, etc.
Not going to discuss this one in a lot of detail today but it’s important that you understand the difference between these technologies.
Recognizes a person, not necessarily what they are saying.
You can have ASR without Voice Verification
And vice versa
Human voice talent
Hundreds of hours of recording
Digitized
Phonemes:
Concatenated speech synthesis
Dynamic Speech Synthesis
Many commercial products are available
API-based
Downloadable
Quality varies
If possible, record audio
TTS has improved considerably, but is still noticeable
High quality TTS may not be available in all situations
If you have a lot of dynamic data TTS is useful
You can mix recorded audio and TTS
You may have to use TTS
Voice Agent (Alexa, Cortana, etc.)
API-based
Some of them do let you mark up your TTS with SSML
More phonemes = higher quality voice
Also means a bigger download and install (if on device)
Exceptions (addresses, names) can be iffy
May require a lot of work to handle well
St. James St.
Saint James Street
Punctuation
Your data needs to be clean and ready to voice back
Acronyms, incomplete sentences will not sound good
It is possible to build a custom voice
But it takes a lot of work!
Speech Synthesis Markup Language
XML based WC3 standard
Not universally supported
Tags which allow you produce a more natural quality output.
Emphasis
Break
Voice
Prosody
Pitch
World Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before language
Semantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitation
Syntax: The rules that govern putting words together to form meaningful units
Lexicon: What words mean
Morphology: How words change their form to perform differently in a language i.e. horse / horses
Phonetics: Phonemes and how words are built
Acoustics: What phonemes sound like and how to create them
Speech is never stationary
Coarticulation
Noisy environments
Accents
Different speakers have voices with different acoustic qualities
Goats
Challenges vary depending on what you are going to recognize
Spelling (short utterances) can be difficult even for humans
Phonetic alphabet (Military)
Humans can deduce meaning from context and unknown words
“How can I help you?”
I’m having a problem with my account.
I’d like that one. No, not the green one, the red one.
Time flies like an arrow.
Fruit flies like a banana.
All modern speech recognition is probabilistic
GUI: Button clicked? true / false
VUI: There is an 85% chance that button was clicked
Three Dimensions of Speech Problems
AUDREY: Davis, Biddulph, and Balashek - Bell Labs 1952
Analog
Isolated digit recognition
Pause between digits
Speaker-dependent
Speech recognition with vacuum tubes – How very steampunk.
Her name was AUDREY. Let that sink in a minute.
(Automatic Digit Recognizer)
1980’s: The Power of Statistics
The recognition of connected speech becomes a search for the best path in a large network
Problem of finding the probabilities
Statistical Language Models
Not all sequences of words are equally probable
Rank all permissible sentences in terms of probability
“Correct” grammar is not applicable
Restricted by domain
Hidden Markov Models (HMM)
Unified probabilistic model for speech
You’re Only As Good As What You’re Trained On
Corpora
Collection of speech used to train a recognizer
Acoustic and/or Pronunciation Model
Associates sounds with symbols and words.
Created by a general speech corpora and a phonetic and orthographic transcription
Statistical Language Model (SLM)
A probability distribution over sequences of words
Created by a domain-specific speech corpora and a tagged transcription to extract meaning
Speech Agent: The “Person” who
Distributed speech recognition
Collection and compression of speech is on the device
The language models are typically on the network
Phone can be speaker-dependent
Trains itself on your voice and on the acoustic environments you are in most often
Many companies are providing APIs to use their speech recognition
Alexa, Ask Capitol One What’s my current credit card balance?
Observations to make: Represents the entirety of a VUI experience
Placement of Spanish prompt would vary depending on type of call.
Confirmation is variable
Confirmation prompt is general
What do you need it for?
What kind of device will you be running it on?
Connectivity?
Can you use cloud based ASR?
How much control do you need over the application / user interface?
Jarvis:
Audio and gestural
Perfect recognition.
No error recovery needed
Great voice quality
Connected to vast amounts of data
Understands all the parts of the model: “Lose the landscape.”
Context-sensitive.
Aware of the space around him
Sense of humor. “Am I to include the Belgian Waffle stands?”
Takes initiative. “What is it you’re trying to achieve, sir?”