Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Where’s Jarvis?
The Future of Voice
Recognition and Natural
Language User Interfaces.
Crispin Reedy, Versay Solutions
@crispinTX crispinreedy.com
#UXPA2016

From the session description
• What is voice recognition?
• What is natural language understanding?
• What are the common technologies in the market
today?
• How does this fit with IoT?
• What are design considerations / methods to
evaluate these types of interfaces?
• Implied: Should I speech-enable my ___?
• Bonus Q: Why doesn’t it work the way we want it
to, and when will it?

Should I Speech-Enable My ___?

Iron Man 2: Marvel Studios, Paramount Pictures

Star Trek Voyager: Paramount Television

“Tomato soup”
“Tomato soup.
Ok, what kind?”
“Just plain”
“Coming right
up!”
Implicit
confirmation
Second level-open
ended prompting
Cultural context: plain = hot

Terms & Technologies
• Speech Recognition
• Natural Language Understanding
• Voice Verification (Biometrics)
• Text to Speech

Speech Recognition “ASR”
“See the cat.”

Natural Language Understanding
• Extracting meaning from natural text
“Hello, yes,
I’d like to
pay my
water bill.
Can you
help me with
that?
Intent =
BillPay
Entity
(Bill Type) =
Water

Voice Verification
“My voice is
my password.”
“Authenticated.
Welcome, Mr.
Smith.”
✓

What Is Good TTS?
• Phonemes change based on location
• “Cat”
• “Alligator”
• Elision
• “I’m. Awaiting. You.”
• “I’m awaiting you.”
• Intonation
• “Do you want coffee?”
• “Do you want soda, tea, or coffee?”
• Most TTS isn’t “Movie Quality”
IMDB

SSML Example
SSML

Speech Recognition
• Hands-free command /
control
• Dictation
• Input text
• Small form factor
device, etc.
Text To Speech
• Output text dynamically
• Respond to input
• Useful when no
display is available
Natural Language
Understanding
• Necessary for all
language-based input
• Extract meaning
• Parse large volumes of
text
Voice Verification
• Security

ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation

ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation
Touch
Keyboard
Manage I/O Modality
Determine Meaning in
Context
Visual
Context!

ASR

World
Knowledge
Semantics
Syntax
Lexicon
Morphology
Phonetics
Acoustics
Linguistics
Physiology
Concepts
Phrases
Words
Phonemes
Sounds
ASR
NLU

Speech is ambiguous

Language is ambiguous

Everything is ambiguous

Speaker Independence
Speaker
Dependent
Multiple
Speakers
Speaker
Independent
Isolated Words
Connected
Words
Natural Speech
10 words
1000 words
100,000 words
Unlimited
VocabularySize
Humanlike

AUDREY: Automatic Digit
Recognizer
Bell Labs 1952

X — states
y — possible
observations
a — state transition
probabilities
b — output
probabilities
"HiddenMarkovModel" by Tdunningvectorization: Wikimedia

Training
Speech
Recognition
Engine
Acoustic
Model
SLM and/or
Grammar
Pronunciation
Model

Utterance
Noise
Levels?
Barge-In?
Feature
Extraction
Endpointing
Speech
Recognition
Engine
Grammar or SLM
Probabilities
n:best list
Literal return
Tokens
Recognition Event

Early Commercial Adoptions
• Interactive Voice Response
• “Those Phone Menus”
• Server-based ASR
• Nuance
• Microsoft
• Voice-Enabled Handheld Devices
• Industrial / Productivity applications
• Device-based ASR
• Network not needed
Note: Call center
is still an
important
customer
touchpoint!

Today’s Speech Agents vs. APIs
• Siri / Apple APIs
• Cortana / Cortana APIs
• Google Now / Google Voice Actions
• Amazon Echo (Alexa) / AVS API
• Jibo
• Ubi / Ubi Kit
• Assistant.ai / Api.ai

Alexa Skill vs. Amazon Voice Service
Amazon.com

Alexa Skill Example
Amazon.com

NLU

Natural Language Understanding
• Parsing input to extract meaning
• Covers a large field
• Commands
• Automatic classification of emails
• Newspaper articles, large chunks of text
• Bots
• Conversational agents
• Messaging apps
• Personal assistants
• Input could be via speech or via text

Levels of Meaning
Too Broad / Ambiguous Too MuchJust Right
“I’m having a problem
with my account.”
“Well, I was
looking at my
bill, because I
do that every
week, and I was
reviewing
everything on
there, and I
saw…”
“I’m seeing an
unusual charge
on my bill.”
“How can I help you?”

NLU Tasks
http://www.conversational-technologies.com/nldemos/nlDemos.html

Intents and Entities
• “I’d like to transfer $50 from my checking account
to my savings account.”
• ACTION = Transfer (Intent)
• FROM_ACCOUNT = Checking (Entity)
• TO_ACCOUNT = Savings (Entity)
• AMOUNT = $50 (Entity)

NLU APIs
• API.ai
• Alexa
• Microsoft LUIS
• Wit.ai
• Google Voice Actions
• Etc.

Today’s NLU APIs
• Microsoft LUIS (part of Project Oxford)
Microsoft.com

Today’s NLU APIs
API.ai|
• API.ai

The Future Is Here
• DNN (Deep Neural Networks)
• Being applied to both ASR and NLU problems
• Requires large amounts of data to train the models

What’s The Glue Here?
Consistency
Across
Contexts?
“Omnichannel CX”
Data
Is
Everywhere
State Chart XML?

ASR vs. NLU: Wrap Up
ASR
• Spoken aloud
• Requires some NLU
even if it’s hand-crafted
(tagging)
• Useful in hands-free,
eyes-free contexts
NLU
• Focuses on meaning
extraction
• Could be used for chat
bots, etc.
• Machine learning to
train models

Design Considerations

Design Considerations
• What are you trying to build?
• What’s your platform?
• Existing guidelines / research
• User testing is key
• Especially if you’re trying to do something complicated

What’s Your ASR/NLU Platform?
Write an app (skill) for
an agent such as
Cortana / Alexa
Use cloud APIs to add
ASR / NLU to your app /
device / page / gadget
Download software and
use full-featured
capabilities for more robust
recognition on a specific
device
Build your own

Network Availability
• Simply irritating… or totally unusable?
“What’s on my
calendar today?
“Sorry, I can’t
complete that request
right now.”

Appropriate Modality?
• Voice Only? Voice + Display?
• Is it possible for the user to switch modalities?
• Or would switching potentially be dangerous?
“How long is the
flight from Dallas to
Seattle?
“I’ve got a few results
to show you.”

Is State Maintained?
• Does your platform support a multiple-stage
interaction?
• Does it remember what you did previously?
“Who is Barack Obama?”
“Barack Obama is the 44th
president of the United
States.”
“How old is he?”
“I’m sorry, I don’t understand
your question.”

Wake-Up Words
• How many of these “Agents”
will we be talking to?
“Jibo, take a picture.”
“Alexa, play music.”
“OK Google, set the
temperature to 77
degrees.”

System Personality
• Are you writing for an “Agent”
who has an existing style?
• What if your skill or app doesn’t
match that style?
• If not, should you create one?
“Hi, I’m Julie!”

Context
• Real-world context
• Digital context
• How much does your app
know about where you are
and what it can do?
“When I get home,
remind me to take
out the trash.”
“I’m sorry, your calendar
doesn’t support location-
based reminders.”

What Are You Trying To Recognize?
• Long utterances work
better than short ones
• Letter names require extra
work
“Start a session”
“Got it”

And So Much More….
• What will you do when the
recognizer just can’t get it?
“I want my…. BARK
BARK BARK Timmy STOP
THAT NOW GET
DOWN!”
????

Existing Guidelines / Research
• Caveat: Best practices evolved in one modality (e.g.
voice-only) may not apply the same way in another
(e.g. combined voice + touch)
• But they could be adapted
• Association for Voice Interaction Design (AVIxD.org)
• Wiki
• Peer-Reviewed Journal
• Virtual “Brown Bags”
• Academic Sources, Books

AVIxD.org
CUI Working Group is actively recruiting!

Specific Example: “Help”
Voice XML
Standard
(2004)
“Help” should
be a global
command
AVIxD Wiki
(2014)
Stop using
“Help” as a
global
Agent API
Doc
(2015)
Offer “Help”

Specific Example: “Help”
• Designers who tune applications have seen that the
word “help” is a known “False Attractor”
• Other things that you say which are short get recognized
as “help”
• People don’t voluntarily come up with “help”
unless they are prompted
• Give callers a context specific command only
where help may truly be needed, and call it
something besides "help”
• System: Say or enter your account number, or say, where
do I find it.

Special Case: Car
• “Distracted Driver” is a hot topic!
• Richard Young, Wayne State University
• Paper: “Safe Interaction For Drivers”
• “Visual-Manual Mode” – What we do today
• “Auditory-Vocal Mode” – Speech only. NO GUI.
• “Mixed Mode” – Speech and GUI being used together
• Finding: If you give someone a graphic interface,
they’re going to look at it
• And take their eyes off the road

Usability Studies / Research
• Special Challenges
• Technical setup
• Phone tap / Recording both sides

Early Stage Voice Only Prototype

What’s the Use Case?
• Enabling application
• User can’t do it any other way
• New tasks
• Enhancing application
• User can do it now
• But speech makes it better
• Faster
• Safer

API-Based
Device-
Based
Roll Your
Own /
Open-
Source
• Flexibility
• Power
• Customization
• Time
• Difficulty

Cloud vs. Downloadable / Embedded
• Easy to get started
• Lightweight
• Not much specialized
knowledge
• Customizable
• Probably better recognition
• Can be device-specific
• More features
• Higher powered
• May require specialized
knowledge
– Speech scientist

Open Source ASR
• CMU Sphinx
• pocketsphinx
• Kaldi
• http://kaldi-asr.org/
• Github
• New updates include some pretty interesting stuff (DNN)
• Requires:
• Corpus
• Tech know-how

Should I Speech-Enable My ___?
Maybe

Iron Man 2: Marvel Studios, Paramount Pictures
Where’s Jarvis?

Where’s Jarvis?
Gesture
Based
Interface
Artificial
Intelligence
Voice Based
Interface

Where’s Jarvis?
ASR
NLU
Voice Design
Context

Resources
• Handout / Web page

Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

Recommended

Recommended

More Related Content

Similar to Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

Similar to Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016 (20)

More from Crispin Reedy

More from Crispin Reedy (10)

Recently uploaded

Recently uploaded (20)

Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

Editor's Notes