The Voice Interface Revolution

THE VOICE INTERFACE
REVOLUTION_

WHY NOW?_
‣ WEB SERVICES AND IoT EXPLOTION
‣ HARDWARE NOW SUPPORTS FAR-FIELD VOICE INPUT PROCESSING
‣ SCIENCE BEHIND THE SCENES IS NOW ACCESIBLE
‣ AUTOMATIC SPEECH RECOGNITION, NATURAL LANGUAGE
UNDERSTANDING, TEXT TO SPEECH
‣ ARTIFICIAL INTELLIGENCE IS MAKING VOICE INTERFACES SMARTER
‣ PERSONALIZATION TO USER SPEECH, CONTEXTS AND
PREFERENCES

BENEFITS_
‣ MOST NATURAL INTERFACE FOR HUMANS
‣ INSTANT VALUE FOR QUICK DEMANDS
‣ SUITABLE FOR NON-TECHNOLOGICAL USERS
‣ LOW HARDWARE NEED, NO SCREEN REQUIRED
‣ ACCESIBILITY FOR LOW VISION CAPABILITIES AND HAND
DISABILITY

DRAWBACKS_
‣ ERRORS IN SPEECH RECOGNITION
‣ DIFFERENT SPEECH RECOGNITION/GENERATION ACCURACY
AMONG LANGUAGES
‣ BACKGROUND NOISE SUSCEPTIBILITY
‣ NON-ACCESIBLE FOR DEAF/MUTE
‣ RESPONSES WITH SLOW DATA EXPOSURE

USE CASES_
‣ SUITABLE
‣ QUICK LOW-PARAMETERIZED INFORMATION DEMANDS
‣ LOW-PARAMETERIZED NON-CRITICAL TRANSACTIONS
‣ RESPONSES WITH REDUCED AMOUNT OF DATA
‣ NON-SUITABLE
‣ HIGH-PARAMETERIZED QUESTIONS OR TRANSACTIONS
‣ CRITICAL TRANSACTIONS DUE TO ERROR POSSIBILITIES
‣ RESPONSES WITH LARGE AMOUNT OF DATA

BASICS_
‣ CHOOSE YOUR CHANNEL WISELY
‣ CUSTOM APPLICATION
‣ GENERAL ASSISTANT
‣ STUDY YOUR DOMAIN
‣ VOICE-ONLY?
‣ BEST OPTION IS USALLY COMBINED GRAPHIC AND VOICE
INTERFACE

RECOMMEDATIONS_
‣ SHORT INTERACTIONS
‣ SHORTER THAN TEXT-BASED EXPERIENCES
‣ NO LONG FUNNELS MORE THAN TWO STEPS
‣ MEANINGFUL RESPONSES WITH VALUE
‣ TRANSPARENCY
‣ ENGAGEMENT
‣ TAIL QUESTIONS
‣ NOTIFICATIONS

ENTRANCE_
‣ DRIVEN FIRST INTERACTION
‣ SINGLE POINT ALLOW MORE CONTROL
‣ QUICK WELCOME

CUT THE BULLSHIT_
‣ VALUE OVER SMALLTALK
‣ VALUE OVER PERSONALITY
‣ VALUE OVER HUMOUR
‣ BE HONEST

CONVERSATIONS BASICS_
‣ Turn-taking
‣ Threading
‣ Leveraging inherent eﬃciency of language
‣ Anticipating variable user behaviour
‣ Understanding cooperative behaviour
‣ Cooperative principle
‣ Paul Grice’s Maxims
‣ Use everyday language
‣ Instilling user conﬁdence

GRICE’S MAXIMS_
‣ The maxim of quantity, where one tries to be as informative as one
possibly can, and gives as much information as is needed, and no
more.
‣ The maxim of quality, where one tries to be truthful, and does not
give information that is false or that is not supported by evidence.
‣ The maxim of relation, where one tries to be relevant, and says things
that are pertinent to the discussion.
‣ The maxim of manner, when one tries to be as clear, as brief, and as
orderly as one can in what one says, and where one avoids obscurity
and ambiguity.

BASICS_
‣ SPEECH RECOGNITION/GENERATION
‣ AUTOMATIC IN GENERAL ASSISTANTS
‣ SERVICE OR LIBRARY BASED IN CUSTOM ASSISTANTS
‣ CORE COMPONENT IS THE DIALOG ENGINE
‣ GOOGLE DIALOGFLOW
‣ MICROSOFT BOT FRAMEWORK
‣ IBM WATSON ASSISTANT
‣ YOUR OWN

EXAMPLE DIALOG ENGINE DIAGRAM_
‣ NLU Platform to receive requests and converts them to intents,
parameters

RECOMMEDATIONS_
‣ NODEJS AS BACKEND TECHNOLOGY
‣ IDEAL FOR PaaS AND EVEN FaaS
‣ OWN USER SYSTEM
‣ MIXED CONTEXT STRATEGY:
‣ KEEP CONVERSATIONS ON MEMORY
‣ KEEP MEANINGFUL, ACTIONABLE DATA ON DATABASE
‣ PRINCIPLES OF MODULARITY AND COMPONENTIZATION

PATTERNS_
‣ ADAPTER
‣ IDEAL FOR CUSTOM INPUT ENTRIES:
‣ OWN/THIRD PARTY WEBHOOK
‣ MESSAGE SYSTEM LIBRARY
‣ IDEAL FOR CUSTOM OUTPUT EXITS
‣ MIDDLEWARE
‣ FOR USER INPUT
‣ FOR OUTPUT GENERATION

SPEECH RECOGNITION_
‣ ACCURACY IS DOWN TO 4.9 ERROR PERCENTAGE BY GOOGLE
THANKS TO AI TECHNIQUES LIKE DEEP LEARNING
‣ THREE MODELS WORK TOGETHER IN A GRAPH:
‣ ACOUSTIC: WAVEFORM TO EACH SOUND FRAGMENT
‣ PRONUNCIATION: SOUNDS TO WORDS
‣ LANGUAGE: WORDS TO SENTENCES
‣ STANDARD DATASET TO MEASURE ACCURACY IS NIST 2000
SWITCHBOARD

SPEECH RECOGNITION APIS_
‣ GOOGLE CLOUD SPEECH
‣ Converts audio to text, synchronously and asynchronously in 80+
diﬀerent languages with a high degree of accuracy
‣ https://cloud.google.com/speech/docs
‣ MICROSOFT LUIS
‣ Interprets intents and extract entities, with built-in trained ones
‣ https://www.luis.ai/home
‣ IBM WATSON SPEECH-TO-TEXT
‣ https://www.ibm.com/watson/services/speech-to-text/
‣ AMAZON TRANSCRIBE
‣ https://aws.amazon.com/es/transcribe/

GOOGLE CLOUD API_
‣ NATURAL LANGUAGE
‣ Provides natural language understanding technologies to developers.
Examples include sentiment analysis, entity recognition, entity
sentiment analysis, and text annotations.
‣ https://cloud.google.com/natural-language/docs/reference/rest
‣ TRANSLATION
‣ Translates over 80+ languages and detect language from speech.
‣ https://cloud.google.com/translate/docs/reference/rest

GOOGLE
ASSISTANT
DEVELOPMENT
+ EXAMPLE

ACTIONS ON GOOGLE_
‣ Platform to build actions invoked by users to fulfill some need
‣ Easy way with Dialogflow integration
‣ Custom way with ACTIONS SDK
‣ How it works:
‣ User requests an action “Talk to my Hotel Concierge”
‣ Assistant asks Actions on Google to invoke the particular app
‣ The conversation between the user and the app begins
‣ Subsequent user input is sent directly to app until the app
fulfills the intent and ends

INTENTS_
‣ Represent a mapping between what a user says and what action
should be taken by your software.
‣ User Says (Expressions)
‣ Natural language expressions annotated with parameters that
are linked to entities
‣ Actions
‣ Trigger-name with associated parameters to perform an action
on the app
‣ Response
‣ You can add Simple Text or Rich Response depending on platform
‣ Contexts
‣ Passing info from other intents or external. Input are prerequisite

ENTITIES_
‣ Significant data extracted from user input in form of parameter
value
‣ Entities are associated to particular actions
‣ There are three types:
‣ System
‣ Pre-built entities provided by API.AI in order to facilitate
handling common concepts (colors, locations,…)
‣ Developer
‣ Custom entities created with Reference Value plus Synonyms
‣ User Entities
‣ Defined for the session, specific playlists for instance

CONTEXTS_
‣ Persisted information that can be used through intents
‣ It can be internal like a particular movie the user is asking for
‣ Or external like the user data retrieved from a user system
‣ Lifespan:
‣ By default they last for 5 requests or 10 minutes
‣ Input Context:
‣ Limit intents to be matched only when certain contexts are set
‣ For example when you need speciﬁc info to perform action
‣ Output Context:
‣ They are tied to user sessions, is shared by the intent
‣ Automatically added to follow-up intents

EVENTS & DIALOGS_
‣ Events is a feature that allows you to invoke intents by an event
name instead of a user query
‣ Dialogs
‣ Linear
‣ With Slot Filling you deﬁne required parameters with prompts
and order them. Agent will ask for them until has all info.
‣ Non-linear
‣ Complex dialogs are formed from context routing, removing
Output Context for Intent Responses, and adding new Output
Context that is matched for next question

GOOGLE ASSISTANT_
‣ ACTIONS ON GOOGLE ALLOWS BUILDING APPS
‣ GOOGLE HOME, HOME MINI, ANDROID, ANDROID AUTO, WEAR OS
‣ FUTURE UP TO 80% OF WORLD MOBILE 
MARKET
‣ CUSTOM DIALOG ENGINE DIALOGFLOW
‣ BEST SPEECH RECOGNITION
‣ BEST POSSIBLE FUTURE INTEGRATION WITH OTHER GOOGLE
SERVICES

ALEXA_
‣ ALEXA SKILLS FOR THIRD PARTY INTEGRATIONS
‣ AMAZON ECHO, DOT ECHO
‣ LARGEST SALES CHANNEL
‣ LARGEST CURRENT MARKET SHARE THANKS 
TO EARLY TIME-TO-MARKET
‣ IOT INTEGRATION THROUGH ALEXA VOICE SERVICE

HOME DEVICES_
‣ SMART SPEAKERS WILL BE THE CENTER OF HOME USAGE
‣ HOME AUTOMATION
‣ IoT DEVICES
‣ GENERAL USE CASES:
‣ PURCHASES
‣ INFORMATION DEMAND
‣ AGENDA
‣ CONTROL OVER OTHER DEVICES

CURRENT USAGE & PREDICTIONS_
‣ 50% OF ALL SEARCHES WILL BE VOICE-BASED BY 2020
‣ 22M SMART SPEAKERS IN US BY 2020
‣ 400M DEVICES WITH ACCESS TO GOOGLE ASSISTANT THIS YEAR
‣ A GOOGLE HOME IS SOLD EVERY SECOND IN US
‣ 40% ADULTS USE VOICE SEARCH

The Voice Interface Revolution

Recommended

Recommended

More Related Content

Similar to The Voice Interface Revolution

Similar to The Voice Interface Revolution (20)

More from Rafael Casuso Romate

More from Rafael Casuso Romate (10)

Recently uploaded

Recently uploaded (20)

The Voice Interface Revolution