A modern take on Voice Interface based Applications design and development, with comprehensive recommendations and next future's forecasts about Internet of Things and Home Automation.
3. WHY NOW?_
‣ WEB SERVICES AND IoT EXPLOTION
‣ HARDWARE NOW SUPPORTS FAR-FIELD VOICE INPUT PROCESSING
‣ SCIENCE BEHIND THE SCENES IS NOW ACCESIBLE
‣ AUTOMATIC SPEECH RECOGNITION, NATURAL LANGUAGE
UNDERSTANDING, TEXT TO SPEECH
‣ ARTIFICIAL INTELLIGENCE IS MAKING VOICE INTERFACES SMARTER
‣ PERSONALIZATION TO USER SPEECH, CONTEXTS AND
PREFERENCES
4. BENEFITS_
‣ MOST NATURAL INTERFACE FOR HUMANS
‣ INSTANT VALUE FOR QUICK DEMANDS
‣ SUITABLE FOR NON-TECHNOLOGICAL USERS
‣ LOW HARDWARE NEED, NO SCREEN REQUIRED
‣ ACCESIBILITY FOR LOW VISION CAPABILITIES AND HAND
DISABILITY
5. DRAWBACKS_
‣ ERRORS IN SPEECH RECOGNITION
‣ DIFFERENT SPEECH RECOGNITION/GENERATION ACCURACY
AMONG LANGUAGES
‣ BACKGROUND NOISE SUSCEPTIBILITY
‣ NON-ACCESIBLE FOR DEAF/MUTE
‣ RESPONSES WITH SLOW DATA EXPOSURE
6. USE CASES_
‣ SUITABLE
‣ QUICK LOW-PARAMETERIZED INFORMATION DEMANDS
‣ LOW-PARAMETERIZED NON-CRITICAL TRANSACTIONS
‣ RESPONSES WITH REDUCED AMOUNT OF DATA
‣ NON-SUITABLE
‣ HIGH-PARAMETERIZED QUESTIONS OR TRANSACTIONS
‣ CRITICAL TRANSACTIONS DUE TO ERROR POSSIBILITIES
‣ RESPONSES WITH LARGE AMOUNT OF DATA
8. BASICS_
‣ CHOOSE YOUR CHANNEL WISELY
‣ CUSTOM APPLICATION
‣ GENERAL ASSISTANT
‣ STUDY YOUR DOMAIN
‣ VOICE-ONLY?
‣ BEST OPTION IS USALLY COMBINED GRAPHIC AND VOICE
INTERFACE
9. RECOMMEDATIONS_
‣ SHORT INTERACTIONS
‣ SHORTER THAN TEXT-BASED EXPERIENCES
‣ NO LONG FUNNELS MORE THAN TWO STEPS
‣ MEANINGFUL RESPONSES WITH VALUE
‣ TRANSPARENCY
‣ ENGAGEMENT
‣ TAIL QUESTIONS
‣ NOTIFICATIONS
11. CUT THE BULLSHIT_
‣ VALUE OVER SMALLTALK
‣ VALUE OVER PERSONALITY
‣ VALUE OVER HUMOUR
‣ BE HONEST
12. CONVERSATIONS BASICS_
‣ Turn-taking
‣ Threading
‣ Leveraging inherent efficiency of language
‣ Anticipating variable user behaviour
‣ Understanding cooperative behaviour
‣ Cooperative principle
‣ Paul Grice’s Maxims
‣ Use everyday language
‣ Instilling user confidence
13. GRICE’S MAXIMS_
‣ The maxim of quantity, where one tries to be as informative as one
possibly can, and gives as much information as is needed, and no
more.
‣ The maxim of quality, where one tries to be truthful, and does not
give information that is false or that is not supported by evidence.
‣ The maxim of relation, where one tries to be relevant, and says things
that are pertinent to the discussion.
‣ The maxim of manner, when one tries to be as clear, as brief, and as
orderly as one can in what one says, and where one avoids obscurity
and ambiguity.
15. BASICS_
‣ SPEECH RECOGNITION/GENERATION
‣ AUTOMATIC IN GENERAL ASSISTANTS
‣ SERVICE OR LIBRARY BASED IN CUSTOM ASSISTANTS
‣ CORE COMPONENT IS THE DIALOG ENGINE
‣ GOOGLE DIALOGFLOW
‣ MICROSOFT BOT FRAMEWORK
‣ IBM WATSON ASSISTANT
‣ YOUR OWN
16. EXAMPLE DIALOG ENGINE DIAGRAM_
‣ NLU Platform to receive requests and converts them to intents,
parameters
17. RECOMMEDATIONS_
‣ NODEJS AS BACKEND TECHNOLOGY
‣ IDEAL FOR PaaS AND EVEN FaaS
‣ OWN USER SYSTEM
‣ MIXED CONTEXT STRATEGY:
‣ KEEP CONVERSATIONS ON MEMORY
‣ KEEP MEANINGFUL, ACTIONABLE DATA ON DATABASE
‣ PRINCIPLES OF MODULARITY AND COMPONENTIZATION
18. PATTERNS_
‣ ADAPTER
‣ IDEAL FOR CUSTOM INPUT ENTRIES:
‣ OWN/THIRD PARTY WEBHOOK
‣ MESSAGE SYSTEM LIBRARY
‣ IDEAL FOR CUSTOM OUTPUT EXITS
‣ MIDDLEWARE
‣ FOR USER INPUT
‣ FOR OUTPUT GENERATION
19. SPEECH RECOGNITION_
‣ ACCURACY IS DOWN TO 4.9 ERROR PERCENTAGE BY GOOGLE
THANKS TO AI TECHNIQUES LIKE DEEP LEARNING
‣ THREE MODELS WORK TOGETHER IN A GRAPH:
‣ ACOUSTIC: WAVEFORM TO EACH SOUND FRAGMENT
‣ PRONUNCIATION: SOUNDS TO WORDS
‣ LANGUAGE: WORDS TO SENTENCES
‣ STANDARD DATASET TO MEASURE ACCURACY IS NIST 2000
SWITCHBOARD
20. SPEECH RECOGNITION APIS_
‣ GOOGLE CLOUD SPEECH
‣ Converts audio to text, synchronously and asynchronously in 80+
different languages with a high degree of accuracy
‣ https://cloud.google.com/speech/docs
‣ MICROSOFT LUIS
‣ Interprets intents and extract entities, with built-in trained ones
‣ https://www.luis.ai/home
‣ IBM WATSON SPEECH-TO-TEXT
‣ https://www.ibm.com/watson/services/speech-to-text/
‣ AMAZON TRANSCRIBE
‣ https://aws.amazon.com/es/transcribe/
21. GOOGLE CLOUD API_
‣ NATURAL LANGUAGE
‣ Provides natural language understanding technologies to developers.
Examples include sentiment analysis, entity recognition, entity
sentiment analysis, and text annotations.
‣ https://cloud.google.com/natural-language/docs/reference/rest
‣ TRANSLATION
‣ Translates over 80+ languages and detect language from speech.
‣ https://cloud.google.com/translate/docs/reference/rest
23. ACTIONS ON GOOGLE_
‣ Platform to build actions invoked by users to fulfill some need
‣ Easy way with Dialogflow integration
‣ Custom way with ACTIONS SDK
‣ How it works:
‣ User requests an action “Talk to my Hotel Concierge”
‣ Assistant asks Actions on Google to invoke the particular app
‣ The conversation between the user and the app begins
‣ Subsequent user input is sent directly to app until the app
fulfills the intent and ends
24. INTENTS_
‣ Represent a mapping between what a user says and what action
should be taken by your software.
‣ User Says (Expressions)
‣ Natural language expressions annotated with parameters that
are linked to entities
‣ Actions
‣ Trigger-name with associated parameters to perform an action
on the app
‣ Response
‣ You can add Simple Text or Rich Response depending on platform
‣ Contexts
‣ Passing info from other intents or external. Input are prerequisite
25. ENTITIES_
‣ Significant data extracted from user input in form of parameter
value
‣ Entities are associated to particular actions
‣ There are three types:
‣ System
‣ Pre-built entities provided by API.AI in order to facilitate
handling common concepts (colors, locations,…)
‣ Developer
‣ Custom entities created with Reference Value plus Synonyms
‣ User Entities
‣ Defined for the session, specific playlists for instance
26. CONTEXTS_
‣ Persisted information that can be used through intents
‣ It can be internal like a particular movie the user is asking for
‣ Or external like the user data retrieved from a user system
‣ Lifespan:
‣ By default they last for 5 requests or 10 minutes
‣ Input Context:
‣ Limit intents to be matched only when certain contexts are set
‣ For example when you need specific info to perform action
‣ Output Context:
‣ They are tied to user sessions, is shared by the intent
‣ Automatically added to follow-up intents
27. EVENTS & DIALOGS_
‣ Events is a feature that allows you to invoke intents by an event
name instead of a user query
‣ Dialogs
‣ Linear
‣ With Slot Filling you define required parameters with prompts
and order them. Agent will ask for them until has all info.
‣ Non-linear
‣ Complex dialogs are formed from context routing, removing
Output Context for Intent Responses, and adding new Output
Context that is matched for next question
30. GOOGLE ASSISTANT_
‣ ACTIONS ON GOOGLE ALLOWS BUILDING APPS
‣ GOOGLE HOME, HOME MINI, ANDROID, ANDROID AUTO, WEAR OS
‣ FUTURE UP TO 80% OF WORLD MOBILE
MARKET
‣ CUSTOM DIALOG ENGINE DIALOGFLOW
‣ BEST SPEECH RECOGNITION
‣ BEST POSSIBLE FUTURE INTEGRATION WITH OTHER GOOGLE
SERVICES
31. ALEXA_
‣ ALEXA SKILLS FOR THIRD PARTY INTEGRATIONS
‣ AMAZON ECHO, DOT ECHO
‣ LARGEST SALES CHANNEL
‣ LARGEST CURRENT MARKET SHARE THANKS
TO EARLY TIME-TO-MARKET
‣ IOT INTEGRATION THROUGH ALEXA VOICE SERVICE
32. HOME DEVICES_
‣ SMART SPEAKERS WILL BE THE CENTER OF HOME USAGE
‣ HOME AUTOMATION
‣ IoT DEVICES
‣ GENERAL USE CASES:
‣ PURCHASES
‣ INFORMATION DEMAND
‣ AGENDA
‣ CONTROL OVER OTHER DEVICES
35. CURRENT USAGE & PREDICTIONS_
‣ 50% OF ALL SEARCHES WILL BE VOICE-BASED BY 2020
‣ 22M SMART SPEAKERS IN US BY 2020
‣ 400M DEVICES WITH ACCESS TO GOOGLE ASSISTANT THIS YEAR
‣ A GOOGLE HOME IS SOLD EVERY SECOND IN US
‣ 40% ADULTS USE VOICE SEARCH