This talk provides an overview of the history of design in technology, highlighting what we have learned over the years in developing for a screen. Designing for the ear is different from developing for the screen though. This talk establishes best practices for voice-first design contrasting them with GUI design principles. You will learn of the similarities and differences when developing for voice, compared to screen-oriented mediums. Learn how to create engaging experiences where customers can speak in their own words, receive individualised responses, and easily find what they need via voice.
More details:
https://confengine.com/agile-india-2019/proposal/9648/voice-design-how-designing-for-voice-is-different-from-designing-for-screens
Conference link: https://2019.agileindia.org
18. Turn up the volume
Volume up
Louder
Make it louder
Turn it down
Set the volume to 6
Utterances
Affordances
min max
🔊 Volume
19. For the best colleges.
For the best colleges in Bangalore.
For colleges with a Computer Science program.
For colleges with a Computer Science program in Bangalore.
Utterances
20. For the best colleges.
For the best colleges in Bangalore.
For colleges with a Computer Science program.
For colleges with a Computer Science program in
Bangalore.
Intent
CollegeSearch
Intents
21. For colleges with a Computer Science program in Bangalore
For colleges with an program in{degree} {location}
Slots
22. {size} Synonyms
tiny miniscule, little
small little, modest, small-scale
medium average, middle, in between
large huge, gigantic, big, populous
I want a {size} college
I want a college that’s {size}
24. Alexa, open Holiday Planner and tell
me the cheapest return flight from
Pune to Goa on the 10th of January
and returning on the 16th of January
and what offers are available….
25. “Alexa, start outdoor guru”
“Welcome to…”
“Alexa, ask outdoor guru about fishing ”
“Where to?”
“Goa”
“Goa?
“Yes”
“When .. leave?”
“Next Friday”
“Until when?”
“The following Tuesday”
“What will you like to do?”
““I’ll be fishing”
“Did I get all this right?”
“yes”
“I have two ideas for you…”
Invoking
the skill
Dialog
support
Intent
handling
Use Multi-turn Dialog
26. “Alexa, start outdoor guru”
“Welcome to…”
“Alexa, ask outdoor guru about fishing ”
“Where to?”
“Goa”
“Goa?
“Yes”
“When .. leave?”
“Next Friday”
“Until when?”
“The following Tuesday”
“What will you like to do?”
““I’ll be fishing”
“Did I get all this right?”
“yes”
“I have two ideas for you…”
Slot confirmation
Intent confirmation
Slot elicitation
Use Multi-turn Dialog
40. The FirstTime Storyboard
Launch
I don’t
know
Karnataka
first time
Welcome
need more
info
No Problem
State?
enough info
KA, Nice!
NIT KWhere?
utterance
situation
response
prompt
41. The Fifth Time Storyboard
Launch 4 of 5 Engineer
5 time
streak
Streak!
School
assigned
Glad you
liked it
Degree?
enough info
That helps
BMSCEHow was..
utterance
situation
response
prompt
42. The Disrupted Storyboard
Launch 2 of 5
What’s my
fav?
5 time
streak
Streak!
School
assigned
Let’s keep
looking
Size?
Has schools
with ratings
NIT K
MSRITHow was..
Learn
more?
Great
Sports
Enough
info
Ok
utterance
situation
response
prompt
43. 2 of 5
School
assigned
Let’s keep
looking
Size?
Launch
5 time
streak
Streak!
How was..
What’s
my fav?
Has
schools
with
ratings
NIT
Learn
more?
MSRIT
Great
Sports
Enough
info
Ok
I don’t
know
need
more info
No
Problem
State?
Karnataka
enough
info
KA, Nice!
NIT
Launch
first time
Every day
Where?
StoryboardsCards
Launch 2 of 5
What’s my
fav?
5 time streak
Streak!
School
assigned
Let’s keep
looking
Size?
Has schools
with ratings
NIT K
MSRITHow was.. Learn more?
Great
Sports
Enough info
Ok
Launch 4 of 5 Engineer
5 time
streak
Streak!
School
assigned
Glad you
liked it
Degree?
enough info
That helps
BMSCEHow was..
45. Conversational platforms
will drive the next big
paradigm shift in how
humans interact
with the digital world.
GartnerTop 10 StrategicTechnologyTrends for 2018
“
”
Speaker Notes:
Every decade, we’ve embraced a new way to interact with computers,
from character mode to a graphical user interface, to web and mobile.
[PAUSE]
Each step brought a magic moment when we realized what we could suddenly do. Think back to the first time you clicked on something with a mouse or pinched the screen with your fingers.
In that same way, we are now awakening to the vast potential of the voice user interface, or VUI.
Voice is the next major disruption
We know this because customers are telling us.
Sales
Usage
40K reviews. 4.4 stars.
Why? Voice that actually works, well.
Customer love video
We believe voice represents that next major disruption in computing. We believe this because customers are telling us it is so. The sales success of Amazon Echo only tells part of the story. The amount customers are using Alexa, which is the cloud service that powers those devices is staggering.
Amazon Echo has over 40,000 reviews on Amazon.com with an average review rating of 4.4 stars. So not only are customers buying the product, and using it, they, apparently love it.
Customers tell us they love it because the voice and speech recognition actually, really, do work in a seamless, natural way.
Check out this video of real customers in real homes using Alexa. These videos were all sent to the Echo team by real Amazon customers who just wanted to share.
Added by Amit
as we can see, every 10 years there’s a dynamic shift in user interfaces. We think we’re on the cusp of a major disruption in computing.
Every few years, there’s a dynamic shift on interfaces we use
started with character mode
touch - while touch has been a great for xyz things, it’s not so great for abc things. voice is the future for these kinds of “quick in and out” interactions
Added by Amit
For that same reason, we think that voice is the next major disruption in computing.
Voice is a really good interface because - speed, natural, nothing to learnVoice is the next major disruption - easy, speed, natural
Voice assistants are not just for the home anymore: we are enabling devices to be used on the go (headphones, car accessories), at hotels, businesses, and more.
NOTE TO SPEAKERS: Please make sure you have the latest approved number. You can find this easily on the ASK detail page.
Alexa provides capabilities, or what we call skills. Skills are like apps for the smartphone. Developers build skills to add new functionalities and make Alexa smarter. And customers can enable the skills of their choosing to create a more personalized experience.
Our community of developers have already built more than 25,000 skills. And that number is growing every day. These skill span a wide range. Gaming skills like Jeopardy provide entertaining experiences that keep customers coming back. Flash briefing skills like NPR and Bloomberg deliver fresh content whenever customers ask. And smart home skills like Wemo make it faster and easier for customers to control their smart home devices.
Speaker Notes:
We learnt a lot by designing for different platforms over the years. One thing that stayed the same is that The experience is successful if the user is happy and we get the experience right. So, Focusing on a flawless nice experience for user no matter what the interface is, is still the key.
To get the experience right we need to understand the platform and use cases before designing any applications for it.
We didn’t do that for other platforms to begin with and that led to so many bad experiences.
Let me give you an example in the context of shifting between web and mobile. At first, we started providing all the information on a web page on a small screen of mobile. It was hard to skim and small buttons were hard to press. It just turned out to be a bad experience.
The use case was different too. People might use mobiles more to locate the stores whereas they use web page to browse the items.
Speaker Notes:
It is important to be consistent in visual design to give users clue of how to find things.
If I ask you to create a mobile app and all the pages should look different (Different fonts, colors, design) [Break] you might think I am crazy. In a visual experience, we want to be consistent. We want to give user visual clues so that they can skim easily and to remove their cognitive load.
In a Voice user experience, there is nothing to skim through. So, we are not solving any problem by being consistent for voice. At the same time Ears expect variety. If I ask you “How your day was?” every day and you always answer in the same way, I kind of lose my interest. I stop listening. It gets boring. So, we need to keep in mind that ears expect variety and design for that.
[Opportunity to role play by asking a few audiences a question and no matter what they respond to answer the same way.]
[Can use Dev tips to demonstrate how good bye and hello message is always different]
Speaker Notes:
Screen based applications have a well-defined happy path
There is one way to get to your goal. The menu item and things to click is always in the same spot. But, In voice you can ask for the same thing in many different ways. Let’s say you want to book a flight from You can provide all information to book a flight in any specific order, altogether or a little bit of information at a time. There is not one single specific way to provide information and to reach your goal.
Skills which try very guided approaches (say Option 1 for this, option 2 for that..) end up being like IVR systems. Something we want to move on from.
Speaker Notes:
Screen based applications are designed for how people write. We Don’t Speak the Way We Write
And Anyone who has read and watched Harry Potter knows this, Dialogues of the book are not similar to the one in the movie.
When you start designing for voice, you first have to adopt a different writing process. In web or mobile design, once you write your copy, you likely don’t need to test and iterate many times. In voice design, it’s important to test and iterate on voice design; it’s not sufficient to simply write the script and build the skill. This is because we don’t speak the way we write.
Speaker Notes:
The first one is Be adaptable meaning to let users speak in their own words.
There is a pretty fundamental way of difference between Voice and GUI design. With Web and mobile, designers are thinking of what words they want to put on buttons. How the labels going to look like? Whereas in the voice the customer says what they want to say.
There is this concept of affordance in design. Meaning affordance let you know how to use something thing by design. Like a door handle that is designed in a way that affords PUSHING. In GUI, it can be a button that has hover state to indicate you can click on it.
These affordances when we start designing it for GUI start to be very Skeuomorphic, meaning design cues are taken from the physical world. Like a volume controller to be similar to a radio dial that you can grab and twist. [Break] and over time they get simpler, to a point that we only have an icon/ label and people know how to use it.
In voice there is no visual clue to guide users. So, In designing for voice, affordance means we can’t guide users by visual clues of what we can afford. Instead, we let people express themselves the way they want to express themselves and you should handle it through Intents, utterances, slots, synonyms, … ..
We have this concept of intent in voice. Like: I want to TURN THE VOLUME UP is my intent. But, I have so much flexibility to say it in different ways using utterances:
Things like:
-Turn up the volume
-Volume up
-Louder
Our skill should handle all these different ways of saying the same thing.
https://www.istockphoto.com/photo/emergency-exit-door-gm471460697-34869824
Stimulus-Response Compatibility
Explicit
door that says push and has a place to push rather than a handle
Pattern and Metahorical
Hyperlink
Save icon
Collectively learn
Hidden
Menu
Negative (disabled)
Start familiar and expected (Skumorphic)
Adapt to the medium and simplify
Start familiar and expected (Skumorphic)
Adapt to the medium and simplify
Start familiar and expected (Skumorphic)
Adapt to the medium and simplify
Start familiar and expected (Skumorphic)
Adapt to the medium and simplify
Utterances – Your voice user interface, things the user says to get the skill to do “something”
Intent – the “something” that happens as a result of the utterance.
Transition:
What are slots?
Turns out you’ve been looking at them the whole time. Let’s turn them on.
Utterances – Your voice user interface, things the user says to get the skill to do “something”
Intent – the “something” that happens as a result of the utterance.
Transition:
What are slots?
Turns out you’ve been looking at them the whole time. Let’s turn them on.
Here’s a utterance with the slots.
Notice how the same sentence structure can have multiple meanings just by changing the slot values.
Speaker Notes:
Be relatable is around the conversation is a cooperative experience. In Web and mobile, it is not cooperative. GUI tells us what to do by providing wizards and showing options on the menu.
In voice it is meant to be more cooperative where we get to the ability to have this back and forth dialogues. You can imagine if some one wants to make a long request and very complex, they probably won’t be able to say the whole thing in one go. We need to have multi turn dialogs to gather all the information.
Eg: no one will say “Alexa, ask college finder to help me find a college in California, that offers an Animal Sciences degree, with a large student body, relatively low tuition, that’s pretty hard to get into….
No one talks like that.
Using Alexa platform we can help you define the dialogue to help with over answering and under answering situations. A sample like:
Alexa, plan a trip?
OK, Where do you want to go?
I want to go to Sydney.
When do you want to go?
The 10th of December.
It is more like a wizard flow to gather information. But people might not answer that way. In fact, they often won’t.
When Alexa asks:
Where do you want to go?
The answer can be:
To Sydney for Surfing.
So, We should not ask:
- What do you want to do when you get there?
They already told that to us. That’s an example for over answering.
Another example can be
- Alexa, plan a trip?
OK, Where do you want to go?
I want to go Surfing.
They haven’t mention the city but We should get Surfing as an activity even though I asked for a place. So, This becomes cooperative. And then instead of saying:
“Where do you want to go?” we should probably ask “Which city you want to go?” You ask the question in a slight different way.
So, Here we talked about slot elicitation and getting the slot values based on what user said. We can also get confirmation for the slot values and what we heard.
The Explicit way would be: Did you say Seattle? And the answer is Yes/No. That’s when hearing right is very important for the task that we are going to do like booking a flight.
But we can also Implicitly confirm like saying:
Here is the weather in Sydney. It doesn’t worth asking ”Did you mean weather in Sydney?” That can get annoying if we over use Explicit confirmation.
And finally in the context of Intent confirmation for important tasks like booking a flight, we can ask:
Did you want me to book the flight?
That is For FlightBooking Intent to make sure user wants to proceed with the intention.
<Here we are going to open developer console and to show sample code for flight booking and Over and under answering with confirmations>
This is not good because the user will stop using your skill.
How do you solve this problem?
Design note: This slide aside from the following is a typical slide that we use for Multi turn dialogue. It might require some design touch especially the color.
AP: Done
The skill we built this morning was a simple, call and response. Let’s talk about a more conversational experience.
Design note: The same as previous slide
AP: done
Slot elicitation, Slot confirmation, Intent confirmation
Speaker Notes:
Be contextual is all about having the right context or letting people interact in the right context. It is not all about gathering a profile on user and use their name back to them. Although that is also good.
But a good flavor of this is the experience should be tailored over time. When I use mobile app, I see the same screen every time and that’s actually good, cause I can learn it and to understand what is going on on the page. But in voice you want to tailor it. The first time you come to a skill it says one thing and the more you use it, it can change.
A sample I usually use here is the first time we open a skill. We can provide some information about what skill can do and how to navigate. But the 3rd time it can be unnecessary. User should have learnt it by hearing it twice.
One aspect of being contextual which is important is, let’s say I am in a game and I want to pause in the middle of the game, I should be able to come back any time to pick it up where I left off. So, this involves memory and how to remember state and to properly respond.
Also, When you are designing your skill, you should tailor the skill based on what the customer is trying to get out of the skill.
There are different roles for a skill:
Get Information/Command and control (for example if the customer says Turn on the lights, what’s the weather?) very annoying if instead of doing what user asked for to say, ”Hey I am going to turn on the lights. Are you ready?....” or when they ask “give me a fact”, that’s more like do this or fetch me this information and don’t really say much moreJust do it. <Sample: Big Sky for weather>
Support/Guidance: So, It is more like helping them navigate to accomplish something. So, This might be the trip planning skill where they don’t know exactly where they want to go or which plane to take. So, you walk them through the process and provide options and help them make decisions and be more of assistant. So, here we design it more like a dialogue and back and forth conversations with confirmations. <Sample: Flight Booking skill>
Entertainment: You want the person to lean back and just enjoy the experience. So, this is mainly more of a story telling style. <Sample: Bed time story>
So, being contextual is more making sure skill behaves the way people expect it to.
Sample code: Persistance.
Tailor your responses and prompts
li
li
li
Speaker Notes:
Being available is mainly related to re structuring the way we think about web and mobile information architecture.
A Graph based UI, what they are is you can walk this line of logic chain. I do this and then I do that. I make choices and I get to where I want to get. I can get myself deeper and deeper into the experience and this is good because you can map out entire site of a web page and people learn how that thing is arranged and they can navigate to different places. That’s a very good design for the number of pixels we have on a screen. Or maybe there is a certain number of concepts people can skim through at every stage. Otherwise they will be overwhelmed.
A good example is to compare a banking mobile app with a banking skill. To find out your Routing Number (or IFSC Code in India), you first have to click on the hamburger menu button on a web page, go to your account info, click on account details and then see your routing info. That’s the best practice on GUI with the limited pixels available.
With voice its different. You can’t expect your users to remember a series of nested menus to get to a routing number. They’d rather just ask the skill what the routing number is.
So, you will end up having top level UI instead of nested menus which is the concept of this frame that we are trying to illustrate.
Speaker note: Here are the 4 key design principles we talked about. If you have any questions please ask or we can walk through some codes which illustrates some of the topics that we talked about.