The presentation is about how reliability engineering is applied to a Japanese video streaming service, ABEMA. Streaming reliability engineering should be applied to a video streaming service which is destined to keep evolving for years in order to deliver new value in media to people. The engineers have to challenge new technical missions and experiences rapidly as they keep the product reliable enough for viewers to use with no concern.
2. Yusuke Goto
五藤 佑典
https://ygoto3.com/
@ygoto3_
● Majored in Graphic Design at California State University, San Bernardino
● Software engineer @ CyberAgent, Inc. and AbemaTV, Inc.
● CyberAgent, Inc.
○ A Developer Expert in video technology and product design
● AbemaTV, Inc.
○ Lead Video Streaming Client Engineering team
○ Lead Cross Device Engineering team
4. Internet media
● Blogging platform
● Music streaming
● Game
● Curation
● Ad product
Development
● Ad agency
Broadcasting
● News
● Documentary
● Drama
● Anime
● Sports
● Music
● Movies
- Internet TV station -
10. Why Reliability Engineering is Important
● Keep creating new service values by challenging technical edges
● Keep our service reliable all the time
Our limited resource
Development
? %
Operation
? %
11. Why Reliability Engineering is Important
● Keep creating new values by challenging technical edges
● Keep our system available all the time
Our limited resource
Development
? %
Operation
? %
Reliability engineering gives us a way to decide
the balance between development and operation
12. W/o Reliability Engineering
● Quality improvement depends on an individual’s ownership
● Confusion by subjective reviews from app stores and SNS
● Gradually degraded service quality can be ignored
13. W/o Reliability Engineering
● Quality improvement depends on an individual’s ownership
● Confusion by subjective reviews from app stores and SNS
● Gradually degraded service quality can be ignored
As a result,
reliability will be lost
14. Google’s SRE Practice
The ideas of Google’s SRE practice can be applied to video streaming
https://sre.google/books/
15. Streaming Reliability Engineering
Not Site Reliability Engineering
● We consider it to be a part of Site Reliability Engineering
● A reliability engineering part which requires domain knowledge in video
streaming
16. Defining the Video Streaming Level
When can you say your video streaming service is reliable enough?
Service Level Terminologies
● Service Level Indicators a.k.a. SLI
○ a carefully defined quantitative measure of some aspect of the level of service that is provided
● Service Level Objectives a.k.a. SLO
○ a target value or range of values for a service level that is measured by an SLI
● Service Level Agreements a.k.a. SLA
○ an explicit or implicit contract with your users that includes consequences of meeting (or
missing) the SLOs they contain
17. This is an example of how
a video streaming service tries
Reliability Engineering
18. Defining Video Streaming SLOs
Based on which playback behaviors really matter to a user
Key questions: Does a viewer
● Succeed to start watching the content?
● Smoothly watch the whole content with no interruption?
● Watch the content in a good video quality?
● Not wait long for the content to start playing?
19. Defining Video Streaming SLOs
Does a viewer succeed to start watching the content?
● The most critical question
● It matters regardless of usecase type
● If a viewer fails, it means we fail to provide no value as a video streaming
service
20. Defining Video Streaming SLOs
Does a viewer smoothly watch the whole content with no interruption?
● The second most critical question
● It matters regardless of usecase type
● If a viewer doesn’t, it means we give him or her a stressful experience
21. Defining Video Streaming SLOs
Does a viewer watch the content in a good video quality?
● The question differentiates ABEMA from others
● 2 types of video quality
○ Resolution
■ Matters most when watching it on a big screen
○ Encoding
■ Matters regardless of usecase type
● If its quality is not satisfactory, it means we possibly get him or her away from
our service
22. We’re Reinventing the TV, but Extremely Convenient
Professional news
Live streaming
Simultaneity
Anywhere
● Phones
● TVs
● Desktops
● Tablets
● Smart Speakers
● Smart Displays
Anytime
● Timeshift playback
● Chasing playback
● Double-speed
playback
23. Various Types of Usecases
● Phones
● Tablets
● TVs
● Web on Desktops
● Web on Phones
● Smart Speakers
● Smart Displays
24. Defining Video Streaming SLOs
Does a viewer not wait long for the content to start playing?
● It matters most when a viewer zaps channels
○ Essential experience of TV
● If a viewer waits long, it means we fail to provide value equivalent to the TV
broadcasting
25. Converting Key Questions into Indicators
Does a viewer succeed to start watching the content?
Startup Success Rate
26. Converting Key Questions into Indicators
Does a viewer succeed to start watching the content?
Startup Success Rate
27. Converting Key Questions into Indicators
Does a viewer smoothly watch the whole content with no interruption?
Successful Playback Rate
28. Converting Key Questions into Indicators
Does a viewer smoothly watch the whole content with no interruption?
Successful Playback Rate
29. Converting Key Questions into Indicators
Does a viewer watch the content in a good video quality?
Rendition Distribution Success Rate
Video Quality MOS Success Rate
30. Converting Key Questions into Indicators
Does a viewer watch the content in a good video quality?
Rendition Distribution Success Rate
Video Quality MOS Success Rate
The indicator is about the
video resolution
31. Converting Key Questions into Indicators
We used to value other indicators :
● Round-Trip Time
● Bandwidth
● Bitrate
● Throughput
32. Converting Key Questions into Indicators
We used to value other indicators :
● Round-Trip Time
● Bandwidth
● Bitrate
● Throughput
A user doesn’t care even if these are bad
33. Converting Key Questions into Indicators
We used to value other indicators :
● Round-Trip Time
● Bandwidth
● Bitrate
● Throughput
A user doesn’t care even if these are bad
A user cares the video’s resolution !
34. Converting Key Questions into Indicators
Rendition Distribution Success Rate A more user-centric indicator
35. Converting Key Questions into Indicators
Rendition Distribution Success Rate A more user-centric indicator
How often we provide the preferable rendition matters
regarding the video quality
36. Converting Key Questions into Indicators
Rendition Distribution Success Rate A more user-centric indicator
37. Converting Key Questions into Indicators
Does a viewer not wait long for the content to start playing?
Join Time Success Rate
38. Categorizing the SLIs
Indicators for QoS
(Quality of Service)
● Startup Success Rate
● Successful Playback Rate
Indicators for QoE
(Quality of Experience)
● Rendition Distribution
Success Rate
● Video Quality MOS
Success Rate
● Join Time Success Rate
40. What do SLOs mean?
When an SLO is not met:
● Our service loses users’ trust
● An SLO should trigger an action to reachieve it
● The priority of taking back reliability should be higher than that of developing
a new feature
41. Defining Video Streaming SLOs
1. Collect the current facts
2. Decide a threshold for each SLI in order to keep the current service level
3. Decide your ideal objectives which we should aim in a longer period
42.
43. Creating SLO Document
Category SLI SLO
QoS Startup Success Rate 99%
QoS Smooth Playback Success Rate 99%
QoE Rendition Distribution Success (>= 1080p 60%) Rate
for TV
90%
QoE Video Quality MOS Success (>= 3.0) Rate 90%
QoE Join Time Success (<= 3 secs) Rate
for Linear
75%
44. Updating SLO Document towards Your Ideal Objectives
Category SLI SLO
QoS Startup Success Rate 99.99%
QoS Smooth Playback Success Rate 99%
QoE Rendition Distribution Success (>= 1080p 80%) Rate
for TV / Desktop
90%
QoE Video Quality MOS Success (>= 4.0) Rate 90%
QoE Join Time Success (<= 2 secs) Rate
for Linear
90%
45. Deciding an Internal SLAs
More like agreements between developers and business owners
46. Collecting Indicators from Monitoring Metrics
● # of Startup Errors
● # of In-Stream Errors
● Rendition Distribution
● Join Time
● Video Quality MOS
47. It’s time to challenge the new technical experiments
while your SLOs are met ...
48. Now we need to prevent SLIs from being degraded
as time passes
49. Practices to Keep Achieving SLOs
● Alerting on SLOs
○ Respond to problems before you consume too much of your error budget
● Change Management
○ Take a good care of uncontrollable changes on your users’ environments
● Fallback Strategies
○ Make an aggressive stream less risky
50. Alerting on SLOs
The simplest way to set an alert is simply too fragile.
Target Error Rate ≥ SLO Threshold
51. Target Error Rate ≥ SLO Threshold
has too many false positive alerts
60. Ephemeral Change Management
Take a good care of uncontrollable changes on your users’ environments
● App version updates
● OS version updates
● Firmware updates
● Brand-new hardware releases
● Automatic SDK updates
● Combinations of updates
73. Combinations of updates
CAF
Receiver SDK
CAF
Receiver SDK
CAF
Receiver SDK
Automatic update
Old
Firmware
Latest
Firmware
Old
Firmware
Automatic update
Stop
Updating
74. Fallback Strategy: Aggressive stream to Safe Stream
Modifying clients so that aggressive failures are less likely to cause user harm
75. Aggressive video stream - Low-latency news
Pros:
● A viewer watches news contents close to the real time
● Knowing what happens right now can save people’s lives during a disaster
Cons:
● Shorter segment length => Less buffering length => More rebuffering events
76. Aggressive video stream - Ad-personalized linear
Pros:
● A viewer watches potentially more interesting ads
○ Viewers and advertisers are both happy
Cons:
● High load on deciding personalized ads => Streams could be unstable
77. Fallback Strategy: Aggressive stream to Safe Stream
Modifying clients so that aggressive failures are less likely to cause user harm
78. Fallback Strategy: Aggressive stream to Safe Stream
Modifying clients so that aggressive failures are less likely to cause user harm
If a viewer watching an aggressive stream experiences
● Rebuffer Events
● Startup Errors
● In-Stream Errors
x times in y seconds,
then the player automatically switches to a safe stream
79. Fallback Strategy: Aggressive stream to Safe Stream
Modifying clients so that aggressive failures are less likely to cause user harm
If a viewer watching an aggressive stream experiences
● Rebuffer Events
● Startup Errors
● In-Stream Errors
x times in y seconds,
then the player automatically switches to a safe stream
A fallback strategy makes
aggressive video streams safer
in a user’s environment
80. Conclusion
● Reliability engineering is an essential strategy for service innovation
● Applying reliability engineering to a video streaming service is hard
● Some practices work fine and others don’t for your service
○ Find your own practice
○ The example here is far from perfect - we keep exploring further
● No option but more practices