14. Reactive monitoring new challenges
Containers != VM
Cannot focus on host metrics anymore with dynamic container
changes
Which metric(s) to track?
10000s of metrics, massive amounts of alert
How to identify or isolate the problem (correlate metrics)?
HW/Container/Host/Neighbor/Cluster/Dependency….
How to resolve these performance problems? (RCA)
Unmanaged resources -> Manual tuning
15. Proactive challenges
Goal: Maximize performance with minimum cost!
Capacity Planning / What-If Analysis
Cluster scheduling (interference, heterogenous, etc)
Dynamic configuration
-
Orchestration
Network Storage
Runtime
OS
Application
VM
16. Gartner definition
AIOps platforms utilize big data, modern machine learning
and other advanced analytics technologies to directly and
indirectly enhance IT operations (monitoring, automation
and service desk) functions with proactive, personal and
dynamic insight. AIOps platforms enable the concurrent use
of multiple data sources, data collection methods, analytical
(real-time and deep) technologies, and presentation
technologies.
18. AIOps is not a new concept
Decades of academic research around managing
performance in computing….
- Cluster Scheduling
- SLA-driven interference, storage, network, caching...
26. The Difficulty of using the Public Cloud
Many instance types
Reserved, on-demand, spot instances
Many instant sizes (10s)
Application churn
Variability in load
Can we guarantee SLOs & minimize cost automatically?
29. Confidential: do not distribute
Results in a couple hrs, with 20% sampling
Optimal Solution:
Highest Perf / Cost
ratio
...and there is room to further improve on speed & sampling
2nd Highest
Perf / Cost ratio
with higher TPM
38. Variability in user load
Bloated reservations due performance variability
TwitterGoogle
39. Oversubscription
Google search + Google brain (deep neural nets)
>90% hardware utilization
No latency violations for search
[Lo et al’15]
>90% HW utilization No tail latency
problems
42. Hyperpilot
42
Using ML to build the intelligence to
help operators to gain deep insights,
and further enable dynamic infrastructure.
Processors, Memory, Flash, Disk, NIC
Operating system
Cluster resource manager
Distributed
analytics
Hadoop,
Spark, …
IaaS
Online,
data-intensive
(OLDI)
Search, social nets,
SaaS