2. About me
2
Apache PMC Mesos, Drill
Container engineer lead at Mesosphere
Help start community effort for Spark on Kubernetes (Google, Redhat,
Palantir, Salesforce, etc).
@tnachen
14. Reactive monitoring new challenges
Containers != VM
Cannot focus on host metrics anymore with dynamic container
changes
Which metric(s) to track?
10000s of metrics, massive amounts of alert
How to identify or isolate the problem (correlate metrics)?
HW/Container/Host/Neighbor/Cluster/Dependency….
How to resolve these performance problems? (RCA)
Unmanaged resources -> Manual tuning
15. Proactive challenges
Goal: Maximize performance with minimum cost!
Capacity Planning / What-If Analysis
Cluster scheduling (interference, heterogenous, etc)
Optimal configuration
-
Orchestration
Network Storage
Runtime
OS
Application
VM
18. AIOps is not a new concept
Decades of academic research around managing
performance in computing….
- Cluster Scheduling
- SLA-driven interference, storage, network, caching...
19. Now is the time
- Workloads are more complex than before
Infrastructure interfaces and data are becoming more
standardized (Kubernetes, Prometheus, CTE, etc)
All metrics centrally collected
-
27. The Difficulty of using the Public Cloud
Many instance types
Reserved, on-demand, spot instances
Many instant sizes (10s)
Application churn
Variability in load
Can we guarantee SLOs & minimize cost automatically?
37. Variability in user load
Bloated reservations due performance variability
TwitterGoogle
38. Oversubscription
Google search + Google brain (deep neural nets)
>90% hardware utilization
No latency violations for search
[Lo et al’15]
>90% HW utilization No tail latency
problems
39.
40. Hyperpilot
40
Help enterprises
operate container in
production.
Using AI to learn how
applications are behaving with your
infrastructure and actively manage these
resources.
Processors, Memory, Flash, Disk, NIC
Operating system
Cluster resource manager
Distributed
analytics
Hadoop,
Spark, …
IaaS
Online,
data-intensive
(OLDI)
Search, social nets,
SaaS