Why AIOps Matters For Kubernetes

•

6 likes•715 views

Timothy Chen

Why AIOps Matters For Kubernetes (Talk in Taiwan Kubernetes Day 2017)

Engineering

Timothy Chen
Co-founder / CTO
Why AIOps Matters for
Kubernetes

About me
2
Apache PMC Mesos, Drill
Container engineer lead at Mesosphere
Help start community effort for Spark on Kubernetes (Google, Redhat,
Palantir, Salesforce, etc).
@tnachen

Agenda
3
Operating Kubernetes in production at scale
What is AIOps?
Operating Kubernetes in production at scale w/ AIOps!

Container Production Stack
Orchestration
Security Network Storage
Runtime
OS
Image

Infrastructure Monitoring
Orchestration
Network
Storage
Runtime
OS
Image
Users:
SRE
System Engineers
Ops
DevOps

Application Monitoring
Kafka
Microservice #1
Spark
Microservice #2
Microservice #3
Users:
SRE
DevOps
Developers

Reactive monitoring new challenges
Containers != VM
Cannot focus on host metrics anymore with dynamic container
changes
Which metric(s) to track?
10000s of metrics, massive amounts of alert
How to identify or isolate the problem (correlate metrics)?
HW/Container/Host/Neighbor/Cluster/Dependency….
How to resolve these performance problems? (RCA)
Unmanaged resources -> Manual tuning

Proactive challenges
Goal: Maximize performance with minimum cost!
Capacity Planning / What-If Analysis
Cluster scheduling (interference, heterogenous, etc)
Optimal configuration
-
Orchestration
Network Storage
Runtime
OS
Application
VM

https://netman.cs.tsinghua.edu.cn/contacts/projects/

AIOps is not a new concept
Decades of academic research around managing
performance in computing….
- Cluster Scheduling
- SLA-driven interference, storage, network, caching...

Now is the time
- Workloads are more complex than before
Infrastructure interfaces and data are becoming more
standardized (Kubernetes, Prometheus, CTE, etc)
All metrics centrally collected
-

https://www.youtube.com/watch?v=dJxGtfTPVCg

The Difficulty of using the Public Cloud
Many instance types
Reserved, on-demand, spot instances
Many instant sizes (10s)
Application churn
Variability in load
Can we guarantee SLOs & minimize cost automatically?

Live traffic load testing
https://engineering.linkedin.com/blog/2017/02/redliner--how-linkedin-determines-the-capacity-limits-of-its-ser

Live traffic bottleneck
https://research.fb.com/publications/kraken-leveraging-live-traf%EF%AC%81c-tests-to-id
entify-and-resolve-resource-utilization-bottlenecks-in-large-scale-web-services/

Batch scheduling
Improve deadline by 5x - 13x, increase utilization 14 - 28%

Variability in user load
Bloated reservations due performance variability
TwitterGoogle

Oversubscription
Google search + Google brain (deep neural nets)
>90% hardware utilization
No latency violations for search
[Lo et al’15]
>90% HW utilization No tail latency
problems

Hyperpilot
40
Help enterprises
operate container in
production.
Using AI to learn how
applications are behaving with your
infrastructure and actively manage these
resources.
Processors, Memory, Flash, Disk, NIC
Operating system
Cluster resource manager
Distributed
analytics
Hadoop,
Spark, …
IaaS
Online,
data-intensive
(OLDI)
Search, social nets,
SaaS

Hiring Engineers in Silicon Valley / Taiwan!
tim@hyperpilot.io
Christos Kozyrakis (Stanford)
Michael Huang (ex-CoreOS)
Timothy Chen (ex-Mesosphere)

What's hot

Five keys to successful cloud migrationIBM

App Modernization - What you need to know before planning a migration to Offi...Oliver Wirkus

Cloud-Native Workshop New York- DynatraceVMware Tanzu

The Ideal Approach to Application Modernization; Which Way to the Cloud?Codit

Demystifying Operational Features for Product Owners - AgileCam - SkeltonThat...Skelton Thatcher Consulting Ltd

Driving TAS Enterprise FitnessVMware Tanzu

Troubleshooting App Health and Performance with PCF Metrics 1.2VMware Tanzu

From Mainframe to Microservices with Pivotal Platform and Kafka: Bridging the...VMware Tanzu

Accelerating Digital Transformation with App ModernizationDavid J Rosenthal

Demystifying Cloud Economics - How to Build an Investment Case for Scale Migr...Amazon Web Services

GE Predix 新手入门赵锴物联网_IoTKai Zhao

Elastic APM: Amping up your logs and metrics for the full pictureElasticsearch

Smart building mendix azure influx / smart City / IoT Conclusion Connect enabling industry 4.0 with IoT

Rackspace::Solve NYC - Solving for Rapid Customer Growth and Scale Through De...Rackspace

Combining logs, metrics, and traces for unified observabilityElasticsearch

Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...Lightbend

Combining logs, metrics, and traces for unified observabilityElasticsearch

Cloud migration strategiesSogetiLabs

Cloud native enterpriseVMware Tanzu Korea

Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIVMware Tanzu

What's hot (20)

Five keys to successful cloud migration

App Modernization - What you need to know before planning a migration to Offi...

Cloud-Native Workshop New York- Dynatrace

The Ideal Approach to Application Modernization; Which Way to the Cloud?

Demystifying Operational Features for Product Owners - AgileCam - SkeltonThat...

Driving TAS Enterprise Fitness

Troubleshooting App Health and Performance with PCF Metrics 1.2

From Mainframe to Microservices with Pivotal Platform and Kafka: Bridging the...

Accelerating Digital Transformation with App Modernization

Demystifying Cloud Economics - How to Build an Investment Case for Scale Migr...

GE Predix 新手入门赵锴物联网_IoT

Elastic APM: Amping up your logs and metrics for the full picture

Smart building mendix azure influx / smart City / IoT

Rackspace::Solve NYC - Solving for Rapid Customer Growth and Scale Through De...

Combining logs, metrics, and traces for unified observability

Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...

Combining logs, metrics, and traces for unified observability

Cloud migration strategies

Cloud native enterprise

Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API

Similar to Why AIOps Matters For Kubernetes

Scale Container Operations with AIOpsTimothy Chen

DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps.com

DevOps in Age of KubernetesMesosphere Inc.

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa

Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)QAware GmbH

Si so product 1 day technicalBjørn Hell Larsen

Strata SC 2014: Apache Mesos as an SDK for Building Distributed FrameworksPaco Nathan

Apache Hadoop India Summit 2011 talk "Profiling Application Performance" by U...Yahoo Developer Network

VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld

Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023VMware Tanzu

OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs

Addressing the 8 Key Pain Points of Kubernetes Cluster ManagementEnterprise Management Associates

Episode 4: Operating Kubernetes at Scale with DC/OSMesosphere Inc.

Microservices Architecture - Cloud Native AppsAraf Karsh Hamid

8 - OpenShift - A look at a container platform: what's in the boxKangaroot

Lessons learned from writing over 300,000 lines of infrastructure codeYevgeniy Brikman

Episode 1: Building Kubernetes-as-a-ServiceMesosphere Inc.

Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics

Harbour IT & VMware - vForum 2010 WrapHarbourIT

The New Stack Container Summit TalkThe New Stack

Similar to Why AIOps Matters For Kubernetes (20)

Scale Container Operations with AIOps

DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes

DevOps in Age of Kubernetes

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...

Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)

Si so product 1 day technical

Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks

Apache Hadoop India Summit 2011 talk "Profiling Application Performance" by U...

VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver

Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023

OS for AI: Elastic Microservices & the Next Gen of ML

Addressing the 8 Key Pain Points of Kubernetes Cluster Management

Episode 4: Operating Kubernetes at Scale with DC/OS

Microservices Architecture - Cloud Native Apps

8 - OpenShift - A look at a container platform: what's in the box

Lessons learned from writing over 300,000 lines of infrastructure code

Episode 1: Building Kubernetes-as-a-Service

Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...

Harbour IT & VMware - vForum 2010 Wrap

The New Stack Container Summit Talk

Recently uploaded

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Introduction and different types of Ethernet.pptxupamatechverse

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Java Programming :Event Handling(Types of Events)simmis5

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

Introduction to Multiple Access Protocol.pptxupamatechverse

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

Porous Ceramics seminar and technical writingrakeshbaidya232001

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

Recently uploaded (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik

Processing & Properties of Floor and Wall Tiles.pptx

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Glass Ceramics: Processing and Properties

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Introduction and different types of Ethernet.pptx

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Java Programming :Event Handling(Types of Events)

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

Introduction to Multiple Access Protocol.pptx

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

Porous Ceramics seminar and technical writing

UNIT-III FMM. DIMENSIONAL ANALYSIS

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

Why AIOps Matters For Kubernetes

1. Timothy Chen Co-founder / CTO Why AIOps Matters for Kubernetes

2. About me 2 Apache PMC Mesos, Drill Container engineer lead at Mesosphere Help start community effort for Spark on Kubernetes (Google, Redhat, Palantir, Salesforce, etc). @tnachen

3. Agenda 3 Operating Kubernetes in production at scale What is AIOps? Operating Kubernetes in production at scale w/ AIOps!

4. Adoption

5. Infrastructure Maturity

6. Microservices

7. Microservices at scale

8. Open source software

10. Container Production Stack Orchestration Security Network Storage Runtime OS Image

11. Infrastructure Monitoring Orchestration Network Storage Runtime OS Image Users: SRE System Engineers Ops DevOps

12. Application Monitoring Kafka Microservice #1 Spark Microservice #2 Microservice #3 Users: SRE DevOps Developers

13. Metrics Explosion

14. Reactive monitoring new challenges Containers != VM Cannot focus on host metrics anymore with dynamic container changes Which metric(s) to track? 10000s of metrics, massive amounts of alert How to identify or isolate the problem (correlate metrics)? HW/Container/Host/Neighbor/Cluster/Dependency…. How to resolve these performance problems? (RCA) Unmanaged resources -> Manual tuning

15. Proactive challenges Goal: Maximize performance with minimum cost! Capacity Planning / What-If Analysis Cluster scheduling (interference, heterogenous, etc) Optimal configuration - Orchestration Network Storage Runtime OS Application VM

16.

17. https://netman.cs.tsinghua.edu.cn/contacts/projects/

18. AIOps is not a new concept Decades of academic research around managing performance in computing…. - Cluster Scheduling - SLA-driven interference, storage, network, caching...

19. Now is the time - Workloads are more complex than before Infrastructure interfaces and data are becoming more standardized (Kubernetes, Prometheus, CTE, etc) All metrics centrally collected -

20.

21.

22. https://www.youtube.com/watch?v=dJxGtfTPVCg

23.

24.

25.

26.

27. The Difficulty of using the Public Cloud Many instance types Reserved, on-demand, spot instances Many instant sizes (10s) Application churn Variability in load Can we guarantee SLOs & minimize cost automatically?

28. Choosing the right VM/HW

29. Performance tuning

30.

31. Live traffic load testing https://engineering.linkedin.com/blog/2017/02/redliner--how-linkedin-determines-the-capacity-limits-of-its-ser

32. Live traffic bottleneck https://research.fb.com/publications/kraken-leveraging-live-traf%EF%AC%81c-tests-to-id entify-and-resolve-resource-utilization-bottlenecks-in-large-scale-web-services/

33. Chaos Engineering

34.

35. Batch scheduling Improve deadline by 5x - 13x, increase utilization 14 - 28%

36. Placement

37. Variability in user load Bloated reservations due performance variability TwitterGoogle

38. Oversubscription Google search + Google brain (deep neural nets) >90% hardware utilization No latency violations for search [Lo et al’15] >90% HW utilization No tail latency problems

39.

40. Hyperpilot 40 Help enterprises operate container in production. Using AI to learn how applications are behaving with your infrastructure and actively manage these resources. Processors, Memory, Flash, Disk, NIC Operating system Cluster resource manager Distributed analytics Hadoop, Spark, … IaaS Online, data-intensive (OLDI) Search, social nets, SaaS

41. Oversubscription in Action

42. Hiring Engineers in Silicon Valley / Taiwan! tim@hyperpilot.io Christos Kozyrakis (Stanford) Michael Huang (ex-CoreOS) Timothy Chen (ex-Mesosphere)

Why AIOps Matters For Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why AIOps Matters For Kubernetes

Similar to Why AIOps Matters For Kubernetes (20)

Recently uploaded

Recently uploaded (20)

Why AIOps Matters For Kubernetes