SlideShare a Scribd company logo
1 of 47
Download to read offline
you should watch
Jorge Salamero - @bencerillo
15 Kubernetes
failure points
Jorge Salamero
Tech Marketing aka container gamer @ Sysdig
github.com/bencer
@bencerillo
OSS fan
Monitoring, containers, IoT/home-automation, cars
About me
Monitoring & Security Platform for Containers
Monitoring 15 Kubernetes failure points
- Apps
- Hosts
- Orchestration
- Containers
- Yourself
https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/
https://sysdig.com/blog/alerting-kubernetes/
The holy service metrics
- KPI / biz metrics / synthetic
monitoring / user metrics
- Google SRE book:
“The Four Golden Signals”
Latency+Traffic+Errors+Saturation
USE method
- Utilization
(how busy we are, close to 100% bottleneck)
- Saturation
(amount of work waiting on the queue)
- Errors
RED method
- Request Rate
- Request Errors
- Request Duration
The holy service metrics
- Code instrumentation (statsd, JMX
or Prometheus metrics):
var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_durations_histogram_seconds",
Help: "Seconds spent serving HTTP requests.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "route", "status_code"})
prometheus.MustRegister(httpDurationsHistogram)
- or Sysdig autodiscovery ;-)
1. connections per second
net.request.count
2. response time
net.response.time
3. errors
net.request.error.count
Prometheus + Grafana UI
Kubernetes orchestration
Kubernetes hierarchy
Services vs hosts+containers
Kubernetes metadata: labels
Pod
app: shopping
tier: api
Pod
app: shopping
tier: db
Pod
app: social
tier: api
role: search
Pod
app: social
tier: api
role: search
Leverage metadata (by service)
Leverage metadata (by pod)
Health vs state monitoring
- Health:
- CPU, memory, disk
- connections, response time,
errors
Health vs state monitoring
- State (orchestration):
- Are containers up and
running properly?
Health vs state monitoring
- kube-state-metrics
https://github.com/kubernetes/kube-state-metrics
https://sysdig.com/blog/introducing-kube-state-metrics/
calculate new metrics based on
the state of Kubernetes
resources
Container scheduling
- Need to deploy a container:
- given the requirements,
where can we run it?
and let’s ignore affinity, taints and tolerations:
https://sysdig.com/blog/kubernetes-scheduler/
- capacity planning
4. node availability
Based on the host or the kubelet component status:
kube_node_status_condition{condition="Ready",status="true"} == 0
count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and
(count(kube_node_status_condition{condition="Ready",status="true"} == 0) /
count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
kube_node_status_condition: kube_node_status_ready,
kube_node_status_out_of_disk, kube_node_status_memory_pressure,
kube_node_status_disk_pressure, and kube_node_status_network_unavailable
Sysdig alert UI
Container resource requirements
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
https://github.com/kubernetes-incubator/cluster-capacity
5. CPU resources
6. memory resources
kube_node_status_capacity_pods
kube_node_status_allocatable_pods
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
capacity - used (by OS and kube services) = allocatable
Container disk requirements
here things get more complicated...
- ephemeral disk usage
- persistent volumes claims
7. disk resources
predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
kube_node_status_condition: kube_node_status_out_of_disk
but within containers this is still WIP, at least Kubernetes 1.8:
container_fs_* doesn’t work with PV
https://github.com/kubernetes/kubernetes/pull/59170
https://github.com/kubernetes/kubernetes/pull/51553
https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
Container orchestration
- ReplicationController
- ReplicaSet
- Deployment
- DaemonSet
- StatefulSet
Kubernetes deployments
Is Kubernetes doing what is
supposed to to?
Orchestration needs monitoring too.
8. running instances
9. desired instances
((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or
(kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
10. deployment updates glitches
kube_deployment_status_observed_generation !=
kube_deployment_metadata_generation
kube_deployment_spec_paused
kube_deployment_spec_strategy_rollingupdate_max_unavailable
Container livecycle state
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
Liveness probes
To know when to restart a container:
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
Ready-ness probes
To know when a container is ready to start accepting traffic:
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
11. pod status
kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown
kube_pod_status_ready
kube_pod_status_scheduled
kube_pod_container_status_waiting
kube_pod_container_status_running
kube_pod_container_status_terminated
kube_pod_container_status_ready
12. pod restarts
You can look at this as a metric or as an event:
ALERT PodRestartingTooMuch
IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60)
FOR 1h
LABELS { severity="warning" }
ANNOTATIONS {
summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
}
CrashLoopBackOff event
https://sysdig.com/blog/debug-kubernetes-crashloopbackoff/
Sysdig Inspect
https://github.com/draios/sysdig-inspect
Kubernetes internals
- APIserver
- KubeDNS / Istio
- container registry
- any other piece of Kubernetes
https://sysdig.com/blog/monitor-etcd/
13. APIserver
rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) /
rate(apiserver_request_count[5m])* 100 > 5
apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!
~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4
Or just do Golden signals on APIserver endpoint too :-)
14. KubeDNS / Istio
histogram_quantile(0.95,
sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le,
kubernetes_pod_name)) > 1000
All export native metrics in Prometheus format, just scrape them!
https://sysdig.com/blog/monitor-istio/
What are we deploying?
- CI/CD and commits
- Manual deploys
You need to validate what you
tell Kubernetes too!
15. monitor your commands
kubeval: validates YAML and JSON config files
https://github.com/garethr/kubeval
kube-diff: show differences between running state and version controlled configuration
https://github.com/weaveworks/kubediff
Configuration reconciliation discussion:
https://github.com/kubernetes/kubernetes/issues/1702
Although this is getting automated too:
https://sysdig.com/blog/kubernetes-scaler/
Recap
1. connections per second
2. response time
3. errors
4. node availability
5. CPU resources
6. memory resources
7. disk and external resources
Recap (2)
8. running instances
9. desired instances
10. deployment updates glitches
Recap (3)
11. pod status
12. pod restarts
13. APIserver health
14. KubeDNS / Istio health
15. monitor your commands
Grazie!
Jorge Salamero - @bencerillo
https://sysdig.com/blog/

More Related Content

Similar to 15 kubernetes failure points you should watch

Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Paris Open Source Summit
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Giacomo Tirabassi
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsSIGHUP
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersjosfuecas
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3Nilesh Gule
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSWeaveworks
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'sAntônio Roberto Silva
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetDevOpsDaysJKT
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Codemotion
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with KubernetesSatnam Singh
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Yoichi Kawasaki
 
Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Michael Elder
 
Kubernetes Administration from Zero to Hero.pdf
Kubernetes Administration from Zero to Hero.pdfKubernetes Administration from Zero to Hero.pdf
Kubernetes Administration from Zero to Hero.pdfArzooGupta16
 
An Introduction to the Kubernetes API
An Introduction to the Kubernetes APIAn Introduction to the Kubernetes API
An Introduction to the Kubernetes APIStefan Schimanski
 
Orchestraing the Blockchain Using Containers
Orchestraing the Blockchain Using ContainersOrchestraing the Blockchain Using Containers
Orchestraing the Blockchain Using ContainersAndrew Kennedy
 
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기Jinsu Moon
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community
 

Similar to 15 kubernetes failure points you should watch (20)

Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containers
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API's
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with Kubernetes
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
 
Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...
 
Kubernetes Administration from Zero to Hero.pdf
Kubernetes Administration from Zero to Hero.pdfKubernetes Administration from Zero to Hero.pdf
Kubernetes Administration from Zero to Hero.pdf
 
An Introduction to the Kubernetes API
An Introduction to the Kubernetes APIAn Introduction to the Kubernetes API
An Introduction to the Kubernetes API
 
Orchestraing the Blockchain Using Containers
Orchestraing the Blockchain Using ContainersOrchestraing the Blockchain Using Containers
Orchestraing the Blockchain Using Containers
 
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 

More from Sysdig

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionSysdig
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsSysdig
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime SecuritySysdig
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesSysdig
 
Continuous Security
Continuous SecurityContinuous Security
Continuous SecuritySysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoSysdig
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor MicroservicesSysdig
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!Sysdig
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsSysdig
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongSysdig
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishSysdig
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Sysdig
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy ContainersSysdig
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system callsSysdig
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with ChiselSysdig
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutesSysdig
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting KubernetesSysdig
 

More from Sysdig (20)

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdmins
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system calls
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
 

Recently uploaded

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 

15 kubernetes failure points you should watch

  • 1. you should watch Jorge Salamero - @bencerillo 15 Kubernetes failure points
  • 2. Jorge Salamero Tech Marketing aka container gamer @ Sysdig github.com/bencer @bencerillo OSS fan Monitoring, containers, IoT/home-automation, cars About me
  • 3. Monitoring & Security Platform for Containers
  • 4. Monitoring 15 Kubernetes failure points - Apps - Hosts - Orchestration - Containers - Yourself https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/ https://sysdig.com/blog/alerting-kubernetes/
  • 5. The holy service metrics - KPI / biz metrics / synthetic monitoring / user metrics - Google SRE book: “The Four Golden Signals” Latency+Traffic+Errors+Saturation
  • 6. USE method - Utilization (how busy we are, close to 100% bottleneck) - Saturation (amount of work waiting on the queue) - Errors
  • 7. RED method - Request Rate - Request Errors - Request Duration
  • 8. The holy service metrics - Code instrumentation (statsd, JMX or Prometheus metrics): var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "http_durations_histogram_seconds", Help: "Seconds spent serving HTTP requests.", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status_code"}) prometheus.MustRegister(httpDurationsHistogram) - or Sysdig autodiscovery ;-)
  • 9. 1. connections per second net.request.count 2. response time net.response.time 3. errors net.request.error.count
  • 14. Kubernetes metadata: labels Pod app: shopping tier: api Pod app: shopping tier: db Pod app: social tier: api role: search Pod app: social tier: api role: search
  • 17. Health vs state monitoring - Health: - CPU, memory, disk - connections, response time, errors
  • 18. Health vs state monitoring - State (orchestration): - Are containers up and running properly?
  • 19. Health vs state monitoring - kube-state-metrics https://github.com/kubernetes/kube-state-metrics https://sysdig.com/blog/introducing-kube-state-metrics/ calculate new metrics based on the state of Kubernetes resources
  • 20. Container scheduling - Need to deploy a container: - given the requirements, where can we run it? and let’s ignore affinity, taints and tolerations: https://sysdig.com/blog/kubernetes-scheduler/ - capacity planning
  • 21. 4. node availability Based on the host or the kubelet component status: kube_node_status_condition{condition="Ready",status="true"} == 0 count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3 kube_node_status_condition: kube_node_status_ready, kube_node_status_out_of_disk, kube_node_status_memory_pressure, kube_node_status_disk_pressure, and kube_node_status_network_unavailable
  • 23. Container resource requirements resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" https://github.com/kubernetes-incubator/cluster-capacity
  • 24. 5. CPU resources 6. memory resources kube_node_status_capacity_pods kube_node_status_allocatable_pods kube_node_status_capacity_cpu_cores kube_node_status_capacity_memory_bytes kube_node_status_allocatable_cpu_cores kube_node_status_allocatable_memory_bytes capacity - used (by OS and kube services) = allocatable
  • 25. Container disk requirements here things get more complicated... - ephemeral disk usage - persistent volumes claims
  • 26. 7. disk resources predict_linear(node_filesystem_free[30m], 3600 * 2) < 0 kube_node_status_condition: kube_node_status_out_of_disk but within containers this is still WIP, at least Kubernetes 1.8: container_fs_* doesn’t work with PV https://github.com/kubernetes/kubernetes/pull/59170 https://github.com/kubernetes/kubernetes/pull/51553 https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
  • 27. Container orchestration - ReplicationController - ReplicaSet - Deployment - DaemonSet - StatefulSet
  • 28. Kubernetes deployments Is Kubernetes doing what is supposed to to? Orchestration needs monitoring too.
  • 30. 9. desired instances ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
  • 31. 10. deployment updates glitches kube_deployment_status_observed_generation != kube_deployment_metadata_generation kube_deployment_spec_paused kube_deployment_spec_strategy_rollingupdate_max_unavailable
  • 33. Liveness probes To know when to restart a container: livenessProbe: httpGet: path: /healthz port: 8080 httpHeaders: - name: X-Custom-Header value: Awesome initialDelaySeconds: 3 periodSeconds: 3
  • 34. Ready-ness probes To know when a container is ready to start accepting traffic: readinessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5
  • 35. 11. pod status kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown kube_pod_status_ready kube_pod_status_scheduled kube_pod_container_status_waiting kube_pod_container_status_running kube_pod_container_status_terminated kube_pod_container_status_ready
  • 36. 12. pod restarts You can look at this as a metric or as an event: ALERT PodRestartingTooMuch IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60) FOR 1h LABELS { severity="warning" } ANNOTATIONS { summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", }
  • 39. Kubernetes internals - APIserver - KubeDNS / Istio - container registry - any other piece of Kubernetes https://sysdig.com/blog/monitor-etcd/
  • 40. 13. APIserver rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])* 100 > 5 apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb! ~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4 Or just do Golden signals on APIserver endpoint too :-)
  • 41. 14. KubeDNS / Istio histogram_quantile(0.95, sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le, kubernetes_pod_name)) > 1000 All export native metrics in Prometheus format, just scrape them! https://sysdig.com/blog/monitor-istio/
  • 42. What are we deploying? - CI/CD and commits - Manual deploys You need to validate what you tell Kubernetes too!
  • 43. 15. monitor your commands kubeval: validates YAML and JSON config files https://github.com/garethr/kubeval kube-diff: show differences between running state and version controlled configuration https://github.com/weaveworks/kubediff Configuration reconciliation discussion: https://github.com/kubernetes/kubernetes/issues/1702 Although this is getting automated too: https://sysdig.com/blog/kubernetes-scaler/
  • 44. Recap 1. connections per second 2. response time 3. errors 4. node availability 5. CPU resources 6. memory resources 7. disk and external resources
  • 45. Recap (2) 8. running instances 9. desired instances 10. deployment updates glitches
  • 46. Recap (3) 11. pod status 12. pod restarts 13. APIserver health 14. KubeDNS / Istio health 15. monitor your commands
  • 47. Grazie! Jorge Salamero - @bencerillo https://sysdig.com/blog/