[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기

Open Source Consulting
국내 최고의 오픈소스 전문기업
Private/Public Cloud | Data Center to Cloud | Atlassian
H. www.osci.kr T. 02-516-0711 F. 02-516-0722
서울특별시 강남구 테헤란로83길 32 나라키움삼성동A빌딩 5층
Copyright 2019 Open Source Consulting Inc. All rights reserved.

이 영주
Prometheus
2019.04.17

Contents 01. Prometheus?
02. Usage
03. Alertmanager
04. Cluster
05. Performance

01. Prometheus?
• Prometheus?
⚫ 2012년 SoundCloud에서 몇몇의 개발자와 함께 시작.
⚫ 2016년 CNCF(Cloud Native Computing Foundation)의 두번 째 Memb
er.
⚫ PromQL이라는 자체언어를 이용해서 빠르게 검색가능.
⚫ Kubernates의 모니터링에 많이 쓰이게 되면서 각광받게 됨.
⚫ 초당 수백만 쿼리를 수행 할 수 있게 디자인 됨.
⚫ 기존 Monitoring system보다 성능이 월등히 좋음.
⚫ Openstack, AWS, Azure, GCE등 거의 모든 Platform 모니터링 가능.

01. Prometheus?
• Prometheus?

01. Prometheus?
• Monitoring?
• Alerting
⚫ 일이 잘못 되었을 때 사람에게 알리는 것.
⚫ E-mail, Slack, ...
• Debugging
⚫ 문제원인을 파악 하는 것.
• Trending
⚫ 사용량을 예측하여 계획에 반영.

01. Prometheus?
• Categories of Monitoring
• Profiling
⚫ tcpdump ...
• Tracing
⚫ OpenZipkin, Jaeger ...
• Logging
⚫ elasticsearch, Graylog
• Metric
⚫ Prometheus, Zabbix

01. Prometheus?
• Prometheus Architecture
Target을 찾아서
자동으로 등록!!
push 방식 간접구현
(App이 여기에
metric을 push)
Prometheus가
이해할 수 있는
format으로 바꿔줌.
사람에게 알림을
보내주는 역할
Graph를
그리는 역할
다른 Prometheus의
metric도 가져올 수 있음
Prometheus Federation

02. Usage
• Running Prometheus
[root@yj26-ovstest3 prometheus]# wget
> https://github.com/prometheus/prometheus/releases/download/v2.9.1/prometheus-2.9.1.linux-amd64.tar.gz
--2019-04-18 13:22:50-- https://github.com/prometheus/prometheus/releases/download/v2.9.1/prometheus-2.9.1...
...
[root@yj26-ovstest3 prometheus]# tar xvzf prometheus-2.9.1.linux-amd64.tar.gz
prometheus-2.9.1.linux-amd64/
prometheus-2.9.1.linux-amd64/consoles/
...
[root@yj26-ovstest3 prometheus]# cd prometheus-2.9.1.linux-amd64/
[root@yj26-ovstest3 prometheus-2.9.1.linux-amd64]# grep -iv '^$|^#|^[[:space:]]*#' prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
[root@yj26-ovstest3 prometheus-2.9.1.linux-amd64]#
Binary download!
Decompression
Self monitoring

02. Usage
• Running Prometheus
Prometheus start!
Listen address!

02. Usage
• Running Node-exporter
[root@yj26-ovstest3 temp]# wget
> https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
[root@yj26-ovstest3 temp]# tar xvzf node_exporter-0.17.0.linux-amd64.tar.gz
[root@yj26-ovstest3 temp]# cd node_exporter-0.17.0.linux-amd64/
[root@yj26-ovstest3 node_exporter-0.17.0.linux-amd64]# ./node_exporter
...
INFO[0000] - uname source="node_exporter.go:97"
INFO[0000] - vmstat source="node_exporter.go:97"
INFO[0000] - xfs source="node_exporter.go:97"
INFO[0000] - zfs source="node_exporter.go:97"
INFO[0000] Listening on :9100 source="node_exporter.go:111"
기본설정 port 9100

02. Usage
• Running Node-exporter
[root@yj26-ovstest3 prometheus-2.9.1.linux-amd64]# grep -iv '^$|^#|[[:space:]]*#' prometheus.yml
global:
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
static_configs:
- job_name: 'node3'
static_configs:
[root@yj26-ovstest3 prometheus-2.9.1.linux-amd64]# ps aux |grep -i prometheus
root 12286 0.0 2.2 157280 41652 pts/1 Sl+ 13:41 0:04 ./prometheus
root 12520 0.0 0.0 116812 1032 pts/3 S+ 15:25 0:00 grep --color=auto -i prometheus
[root@yj26-ovstest3 prometheus-2.9.1.linux-amd64]# kill -SIGHUP 12286
Log ...
... caller=main.go:724 msg="Loading configuration file" filename=prometheus.yml
... caller=main.go:751 msg="Completed loading of configuration file" filename=prometheus.yml
node exporter target
추가!!
1번 시그널을 보내서
config reload!!
Config reload 성공!!

02. Usage
• Scraping
[root@yj26-ovstest3 ~]# curl localhost:9090/metrics |head -n 20
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 41814 0 41814 0 0 5023k 0 --:--:-- --:--:-- --:--:-- 5833k
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.2143e-05
go_gc_duration_seconds{quantile="0.25"} 2.9441e-05
go_gc_duration_seconds{quantile="0.5"} 9.6832e-05
go_gc_duration_seconds{quantile="0.75"} 0.000199094
go_gc_duration_seconds{quantile="1"} 0.000424251
go_gc_duration_seconds_sum 0.000761761
go_gc_duration_seconds_count 5
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 38
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.12.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.385236e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
[root@yj26-ovstest3 ~]#
Metric name
Label
Time series
go_gc_duration_seconds의
Cardinality는 7
Metric Type
Description of metric

02. Usage
• PromQL
Internal Fuction
Time series name
Selector
Range
Sample
Instant vector
Access URL
Metric name
kvm2 를 제외한 node의 eth0에서
3분동안 수신한 traffic byte 총량의
1초당 평균 변화량 = Network 사용률

03. Alertmanager
• Alertmanager Architecture

03. Alertmanager
• Running Alertmanager
[root@yj26-ovstest3 alertmanager]# wget
> https://github.com/prometheus/alertmanager/releases/download/v0.16.2/alertmanager-0.16.2.linux-amd64.tar.gz
[root@yj26-ovstest3 alertmanager]# cd alertmanager-0.16.2.linux-amd64/
[root@yj26-ovstest3 alertmanager-0.16.2.linux-amd64]# grep -iv '^$|^#|^[[:space:]]*#' alertmanager.yml
route:
group_by: [Alertname]
receiver: email-me
receivers:
- name: email-me
email_configs:
- to: leeyj7141@gmail.com
from: leeyj7141@gmail.com
smarthost: smtp.gmail.com:587
auth_username: "leeyj7141@gmail.com"
auth_identity: "leeyj7141@gmail.com"
auth_password: "xxxxxxxxxxxxxxxxxx"
[root@yj26-ovstest3 alertmanager-0.16.2.linux-amd64]# ./alertmanager
... caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.2, branch=HEAD, revision=308b7620642dc147794e6686a3f94d1b6fc8ef4d
... caller=main.go:178 build_context="(go=go1.11.6, user=root@1e9a48272b38, date=20190405-12:27:40)"
... caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=10.26.1.13 port=9094
... caller=cluster.go:632 component=cluster msg="Waiting for gossip to settle..." interval=2s
... caller=main.go:334 msg="Loading configuration file" file=alertmanager.yml
... caller=main.go:428 msg=Listening address=:9093
... caller=cluster.go:657 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000210783s
... caller=cluster.go:649 component=cluster msg="gossip settled; proceeding" elapsed=10.001438416s
Google App
password

03. Alertmanager
[root@yj26-ovstest3 ~]# grep -iv '^$|^#|^[[:space:]]*#'
> prometheus/prometheus-2.9.1.linux-amd64/prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- rules.yml
scrape_configs:
static_configs:
- job_name: 'node3'
static_configs:
[root@yj26-ovstest3 ~]# grep -iv '^$|^#|^[[:space:]]*#' ~/prometheus/prometheus-2.9.1.linux-amd64/rules.yml
groups:
- name: example
rules:
- alert: InstanceDown
expr: up{instance="localhost:9100",job="node3"} == 0
for: 1m
[root@yj26-ovstest3 ~]# ps aux |grep -i prometheus
root 12286 0.0 2.3 157424 44836 pts/1 Sl+ 13:41 0:10 ./prometheus
[root@yj26-ovstest3 ~]# kill -1 12286
Alertmanager
위치
아까 추가한
node-exporter

03. Alertmanager
[root@yj26-ovstest3 ~]# pkill node_exporter
[root@yj26-ovstest3 ~]# ps aux |grep -i node
root 12838 0.0 0.0 116812 1028 pts/2 S+ 17:34 0:00 grep --color=auto -i node

03. Alertmanager

04. Cluster
• Monitoring Server가 죽으면 어쩌지???
동일한 역할을 하는
Prometheus를 추가!

04. Cluster
• Alertmanager가 죽으면???
Gossip network가 끊어지면 ??
알림을 두개씩 받는다.
하나도 못받는거 보단 나음.

05. Performance
• Hardware
⚫ 1개의 Sample을 압축 하면 약 1.3 bytes 정도의 storage 소모
⚫ 기본설정 15일간의 data를 남기고 초당 10만 sample을 저장한다고 한다면 Storag
e 약 240GB 정도 소모
⚫ 초당 10만 sample정도 처리하는데 CPU 약 0.25개 정도 소모.
⚫ Query, Recording rule, Go gabege collection 까지 생각하면 +1 개
⚫ CPU는 1.25개면 충분!
⚫ 초당 10만 sample정도에 Memory는 약 8GB면 충분.
⚫ Prometheus는 scrap 시 압축을 해서 받기에 1개의 sample당 Network ba
ndwidth 20 bytes 정도 소모
⚫ 초당 10만 sample을 처리하는데 Network bandwidth는 약 16Mbps 소
모.
Node exporter 1개 약 3000 Time series
Node 100대
기본 15초에 1번씩 scrap
= 20000/s

05. Performance
• Reducing Cardinality
Cardinality가 높은 순으로
metric을 나열한 것.

05. Performance
• Recording rule
[root@yj26-ovstest1 prometheus-2.8.1.linux-amd64]# grep -iv '^$|^#|[[:space:]]*#' rules.yml
groups:
- name: node
rules:
- record: job:node_cpu_seconds_total:rate3m
expr: >
100 - (avg by (instance) (rate(node_cpu_seconds_total{job="node",mode="idle"}[3m])) * 100)
[root@yj26-ovstest1 prometheus-2.8.1.linux-amd64]#
CPU 사용률 계산식

05. Performance
• Recording rule

05. Performance
• Target이 너무 많아졌을때는???

05. Performance
• Target이 너무 많아 졌을때는??? Horizontal Sharding!!

[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기

Similar to [오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기 (20)

More from Ji-Woong Choi

More from Ji-Woong Choi (20)

Recently uploaded

Recently uploaded (20)

[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기