2. About me
• MitsuhiroTanda
• Infrastructure Engineer @GREE
• Use Prometheus on AWS (1 year)
• Grafana committer
• @mtanda
3. Features
• multi-dimensional data model
• flexible query language
• pull model over HTTP
• service discovery
• Prometheus values reliability
4. AWS Monitoring Problems
• Instance lifecycle is short
• Instance is launched/terminated byASG
• Instance workload is not same amongAZ, …
5. Why we use Prometheus
• multi-dimensional data model & flexible query
language
– aggregate metrics by Role/AZ, and compare the result
– detect the instance which workload is differ among the
Role
• pull model over HTTP & service discovery
– specify monitoring target by Role, ...
– easily adapt monitoring target increase
6. multi-dimensional data model
• record instance metadata to labels
key value
instance_id i-1234abcd
instance_type ec2, rds, elasticache, elb, …
instance_model t2.large, m4.large, c4.large, r3.large, …
region ap-northeast-1, us-east-1, …
availability_zone ap-northeast-1a, ap-northeast-1c, …
role (instance tag) web, db, …
environment (instance tag) production, staging, …
10. Service Discovery
• auto detect monitoring target
• Prometheus provides several SD
– ec2_sd, consul_sd, kubernetes_sd, file_sd
• (fundamental feature for Pull architecture)
11. ec2_sd
• detect monitoring target by ec2:DescribeInstances API
• specify monitoring target by AZ, InstanceTags, ...
• example setting for specifying Web Role target
- job_name: 'job_name'
ec2_sd_configs:
- region: ap-northeast-1
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Role]
regex: web.*
action: keep
12. How we deploy setting
Prometheus
(for web)
Prometheus
(for db)
Role=web Role=db
pack
upload
deploy
edit
このロゴはJenkins project (https://jenkins.io/)に帰属します。
13. CloudWatch support
• We store CloudWatch metrics to Prometheus
• Don't use cloudwatch_exporter, because it's depend on
Java
• Create in-house CloudWatch exporter by aws-sdk-go
• Recording timestamp cause some problems
– CloudWatch metrics emission is delayed for several minutes
– Prometheus treat the metrics as stale, and drop it
– I give up to record timestamp for some metrics
14. Instance Spec we use
• use t2.micro - t2.medium instance
• use gp2 EBS, volume size is 50-100GB
• If the number of monitoring target is 50-100, t2.medium is enough to
monitor them
• I recommend to use t2.small or upper
– t2.micro's memory size is not enough
– need to change storage.local.memory-chunks
• Sudden load increase can handled by Burst
– t2 Instance burst
– EBS(gp2) burst
16. Disk usage
• calculate per monitoring target instance
• We have 150 - 300 metrics per one instance
• scrape interval is 15 seconds
• Disk usage becomes approximately 200MB
per 1 month
17. Long term metrics storage
• Prometheus doesn't support summarize metrics like rrdtool
• The data size becomes large if you set long retention period
• The default retention period is 15 days
• Prometheus is not designed for long term metrics storage
• To store metrics for a long term
– Use Remote Storage (e.g. Graphite)
– Launch another Prometheus for long term storage, and store
summarized metrics data (we create metrics summarize exporter)
18. Using 1 year
• daily operation
– Prometheus workload is very stable
– mostly no operation required
• upgrade Prometheus
– need to change configuration file due to format change
– breaking change will come until version 1.0
• support new monitoring target middleware
– create exporter for each middleware
– by using Prometheus powerful query, exporter becomes very simple