Prometheus on AWS

•Download as PPTX, PDF•

13 likes•18,147 views

Mitsuhiro Tanda

Prometheus on AWS (english version)

Technology

About me
• MitsuhiroTanda
• Infrastructure Engineer @GREE
• Use Prometheus on AWS (1 year)
• Grafana committer
• @mtanda

Features
• multi-dimensional data model
• flexible query language
• pull model over HTTP
• service discovery
• Prometheus values reliability

AWS Monitoring Problems
• Instance lifecycle is short
• Instance is launched/terminated byASG
• Instance workload is not same amongAZ, …

Why we use Prometheus
• multi-dimensional data model & flexible query
language
– aggregate metrics by Role/AZ, and compare the result
– detect the instance which workload is differ among the
Role
• pull model over HTTP & service discovery
– specify monitoring target by Role, ...
– easily adapt monitoring target increase

multi-dimensional data model
• record instance metadata to labels
key value
instance_id i-1234abcd
instance_type ec2, rds, elasticache, elb, …
instance_model t2.large, m4.large, c4.large, r3.large, …
region ap-northeast-1, us-east-1, …
availability_zone ap-northeast-1a, ap-northeast-1c, …
role (instance tag) web, db, …
environment (instance tag) production, staging, …

Service Discovery
• auto detect monitoring target
• Prometheus provides several SD
– ec2_sd, consul_sd, kubernetes_sd, file_sd
• (fundamental feature for Pull architecture)

ec2_sd
• detect monitoring target by ec2:DescribeInstances API
• specify monitoring target by AZ, InstanceTags, ...
• example setting for specifying Web Role target
- job_name: 'job_name'
ec2_sd_configs:
- region: ap-northeast-1
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Role]
regex: web.*
action: keep

How we deploy setting
Prometheus
(for web)
Prometheus
(for db)
Role=web Role=db
pack
upload
deploy
edit
このロゴはJenkins project (https://jenkins.io/)に帰属します。

CloudWatch support
• We store CloudWatch metrics to Prometheus
• Don't use cloudwatch_exporter, because it's depend on
Java
• Create in-house CloudWatch exporter by aws-sdk-go
• Recording timestamp cause some problems
– CloudWatch metrics emission is delayed for several minutes
– Prometheus treat the metrics as stale, and drop it
– I give up to record timestamp for some metrics

Instance Spec we use
• use t2.micro - t2.medium instance
• use gp2 EBS, volume size is 50-100GB
• If the number of monitoring target is 50-100, t2.medium is enough to
monitor them
• I recommend to use t2.small or upper
– t2.micro's memory size is not enough
– need to change storage.local.memory-chunks
• Sudden load increase can handled by Burst
– t2 Instance burst
– EBS(gp2) burst

Disk usage
• calculate per monitoring target instance
• We have 150 - 300 metrics per one instance
• scrape interval is 15 seconds
• Disk usage becomes approximately 200MB
per 1 month

Long term metrics storage
• Prometheus doesn't support summarize metrics like rrdtool
• The data size becomes large if you set long retention period
• The default retention period is 15 days
• Prometheus is not designed for long term metrics storage
• To store metrics for a long term
– Use Remote Storage (e.g. Graphite)
– Launch another Prometheus for long term storage, and store
summarized metrics data (we create metrics summarize exporter)

Using 1 year
• daily operation
– Prometheus workload is very stable
– mostly no operation required
• upgrade Prometheus
– need to change configuration file due to format change
– breaking change will come until version 1.0
• support new monitoring target middleware
– create exporter for each middleware
– by using Prometheus powerful query, exporter becomes very simple

Reference URL
• http://www.robustperception.io/automatically-monitoring-ec2-instances/
• http://www.robustperception.io/how-to-have-labels-for-machine-roles/
• http://www.robustperception.io/life-of-a-label/
• http://www.slideshare.net/FabianReinartz/prometheus-storage-57557499

What's hot

Prometheuswyukawa

Kubernetes at Telekom Austria Group Oliver Moser

Prometheus londonwyukawa

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil

An Introduction to PrometheusEvgeny Shmarnev

3.2 Streaming and Messaging振东刘

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil

Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz

Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil

Cloud Monitoring with PrometheusQAware GmbH

Project Reactor By ExampleDenny Abraham Cheriyan

Lessons Learned from Building and Operating ScubaSingleStore

Prometheus OverviewBrian Brazil

Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...Weaveworks

Low latency stream processing with jetStreamNative

Portable Streaming Pipelines with Apache Beamconfluent

Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent

Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil

Distributed Kafka Architecture Taboola ScaleApache Kafka TLV

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

What's hot (20)

Prometheus

Kubernetes at Telekom Austria Group

Prometheus london

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)

An Introduction to Prometheus

3.2 Streaming and Messaging

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)

Monitoring a Kubernetes-backed microservice architecture with Prometheus

Prometheus for Monitoring Metrics (Fermilab 2018)

Cloud Monitoring with Prometheus

Project Reactor By Example

Lessons Learned from Building and Operating Scuba

Prometheus Overview

Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...

Low latency stream processing with jet

Portable Streaming Pipelines with Apache Beam

Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka

Prometheus: A Next Generation Monitoring System (FOSDEM 2016)

Distributed Kafka Architecture Taboola Scale

Introduction to Streaming Distributed Processing with Storm

Viewers also liked

Grafana optimization for PrometheusMitsuhiro Tanda

Application security as crucial to the modern distributed trust modelLINE Corporation

Implementing Trusted Endpoints in the Mobile WorldLINE Corporation

“Your Security, More Simple.” by utilizing FIDO AuthenticationLINE Corporation

Drawing the Line Correctly: Enough Security, EverywhereLINE Corporation

FRONTIERS IN CRYPTOGRAPHYLINE Corporation

FIDO認証で「あんしんをもっと便利に」LINE Corporation

ゲーム開発を加速させるクライアントセキュリティLINE Corporation

Viewers also liked (8)

Grafana optimization for Prometheus

Application security as crucial to the modern distributed trust model

Implementing Trusted Endpoints in the Mobile World

“Your Security, More Simple.” by utilizing FIDO Authentication

Drawing the Line Correctly: Enough Security, Everywhere

FRONTIERS IN CRYPTOGRAPHY

FIDO認証で「あんしんをもっと便利に」

ゲーム開発を加速させるクライアントセキュリティ

Similar to Prometheus on AWS

Presto At Treasure DataTaro L. Saito

Understanding Elastic Block Store Availability and PerformanceAmazon Web Services

Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro

Fastest Servlets in the WestStuart (Pid) Williams

PrestoKnoldus Inc.

Cloud Security Monitoring and Spark Analyticsamesar0

Megastore by GoogleAnkita Kapratwar

[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski

Hardware Provisioning MongoDB

Capacity PlanningMongoDB

AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services

Log analytics with ELK stackAWS User Group Bengaluru

Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs

Optimizing spark based data pipelines - are you up for it?Etti Gur

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services

Presto Summit 2018 - 07 - Lyftkbajda

NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez

Architectures, Frameworks and Infrastructureharendra_pathak

Similar to Prometheus on AWS (20)

Presto At Treasure Data

Understanding Elastic Block Store Availability and Performance

Big data Argentina meetup 2020-09: Intro to presto on docker

Fastest Servlets in the West

Presto

Cloud Security Monitoring and Spark Analytics

Megastore by Google

[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...

Hardware Provisioning

Capacity Planning

AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...

Log analytics with ELK stack

Performance testing in scope of migration to cloud by Serghei Radov

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Optimizing spark based data pipelines - are you up for it?

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...

Presto Summit 2018 - 07 - Lyft

NetflixOSS Meetup season 3 episode 1

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...

Architectures, Frameworks and Infrastructure

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

A Domino Admins Adventures (Engage 2024)Gabriella Davis

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

GenCyber Cyber Security Day PresentationMichael W. Hawkins

GenAI Risks & Security Meetup 01052024.pdflior mazor

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

How to convert PDF to text with Nanonetsnaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts

What Are The Drone Anti-jamming Systems Technology?

A Domino Admins Adventures (Engage 2024)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Boost Fertility New Invention Ups Success Rates.pdf

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Strategies for Landing an Oracle DBA Job as a Fresher

GenCyber Cyber Security Day Presentation

GenAI Risks & Security Meetup 01052024.pdf

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Automating Google Workspace (GWS) & more with Apps Script

Boost PC performance: How more available memory can improve productivity

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

How to convert PDF to text with Nanonets

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Axa Assurance Maroc - Insurer Innovation Award 2024

Prometheus on AWS

1. Prometheus on AWS

2. About me • MitsuhiroTanda • Infrastructure Engineer @GREE • Use Prometheus on AWS (1 year) • Grafana committer • @mtanda

3. Features • multi-dimensional data model • flexible query language • pull model over HTTP • service discovery • Prometheus values reliability

4. AWS Monitoring Problems • Instance lifecycle is short • Instance is launched/terminated byASG • Instance workload is not same amongAZ, …

5. Why we use Prometheus • multi-dimensional data model & flexible query language – aggregate metrics by Role/AZ, and compare the result – detect the instance which workload is differ among the Role • pull model over HTTP & service discovery – specify monitoring target by Role, ... – easily adapt monitoring target increase

6. multi-dimensional data model • record instance metadata to labels key value instance_id i-1234abcd instance_type ec2, rds, elasticache, elb, … instance_model t2.large, m4.large, c4.large, r3.large, … region ap-northeast-1, us-east-1, … availability_zone ap-northeast-1a, ap-northeast-1c, … role (instance tag) web, db, … environment (instance tag) production, staging, …

7. avg(cpu) by (availability_zone)

8. cpu{role="web"}

9. avg(cpu) by (role)

10. Service Discovery • auto detect monitoring target • Prometheus provides several SD – ec2_sd, consul_sd, kubernetes_sd, file_sd • (fundamental feature for Pull architecture)

11. ec2_sd • detect monitoring target by ec2:DescribeInstances API • specify monitoring target by AZ, InstanceTags, ... • example setting for specifying Web Role target - job_name: 'job_name' ec2_sd_configs: - region: ap-northeast-1 port: 9100 relabel_configs: - source_labels: [__meta_ec2_tag_Role] regex: web.* action: keep

12. How we deploy setting Prometheus (for web) Prometheus (for db) Role=web Role=db pack upload deploy edit このロゴはJenkins project (https://jenkins.io/)に帰属します。

13. CloudWatch support • We store CloudWatch metrics to Prometheus • Don't use cloudwatch_exporter, because it's depend on Java • Create in-house CloudWatch exporter by aws-sdk-go • Recording timestamp cause some problems – CloudWatch metrics emission is delayed for several minutes – Prometheus treat the metrics as stale, and drop it – I give up to record timestamp for some metrics

14. Instance Spec we use • use t2.micro - t2.medium instance • use gp2 EBS, volume size is 50-100GB • If the number of monitoring target is 50-100, t2.medium is enough to monitor them • I recommend to use t2.small or upper – t2.micro's memory size is not enough – need to change storage.local.memory-chunks • Sudden load increase can handled by Burst – t2 Instance burst – EBS(gp2) burst

15. Disk write workload

16. Disk usage • calculate per monitoring target instance • We have 150 - 300 metrics per one instance • scrape interval is 15 seconds • Disk usage becomes approximately 200MB per 1 month

17. Long term metrics storage • Prometheus doesn't support summarize metrics like rrdtool • The data size becomes large if you set long retention period • The default retention period is 15 days • Prometheus is not designed for long term metrics storage • To store metrics for a long term – Use Remote Storage (e.g. Graphite) – Launch another Prometheus for long term storage, and store summarized metrics data (we create metrics summarize exporter)

18. Using 1 year • daily operation – Prometheus workload is very stable – mostly no operation required • upgrade Prometheus – need to change configuration file due to format change – breaking change will come until version 1.0 • support new monitoring target middleware – create exporter for each middleware – by using Prometheus powerful query, exporter becomes very simple

19. Reference URL • http://www.robustperception.io/automatically-monitoring-ec2-instances/ • http://www.robustperception.io/how-to-have-labels-for-machine-roles/ • http://www.robustperception.io/life-of-a-label/ • http://www.slideshare.net/FabianReinartz/prometheus-storage-57557499

Prometheus on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Prometheus on AWS

Similar to Prometheus on AWS (20)

Recently uploaded

Recently uploaded (20)

Prometheus on AWS