SlideShare a Scribd company logo
1 of 19
Prometheus on AWS
About me
• MitsuhiroTanda
• Infrastructure Engineer @GREE
• Use Prometheus on AWS (1 year)
• Grafana committer
• @mtanda
Features
• multi-dimensional data model
• flexible query language
• pull model over HTTP
• service discovery
• Prometheus values reliability
AWS Monitoring Problems
• Instance lifecycle is short
• Instance is launched/terminated byASG
• Instance workload is not same amongAZ, …
Why we use Prometheus
• multi-dimensional data model & flexible query
language
– aggregate metrics by Role/AZ, and compare the result
– detect the instance which workload is differ among the
Role
• pull model over HTTP & service discovery
– specify monitoring target by Role, ...
– easily adapt monitoring target increase
multi-dimensional data model
• record instance metadata to labels
key value
instance_id i-1234abcd
instance_type ec2, rds, elasticache, elb, …
instance_model t2.large, m4.large, c4.large, r3.large, …
region ap-northeast-1, us-east-1, …
availability_zone ap-northeast-1a, ap-northeast-1c, …
role (instance tag) web, db, …
environment (instance tag) production, staging, …
avg(cpu) by (availability_zone)
cpu{role="web"}
avg(cpu) by (role)
Service Discovery
• auto detect monitoring target
• Prometheus provides several SD
– ec2_sd, consul_sd, kubernetes_sd, file_sd
• (fundamental feature for Pull architecture)
ec2_sd
• detect monitoring target by ec2:DescribeInstances API
• specify monitoring target by AZ, InstanceTags, ...
• example setting for specifying Web Role target
- job_name: 'job_name'
ec2_sd_configs:
- region: ap-northeast-1
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Role]
regex: web.*
action: keep
How we deploy setting
Prometheus
(for web)
Prometheus
(for db)
Role=web Role=db
pack
upload
deploy
edit
このロゴはJenkins project (https://jenkins.io/)に帰属します。
CloudWatch support
• We store CloudWatch metrics to Prometheus
• Don't use cloudwatch_exporter, because it's depend on
Java
• Create in-house CloudWatch exporter by aws-sdk-go
• Recording timestamp cause some problems
– CloudWatch metrics emission is delayed for several minutes
– Prometheus treat the metrics as stale, and drop it
– I give up to record timestamp for some metrics
Instance Spec we use
• use t2.micro - t2.medium instance
• use gp2 EBS, volume size is 50-100GB
• If the number of monitoring target is 50-100, t2.medium is enough to
monitor them
• I recommend to use t2.small or upper
– t2.micro's memory size is not enough
– need to change storage.local.memory-chunks
• Sudden load increase can handled by Burst
– t2 Instance burst
– EBS(gp2) burst
Disk write workload
Disk usage
• calculate per monitoring target instance
• We have 150 - 300 metrics per one instance
• scrape interval is 15 seconds
• Disk usage becomes approximately 200MB
per 1 month
Long term metrics storage
• Prometheus doesn't support summarize metrics like rrdtool
• The data size becomes large if you set long retention period
• The default retention period is 15 days
• Prometheus is not designed for long term metrics storage
• To store metrics for a long term
– Use Remote Storage (e.g. Graphite)
– Launch another Prometheus for long term storage, and store
summarized metrics data (we create metrics summarize exporter)
Using 1 year
• daily operation
– Prometheus workload is very stable
– mostly no operation required
• upgrade Prometheus
– need to change configuration file due to format change
– breaking change will come until version 1.0
• support new monitoring target middleware
– create exporter for each middleware
– by using Prometheus powerful query, exporter becomes very simple
Reference URL
• http://www.robustperception.io/automatically-monitoring-ec2-instances/
• http://www.robustperception.io/how-to-have-labels-for-machine-roles/
• http://www.robustperception.io/life-of-a-label/
• http://www.slideshare.net/FabianReinartz/prometheus-storage-57557499

More Related Content

What's hot

Prometheus
PrometheusPrometheus
Prometheuswyukawa
 
Kubernetes at Telekom Austria Group
Kubernetes at Telekom Austria Group Kubernetes at Telekom Austria Group
Kubernetes at Telekom Austria Group Oliver Moser
 
Prometheus london
Prometheus londonPrometheus london
Prometheus londonwyukawa
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
An Introduction to Prometheus
An Introduction to PrometheusAn Introduction to Prometheus
An Introduction to PrometheusEvgeny Shmarnev
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
Lessons Learned from Building and Operating Scuba
Lessons Learned from Building and Operating ScubaLessons Learned from Building and Operating Scuba
Lessons Learned from Building and Operating ScubaSingleStore
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...
Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...
Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...Weaveworks
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jetStreamNative
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beamconfluent
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleApache Kafka TLV
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
 

What's hot (20)

Prometheus
PrometheusPrometheus
Prometheus
 
Kubernetes at Telekom Austria Group
Kubernetes at Telekom Austria Group Kubernetes at Telekom Austria Group
Kubernetes at Telekom Austria Group
 
Prometheus london
Prometheus londonPrometheus london
Prometheus london
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
An Introduction to Prometheus
An Introduction to PrometheusAn Introduction to Prometheus
An Introduction to Prometheus
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Project Reactor By Example
Project Reactor By ExampleProject Reactor By Example
Project Reactor By Example
 
Lessons Learned from Building and Operating Scuba
Lessons Learned from Building and Operating ScubaLessons Learned from Building and Operating Scuba
Lessons Learned from Building and Operating Scuba
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...
Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...
Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 

Viewers also liked

Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust modelLINE Corporation
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldLINE Corporation
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO AuthenticationLINE Corporation
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, EverywhereLINE Corporation
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」LINE Corporation
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティLINE Corporation
 

Viewers also liked (8)

Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust model
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile World
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, Everywhere
 
FRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHYFRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHY
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティ
 

Similar to Prometheus on AWS

Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Understanding Elastic Block Store Availability and Performance
Understanding Elastic Block Store Availability and PerformanceUnderstanding Elastic Block Store Availability and Performance
Understanding Elastic Block Store Availability and PerformanceAmazon Web Services
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning MongoDB
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Etti Gur
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyftkbajda
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 

Similar to Prometheus on AWS (20)

Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Understanding Elastic Block Store Availability and Performance
Understanding Elastic Block Store Availability and PerformanceUnderstanding Elastic Block Store Availability and Performance
Understanding Elastic Block Store Availability and Performance
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Presto
PrestoPresto
Presto
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Log analytics with ELK stack
Log analytics with ELK stackLog analytics with ELK stack
Log analytics with ELK stack
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Prometheus on AWS

  • 2. About me • MitsuhiroTanda • Infrastructure Engineer @GREE • Use Prometheus on AWS (1 year) • Grafana committer • @mtanda
  • 3. Features • multi-dimensional data model • flexible query language • pull model over HTTP • service discovery • Prometheus values reliability
  • 4. AWS Monitoring Problems • Instance lifecycle is short • Instance is launched/terminated byASG • Instance workload is not same amongAZ, …
  • 5. Why we use Prometheus • multi-dimensional data model & flexible query language – aggregate metrics by Role/AZ, and compare the result – detect the instance which workload is differ among the Role • pull model over HTTP & service discovery – specify monitoring target by Role, ... – easily adapt monitoring target increase
  • 6. multi-dimensional data model • record instance metadata to labels key value instance_id i-1234abcd instance_type ec2, rds, elasticache, elb, … instance_model t2.large, m4.large, c4.large, r3.large, … region ap-northeast-1, us-east-1, … availability_zone ap-northeast-1a, ap-northeast-1c, … role (instance tag) web, db, … environment (instance tag) production, staging, …
  • 10. Service Discovery • auto detect monitoring target • Prometheus provides several SD – ec2_sd, consul_sd, kubernetes_sd, file_sd • (fundamental feature for Pull architecture)
  • 11. ec2_sd • detect monitoring target by ec2:DescribeInstances API • specify monitoring target by AZ, InstanceTags, ... • example setting for specifying Web Role target - job_name: 'job_name' ec2_sd_configs: - region: ap-northeast-1 port: 9100 relabel_configs: - source_labels: [__meta_ec2_tag_Role] regex: web.* action: keep
  • 12. How we deploy setting Prometheus (for web) Prometheus (for db) Role=web Role=db pack upload deploy edit このロゴはJenkins project (https://jenkins.io/)に帰属します。
  • 13. CloudWatch support • We store CloudWatch metrics to Prometheus • Don't use cloudwatch_exporter, because it's depend on Java • Create in-house CloudWatch exporter by aws-sdk-go • Recording timestamp cause some problems – CloudWatch metrics emission is delayed for several minutes – Prometheus treat the metrics as stale, and drop it – I give up to record timestamp for some metrics
  • 14. Instance Spec we use • use t2.micro - t2.medium instance • use gp2 EBS, volume size is 50-100GB • If the number of monitoring target is 50-100, t2.medium is enough to monitor them • I recommend to use t2.small or upper – t2.micro's memory size is not enough – need to change storage.local.memory-chunks • Sudden load increase can handled by Burst – t2 Instance burst – EBS(gp2) burst
  • 16. Disk usage • calculate per monitoring target instance • We have 150 - 300 metrics per one instance • scrape interval is 15 seconds • Disk usage becomes approximately 200MB per 1 month
  • 17. Long term metrics storage • Prometheus doesn't support summarize metrics like rrdtool • The data size becomes large if you set long retention period • The default retention period is 15 days • Prometheus is not designed for long term metrics storage • To store metrics for a long term – Use Remote Storage (e.g. Graphite) – Launch another Prometheus for long term storage, and store summarized metrics data (we create metrics summarize exporter)
  • 18. Using 1 year • daily operation – Prometheus workload is very stable – mostly no operation required • upgrade Prometheus – need to change configuration file due to format change – breaking change will come until version 1.0 • support new monitoring target middleware – create exporter for each middleware – by using Prometheus powerful query, exporter becomes very simple
  • 19. Reference URL • http://www.robustperception.io/automatically-monitoring-ec2-instances/ • http://www.robustperception.io/how-to-have-labels-for-machine-roles/ • http://www.robustperception.io/life-of-a-label/ • http://www.slideshare.net/FabianReinartz/prometheus-storage-57557499