SlideShare a Scribd company logo
1 of 38
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mahdi Ben Hamida - SignalFx
11/30/2016
DEV307
How to Scale and Operate
Elasticsearch on AWS
What to Expect from the Session
• Elasticsearch (ES) usage at SignalFx
• What do we use ES for?
• How ES is deployed on AWS?
• Backup/restore of ES on Amazon S3
• Important ES/AWS metrics to monitor; what to alert on
• ES capacity planning
• Zero-downtime re-sharding
• SignalFx metadata storage architecture overview
• Scaling up and zero-downtime re-sharding on AWS
Elasticsearch at
ES Usage
Ad-hoc queries Auto-complete Full-text search
Cluster Size
• 4 clusters in production on Amazon EC2
• Biggest cluster
• 54 data nodes, 3 master nodes, 6 client nodes deployed
across 3 AZs
• Over 1.3 billion unique documents
• 10+ TB of data
• 270 shards (primaries + replica)
• Sustained 75 QPS, 1K index/sec
ES Deployment on AWS
• Dockerized ES 2.3/1.7 clusters. Orchestration done
using MaestroNG
• Biggest cluster
• Data nodes: i2.2xlarge – 16 GB heap (61GB total)
• Master nodes: m3.large – 2 GB heap (7.5GB total)
• Client nodes: m3.xlarge – 10 GB heap (15GB total)
• ES rack awareness to distribute primary and 2 replica
across 3 Availability Zones
Backup/Restore
• Made easy using the AWS Cloud plugin:
PUT _snapshot/s3-repo { "type": "s3",
"settings": { "bucket": ”signalfx-es-
backups", "region": "us-east" } }
• Incremental backups
• Un-versioned S3 bucket
• VPC S3 endpoint to avoid bandwidth constraints
• Instance profiles for authentication to S3
• Cron job for hourly snapshots and weekly rotation
ES Monitoring & Alerting
Key Performance Metrics
Key Detectors
• High CPU usage, low disk size
• Sustained high heap usage
• Master nodes availability
• Cluster state (green/yellow/red)
• Unassigned shards
• Thread pool rejections (search, bulk, index are the most
critical)
Always Test your ES Detectors/Alerts
Elasticsearch Capacity Planning
Capacity Factors
• Indexing
• CPU/IO utilization can be considerable
• Merges are CPU/IO intensive. Improved in ES 2.0
• Queries
• CPU load
• Memory load
ES Sharding & Scale-up
1P
0R
0P
1R
node-1
node-2
1P
0P
node-1
node-2
0R
1R
node-3
node-4
1P
0P
node-1
node-2
0R
1R
node-3
node-4
0R
1R
node-5
node-6
Sizing Shards
• Create an index with one shard
• Simulate what you expect your indexing load to be –
measure CPU/IO load, find where it breaks
• Do the same with queries
• Determine disk consumption (average document size)
Zero-downtime Re-sharding
Why Re-shard?
• Required if you can’t scale up indexing by adding more
nodes
• If the index is read-only, you could implement a simpler
approach using aliases
• If the index is being written to, it’s more complicated
service-A
metabase-client
mb-
server-1
mb-
server-1metabase-1
index-topic
write-topic
(1) enqueue write
(2) dequeue write
(3) write to C*
(4) enqueue index
(7) index document
(5) dequeue index
(6) read from C*
SignalFx’s Metadata Storage Architecture
Index Re-sharding Process
• Pre-requisites
• Phase 1: create target index
• Phase 2: bulk re-indexing
• Phase 3: double writing & change re-conciliation
• Phase 4: testing new index
• Phase 5: complete re-sharding process
Pre-requisite 1: readers query from an alias
myindex_v1
myindex reader
reader
reader
Pre-requisite 2: indexing state +
generation number
myindex_v1
indexer generation: 42
extra: <null>
current: myindex_v1
myindex_v2
Phase 1: create new index with updated
mappings
myindex_v1
indexer generation: 42
extra: <null>
current: myindex_v1
Phase 2: increment generation, then start
bulk re-indexing of older generations
myindex_v1 myindex_v2
_generation <= 42
indexer generation: 43
extra: <null>
current: myindex_v1
During this step, documents may get
added/updated (or deleted*)
_generation <= 42
43
43
updated
created
indexer
myindex_v1
generation: 43
extra: <null>
current: myindex_v1
myindex_v2
Index state at the end of the bulk indexing
43
43
43
43
43
indexer
myindex_v1
generation: 43
extra: <null>
current: myindex_v1
myindex_v2
Phase 3 – (a): enable double writing & bump
generation
43
43
43
43
43
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
Phase 3 – (b): re-index documents at
generation 43
43
43
43
43
43
44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44
Phase 3 – (c): re-index documents at
generation 43
43
43
43
43
43
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44 44
Phase 3 – (c): re-index documents at
generation 43
43
43
43
43
43 43
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44 44
Phase 3 – (c): re-index documents at
generation 43
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
44 44
44 44
Phase 3 – (e): perfect sync of both indices
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
44 44
44 44
Phase 4: A/B testing of the new index
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
myindexreader
reader
reader
44 44
44 44
Phase 4: swap read alias (or swap back !)
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
myindexreader
reader
reader
44 44
44 44
Phase 5: switch write index, generation,
stop double writing
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
45
indexer
45
45
45
myindex_v1
generation: 45
extra: <null>
current: myindex_v2
myindex_v2
44 44
44 44
Handling Failures
• Bulk re-indexing can fail (and it does); you don’t want to
re-start from scratch
• Use a “partition” field
• Migrate partition ranges
• Deletions could be a problem. We handle that by using
“deletion markers” instead then cleaning up
Performance Considerations
• Migrate using partition ranges to avoid holding segments
for a long time
• Add temporary nodes to handle the load
• Disable refreshes on the target index (so worth it!)
• Start with no replica (or one just in case)
• Avoid ”hot” shards by sorting on a field (a timestamp for
example)
• Have throttling controls to control indexing load
Thank you!
Sign-up for a free trial at
signalfx.com
Remember to complete
your evaluations!

More Related Content

What's hot

What's hot (20)

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
 
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
 
Getting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoGetting started with amazon aurora - Toronto
Getting started with amazon aurora - Toronto
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with Caching
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform MigrationDatabase Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
 
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 

Viewers also liked

Viewers also liked (20)

AWS re:Invent 2016: Getting Started with Docker on AWS (CMP209)
AWS re:Invent 2016: Getting Started with Docker on AWS (CMP209)AWS re:Invent 2016: Getting Started with Docker on AWS (CMP209)
AWS re:Invent 2016: Getting Started with Docker on AWS (CMP209)
 
Deep Dive on Microservices and Amazon ECS by Raul Frias, Solutions Architect,...
Deep Dive on Microservices and Amazon ECS by Raul Frias, Solutions Architect,...Deep Dive on Microservices and Amazon ECS by Raul Frias, Solutions Architect,...
Deep Dive on Microservices and Amazon ECS by Raul Frias, Solutions Architect,...
 
Continuous Delivery with AWS Lambda - AWS April 2016 Webinar Series
Continuous Delivery with AWS Lambda - AWS April 2016 Webinar SeriesContinuous Delivery with AWS Lambda - AWS April 2016 Webinar Series
Continuous Delivery with AWS Lambda - AWS April 2016 Webinar Series
 
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
 
AWS re:Invent 2016: The AWS Hero’s Journey to Achieving Autonomous, Self-Heal...
AWS re:Invent 2016: The AWS Hero’s Journey to Achieving Autonomous, Self-Heal...AWS re:Invent 2016: The AWS Hero’s Journey to Achieving Autonomous, Self-Heal...
AWS re:Invent 2016: The AWS Hero’s Journey to Achieving Autonomous, Self-Heal...
 
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
 
AWS Batch: Simplifying Batch Computing in the Cloud
AWS Batch: Simplifying Batch Computing in the CloudAWS Batch: Simplifying Batch Computing in the Cloud
AWS Batch: Simplifying Batch Computing in the Cloud
 
Managing Your Infrastructure as Code
Managing Your Infrastructure as CodeManaging Your Infrastructure as Code
Managing Your Infrastructure as Code
 
AWS re:Invent 2016: State of the Union: Containers (CON316)
AWS re:Invent 2016: State of the Union:  Containers (CON316)AWS re:Invent 2016: State of the Union:  Containers (CON316)
AWS re:Invent 2016: State of the Union: Containers (CON316)
 
AWS re:Invent 2016: Workshop: Deploy a Swift Web Application on Amazon ECS (C...
AWS re:Invent 2016: Workshop: Deploy a Swift Web Application on Amazon ECS (C...AWS re:Invent 2016: Workshop: Deploy a Swift Web Application on Amazon ECS (C...
AWS re:Invent 2016: Workshop: Deploy a Swift Web Application on Amazon ECS (C...
 
AWS re:Invent 2016: Securing Container-Based Applications (CON402)
AWS re:Invent 2016: Securing Container-Based Applications (CON402)AWS re:Invent 2016: Securing Container-Based Applications (CON402)
AWS re:Invent 2016: Securing Container-Based Applications (CON402)
 
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
AWS re:Invent 2016: Simplifying Microsoft Architectures with AWS services (WI...
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
AWS re:Invent 2016: Workshop: Deploy a Deep Learning Framework on Amazon ECS ...
AWS re:Invent 2016: Workshop: Deploy a Deep Learning Framework on Amazon ECS ...AWS re:Invent 2016: Workshop: Deploy a Deep Learning Framework on Amazon ECS ...
AWS re:Invent 2016: Workshop: Deploy a Deep Learning Framework on Amazon ECS ...
 
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
AWS January 2016 Webinar Series - Cloud Data Migration: 6 Strategies for Gett...
 
Continuous Delivery to Amazon EC2 Container Service
Continuous Delivery to Amazon EC2 Container ServiceContinuous Delivery to Amazon EC2 Container Service
Continuous Delivery to Amazon EC2 Container Service
 
AWS re:Invent 2016: Configuration Management in the Cloud (DEV305)
AWS re:Invent 2016: Configuration Management in the Cloud (DEV305)AWS re:Invent 2016: Configuration Management in the Cloud (DEV305)
AWS re:Invent 2016: Configuration Management in the Cloud (DEV305)
 
AWS January 2016 Webinar Series - Getting Started with Big Data on AWS
AWS January 2016 Webinar Series - Getting Started with Big Data on AWSAWS January 2016 Webinar Series - Getting Started with Big Data on AWS
AWS January 2016 Webinar Series - Getting Started with Big Data on AWS
 
AWS January 2016 Webinar Series - Introduction to Docker on AWS
AWS January 2016 Webinar Series - Introduction to Docker on AWSAWS January 2016 Webinar Series - Introduction to Docker on AWS
AWS January 2016 Webinar Series - Introduction to Docker on AWS
 
AWS January 2016 Webinar Series - Introduction to Deploying Applications on AWS
AWS January 2016 Webinar Series - Introduction to Deploying Applications on AWSAWS January 2016 Webinar Series - Introduction to Deploying Applications on AWS
AWS January 2016 Webinar Series - Introduction to Deploying Applications on AWS
 

Similar to AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Similar to AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307) (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Elk presentation1#3
Elk presentation1#3Elk presentation1#3
Elk presentation1#3
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
Log Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & KibanaLog Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & Kibana
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the CloudTime to Science, Time to Results. Accelerating Scientific research in the Cloud
Time to Science, Time to Results. Accelerating Scientific research in the Cloud
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Mahdi Ben Hamida - SignalFx 11/30/2016 DEV307 How to Scale and Operate Elasticsearch on AWS
  • 2. What to Expect from the Session • Elasticsearch (ES) usage at SignalFx • What do we use ES for? • How ES is deployed on AWS? • Backup/restore of ES on Amazon S3 • Important ES/AWS metrics to monitor; what to alert on • ES capacity planning • Zero-downtime re-sharding • SignalFx metadata storage architecture overview • Scaling up and zero-downtime re-sharding on AWS
  • 4. ES Usage Ad-hoc queries Auto-complete Full-text search
  • 5. Cluster Size • 4 clusters in production on Amazon EC2 • Biggest cluster • 54 data nodes, 3 master nodes, 6 client nodes deployed across 3 AZs • Over 1.3 billion unique documents • 10+ TB of data • 270 shards (primaries + replica) • Sustained 75 QPS, 1K index/sec
  • 6. ES Deployment on AWS • Dockerized ES 2.3/1.7 clusters. Orchestration done using MaestroNG • Biggest cluster • Data nodes: i2.2xlarge – 16 GB heap (61GB total) • Master nodes: m3.large – 2 GB heap (7.5GB total) • Client nodes: m3.xlarge – 10 GB heap (15GB total) • ES rack awareness to distribute primary and 2 replica across 3 Availability Zones
  • 7. Backup/Restore • Made easy using the AWS Cloud plugin: PUT _snapshot/s3-repo { "type": "s3", "settings": { "bucket": ”signalfx-es- backups", "region": "us-east" } } • Incremental backups • Un-versioned S3 bucket • VPC S3 endpoint to avoid bandwidth constraints • Instance profiles for authentication to S3 • Cron job for hourly snapshots and weekly rotation
  • 8. ES Monitoring & Alerting
  • 10. Key Detectors • High CPU usage, low disk size • Sustained high heap usage • Master nodes availability • Cluster state (green/yellow/red) • Unassigned shards • Thread pool rejections (search, bulk, index are the most critical)
  • 11. Always Test your ES Detectors/Alerts
  • 13. Capacity Factors • Indexing • CPU/IO utilization can be considerable • Merges are CPU/IO intensive. Improved in ES 2.0 • Queries • CPU load • Memory load
  • 14. ES Sharding & Scale-up 1P 0R 0P 1R node-1 node-2 1P 0P node-1 node-2 0R 1R node-3 node-4 1P 0P node-1 node-2 0R 1R node-3 node-4 0R 1R node-5 node-6
  • 15. Sizing Shards • Create an index with one shard • Simulate what you expect your indexing load to be – measure CPU/IO load, find where it breaks • Do the same with queries • Determine disk consumption (average document size)
  • 17. Why Re-shard? • Required if you can’t scale up indexing by adding more nodes • If the index is read-only, you could implement a simpler approach using aliases • If the index is being written to, it’s more complicated
  • 18. service-A metabase-client mb- server-1 mb- server-1metabase-1 index-topic write-topic (1) enqueue write (2) dequeue write (3) write to C* (4) enqueue index (7) index document (5) dequeue index (6) read from C* SignalFx’s Metadata Storage Architecture
  • 19. Index Re-sharding Process • Pre-requisites • Phase 1: create target index • Phase 2: bulk re-indexing • Phase 3: double writing & change re-conciliation • Phase 4: testing new index • Phase 5: complete re-sharding process
  • 20. Pre-requisite 1: readers query from an alias myindex_v1 myindex reader reader reader
  • 21. Pre-requisite 2: indexing state + generation number myindex_v1 indexer generation: 42 extra: <null> current: myindex_v1
  • 22. myindex_v2 Phase 1: create new index with updated mappings myindex_v1 indexer generation: 42 extra: <null> current: myindex_v1
  • 23. Phase 2: increment generation, then start bulk re-indexing of older generations myindex_v1 myindex_v2 _generation <= 42 indexer generation: 43 extra: <null> current: myindex_v1
  • 24. During this step, documents may get added/updated (or deleted*) _generation <= 42 43 43 updated created indexer myindex_v1 generation: 43 extra: <null> current: myindex_v1 myindex_v2
  • 25. Index state at the end of the bulk indexing 43 43 43 43 43 indexer myindex_v1 generation: 43 extra: <null> current: myindex_v1 myindex_v2
  • 26. Phase 3 – (a): enable double writing & bump generation 43 43 43 43 43 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 43
  • 27. Phase 3 – (b): re-index documents at generation 43 43 43 43 43 43 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 43 44
  • 28. Phase 3 – (c): re-index documents at generation 43 43 43 43 43 43 44 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 43 44 44
  • 29. Phase 3 – (c): re-index documents at generation 43 43 43 43 43 43 43 44 44 44 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 43 44 44
  • 30. Phase 3 – (c): re-index documents at generation 43 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 44 44 44 44
  • 31. Phase 3 – (e): perfect sync of both indices 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44 44 44 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 44 44 44 44
  • 32. Phase 4: A/B testing of the new index 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44 44 44 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 myindexreader reader reader 44 44 44 44
  • 33. Phase 4: swap read alias (or swap back !) 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44 44 44 44 44 44 indexer myindex_v2myindex_v1 generation: 44 extra: myindex_v2 current: myindex_v1 myindexreader reader reader 44 44 44 44
  • 34. Phase 5: switch write index, generation, stop double writing 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44 44 44 44 44 44 45 indexer 45 45 45 myindex_v1 generation: 45 extra: <null> current: myindex_v2 myindex_v2 44 44 44 44
  • 35. Handling Failures • Bulk re-indexing can fail (and it does); you don’t want to re-start from scratch • Use a “partition” field • Migrate partition ranges • Deletions could be a problem. We handle that by using “deletion markers” instead then cleaning up
  • 36. Performance Considerations • Migrate using partition ranges to avoid holding segments for a long time • Add temporary nodes to handle the load • Disable refreshes on the target index (so worth it!) • Start with no replica (or one just in case) • Avoid ”hot” shards by sorting on a field (a timestamp for example) • Have throttling controls to control indexing load
  • 37. Thank you! Sign-up for a free trial at signalfx.com