SlideShare a Scribd company logo
1 of 38
Download to read offline
Log Analytics with ELK Stack
(Architecture for aggressive cost optimization and infinite data scale)
Denis D’Souza | 27th July 2019
About me...
● Currently a DevOps engineer at Moonfrog Labs
● 6 + years working as DevOps Engineer, SRE and Linux administrator
Worked on a variety of technologies in both service-based and
product-based organisations
● How do I spend my free time ?
Learning new technologies and Playing PC Games
www.linkedin.com/in/denis-dsouza
• A Mobile Gaming Company making mass market social games
• More than 5M+ Daily Active, 15M+ Weekly Active Users
• Real time, Cross platform games optimized for Primary
Market(s) - India and subcontinent
• Profitable!
Current Scale
Who we are ?
1. Our business requirements
2. Choosing the right option
3. ELK Stack overview
4. Our ELK architecture
5. Optimizations we did
6. Cost savings
7. Key takeaways
Our problem statement
● Log analytics platform (Web-Server, Application, Database logs)
● Data Ingestion rate: ~300GB/day
● Frequently accessed data: last 8 days
● Infrequently accessed
● Uptime: 99.90
● Hot Retention period: 90 days
● Cold Retention period: 90 days (with potential to increase)
● Simple and Cost effective solution
● Fairly predictable concurrent user-base
● Not to be used for storing user/business data
Our business requirements
ELK stack Splunk Sumo logic
Product Self managed Cloud Professional
Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month *
Data Ingestion ~ 300 GB / day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day *
* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.
References:
https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2
https://www.sumologic.com/pricing/apac/
Choosing the right option
ELK Stack overview
● Index
● Shard
○ Primary
○ Replica
● Segment
● Node
References:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html
ELK Stack overview: Terminologies
Our ELK architecture
Our ELK architecture: Hot-Warm-Cold data storage
(infinite scale)
Service
Number of
Nodes
Total CPU
Cores
Total RAM
Storage
EBS
1 Elasticsearch 7 28 141 GB
2 Logstash 3 6 12 GB
3 Kibana 1 1 4 GB
Total 11 35 157 GB ~ 20 TB
Data-ingestion per day ~ 300 GB
Hot Retention period 90 days
Docs/sec (at peak load) ~ 7K
Our ELK architecture: Size and scale
Application Side
● Logstash
● Elasticsearch
Infrastructure Side
● EC2
● EBS
● Data transfer
Optimizations we did
Optimizations we did: Application side
Logstash
Pipeline Workers:
● Adjusted "pipeline.workers" to x4 the number of
Cores to improve CPU utilisation on Logstash
server (as threads may spend significant time in
an I/O wait state)
### Core-count: 2 ###
...
pipeline.workers: 8
...
logstash.yml
References:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Optimizations we did: Logstash
'info' logs:
● Separated application 'info' log to be store in a
different index with retention policy of fewer days
if [sourcetype] == "app_logs" and [level] == "info"
{
elasticsearch {
index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"
...
Filter config
if [sourcetype] == "nginx" and [status] == "200"
{
elasticsearch {
index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"
...
References:
https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html
'200' response-code logs:
● Separated Access log with '200' response-code
be store in a different index with retention policy
of fewer days
Optimizations we did: Logstash
Log ‘message’ field:
● Removed "message" field if there were no
'grok-failures' in logstash while applying grok
patterns
(reduced storage footprint by ~30% per doc)
if "_grokparsefailure" not in [tags] {
mutate {
remove_field => ["message"]
}
}
Filter config
Eg:
Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0
Gecko" "-"
Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident}
[%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?:
HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-)
%{QS:referrer} %{QS:agent} %{QS:forwarder}
Optimizations we did: Logstash
Elasticsearch
Optimizations we did: Application side
JVM heap vs non-heap memory:
● Optimised JVM heap-size by monitoring the GC
interval, this helped in efficient utilization of system
Memory (33% for JVM, 66% for non-heap) *
jvm.options
### Total system Memory 15GB ###
-Xms5g
-Xmx5g
Heap too small
Heap too large
Optimised Heap
* Recommended heap-size settings by Elastic:
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
Optimizations we did: Elasticsearch
Shards:
● Created templates with number of shards which
are multiples of the number of Elasticsearch
nodes
(helps fix issues with shards distribution
imbalance which resulted in uneven disk,
compute resource usage)
### Number of ES nodes: 5 ###
{
"template": "appserver-*",
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0",
...
}
}'
Trade-offs:
● Removing replicas will result in search queries
running slower as replicas are used while
performing search operations
● It is not recommended to run production clusters
without replicas
Replicas:
● Removed replicas for the required indexes
(50% savings on storage cost, ~30% reduction in
compute resource utilization)
Optimizations we did: Elasticsearch
Template config
AWS
● EC2
● EBS
● Data transfer (Inter AZ)
Spotinst platform allows users to reliably
leverage excess capacity, simplify cloud
operations and save 80% on compute costs.
Optimizations we did: Infrastructure side
Optimizations we did: Infrastructure side
EC2
Stateful EC2 Spot instances:
● Moved all ELK nodes to run on spot instances
(Instances maintain IP address, EBS volumes)
Recovery time: < 10 mins
Trade-offs:
● Prefer using previous generation instance
types to reduce frequent spot take-backs
Optimizations we did: EC2 and spot
Auto-Scaling:
● Performance/time based auto-scaling for
Logstash Instances
Optimizations we did: EC2 and spot
Optimizations we did: Infrastructure side
EBS
"Hot-Warm" Architecture:
● "Hot" nodes: store active indexes, use GP2
EBS-disks (General purpose SSD)
● "Warm" nodes: store passive indexes, use SC1
EBS-disks (Cold storage)
(~69% savings on storage cost)
node.attr.box_type: hot
...
elasticsearch.yml
"template": "appserver-*",
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"box_type": "hot"}
}
}
},
...
Template config
Trade-offs:
● Since "Warm" nodes are using SC1 EBS-disks,
they have lower IOPS, throughput this will result
in search operations being comparatively slower
References:
https://cinhtau.net/2017/06/14/hot-warm-architecture/
Optimizations we did: EBS
Moving indexes to "Warm" nodes:
● Reallocated indexes older than 8 days to "Warm"
nodes
● Recommended to perform this operation during
off-peak hours as it is I/O intensive
actions:
1:
action: allocation
description: "Move index to Warm-nodes after 8
days"
options:
key: box_type
value: warm
allocation_type: require
timeout_override:
continue_if_exception: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 8
...
Curator config
References:
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x
Optimizations we did: EBS
Single Availability Zone:
● Migrated all ELK node to a single availability zone
(reduce inter AZ data transfer cost for ELK nodes
by 100%)
● Data transfer/day: ~700GB
(Logstash to Elasticsearch: ~300GB,
Elasticsearch inter-communication: ~400GB)
Trade-offs:
● It is not recommended to run production clusters in
a single AZ as it will result in downtime and
potential data loss in case of AZ failures
Optimizations we did: Inter-AZ data transfer
Using S3 for index Snapshots:
● Take snapshots of indexes and store them in S3
curl -XPUT
"http://<domain>:9200/_snapshot/s3_repository/
snap1?pretty?wait_for_completion=true" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
Backup:
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f
Data backup and restore
curl -s -XPOST --url
"http://<domain>:9200/_snapshot/s3_repository/s
nap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
}'
On-demand Elasticsearch cluster:
● Launching a on demand ES cluster and importing
the snapshots from S3
Existing Cluster:
● Restore the required snapshots to existing cluster
Restore:
Data backup and restore
Data corruption:
● List out indexes with status as ‘Red’
● Deleted the corrupted indexes
● Restore indexes from S3 snapshots
● Recovery time: depends of size of data
Node failure due to AZ going down:
● Launch a new ELK cluster using AWS cloud
formation templates
● Do the necessary config changes in Filebeat,
Logstash etc.
● Restore the required indexes from S3 snapshots
● Recovery time: depends on provisioning time and
size of data
Node failures due to underlying hardware issue:
● Recycle node in Spotinst console
(will take AMI of root volume, launch new instance,
attach EBS volumes, maintain private IP)
● Recovery time: < 10 mins/node
Snapshot restore time (estimates):
● < 4mins for a 20GB snapshot (test-cluster: 3
nodes, multiple indexes with 3 primary shards
each, no replicas)
Disaster recovery
EC2
Instance type Service Daily cost
5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80
3 x c5.large (6C, 12GB) Logstash 7.17
1 x t3.medium (2C, 4GB) Kibana 1.29
Total ~ 49.26$
EC2 (optimized)
Instance type Service
Daily cost
65% savings + Spotinst charges (20% of savings) Total Savings
5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64
2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50
3 x c4.large (6C, 12GB) Logstash 3.50
1 x t2.medium (2C, 4GB) Kibana 0.69
Total ~ 26.33$ ~ 47%
Cost savings: EC2
Ingesting: 300GB/day
Retention: 90 days
Replica count: 1
Storage
Storage type Retention Daily cost
~54TB (GP2) 90 days ~ 237.60$
Storage (optimized)
Storage type Retention Daily cost Total Savings
~ 3TB (GP2) Hot 8 days 12.00
~ 24TB (SC1) Warm 82 days 24.00
~ 27TB (S3) Backup 90 days 22.50
Total ~ 58.50$ ~ 75%
Ingesting: 300GB/day
Retention: 90 days
Replica count: 0
Backups: Daily S3 snapshots
Cost savings: Storage
ELK stack
ELK stack
(optimized) Savings
EC2 49.40 26.33 47%
Storage 237.60 58.50 75%
Data-transfer 7 0 100%
Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% *
Cost/GB (daily) ~ 0.98$ ~ 0.28$
* Total savings are exclusive of some of the application-level optimizations done
Total savings
ELK Stack
(optimized) ELK Stack Splunk Sumo logic
Product Self managed Self managed Cloud Professional
Data Ingestion ~ 300GB/day ~ 300GB/day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day *
Savings over traditional ELK stack: 71% *
* Total savings are exclusive of some of the application-level optimizations done
Our Costs vs other Platforms
ELK Stack Scalability:
● Logstash: auto-scaling
● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling
Handling potential data-loss while AZ is down:
● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
● We do not store user-data or business metrics in ELK, users/business will not be impacted
Handling potential data-corruptions in Elasticsearch:
● DR mechanisms in place, recover index from S3 index-snapshots
Managing downtime during spot take-backs:
● Logstash: multiple nodes, minimal impact
● Elasticsearch/Kibana: < 10min downtime per node
● Use previous generation instance types as spot take-back chances are comparatively low
Key Takeaways
Handling back-pressure when a node is down:
● Filebeat: will auto-retry to send old logs
● Logstash: use ‘date’ filter for document timestamp, auto-scaling
● Elasticsearch: overprovisioning
Other log analytics alternatives:
● We have only evaluated ELK, Splunk and Sumo Logic
ELK stack upgrade path:
● Blue Green deployment for major version upgrade
Key Takeaways
● We built a platform tailored to our requirements, yours might be different...
● Building a log analytics platform is not rocket science, but it can be painfully iterative if you
are not aware of the options
● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for
your specific requirements
Reflection
Thank you!
Happy to take your questions..
Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..

More Related Content

What's hot

Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elkRushika Shah
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with OpenstackArun prasath
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELKGeert Pante
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
ELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementEl Mahdi Benzekri
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesVineet Sabharwal
 
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 SeoulElastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 SeoulSeungYong Oh
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 

What's hot (20)

Elk
Elk Elk
Elk
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
ELK introduction
ELK introductionELK introduction
ELK introduction
 
Introducing ELK
Introducing ELKIntroducing ELK
Introducing ELK
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
ELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log Management
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for Microservices
 
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 SeoulElastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 

Similar to Log analytics with ELK stack

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Etti Gur
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on KubernetesJoerg Henning
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsdata://disrupted®
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011bostonrb
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...ronwarshawsky
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Durable Azure Functions
Durable Azure FunctionsDurable Azure Functions
Durable Azure FunctionsPushkar Saraf
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 

Similar to Log analytics with ELK stack (20)

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Durable Azure Functions
Durable Azure FunctionsDurable Azure Functions
Durable Azure Functions
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
Varnish - PLNOG 4
Varnish - PLNOG 4Varnish - PLNOG 4
Varnish - PLNOG 4
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
 

More from AWS User Group Bengaluru

Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSAWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerAWS User Group Bengaluru
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS accountAWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSAWS User Group Bengaluru
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markAWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerAWS User Group Bengaluru
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedAWS User Group Bengaluru
 

More from AWS User Group Bengaluru (20)

Demystifying identity on AWS
Demystifying identity on AWSDemystifying identity on AWS
Demystifying identity on AWS
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Medlife journey with AWS
Medlife journey with AWSMedlife journey with AWS
Medlife journey with AWS
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS account
 
Serverless Culture
Serverless CultureServerless Culture
Serverless Culture
 
Refactoring to serverless
Refactoring to serverlessRefactoring to serverless
Refactoring to serverless
 
Amazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances WorkshopAmazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances Workshop
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
 
Cost Optimization in AWS
Cost Optimization in AWSCost Optimization in AWS
Cost Optimization in AWS
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Log analytics with ELK stack

  • 1. Log Analytics with ELK Stack (Architecture for aggressive cost optimization and infinite data scale) Denis D’Souza | 27th July 2019
  • 2. About me... ● Currently a DevOps engineer at Moonfrog Labs ● 6 + years working as DevOps Engineer, SRE and Linux administrator Worked on a variety of technologies in both service-based and product-based organisations ● How do I spend my free time ? Learning new technologies and Playing PC Games www.linkedin.com/in/denis-dsouza
  • 3. • A Mobile Gaming Company making mass market social games • More than 5M+ Daily Active, 15M+ Weekly Active Users • Real time, Cross platform games optimized for Primary Market(s) - India and subcontinent • Profitable! Current Scale Who we are ?
  • 4. 1. Our business requirements 2. Choosing the right option 3. ELK Stack overview 4. Our ELK architecture 5. Optimizations we did 6. Cost savings 7. Key takeaways Our problem statement
  • 5. ● Log analytics platform (Web-Server, Application, Database logs) ● Data Ingestion rate: ~300GB/day ● Frequently accessed data: last 8 days ● Infrequently accessed ● Uptime: 99.90 ● Hot Retention period: 90 days ● Cold Retention period: 90 days (with potential to increase) ● Simple and Cost effective solution ● Fairly predictable concurrent user-base ● Not to be used for storing user/business data Our business requirements
  • 6. ELK stack Splunk Sumo logic Product Self managed Cloud Professional Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month * Data Ingestion ~ 300 GB / day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day * * values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only. References: https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2 https://www.sumologic.com/pricing/apac/ Choosing the right option
  • 8. ● Index ● Shard ○ Primary ○ Replica ● Segment ● Node References: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html ELK Stack overview: Terminologies
  • 10. Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)
  • 11. Service Number of Nodes Total CPU Cores Total RAM Storage EBS 1 Elasticsearch 7 28 141 GB 2 Logstash 3 6 12 GB 3 Kibana 1 1 4 GB Total 11 35 157 GB ~ 20 TB Data-ingestion per day ~ 300 GB Hot Retention period 90 days Docs/sec (at peak load) ~ 7K Our ELK architecture: Size and scale
  • 12. Application Side ● Logstash ● Elasticsearch Infrastructure Side ● EC2 ● EBS ● Data transfer Optimizations we did
  • 13. Optimizations we did: Application side Logstash
  • 14. Pipeline Workers: ● Adjusted "pipeline.workers" to x4 the number of Cores to improve CPU utilisation on Logstash server (as threads may spend significant time in an I/O wait state) ### Core-count: 2 ### ... pipeline.workers: 8 ... logstash.yml References: https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html Optimizations we did: Logstash
  • 15. 'info' logs: ● Separated application 'info' log to be store in a different index with retention policy of fewer days if [sourcetype] == "app_logs" and [level] == "info" { elasticsearch { index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}" ... Filter config if [sourcetype] == "nginx" and [status] == "200" { elasticsearch { index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}" ... References: https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html '200' response-code logs: ● Separated Access log with '200' response-code be store in a different index with retention policy of fewer days Optimizations we did: Logstash
  • 16. Log ‘message’ field: ● Removed "message" field if there were no 'grok-failures' in logstash while applying grok patterns (reduced storage footprint by ~30% per doc) if "_grokparsefailure" not in [tags] { mutate { remove_field => ["message"] } } Filter config Eg: Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-" Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} [%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder} Optimizations we did: Logstash
  • 18. JVM heap vs non-heap memory: ● Optimised JVM heap-size by monitoring the GC interval, this helped in efficient utilization of system Memory (33% for JVM, 66% for non-heap) * jvm.options ### Total system Memory 15GB ### -Xms5g -Xmx5g Heap too small Heap too large Optimised Heap * Recommended heap-size settings by Elastic: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html Optimizations we did: Elasticsearch
  • 19. Shards: ● Created templates with number of shards which are multiples of the number of Elasticsearch nodes (helps fix issues with shards distribution imbalance which resulted in uneven disk, compute resource usage) ### Number of ES nodes: 5 ### { "template": "appserver-*", "settings": { "number_of_shards": "5", "number_of_replicas": "0", ... } }' Trade-offs: ● Removing replicas will result in search queries running slower as replicas are used while performing search operations ● It is not recommended to run production clusters without replicas Replicas: ● Removed replicas for the required indexes (50% savings on storage cost, ~30% reduction in compute resource utilization) Optimizations we did: Elasticsearch Template config
  • 20. AWS ● EC2 ● EBS ● Data transfer (Inter AZ) Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs. Optimizations we did: Infrastructure side
  • 21. Optimizations we did: Infrastructure side EC2
  • 22. Stateful EC2 Spot instances: ● Moved all ELK nodes to run on spot instances (Instances maintain IP address, EBS volumes) Recovery time: < 10 mins Trade-offs: ● Prefer using previous generation instance types to reduce frequent spot take-backs Optimizations we did: EC2 and spot
  • 23. Auto-Scaling: ● Performance/time based auto-scaling for Logstash Instances Optimizations we did: EC2 and spot
  • 24. Optimizations we did: Infrastructure side EBS
  • 25. "Hot-Warm" Architecture: ● "Hot" nodes: store active indexes, use GP2 EBS-disks (General purpose SSD) ● "Warm" nodes: store passive indexes, use SC1 EBS-disks (Cold storage) (~69% savings on storage cost) node.attr.box_type: hot ... elasticsearch.yml "template": "appserver-*", "settings": { "index": { "routing": { "allocation": { "require": { "box_type": "hot"} } } }, ... Template config Trade-offs: ● Since "Warm" nodes are using SC1 EBS-disks, they have lower IOPS, throughput this will result in search operations being comparatively slower References: https://cinhtau.net/2017/06/14/hot-warm-architecture/ Optimizations we did: EBS
  • 26. Moving indexes to "Warm" nodes: ● Reallocated indexes older than 8 days to "Warm" nodes ● Recommended to perform this operation during off-peak hours as it is I/O intensive actions: 1: action: allocation description: "Move index to Warm-nodes after 8 days" options: key: box_type value: warm allocation_type: require timeout_override: continue_if_exception: false filters: - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 8 ... Curator config References: https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x Optimizations we did: EBS
  • 27. Single Availability Zone: ● Migrated all ELK node to a single availability zone (reduce inter AZ data transfer cost for ELK nodes by 100%) ● Data transfer/day: ~700GB (Logstash to Elasticsearch: ~300GB, Elasticsearch inter-communication: ~400GB) Trade-offs: ● It is not recommended to run production clusters in a single AZ as it will result in downtime and potential data loss in case of AZ failures Optimizations we did: Inter-AZ data transfer
  • 28. Using S3 for index Snapshots: ● Take snapshots of indexes and store them in S3 curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/ snap1?pretty?wait_for_completion=true" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false } Backup: References: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f Data backup and restore
  • 29. curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/s nap1/_restore" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false, }' On-demand Elasticsearch cluster: ● Launching a on demand ES cluster and importing the snapshots from S3 Existing Cluster: ● Restore the required snapshots to existing cluster Restore: Data backup and restore
  • 30. Data corruption: ● List out indexes with status as ‘Red’ ● Deleted the corrupted indexes ● Restore indexes from S3 snapshots ● Recovery time: depends of size of data Node failure due to AZ going down: ● Launch a new ELK cluster using AWS cloud formation templates ● Do the necessary config changes in Filebeat, Logstash etc. ● Restore the required indexes from S3 snapshots ● Recovery time: depends on provisioning time and size of data Node failures due to underlying hardware issue: ● Recycle node in Spotinst console (will take AMI of root volume, launch new instance, attach EBS volumes, maintain private IP) ● Recovery time: < 10 mins/node Snapshot restore time (estimates): ● < 4mins for a 20GB snapshot (test-cluster: 3 nodes, multiple indexes with 3 primary shards each, no replicas) Disaster recovery
  • 31. EC2 Instance type Service Daily cost 5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80 3 x c5.large (6C, 12GB) Logstash 7.17 1 x t3.medium (2C, 4GB) Kibana 1.29 Total ~ 49.26$ EC2 (optimized) Instance type Service Daily cost 65% savings + Spotinst charges (20% of savings) Total Savings 5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64 2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50 3 x c4.large (6C, 12GB) Logstash 3.50 1 x t2.medium (2C, 4GB) Kibana 0.69 Total ~ 26.33$ ~ 47% Cost savings: EC2
  • 32. Ingesting: 300GB/day Retention: 90 days Replica count: 1 Storage Storage type Retention Daily cost ~54TB (GP2) 90 days ~ 237.60$ Storage (optimized) Storage type Retention Daily cost Total Savings ~ 3TB (GP2) Hot 8 days 12.00 ~ 24TB (SC1) Warm 82 days 24.00 ~ 27TB (S3) Backup 90 days 22.50 Total ~ 58.50$ ~ 75% Ingesting: 300GB/day Retention: 90 days Replica count: 0 Backups: Daily S3 snapshots Cost savings: Storage
  • 33. ELK stack ELK stack (optimized) Savings EC2 49.40 26.33 47% Storage 237.60 58.50 75% Data-transfer 7 0 100% Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% * Cost/GB (daily) ~ 0.98$ ~ 0.28$ * Total savings are exclusive of some of the application-level optimizations done Total savings
  • 34. ELK Stack (optimized) ELK Stack Splunk Sumo logic Product Self managed Self managed Cloud Professional Data Ingestion ~ 300GB/day ~ 300GB/day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day * Savings over traditional ELK stack: 71% * * Total savings are exclusive of some of the application-level optimizations done Our Costs vs other Platforms
  • 35. ELK Stack Scalability: ● Logstash: auto-scaling ● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling Handling potential data-loss while AZ is down: ● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour ● We do not store user-data or business metrics in ELK, users/business will not be impacted Handling potential data-corruptions in Elasticsearch: ● DR mechanisms in place, recover index from S3 index-snapshots Managing downtime during spot take-backs: ● Logstash: multiple nodes, minimal impact ● Elasticsearch/Kibana: < 10min downtime per node ● Use previous generation instance types as spot take-back chances are comparatively low Key Takeaways
  • 36. Handling back-pressure when a node is down: ● Filebeat: will auto-retry to send old logs ● Logstash: use ‘date’ filter for document timestamp, auto-scaling ● Elasticsearch: overprovisioning Other log analytics alternatives: ● We have only evaluated ELK, Splunk and Sumo Logic ELK stack upgrade path: ● Blue Green deployment for major version upgrade Key Takeaways
  • 37. ● We built a platform tailored to our requirements, yours might be different... ● Building a log analytics platform is not rocket science, but it can be painfully iterative if you are not aware of the options ● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for your specific requirements Reflection
  • 38. Thank you! Happy to take your questions.. Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..