SlideShare a Scribd company logo
1 of 27
Real time analytics using 
Hadoop 
and 
Elasticsearch 
by 
ABHISHEK ANDHAVARAPU
Thank you Sponsors!
About Me 
• Currently working as Software 
Engineer (Data Platform) at 
Allegiance Software Inc. 
• Passion for Distributed 
System, Data visualizations. 
• Masters in Distributed 
Systems. 
• abhishek376.wordpress.com
Agenda 
Use Case. 
Architecture. 
Elasticsearch 101. 
Demo. 
Lessons learnt.
Legacy Architecture 
5
Current Architecture
Why Hadoop ?
Elasticsearch 101 
• Document oriented search engine Json based, apache 
lucene under covers. 
• Schema free. 
• Its distributed, supports aggregations similar to group by . 
• Uses bit sets to efficiently cache. 
• It’s fast. Super fast. 
• Its has REST and Java based API’s
Elasticsearch CRUD 
Index a person: 
curl -XPUT ‘localhost:9200/person/1’ -d '{ 
"first_name" : "Abhishek", 
"last_name" : "Andhavarapu" 
}’ 
Get a person: 
curl -XGET 'localhost:9200/person/1' 
Delete a person: 
curl -XDELETE ‘localhost:9200/person/1’ 
Update a person: 
curl -XPOST 'localhost:9200/person/1/_update' -d '{ 
"doc" : { 
"first_name" : "Abhi" 
} 
}'
Elasticsearch data 
Node1 Node2 
S0 S1 
Shard
Replicas 
Node1 Node2 
S0 S0 
S1 S1 
Blue - Replica 
Red - Primary 
Shard
More nodes.. 
Node1 Node2 
S0 S1 
Node3 Node4 
S1 S0 
Blue - Replica 
Red - Primary
Node down 
Node1 Node2 
S0 S1 
Node3 Node4 
S1 S0 
Blue - Replica 
Red - Primary
Node1 
S0 
Node down 
Node3 Node4 
A1 S1 
S0 
Blue - Replica 
Red - Primary 
S1 
Re-replicated 
Promoted to Primary
Elasticsearch 101 
• Lucene is under covers. 
• Each index (like a database) is made up of multiple 
shards(lucene instance). 
• Shards are distributed amongst all nodes in the 
cluster. 
• In case of failure or the addition of new nodes 
shards are automatically moved from one to 
another.
How is it Fast ? 
Distributed execution 
Client 
Node 2 
Node 1 
S0 S1 S0 S1 
Query 
Red - Primary 
Blue - Replica
DEMO 
• Import data from SQL database 
in to Hive. (Extract) 
• Run the necessary 
computations using 
Hadoop/Hive. (Transform) 
• Push the data in to 
Elasticsearch. (Load) 
• Run queries against 
Elasticsearch.
Current Elasticsearch Cluster 
• 9 bare metal boxes 
• 128 GB RAM 
• 2X SSD 
• 10 GB Ethernet 
• 2X 10 core Xeon Processors 
• 2X 30GB Elasticsearch instances per box 
• 1 Elasticsearch load balancing instance to handle index requests
Zabbix 
What’s slow ? 
Any request that takes more than 300ms is slow
Lessons Learnt
Concurrency 
• More replication for more currency. Updates are costly. 
• More shards much faster. 
• SQL 3 to 5k per minute
Filter Cache 
• All the filters have a cache flag that controls if they 
are cached or not. 
• Once the filter cache is warmed, all the requests are 
served from the memory. 
• Defaults - 10% for the filter cache. 
• LRU. 
• Bit Sets.
Field Data 
• For sorting, aggegration etc.. all the field values are 
loaded in to memory called field data. 
• By default its unbounded. 
• Expensive to build, its recommended to hold this in 
memory. 
• They are circuit breakers to protect against this. 
• If the query is gonna use more than 60% of the JVM 
heap it will kill the query.
JVM memory - Friend or Foe ? 
to replicate which are still serving requests causing additional heap
Getting Bad 
Solution ? 
More memory. 
Not necessarily more boxes.
Elasticsearch Cons 
• Not commodity hardware 6K (Hadoop) vs 10K (SSD) 
• GC issues. 
• Circuit breakers doesn’t protect you against everything. 
• No built in security. Use ngnix proxy with authentication. 
• Learning curve. 
• Lot of updates hurt. Filter cache should be rebuilt, merges etc..
Thank you 
• abhishek376.wordpress.com 
• abhishek376@gmail.com 
• Twitter : abhishek376 
We are Hiring !!

More Related Content

What's hot

Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Upfoundsearch
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksLucidworks
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basicspmanvi
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackRich Lee
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Lucidworks
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5Idan Tohami
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 

What's hot (20)

Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
AHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File SystemsAHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File Systems
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basics
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 

Similar to Real time analytics using Hadoop and Elasticsearch

Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Modernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with ElasticsearchModernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with ElasticsearchTaylor Lovett
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engineBhuvaneshwaran R
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionSearce Inc
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsConcentric Sky
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
From Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
From Concept to Clustered JAC (jira.atlassian.com) - Graham CarrickFrom Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
From Concept to Clustered JAC (jira.atlassian.com) - Graham CarrickAtlassian
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyCeph Community
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearchErhwen Kuo
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)foundsearch
 

Similar to Real time analytics using Hadoop and Elasticsearch (20)

Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Modernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with ElasticsearchModernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with Elasticsearch
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engine
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in Production
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Cassandra
CassandraCassandra
Cassandra
 
JCache Using JCache
JCache Using JCacheJCache Using JCache
JCache Using JCache
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the Seams
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
From Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
From Concept to Clustered JAC (jira.atlassian.com) - Graham CarrickFrom Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
From Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearch
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Real time analytics using Hadoop and Elasticsearch

  • 1. Real time analytics using Hadoop and Elasticsearch by ABHISHEK ANDHAVARAPU
  • 3. About Me • Currently working as Software Engineer (Data Platform) at Allegiance Software Inc. • Passion for Distributed System, Data visualizations. • Masters in Distributed Systems. • abhishek376.wordpress.com
  • 4. Agenda Use Case. Architecture. Elasticsearch 101. Demo. Lessons learnt.
  • 8. Elasticsearch 101 • Document oriented search engine Json based, apache lucene under covers. • Schema free. • Its distributed, supports aggregations similar to group by . • Uses bit sets to efficiently cache. • It’s fast. Super fast. • Its has REST and Java based API’s
  • 9. Elasticsearch CRUD Index a person: curl -XPUT ‘localhost:9200/person/1’ -d '{ "first_name" : "Abhishek", "last_name" : "Andhavarapu" }’ Get a person: curl -XGET 'localhost:9200/person/1' Delete a person: curl -XDELETE ‘localhost:9200/person/1’ Update a person: curl -XPOST 'localhost:9200/person/1/_update' -d '{ "doc" : { "first_name" : "Abhi" } }'
  • 10. Elasticsearch data Node1 Node2 S0 S1 Shard
  • 11. Replicas Node1 Node2 S0 S0 S1 S1 Blue - Replica Red - Primary Shard
  • 12. More nodes.. Node1 Node2 S0 S1 Node3 Node4 S1 S0 Blue - Replica Red - Primary
  • 13. Node down Node1 Node2 S0 S1 Node3 Node4 S1 S0 Blue - Replica Red - Primary
  • 14. Node1 S0 Node down Node3 Node4 A1 S1 S0 Blue - Replica Red - Primary S1 Re-replicated Promoted to Primary
  • 15. Elasticsearch 101 • Lucene is under covers. • Each index (like a database) is made up of multiple shards(lucene instance). • Shards are distributed amongst all nodes in the cluster. • In case of failure or the addition of new nodes shards are automatically moved from one to another.
  • 16. How is it Fast ? Distributed execution Client Node 2 Node 1 S0 S1 S0 S1 Query Red - Primary Blue - Replica
  • 17. DEMO • Import data from SQL database in to Hive. (Extract) • Run the necessary computations using Hadoop/Hive. (Transform) • Push the data in to Elasticsearch. (Load) • Run queries against Elasticsearch.
  • 18. Current Elasticsearch Cluster • 9 bare metal boxes • 128 GB RAM • 2X SSD • 10 GB Ethernet • 2X 10 core Xeon Processors • 2X 30GB Elasticsearch instances per box • 1 Elasticsearch load balancing instance to handle index requests
  • 19. Zabbix What’s slow ? Any request that takes more than 300ms is slow
  • 21. Concurrency • More replication for more currency. Updates are costly. • More shards much faster. • SQL 3 to 5k per minute
  • 22. Filter Cache • All the filters have a cache flag that controls if they are cached or not. • Once the filter cache is warmed, all the requests are served from the memory. • Defaults - 10% for the filter cache. • LRU. • Bit Sets.
  • 23. Field Data • For sorting, aggegration etc.. all the field values are loaded in to memory called field data. • By default its unbounded. • Expensive to build, its recommended to hold this in memory. • They are circuit breakers to protect against this. • If the query is gonna use more than 60% of the JVM heap it will kill the query.
  • 24. JVM memory - Friend or Foe ? to replicate which are still serving requests causing additional heap
  • 25. Getting Bad Solution ? More memory. Not necessarily more boxes.
  • 26. Elasticsearch Cons • Not commodity hardware 6K (Hadoop) vs 10K (SSD) • GC issues. • Circuit breakers doesn’t protect you against everything. • No built in security. Use ngnix proxy with authentication. • Learning curve. • Lot of updates hurt. Filter cache should be rebuilt, merges etc..
  • 27. Thank you • abhishek376.wordpress.com • abhishek376@gmail.com • Twitter : abhishek376 We are Hiring !!

Editor's Notes

  1. I would like to thank all my sponsors. With out whose suppose this wouldn’t have been possible,
  2. Timed the session it took me about 40 mins. Stop me any time. Interactive session,
  3. Demo - Pushing data from SQL to Hadoop and Hadoop to elasticsearch.
  4. Allegiance the company I work for - Voice of customer space We provide tools for analytics over the customer data. Every night CUBE is tore down, takes about 6 to 12 hours to rebuild. Multi tenant environment Not scalable, very expensive hard to manage
  5. SQL - Master System of record. Plans to use HBase. Still prototyping. ETL process runs every hours which pushes the data from SQL to Hadoop and hadoop to es. Hadoop for ETL (Converts SQL in to NoSQL docs) Elasticsearch for Reporting
  6. If you want to more about it or how we use it in our product. Come see me in the experts room. They are a lot of sessions. Demo our architecture. its a batch processing engine which run on commodity hardware and is highly scalable.
  7. Aggregations -It is similar to GROUP BY in SQL, but much more powerful. I have slides to show how fast it is. Craig has a session about intro to elasticsearch
  8. How the documents are stored in elasticsearch. Index are like SQL databases. The data of an index is distributed across multiple shards.
  9. Elasticsearch makes sure that a single node doesn’t have both the primary and replica. All the write requests hit the primary shards. All the search requests can hit either primary or replica. Routing for inserts and updates are handled by elasticsearch.
  10. Shards are automatically moved to the new nodes.
  11. Shards are distributed among the nodes. A query is executed simultaneously across multiple nodes and results are aggregated back to the client. More nodes faster response times. More replication, costlier the updates are.
  12. Queries that took 100s now take 1s. By using aggregations.
  13. ES distributed - designed to handle some thing like this. More shards more concurrent.
  14. Filter cache and Field Data is what uses the memory in elasticsearch. The reason why its is so fast. If not careful, can very fast use all the JVM heap. Leading to more gc and can lead of OOM and then cascading failure of the cluster. In the fast we have seen that 500gb of filter cache filling up and crashing the cluster in 4 sec due to lots of nodes and SSD with the help of file system cache.
  15. Memory is why elasticsearch is so fast. It can also be a foe if not carefull. These graphs show common symptoms of stop the world gc. While its running nothing else is allowed to run. We would like to keep this to minimum as possible. Talk about cascading failure. There are two generations, young and old when the objects are created they are put in young generation and after they survive couple of gc its moved to old.
  16. When we added more indexes, healthy graph become something like this. GC have hard time catching up. 2 elasticsearch instances per box as we are not io/cpu/bandwidth limited. Just memory
  17. Real time. People are OK with AWS. But for us real time analytics. So no commodity hardware. Still working on stability. We saw a single query can get down the entire cluster
  18. If you like what you saw you should come work with us. Come talk to me. We have a booth if you are interested please submit your resume. Email me if you have any questions. You can also find me in the experts room.