Real time analytics using Hadoop and Elasticsearch

•Download as PPTX, PDF•

10 likes•4,592 views

Abhishek Andhavarapu

Technology

About Me
• Currently working as Software
Engineer (Data Platform) at
Allegiance Software Inc.
• Passion for Distributed
System, Data visualizations.
• Masters in Distributed
Systems.
• abhishek376.wordpress.com

Agenda
Use Case.
Architecture.
Elasticsearch 101.
Demo.
Lessons learnt.

Elasticsearch 101
• Document oriented search engine Json based, apache
lucene under covers.
• Schema free.
• Its distributed, supports aggregations similar to group by .
• Uses bit sets to efficiently cache.
• It’s fast. Super fast.
• Its has REST and Java based API’s

$Elasticsearch CRUD Index a person: curl -XPUT ‘localhost:9200/person/1’ -d '{ "first_name" : "Abhishek", "last_name" : "Andhavarapu" }’ Get a person: curl -XGET 'localhost:9200/person/1' Delete a person: curl -XDELETE ‘localhost:9200/person/1’ Update a person: curl -XPOST 'localhost:9200/person/1/_update' -d '{ "doc" : { "first_name" : "Abhi" } }'$

Elasticsearch data
Node1 Node2
S0 S1
Shard

Replicas
Node1 Node2
S0 S0
S1 S1
Blue - Replica
Red - Primary
Shard

More nodes..
Node1 Node2
S0 S1
Node3 Node4
S1 S0
Blue - Replica
Red - Primary

Node down
Node1 Node2
S0 S1
Node3 Node4
S1 S0
Blue - Replica
Red - Primary

Node1
S0
Node down
Node3 Node4
A1 S1
S0
Blue - Replica
Red - Primary
S1
Re-replicated
Promoted to Primary

Elasticsearch 101
• Lucene is under covers.
• Each index (like a database) is made up of multiple
shards(lucene instance).
• Shards are distributed amongst all nodes in the
cluster.
• In case of failure or the addition of new nodes
shards are automatically moved from one to
another.

How is it Fast ?
Distributed execution
Client
Node 2
Node 1
S0 S1 S0 S1
Query
Red - Primary
Blue - Replica

DEMO
• Import data from SQL database
in to Hive. (Extract)
• Run the necessary
computations using
Hadoop/Hive. (Transform)
• Push the data in to
Elasticsearch. (Load)
• Run queries against
Elasticsearch.

Current Elasticsearch Cluster
• 9 bare metal boxes
• 128 GB RAM
• 2X SSD
• 10 GB Ethernet
• 2X 10 core Xeon Processors
• 2X 30GB Elasticsearch instances per box
• 1 Elasticsearch load balancing instance to handle index requests

Zabbix
What’s slow ?
Any request that takes more than 300ms is slow

Concurrency
• More replication for more currency. Updates are costly.
• More shards much faster.
• SQL 3 to 5k per minute

Filter Cache
• All the filters have a cache flag that controls if they
are cached or not.
• Once the filter cache is warmed, all the requests are
served from the memory.
• Defaults - 10% for the filter cache.
• LRU.
• Bit Sets.

Field Data
• For sorting, aggegration etc.. all the field values are
loaded in to memory called field data.
• By default its unbounded.
• Expensive to build, its recommended to hold this in
memory.
• They are circuit breakers to protect against this.
• If the query is gonna use more than 60% of the JVM
heap it will kill the query.

JVM memory - Friend or Foe ?
to replicate which are still serving requests causing additional heap

Getting Bad
Solution ?
More memory.
Not necessarily more boxes.

Elasticsearch Cons
• Not commodity hardware 6K (Hadoop) vs 10K (SSD)
• GC issues.
• Circuit breakers doesn’t protect you against everything.
• No built in security. Use ngnix proxy with authentication.
• Learning curve.
• Lot of updates hurt. Filter cache should be rebuilt, merges etc..

Thank you
• abhishek376.wordpress.com
• abhishek376@gmail.com
• Twitter : abhishek376
We are Hiring !!

What's hot

Configuring elasticsearch for performance and scaleBharvi Dixit

ElasticsearchAndrii Gakhov

Introduction to elasticsearchhypto

Case study of Rujhaan.com (A social news app )Rahul Jain

Vayacondios: Divine into Complex SystemsInfochimps, a CSC Big Data Business

Elasticsearch From the Bottom Upfoundsearch

Searching for Better Code: Presented by Grant Ingersoll, LucidworksLucidworks

ElasticsearchDivij Sehgal

AHUG Presentation: Fun with Hadoop File SystemsInfochimps, a CSC Big Data Business

quick intro to elastic search medcl

Search domain basicspmanvi

Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain

ElasticSearch Basic IntroductionMayur Rathod

Centralized log-management-with-elastic-stackRich Lee

Solr + Hadoop = Big Data SearchMark Miller

Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Lucidworks

What's new in Elasticsearch v5Idan Tohami

Demystifying Data Engineeringnathanmarz

Using Data Lakes Amazon Web Services

The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit

What's hot (20)

Configuring elasticsearch for performance and scale

Elasticsearch

Introduction to elasticsearch

Case study of Rujhaan.com (A social news app )

Vayacondios: Divine into Complex Systems

Elasticsearch From the Bottom Up

Searching for Better Code: Presented by Grant Ingersoll, Lucidworks

Elasticsearch

AHUG Presentation: Fun with Hadoop File Systems

quick intro to elastic search

Search domain basics

Building a Large Scale SEO/SEM Application with Apache Solr

ElasticSearch Basic Introduction

Centralized log-management-with-elastic-stack

Solr + Hadoop = Big Data Search

Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...

What's new in Elasticsearch v5

Demystifying Data Engineering

Using Data Lakes

The Pushdown of Everything by Stephan Kessler and Santiago Mola

Similar to Real time analytics using Hadoop and Elasticsearch

Devnexus 2018Roy Russo

Dev nexus 2017Roy Russo

Managing Security At 1M Events a Second using ElasticsearchJoe Alex

Modernizing WordPress Search with ElasticsearchTaylor Lovett

Optimizing elastic search on google compute engineBhuvaneshwaran R

Running ElasticSearch on Google Compute Engine in ProductionSearce Inc

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil

Cassandraexsuns

JCache Using JCache日本Javaユーザーグループ

Benchmarking Solr Performance at Scalethelabdude

M6d cassandrapresentationEdward Capriolo

Where Django Caching Bust at the SeamsConcentric Sky

In-memory Data Management Trends & TechniquesHazelcast

From Concept to Clustered JAC (jira.atlassian.com) - Graham CarrickAtlassian

Webinar - DreamObjects/Ceph Case StudyCeph Community

Storage Systems For Scalable systemselliando dias

L6.sp17.pptxSudheerKumar499932

06 integrate elasticsearchErhwen Kuo

ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N

Elasticsearch in Production (London version)foundsearch

Similar to Real time analytics using Hadoop and Elasticsearch (20)

Devnexus 2018

Dev nexus 2017

Managing Security At 1M Events a Second using Elasticsearch

Modernizing WordPress Search with Elasticsearch

Optimizing elastic search on google compute engine

Running ElasticSearch on Google Compute Engine in Production

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...

Cassandra

JCache Using JCache

Benchmarking Solr Performance at Scale

M6d cassandrapresentation

Where Django Caching Bust at the Seams

In-memory Data Management Trends & Techniques

From Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick

Webinar - DreamObjects/Ceph Case Study

Storage Systems For Scalable systems

L6.sp17.pptx

06 integrate elasticsearch

ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine

Elasticsearch in Production (London version)

Recently uploaded

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

"ML in Production",Oleksandr BaganFwdays

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

From Family Reminiscence to Scholarly Archive .Alan Dix

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Take control of your SAP testing with UiPath Test SuiteDianaGray10

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf

Dev Dives: Streamline document processing with UiPath Studio Web

Developer Data Modeling Mistakes: From Postgres to NoSQL

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

"ML in Production",Oleksandr Bagan

The Ultimate Guide to Choosing WordPress Pros and Cons

From Family Reminiscence to Scholarly Archive .

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

TeamStation AI System Report LATAM IT Salaries 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

What is DBT - The Ultimate Data Build Tool.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Commit 2024 - Secret Management made easy

Take control of your SAP testing with UiPath Test Suite

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

What's New in Teams Calling, Meetings and Devices March 2024

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Real time analytics using Hadoop and Elasticsearch

1. Real time analytics using Hadoop and Elasticsearch by ABHISHEK ANDHAVARAPU

2. Thank you Sponsors!

3. About Me • Currently working as Software Engineer (Data Platform) at Allegiance Software Inc. • Passion for Distributed System, Data visualizations. • Masters in Distributed Systems. • abhishek376.wordpress.com

4. Agenda Use Case. Architecture. Elasticsearch 101. Demo. Lessons learnt.

5. Legacy Architecture 5

6. Current Architecture

7. Why Hadoop ?

8. Elasticsearch 101 • Document oriented search engine Json based, apache lucene under covers. • Schema free. • Its distributed, supports aggregations similar to group by . • Uses bit sets to efficiently cache. • It’s fast. Super fast. • Its has REST and Java based API’s

9. Elasticsearch CRUD Index a person: curl -XPUT ‘localhost:9200/person/1’ -d '{ "first_name" : "Abhishek", "last_name" : "Andhavarapu" }’ Get a person: curl -XGET 'localhost:9200/person/1' Delete a person: curl -XDELETE ‘localhost:9200/person/1’ Update a person: curl -XPOST 'localhost:9200/person/1/_update' -d '{ "doc" : { "first_name" : "Abhi" } }'

10. Elasticsearch data Node1 Node2 S0 S1 Shard

11. Replicas Node1 Node2 S0 S0 S1 S1 Blue - Replica Red - Primary Shard

12. More nodes.. Node1 Node2 S0 S1 Node3 Node4 S1 S0 Blue - Replica Red - Primary

13. Node down Node1 Node2 S0 S1 Node3 Node4 S1 S0 Blue - Replica Red - Primary

14. Node1 S0 Node down Node3 Node4 A1 S1 S0 Blue - Replica Red - Primary S1 Re-replicated Promoted to Primary

15. Elasticsearch 101 • Lucene is under covers. • Each index (like a database) is made up of multiple shards(lucene instance). • Shards are distributed amongst all nodes in the cluster. • In case of failure or the addition of new nodes shards are automatically moved from one to another.

16. How is it Fast ? Distributed execution Client Node 2 Node 1 S0 S1 S0 S1 Query Red - Primary Blue - Replica

17. DEMO • Import data from SQL database in to Hive. (Extract) • Run the necessary computations using Hadoop/Hive. (Transform) • Push the data in to Elasticsearch. (Load) • Run queries against Elasticsearch.

18. Current Elasticsearch Cluster • 9 bare metal boxes • 128 GB RAM • 2X SSD • 10 GB Ethernet • 2X 10 core Xeon Processors • 2X 30GB Elasticsearch instances per box • 1 Elasticsearch load balancing instance to handle index requests

19. Zabbix What’s slow ? Any request that takes more than 300ms is slow

20. Lessons Learnt

21. Concurrency • More replication for more currency. Updates are costly. • More shards much faster. • SQL 3 to 5k per minute

22. Filter Cache • All the filters have a cache flag that controls if they are cached or not. • Once the filter cache is warmed, all the requests are served from the memory. • Defaults - 10% for the filter cache. • LRU. • Bit Sets.

23. Field Data • For sorting, aggegration etc.. all the field values are loaded in to memory called field data. • By default its unbounded. • Expensive to build, its recommended to hold this in memory. • They are circuit breakers to protect against this. • If the query is gonna use more than 60% of the JVM heap it will kill the query.

24. JVM memory - Friend or Foe ? to replicate which are still serving requests causing additional heap

25. Getting Bad Solution ? More memory. Not necessarily more boxes.

26. Elasticsearch Cons • Not commodity hardware 6K (Hadoop) vs 10K (SSD) • GC issues. • Circuit breakers doesn’t protect you against everything. • No built in security. Use ngnix proxy with authentication. • Learning curve. • Lot of updates hurt. Filter cache should be rebuilt, merges etc..

27. Thank you • abhishek376.wordpress.com • abhishek376@gmail.com • Twitter : abhishek376 We are Hiring !!

Editor's Notes

I would like to thank all my sponsors. With out whose suppose this wouldn’t have been possible,
Timed the session it took me about 40 mins. Stop me any time. Interactive session,
Demo - Pushing data from SQL to Hadoop and Hadoop to elasticsearch.
Allegiance the company I work for - Voice of customer space We provide tools for analytics over the customer data. Every night CUBE is tore down, takes about 6 to 12 hours to rebuild. Multi tenant environment Not scalable, very expensive hard to manage
SQL - Master System of record. Plans to use HBase. Still prototyping. ETL process runs every hours which pushes the data from SQL to Hadoop and hadoop to es. Hadoop for ETL (Converts SQL in to NoSQL docs) Elasticsearch for Reporting
If you want to more about it or how we use it in our product. Come see me in the experts room. They are a lot of sessions. Demo our architecture. its a batch processing engine which run on commodity hardware and is highly scalable.
Aggregations -It is similar to GROUP BY in SQL, but much more powerful. I have slides to show how fast it is. Craig has a session about intro to elasticsearch
How the documents are stored in elasticsearch. Index are like SQL databases. The data of an index is distributed across multiple shards.
Elasticsearch makes sure that a single node doesn’t have both the primary and replica. All the write requests hit the primary shards. All the search requests can hit either primary or replica. Routing for inserts and updates are handled by elasticsearch.
Shards are automatically moved to the new nodes.
Shards are distributed among the nodes. A query is executed simultaneously across multiple nodes and results are aggregated back to the client. More nodes faster response times. More replication, costlier the updates are.
Queries that took 100s now take 1s. By using aggregations.
ES distributed - designed to handle some thing like this. More shards more concurrent.
Filter cache and Field Data is what uses the memory in elasticsearch. The reason why its is so fast. If not careful, can very fast use all the JVM heap. Leading to more gc and can lead of OOM and then cascading failure of the cluster. In the fast we have seen that 500gb of filter cache filling up and crashing the cluster in 4 sec due to lots of nodes and SSD with the help of file system cache.
Memory is why elasticsearch is so fast. It can also be a foe if not carefull. These graphs show common symptoms of stop the world gc. While its running nothing else is allowed to run. We would like to keep this to minimum as possible. Talk about cascading failure. There are two generations, young and old when the objects are created they are put in young generation and after they survive couple of gc its moved to old.
When we added more indexes, healthy graph become something like this. GC have hard time catching up. 2 elasticsearch instances per box as we are not io/cpu/bandwidth limited. Just memory
Real time. People are OK with AWS. But for us real time analytics. So no commodity hardware. Still working on stability. We saw a single query can get down the entire cluster
If you like what you saw you should come work with us. Come talk to me. We have a booth if you are interested please submit your resume. Email me if you have any questions. You can also find me in the experts room.

Real time analytics using Hadoop and Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real time analytics using Hadoop and Elasticsearch

Similar to Real time analytics using Hadoop and Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

Real time analytics using Hadoop and Elasticsearch

Editor's Notes