SlideShare a Scribd company logo
1 of 35
Building an ETL pipeline for
Elasticsearch using Spark
* *
@2014 eXelate Inc. Confidential and Proprietary
Itai Yaffe, Big Data Infrastructure Developer
December 2015
Agenda
• About eXelate
• About the team
• eXelate’s architecture overview
• The need
• The problem
• Why Elasticsearch and how do we use it?
• Loading the data
• Re-designing the loading process
• Additional improvements
• To summarize
* *
©2011 eXelate Inc. Confidential and Proprietary
About eXelate, a Nielsen company
• Founded in 2007
• Acquired by Nielsen on March, 2015
• A leader in the Ad Tech industry
• Provides data and software services through :
• eXchange (2 billion users)
• maX DMP (data management platform)
* *
©2011 eXelate Inc. Confidential and Proprietary
Our numbers
* *
©2011 eXelate Inc. Confidential and Proprietary
• ~10 billion events per day
• ~150TB of data per day
• Hybrid cloud infrastructure
• 4 Data Centers
• Amazon Web Services
About the team
• The BDI (Big Data Infrastructure) team is in charge
of shipping, transforming and loading eXelate’s data
into various data stores, making it ready to be
queried efficiently
• For the last year and a half, we’ve been transitioning
our legacy systems to modern, scale-out
infrastructure (Spark, Kafka, etc.)
* *
©2011 eXelate Inc. Confidential and Proprietary
About me
• Dealing with Big Data challenges for the last 3.5 years,
using :
• Cassandra
• Spark
• Elasticsearch
• And others…
• Joined eXelate on May 2014
• Previously : OpTier, Mamram
• LinkedIn : https://www.linkedin.com/in/itaiy
• Email : itai.yaffe@nielsen.com
* *
©2011 eXelate Inc. Confidential and Proprietary
eXelate’s architecture overview
* *
©2011 eXelate Inc. Confidential and Proprietary
Serving
(frontend
servers)
Incoming
HTTP
requests
ETL
ETL
ETL
DMP
applications
(SaaS)
DB
DWH
The need
* *
©2011 eXelate Inc. Confidential and Proprietary
The need
* *
©2011 eXelate Inc. Confidential and Proprietary
The need
• From the data perspective :
• ETL – collect raw data and load it into
Elasticsearch periodically
• Tens of millions of events per day
• Data is already labeled
• Query - allow ad hoc calculations based on the
stored data
• Mainly counting unique users related to a specific
campaign in conjunction with
geographic/demographic data limited by date range
• The number of permutations is huge, so real-time
queries are a must! (and can’t be pre-calculated)
* *
©2011 eXelate Inc. Confidential and Proprietary
The problem
• We chose Elasticsearch as the data store (details to
follow)
• But… the ETL process was far from optimal
• Also affected query performance
* *
©2011 eXelate Inc. Confidential and Proprietary
Why Elasticsearch?
• Originally designed as a text search engine
• Today it has advanced real-time analytics
capabilities
• Distributed, scalable and highly available
* *
©2011 eXelate Inc. Confidential and Proprietary
How do we use Elasticsearch?
• We rely heavily on its counting capabilities
• Splitting the data into separate indices based on a
few criteria (e.g TTL, tags VS segments)
• Each user (i.e device) is stored as a document with
many nested document
* *
©2011 eXelate Inc. Confidential and Proprietary
How do we use Elasticsearch?
* *
©2011 eXelate Inc. Confidential and Proprietary
{
"_index": "sample",
"_type": "user",
"_id": "0c31644ad41e32c819be29ba16e14300",
"_version": 4,
"_score": 1,
"_source": {
"events": [
{
"event_time": "2014-01-18",
"segments": [
{
"segment": "female"
}
,{
"segment": "Airplane tickets"
}
]
},
{
"event_time": "2014-02-19",
"segments": [
{
"segment": "female"
}
,{
"segment": "Hotel reservations"
}
]
}
]
}
}
Loading the data
* *
©2011 eXelate Inc. Confidential and Proprietary
Standalone Java loader application
• Runs every few minutes
• Parses the log files
• For each user we encountered :
• Queries Elasticsearch to get the user’s document
• Merges the new data into the document on the
client-side
• Bulk-indexes documents into Elasticsearch
* *
©2011 eXelate Inc. Confidential and Proprietary
OK, so what’s the problem?
• Multiple updates per user per day
• Updates in Elasticsearch are expensive (basically delete +
insert)
• Merges are done on the client-side
• Involves redundant queries
• Leads to degradation of query performance
• Not scalable or high available
* *
©2011 eXelate Inc. Confidential and Proprietary
Re-designing the loading process
• Batch processing once a day during off-hours
• Daily dedup leads to ~75% less update operations in
Elasticsearch
• Using Spark as our processing framework
• Distributed, scalable and highly available
• Unified framework for batch, streaming, machine
learning, etc.
• Using update script
• Merges are done on the server-side
* *
©2011 eXelate Inc. Confidential and Proprietary
Elasticsearch update script
* *
©2011 eXelate Inc. Confidential and Proprietary
import groovy.json.JsonSlurper;
added=false;
def slurper = new JsonSlurper();
def result = slurper.parseText(param1);
ctx._ttl = ttl;
ctx._source.events.each() {
item->if (item.event_time == result[0].event_time) {
def segmentMap = [:];
item.segments.each() {
segmentMap.put(it.segment,it.segment)
};
result[0].segments.each{
if(!segmentMap[it.segment]){
item.segments += it
}
};
added=true;
}
};
if(!added) {
ctx._source.events += result
}
Re-designing the loading process
* *
©2011 eXelate Inc. Confidential and Proprietary
AWS S3
AWS Data Pipeline
AWS EMR
AWS SNS
Zoom-in
• Log files are compressed (.gz) CSVs
• Once a day :
• Files are copied and uncompressed into the EMR cluster
using S3DistCp
• The Spark application :
• Groups events by user and build JSON documents,
which include an inline udpate script
• Writes the JSON documents back to S3
• The Scala application reads the documents from S3 and
bulk-indexes them into Elasticsearch
• Notifications are sent via SNS
* *
©2011 eXelate Inc. Confidential and Proprietary
We discovered it wasn’t enough…
• Redundant moving parts
• Excessive network traffic
• Still not scalable enough
* *
©2011 eXelate Inc. Confidential and Proprietary
Elasticsearch-Spark plugin-in for the rescue…
* *
©2011 eXelate Inc. Confidential and Proprietary
AWS S3
AWS Data Pipeline
AWS EMR
Elasticsearch-Spark plug-in
AWS SNS
Deep-dive
• Bulk-indexing directly from Spark using
elasticsearch-hadoop plugin-in for Spark :
// Save created RDD records to a file
documentsRdd.saveAsTextFile(outputPath)
Is now :
// Save created RDD records directly to Elasticsearch
documentsRdd.saveJsonToEs(configData.documentResource,
scala.collection.Map(ConfigurationOptions.ES_MAPPING_ID ->
configData.documentIdFieldName))
• Storing the update script on the server-side (Elasticsearch)
* *
©2011 eXelate Inc. Confidential and Proprietary
Better…
• Single component for both processing and indexing
• Elastically scalable
• Out-of-the-box error handling and fault-tolerance
• Spark-level (e.g spark.task.maxFailures)
• Plug-in level (e.g
ConfigurationOptions.ES_BATCH_WRITE_RETRY_COUNT/
WAIT)
• Less network traffic (update script is stored in
Elasticsearch)
* *
©2011 eXelate Inc. Confidential and Proprietary
… But
• Number of deleted documents continually grows
• Also affects query performance
• Elasticsearch itself becomes the bottleneck
• org.elasticsearch.hadoop.EsHadoopException: Could not
write all entries [5/1047872] (maybe ES was
overloaded?). Bailing out...
• [INFO ][index.engine ] [NODE_NAME]
[INDEX_NAME][7] now throttling indexing:
numMergesInFlight=6, maxNumMerges=5
* *
©2011 eXelate Inc. Confidential and Proprietary
Expunging deleted documents
• Theoretically not a “best practice” but necessary
when doing significant bulk-indexing
• Done through the optimize API
• curl -XPOST
http://localhost:9200/_optimize?only_expunge_deletes
• curl -XPOST
http://localhost:9200/_optimize?max_num_segments=5
• A heavy operation (time, CPU , I/O)
* *
©2011 eXelate Inc. Confidential and Proprietary
Improving indexing performance
• Set index.refresh_interval to -1
• Set indices.store.throttle.type to none
• Properly set the retry-related configuration
properties (e.g spark.task.maxFailures)
* *
©2011 eXelate Inc. Confidential and Proprietary
What’s next?
• Further improve indexing performance, e.g :
• Reduce excessive concurrency on Elasticsearch nodes by
limiting Spark’s maximum concurrent tasks
• Bulk-index objects rather than JSON documents to avoid
excessive parsing
• Better monitoring (e.g using Spark Accumulators)
* *
©2011 eXelate Inc. Confidential and Proprietary
To summarize
• We use :
• S3 to store (raw) labeled data
• Spark on EMR to process the data
• Elasticsearch-hadoop plug-in for bulk-indexing
• Data Pipeline to manage the flow
• Elasticsearch for real-time analytics
* *
©2011 eXelate Inc. Confidential and Proprietary
To summarize
• Updates are expensive – consider daily dedup
• Avoid excessive querying and network traffic -
perform merges on the server-side
• Use an update script
• Store it on your Elasticsearch cluster
• Make sure your loading process is scalable and
fault-tolerant – use Spark
• Reduce # of moving parts
• Index the data directly using elasticsearch-hadoop plug-in
* *
©2011 eXelate Inc. Confidential and Proprietary
To summarize
• Improve indexing performance – properly configure
your cluster before indexing
• Avoid excessive disk usage – optimize your indices
• Can also help query performance
• Making the processing phase elastically scalable (i.e
using Spark) doesn’t mean the whole ETL flow is
elastically scalable
• Elasticsearch becomes the new bottleneck…
* *
©2011 eXelate Inc. Confidential and Proprietary
Questions?
Also - we’re hiring!
http://exelate.com/about-us/careers/
•DevOps team leader
•Senior frontend developers
•Senior Java developers
* *
©2011 eXelate Inc. Confidential and Proprietary
Thank you
©2011 eXelate Inc. Confidential and Proprietary
Itai Yaffe
Keep an eye on…
• S3 limitations :
• The penalty involved in moving files
• File partitioning and hash prefixes
* *
©2011 eXelate Inc. Confidential and Proprietary

More Related Content

What's hot

금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...
금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...
금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...Amazon Web Services Korea
 
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Serverless Kafka on AWS as Part of a Cloud-native Data Lake ArchitectureServerless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Serverless Kafka on AWS as Part of a Cloud-native Data Lake ArchitectureKai Wähner
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouseAmin Choroomi
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWSDanilo Poccia
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scaleVinay Kumar Chella
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Role of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingRole of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingVenu Anuganti
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Migrating Your Databases to AWS - Tools and Services.pdf
Migrating Your Databases to AWS -  Tools and Services.pdfMigrating Your Databases to AWS -  Tools and Services.pdf
Migrating Your Databases to AWS - Tools and Services.pdfAmazon Web Services
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea
 
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나Amazon Web Services Korea
 
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...Amazon Web Services Korea
 

What's hot (20)

금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...
금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...
금융권 최신 AWS 도입 사례 총정리 – 신한 제주 은행, KB손해보험 사례를 중심으로 - 지성국 사업 개발 담당 이사, AWS / 정을용...
 
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Serverless Kafka on AWS as Part of a Cloud-native Data Lake ArchitectureServerless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouse
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Role of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingRole of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, Warehousing
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
DataHub
DataHubDataHub
DataHub
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Migrating Your Databases to AWS - Tools and Services.pdf
Migrating Your Databases to AWS -  Tools and Services.pdfMigrating Your Databases to AWS -  Tools and Services.pdf
Migrating Your Databases to AWS - Tools and Services.pdf
 
5 v of big data
5 v of big data5 v of big data
5 v of big data
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
 
Creating a Data Driven Culture
Creating a Data Driven Culture Creating a Data Driven Culture
Creating a Data Driven Culture
 

Viewers also liked

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementMohamed hedi Abidi
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Tirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearchTirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearchSéven Le Mesle
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWSPhilipp Garbe
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Scaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with ElasticsearchScaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with Elasticsearchclintongormley
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchSigmoid
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutesDavid Pilato
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 

Viewers also liked (20)

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Tirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearchTirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearch
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWS
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Scaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with ElasticsearchScaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with Elasticsearch
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutes
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
Introducing Akka
Introducing AkkaIntroducing Akka
Introducing Akka
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 

Similar to Building an ETL pipeline for Elasticsearch using Spark

Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log AnalyticsElastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log AnalyticsElasticsearch
 
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan  - What is so smart about it?Exadata Smart Scan  - What is so smart about it?
Exadata Smart Scan - What is so smart about it?Uwe Hesse
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open SourceEDB
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Achieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactAchieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactElasticsearch
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7DataStax
 
Using ELK Explore Defect Data
Using ELK Explore Defect DataUsing ELK Explore Defect Data
Using ELK Explore Defect Dataatf117
 
UsingELKExploreDefectData
UsingELKExploreDefectDataUsingELKExploreDefectData
UsingELKExploreDefectDataYabin Xu
 
Introducing the eDB360 Tool
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 ToolCarlos Sierra
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with ElasticsearchAlibaba Cloud
 
Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Will Du
 

Similar to Building an ETL pipeline for Elasticsearch using Spark (20)

Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log AnalyticsElastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
 
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan  - What is so smart about it?Exadata Smart Scan  - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open Source
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Achieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactAchieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impact
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7
 
Using ELK Explore Defect Data
Using ELK Explore Defect DataUsing ELK Explore Defect Data
Using ELK Explore Defect Data
 
UsingELKExploreDefectData
UsingELKExploreDefectDataUsingELKExploreDefectData
UsingELKExploreDefectData
 
Introducing the eDB360 Tool
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 Tool
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with Elasticsearch
 
Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 

Recently uploaded

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 

Recently uploaded (20)

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 

Building an ETL pipeline for Elasticsearch using Spark

  • 1. Building an ETL pipeline for Elasticsearch using Spark * * @2014 eXelate Inc. Confidential and Proprietary Itai Yaffe, Big Data Infrastructure Developer December 2015
  • 2. Agenda • About eXelate • About the team • eXelate’s architecture overview • The need • The problem • Why Elasticsearch and how do we use it? • Loading the data • Re-designing the loading process • Additional improvements • To summarize * * ©2011 eXelate Inc. Confidential and Proprietary
  • 3. About eXelate, a Nielsen company • Founded in 2007 • Acquired by Nielsen on March, 2015 • A leader in the Ad Tech industry • Provides data and software services through : • eXchange (2 billion users) • maX DMP (data management platform) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 4. Our numbers * * ©2011 eXelate Inc. Confidential and Proprietary • ~10 billion events per day • ~150TB of data per day • Hybrid cloud infrastructure • 4 Data Centers • Amazon Web Services
  • 5. About the team • The BDI (Big Data Infrastructure) team is in charge of shipping, transforming and loading eXelate’s data into various data stores, making it ready to be queried efficiently • For the last year and a half, we’ve been transitioning our legacy systems to modern, scale-out infrastructure (Spark, Kafka, etc.) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 6. About me • Dealing with Big Data challenges for the last 3.5 years, using : • Cassandra • Spark • Elasticsearch • And others… • Joined eXelate on May 2014 • Previously : OpTier, Mamram • LinkedIn : https://www.linkedin.com/in/itaiy • Email : itai.yaffe@nielsen.com * * ©2011 eXelate Inc. Confidential and Proprietary
  • 7. eXelate’s architecture overview * * ©2011 eXelate Inc. Confidential and Proprietary Serving (frontend servers) Incoming HTTP requests ETL ETL ETL DMP applications (SaaS) DB DWH
  • 8. The need * * ©2011 eXelate Inc. Confidential and Proprietary
  • 9. The need * * ©2011 eXelate Inc. Confidential and Proprietary
  • 10. The need • From the data perspective : • ETL – collect raw data and load it into Elasticsearch periodically • Tens of millions of events per day • Data is already labeled • Query - allow ad hoc calculations based on the stored data • Mainly counting unique users related to a specific campaign in conjunction with geographic/demographic data limited by date range • The number of permutations is huge, so real-time queries are a must! (and can’t be pre-calculated) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 11. The problem • We chose Elasticsearch as the data store (details to follow) • But… the ETL process was far from optimal • Also affected query performance * * ©2011 eXelate Inc. Confidential and Proprietary
  • 12. Why Elasticsearch? • Originally designed as a text search engine • Today it has advanced real-time analytics capabilities • Distributed, scalable and highly available * * ©2011 eXelate Inc. Confidential and Proprietary
  • 13. How do we use Elasticsearch? • We rely heavily on its counting capabilities • Splitting the data into separate indices based on a few criteria (e.g TTL, tags VS segments) • Each user (i.e device) is stored as a document with many nested document * * ©2011 eXelate Inc. Confidential and Proprietary
  • 14. How do we use Elasticsearch? * * ©2011 eXelate Inc. Confidential and Proprietary { "_index": "sample", "_type": "user", "_id": "0c31644ad41e32c819be29ba16e14300", "_version": 4, "_score": 1, "_source": { "events": [ { "event_time": "2014-01-18", "segments": [ { "segment": "female" } ,{ "segment": "Airplane tickets" } ] }, { "event_time": "2014-02-19", "segments": [ { "segment": "female" } ,{ "segment": "Hotel reservations" } ] } ] } }
  • 15. Loading the data * * ©2011 eXelate Inc. Confidential and Proprietary
  • 16. Standalone Java loader application • Runs every few minutes • Parses the log files • For each user we encountered : • Queries Elasticsearch to get the user’s document • Merges the new data into the document on the client-side • Bulk-indexes documents into Elasticsearch * * ©2011 eXelate Inc. Confidential and Proprietary
  • 17. OK, so what’s the problem? • Multiple updates per user per day • Updates in Elasticsearch are expensive (basically delete + insert) • Merges are done on the client-side • Involves redundant queries • Leads to degradation of query performance • Not scalable or high available * * ©2011 eXelate Inc. Confidential and Proprietary
  • 18. Re-designing the loading process • Batch processing once a day during off-hours • Daily dedup leads to ~75% less update operations in Elasticsearch • Using Spark as our processing framework • Distributed, scalable and highly available • Unified framework for batch, streaming, machine learning, etc. • Using update script • Merges are done on the server-side * * ©2011 eXelate Inc. Confidential and Proprietary
  • 19. Elasticsearch update script * * ©2011 eXelate Inc. Confidential and Proprietary import groovy.json.JsonSlurper; added=false; def slurper = new JsonSlurper(); def result = slurper.parseText(param1); ctx._ttl = ttl; ctx._source.events.each() { item->if (item.event_time == result[0].event_time) { def segmentMap = [:]; item.segments.each() { segmentMap.put(it.segment,it.segment) }; result[0].segments.each{ if(!segmentMap[it.segment]){ item.segments += it } }; added=true; } }; if(!added) { ctx._source.events += result }
  • 20. Re-designing the loading process * * ©2011 eXelate Inc. Confidential and Proprietary AWS S3 AWS Data Pipeline AWS EMR AWS SNS
  • 21. Zoom-in • Log files are compressed (.gz) CSVs • Once a day : • Files are copied and uncompressed into the EMR cluster using S3DistCp • The Spark application : • Groups events by user and build JSON documents, which include an inline udpate script • Writes the JSON documents back to S3 • The Scala application reads the documents from S3 and bulk-indexes them into Elasticsearch • Notifications are sent via SNS * * ©2011 eXelate Inc. Confidential and Proprietary
  • 22. We discovered it wasn’t enough… • Redundant moving parts • Excessive network traffic • Still not scalable enough * * ©2011 eXelate Inc. Confidential and Proprietary
  • 23. Elasticsearch-Spark plugin-in for the rescue… * * ©2011 eXelate Inc. Confidential and Proprietary AWS S3 AWS Data Pipeline AWS EMR Elasticsearch-Spark plug-in AWS SNS
  • 24. Deep-dive • Bulk-indexing directly from Spark using elasticsearch-hadoop plugin-in for Spark : // Save created RDD records to a file documentsRdd.saveAsTextFile(outputPath) Is now : // Save created RDD records directly to Elasticsearch documentsRdd.saveJsonToEs(configData.documentResource, scala.collection.Map(ConfigurationOptions.ES_MAPPING_ID -> configData.documentIdFieldName)) • Storing the update script on the server-side (Elasticsearch) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 25. Better… • Single component for both processing and indexing • Elastically scalable • Out-of-the-box error handling and fault-tolerance • Spark-level (e.g spark.task.maxFailures) • Plug-in level (e.g ConfigurationOptions.ES_BATCH_WRITE_RETRY_COUNT/ WAIT) • Less network traffic (update script is stored in Elasticsearch) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 26. … But • Number of deleted documents continually grows • Also affects query performance • Elasticsearch itself becomes the bottleneck • org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [5/1047872] (maybe ES was overloaded?). Bailing out... • [INFO ][index.engine ] [NODE_NAME] [INDEX_NAME][7] now throttling indexing: numMergesInFlight=6, maxNumMerges=5 * * ©2011 eXelate Inc. Confidential and Proprietary
  • 27. Expunging deleted documents • Theoretically not a “best practice” but necessary when doing significant bulk-indexing • Done through the optimize API • curl -XPOST http://localhost:9200/_optimize?only_expunge_deletes • curl -XPOST http://localhost:9200/_optimize?max_num_segments=5 • A heavy operation (time, CPU , I/O) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 28. Improving indexing performance • Set index.refresh_interval to -1 • Set indices.store.throttle.type to none • Properly set the retry-related configuration properties (e.g spark.task.maxFailures) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 29. What’s next? • Further improve indexing performance, e.g : • Reduce excessive concurrency on Elasticsearch nodes by limiting Spark’s maximum concurrent tasks • Bulk-index objects rather than JSON documents to avoid excessive parsing • Better monitoring (e.g using Spark Accumulators) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 30. To summarize • We use : • S3 to store (raw) labeled data • Spark on EMR to process the data • Elasticsearch-hadoop plug-in for bulk-indexing • Data Pipeline to manage the flow • Elasticsearch for real-time analytics * * ©2011 eXelate Inc. Confidential and Proprietary
  • 31. To summarize • Updates are expensive – consider daily dedup • Avoid excessive querying and network traffic - perform merges on the server-side • Use an update script • Store it on your Elasticsearch cluster • Make sure your loading process is scalable and fault-tolerant – use Spark • Reduce # of moving parts • Index the data directly using elasticsearch-hadoop plug-in * * ©2011 eXelate Inc. Confidential and Proprietary
  • 32. To summarize • Improve indexing performance – properly configure your cluster before indexing • Avoid excessive disk usage – optimize your indices • Can also help query performance • Making the processing phase elastically scalable (i.e using Spark) doesn’t mean the whole ETL flow is elastically scalable • Elasticsearch becomes the new bottleneck… * * ©2011 eXelate Inc. Confidential and Proprietary
  • 33. Questions? Also - we’re hiring! http://exelate.com/about-us/careers/ •DevOps team leader •Senior frontend developers •Senior Java developers * * ©2011 eXelate Inc. Confidential and Proprietary
  • 34. Thank you ©2011 eXelate Inc. Confidential and Proprietary Itai Yaffe
  • 35. Keep an eye on… • S3 limitations : • The penalty involved in moving files • File partitioning and hash prefixes * * ©2011 eXelate Inc. Confidential and Proprietary