SlideShare a Scribd company logo
1 of 31
SUMMINGBIRD 
Sofian DJAMAA 
@sdjamaa
ANALYTICS INFRASTRUCTURE 
We log all data coming out of our web servers 
• Events of ad displays 
• Events of clicks 
• Events of arbitrage 
• …
SO MUCH DATA TO COMPUTE THAT WE USE 
HADOOP 
Copyright © 2013 Criteo. Confidential 3
YEAH, BUT WHAT ABOUT REAL NUMBERS? 
4 
We have those too, but you’ll note that the 
marketing team was not involved in the design  
• 1.4M requests per second average across 6 DCs 
• 30B events logged and imported into HDFS each day 
• 2.5B displays 
• 840M unique users 
…and 70 BI Analysts who want to play with it all…
DATA ANALYSTS 
Data analysts (or BIs) use our 
data to 
– Improve our algorithms and 
strategy 
– Build financial reports 
They obviously learn SQL
QUERYING TOOL 
Of course we have a « fast » datastore fitting 80% of 
analyst needs… 
… with limited updates per day 
Vertica 
(datamart) 
Hadoop (log 
store & 
compute 
platform) 
LOGS 
Batch loading here 
Some PhD level 
computations and 
transformations 
Batch loading here 
Querying
ANOTHER QUERYING TOOL 
Therefore sometimes data analysts need to use Hive
MAKE LIFE BETTER 
Can we get « big data » faster? 
• Current situation: we have PBs of useful data to handle 
– We have a low/middle latency datastore 
– But batches take hours (litterally)  
• e.g. arbitrage events are up to 200GB per hour
STREAMING ISSUES 
But we do have Storm! 
• Fault-tolerance??? 
• We need to recompute historical data whenever a 
new metric emerges 
• To be honest those issues can be overcame but at a 
high cost 
– We already have a satisfying Hadoop architecture 
– The switch to a fully streamed architecture would require 
to rebuild our backends
OUR WISH 
Ideally, we’d like… 
Hadoop with reliable data 
• Get reliable data 
• Process lots of events 
(daily/monthly/quarterly 
aggregates) 
Storm with fresh data 
• Get a first overview of our 
aggregates faster 
– Errors are bounded to the current 
« batch » processing
THIS IS CALLED: LAMBDA ARCHITECTURE 
HADOOP 
STORM
WHAT ARE THE CHOICES THEN? 
One job to write per platform? 
– MapReduce/Cascading/Hive jobs 
– Storm topology 
 Learn different technologies to achieve the same result 
 Risks of discrepancies 
 Deployment complexity 
Our choice: SummingBird 
PIOU!
CORE CONCEPTS 
Main concept: Platform[P] 
Currently P = [Scalding | Storm | Spark] 
Every job is written with the same piece of code, in Scala: 
• Storm topology 
• Scalding pipes 
• And the thing that runs on Spark (?!)
SUMMINGBIRD: CODE SAMPLE 
Data source (either HDFS 
path or Kafka queue) 
All the processing 
happens here (converted 
to either a Storm topology 
or a Cascading job) 
We merge the processing 
with another source 
This is the output of 
the job 
We sum output of both sources (DisplayClick) 
inside a datastore (Memcache, MySQL…)
BATCHID: CORE OF SUMMINGBIRD API 
We merge previous 
results of Hadoop with 
the new batch 
It means Storm results 
are volatile 
BatchID is a unit of time, ex: 
1H, 6H.. 
The BatchID is set on Hadoop 
frequency to bound errors
UPDATING DATA IN REAL-TIME 
We use a data type: monoids 
« In abstract algebra, a branch of mathematics, a monoid is 
an algebraic structure with a single associative binary 
operation and an identity element. » Wikipedia 
Monoids are cool 
– Associativity loves parallelism 
– No need to be commutative
ALGEBIRD: CODE SAMPLE
PUSHING DATA TO THE STORE 
For ease of programming, we define entities 
CampaignAffiliate & 
DisplayClick are stored in a 
« store »
PUSHING DATA TO THE STORE
BIJECTION 
How can we transform an entity to 
– An Array[Byte] for Memcached store? 
– A String for an in-memory Map[String, String] store? 
 Another cool library: Bijection! 
« In mathematics, an injective function or injection or one-to-one 
function is a function that preserves distinctness: it never maps 
distinct elements of its domain to the same element of its codomain ». 
Wikipedia 
Example: Int => String is an injection in the whole set of 
Strings because 
« foo » => Int is not possible 
« 12.42 » => Int is not possible
BIJECTION: CODE SAMPLE 
A simple Injection from Twitter 
From Int to String 
From String to maybe an Int
DATA PERSISTENENCE 
• Now we know how to convert a (CampaignAffiliate, 
DisplayClick) tuple to an Array[Byte] and we can even 
modify it 
• One last thing is missing: what’s the interaction with the 
data store? 
 Last cool thing: Storehaus!!!
STOREHAUS 
« Storehaus is a library that makes it easy to work 
with asynchronous key value stores. Storehaus is 
built on top of Twitter's Future. » Twitter’s GitHub 
What is really cool: storehaus-algebird 
• Make all stores mergeable (using 
Monoid[V])
STOREHAUS: CODE SAMPLE 
Injection because MySqlStore 
takes only MySqlValues 
Original store with only 
put/get operations 
Make the store Mergeable to 
allow updates
SUMMINGBIRD CLIENT 
HADOOP 
STORM 
Batch 
store 
Live 
store 
Client query 
With 2 different store: we have to use the ClientStore from 
SummingBird to merge online and offline data 
With a single store: we need to know if the data is reliable or not
SUMMINGBIRD CLIENT
Client query 
We want to query data up to 
BatchID #3 : 
• We get the latest 
processed batch from 
Hadoop (#1) 
• We merge those results 
with the 2 latests from 
Storm
DEPLOY EVERYTHING 
• We decided to build one assembly per platform… 
– … using SBT      
– Unfortunately the 2 samples you can find on all Internets are 
monolithics 
• This is how we run the Scalding part 
$> java <f******g lots of Hadoop conf keys> –cp <sh*t 
classpath value> org.apache.hadoop.commons.RunJar job.jar 
JobName –arg1 value –arg2 value 
• The Storm JAR assembly is deployed as usual
BENEFITS OF SUMMINGBIRD 
Cool  
• Write once, run everywhere 
– We plan to deploy pure Scalding 
jobs thanks to SummingBird 
– Same can be done with Storm 
• I didn’t need to learn any Storm o/ 
• Use of Scala to write concise jobs 
– Especially after some months 
using Cascading... 
• Opportunity to do some cool 
Open Source coding  
Not cool  
• SummingBird = SummingBird + 
Storehaus + Bijection + Algebird 
– Need to learn all those APIs 
– And it’s hard if you’re not a Scala 
master 
• Not all of these libraries are up-to-date 
• As for any new Open Source 
project, you won’t find: 
– Tutorials 
– Examples 
– StackOverflow posts 
– Weird Chinese mailing-lists listing 
your stacktrace
WE HIRE  
WE HIRE  
WE HIRE  
WE HIRE 

More Related Content

What's hot

Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introductionRick Chang
 
Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introductionjamesrwu
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Sergio Fernández
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamPyData
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...PROIDEA
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberDanny Yuan
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with AirflowErin Shellman
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphJason Plurad
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用Simon Su
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopJason Plurad
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systemsRaja SP
 

What's hot (20)

Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introduction
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Graphite
GraphiteGraphite
Graphite
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uber
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with Airflow
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPop
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 

Similar to Paris DataGeek - SummingBird

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...Dataconomy Media
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesMartin Czygan
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQLYu Ishikawa
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Klas Berlič Fras
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using HadoopDataWorks Summit
 

Similar to Paris DataGeek - SummingBird (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
 
Serverless Go at BuzzBird
Serverless Go at BuzzBirdServerless Go at BuzzBird
Serverless Go at BuzzBird
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using Hadoop
 

Recently uploaded

cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 

Recently uploaded (20)

cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 

Paris DataGeek - SummingBird

  • 2. ANALYTICS INFRASTRUCTURE We log all data coming out of our web servers • Events of ad displays • Events of clicks • Events of arbitrage • …
  • 3. SO MUCH DATA TO COMPUTE THAT WE USE HADOOP Copyright © 2013 Criteo. Confidential 3
  • 4. YEAH, BUT WHAT ABOUT REAL NUMBERS? 4 We have those too, but you’ll note that the marketing team was not involved in the design  • 1.4M requests per second average across 6 DCs • 30B events logged and imported into HDFS each day • 2.5B displays • 840M unique users …and 70 BI Analysts who want to play with it all…
  • 5. DATA ANALYSTS Data analysts (or BIs) use our data to – Improve our algorithms and strategy – Build financial reports They obviously learn SQL
  • 6. QUERYING TOOL Of course we have a « fast » datastore fitting 80% of analyst needs… … with limited updates per day Vertica (datamart) Hadoop (log store & compute platform) LOGS Batch loading here Some PhD level computations and transformations Batch loading here Querying
  • 7. ANOTHER QUERYING TOOL Therefore sometimes data analysts need to use Hive
  • 8. MAKE LIFE BETTER Can we get « big data » faster? • Current situation: we have PBs of useful data to handle – We have a low/middle latency datastore – But batches take hours (litterally)  • e.g. arbitrage events are up to 200GB per hour
  • 9. STREAMING ISSUES But we do have Storm! • Fault-tolerance??? • We need to recompute historical data whenever a new metric emerges • To be honest those issues can be overcame but at a high cost – We already have a satisfying Hadoop architecture – The switch to a fully streamed architecture would require to rebuild our backends
  • 10. OUR WISH Ideally, we’d like… Hadoop with reliable data • Get reliable data • Process lots of events (daily/monthly/quarterly aggregates) Storm with fresh data • Get a first overview of our aggregates faster – Errors are bounded to the current « batch » processing
  • 11. THIS IS CALLED: LAMBDA ARCHITECTURE HADOOP STORM
  • 12. WHAT ARE THE CHOICES THEN? One job to write per platform? – MapReduce/Cascading/Hive jobs – Storm topology  Learn different technologies to achieve the same result  Risks of discrepancies  Deployment complexity Our choice: SummingBird PIOU!
  • 13. CORE CONCEPTS Main concept: Platform[P] Currently P = [Scalding | Storm | Spark] Every job is written with the same piece of code, in Scala: • Storm topology • Scalding pipes • And the thing that runs on Spark (?!)
  • 14. SUMMINGBIRD: CODE SAMPLE Data source (either HDFS path or Kafka queue) All the processing happens here (converted to either a Storm topology or a Cascading job) We merge the processing with another source This is the output of the job We sum output of both sources (DisplayClick) inside a datastore (Memcache, MySQL…)
  • 15. BATCHID: CORE OF SUMMINGBIRD API We merge previous results of Hadoop with the new batch It means Storm results are volatile BatchID is a unit of time, ex: 1H, 6H.. The BatchID is set on Hadoop frequency to bound errors
  • 16. UPDATING DATA IN REAL-TIME We use a data type: monoids « In abstract algebra, a branch of mathematics, a monoid is an algebraic structure with a single associative binary operation and an identity element. » Wikipedia Monoids are cool – Associativity loves parallelism – No need to be commutative
  • 18. PUSHING DATA TO THE STORE For ease of programming, we define entities CampaignAffiliate & DisplayClick are stored in a « store »
  • 19. PUSHING DATA TO THE STORE
  • 20. BIJECTION How can we transform an entity to – An Array[Byte] for Memcached store? – A String for an in-memory Map[String, String] store?  Another cool library: Bijection! « In mathematics, an injective function or injection or one-to-one function is a function that preserves distinctness: it never maps distinct elements of its domain to the same element of its codomain ». Wikipedia Example: Int => String is an injection in the whole set of Strings because « foo » => Int is not possible « 12.42 » => Int is not possible
  • 21. BIJECTION: CODE SAMPLE A simple Injection from Twitter From Int to String From String to maybe an Int
  • 22. DATA PERSISTENENCE • Now we know how to convert a (CampaignAffiliate, DisplayClick) tuple to an Array[Byte] and we can even modify it • One last thing is missing: what’s the interaction with the data store?  Last cool thing: Storehaus!!!
  • 23. STOREHAUS « Storehaus is a library that makes it easy to work with asynchronous key value stores. Storehaus is built on top of Twitter's Future. » Twitter’s GitHub What is really cool: storehaus-algebird • Make all stores mergeable (using Monoid[V])
  • 24. STOREHAUS: CODE SAMPLE Injection because MySqlStore takes only MySqlValues Original store with only put/get operations Make the store Mergeable to allow updates
  • 25. SUMMINGBIRD CLIENT HADOOP STORM Batch store Live store Client query With 2 different store: we have to use the ClientStore from SummingBird to merge online and offline data With a single store: we need to know if the data is reliable or not
  • 27. Client query We want to query data up to BatchID #3 : • We get the latest processed batch from Hadoop (#1) • We merge those results with the 2 latests from Storm
  • 28. DEPLOY EVERYTHING • We decided to build one assembly per platform… – … using SBT      – Unfortunately the 2 samples you can find on all Internets are monolithics • This is how we run the Scalding part $> java <f******g lots of Hadoop conf keys> –cp <sh*t classpath value> org.apache.hadoop.commons.RunJar job.jar JobName –arg1 value –arg2 value • The Storm JAR assembly is deployed as usual
  • 29. BENEFITS OF SUMMINGBIRD Cool  • Write once, run everywhere – We plan to deploy pure Scalding jobs thanks to SummingBird – Same can be done with Storm • I didn’t need to learn any Storm o/ • Use of Scala to write concise jobs – Especially after some months using Cascading... • Opportunity to do some cool Open Source coding  Not cool  • SummingBird = SummingBird + Storehaus + Bijection + Algebird – Need to learn all those APIs – And it’s hard if you’re not a Scala master • Not all of these libraries are up-to-date • As for any new Open Source project, you won’t find: – Tutorials – Examples – StackOverflow posts – Weird Chinese mailing-lists listing your stacktrace
  • 30.
  • 31. WE HIRE  WE HIRE  WE HIRE  WE HIRE 

Editor's Notes

  1. Alléger le slide