SlideShare a Scribd company logo
1 of 24
Spark Streaming As Near 
Realtime ETL 
Paris Data Geek 
18/09/2014 
Djamel Zouaoui 
@DjamelOnLine
Who am I ? 
Djamel Zouaoui 
Director Of Engineering 
@DjamelOnLine 
#Data 
#Scala 
#RecSys #Tech 
#MachineLearning 
#NoSql 
#BigData 
#Spark 
#Dev 
#R 
#Architecture
What is 
Fast and Expressive Cluster Computing 
Engine Compatible with Apache Hadoop 
• Efficient • Usable 
• General execution 
graphs 
• In-memory storage 
• Rich APIs in Java, 
Scala, Python 
• Interactive shell
RDD in 
• Resilient Distributed Dataset 
• Storage abstraction for dataset in Spark 
• Imutable 
• Fault recovery 
– Each RDD remembers how it was created, and can recover if any part of 
the data is lost 
• 3 kinds of operations 
– Transformations: Lazy in nature, allow to create a new dataset from one 
– Actions: Returns a value or exports data after performing a computation 
– Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect()
textFiles map map reduceByKey 
collect 
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect()
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect() 
textFiles map map reduceByKey 
collect 
textFiles map map reduceByKey 
collect 
Stage 1 Stage 2
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect() 
textFiles map map reduceByKey 
collect 
textFiles map map reduceByKey 
collect 
Stage 1 Stage 2 
Stage 1 Stage 2
Ecosystem 
RDD-Based 
Matrices 
RDD-Based 
Graphs 
Spark RDD API 
DStream’s: 
Streams of RDD’s 
Spark 
Streaming GraphX MLLib 
RDD-Based 
Tables 
Spark 
SQL 
HDFS, S3, Cassandra 
YARN, Mesos, 
Standalone
What is 
Project started in early 2012, extends Spark 
for doing big data stream processing which: 
Scales to hundreds of nodes 
Achieves second-scale latencies 
Efficiently recover from failures 
Integrates with batch and interactive processing
How it works ?
How it works ?
How it works ? 
• Input Source 
Definition 
• Input D-Stream 
D-Stream Computations 
• Window level 
• Statefull option 
• … 
Classic RDDs 
manipulation 
• Transformation 
• Action
Code 
TOPOLOG 
Y 
FREE 
//StreamingContext & Input source creation 
//Standard transformations 
//Window usage 
//Start the streaming and put it in the background
Internals 
• Two main processes 
– Receivers in charge of the D-Stream creation 
– Workers which in charge of data processing 
• These processes are autonomous & independent 
– No cores & resources shared 
– No information shared
Execution Model – Receiving Data 
Spark Streaming + Spark Driver Spark Workers 
StreamingContext.start() 
Network 
Input 
Tracker 
Receiver 
Data 
received 
Blocks pushed 
Blocks replicated 
Block 
Manager 
Block 
Manager 
Master 
Block 
Manager
Execution Model – Job Scheduling 
Spark Streaming + Spark Driver 
Network 
Input 
Tracker 
RDDs Block IDs 
Job Scheduler 
Spark’s 
Schedulers 
Receiver 
Block 
Manager 
Block 
Manager 
Jobs executed on 
worker nodes 
DStream 
Graph 
Job 
Manager 
Job Queue 
Jobs
Use Case: Find The True Love ! 
Build a recommender system based on implicit 
and explicit data to find the best matching for you 
• Based on Machine Learning models 
• Processed offline (batch) 
• On big (bunch of) data 
• Main goals of streaming platforms : 
– Need to store a lot of data 
– Need to clean them 
– Need to transform them
Overview 
Data 
Receiver 
Data 
Cleaning 
job 
KAFKA 
Topics 
Data 
Modelin 
g 
job 
HDFS 
Storage 
HDFS 
Storage 
Spark Cluster 
• Spark in Standalone mode 
• 120 cores available on Spark 
• 4.5 GB RAM per core 
• Based on Hadoop cluster for 
HDFS storage (10 To) 
• HDP 2.0 
• 8 machines (2 masters, 6 
slaves)
Data 
Receiver 
Data 
Cleaning 
job 
Data 
Modelin 
g 
job 
HDFS 
Storage 
HDFS 
Storage 
• Use of provided Kafka 
source 
• Naive implementation: 
– Based on 
autocommit 
– Automatic Offset 
management 
• Cleaning with classic 
RDD transformations 
• Persist new RDDs 
– In HDFS for other spark 
job (batch) 
– In RAM to speed up 
next step 
• Binary matrix 
• Scoring based on current 
events and history 
– History is load from 
RDDs stored on HDFS 
Job details
Issues 
• Data Lost 
– In the receiver phase due to naive kafka consumer 
– Need a more robust client with handly offset management (VS 
autocommit) 
• The delights of (de)serialisation 
– Kryo / Avro / Parquet…: Not directly due to Spark but not ease 
Major issues are during import/export steps
And Beyond…@VIADEO 
More than ETL, an analytics backend 
Data 
Receiver 
Data 
Modelin 
g 
RabbitMQ 
Data 
Modelin 
Generic 
Index 
ElasticSearch 
Spark Cluster Cluster 
D3.JS 
webapp 
Data 
Modelin 
g 
g
Join the Viadeo adventure 
Wanted: Software Engineers 
• We use Node.js, Spark, 
ElasticSearch, CQRS, AWS and 
many more… 
• We love FullStack Engineers and 
flat organization 
• We work in autonomous product 
team 
• We lunch for free ;-)
QUESTIONS ?

More Related Content

What's hot

Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAmazon Web Services
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streamingMaturin BADO
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018VMware Tanzu
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 

What's hot (20)

Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 

Similar to Paris Data Geek - Spark Streaming

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 

Similar to Paris Data Geek - Spark Streaming (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 

More from Djamel Zouaoui

Datajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandationDatajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandationDjamel Zouaoui
 
Usi 2013 - NoSql les defis à relever
Usi 2013 -  NoSql les defis à releverUsi 2013 -  NoSql les defis à relever
Usi 2013 - NoSql les defis à releverDjamel Zouaoui
 
ParisDataGeek - L amour est dans le graphe
ParisDataGeek - L amour est dans le grapheParisDataGeek - L amour est dans le graphe
ParisDataGeek - L amour est dans le grapheDjamel Zouaoui
 
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Djamel Zouaoui
 
USI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIUSI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIDjamel Zouaoui
 
Retour d'expérience TechLead
Retour d'expérience TechLeadRetour d'expérience TechLead
Retour d'expérience TechLeadDjamel Zouaoui
 
Présentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisésPrésentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisésDjamel Zouaoui
 
USI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueUSI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueDjamel Zouaoui
 
USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !Djamel Zouaoui
 

More from Djamel Zouaoui (9)

Datajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandationDatajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandation
 
Usi 2013 - NoSql les defis à relever
Usi 2013 -  NoSql les defis à releverUsi 2013 -  NoSql les defis à relever
Usi 2013 - NoSql les defis à relever
 
ParisDataGeek - L amour est dans le graphe
ParisDataGeek - L amour est dans le grapheParisDataGeek - L amour est dans le graphe
ParisDataGeek - L amour est dans le graphe
 
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
 
USI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIUSI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SI
 
Retour d'expérience TechLead
Retour d'expérience TechLeadRetour d'expérience TechLead
Retour d'expérience TechLead
 
Présentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisésPrésentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisés
 
USI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueUSI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continue
 
USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !
 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Paris Data Geek - Spark Streaming

  • 1. Spark Streaming As Near Realtime ETL Paris Data Geek 18/09/2014 Djamel Zouaoui @DjamelOnLine
  • 2. Who am I ? Djamel Zouaoui Director Of Engineering @DjamelOnLine #Data #Scala #RecSys #Tech #MachineLearning #NoSql #BigData #Spark #Dev #R #Architecture
  • 3. What is Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop • Efficient • Usable • General execution graphs • In-memory storage • Rich APIs in Java, Scala, Python • Interactive shell
  • 4. RDD in • Resilient Distributed Dataset • Storage abstraction for dataset in Spark • Imutable • Fault recovery – Each RDD remembers how it was created, and can recover if any part of the data is lost • 3 kinds of operations – Transformations: Lazy in nature, allow to create a new dataset from one – Actions: Returns a value or exports data after performing a computation – Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
  • 5. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  • 6. textFiles map map reduceByKey collect sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  • 7. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2
  • 8. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2 Stage 1 Stage 2
  • 9. Ecosystem RDD-Based Matrices RDD-Based Graphs Spark RDD API DStream’s: Streams of RDD’s Spark Streaming GraphX MLLib RDD-Based Tables Spark SQL HDFS, S3, Cassandra YARN, Mesos, Standalone
  • 10. What is Project started in early 2012, extends Spark for doing big data stream processing which: Scales to hundreds of nodes Achieves second-scale latencies Efficiently recover from failures Integrates with batch and interactive processing
  • 13. How it works ? • Input Source Definition • Input D-Stream D-Stream Computations • Window level • Statefull option • … Classic RDDs manipulation • Transformation • Action
  • 14. Code TOPOLOG Y FREE //StreamingContext & Input source creation //Standard transformations //Window usage //Start the streaming and put it in the background
  • 15. Internals • Two main processes – Receivers in charge of the D-Stream creation – Workers which in charge of data processing • These processes are autonomous & independent – No cores & resources shared – No information shared
  • 16. Execution Model – Receiving Data Spark Streaming + Spark Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data received Blocks pushed Blocks replicated Block Manager Block Manager Master Block Manager
  • 17. Execution Model – Job Scheduling Spark Streaming + Spark Driver Network Input Tracker RDDs Block IDs Job Scheduler Spark’s Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager Job Queue Jobs
  • 18. Use Case: Find The True Love ! Build a recommender system based on implicit and explicit data to find the best matching for you • Based on Machine Learning models • Processed offline (batch) • On big (bunch of) data • Main goals of streaming platforms : – Need to store a lot of data – Need to clean them – Need to transform them
  • 19. Overview Data Receiver Data Cleaning job KAFKA Topics Data Modelin g job HDFS Storage HDFS Storage Spark Cluster • Spark in Standalone mode • 120 cores available on Spark • 4.5 GB RAM per core • Based on Hadoop cluster for HDFS storage (10 To) • HDP 2.0 • 8 machines (2 masters, 6 slaves)
  • 20. Data Receiver Data Cleaning job Data Modelin g job HDFS Storage HDFS Storage • Use of provided Kafka source • Naive implementation: – Based on autocommit – Automatic Offset management • Cleaning with classic RDD transformations • Persist new RDDs – In HDFS for other spark job (batch) – In RAM to speed up next step • Binary matrix • Scoring based on current events and history – History is load from RDDs stored on HDFS Job details
  • 21. Issues • Data Lost – In the receiver phase due to naive kafka consumer – Need a more robust client with handly offset management (VS autocommit) • The delights of (de)serialisation – Kryo / Avro / Parquet…: Not directly due to Spark but not ease Major issues are during import/export steps
  • 22. And Beyond…@VIADEO More than ETL, an analytics backend Data Receiver Data Modelin g RabbitMQ Data Modelin Generic Index ElasticSearch Spark Cluster Cluster D3.JS webapp Data Modelin g g
  • 23. Join the Viadeo adventure Wanted: Software Engineers • We use Node.js, Spark, ElasticSearch, CQRS, AWS and many more… • We love FullStack Engineers and flat organization • We work in autonomous product team • We lunch for free ;-)