SlideShare a Scribd company logo
1 of 34
Spark Meetup chez Viadeo
Mercredi 4 février 2015
• 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers
pouvant être traités par du BigData temps réel.
Cédric Carbone - Cofondateur d'Influans – cedric@influans.com -Spark vs Hadoop
MapReduce
-Spark Streaming vs Storm
-Le Machine Learning avec Spark -Use case métier : NextProductToBuy
• 19h45-20h : Extension de Spark (Tachyon / Spark JobServer).
Jonathan Lamiel - Talend Labs – jlamiel@talend.com
-La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark
JobServer
• 20h-21h : Big Data analytics avec Spark & Cassandra.
DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan@datastax.com
Apache Spark is a general data processing framework which allows you perform data
processing tasks in memory. Apache Cassandra is a highly available and massively scalable
NoSQL data-store.
By combining Spark flexible API and Cassandra performance, we get an interesting combo
for both real-time and batch processing.
Map Reduce
➜ Map() : parse inputs and generate 0 to n <key, value>
➜ Reduce() : sums all values of the same key and
generate a <key, value>
WordCount Example
➜ Each map take a line as an input and break into words
• It emits a key/value pair of the word and 1
➜ Each Reducer sums the counts for each word
• It emits a key/value pair of the word and sum
Map Reduce
Hadoop MapReduce v1
Hadoop MapReduce v1
Hadoop MapReduce v1
MapReduce v1
Not good for low-latency jobs on smallest dataset
MapReduce v1
Good for off-line batch jobs on massive data
Hadoop 1
➜ Batch ONLY
• High latency jobs
HDFS (Redundant, Reliable Storage)
MapReduce1
Cluster Resource Management + Data Processing
BATCH
HIVE
Query
Pig
Scripting
Cascading
Accelerate Dev.
Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and
interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
MapReduce1
Data Processing
BATCH
YARN (Cluster Resource Management)
Other
Data Processing
…
Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and
interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH
YARN (Cluster Resource Management)
Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and
interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Samza
SparkStreaming)
GRAPH
(Giraph,
GraphX)
Machine
Learning
(MLLIb)
In-Memory
(Spark)
ONLINE
(Hbase HOYA)
OTHER
(ElasticSearch)
https://spark.apache.org
Apache Spark™ is a
fast and general
engine for large-scale
data processing.
The most active project
0
50
100
150
200
250
Patches
MapReduce Storm
Yarn Spark
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Lines Added
MapReduce Storm
Yarn Spark
Spark won the
Daytona GraySort contest!
Run programs up to 100x
faster than Hadoop
MapReduce in memory, or 10x
faster on disk.
Sort on disk 100TB of data 3x faster than Hadoop
MapReduce using 10x fewer machines.
RDD & Operation
Resilient Distributed Datasets (RDDs)
Operations
➜ Transformations (e.g. map, filter, groupBy)
➜ Actions (e.g. count, collect, save)
Spark
scala> val textFile = sc.textFile("README.md")
➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> textFile.count()
➜ res0: Long = 126
scala> textFile.first()
➜ res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4
scala>
textFile.filter(line=>line.contains("Spark")).count()
➜ res3: Long = 15
Streaming
Streaming
Storm
Storm
Storm vs Spark
Spark Streaming Storm Storm Trident
Processing model Micro batches Record-at-a-time Micro batches
Thoughput ++++ ++ ++++
Latency Second Sub-second Second
Reliability Models Exactly once At least once Exactly once
Embedded Hadoop Distro HDP, CDH, MapR HDP HDP
Support Databricks N/A N/A
Community ++++ ++ ++
Spark Storm
Scope Batch, Streaming, Graph, ML, SQL Streaming only
Machine Learning Library (Mllib)
Collaborative Filtering
Collaborative Filtering
(learning)
Collaborative Filtering
(learning)
Collaborative Filtering
(learning)
Collaborative Filtering :
Let’s use the model
Collaborative Filtering :
similar behaviors
Collaborative Filtering
Prediction
Netflix Prize (2009)
Netflix is a provider of on-demand Internet streaming media
Input Data
UserID::MovieID::Rating::Timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
Etc…
2::1357::5::978298709
2::3068::4::978299000
2::1537::4::978299620
The result
1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972)
1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994)
1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993)
1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991)
1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939)
2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994)
2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994)
2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993)
2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982)
2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995)
3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995)
3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994)
3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19
Real Time Big Data
Use Case
Next Product To Buy
➜ Right Person
➜ Right Product
➜ Right Price
➜ Right Time
➜ Right Channel
Questions?
Cédric Carbone
cedric@influans.com
@carbone
www.hugfrance.fr
hug-france-orga@googlegroups.com
@hugfrance

More Related Content

What's hot

Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraStratio
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 

What's hot (20)

Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 

Similar to Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavSwapnil (Neil) Jadhav
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkJerry Wen
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 

Similar to Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy (20)

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 

Recently uploaded

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 

Recently uploaded (20)

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

  • 1. Spark Meetup chez Viadeo Mercredi 4 février 2015 • 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – cedric@influans.com -Spark vs Hadoop MapReduce -Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy • 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – jlamiel@talend.com -La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer • 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan@datastax.com Apache Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.
  • 2. Map Reduce ➜ Map() : parse inputs and generate 0 to n <key, value> ➜ Reduce() : sums all values of the same key and generate a <key, value> WordCount Example ➜ Each map take a line as an input and break into words • It emits a key/value pair of the word and 1 ➜ Each Reducer sums the counts for each word • It emits a key/value pair of the word and sum
  • 7. MapReduce v1 Not good for low-latency jobs on smallest dataset
  • 8. MapReduce v1 Good for off-line batch jobs on massive data
  • 9. Hadoop 1 ➜ Batch ONLY • High latency jobs HDFS (Redundant, Reliable Storage) MapReduce1 Cluster Resource Management + Data Processing BATCH HIVE Query Pig Scripting Cascading Accelerate Dev.
  • 10. Hadoop2 : Big Data Operating System ➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS • Simultaneously & with predictable levels of service • Data analysts and real-time applications HDFS (Redundant, Reliable Storage) MapReduce1 Data Processing BATCH YARN (Cluster Resource Management) Other Data Processing …
  • 11. Hadoop2 : Big Data Operating System ➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS • Simultaneously & with predictable levels of service • Data analysts and real-time applications HDFS (Redundant, Reliable Storage) BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH YARN (Cluster Resource Management)
  • 12. Hadoop2 : Big Data Operating System ➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS • Simultaneously & with predictable levels of service • Data analysts and real-time applications HDFS (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Samza SparkStreaming) GRAPH (Giraph, GraphX) Machine Learning (MLLIb) In-Memory (Spark) ONLINE (Hbase HOYA) OTHER (ElasticSearch)
  • 13. https://spark.apache.org Apache Spark™ is a fast and general engine for large-scale data processing.
  • 14. The most active project 0 50 100 150 200 250 Patches MapReduce Storm Yarn Spark 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Lines Added MapReduce Storm Yarn Spark
  • 15. Spark won the Daytona GraySort contest! Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines.
  • 16. RDD & Operation Resilient Distributed Datasets (RDDs) Operations ➜ Transformations (e.g. map, filter, groupBy) ➜ Actions (e.g. count, collect, save)
  • 17. Spark scala> val textFile = sc.textFile("README.md") ➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() ➜ res0: Long = 126 scala> textFile.first() ➜ res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) ➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4 scala> textFile.filter(line=>line.contains("Spark")).count() ➜ res3: Long = 15
  • 19. Storm
  • 20. Storm
  • 21. Storm vs Spark Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput ++++ ++ ++++ Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community ++++ ++ ++ Spark Storm Scope Batch, Streaming, Graph, ML, SQL Streaming only
  • 30. Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media
  • 32. The result 1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19
  • 33. Real Time Big Data Use Case Next Product To Buy ➜ Right Person ➜ Right Product ➜ Right Price ➜ Right Time ➜ Right Channel