Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel par Cédric Carbone
-Spark vs Hadoop MapReduce (& Hadoop v2 vs Hadoop v1)
-Spark Streaming vs Storm
-Le Machine Learning avec Spark
-Use case métier : NextProductToBuy
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy
1. Spark Meetup chez Viadeo
Mercredi 4 février 2015
• 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers
pouvant être traités par du BigData temps réel.
Cédric Carbone - Cofondateur d'Influans – cedric@influans.com -Spark vs Hadoop
MapReduce
-Spark Streaming vs Storm
-Le Machine Learning avec Spark -Use case métier : NextProductToBuy
• 19h45-20h : Extension de Spark (Tachyon / Spark JobServer).
Jonathan Lamiel - Talend Labs – jlamiel@talend.com
-La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark
JobServer
• 20h-21h : Big Data analytics avec Spark & Cassandra.
DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan@datastax.com
Apache Spark is a general data processing framework which allows you perform data
processing tasks in memory. Apache Cassandra is a highly available and massively scalable
NoSQL data-store.
By combining Spark flexible API and Cassandra performance, we get an interesting combo
for both real-time and batch processing.
2. Map Reduce
➜ Map() : parse inputs and generate 0 to n <key, value>
➜ Reduce() : sums all values of the same key and
generate a <key, value>
WordCount Example
➜ Each map take a line as an input and break into words
• It emits a key/value pair of the word and 1
➜ Each Reducer sums the counts for each word
• It emits a key/value pair of the word and sum
9. Hadoop 1
➜ Batch ONLY
• High latency jobs
HDFS (Redundant, Reliable Storage)
MapReduce1
Cluster Resource Management + Data Processing
BATCH
HIVE
Query
Pig
Scripting
Cascading
Accelerate Dev.
10. Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and
interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
MapReduce1
Data Processing
BATCH
YARN (Cluster Resource Management)
Other
Data Processing
…
11. Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and
interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH
YARN (Cluster Resource Management)
12. Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and
interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Samza
SparkStreaming)
GRAPH
(Giraph,
GraphX)
Machine
Learning
(MLLIb)
In-Memory
(Spark)
ONLINE
(Hbase HOYA)
OTHER
(ElasticSearch)
15. Spark won the
Daytona GraySort contest!
Run programs up to 100x
faster than Hadoop
MapReduce in memory, or 10x
faster on disk.
Sort on disk 100TB of data 3x faster than Hadoop
MapReduce using 10x fewer machines.
21. Storm vs Spark
Spark Streaming Storm Storm Trident
Processing model Micro batches Record-at-a-time Micro batches
Thoughput ++++ ++ ++++
Latency Second Sub-second Second
Reliability Models Exactly once At least once Exactly once
Embedded Hadoop Distro HDP, CDH, MapR HDP HDP
Support Databricks N/A N/A
Community ++++ ++ ++
Spark Storm
Scope Batch, Streaming, Graph, ML, SQL Streaming only