SlideShare a Scribd company logo
1 of 27
Spark and Storm at Yahoo 
Wh y c h o o s e o n e o v e r t h e o t h e r ? 
P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
Tom Graves 
Bobby Evans (bobby@apache.org) 
2 
 Committers and PMC/PPMC Members for 
› Apache Storm incubating (Bobby) 
› Apache Hadoop (Tom and Bobby) 
› Apache Spark (Tom and Bobby) 
› Apache TEZ (Tom and Bobby) 
 Low Latency Big Data team at Yahoo (Part of the Hadoop Team) 
› Apache Storm as a service 
• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). 
› Apache Spark on YARN 
• 40,000 nodes total, 5000+ node cluster 
› Help with distributed ML and deep learning.
Where we come from 
Yahoo Champaign: 
• 100+ engineers 
• Located in UIUC Research Park http://researchpark.illinois.edu/ 
• Split between Advertising and Data Platform team and Hadoop team. 
• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. 
• Site is 7 years old, and we are building a new building with room for 200. 
• We are Hiring 
• resume-hadoop@yahoo-inc.com 
• http://bit.ly/1ybTXMe
Agenda 
Spark Overview (1.1) 
Storm Overview (0.9.2) 
Things to Consider 
Example Architectures 
4 Yahoo Confidential & Proprietary
Apache Spark 
5
Spark Key Concepts 
Write programs in terms of 
transformations on distributed 
Resilient Distributed 
Datasets 
 Collections of objects spread 
across a cluster, stored in RAM 
or on Disk 
 Built through parallel 
transformations 
 Automatically rebuilt on failure 
Operations 
 Transformations 
(e.g. map, filter, 
groupBy) 
 Actions 
(e.g. count, collect, 
save) 
datasets
Working With RDDs 
RDD 
RDD 
RDD 
RDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
Action Value 
linesWithSpark = textFile.filter(lambda line: "Spark” in line) 
linesWithSpark.count() 
74 
linesWithSpark.first() 
# Apache Spark
Example: Word Count 
> lines = sc.textFile(“hamlet.txt”) 
> counts = lines.flatMap(lambda line: line.split(“ ”)) 
.map(lambda word => (word, 1)) 
.reduceByKey(lambda x, y: x + y) 
“to be or” 
“not to be” 
“to” 
“be” 
“or” 
“not” 
“to” 
“be” 
(to, 1) 
(be, 1) 
(or, 1) 
(not, 1) 
(to, 1) 
(be, 1) 
(be, 2) 
(not, 1) 
(or, 1) 
(to, 2)
Spark Streaming Word Count 
updateFunc = (values: Seq[Int], state: Option[Int]) => { 
val currentCount = values.foldLeft(0)(_ + _) 
val previousCount = state.getOrElse(0) 
Some(currentCount + previousCount) 
} 
… 
lines = ssc.socketTextStream(args(0), args(1).toInt) 
Words = lines.flatMap(lambda line: line.split(“ ”)) 
wordDstream = words.map(lambda word => (word, 1)) 
stateDstream = wordDstream.updateStateByKey[Int](updateFunc) 
ssc.start() 
ssc.awaitTermination()
10 
Apache Storm
Storm Concepts 
1. Streams 
› Unbounded sequence of tuples 
2. Spout 
› Source of Stream 
› E.g. Read from Twitter streaming API 
3. Bolts 
› Processes input streams and produces 
new streams 
› E.g. Functions, Filters, Aggregation, 
Joins 
4. Topologies 
› Network of spouts and bolts
Storm Architecture 
Master 
Node 
Cluster 
Coordination 
Worker 
Worker 
Worker 
Worker 
Processes 
Nimbus 
Zookeeper 
Zookeeper 
Zookeeper 
Supervisor 
Supervisor 
Supervisor 
Supervisor Worker 
Launches 
Workers
Trident (Storm) Word Count 
TridentTopology topology = new TridentTopology(); 
TridentState wordCounts = topology.newStream("spout1", spout) 
.each(new Fields("sentence"), new Split(), new Fields("word")) 
.groupBy(new Fields("word")) 
.persistentAggregate(new MemoryMapState.Factory(), new Count(), 
new Fields("count")).parallelismHint(6); 
“to be or” 
“to” 
“be” 
“or” 
(to, 1) 
(be, 1) 
(or, 1) 
1) 
1) 
“not to be” 
“not” 
“to” 
“be” 
(not, 1) 
(to, 1) 
(be, 1) 
(be, 2) 
(not, 1) 
(or, 1) 
(to, 2)
Use the Right Tool for the Job 
14 
https://www.flickr.com/photos/hikingartist/4193330368/
Things to Consider 
15 
Scale 
Latency 
 Iterative Processing 
› Are there suitable non-iterative alternatives? 
Use What You Know 
Code Reuse 
Maturity
When We Recommend Spark 
16 
 Iterative Batch Processing (most Machine Learning) 
› There really is nothing else right now. 
› Has some scale issues. 
 Tried ETL (Not at Yahoo scale yet) 
 Tried Shark/Interactive Queries (Not at Yahoo scale yet) 
 < 1 TB (or memory size of your cluster) 
 Tuning it to run well can be a pain 
 Data Bricks and others are working on scaling. 
 Streaming is all μ-batch so latency is at least 1 sec 
 Streaming has single points of failure still 
 All streaming inputs are replicated in memory
When We Recommend Storm 
17 
 Latency < 1 second (single event at a time) 
› There is little else (especially not open source) 
 “Real Time” … 
› Analytics 
› Budgeting 
› ML 
› Anything 
 Lower Level API than Spark 
 No built-in concept of look back aggregations 
 Takes more effort to combine batch with streaming
Fictitious Example: My Commute App 
18 
 Mobile App that lets users track their commute. 
 Cities, users, companies, etc. compete daily for 
› Shortest commute time 
› Greenest commute 
 Make money by selling location based ads and aggregate data to 
› Governments 
› Advertisers 
 Feel free to steal my crazy idea, I just want to be invited to the launch 
party, and I wouldn't say no to some stock.
Chicago vs. Champaign Urbana 
19 
Champaign Urbana: 14-15 min 
Chicago: 20-30 min 
35 
30 
25 
20 
15 
10 
5 
0 
Bobby 
CU Chicago 
Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
Things to Consider 
20 
Scale 
› everyone in the world!!! 
Latency 
› a few seconds max 
 Iterative Processing 
› Possibly for targeting, but there are alternatives
Architecture 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
Kafka Storm 
HBase/NoSQ 
L 
HDFS Spark 
Customer 
21
Architecture (Alternative) 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
HBase/NOS 
QL 
HDFS Spark 
Customer 
22 
Go directly to Spark Streaming, 
but data loss potential goes up.
Architecture (Alternative 2) 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
Kafka Storm 
HBase/NOS 
QL 
Customer 
23 
Streaming Operations Only 
(Kappa Architecture)
Fictitious Example 2: Web Scale Monitoring 
24 
 Look for trends that can indicate a problem. 
› Alert or provide automated corrections 
 Provide an interface to visualize 
› Current data very quickly 
› Historical data in depth 
 If you commercialize this one please give me/Yahoo a free license for 
life (open source works too)
Things to Consider 
25 
Scale 
› Lots of events from many different servers 
Latency 
› a few seconds max, but the fewer the better 
 Iterative Processing 
› For in depth analysis definetly
Fictitious Example 2: Web Scale Monitoring 
26 
Servers 
HBase 
Kafka Storm 
HDFS Spark 
UI 
Alert!! 
JDBC 
Server 
Rules 
ML and trend 
analysis
Questions? 
bobby@apache.org resume-hadoop@yahoo-inc.com 
http://bit.ly/1ybTXMe

More Related Content

What's hot

Vert.x for Microservices Architecture
Vert.x for Microservices ArchitectureVert.x for Microservices Architecture
Vert.x for Microservices ArchitectureIdan Fridman
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For WorkflowTimothy Spann
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners HubSpot
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm TutorialDavide Mazza
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introductionRico Chen
 
Git and GitHub
Git and GitHubGit and GitHub
Git and GitHubJames Gray
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 

What's hot (20)

Vert.x for Microservices Architecture
Vert.x for Microservices ArchitectureVert.x for Microservices Architecture
Vert.x for Microservices Architecture
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Advanced Git
Advanced GitAdvanced Git
Advanced Git
 
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Github
GithubGithub
Github
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Git and GitHub
Git and GitHubGit and GitHub
Git and GitHub
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 

Viewers also liked

Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Safety hand tools & grinding
Safety hand tools & grindingSafety hand tools & grinding
Safety hand tools & grindingLavanya Singh
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersJean-Paul Azar
 

Viewers also liked (6)

Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Safety hand tools & grinding
Safety hand tools & grindingSafety hand tools & grinding
Safety hand tools & grinding
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka Consumers
 

Similar to Yahoo compares Storm and Spark

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 

Similar to Yahoo compares Storm and Spark (20)

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Tech
TechTech
Tech
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 

More from Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

More from Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Yahoo compares Storm and Spark

  • 1. Spark and Storm at Yahoo Wh y c h o o s e o n e o v e r t h e o t h e r ? P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
  • 2. Tom Graves Bobby Evans (bobby@apache.org) 2  Committers and PMC/PPMC Members for › Apache Storm incubating (Bobby) › Apache Hadoop (Tom and Bobby) › Apache Spark (Tom and Bobby) › Apache TEZ (Tom and Bobby)  Low Latency Big Data team at Yahoo (Part of the Hadoop Team) › Apache Storm as a service • 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). › Apache Spark on YARN • 40,000 nodes total, 5000+ node cluster › Help with distributed ML and deep learning.
  • 3. Where we come from Yahoo Champaign: • 100+ engineers • Located in UIUC Research Park http://researchpark.illinois.edu/ • Split between Advertising and Data Platform team and Hadoop team. • Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. • Site is 7 years old, and we are building a new building with room for 200. • We are Hiring • resume-hadoop@yahoo-inc.com • http://bit.ly/1ybTXMe
  • 4. Agenda Spark Overview (1.1) Storm Overview (0.9.2) Things to Consider Example Architectures 4 Yahoo Confidential & Proprietary
  • 6. Spark Key Concepts Write programs in terms of transformations on distributed Resilient Distributed Datasets  Collections of objects spread across a cluster, stored in RAM or on Disk  Built through parallel transformations  Automatically rebuilt on failure Operations  Transformations (e.g. map, filter, groupBy)  Actions (e.g. count, collect, save) datasets
  • 7. Working With RDDs RDD RDD RDD RDD Transformations textFile = sc.textFile(”SomeFile.txt”) Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark
  • 8. Example: Word Count > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 9. Spark Streaming Word Count updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } … lines = ssc.socketTextStream(args(0), args(1).toInt) Words = lines.flatMap(lambda line: line.split(“ ”)) wordDstream = words.map(lambda word => (word, 1)) stateDstream = wordDstream.updateStateByKey[Int](updateFunc) ssc.start() ssc.awaitTermination()
  • 11. Storm Concepts 1. Streams › Unbounded sequence of tuples 2. Spout › Source of Stream › E.g. Read from Twitter streaming API 3. Bolts › Processes input streams and produces new streams › E.g. Functions, Filters, Aggregation, Joins 4. Topologies › Network of spouts and bolts
  • 12. Storm Architecture Master Node Cluster Coordination Worker Worker Worker Worker Processes Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Worker Launches Workers
  • 13. Trident (Storm) Word Count TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6); “to be or” “to” “be” “or” (to, 1) (be, 1) (or, 1) 1) 1) “not to be” “not” “to” “be” (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 14. Use the Right Tool for the Job 14 https://www.flickr.com/photos/hikingartist/4193330368/
  • 15. Things to Consider 15 Scale Latency  Iterative Processing › Are there suitable non-iterative alternatives? Use What You Know Code Reuse Maturity
  • 16. When We Recommend Spark 16  Iterative Batch Processing (most Machine Learning) › There really is nothing else right now. › Has some scale issues.  Tried ETL (Not at Yahoo scale yet)  Tried Shark/Interactive Queries (Not at Yahoo scale yet)  < 1 TB (or memory size of your cluster)  Tuning it to run well can be a pain  Data Bricks and others are working on scaling.  Streaming is all μ-batch so latency is at least 1 sec  Streaming has single points of failure still  All streaming inputs are replicated in memory
  • 17. When We Recommend Storm 17  Latency < 1 second (single event at a time) › There is little else (especially not open source)  “Real Time” … › Analytics › Budgeting › ML › Anything  Lower Level API than Spark  No built-in concept of look back aggregations  Takes more effort to combine batch with streaming
  • 18. Fictitious Example: My Commute App 18  Mobile App that lets users track their commute.  Cities, users, companies, etc. compete daily for › Shortest commute time › Greenest commute  Make money by selling location based ads and aggregate data to › Governments › Advertisers  Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.
  • 19. Chicago vs. Champaign Urbana 19 Champaign Urbana: 14-15 min Chicago: 20-30 min 35 30 25 20 15 10 5 0 Bobby CU Chicago Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
  • 20. Things to Consider 20 Scale › everyone in the world!!! Latency › a few seconds max  Iterative Processing › Possibly for targeting, but there are alternatives
  • 21. Architecture App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NoSQ L HDFS Spark Customer 21
  • 22. Architecture (Alternative) App Web Service (User, Commute ID, Location History, MPG) HBase/NOS QL HDFS Spark Customer 22 Go directly to Spark Streaming, but data loss potential goes up.
  • 23. Architecture (Alternative 2) App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NOS QL Customer 23 Streaming Operations Only (Kappa Architecture)
  • 24. Fictitious Example 2: Web Scale Monitoring 24  Look for trends that can indicate a problem. › Alert or provide automated corrections  Provide an interface to visualize › Current data very quickly › Historical data in depth  If you commercialize this one please give me/Yahoo a free license for life (open source works too)
  • 25. Things to Consider 25 Scale › Lots of events from many different servers Latency › a few seconds max, but the fewer the better  Iterative Processing › For in depth analysis definetly
  • 26. Fictitious Example 2: Web Scale Monitoring 26 Servers HBase Kafka Storm HDFS Spark UI Alert!! JDBC Server Rules ML and trend analysis

Editor's Notes

  1. RDD  Colloquially referred to as RDDs (e.g. caching in RAM) Lazy operations to build RDDs from other RDDs Return a result or write it to storage
  2. Let me illustrate this with some bad powerpoint diagrams and animations This diagram is LOGICAL,
  3. Trend analysis is difficult but sketches for approximations on many aggregates and Gradient Decent or VW for ML make this still an attractive option.