SlideShare a Scribd company logo
1 of 38
S4: Distributed
Stream Computing
Platform YAHOO LABS 2010
FARZAD NOZARIAN, MAZAHER BAZARI
Cloud Computing
CEIT@AUT 12/22/2014
/* Who we are! */
Farzad Nozarian
fnozarian@aut.ac.ir
Big Data Processing And Mining
Mazaher Bazari
mbazari@aut.ac.ir
Mobile Cloud Computing
1
What is S4
 Simple Scalable Streaming System
 Inspired by the MapReduce model!
S4 is a general-purpose, distributed, scalable,
partially fault-tolerant, pluggable platform for
processing continuous unbounded streams of data
2
Motivation
 Real-time search
 High frequency trading
 Social networks
3
“cost-per-click” billing model 4
“cost-per-click” billing model
 Render the most relevant ads in an optimal
position on the page
 Include user preferences from context:
 Processing thousands of queries per sec.
 recent user activity
 Geographic location
 Prior queries
 Prior clicks
5
Reinvent the Wheel!
 Extending the open source Hadoop platform to support
computation of unbound streams
But, Hadoop isn’t suitable!
 The Hadoop platform was highly optimized for batch processing
 MapReduce systems typically operate on static data by
scheduling batch jobs.
6
Real world systems!
Partition #1
Partition #2
Partition #...
Partition #N
Partition #3
Data Stream
Latency
Latency is proportional to
Length of the segment
Overhead of segmentation
and initiate the processing jobs
Fixed-size segments
7
Design goals
Simple API
Scale using commodity hardware
Minimize latency by using local memory in
each processing node
8
Design goals
Decentralized and symmetric architecture
Pluggable architecture
Science friendly
9
S4 Model
Avoiding the use of shared memory across the cluster
Distributed operation on commodity hardware
Actors model
G. Agha, Actors: A Model of Concurrent Computation in Distributed Systems.
10
S4 model (cont.)
 Computation is performed by Processing Elements (PEs)
 Messages are transmitted between them in the form of data events
 The state of each PE is inaccessible to other PEs
 Event emission and consumption is the only mode of interaction
between PEs
 The framework provides the capability to route events to
appropriate PEs and to create new instances of PEs
11
Design Assumptions
 Lossy failover is acceptable!
 Nodes will not be added to or removed from a running cluster!
12
Design: Example
 What is the task?
The task is to continuously produce a sorted list
of the top K most frequent words across all
documents with minimal latency
13
EV Quote
KEY Null
VALUE Quote=“I …”
A keyless event (EV) arrives at PE1 with quote:
EV Quote “I meant what I said and I said what I meant.”, Dr. Seuss
PE1
QuoteSplitterPE (PE1) counts
unique words in Quote and
emits events for each word.
EV WordEvent
KEY word="said"
VALUE count=2
PE2 PE3 PE4…
EV WordEvent
KEY word=“i"
VALUE count=4
PE2 PE3 PE4…
PE3
EV UpdatedCountEv
KEY sortID=2
VALUE word=said count=9
EV UpdatedCountEv
KEY sortID=9
VALUE word="i" count=35
WordCountPE (PE2-4) keeps
total counts for each word
across all quotes. Emits an event
any time a count is updated.
EV PartialTopKEv
KEY topk=1234
VALUE words={w:cnt}
MergePE (PE8) combines partial
TopK lists and outputs final TopK list.
14
Design: Processing Elements
PE
Functionality
Types of events
Keyed attribute of events
Value of the keyed attribute
EV WordEvent
KEY word="said"
VALUE count=2
15
Design: Processing Elements (cont.)
EV Quote
KEY Null
VALUE Quote=“I …”
 Keyless PEs
No keyed attribute or value
Consume all events of the type with which they are
associated
Typically used at the input layer of an S4 cluster where
events are assigned a key
 Standard PE
Count
aggregate
 join
16
Design: Processing Node
 Processing Nodes (PNs) are the logical hosts to PEs.
 They are responsible for:
 listening to events
 executing operations on the incoming events
 dispatching events with the assistance of the communication layer
 emitting output events
17
Communication Layer
Zookeeper
Design: Processing Node (cont.)
Processing Element Container
PE1 PE2 PEn
…
Event
Listener
Dispatcher Emiter
Routing Load Balancing
Failover Management
Transport Protocols
18
Programming Model
 High-level programming paradigm
 Generic
 Reusable
 Configurable
 Java Programming Language
19
Programming Model (cont.)
 Define Many ProcessEvent
(polymorphism)
 Create PE
 Inherited from AbstractePE
 Implement ProcessEvent()
 Implement Output()(Optional)
20
Programming Model (cont.)
21
Programming Model (cont.)
PE1
PE2
PE3
PE4
PE5
PE1
PE2 PE3
PE4
PE5
 Configuration
22
Programming Model (cont.)
 Configuration
23
Streaming Click-Through Rate
Computation
 CTR = (ratio of the number of clicks )/(number of impressions)
Two types of
events
Serve Event
Click Event
Serve is a search result page is returned to
the user
24
Streaming Click-Through Rate
Computation(cont.)
 Serve event contain:
 serveID
 query
 user
 Ads
 …..
 Click event contain:
 Click information
 serveID
Use a set of heuristic rules to eliminate
suspicious serves and click
25
EV RawServe
KEY Null
VALUE _Serve_Data
Event Flow of CTR Computation
PE1
EV serve
KEY Serve=123
VALUE Serve Data
EV Click
KEY Serve=123
VALUE Click Data
PE4
EV JoinedServe
KEY usr=Peter
VALUE JoinedData
EV JoinedClick
KEY usr=Peter
VALUE JoinedData
EV FilteredServe
KEY g-ad=Ipod-78
VALUE JoinedData
EV RawClick
KEY Null
VALUE _Click_Data
PE2
PE3
EV FilteredClick
KEY g-ad=Ipod-78
VALUE JoinedData
26
Apache S4
27
Zookeeper
Node1
Node2
Cluster 1
Repo
Apache S4 28
Zookeeper
Node1
Node2
Cluster 1
Repo S4 App
12
3
4
Apache S4 29
Zookeeper
Node1
Node2
Cluster 1
Node1
Cluster 2
Repo
1
2
3
Apache S4 30
Zookeeper
Node1
Node2
Cluster 1
Node1
Cluster 2
Repo
1
2 3
Apache S4 31
Apache S4: Commands
 s4 <command> <options>
Command Purpose
newApp Create a new application
zkServer Start a ZooKeeper server
newCluster Define an S4 cluster
s4r Package an application
deploy Deploy/configure an application
node Start an S4 node
status Get information about S4 infrastructure
32
Apache S4: Failover
while preserving low processing latency
High Availability
State Recovery
33
Apache S4: Failover
Zookeeper
Node 2
Node 3
Node 1
Node 4 Standb
Standby
Standby
Node 3
High Availability State Recovery
34
Apache S4: Failover
 PE recover a previous state()
 periodically checkpoint (uncoordinated and asynchronous)
 lazily recover
PE 1
PE 2
Keyed Message
Checkpoint
Framework
Hooks
Storage
Backend Storage
S4 node
35
Summary
 S4: Simple Scalable Streaming System
 Design
 Processing Elements
 Processing Nodes
 Communication Layer
 Programming Model
 Apache S4
 Deployment
 Failover
36
Thanks
Q&A

More Related Content

What's hot

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Streamlio
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyData Con LA
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperKarthik Ramasamy
 
Making Pretty Charts in Splunk
Making Pretty Charts in SplunkMaking Pretty Charts in Splunk
Making Pretty Charts in SplunkSplunk
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-introLeo Lahti
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 

What's hot (20)

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik Ramasamy
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paper
 
Making Pretty Charts in Splunk
Making Pretty Charts in SplunkMaking Pretty Charts in Splunk
Making Pretty Charts in Splunk
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-intro
 
Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 

Viewers also liked

Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelFarzad Nozarian
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineFarzad Nozarian
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab AssignmentFarzad Nozarian
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsFarzad Nozarian
 
Introduction to Database Services
Introduction to Database ServicesIntroduction to Database Services
Introduction to Database ServicesAmazon Web Services
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
The Mini-Guide to Presentation Practice
The Mini-Guide to Presentation PracticeThe Mini-Guide to Presentation Practice
The Mini-Guide to Presentation PracticeEthos3
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)Board of Innovation
 
The Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsThe Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsXPLAIN
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideFive Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideCrispy Presentations
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)Steven Hoober
 
Upworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy
 

Viewers also liked (20)

Object Based Databases
Object Based DatabasesObject Based Databases
Object Based Databases
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring Model
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engine
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
 
Introduction to Database Services
Introduction to Database ServicesIntroduction to Database Services
Introduction to Database Services
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
The Mini-Guide to Presentation Practice
The Mini-Guide to Presentation PracticeThe Mini-Guide to Presentation Practice
The Mini-Guide to Presentation Practice
 
The Minimum Loveable Product
The Minimum Loveable ProductThe Minimum Loveable Product
The Minimum Loveable Product
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
 
The Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsThe Seven Deadly Social Media Sins
The Seven Deadly Social Media Sins
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideFive Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same Slide
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)
 
Upworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The Internets
 

Similar to S4: Distributed Stream Computing Platform

Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Bruce Johnson
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
Managing V Sphere With The Vesi
Managing V Sphere With The VesiManaging V Sphere With The Vesi
Managing V Sphere With The VesiEric Sloof
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 
Guido schmutz-jax2011-event-driven soa
Guido schmutz-jax2011-event-driven soaGuido schmutz-jax2011-event-driven soa
Guido schmutz-jax2011-event-driven soaGuido Schmutz
 
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...confluent
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Aljoscha Krettek
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2
 
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and ImplyAchieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Implyconfluent
 
Compiler Construction for DLX Processor
Compiler Construction for DLX Processor Compiler Construction for DLX Processor
Compiler Construction for DLX Processor Soham Kulkarni
 
Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features WSO2
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsNeil Avery
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterHarsh Kevadia
 
Cogility intel-web site-v1.0
Cogility intel-web site-v1.0Cogility intel-web site-v1.0
Cogility intel-web site-v1.0Cogility
 

Similar to S4: Distributed Stream Computing Platform (20)

Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Managing V Sphere With The Vesi
Managing V Sphere With The VesiManaging V Sphere With The Vesi
Managing V Sphere With The Vesi
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
Guido schmutz-jax2011-event-driven soa
Guido schmutz-jax2011-event-driven soaGuido schmutz-jax2011-event-driven soa
Guido schmutz-jax2011-event-driven soa
 
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
PhD thesis
PhD thesisPhD thesis
PhD thesis
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
 
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and ImplyAchieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
 
Compiler Construction for DLX Processor
Compiler Construction for DLX Processor Compiler Construction for DLX Processor
Compiler Construction for DLX Processor
 
Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-events
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large Cluster
 
Cogility intel-web site-v1.0
Cogility intel-web site-v1.0Cogility intel-web site-v1.0
Cogility intel-web site-v1.0
 
My Saminar On Php
My Saminar On PhpMy Saminar On Php
My Saminar On Php
 

Recently uploaded

How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 

Recently uploaded (20)

How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 

S4: Distributed Stream Computing Platform

  • 1. S4: Distributed Stream Computing Platform YAHOO LABS 2010 FARZAD NOZARIAN, MAZAHER BAZARI Cloud Computing CEIT@AUT 12/22/2014
  • 2. /* Who we are! */ Farzad Nozarian fnozarian@aut.ac.ir Big Data Processing And Mining Mazaher Bazari mbazari@aut.ac.ir Mobile Cloud Computing 1
  • 3. What is S4  Simple Scalable Streaming System  Inspired by the MapReduce model! S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform for processing continuous unbounded streams of data 2
  • 4. Motivation  Real-time search  High frequency trading  Social networks 3
  • 6. “cost-per-click” billing model  Render the most relevant ads in an optimal position on the page  Include user preferences from context:  Processing thousands of queries per sec.  recent user activity  Geographic location  Prior queries  Prior clicks 5
  • 7. Reinvent the Wheel!  Extending the open source Hadoop platform to support computation of unbound streams But, Hadoop isn’t suitable!  The Hadoop platform was highly optimized for batch processing  MapReduce systems typically operate on static data by scheduling batch jobs. 6
  • 8. Real world systems! Partition #1 Partition #2 Partition #... Partition #N Partition #3 Data Stream Latency Latency is proportional to Length of the segment Overhead of segmentation and initiate the processing jobs Fixed-size segments 7
  • 9. Design goals Simple API Scale using commodity hardware Minimize latency by using local memory in each processing node 8
  • 10. Design goals Decentralized and symmetric architecture Pluggable architecture Science friendly 9
  • 11. S4 Model Avoiding the use of shared memory across the cluster Distributed operation on commodity hardware Actors model G. Agha, Actors: A Model of Concurrent Computation in Distributed Systems. 10
  • 12. S4 model (cont.)  Computation is performed by Processing Elements (PEs)  Messages are transmitted between them in the form of data events  The state of each PE is inaccessible to other PEs  Event emission and consumption is the only mode of interaction between PEs  The framework provides the capability to route events to appropriate PEs and to create new instances of PEs 11
  • 13. Design Assumptions  Lossy failover is acceptable!  Nodes will not be added to or removed from a running cluster! 12
  • 14. Design: Example  What is the task? The task is to continuously produce a sorted list of the top K most frequent words across all documents with minimal latency 13
  • 15. EV Quote KEY Null VALUE Quote=“I …” A keyless event (EV) arrives at PE1 with quote: EV Quote “I meant what I said and I said what I meant.”, Dr. Seuss PE1 QuoteSplitterPE (PE1) counts unique words in Quote and emits events for each word. EV WordEvent KEY word="said" VALUE count=2 PE2 PE3 PE4… EV WordEvent KEY word=“i" VALUE count=4 PE2 PE3 PE4… PE3 EV UpdatedCountEv KEY sortID=2 VALUE word=said count=9 EV UpdatedCountEv KEY sortID=9 VALUE word="i" count=35 WordCountPE (PE2-4) keeps total counts for each word across all quotes. Emits an event any time a count is updated. EV PartialTopKEv KEY topk=1234 VALUE words={w:cnt} MergePE (PE8) combines partial TopK lists and outputs final TopK list. 14
  • 16. Design: Processing Elements PE Functionality Types of events Keyed attribute of events Value of the keyed attribute EV WordEvent KEY word="said" VALUE count=2 15
  • 17. Design: Processing Elements (cont.) EV Quote KEY Null VALUE Quote=“I …”  Keyless PEs No keyed attribute or value Consume all events of the type with which they are associated Typically used at the input layer of an S4 cluster where events are assigned a key  Standard PE Count aggregate  join 16
  • 18. Design: Processing Node  Processing Nodes (PNs) are the logical hosts to PEs.  They are responsible for:  listening to events  executing operations on the incoming events  dispatching events with the assistance of the communication layer  emitting output events 17
  • 19. Communication Layer Zookeeper Design: Processing Node (cont.) Processing Element Container PE1 PE2 PEn … Event Listener Dispatcher Emiter Routing Load Balancing Failover Management Transport Protocols 18
  • 20. Programming Model  High-level programming paradigm  Generic  Reusable  Configurable  Java Programming Language 19
  • 21. Programming Model (cont.)  Define Many ProcessEvent (polymorphism)  Create PE  Inherited from AbstractePE  Implement ProcessEvent()  Implement Output()(Optional) 20
  • 23. Programming Model (cont.) PE1 PE2 PE3 PE4 PE5 PE1 PE2 PE3 PE4 PE5  Configuration 22
  • 24. Programming Model (cont.)  Configuration 23
  • 25. Streaming Click-Through Rate Computation  CTR = (ratio of the number of clicks )/(number of impressions) Two types of events Serve Event Click Event Serve is a search result page is returned to the user 24
  • 26. Streaming Click-Through Rate Computation(cont.)  Serve event contain:  serveID  query  user  Ads  …..  Click event contain:  Click information  serveID Use a set of heuristic rules to eliminate suspicious serves and click 25
  • 27. EV RawServe KEY Null VALUE _Serve_Data Event Flow of CTR Computation PE1 EV serve KEY Serve=123 VALUE Serve Data EV Click KEY Serve=123 VALUE Click Data PE4 EV JoinedServe KEY usr=Peter VALUE JoinedData EV JoinedClick KEY usr=Peter VALUE JoinedData EV FilteredServe KEY g-ad=Ipod-78 VALUE JoinedData EV RawClick KEY Null VALUE _Click_Data PE2 PE3 EV FilteredClick KEY g-ad=Ipod-78 VALUE JoinedData 26
  • 30. Zookeeper Node1 Node2 Cluster 1 Repo S4 App 12 3 4 Apache S4 29
  • 33. Apache S4: Commands  s4 <command> <options> Command Purpose newApp Create a new application zkServer Start a ZooKeeper server newCluster Define an S4 cluster s4r Package an application deploy Deploy/configure an application node Start an S4 node status Get information about S4 infrastructure 32
  • 34. Apache S4: Failover while preserving low processing latency High Availability State Recovery 33
  • 35. Apache S4: Failover Zookeeper Node 2 Node 3 Node 1 Node 4 Standb Standby Standby Node 3 High Availability State Recovery 34
  • 36. Apache S4: Failover  PE recover a previous state()  periodically checkpoint (uncoordinated and asynchronous)  lazily recover PE 1 PE 2 Keyed Message Checkpoint Framework Hooks Storage Backend Storage S4 node 35
  • 37. Summary  S4: Simple Scalable Streaming System  Design  Processing Elements  Processing Nodes  Communication Layer  Programming Model  Apache S4  Deployment  Failover 36

Editor's Notes

  1. thousands of queries per second, which may include several ads per page. To process user feedback, we developed S4, a low latency, scalable stream processing engine.
  2. The main requirement for research is to have a high degree of flexibility to deploy algorithms to the field very quickly. The main requirements for a production environment are scalability and high availability
  3. Small segments will reduce latency, add overhead, and make it more complex to manage intersegment dependencies On the other hand, large segments would increase latency. The optimal segment size will depend on the application.
  4. Minimize latency by using local memory in each processing node and avoiding disk I/O bottlenecks.
  5. Decentralized architecture greatly simplifies deployment and maintenance. Use a pluggable architecture to keep the design as generic and customizable as possible.
  6. Upon a server failure, processes are automatically moved to a standby server. The state of the processes, which is stored in local memory, is lost during the handoff. The state is regenerated using the input streams.
  7. QuoteSplitterPE is a keyless PE object that processes all Quote events. For each unique word in a document, the QuoteSplitterPE object will assign a count and emit a new event of type WordEvent, keyed on word. If the WordCountPE object exists, the PE object is called and the counter is incremented, otherwise a new WordCountPE object is instantiated.
  8. S4 routes each event to PNs based on a hash function of the values of all known keyed attributes in that event.
  9. Communication Layer provides: Cluster management Automatic failover to standby nodes Maps physical nodes to logical nodes It uses a pluggable architecture to select network protocol Events may be sent with or without a guarantee It uses ZooKeeper to help coordinate between nodes