There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.
5. Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations
Data Processing
5
6. 6
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed Storage
Analytics
(Streaming)
7. 7
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed Storage
Analytics
(Streaming)
8. 8
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Fast Data Pipelines
9. SMACK Stack
9
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Apache Mesos
Sensors
Devices
Clients
10. SMACK Stack
10
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
11. Message Queues
11
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients
13. APACHE KAFKA
High-throughput, distributed, persistent publish-subscribe
messaging system
Originates from LinkedIn
Typically used as buffer/de-coupling layer in online stream
processing
13
16. DELIVERY GUARANTEES
At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message
is delivered once and only once.
16
Murphy’s Law of Distributed
Systems:
Anything that can
go wrong, will go
wrong … partially!
18. Stream Processing
18
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients
21. APACHE SPARK (STREAMING 2.0)
Typical Use: distributed, large-scale data processing;
micro-batching
Why Spark Streaming?
• Micro-batching creates very low latency, which can
be faster
• Well defined role means it fits in well with other
pieces of the pipeline
21
26. Storage
26
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients
32. MULTIPLEXING OF DATA, SERVICES, USERS,
ENVIRONMENTS
32
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
33. Apache Mesos
• A top-level Apache project
• A cluster resource negotiator
• Scalable to 10,000s of nodes
• Fault-tolerant, battle-tested
• An SDK for distributed apps
• Native Docker support
33
34. MESOS: FUNDAMENTAL ARCHITECTURE
34
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Cassandra
Scheduler
Container
Scheduler
Spark
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
Spark
Executor
Spark
Task
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
35. Challenges
35
• Mesos is just the kernel
• Need for OS:
• Scheduler
• Monitoring
• Security
• CLI
• Package Repository
• …
38. DC/OS
38
! Datacenter-wide services to power your apps
! Turnkey installation and lifecycle management
DC/OS Universe
DC/OS
Any Infrastructure
! Container operations & big data operations
! Security, fault tolerance & high availability
! Open Source (ASL2.0)
! Based on Apache Mesos
! Production proven at scale
! Requires only a modern linux distro
(windows coming soon)
! Hybrid Datacenter
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
Big Data + Analytics EnginesMicroservices ( containers)
Streaming
Batch
Machine Learning
Analytics
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Any Infrastructure (Physical, Virtual, Cloud)
41. Distributed Services: Challenges
41
● As simple as Docker Compose
● Don’t need to write any Java code
● Don’t need to be an app expert
● Need to be an app expert
● Need to write a little Java code
● Don’t want to understand DC/OS
● Can’t use the default scheduler
● Need to write a lot of Java code
● Willing to understand DC/OS
Custom
Jobs & Strategies
Build Your Own Scheduler
Defaults
42. Demo Time
42
Generator Display
1. Financial data
created by generator
2. Written to
Kafka topics
3. Kafka Topics
consumed by Spark or
Flink
4. Results written back into
Kafka stream (another topic)
7. Results displayed