Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jörg Schad, Mesosphere
SMACK STACK AND
BEYOND
BUILDING FAST DATA PIPELINES
@dcos @joerg_schad
© 2017 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Software Engineer @Mesosphere
@joerg_schad
@joerg.mesosphere
© 2017 Mesosphere, Inc. All Rights Reserved. 3
MapReduce is
crunching Data
Ancient
Times...
© 2016 Mesosphere, Inc. All Rights Reserved. 4
But then business
demanded
FAST DATA
We need to turn faster!
Today...
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptiv...
6
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed S...
7
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed S...
8
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per second
Distr...
SMACK Stack
9
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
...
SMACK Stack
10
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per...
Message Queues
11
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apa...
MESSAGE QUEUES
Apache Kafka
ØMQ, RabbitMQ, Disque (Redis-based), etc.
fluentd, Logstash, Flume
Akka streams
cloud-only:
AW...
APACHE KAFKA
High-throughput, distributed, persistent publish-subscribe
messaging system
Originates from LinkedIn
Typicall...
fluentd
14
© 2017 Mesosphere, Inc. All Rights Reserved. 15
● Scalability
! Message Type
! Log vs …
! Delivery Guarantees/Message
dura...
DELIVERY GUARANTEES
At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but...
Routing
17
Simple Pipes Routing
Stream Processing
18
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
...
STREAM PROCESSING
• Apache Storm
• Apache Spark
• Apache Samza
• Apache Flink
• Apache Apex
• Concord
• cloud-only: AWS Ki...
© 2016 Mesosphere, Inc. All Rights Reserved. 20
APACHE SPARK
APACHE SPARK (STREAMING 2.0)
Typical Use: distributed, large-scale data processing;
micro-batching


Why Spark Streaming?
...
© 2016 Mesosphere, Inc. All Rights Reserved. 22
! Execution Model
! Native Streaming vs Microbatch
! Fault Tolerance Granu...
EXECUTION MODEL
Micro-Batching
23
Native
Streaming
FAULT TOLERANCE
Checkpoint per “Batch”
24
Ack-Per-Record Checkpoint per Batch
DELIVERY GUARANTEES
“Exactly once”
25
At least Once
Storage
26
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cas...
Datastores
27
Data Model
28
GraphRelational Document
! Schema
! SQL
! Foreign
Keys/Joins
! OLTP/
OLAP
! Simple
! Scalable
! Cache
FilesT...
© 2017 Mesosphere, Inc. All Rights Reserved. 29
Datacenter
NAIVE APPROACH
30
Typical Datacenter

siloed, over-provisioned servers,

low utilization
Industry Average

12-15% utilizat...
© 2017 Mesosphere, Inc. All Rights Reserved. 31
MULTIPLEXING OF DATA, SERVICES, USERS,
ENVIRONMENTS
32
Typical Datacenter

siloed, over-provisioned servers,

low utilizat...
Apache Mesos
• A top-level Apache project

• A cluster resource negotiator

• Scalable to 10,000s of nodes

• Fault-tolera...
MESOS: FUNDAMENTAL ARCHITECTURE
34
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executo...
Challenges
35
• Mesos is just the kernel 

• Need for OS:

• Scheduler

• Monitoring

• Security

• CLI

• Package Reposit...
© 2017 Mesosphere, Inc. All Rights Reserved. 36
Operating Systems
© 2017 Mesosphere, Inc. All Rights Reserved. 37
DC/OS
38
! Datacenter-wide services to power your apps
! Turnkey installation and lifecycle management
DC/OS Universe
DC/O...
Developing Distributed Services
39
• Failures (Task, Node, Network,…)

• Zero Downtime Upgrades

• Persistence

• Multiple...
© 2017 Mesosphere, Inc. All Rights Reserved. 40
Operating Distributed
Services
Distributed Services: Challenges
41
● As simple as Docker Compose
● Don’t need to write any Java code
● Don’t need to be a...
Demo Time
42
Generator Display
1. Financial data
created by generator
2. Written to
Kafka topics
3. Kafka Topics
consumed ...
© 2017 Mesosphere, Inc. All Rights Reserved. 43
Keep it running!
SERVICE OPERATIONS
44
● Configuration Updates (ex: Scaling, re-configuration)
● Binary Upgrades
● Cluster Maintenance (ex:...
Questions?
@dcos @joerg_schad
dcos.io
Demo
Upcoming SlideShare
Loading in …5
×

9

Share

Download to read offline

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Download to read offline

There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.

Related Books

Free with a 30 day trial from Scribd

See all

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

  1. 1. Jörg Schad, Mesosphere SMACK STACK AND BEYOND BUILDING FAST DATA PIPELINES @dcos @joerg_schad
  2. 2. © 2017 Mesosphere, Inc. All Rights Reserved. 2 Jörg Schad Software Engineer @Mesosphere @joerg_schad @joerg.mesosphere
  3. 3. © 2017 Mesosphere, Inc. All Rights Reserved. 3 MapReduce is crunching Data Ancient Times...
  4. 4. © 2016 Mesosphere, Inc. All Rights Reserved. 4 But then business demanded FAST DATA We need to turn faster! Today...
  5. 5. Batch Event ProcessingMicro-Batch Days Hours Minutes Seconds Microseconds Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations Data Processing 5
  6. 6. 6 Fast Data Pipelines Data Ingestion Request/Response Devices Client Sensors Message Queue/Bus Microservices Distributed Storage Analytics (Streaming)
  7. 7. 7 Fast Data Pipelines Data Ingestion Request/Response Devices Client Sensors Message Queue/Bus Microservices Distributed Storage Analytics (Streaming)
  8. 8. 8 EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Fast Data Pipelines
  9. 9. SMACK Stack 9 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Apache Mesos Sensors Devices Clients
  10. 10. SMACK Stack 10 EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications
  11. 11. Message Queues 11 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  12. 12. MESSAGE QUEUES Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only: AWS SQS Google Cloud Pub/Sub 12
  13. 13. APACHE KAFKA High-throughput, distributed, persistent publish-subscribe messaging system Originates from LinkedIn Typically used as buffer/de-coupling layer in online stream processing 13
  14. 14. fluentd 14
  15. 15. © 2017 Mesosphere, Inc. All Rights Reserved. 15 ● Scalability ! Message Type ! Log vs … ! Delivery Guarantees/Message durability ! Routing Capabilities ! Failover ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  16. 16. DELIVERY GUARANTEES At most once—Messages may be lost but are never redelivered. At least once—Messages are never lost but may be redelivered. Exactly once—this is what people actually want, each message is delivered once and only once. 16 Murphy’s Law of Distributed Systems: 
 Anything that can go wrong, will go wrong … partially!
  17. 17. Routing 17 Simple Pipes Routing
  18. 18. Stream Processing 18 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  19. 19. STREAM PROCESSING • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,
 Google Cloud Dataflow 19
  20. 20. © 2016 Mesosphere, Inc. All Rights Reserved. 20 APACHE SPARK
  21. 21. APACHE SPARK (STREAMING 2.0) Typical Use: distributed, large-scale data processing; micro-batching 
 Why Spark Streaming? • Micro-batching creates very low latency, which can be faster • Well defined role means it fits in well with other pieces of the pipeline 21
  22. 22. © 2016 Mesosphere, Inc. All Rights Reserved. 22 ! Execution Model ! Native Streaming vs Microbatch ! Fault Tolerance Granularity ! Per record, per batch ! Delivery Guarantees ! API ! SQL ! Spark ! Performance…. ! Realtime ≠ Realtime ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  23. 23. EXECUTION MODEL Micro-Batching 23 Native Streaming
  24. 24. FAULT TOLERANCE Checkpoint per “Batch” 24 Ack-Per-Record Checkpoint per Batch
  25. 25. DELIVERY GUARANTEES “Exactly once” 25 At least Once
  26. 26. Storage 26 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  27. 27. Datastores 27
  28. 28. Data Model 28 GraphRelational Document ! Schema ! SQL ! Foreign Keys/Joins ! OLTP/ OLAP ! Simple ! Scalable ! Cache FilesTime-Series ! Complex relations ! Social Graph ! Recommen dation ! Fraud detections ! Schema- Less ! Semi- structured queries ! Product catalogue ! Session data Key-Value
  29. 29. © 2017 Mesosphere, Inc. All Rights Reserved. 29 Datacenter
  30. 30. NAIVE APPROACH 30 Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Industry Average
 12-15% utilization mySQL microservice Cassandra Spark/Hadoop Kafka
  31. 31. © 2017 Mesosphere, Inc. All Rights Reserved. 31
  32. 32. MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS 32 Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Mesos/ DC/OS
 automated schedulers, workload multiplexing onto the same machines mySQL microservice Cassandra Spark/Hadoop Kafka
  33. 33. Apache Mesos • A top-level Apache project • A cluster resource negotiator • Scalable to 10,000s of nodes • Fault-tolerant, battle-tested • An SDK for distributed apps • Native Docker support 33
  34. 34. MESOS: FUNDAMENTAL ARCHITECTURE 34 Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Cassandra Scheduler Container Scheduler Spark Scheduler Spark Executor Spark
 Task Mesos AgentMesos Agent Service Docker Executor Docker
 Task Spark Executor Spark
 Task Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master
  35. 35. Challenges 35 • Mesos is just the kernel • Need for OS: • Scheduler • Monitoring • Security • CLI • Package Repository • …
  36. 36. © 2017 Mesosphere, Inc. All Rights Reserved. 36 Operating Systems
  37. 37. © 2017 Mesosphere, Inc. All Rights Reserved. 37
  38. 38. DC/OS 38 ! Datacenter-wide services to power your apps ! Turnkey installation and lifecycle management DC/OS Universe DC/OS Any Infrastructure ! Container operations & big data operations ! Security, fault tolerance & high availability ! Open Source (ASL2.0) ! Based on Apache Mesos ! Production proven at scale ! Requires only a modern linux distro 
 (windows coming soon) ! Hybrid Datacenter Datacenter Operating System (DC/OS) Distributed Systems Kernel (Mesos) Big Data + Analytics EnginesMicroservices ( containers) Streaming Batch Machine Learning Analytics Search Time Series SQL / NoSQL Databases Modern App Components Any Infrastructure (Physical, Virtual, Cloud)
  39. 39. Developing Distributed Services 39 • Failures (Task, Node, Network,…) • Zero Downtime Upgrades • Persistence • Multiple Frameworks • Service Discovery • Metrics • ….
  40. 40. © 2017 Mesosphere, Inc. All Rights Reserved. 40 Operating Distributed Services
  41. 41. Distributed Services: Challenges 41 ● As simple as Docker Compose ● Don’t need to write any Java code ● Don’t need to be an app expert ● Need to be an app expert ● Need to write a little Java code ● Don’t want to understand DC/OS ● Can’t use the default scheduler ● Need to write a lot of Java code ● Willing to understand DC/OS Custom Jobs & Strategies Build Your Own Scheduler Defaults
  42. 42. Demo Time 42 Generator Display 1. Financial data created by generator 2. Written to Kafka topics 3. Kafka Topics consumed by Spark or Flink 4. Results written back into Kafka stream (another topic) 7. Results displayed
  43. 43. © 2017 Mesosphere, Inc. All Rights Reserved. 43 Keep it running!
  44. 44. SERVICE OPERATIONS 44 ● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages
  45. 45. Questions? @dcos @joerg_schad dcos.io Demo
  • TiagoOliveiraMBA

    Dec. 6, 2020
  • raulsalazar50702

    Aug. 27, 2019
  • IanLi1

    Jun. 24, 2019
  • SebastianIOAN1

    May. 9, 2019
  • insaneisnotfree

    Jan. 7, 2019
  • AlejandroBautistaRam

    Oct. 18, 2018
  • prashunjaveri

    Jul. 10, 2018
  • AreegeAlZubaidi

    Jan. 23, 2018
  • StreamingAnalytics

    Nov. 5, 2017

There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.

Views

Total views

1,841

On Slideshare

0

From embeds

0

Number of embeds

73

Actions

Downloads

116

Shares

0

Comments

0

Likes

9

×