Design Streaming Fast Data Applications with Spark, Akka, Kafka and Cassandra on Mesos & DC/OS

Understanding Data Streaming
•To understand Fast Data we must first understand traditional data streaming:
•The processing of large or unbounded sequences of data
•Dataset’s that are too large to fit in memory or disk
•Can be used to provide insights for data that never ends
•Typical use cases are to provide aggregations and predictions
•Usually synchronous and single threaded
•Very low latency and close to real time

What is Fast Data?
•The processing of high volumes of a continuous stream of data
•Fast Data combines properties from traditional stream processing, big data infrastructure,
and reactive applications
•Low to medium latency data processing in near real time
•Scales horizontally to handle a high volume of data
•Stream processing is parallelized across many CPU cores and machines by partitioning the
data stream
•Fault Tolerance provides resilience and the ability to recover from failures
•High Availability provides responsiveness and ensures uptime guarantees

Fast Data Sources
•Sensor data: Processing discrete data from many Internet of Things (IoT) devices
•Network traffic: Telecommunication network optimization using SDN’s
•Web/mobile user activity: Up-to-date trends of user behaviour from web or mobile apps
•Database event logs: Data pumps and streaming ETL to create new views of source data

Applications for Fast Data
Monitoring
Better Products & Services

Application: Monitoring
•Monitoring data for anomalies has many finance applications: Fraud Detection, Risk Management, Compliance
•Credit card companies have multiple levels of Fraud Detection
•Transaction-time fraud detection occurs at time of purchase
•Secondary fraud detection occurs after transaction time
•Requirements
•Reliable data capture is important for monitoring compliance
•Model training & scoring for fraud detection

Application: Better Products & Services
•Recommendation Engines
•Media companies suggest new songs & tv shows to users (Netflix, Spotify)
•eCommerce companies recommend new products (Amazon)
•Requirements
•Joining historical data with real-time data

So how should we design Fast Data systems?
•To implement Fast Data systems we need to review the evolution of two subsets of software
development
•Building Application Services
•Building Data Systems

Why?
•The worlds of Data Systems (aka Streaming Applications) and Applications (Microservices) are
converging.

The Software Spectrum
•Monoliths and Microservices exist on a spectrum
•Monolith on one end, Microservices on the other
•Most applications live somewhere in the middle

Characteristics of a Monolith
•Deployed as a single unit
•Single shared database
•Communicate with synchronous method calls
•Deep coupling between libraries and components(often through the DB)
•“Big Bang” style releases
•Long cycle times (weeks to months)
•Teams carefully synchronize features and releases

The Monolithic Ball of Mud
•The ball of mud represents the worst case scenario for a Monolith
•No clear isolation in the application
•Complex dependencies
•Hard to understand and harder to modify

The Microservice Architecture
•Microservices are a subset of SOA
•Logical components are separated into isolated microservices
•Microservices can be physically separated and independently deployed
•Each component/microservice has it’s own independent data store

Scaling a Microservice Application
•Each microservice is scaled independently
•Could be one or more copies of a service per machine
•Each machine hosts a subset of the entire system

Characteristics of Microservices
•Each service is deployed independently
•Multiple independent databases
•Communication is synchronous or asynchronous (Reactive Microservices)
•Loose coupling between components
•Rapid deployments (possibly continuous)
•Teams release features when they are ready

Hadoop
•Hadoop is a system for collecting and processing massive amounts of data
•Focus on batch processing and analytics
•Divided into three projects: MapReduce, HDFS, YARN
•Linear scalability with inexpensive commodity servers
•Open Source

Disadvantages of Hadoop
•Batch semantics delay gaining insight from your data
•Discovering insights faster is a competitive advantage
•Customers today expect up-to-date and accurate information
•It’s difficult to implement business processes in MapReduce programming
model
•A poor choice for iterative operations such as Machine Learning

Distributed Stream Processors
•There are lots of distributed stream processors to choose from: Spark Streaming, Storm, Samza, Flink, Apex, Gear Pump
•They fill in the gap of streaming requirements that exists in Hadoop
•Target YARN, Mesos, and standalone cluster resource managers

Complexity of Distributed Stream Processors
•Distributed stream processors address complexities not found in
batch semantics
•Handling out of order messages
•Message delivery & processing semantics
•Event-time vs processing-time

Reactive Principles
•Responsive: A Reactive System consistently responds in a timely fashion
•Resilient: A Reactive System remains responsive, even when failures occur
•Elastic: A Reactive System remains responsive, despite changes system load
•Message Driven: A Reactive System is built on a foundation of async, non-blocking messages

Introducing
Lightbend Fast Data Platform

What is Lightbend Fast Data Platform?
Lightbend Fast Data Platform is a
● curated,
● integrated and
● fully supported platform
that provides you with an easy on-ramp for designing, building and
running streaming Fast Data applications.

Why Lightbend Fast Data Platform?
● For architects: Design capabilities and guided tool choices so you can demystify
complexity and reduce risk.
● For developers: An easy on-ramp that accelerates developer velocity so you can
build & launch performant apps on time.
● For ops teams: Run-time capabilities so you can serve users reliably at scale, along
with one-stop support for all components to ensure peace of mind.

1
2
3
4
5
6
78
7
Introducing Lightbend Fast Data Platform
Stream Processing
1. Streaming Engines
Machine Learning
2. Pluggable ML Libraries
Microservices
3. Reactive Platform
Operational Tooling
4. Intelligent Management and Monitoring
5. Cluster Analysis (FUTURE)
Infrastructure
6. Durable Messaging Backplane
7. Persistence
8. Infrastructure (On-Prem, Cloud, Hybrid)

1
2
3
4
5
6
78
7
Lightbend Fast Data Platform Components
Stream Processing
1. Streaming Engines
Machine Learning
2. Pluggable ML Libraries
Microservices
3. Reactive Platform
Operational Tooling
4. Intelligent Management and Monitoring
5. Cluster Analysis (FUTURE)
Infrastructure
6. Durable Messaging Backplane
7. Persistence
8. Infrastructure (On-Prem, Cloud, Hybrid)

When Choosing Streaming Engines…
•Low latency? How low?
•High volume? How high?
•Kinds of data processing and analytics? Which ones?
•Process data:
•Individually (e.g., complex event processing)?
•In bulk (e.g., like SQL joins)?
•Required integrations with other tools? Which ones?

Design Streaming Fast Data Applications with Spark, Akka, Kafka and Cassandra on Mesos & DC/OS

Design Streaming Fast Data Applications with Spark, Akka, Kafka and Cassandra on Mesos & DC/OS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from Lightbend

More from Lightbend (20)

Recently uploaded

Recently uploaded (20)

Design Streaming Fast Data Applications with Spark, Akka, Kafka and Cassandra on Mesos & DC/OS