Data today is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is a critical need for tools and applications that can process this data in real-time, generating actionable insights that can drive business value. How? By moving away from batch-centric processing to streaming, Fast Data pipelines that are fully Reactive.
In this webinar with Craig Pottinger, Senior Consultant at Lightbend, we examine the design choices around building streaming systems with technologies like Akka Streams, Apache Kafka, Apache Spark, Apache Flink, Mesosphere DC/OS and Lightbend Reactive Platform, all of which come integrated with Lightbend Fast Data Platform.
Craig will discuss how Fast Data Platform brings together a curated set of technologies—including streaming engines, a data backplane, reactive microservices, persistence, Machine Learning, purpose-built monitoring and more—plus on-demand guidance from experts to help you design, and then help your developers build, streaming fast data applications that are successful in the long-run.
Design Streaming Fast Data Applications with Spark, Akka, Kafka and Cassandra on Mesos & DC/OS
1.
2. Understanding Data Streaming
•To understand Fast Data we must first understand traditional data streaming:
•The processing of large or unbounded sequences of data
•Dataset’s that are too large to fit in memory or disk
•Can be used to provide insights for data that never ends
•Typical use cases are to provide aggregations and predictions
•Usually synchronous and single threaded
•Very low latency and close to real time
3. What is Fast Data?
•The processing of high volumes of a continuous stream of data
•Fast Data combines properties from traditional stream processing, big data infrastructure,
and reactive applications
•Low to medium latency data processing in near real time
•Scales horizontally to handle a high volume of data
•Stream processing is parallelized across many CPU cores and machines by partitioning the
data stream
•Fault Tolerance provides resilience and the ability to recover from failures
•High Availability provides responsiveness and ensures uptime guarantees
4. Fast Data Sources
•Sensor data: Processing discrete data from many Internet of Things (IoT) devices
•Network traffic: Telecommunication network optimization using SDN’s
•Web/mobile user activity: Up-to-date trends of user behaviour from web or mobile apps
•Database event logs: Data pumps and streaming ETL to create new views of source data
6. Application: Monitoring
•Monitoring data for anomalies has many finance applications: Fraud Detection, Risk Management, Compliance
•Credit card companies have multiple levels of Fraud Detection
•Transaction-time fraud detection occurs at time of purchase
•Secondary fraud detection occurs after transaction time
•Requirements
•Reliable data capture is important for monitoring compliance
•Model training & scoring for fraud detection
7. Application: Better Products & Services
•Recommendation Engines
•Media companies suggest new songs & tv shows to users (Netflix, Spotify)
•eCommerce companies recommend new products (Amazon)
•Requirements
•Joining historical data with real-time data
8. So how should we design Fast Data systems?
•To implement Fast Data systems we need to review the evolution of two subsets of software
development
•Building Application Services
•Building Data Systems
9. Why?
•The worlds of Data Systems (aka Streaming Applications) and Applications (Microservices) are
converging.
11. The Software Spectrum
•Monoliths and Microservices exist on a spectrum
•Monolith on one end, Microservices on the other
•Most applications live somewhere in the middle
12. Characteristics of a Monolith
•Deployed as a single unit
•Single shared database
•Communicate with synchronous method calls
•Deep coupling between libraries and components(often through the DB)
•“Big Bang” style releases
•Long cycle times (weeks to months)
•Teams carefully synchronize features and releases
13. The Monolithic Ball of Mud
•The ball of mud represents the worst case scenario for a Monolith
•No clear isolation in the application
•Complex dependencies
•Hard to understand and harder to modify
14. The Microservice Architecture
•Microservices are a subset of SOA
•Logical components are separated into isolated microservices
•Microservices can be physically separated and independently deployed
•Each component/microservice has it’s own independent data store
15. Scaling a Microservice Application
•Each microservice is scaled independently
•Could be one or more copies of a service per machine
•Each machine hosts a subset of the entire system
16. Characteristics of Microservices
•Each service is deployed independently
•Multiple independent databases
•Communication is synchronous or asynchronous (Reactive Microservices)
•Loose coupling between components
•Rapid deployments (possibly continuous)
•Teams release features when they are ready
19. Hadoop
•Hadoop is a system for collecting and processing massive amounts of data
•Focus on batch processing and analytics
•Divided into three projects: MapReduce, HDFS, YARN
•Linear scalability with inexpensive commodity servers
•Open Source
20. Disadvantages of Hadoop
•Batch semantics delay gaining insight from your data
•Discovering insights faster is a competitive advantage
•Customers today expect up-to-date and accurate information
•It’s difficult to implement business processes in MapReduce programming
model
•A poor choice for iterative operations such as Machine Learning
21. Distributed Stream Processors
•There are lots of distributed stream processors to choose from: Spark Streaming, Storm, Samza, Flink, Apex, Gear Pump
•They fill in the gap of streaming requirements that exists in Hadoop
•Target YARN, Mesos, and standalone cluster resource managers
22. Complexity of Distributed Stream Processors
•Distributed stream processors address complexities not found in
batch semantics
•Handling out of order messages
•Message delivery & processing semantics
•Event-time vs processing-time
23. Reactive Principles
•Responsive: A Reactive System consistently responds in a timely fashion
•Resilient: A Reactive System remains responsive, even when failures occur
•Elastic: A Reactive System remains responsive, despite changes system load
•Message Driven: A Reactive System is built on a foundation of async, non-blocking messages
25. What is Lightbend Fast Data Platform?
Lightbend Fast Data Platform is a
● curated,
● integrated and
● fully supported platform
that provides you with an easy on-ramp for designing, building and
running streaming Fast Data applications.
26. Why Lightbend Fast Data Platform?
● For architects: Design capabilities and guided tool choices so you can demystify
complexity and reduce risk.
● For developers: An easy on-ramp that accelerates developer velocity so you can
build & launch performant apps on time.
● For ops teams: Run-time capabilities so you can serve users reliably at scale, along
with one-stop support for all components to ensure peace of mind.
29. When Choosing Streaming Engines…
•Low latency? How low?
•High volume? How high?
•Kinds of data processing and analytics? Which ones?
•Process data:
•Individually (e.g., complex event processing)?
•In bulk (e.g., like SQL joins)?
•Required integrations with other tools? Which ones?