SlideShare a Scribd company logo
1 of 84
@wip
Building Reactive Fast Data & the Data Lake
with Akka, Kafka, Spark
AJUG, Jan 17, 2017
Todd Fritz
Cox Automotive, Inc.
This presentation is a draft of what will be presented, next month, at DevNexus.
“Sneak peek”
Reactions and questions may influence evolution of content
Disclaimer
2
3
tafritz@outlook.com
Todd.Fritz@coxautoinc.com
www.linkedin.com/in/tfritz
http://www.slideshare.net/ToddFritz
https://github.com/todd-fritz
4
License: CC BY-SA 3.0 5
• Senior Solutions Architect @ Cox Automotive, Inc.
• Strategic Data Services
• The opinions contained herein may not represent my employer, but I believe they
should.
• Background is building platforms, middleware, MoM, EIP, EDA, etc
• DevOps mentality
• Exposed to many environments, technologies, people
• Life-long learner and always curious
• Novice bass player
• Scuba diver
About Me
DevNexus 2015
http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015
Great Wide Open - Atlanta (April 3, 2014)
http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud
AJUG (April 15, 2014)
http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-group
Video - https://vimeo.com/94556976
Previous Presentations
6
• Forward
• The Briefest History
• Background: Reactive Systems, Patterns, Implementations
• The Enterprise
• Fast Data
• The Data Lake for Analytics, App Dev
• Presentation Improvements Planned for DevNexus
• Questions
• Resources
Agenda
7
Forward
“Our greatest glory is not in never failing,
but in rising every time we fall.”
-Confucius
8
• Why is Reactive Important?
• Reactive Systems and Programming != Reactive Management
• Reactive underpins every use case, every business capability, every product feature
• Tendency for companies to survey market and select products to match perceived
business need
• Importance of vision, governance, tenancy, entitlements, security
• Both Process and Technology
• Use Case: Build a system that can scale thousands to millions of users, handling
millions to billions of messages
• Real time data, for application components, middleware, data processing,
analytics
• A journey of innovation and successive refinement
Onward
9
The Briefest History
History is the sum total of things that could have been avoided.
- Konrad Adenauer
10
• “Reactive” is not new
• Underlying principles go back almost 50 years, to the days of punch cards
• Erlang
• Built to scale, handle extremely high volume
• Extensive use in Telco for decades; many billions of messages
• Actors
• Bedrock is messaging
• We’ve been using this technique for decades, via many technologies
• Improvements over time around component isolation, decoupling to benefit scalability and concurrency
The Briefest History
11
Reactive Systems, Patterns, Implementations
“People who think they know everything
are a great annoyance to those of us who do.”
- Isaac Asimov
12
• The Reactive Manifesto
• http://www.reactivemanifesto.org
• Many organizations independently building software to new, and
similar, patterns
• Increasing pressures to simplify, scale, innovate and improve customer experience
• Increasing proliferation and interoperability of system environments and connected devices
• More data with contextual use cases
• Yesterday’s architectures just don’t cut it
• Need more flexible, resilient, robust systems
• Solution through evolution
• Spawn of Actors (Akka, Erlang)
• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns
Reactive Systems
13
“…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of
thinking about systems architecture and design in a distributed environment where
implementation techniques, tooling, and design patterns are components of a larger
whole.”*
“A Reactive System is based on an architectural style that allows … multiple individual
services to coalesce as a single unit and react to its surroundings while remaining aware
of each … scale up/down, load balance and even ...” * (proactive steps)
Components may qualify as reactive, but when combined, does not guarantee a Reactive
System
What is Reactive? A Set of Design Principles
14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
1. Reactive Systems (architecture, design)
2. Reactive Programming (declarative event-based)
3. Functional Reactive Programming (FRP)
NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava,
Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-
22-essence-of-frp.html
Reactive Begets*
15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Reactive Programming != Functional Reactive Programming
• Subset of Asynch Programming
• New information drives logic flow vs. control flow driven by thread-of-execution
• Avoids resource contention (Amdahl’s Law) that impedes scalability.
• Decompose into multiple steps that are asynch and nonblocking
• Combine into a composed workflow
• Reactive Systems very rarely block
• Reactive API libraries are either declarative (functional composition,
combinators) or callback-based (attached to events, executed during dataflow
chain) with stream-based operators (windowing, triggers, etc)
• Reactive programming is event-driven
• Reactive systems are message driven
• Wait? What? More about this distinction in a few slides.
Reactive Programming*
16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Reactive Programming is related to Dataflow Programming
• Both emphasize flow of data vs. flow of control
• Examples
• Futures / Promises
• (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured
pipelines connecting sources and destinations.
• Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value
change can trigger dependent functions to produce new values (state)
• Technologies that do this, include
• Akka Streams
• RxJava
• Vert.X
• Reactive Streams Specification
• The standard for interoperability amongst Reactive Programming libraries on the JVM
• “…an initiative to provide a standard for asynchronous stream processing with non-
blocking back pressure.”
About Dataflow Programming*
17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Notable benefits
• Increased (efficient) utilization of compute resources (incl. multi-core)
• Increased performance via serialization reduction
• Amdahl’s Law
• Neil Günter’s Universal Scalability Law. To quantify the effects of contention and
coordination in concurrent, distributed systems. This explains how the cost of
coherency in a system can lead to negative results, as new resources are added
to the system.
• Productivity. Reactive libraries handle complexities such as dealing with asynch,
nonblocking compute, IO, coordination between components.
• Great for creating components that are composed to workflows, that are back-
pressured, scalable, high-performance
Why Reactive Programming*
18* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Reactive Programming - event-driven
• Computation via dataflow chains
• Events are not directed; “addressable event sources”
• Events are mere facts that can be observed
• Emitted by changes in state (state machine)
• Listeners attach to even sources, which in turn, react to them
• Emitted locally
• Reactive Systems - message-driven
• Basis of communication across network/components; prefer asynch
• Sender/Receiver are decoupled
• Focus on resilience and elasticity via communication/communication inherent to distributed systems
• Long-lived, addressable components
• Waits for messages to be sent, then reacts to them
• Messages are ”directed”
• A message has a clear destination; “addressable recipient”
Event-Driven vs. Message-Driven*
19* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Common pattern is to us Messaging as means to communicate Events across
network/components
• Events within Messages
• Examples
• AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub)
• Pros
• Abstraction and simplicity
• Cons
• Lose some control
• Messaging forces developers to deal with complex realities of distributed programming
• Failure detection
• Message delivery contracts (dupes, retry, ordering)
• Consistency guarantees
• Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc)
Event-Driven & Message-Driven*
20* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
Reactive Systems: Characteristics*
21
Responsive
• Low latency
• Consistent &
predictable
• Foundation of Usability
• Essential for Utility
• Expose problems
quickly
• Happier Customers
Resilient
• Responsive through failure
• Resilient (H/A) through
replication
• Failure isolated to
Component (bulkheads)
• Delegate recovery to
external Component
(supervisor hierarchies)
• Client does not handle
failures
Elastic
• Responsive as
workload varies
• Devoid of bottlenecks or
hot spots
• Perf metrics drive
predictive / reactive
autonomic scaling
• Favor commodity
infrastructure
Message Driven
• Asynchronous
• Loose Coupling
• Isolation
• Location Transparency
• Non Blocking
• Recipients can passivate
• Message Passing
• Flow Control
• Exception Management
• Elasticity
• Back Pressure
* Source: The Reactive Manifesto
Reactive Systems: Patterns
22
Architecture Pattern:
Single Component
• Component does one thing, fully and well
• Single Responsibility Principle
• Max cohension, min coupling
Architecture Pattern:
Let-it-Crash
• Prefer a full component restart to complex internal error
handling
• Design for failure
• Leads to more reliable components
• Avoids hard to find/fix errors
• Failure is unavoidable
Implementation Patterns
Circuit Breaker
• Protect services by breaking connections during
failures
• From EE
• Protects clients from timeouts
• Allows time for service to recover
Source: https://www.lightbend.com/blog/architect-reactive-design-patterns
Implementation Patterns
Saga
• Divide long-lived, distributed transactions into
quick local ones with compensating actions for
recovery.
• Compensating txns run during saga rollback
• Concurrent sagas can see intermediate state
• Sags need to be persistent to recover from
hardware failures. Save points.
• Still need to use the brain and learn
• Not going to be able to build reactive systems with just paper
certifications
• No substitute for experience!
• The “new” is based on evolution (old techniques and patterns)
• Paradigms and patterns not isolated to components or
technology
Sound Complicated?
23Source: The Reactive Manifesto
The Enterprise
“Even if you are on the right track,
you’ll get run over if you just sit there.”
- Will Rogers
24
25
• We all work for 1..n companies that have a size, complexity and age
• Young companies (e.g. start-ups)
• Less legacy overhead
• More able to adopt newer technology; attempting to innovate, disrupt, find a niche
• Less able to leverage expensive, enterprise-class solutions; value through IP or a unique product
• More likely to build cloud native (various reasons)
• Staff typically has larger sphere of influence
• Sometimes filled with Unicorns
• Mid-size companies
• May have legacy overhead
• Complexity if growth through acquisition vs. growth through product evolution
• Strategic use of of Enterprise products
• May be cloud native, or hybrid
• Increased division of labor, defined roles, paying customers to keep happy
The Enterprise
26
• Generalizing large companies
• Probably where the terms “technical debt” and “culture debt” came from
• Certainly has legacy overhead, very complex environments
• Fear of change, less room to fail – backed by valid business reasons
• Likely has complexity due to both acquisition and product evolution
• Purchase a company, adding a ”different” tech stack
• Money to fully absorb? Risk? Or purchase to evolve existing lines of business?
• Use of of Enterprise products, likely prefers “supported” flavors of open source
• A blend of on-premises, hybrid, cloud
• More politics and cats to herd
• Slower to see value from new technology
• Matrixed division of labor, defined roles, paying customers to keep happy
• Transformative change is more difficult; planning, budgeting, process, management structure
The Enterprise
27
• An Enterprise has a variety of software applications
• Customer facing, revenue generating
• “Supporting” applications for operation, development, maintenance activities
• Datamarts used for analytics
• Data typically moved from system of record, into one or more centralized data
“hubs”
• OLAP / Data Warehouses
• Data Lake, common use of Hadoop
• Older application architectures focused on the application, not on enterprise
interoperability
• Drop in ETL, ESB, SOA, iPaaS to do so ($$$)
• Perhaps refactor service layers into more modern middleware
• Enterprises becoming ”real time”
• The old, batch-oriented, application silos are too complex and just can’t do this well
Breaking Down the Enterprise
28
• (Not a complete list)
• Products (and supporting software) operates in Real Time
• Opens door to do things to increase customer satisfaction/retention
• Save money, become more agile (speed to market) – productivity enhancer
• Less reliant on batch or pull (streaming batch to interop legacy)
• Data flows to system that need it, where it is acted on in real time
• More powerful data processing
• Easier to manage; reduced TCO
• Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service)
• Leverage cloud solutions, auto scaling, resiliency, high availability
• Analysts get to use latest tools, oftentimes in the cloud
• Enables autonomic automation
How the Enterprise Benefits from Reactive
29
• Friends don’t let their friend-analysts do map-reduce
• Analysts are not highly technical (some do code in R)
• Larger companies may have communities of SAS users
• They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL
against data.
• Can’t expect people to switch careers or skill sets to accommodate new
technology.
And Again, those Analysts
30
31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Fast Data
“If you are in a spaceship that is traveling at the speed of light,
and you turn on the headlights,
does anything happen?”
- Steven Wright
32
• Big Data  Fast Data
• Constant stream of inbound data, at incredible rates
• The old way is to store it, then analyze
• Say hello to Map Reduce!
• Hadoop did reinvent how to process petabytes of data on commodity hardware
• Why not analyze and act on data as it is received?
• Using (proven) technology built to scale to massive volume?
• Akka, Kafka, Spark, Flink
• Use case reminder: millions of concurrent actors handling billions of
messages, in real-time
• Fast Data means acting on data in real-time, and sending it to destinations
• Data Lake
• Other systems
Fast Data
33* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
Time is Gold
34* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
• Fast Data is the obvious future
• Many of us have been using these techs for years
• The easy part
• Build new systems using Reactive patterns, architectures and technology
• You start ups have it easy…
• The ability of a company to adopt disruptive architectures and patterns is inversely related to its size
• In a real world where SDLC costs money
• How to reconcile with legacy architectures and implementations?
• How to interoperate with disparate systems with varying capabilities?
• Does it make sense to refactor “what works”?
• Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter of
perspective, when viewed in the rear-facing mirror of innovation?
• What is a cost efficient approach to adopt and adapt to Reactive?
Reality Check
35* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Enough background, time to talk tech
• For this discussion, the reference platform is Lightbend’s Fast Data
platform
• (A few deviations mentioned)
• Core techs will be
• Kafka - open source or Confluent
• Spark - open source or Databricks (a nice managed service in AWS)
• Akka (and Akka-HTTP) (open source or Lightbend stack)
• Alpakka
Now, the Fun Stuff
36* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
37
Source: Lightbend, Inc.
• JMS -> AMQP –> Kafka
• Streaming platform
• Process messages when provided
• Fault-tolerant storage
• Pub/Sub capability
• Common use cases
• Data pipelines (messaging) between a source and destination
• Real-time action / transformation
• Metrics, weblogs, stream processing, event sourcing
• A commit log
• Spread across a cluster
• Record streams stored in topics
• Each record has a key, value, timestamp
• Each topic has offsets and a retention policy
Kafka 101
38
Kafka Performance
39Source: “Introduction to Kafka”, Ducas Francis
30k/s 1.8M/min 108M/hr 2.7B/day
• Producer API
• To publish a record to Kafka
• Consumer API
• To subscribe to a topic(s)
• Consumer groups
• Handle records pushed to topics
• Streams API
• Stream processing
• Consume from Stream An, do processing,
publish to Stream Bn
• E.g. aggregation, joining
• Connector API
• To build custom Producers/Consumers
• Purpose built integration components
Kafka APIs
40
• > Messaging systems (RabbitMQ, AMQ, etc)
• Just don’t scale well and become complex
• Need to use other abstractions for batching
• Lacks replay ability (reset offset, etc)
• > Log forwarders, e.g. Scribe or Flume
• Push architecture
• High performance
• Scales well
• Sensitive to business logic in endpoints, because needs to push data fast
• Assumes data is pushed to large sink (e.g. Hadoop)
• Oh, and then queries later. So much for real-time.
• Supports Polyglot
• Python, Go, .NET, Node.js, C/C++, etc
• Robust ecosystem
• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Why Use Kafka?
41
• “…a fast and general engine for large-scale data processing.”
• Hadoop / YARN
• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource
• Great for parallel file processing of large files
• Synchronization barrier during persistence
• Spark
• In-memory data processing
• Interactive/iterative data query
• Better supports more complex, interactive (real-time) apps
• 100x faster than Hadoop MR (in memory), 10x faster on disk
• Microbatching
Spark: What is it?
42Source: spark.apache.org
• Combine SQL, streaming, complex analytics
• SQL
• Dataframes
• MLlib
• GraphX
• Spark Streaming
Spark
43Source: spark.apache.org
• Run it
• Standalone
• Hadoop
• Mesos
• Cloud
• Access data
• HDFS
• Cassandra
• Hbase
• S3
• Hive
• Tachyon, and more
• Write code
• Scala
• Java
• Python, Clojure, R
• Interactive query shell (notebooks)
Spark Execution Modes
44Source: spark.apache.org
• Slow due to replication, serialization, filesystem IO
• Inefficient use cases:
• Iterative algorithms (ML, Graphs, Network analysis)
• Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries)
Spark: Hadoop MR
45Source: “Spark Overview”, Lisa Hua
Spark: Hadoop & Spark (in Hadoop)
46Source: “Spark Overview”, Lisa Hua
• Spark has additional features, such as interop with S3 storage
Spark: A Clustered Application
47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Spark: Execution Terminology
48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Job – a set of tasks to be executed as a result of an action
• Stage – a set of tasks in a job that can be run in parallel
• Task – a individual unit of work sent to a single executor
Spark: SQL
49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Spark SQL is a module for structured data querying
• Supports basic SQL and HiveQL
• Can act as distributed query engine via JDBC/ODBC, or CLI
Spark: Dataframe, Datasets, RDDs
50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Dataframe is a distributed assembly of data into named columns
• Analogous to a relational table, or data frame in R/Python (with richer optimizations)
• Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s
execution engine
• Build datasets from JVM objects and then manipulate with functional transformations
• Scalable, high-throughput, fault tolerant processing of real time streams and use
cases.
• Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs.
handling a single message/event.
Spark Streaming
51Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Can merge inbound data with historical data
• Write code in Scala, Python, Java, etc
• Lower level access via DStream, obtained via StreamingContext.
• Create RDD’s from the DStream
• Two primary metrics to monitor and tune:
• Processing time (per batch)
• Scheduling delay (processed upon arrival?)
• Use Kyro for serialization
Spark Streaming
52Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Spark Streaming - comparison with other techs
53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Building a Spark Application
54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Scala or Java, compiled to JAR (in turn uploaded to worker nodes)
• Running a spark app is as easy as pie (submit options can expand the
experience):
$ spark-submit myAwesomePythonScript.py theFileURL
$ spark-submit –class SkyNetInScala skynet2017.jar theFileURL
• Spark uses log4j (beware)
• YARN can aggregate worker logs
Akka, who uses it?
55Source: “Akka Actor Introduction”, Gene
Akka
56Source: “Introducing Akka”, Jonas Bonér
• Vision
• Simpler: Concurrency, Scalability, Fault-Tolerance
• With a single unified
• Programming Model
• Managed Runtime
• Open Source Distribution
• Manage System Overload (backpressure)
• Scale up & Scale out
• Program to a higher level
• No more shared state, state visibility, threads, locks, concurrent collections, thread
notifications
• Low level concurrency built into the plumbing; it becomes simple workflow just deal with
messages
• Increases CPU utilization, lowers latency, high throughput, scalable!
• Superior, Proven model to detect and recover from errors
Akka: Perfect for the Cloud
57Source: “Introducing Akka”, Jonas Bonér
• Elastic and dynamic
• Fault tolerant & self healing (autonomic)
• Adaptive load-balancing, cluster rebalancing & Actor migration
• Build loosely coupled systems that can dynamically adapt at runtime
Akka 101
58Source: “Introducing Akka”, Jonas Bonér
• Akka’s unit of code is called an Actor
• The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric
• Encapsulates code like servlets or session beans; policy decisions separated
from biz logic
• Actors have been around since 1973, and if you’ve ever used a phone, that
software helped make it work. 9 nines of uptime!
• Think of an actor as a VM in the cloud (it isn’t, but)
• Encapsulated, decoupled
• Managing own memory and behavior
• Communicates asynchronously with non-blocking messages
• Elastic – grow/shrink on demand
• Hot deploy, change runtime behavior
Akka: How to use Actors?
59Source: “Introducing Akka”, Jonas Bonér
• Alternative to:
• A thread
• An object instance or component
• Callback or listener
• Singleton or service
• Router, load-balancer or pool
• Session bean or MDB
• Out of process service
• A Finite State Machine (FSM)
Akka: What is it?
60Source: http://bit.ly/hewitt-on-actors
• Carl Hewitt’s definition
• Fundamental unit of computation that embodies:
• Processing
• Storage
• Communication
• 3 Axioms – When an actor receives a message it can:
• Create new Actors
• Send messages to Actors it knows
• Designate how it should handle the next message it receives
Akka: Core Actor Operations
61Source: “Introducing Akka”, Jonas Bonér
0. Define
1. Create
2. Send
3. Become
4. Supervise
Akka: Define Operation
62Source: “Introducing Akka”, Jonas Bonér
0. Define
• Define the message (class) the actor should respond to, and Actor class
Akka: Create Operation
63Source: “Introducing Akka”, Jonas Bonér
1. Create
• Yes, creates new actor. From ActorSystem, then ActorRef.
• Lightweight, 2.6M per Gb RAM
• Strong encapsulation of: state/behavior (indistinguishable), message queue
Akka: Send Operation
64Source: “Introducing Akka”, Jonas Bonér
2. Send
• Sends a message to an Actor
• Asynch and on-blocking (fire & forget)
• Everything is Reactive
• Actor is passivated until receiving a message, which triggers it to awaken
• Messages are energy
• Everything is asynch and lockless
Akka: Performance
65Source: “Introducing Akka”, Jonas Bonér
+50 million messages per second !!!
Akka: Remote Deployment
66Source: “Introducing Akka”, Jonas Bonér
Akka: Become Operation
67Source: “Introducing Akka”, Jonas Bonér
3. Become
• Dynamically redefine Actor’s behavior
• Reactively triggered by receipt of a message
• Will not react differently to messages it receives
• Behaviors are stacked – can by pushed and popped…
(Think in terms of the object changed it’s type – interface, protocol, implementation)
Akka: Become Operation – Why?
68Source: “Introducing Akka”, Jonas Bonér
Why do this?
• A busy actor can become an Actor Pool or Router!
• Implement FSM (Finite State Machine)
• Implement graceful degradation
• Spawn empty workers that can ”Become” whatever the Master desires
• Very useful. Limited only by your imagination.
Akka: Failure Management
69Source: “Introducing Akka”, Jonas Bonér
In Java/C/C+ etc.
• Single thread of control
• If that thread blows up you are screwed
• Only option to do explicit error handling within your single thread
• Errors isolated within thread; other threads have no clue
• Results in tons of defensive code, scattered throughout the codebase and
entangled in your business logic
Akka: Supervise Operation
70Source: “Introducing Akka”, Jonas Bonér
4. Supervise
• Manage another Actor’s failures (or the person sitting next to you)
• Let actors monitor (supervise) each other for failure, then respond
• Notification sent to Actor’s supervisor if failure occurs.
• Neat separation of processing from error handling
Akka Streaming
71Source: Akka Documentation
• Akka Streams API is completely decoupled from Reactive Streams interfaces
• Implementation details to pass stream data between processing stages
• Akka Streams is interoperable with any conformant Reactive Streams i
• Principles
• All features explicit in the API
• Compositionality; combined pieces retain the function of each part
• Model of domain of distributed bounded stream processing
• Reactive Streams -> JDK9 Flow APIs
Akka Streaming
72Source: Akka Documentation
• Immutable building blocks / blueprints enabled for libraries, include
• Source - something with exactly one output stream
• Sink - something with exactly one input stream
• Flow - something with exactly one input and one output stream
• BidiFlow - something with exactly two input streams and two output streams that conceptually behave like
two Flows of opposite direction
• Graph - a packaged stream processing topology that exposes a certain set of input and output ports,
characterized by an object of type Shape.
• Built in backpressure capability
• No stage can push downstream unless it received a pull beforehand
• Difference between error and failure
• Error is accessible within the stream as a data element (signaled via onNext)
• Failure means the stream itself has collapsed (signaled via onError).
• Want failure to propagate faster than data (essential, to deal with backpressure)
• Data elements emitted before a failure can still be lost of the onError overtakes them
• Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to
isolate outside from impact of collapsed region (e.g. buffered elements)
Akka HTTP
73Source: Akka Documentation
• Built atop Akka Streams
• Can expose an incoming connection in form of a Source instance
• To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax
to Spray).
• Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0
TCP window – applying backpressure to sending party itself (e.g. a sensor)
• Rules
• Libraries shall provide their users with reusable pieces, i.e. expose factories that return graphs,
allowing full compositionality
• Avoid destruction of compositionality.
• Express functionality of a library such that materialization can be done by user outside of
library’s control.
• Libraries may optionally and additionally provide facilities that consume and materialize graphs
• Allows a library to provide convenience “sugar” for use cases
• Akka Streams Integration
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
• Adds interesting capabilities to Akka Streams
• Modern alternative to Apache Camel (EIP implementation)
• Camel en ze Akka
• Community driven, focused on connectors to external libraries,
integration patterns and conversions.
• ”A call to arms”
• https://github.com/akka/alpakka/releases/tag/v0.1
Alpakka 101
74
The Data Lake for Analytics, App Dev
“Lake Wobegon, the little town that time forgot
and the decades cannot improve.”
- Garrison Keillor
75
The Data Lake
76
• Note: This section will be expanded for DevNexus.
The Data Lake
77
• The Data Lake is where a copy of much of the data from source sytems “ends
up”, via Fast Data, etc.
• Easily accessible, massive repository of data built on commodity hardware (or
Cloud).
• Data is not stored in a way that is optimized for data analysis (S3)
• Data Lake retains all attributes
• Beware the Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117
• Let’s combine all this data to drive increased information sharing, usage, while
reducing cost through consolidation / tech simplication.
• Does it really work?
• Has the ideal of Enterprise-wide data management been realized?
• Deriving value from data still in hands of business end user (enter: Fast Data Platforms)
The Dark Side of the Data Lake
78Source: “Gartner Says Beware of the Data Lake Fallacy”
• Many companies tend to vacuum data into a Hadoop for later use
• Many companies use overlapping tools within the same ecosystem, that do not
interoperate
• Data lakes ignore how/why data is used, governed, defined and secured.
• Does this sound like a good solution?
• Data Lake solves old problem of siloing data. Great, so now it’s all in comingled.
• Federated query? AWS Athena? Why move data if not necessary?
• Inability to quantitatively measure data quality
• Accepts any data without governance or oversight
• Accepts any data without metadata (description)
• Inability to share lineage of findings by other analysis to share found value
• Security, access control (and tracking of both)
• Data Ownership, Entitlements?
• Tenancy?
• Regard for regulatory controls, compliance issues?
• What to do?
More to Come…
79Source: “Gartner Says Beware of the Data Lake Fallacy”
• At DevNexus 2017
• Thurs, Feb 23 @ 2:30pm
• http://devnexus.com/s/devnexus2017/presentations/17212
Presentation Improvements for DevNexus
“Build something 100 people love,
not something 1 million people kind of like.”
- Brian Chesky
80
• More diagrams, fewer words
• Refine, refine, refine
• Mix in coding examples
• Improve contrast between older architectures and reactive, for the Enterprise
• Content that contrasts different streaming options (Akka, Spark, Kafka)
• Add specific performance details
• Incorporate additional, interesting content (incl. Data Lake related)
Planned Improvements
81
Questions?
“I'm sorry, if you were right, I'd agree with you.”
- Robin Williams
82
• The Reactive Manifesto (http://www.reactivemanifesto.org/)
• Chaos Monkey? Use Linear Fault Driven Testing instead.
• https://www.lightbend.com/blog/architect-reactive-design-patterns
• http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html
• https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka
• https://kafka.apache.org
• http://www.slideshare.net/ducasf/introduction-to-kafka
• http://www.slideshare.net/SparkSummit/grace-huang-49762421
• http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
• https://github.com/akka/alpakka/releases/tag/v0.1
• http://www.slideshare.net/LisaHua/spark-overview-37479609
• http://spark.apache.org/
• https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/
• http://www.slideshare.net/gene7299/akka-actor-presentation
• http://www.slideshare.net/jboner/introducing-akka
• http://bit.ly/hewitt-on-actors
• http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html
• https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/
Resources
83
Ideas
84

More Related Content

What's hot

MANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesMANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesCisco DevNet
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingAhmed Soliman
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsLightbend
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Legacy Typesafe (now Lightbend)
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)Henning Spjelkavik
 
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic SystemsThe Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic SystemsLightbend
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
20 mins to Faking the DevOps Unicorn by Matt williams, Datadog
20 mins to Faking the DevOps Unicorn by Matt williams, Datadog20 mins to Faking the DevOps Unicorn by Matt williams, Datadog
20 mins to Faking the DevOps Unicorn by Matt williams, DatadogDocker, Inc.
 
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...Lightbend
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
 
from source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented datafrom source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented dataEric Sammer
 
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkAlex Silva
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScaleOvercoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScaleRightScale
 
CQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NETCQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NETDavid Hoerster
 

What's hot (20)

MANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesMANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData Services
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)
 
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic SystemsThe Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
20 mins to Faking the DevOps Unicorn by Matt williams, Datadog
20 mins to Faking the DevOps Unicorn by Matt williams, Datadog20 mins to Faking the DevOps Unicorn by Matt williams, Datadog
20 mins to Faking the DevOps Unicorn by Matt williams, Datadog
 
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
from source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented datafrom source to solution - building a system for event-oriented data
from source to solution - building a system for event-oriented data
 
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScaleOvercoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
 
CQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NETCQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NET
 

Similar to Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
 
LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0Marcel Mitran
 
Enterprise Software Architecture styles
Enterprise Software Architecture stylesEnterprise Software Architecture styles
Enterprise Software Architecture stylesAraf Karsh Hamid
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EEJ On The Beach
 
Streaming to a new Jakarta EE
Streaming to a new Jakarta EEStreaming to a new Jakarta EE
Streaming to a new Jakarta EEMarkus Eisele
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 
Cloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud HybridizationCloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud HybridizationMark Hinkle
 
170215 msa intro
170215 msa intro170215 msa intro
170215 msa introSonic leigh
 
Software Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced ArchitecturesSoftware Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced ArchitecturesAngelos Kapsimanis
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
Streaming to a new Jakarta EE / JOTB19
Streaming to a new Jakarta EE / JOTB19Streaming to a new Jakarta EE / JOTB19
Streaming to a new Jakarta EE / JOTB19Markus Eisele
 
An Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoTAn Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoTCharles Eckel
 
OpenStack Block Storage 101
OpenStack Block Storage 101OpenStack Block Storage 101
OpenStack Block Storage 101NetApp
 
How would ESBs look like, if they were done today.
How would ESBs look like, if they were done today.How would ESBs look like, if they were done today.
How would ESBs look like, if they were done today.Markus Eisele
 
Reaktive Programmierung mit den Reactive Extensions (Rx)
Reaktive Programmierung mit den Reactive Extensions (Rx)Reaktive Programmierung mit den Reactive Extensions (Rx)
Reaktive Programmierung mit den Reactive Extensions (Rx)NETUserGroupBern
 
The DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps PlaybookThe DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps Playbookbcantrill
 
Accelerate Delivery: Business case for Agile DevOps, CI/CD and Microservices
Accelerate Delivery: Business case for Agile DevOps, CI/CD and MicroservicesAccelerate Delivery: Business case for Agile DevOps, CI/CD and Microservices
Accelerate Delivery: Business case for Agile DevOps, CI/CD and MicroservicesRick Hightower
 
SDN :: Software Defined Networking –2017 Executive Overview
SDN :: Software Defined Networking –2017 Executive OverviewSDN :: Software Defined Networking –2017 Executive Overview
SDN :: Software Defined Networking –2017 Executive OverviewChristian Esteve Rothenberg
 
Reactive programming
Reactive programmingReactive programming
Reactive programmingSUDIP GHOSH
 

Similar to Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark (20)

Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0
 
Enterprise Software Architecture styles
Enterprise Software Architecture stylesEnterprise Software Architecture styles
Enterprise Software Architecture styles
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
Streaming to a new Jakarta EE
Streaming to a new Jakarta EEStreaming to a new Jakarta EE
Streaming to a new Jakarta EE
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Cloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud HybridizationCloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud Hybridization
 
170215 msa intro
170215 msa intro170215 msa intro
170215 msa intro
 
Software Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced ArchitecturesSoftware Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced Architectures
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Streaming to a new Jakarta EE / JOTB19
Streaming to a new Jakarta EE / JOTB19Streaming to a new Jakarta EE / JOTB19
Streaming to a new Jakarta EE / JOTB19
 
An Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoTAn Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoT
 
Benefits of disaggregation and open source networking in data centers
Benefits of disaggregation and open source networking in data centersBenefits of disaggregation and open source networking in data centers
Benefits of disaggregation and open source networking in data centers
 
OpenStack Block Storage 101
OpenStack Block Storage 101OpenStack Block Storage 101
OpenStack Block Storage 101
 
How would ESBs look like, if they were done today.
How would ESBs look like, if they were done today.How would ESBs look like, if they were done today.
How would ESBs look like, if they were done today.
 
Reaktive Programmierung mit den Reactive Extensions (Rx)
Reaktive Programmierung mit den Reactive Extensions (Rx)Reaktive Programmierung mit den Reactive Extensions (Rx)
Reaktive Programmierung mit den Reactive Extensions (Rx)
 
The DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps PlaybookThe DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps Playbook
 
Accelerate Delivery: Business case for Agile DevOps, CI/CD and Microservices
Accelerate Delivery: Business case for Agile DevOps, CI/CD and MicroservicesAccelerate Delivery: Business case for Agile DevOps, CI/CD and Microservices
Accelerate Delivery: Business case for Agile DevOps, CI/CD and Microservices
 
SDN :: Software Defined Networking –2017 Executive Overview
SDN :: Software Defined Networking –2017 Executive OverviewSDN :: Software Defined Networking –2017 Executive Overview
SDN :: Software Defined Networking –2017 Executive Overview
 
Reactive programming
Reactive programmingReactive programming
Reactive programming
 

Recently uploaded

Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 

Recently uploaded (20)

Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

  • 1. @wip Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark AJUG, Jan 17, 2017 Todd Fritz Cox Automotive, Inc.
  • 2. This presentation is a draft of what will be presented, next month, at DevNexus. “Sneak peek” Reactions and questions may influence evolution of content Disclaimer 2
  • 3. 3
  • 5. License: CC BY-SA 3.0 5 • Senior Solutions Architect @ Cox Automotive, Inc. • Strategic Data Services • The opinions contained herein may not represent my employer, but I believe they should. • Background is building platforms, middleware, MoM, EIP, EDA, etc • DevOps mentality • Exposed to many environments, technologies, people • Life-long learner and always curious • Novice bass player • Scuba diver About Me
  • 6. DevNexus 2015 http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015 Great Wide Open - Atlanta (April 3, 2014) http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud AJUG (April 15, 2014) http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-group Video - https://vimeo.com/94556976 Previous Presentations 6
  • 7. • Forward • The Briefest History • Background: Reactive Systems, Patterns, Implementations • The Enterprise • Fast Data • The Data Lake for Analytics, App Dev • Presentation Improvements Planned for DevNexus • Questions • Resources Agenda 7
  • 8. Forward “Our greatest glory is not in never failing, but in rising every time we fall.” -Confucius 8
  • 9. • Why is Reactive Important? • Reactive Systems and Programming != Reactive Management • Reactive underpins every use case, every business capability, every product feature • Tendency for companies to survey market and select products to match perceived business need • Importance of vision, governance, tenancy, entitlements, security • Both Process and Technology • Use Case: Build a system that can scale thousands to millions of users, handling millions to billions of messages • Real time data, for application components, middleware, data processing, analytics • A journey of innovation and successive refinement Onward 9
  • 10. The Briefest History History is the sum total of things that could have been avoided. - Konrad Adenauer 10
  • 11. • “Reactive” is not new • Underlying principles go back almost 50 years, to the days of punch cards • Erlang • Built to scale, handle extremely high volume • Extensive use in Telco for decades; many billions of messages • Actors • Bedrock is messaging • We’ve been using this technique for decades, via many technologies • Improvements over time around component isolation, decoupling to benefit scalability and concurrency The Briefest History 11
  • 12. Reactive Systems, Patterns, Implementations “People who think they know everything are a great annoyance to those of us who do.” - Isaac Asimov 12
  • 13. • The Reactive Manifesto • http://www.reactivemanifesto.org • Many organizations independently building software to new, and similar, patterns • Increasing pressures to simplify, scale, innovate and improve customer experience • Increasing proliferation and interoperability of system environments and connected devices • More data with contextual use cases • Yesterday’s architectures just don’t cut it • Need more flexible, resilient, robust systems • Solution through evolution • Spawn of Actors (Akka, Erlang) • Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns Reactive Systems 13
  • 14. “…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of thinking about systems architecture and design in a distributed environment where implementation techniques, tooling, and design patterns are components of a larger whole.”* “A Reactive System is based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each … scale up/down, load balance and even ...” * (proactive steps) Components may qualify as reactive, but when combined, does not guarantee a Reactive System What is Reactive? A Set of Design Principles 14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 15. 1. Reactive Systems (architecture, design) 2. Reactive Programming (declarative event-based) 3. Functional Reactive Programming (FRP) NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07- 22-essence-of-frp.html Reactive Begets* 15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 16. • Reactive Programming != Functional Reactive Programming • Subset of Asynch Programming • New information drives logic flow vs. control flow driven by thread-of-execution • Avoids resource contention (Amdahl’s Law) that impedes scalability. • Decompose into multiple steps that are asynch and nonblocking • Combine into a composed workflow • Reactive Systems very rarely block • Reactive API libraries are either declarative (functional composition, combinators) or callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc) • Reactive programming is event-driven • Reactive systems are message driven • Wait? What? More about this distinction in a few slides. Reactive Programming* 16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 17. • Reactive Programming is related to Dataflow Programming • Both emphasize flow of data vs. flow of control • Examples • Futures / Promises • (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured pipelines connecting sources and destinations. • Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value change can trigger dependent functions to produce new values (state) • Technologies that do this, include • Akka Streams • RxJava • Vert.X • Reactive Streams Specification • The standard for interoperability amongst Reactive Programming libraries on the JVM • “…an initiative to provide a standard for asynchronous stream processing with non- blocking back pressure.” About Dataflow Programming* 17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 18. • Notable benefits • Increased (efficient) utilization of compute resources (incl. multi-core) • Increased performance via serialization reduction • Amdahl’s Law • Neil Günter’s Universal Scalability Law. To quantify the effects of contention and coordination in concurrent, distributed systems. This explains how the cost of coherency in a system can lead to negative results, as new resources are added to the system. • Productivity. Reactive libraries handle complexities such as dealing with asynch, nonblocking compute, IO, coordination between components. • Great for creating components that are composed to workflows, that are back- pressured, scalable, high-performance Why Reactive Programming* 18* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 19. • Reactive Programming - event-driven • Computation via dataflow chains • Events are not directed; “addressable event sources” • Events are mere facts that can be observed • Emitted by changes in state (state machine) • Listeners attach to even sources, which in turn, react to them • Emitted locally • Reactive Systems - message-driven • Basis of communication across network/components; prefer asynch • Sender/Receiver are decoupled • Focus on resilience and elasticity via communication/communication inherent to distributed systems • Long-lived, addressable components • Waits for messages to be sent, then reacts to them • Messages are ”directed” • A message has a clear destination; “addressable recipient” Event-Driven vs. Message-Driven* 19* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 20. • Common pattern is to us Messaging as means to communicate Events across network/components • Events within Messages • Examples • AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub) • Pros • Abstraction and simplicity • Cons • Lose some control • Messaging forces developers to deal with complex realities of distributed programming • Failure detection • Message delivery contracts (dupes, retry, ordering) • Consistency guarantees • Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc) Event-Driven & Message-Driven* 20* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 21. Reactive Systems: Characteristics* 21 Responsive • Low latency • Consistent & predictable • Foundation of Usability • Essential for Utility • Expose problems quickly • Happier Customers Resilient • Responsive through failure • Resilient (H/A) through replication • Failure isolated to Component (bulkheads) • Delegate recovery to external Component (supervisor hierarchies) • Client does not handle failures Elastic • Responsive as workload varies • Devoid of bottlenecks or hot spots • Perf metrics drive predictive / reactive autonomic scaling • Favor commodity infrastructure Message Driven • Asynchronous • Loose Coupling • Isolation • Location Transparency • Non Blocking • Recipients can passivate • Message Passing • Flow Control • Exception Management • Elasticity • Back Pressure * Source: The Reactive Manifesto
  • 22. Reactive Systems: Patterns 22 Architecture Pattern: Single Component • Component does one thing, fully and well • Single Responsibility Principle • Max cohension, min coupling Architecture Pattern: Let-it-Crash • Prefer a full component restart to complex internal error handling • Design for failure • Leads to more reliable components • Avoids hard to find/fix errors • Failure is unavoidable Implementation Patterns Circuit Breaker • Protect services by breaking connections during failures • From EE • Protects clients from timeouts • Allows time for service to recover Source: https://www.lightbend.com/blog/architect-reactive-design-patterns Implementation Patterns Saga • Divide long-lived, distributed transactions into quick local ones with compensating actions for recovery. • Compensating txns run during saga rollback • Concurrent sagas can see intermediate state • Sags need to be persistent to recover from hardware failures. Save points.
  • 23. • Still need to use the brain and learn • Not going to be able to build reactive systems with just paper certifications • No substitute for experience! • The “new” is based on evolution (old techniques and patterns) • Paradigms and patterns not isolated to components or technology Sound Complicated? 23Source: The Reactive Manifesto
  • 24. The Enterprise “Even if you are on the right track, you’ll get run over if you just sit there.” - Will Rogers 24
  • 25. 25
  • 26. • We all work for 1..n companies that have a size, complexity and age • Young companies (e.g. start-ups) • Less legacy overhead • More able to adopt newer technology; attempting to innovate, disrupt, find a niche • Less able to leverage expensive, enterprise-class solutions; value through IP or a unique product • More likely to build cloud native (various reasons) • Staff typically has larger sphere of influence • Sometimes filled with Unicorns • Mid-size companies • May have legacy overhead • Complexity if growth through acquisition vs. growth through product evolution • Strategic use of of Enterprise products • May be cloud native, or hybrid • Increased division of labor, defined roles, paying customers to keep happy The Enterprise 26
  • 27. • Generalizing large companies • Probably where the terms “technical debt” and “culture debt” came from • Certainly has legacy overhead, very complex environments • Fear of change, less room to fail – backed by valid business reasons • Likely has complexity due to both acquisition and product evolution • Purchase a company, adding a ”different” tech stack • Money to fully absorb? Risk? Or purchase to evolve existing lines of business? • Use of of Enterprise products, likely prefers “supported” flavors of open source • A blend of on-premises, hybrid, cloud • More politics and cats to herd • Slower to see value from new technology • Matrixed division of labor, defined roles, paying customers to keep happy • Transformative change is more difficult; planning, budgeting, process, management structure The Enterprise 27
  • 28. • An Enterprise has a variety of software applications • Customer facing, revenue generating • “Supporting” applications for operation, development, maintenance activities • Datamarts used for analytics • Data typically moved from system of record, into one or more centralized data “hubs” • OLAP / Data Warehouses • Data Lake, common use of Hadoop • Older application architectures focused on the application, not on enterprise interoperability • Drop in ETL, ESB, SOA, iPaaS to do so ($$$) • Perhaps refactor service layers into more modern middleware • Enterprises becoming ”real time” • The old, batch-oriented, application silos are too complex and just can’t do this well Breaking Down the Enterprise 28
  • 29. • (Not a complete list) • Products (and supporting software) operates in Real Time • Opens door to do things to increase customer satisfaction/retention • Save money, become more agile (speed to market) – productivity enhancer • Less reliant on batch or pull (streaming batch to interop legacy) • Data flows to system that need it, where it is acted on in real time • More powerful data processing • Easier to manage; reduced TCO • Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service) • Leverage cloud solutions, auto scaling, resiliency, high availability • Analysts get to use latest tools, oftentimes in the cloud • Enables autonomic automation How the Enterprise Benefits from Reactive 29
  • 30. • Friends don’t let their friend-analysts do map-reduce • Analysts are not highly technical (some do code in R) • Larger companies may have communities of SAS users • They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL against data. • Can’t expect people to switch careers or skill sets to accommodate new technology. And Again, those Analysts 30
  • 31. 31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 32. Fast Data “If you are in a spaceship that is traveling at the speed of light, and you turn on the headlights, does anything happen?” - Steven Wright 32
  • 33. • Big Data  Fast Data • Constant stream of inbound data, at incredible rates • The old way is to store it, then analyze • Say hello to Map Reduce! • Hadoop did reinvent how to process petabytes of data on commodity hardware • Why not analyze and act on data as it is received? • Using (proven) technology built to scale to massive volume? • Akka, Kafka, Spark, Flink • Use case reminder: millions of concurrent actors handling billions of messages, in real-time • Fast Data means acting on data in real-time, and sending it to destinations • Data Lake • Other systems Fast Data 33* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 34. Time is Gold 34* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
  • 35. • Fast Data is the obvious future • Many of us have been using these techs for years • The easy part • Build new systems using Reactive patterns, architectures and technology • You start ups have it easy… • The ability of a company to adopt disruptive architectures and patterns is inversely related to its size • In a real world where SDLC costs money • How to reconcile with legacy architectures and implementations? • How to interoperate with disparate systems with varying capabilities? • Does it make sense to refactor “what works”? • Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter of perspective, when viewed in the rear-facing mirror of innovation? • What is a cost efficient approach to adopt and adapt to Reactive? Reality Check 35* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 36. • Enough background, time to talk tech • For this discussion, the reference platform is Lightbend’s Fast Data platform • (A few deviations mentioned) • Core techs will be • Kafka - open source or Confluent • Spark - open source or Databricks (a nice managed service in AWS) • Akka (and Akka-HTTP) (open source or Lightbend stack) • Alpakka Now, the Fun Stuff 36* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 38. • JMS -> AMQP –> Kafka • Streaming platform • Process messages when provided • Fault-tolerant storage • Pub/Sub capability • Common use cases • Data pipelines (messaging) between a source and destination • Real-time action / transformation • Metrics, weblogs, stream processing, event sourcing • A commit log • Spread across a cluster • Record streams stored in topics • Each record has a key, value, timestamp • Each topic has offsets and a retention policy Kafka 101 38
  • 39. Kafka Performance 39Source: “Introduction to Kafka”, Ducas Francis 30k/s 1.8M/min 108M/hr 2.7B/day
  • 40. • Producer API • To publish a record to Kafka • Consumer API • To subscribe to a topic(s) • Consumer groups • Handle records pushed to topics • Streams API • Stream processing • Consume from Stream An, do processing, publish to Stream Bn • E.g. aggregation, joining • Connector API • To build custom Producers/Consumers • Purpose built integration components Kafka APIs 40
  • 41. • > Messaging systems (RabbitMQ, AMQ, etc) • Just don’t scale well and become complex • Need to use other abstractions for batching • Lacks replay ability (reset offset, etc) • > Log forwarders, e.g. Scribe or Flume • Push architecture • High performance • Scales well • Sensitive to business logic in endpoints, because needs to push data fast • Assumes data is pushed to large sink (e.g. Hadoop) • Oh, and then queries later. So much for real-time. • Supports Polyglot • Python, Go, .NET, Node.js, C/C++, etc • Robust ecosystem • https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem Why Use Kafka? 41
  • 42. • “…a fast and general engine for large-scale data processing.” • Hadoop / YARN • Map-Reduce – mapper, reducer, disk IO, queue, fetch resource • Great for parallel file processing of large files • Synchronization barrier during persistence • Spark • In-memory data processing • Interactive/iterative data query • Better supports more complex, interactive (real-time) apps • 100x faster than Hadoop MR (in memory), 10x faster on disk • Microbatching Spark: What is it? 42Source: spark.apache.org
  • 43. • Combine SQL, streaming, complex analytics • SQL • Dataframes • MLlib • GraphX • Spark Streaming Spark 43Source: spark.apache.org
  • 44. • Run it • Standalone • Hadoop • Mesos • Cloud • Access data • HDFS • Cassandra • Hbase • S3 • Hive • Tachyon, and more • Write code • Scala • Java • Python, Clojure, R • Interactive query shell (notebooks) Spark Execution Modes 44Source: spark.apache.org
  • 45. • Slow due to replication, serialization, filesystem IO • Inefficient use cases: • Iterative algorithms (ML, Graphs, Network analysis) • Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries) Spark: Hadoop MR 45Source: “Spark Overview”, Lisa Hua
  • 46. Spark: Hadoop & Spark (in Hadoop) 46Source: “Spark Overview”, Lisa Hua • Spark has additional features, such as interop with S3 storage
  • 47. Spark: A Clustered Application 47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 48. Spark: Execution Terminology 48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Job – a set of tasks to be executed as a result of an action • Stage – a set of tasks in a job that can be run in parallel • Task – a individual unit of work sent to a single executor
  • 49. Spark: SQL 49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Spark SQL is a module for structured data querying • Supports basic SQL and HiveQL • Can act as distributed query engine via JDBC/ODBC, or CLI
  • 50. Spark: Dataframe, Datasets, RDDs 50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Dataframe is a distributed assembly of data into named columns • Analogous to a relational table, or data frame in R/Python (with richer optimizations) • Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s execution engine • Build datasets from JVM objects and then manipulate with functional transformations
  • 51. • Scalable, high-throughput, fault tolerant processing of real time streams and use cases. • Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event. Spark Streaming 51Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 52. • Can merge inbound data with historical data • Write code in Scala, Python, Java, etc • Lower level access via DStream, obtained via StreamingContext. • Create RDD’s from the DStream • Two primary metrics to monitor and tune: • Processing time (per batch) • Scheduling delay (processed upon arrival?) • Use Kyro for serialization Spark Streaming 52Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 53. Spark Streaming - comparison with other techs 53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 54. Building a Spark Application 54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Scala or Java, compiled to JAR (in turn uploaded to worker nodes) • Running a spark app is as easy as pie (submit options can expand the experience): $ spark-submit myAwesomePythonScript.py theFileURL $ spark-submit –class SkyNetInScala skynet2017.jar theFileURL • Spark uses log4j (beware) • YARN can aggregate worker logs
  • 55. Akka, who uses it? 55Source: “Akka Actor Introduction”, Gene
  • 56. Akka 56Source: “Introducing Akka”, Jonas Bonér • Vision • Simpler: Concurrency, Scalability, Fault-Tolerance • With a single unified • Programming Model • Managed Runtime • Open Source Distribution • Manage System Overload (backpressure) • Scale up & Scale out • Program to a higher level • No more shared state, state visibility, threads, locks, concurrent collections, thread notifications • Low level concurrency built into the plumbing; it becomes simple workflow just deal with messages • Increases CPU utilization, lowers latency, high throughput, scalable! • Superior, Proven model to detect and recover from errors
  • 57. Akka: Perfect for the Cloud 57Source: “Introducing Akka”, Jonas Bonér • Elastic and dynamic • Fault tolerant & self healing (autonomic) • Adaptive load-balancing, cluster rebalancing & Actor migration • Build loosely coupled systems that can dynamically adapt at runtime
  • 58. Akka 101 58Source: “Introducing Akka”, Jonas Bonér • Akka’s unit of code is called an Actor • The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric • Encapsulates code like servlets or session beans; policy decisions separated from biz logic • Actors have been around since 1973, and if you’ve ever used a phone, that software helped make it work. 9 nines of uptime! • Think of an actor as a VM in the cloud (it isn’t, but) • Encapsulated, decoupled • Managing own memory and behavior • Communicates asynchronously with non-blocking messages • Elastic – grow/shrink on demand • Hot deploy, change runtime behavior
  • 59. Akka: How to use Actors? 59Source: “Introducing Akka”, Jonas Bonér • Alternative to: • A thread • An object instance or component • Callback or listener • Singleton or service • Router, load-balancer or pool • Session bean or MDB • Out of process service • A Finite State Machine (FSM)
  • 60. Akka: What is it? 60Source: http://bit.ly/hewitt-on-actors • Carl Hewitt’s definition • Fundamental unit of computation that embodies: • Processing • Storage • Communication • 3 Axioms – When an actor receives a message it can: • Create new Actors • Send messages to Actors it knows • Designate how it should handle the next message it receives
  • 61. Akka: Core Actor Operations 61Source: “Introducing Akka”, Jonas Bonér 0. Define 1. Create 2. Send 3. Become 4. Supervise
  • 62. Akka: Define Operation 62Source: “Introducing Akka”, Jonas Bonér 0. Define • Define the message (class) the actor should respond to, and Actor class
  • 63. Akka: Create Operation 63Source: “Introducing Akka”, Jonas Bonér 1. Create • Yes, creates new actor. From ActorSystem, then ActorRef. • Lightweight, 2.6M per Gb RAM • Strong encapsulation of: state/behavior (indistinguishable), message queue
  • 64. Akka: Send Operation 64Source: “Introducing Akka”, Jonas Bonér 2. Send • Sends a message to an Actor • Asynch and on-blocking (fire & forget) • Everything is Reactive • Actor is passivated until receiving a message, which triggers it to awaken • Messages are energy • Everything is asynch and lockless
  • 65. Akka: Performance 65Source: “Introducing Akka”, Jonas Bonér +50 million messages per second !!!
  • 66. Akka: Remote Deployment 66Source: “Introducing Akka”, Jonas Bonér
  • 67. Akka: Become Operation 67Source: “Introducing Akka”, Jonas Bonér 3. Become • Dynamically redefine Actor’s behavior • Reactively triggered by receipt of a message • Will not react differently to messages it receives • Behaviors are stacked – can by pushed and popped… (Think in terms of the object changed it’s type – interface, protocol, implementation)
  • 68. Akka: Become Operation – Why? 68Source: “Introducing Akka”, Jonas Bonér Why do this? • A busy actor can become an Actor Pool or Router! • Implement FSM (Finite State Machine) • Implement graceful degradation • Spawn empty workers that can ”Become” whatever the Master desires • Very useful. Limited only by your imagination.
  • 69. Akka: Failure Management 69Source: “Introducing Akka”, Jonas Bonér In Java/C/C+ etc. • Single thread of control • If that thread blows up you are screwed • Only option to do explicit error handling within your single thread • Errors isolated within thread; other threads have no clue • Results in tons of defensive code, scattered throughout the codebase and entangled in your business logic
  • 70. Akka: Supervise Operation 70Source: “Introducing Akka”, Jonas Bonér 4. Supervise • Manage another Actor’s failures (or the person sitting next to you) • Let actors monitor (supervise) each other for failure, then respond • Notification sent to Actor’s supervisor if failure occurs. • Neat separation of processing from error handling
  • 71. Akka Streaming 71Source: Akka Documentation • Akka Streams API is completely decoupled from Reactive Streams interfaces • Implementation details to pass stream data between processing stages • Akka Streams is interoperable with any conformant Reactive Streams i • Principles • All features explicit in the API • Compositionality; combined pieces retain the function of each part • Model of domain of distributed bounded stream processing • Reactive Streams -> JDK9 Flow APIs
  • 72. Akka Streaming 72Source: Akka Documentation • Immutable building blocks / blueprints enabled for libraries, include • Source - something with exactly one output stream • Sink - something with exactly one input stream • Flow - something with exactly one input and one output stream • BidiFlow - something with exactly two input streams and two output streams that conceptually behave like two Flows of opposite direction • Graph - a packaged stream processing topology that exposes a certain set of input and output ports, characterized by an object of type Shape. • Built in backpressure capability • No stage can push downstream unless it received a pull beforehand • Difference between error and failure • Error is accessible within the stream as a data element (signaled via onNext) • Failure means the stream itself has collapsed (signaled via onError). • Want failure to propagate faster than data (essential, to deal with backpressure) • Data elements emitted before a failure can still be lost of the onError overtakes them • Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to isolate outside from impact of collapsed region (e.g. buffered elements)
  • 73. Akka HTTP 73Source: Akka Documentation • Built atop Akka Streams • Can expose an incoming connection in form of a Source instance • To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax to Spray). • Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor) • Rules • Libraries shall provide their users with reusable pieces, i.e. expose factories that return graphs, allowing full compositionality • Avoid destruction of compositionality. • Express functionality of a library such that materialization can be done by user outside of library’s control. • Libraries may optionally and additionally provide facilities that consume and materialize graphs • Allows a library to provide convenience “sugar” for use cases
  • 74. • Akka Streams Integration • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • Adds interesting capabilities to Akka Streams • Modern alternative to Apache Camel (EIP implementation) • Camel en ze Akka • Community driven, focused on connectors to external libraries, integration patterns and conversions. • ”A call to arms” • https://github.com/akka/alpakka/releases/tag/v0.1 Alpakka 101 74
  • 75. The Data Lake for Analytics, App Dev “Lake Wobegon, the little town that time forgot and the decades cannot improve.” - Garrison Keillor 75
  • 76. The Data Lake 76 • Note: This section will be expanded for DevNexus.
  • 77. The Data Lake 77 • The Data Lake is where a copy of much of the data from source sytems “ends up”, via Fast Data, etc. • Easily accessible, massive repository of data built on commodity hardware (or Cloud). • Data is not stored in a way that is optimized for data analysis (S3) • Data Lake retains all attributes • Beware the Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117 • Let’s combine all this data to drive increased information sharing, usage, while reducing cost through consolidation / tech simplication. • Does it really work? • Has the ideal of Enterprise-wide data management been realized? • Deriving value from data still in hands of business end user (enter: Fast Data Platforms)
  • 78. The Dark Side of the Data Lake 78Source: “Gartner Says Beware of the Data Lake Fallacy” • Many companies tend to vacuum data into a Hadoop for later use • Many companies use overlapping tools within the same ecosystem, that do not interoperate • Data lakes ignore how/why data is used, governed, defined and secured. • Does this sound like a good solution? • Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. • Federated query? AWS Athena? Why move data if not necessary? • Inability to quantitatively measure data quality • Accepts any data without governance or oversight • Accepts any data without metadata (description) • Inability to share lineage of findings by other analysis to share found value • Security, access control (and tracking of both) • Data Ownership, Entitlements? • Tenancy? • Regard for regulatory controls, compliance issues? • What to do?
  • 79. More to Come… 79Source: “Gartner Says Beware of the Data Lake Fallacy” • At DevNexus 2017 • Thurs, Feb 23 @ 2:30pm • http://devnexus.com/s/devnexus2017/presentations/17212
  • 80. Presentation Improvements for DevNexus “Build something 100 people love, not something 1 million people kind of like.” - Brian Chesky 80
  • 81. • More diagrams, fewer words • Refine, refine, refine • Mix in coding examples • Improve contrast between older architectures and reactive, for the Enterprise • Content that contrasts different streaming options (Akka, Spark, Kafka) • Add specific performance details • Incorporate additional, interesting content (incl. Data Lake related) Planned Improvements 81
  • 82. Questions? “I'm sorry, if you were right, I'd agree with you.” - Robin Williams 82
  • 83. • The Reactive Manifesto (http://www.reactivemanifesto.org/) • Chaos Monkey? Use Linear Fault Driven Testing instead. • https://www.lightbend.com/blog/architect-reactive-design-patterns • http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html • https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka • https://kafka.apache.org • http://www.slideshare.net/ducasf/introduction-to-kafka • http://www.slideshare.net/SparkSummit/grace-huang-49762421 • http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • https://github.com/akka/alpakka/releases/tag/v0.1 • http://www.slideshare.net/LisaHua/spark-overview-37479609 • http://spark.apache.org/ • https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/ • http://www.slideshare.net/gene7299/akka-actor-presentation • http://www.slideshare.net/jboner/introducing-akka • http://bit.ly/hewitt-on-actors • http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html • https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/ Resources 83

Editor's Notes

  1. Mi
  2. Mi
  3. Mi
  4. Mi
  5. Mi
  6. Mi
  7. Mi
  8. Mi
  9. Mi
  10. Mi