Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

@wip
Building Reactive Fast Data & the Data Lake
with Akka, Kafka, Spark
AJUG, Jan 17, 2017
Todd Fritz
Cox Automotive, Inc.

This presentation is a draft of what will be presented, next month, at DevNexus.
“Sneak peek”
Reactions and questions may influence evolution of content
Disclaimer
2

tafritz@outlook.com
Todd.Fritz@coxautoinc.com
www.linkedin.com/in/tfritz
http://www.slideshare.net/ToddFritz
https://github.com/todd-fritz
4

License: CC BY-SA 3.0 5
• Senior Solutions Architect @ Cox Automotive, Inc.
• Strategic Data Services
• The opinions contained herein may not represent my employer, but I believe they
should.
• Background is building platforms, middleware, MoM, EIP, EDA, etc
• DevOps mentality
• Exposed to many environments, technologies, people
• Life-long learner and always curious
• Novice bass player
• Scuba diver
About Me

DevNexus 2015
http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015
Great Wide Open - Atlanta (April 3, 2014)
http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud
AJUG (April 15, 2014)
http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-group
Video - https://vimeo.com/94556976
Previous Presentations
6

• Forward
• The Briefest History
• Background: Reactive Systems, Patterns, Implementations
• The Enterprise
• Fast Data
• The Data Lake for Analytics, App Dev
• Presentation Improvements Planned for DevNexus
• Questions
• Resources
Agenda
7

Forward
“Our greatest glory is not in never failing,
but in rising every time we fall.”
-Confucius
8

• Why is Reactive Important?
• Reactive Systems and Programming != Reactive Management
• Reactive underpins every use case, every business capability, every product feature
• Tendency for companies to survey market and select products to match perceived
business need
• Importance of vision, governance, tenancy, entitlements, security
• Both Process and Technology
• Use Case: Build a system that can scale thousands to millions of users, handling
millions to billions of messages
• Real time data, for application components, middleware, data processing,
analytics
• A journey of innovation and successive refinement
Onward
9

The Briefest History
History is the sum total of things that could have been avoided.
- Konrad Adenauer
10

• “Reactive” is not new
• Underlying principles go back almost 50 years, to the days of punch cards
• Erlang
• Built to scale, handle extremely high volume
• Extensive use in Telco for decades; many billions of messages
• Actors
• Bedrock is messaging
• We’ve been using this technique for decades, via many technologies
• Improvements over time around component isolation, decoupling to benefit scalability and concurrency
The Briefest History
11

Reactive Systems, Patterns, Implementations
“People who think they know everything
are a great annoyance to those of us who do.”
- Isaac Asimov
12

• The Reactive Manifesto
• http://www.reactivemanifesto.org
• Many organizations independently building software to new, and
similar, patterns
• Increasing pressures to simplify, scale, innovate and improve customer experience
• Increasing proliferation and interoperability of system environments and connected devices
• More data with contextual use cases
• Yesterday’s architectures just don’t cut it
• Need more flexible, resilient, robust systems
• Solution through evolution
• Spawn of Actors (Akka, Erlang)
• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns
Reactive Systems
13

“…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of
thinking about systems architecture and design in a distributed environment where
implementation techniques, tooling, and design patterns are components of a larger
whole.”*
“A Reactive System is based on an architectural style that allows … multiple individual
services to coalesce as a single unit and react to its surroundings while remaining aware
of each … scale up/down, load balance and even ...” * (proactive steps)
Components may qualify as reactive, but when combined, does not guarantee a Reactive
System
What is Reactive? A Set of Design Principles
14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

1. Reactive Systems (architecture, design)
2. Reactive Programming (declarative event-based)
3. Functional Reactive Programming (FRP)
NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava,
Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-
22-essence-of-frp.html
Reactive Begets*

• Reactive Programming != Functional Reactive Programming
• Subset of Asynch Programming
• New information drives logic flow vs. control flow driven by thread-of-execution
• Avoids resource contention (Amdahl’s Law) that impedes scalability.
• Decompose into multiple steps that are asynch and nonblocking
• Combine into a composed workflow
• Reactive Systems very rarely block
• Reactive API libraries are either declarative (functional composition,
combinators) or callback-based (attached to events, executed during dataflow
chain) with stream-based operators (windowing, triggers, etc)
• Reactive programming is event-driven
• Reactive systems are message driven
• Wait? What? More about this distinction in a few slides.
Reactive Programming*

• Reactive Programming is related to Dataflow Programming
• Both emphasize flow of data vs. flow of control
• Examples
• Futures / Promises
• (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured
pipelines connecting sources and destinations.
• Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value
change can trigger dependent functions to produce new values (state)
• Technologies that do this, include
• Akka Streams
• RxJava
• Vert.X
• Reactive Streams Specification
• The standard for interoperability amongst Reactive Programming libraries on the JVM
• “…an initiative to provide a standard for asynchronous stream processing with non-
blocking back pressure.”
About Dataflow Programming*

• Notable benefits
• Increased (efficient) utilization of compute resources (incl. multi-core)
• Increased performance via serialization reduction
• Amdahl’s Law
• Neil Günter’s Universal Scalability Law. To quantify the effects of contention and
coordination in concurrent, distributed systems. This explains how the cost of
coherency in a system can lead to negative results, as new resources are added
to the system.
• Productivity. Reactive libraries handle complexities such as dealing with asynch,
nonblocking compute, IO, coordination between components.
• Great for creating components that are composed to workflows, that are back-
pressured, scalable, high-performance
Why Reactive Programming*

• Reactive Programming - event-driven
• Computation via dataflow chains
• Events are not directed; “addressable event sources”
• Events are mere facts that can be observed
• Emitted by changes in state (state machine)
• Listeners attach to even sources, which in turn, react to them
• Emitted locally
• Reactive Systems - message-driven
• Basis of communication across network/components; prefer asynch
• Sender/Receiver are decoupled
• Focus on resilience and elasticity via communication/communication inherent to distributed systems
• Long-lived, addressable components
• Waits for messages to be sent, then reacts to them
• Messages are ”directed”
• A message has a clear destination; “addressable recipient”
Event-Driven vs. Message-Driven*

• Common pattern is to us Messaging as means to communicate Events across
network/components
• Events within Messages
• Examples
• AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub)
• Pros
• Abstraction and simplicity
• Cons
• Lose some control
• Messaging forces developers to deal with complex realities of distributed programming
• Failure detection
• Message delivery contracts (dupes, retry, ordering)
• Consistency guarantees
• Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc)
Event-Driven & Message-Driven*

Reactive Systems: Characteristics*
21
Responsive
• Low latency
• Consistent &
predictable
• Foundation of Usability
• Essential for Utility
• Expose problems
quickly
• Happier Customers
Resilient
• Responsive through failure
• Resilient (H/A) through
replication
• Failure isolated to
Component (bulkheads)
• Delegate recovery to
external Component
(supervisor hierarchies)
• Client does not handle
failures
Elastic
• Responsive as
workload varies
• Devoid of bottlenecks or
hot spots
• Perf metrics drive
predictive / reactive
autonomic scaling
• Favor commodity
infrastructure
Message Driven
• Asynchronous
• Loose Coupling
• Isolation
• Location Transparency
• Non Blocking
• Recipients can passivate
• Message Passing
• Flow Control
• Exception Management
• Elasticity
• Back Pressure
* Source: The Reactive Manifesto

Reactive Systems: Patterns
22
Architecture Pattern:
Single Component
• Component does one thing, fully and well
• Single Responsibility Principle
• Max cohension, min coupling
Architecture Pattern:
Let-it-Crash
• Prefer a full component restart to complex internal error
handling
• Design for failure
• Leads to more reliable components
• Avoids hard to find/fix errors
• Failure is unavoidable
Implementation Patterns
Circuit Breaker
• Protect services by breaking connections during
failures
• From EE
• Protects clients from timeouts
• Allows time for service to recover
Source: https://www.lightbend.com/blog/architect-reactive-design-patterns
Implementation Patterns
Saga
• Divide long-lived, distributed transactions into
quick local ones with compensating actions for
recovery.
• Compensating txns run during saga rollback
• Concurrent sagas can see intermediate state
• Sags need to be persistent to recover from
hardware failures. Save points.

• Still need to use the brain and learn
• Not going to be able to build reactive systems with just paper
certifications
• No substitute for experience!
• The “new” is based on evolution (old techniques and patterns)
• Paradigms and patterns not isolated to components or
technology
Sound Complicated?
23Source: The Reactive Manifesto

The Enterprise
“Even if you are on the right track,
you’ll get run over if you just sit there.”
- Will Rogers
24

• We all work for 1..n companies that have a size, complexity and age
• Young companies (e.g. start-ups)
• Less legacy overhead
• More able to adopt newer technology; attempting to innovate, disrupt, find a niche
• Less able to leverage expensive, enterprise-class solutions; value through IP or a unique product
• More likely to build cloud native (various reasons)
• Staff typically has larger sphere of influence
• Sometimes filled with Unicorns
• Mid-size companies
• May have legacy overhead
• Complexity if growth through acquisition vs. growth through product evolution
• Strategic use of of Enterprise products
• May be cloud native, or hybrid
• Increased division of labor, defined roles, paying customers to keep happy
The Enterprise
26

• Generalizing large companies
• Probably where the terms “technical debt” and “culture debt” came from
• Certainly has legacy overhead, very complex environments
• Fear of change, less room to fail – backed by valid business reasons
• Likely has complexity due to both acquisition and product evolution
• Purchase a company, adding a ”different” tech stack
• Money to fully absorb? Risk? Or purchase to evolve existing lines of business?
• Use of of Enterprise products, likely prefers “supported” flavors of open source
• A blend of on-premises, hybrid, cloud
• More politics and cats to herd
• Slower to see value from new technology
• Matrixed division of labor, defined roles, paying customers to keep happy
• Transformative change is more difficult; planning, budgeting, process, management structure
The Enterprise
27

• An Enterprise has a variety of software applications
• Customer facing, revenue generating
• “Supporting” applications for operation, development, maintenance activities
• Datamarts used for analytics
• Data typically moved from system of record, into one or more centralized data
“hubs”
• OLAP / Data Warehouses
• Data Lake, common use of Hadoop
• Older application architectures focused on the application, not on enterprise
interoperability
• Drop in ETL, ESB, SOA, iPaaS to do so ($$$)
• Perhaps refactor service layers into more modern middleware
• Enterprises becoming ”real time”
• The old, batch-oriented, application silos are too complex and just can’t do this well
Breaking Down the Enterprise
28

• (Not a complete list)
• Products (and supporting software) operates in Real Time
• Opens door to do things to increase customer satisfaction/retention
• Save money, become more agile (speed to market) – productivity enhancer
• Less reliant on batch or pull (streaming batch to interop legacy)
• Data flows to system that need it, where it is acted on in real time
• More powerful data processing
• Easier to manage; reduced TCO
• Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service)
• Leverage cloud solutions, auto scaling, resiliency, high availability
• Analysts get to use latest tools, oftentimes in the cloud
• Enables autonomic automation
How the Enterprise Benefits from Reactive
29

• Friends don’t let their friend-analysts do map-reduce
• Analysts are not highly technical (some do code in R)
• Larger companies may have communities of SAS users
• They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL
against data.
• Can’t expect people to switch careers or skill sets to accommodate new
technology.
And Again, those Analysts
30

31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

Fast Data
“If you are in a spaceship that is traveling at the speed of light,
and you turn on the headlights,
does anything happen?”
- Steven Wright
32

• Big Data  Fast Data
• Constant stream of inbound data, at incredible rates
• The old way is to store it, then analyze
• Say hello to Map Reduce!
• Hadoop did reinvent how to process petabytes of data on commodity hardware
• Why not analyze and act on data as it is received?
• Using (proven) technology built to scale to massive volume?
• Akka, Kafka, Spark, Flink
• Use case reminder: millions of concurrent actors handling billions of
messages, in real-time
• Fast Data means acting on data in real-time, and sending it to destinations
• Data Lake
• Other systems
Fast Data

Time is Gold
34* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)

• Fast Data is the obvious future
• Many of us have been using these techs for years
• The easy part
• Build new systems using Reactive patterns, architectures and technology
• You start ups have it easy…
• The ability of a company to adopt disruptive architectures and patterns is inversely related to its size
• In a real world where SDLC costs money
• How to reconcile with legacy architectures and implementations?
• How to interoperate with disparate systems with varying capabilities?
• Does it make sense to refactor “what works”?
• Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter of
perspective, when viewed in the rear-facing mirror of innovation?
• What is a cost efficient approach to adopt and adapt to Reactive?
Reality Check

• Enough background, time to talk tech
• For this discussion, the reference platform is Lightbend’s Fast Data
platform
• (A few deviations mentioned)
• Core techs will be
• Kafka - open source or Confluent
• Spark - open source or Databricks (a nice managed service in AWS)
• Akka (and Akka-HTTP) (open source or Lightbend stack)
• Alpakka
Now, the Fun Stuff

• JMS -> AMQP –> Kafka
• Streaming platform
• Process messages when provided
• Fault-tolerant storage
• Pub/Sub capability
• Common use cases
• Data pipelines (messaging) between a source and destination
• Real-time action / transformation
• Metrics, weblogs, stream processing, event sourcing
• A commit log
• Spread across a cluster
• Record streams stored in topics
• Each record has a key, value, timestamp
• Each topic has offsets and a retention policy
Kafka 101
38

Kafka Performance
39Source: “Introduction to Kafka”, Ducas Francis
30k/s 1.8M/min 108M/hr 2.7B/day

• Producer API
• To publish a record to Kafka
• Consumer API
• To subscribe to a topic(s)
• Consumer groups
• Handle records pushed to topics
• Streams API
• Stream processing
• Consume from Stream An, do processing,
publish to Stream Bn
• E.g. aggregation, joining
• Connector API
• To build custom Producers/Consumers
• Purpose built integration components
Kafka APIs
40

• > Messaging systems (RabbitMQ, AMQ, etc)
• Just don’t scale well and become complex
• Need to use other abstractions for batching
• Lacks replay ability (reset offset, etc)
• > Log forwarders, e.g. Scribe or Flume
• Push architecture
• High performance
• Scales well
• Sensitive to business logic in endpoints, because needs to push data fast
• Assumes data is pushed to large sink (e.g. Hadoop)
• Oh, and then queries later. So much for real-time.
• Supports Polyglot
• Python, Go, .NET, Node.js, C/C++, etc
• Robust ecosystem
• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Why Use Kafka?
41

• “…a fast and general engine for large-scale data processing.”
• Hadoop / YARN
• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource
• Great for parallel file processing of large files
• Synchronization barrier during persistence
• Spark
• In-memory data processing
• Interactive/iterative data query
• Better supports more complex, interactive (real-time) apps
• 100x faster than Hadoop MR (in memory), 10x faster on disk
• Microbatching
Spark: What is it?
42Source: spark.apache.org

• Combine SQL, streaming, complex analytics
• SQL
• Dataframes
• MLlib
• GraphX
• Spark Streaming
Spark

• Run it
• Standalone
• Hadoop
• Mesos
• Cloud
• Access data
• HDFS
• Cassandra
• Hbase
• S3
• Hive
• Tachyon, and more
• Write code
• Scala
• Java
• Python, Clojure, R
• Interactive query shell (notebooks)
Spark Execution Modes

• Slow due to replication, serialization, filesystem IO
• Inefficient use cases:
• Iterative algorithms (ML, Graphs, Network analysis)
• Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries)
Spark: Hadoop MR
45Source: “Spark Overview”, Lisa Hua

Spark: Hadoop & Spark (in Hadoop)
46Source: “Spark Overview”, Lisa Hua
• Spark has additional features, such as interop with S3 storage

Spark: A Clustered Application

Spark: Execution Terminology
• Job – a set of tasks to be executed as a result of an action
• Stage – a set of tasks in a job that can be run in parallel
• Task – a individual unit of work sent to a single executor

Spark: SQL
• Spark SQL is a module for structured data querying
• Supports basic SQL and HiveQL
• Can act as distributed query engine via JDBC/ODBC, or CLI

Spark: Dataframe, Datasets, RDDs
• Dataframe is a distributed assembly of data into named columns
• Analogous to a relational table, or data frame in R/Python (with richer optimizations)
• Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s
execution engine
• Build datasets from JVM objects and then manipulate with functional transformations

• Scalable, high-throughput, fault tolerant processing of real time streams and use
cases.
• Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs.
handling a single message/event.
Spark Streaming

• Can merge inbound data with historical data
• Write code in Scala, Python, Java, etc
• Lower level access via DStream, obtained via StreamingContext.
• Create RDD’s from the DStream
• Two primary metrics to monitor and tune:
• Processing time (per batch)
• Scheduling delay (processed upon arrival?)
• Use Kyro for serialization
Spark Streaming

Spark Streaming - comparison with other techs

Building a Spark Application
• Scala or Java, compiled to JAR (in turn uploaded to worker nodes)
• Running a spark app is as easy as pie (submit options can expand the
experience):
$ spark-submit myAwesomePythonScript.py theFileURL
$ spark-submit –class SkyNetInScala skynet2017.jar theFileURL
• Spark uses log4j (beware)
• YARN can aggregate worker logs

Akka, who uses it?
55Source: “Akka Actor Introduction”, Gene

Akka
56Source: “Introducing Akka”, Jonas Bonér
• Vision
• Simpler: Concurrency, Scalability, Fault-Tolerance
• With a single unified
• Programming Model
• Managed Runtime
• Open Source Distribution
• Manage System Overload (backpressure)
• Scale up & Scale out
• Program to a higher level
• No more shared state, state visibility, threads, locks, concurrent collections, thread
notifications
• Low level concurrency built into the plumbing; it becomes simple workflow just deal with
messages
• Increases CPU utilization, lowers latency, high throughput, scalable!
• Superior, Proven model to detect and recover from errors

Akka: Perfect for the Cloud
• Elastic and dynamic
• Fault tolerant & self healing (autonomic)
• Adaptive load-balancing, cluster rebalancing & Actor migration
• Build loosely coupled systems that can dynamically adapt at runtime

Akka 101
• Akka’s unit of code is called an Actor
• The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric
• Encapsulates code like servlets or session beans; policy decisions separated
from biz logic
• Actors have been around since 1973, and if you’ve ever used a phone, that
software helped make it work. 9 nines of uptime!
• Think of an actor as a VM in the cloud (it isn’t, but)
• Encapsulated, decoupled
• Managing own memory and behavior
• Communicates asynchronously with non-blocking messages
• Elastic – grow/shrink on demand
• Hot deploy, change runtime behavior

Akka: How to use Actors?
• Alternative to:
• A thread
• An object instance or component
• Callback or listener
• Singleton or service
• Router, load-balancer or pool
• Session bean or MDB
• Out of process service
• A Finite State Machine (FSM)

Akka: What is it?
60Source: http://bit.ly/hewitt-on-actors
• Carl Hewitt’s definition
• Fundamental unit of computation that embodies:
• Processing
• Storage
• Communication
• 3 Axioms – When an actor receives a message it can:
• Create new Actors
• Send messages to Actors it knows
• Designate how it should handle the next message it receives

Akka: Core Actor Operations
0. Define
1. Create
2. Send
3. Become
4. Supervise

Akka: Define Operation
0. Define
• Define the message (class) the actor should respond to, and Actor class

Akka: Create Operation
1. Create
• Yes, creates new actor. From ActorSystem, then ActorRef.
• Lightweight, 2.6M per Gb RAM
• Strong encapsulation of: state/behavior (indistinguishable), message queue

Akka: Send Operation
2. Send
• Sends a message to an Actor
• Asynch and on-blocking (fire & forget)
• Everything is Reactive
• Actor is passivated until receiving a message, which triggers it to awaken
• Messages are energy
• Everything is asynch and lockless

Akka: Performance
+50 million messages per second !!!

Akka: Remote Deployment

Akka: Become Operation
3. Become
• Dynamically redefine Actor’s behavior
• Reactively triggered by receipt of a message
• Will not react differently to messages it receives
• Behaviors are stacked – can by pushed and popped…
(Think in terms of the object changed it’s type – interface, protocol, implementation)

Akka: Become Operation – Why?
Why do this?
• A busy actor can become an Actor Pool or Router!
• Implement FSM (Finite State Machine)
• Implement graceful degradation
• Spawn empty workers that can ”Become” whatever the Master desires
• Very useful. Limited only by your imagination.

Akka: Failure Management
In Java/C/C+ etc.
• Single thread of control
• If that thread blows up you are screwed
• Only option to do explicit error handling within your single thread
• Errors isolated within thread; other threads have no clue
• Results in tons of defensive code, scattered throughout the codebase and
entangled in your business logic

Akka: Supervise Operation
4. Supervise
• Manage another Actor’s failures (or the person sitting next to you)
• Let actors monitor (supervise) each other for failure, then respond
• Notification sent to Actor’s supervisor if failure occurs.
• Neat separation of processing from error handling

Akka Streaming
71Source: Akka Documentation
• Akka Streams API is completely decoupled from Reactive Streams interfaces
• Implementation details to pass stream data between processing stages
• Akka Streams is interoperable with any conformant Reactive Streams i
• Principles
• All features explicit in the API
• Compositionality; combined pieces retain the function of each part
• Model of domain of distributed bounded stream processing
• Reactive Streams -> JDK9 Flow APIs

Akka Streaming
• Immutable building blocks / blueprints enabled for libraries, include
• Source - something with exactly one output stream
• Sink - something with exactly one input stream
• Flow - something with exactly one input and one output stream
• BidiFlow - something with exactly two input streams and two output streams that conceptually behave like
two Flows of opposite direction
• Graph - a packaged stream processing topology that exposes a certain set of input and output ports,
characterized by an object of type Shape.
• Built in backpressure capability
• No stage can push downstream unless it received a pull beforehand
• Difference between error and failure
• Error is accessible within the stream as a data element (signaled via onNext)
• Failure means the stream itself has collapsed (signaled via onError).
• Want failure to propagate faster than data (essential, to deal with backpressure)
• Data elements emitted before a failure can still be lost of the onError overtakes them
• Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to
isolate outside from impact of collapsed region (e.g. buffered elements)

Akka HTTP
• Built atop Akka Streams
• Can expose an incoming connection in form of a Source instance
• To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax
to Spray).
• Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0
TCP window – applying backpressure to sending party itself (e.g. a sensor)
• Rules
• Libraries shall provide their users with reusable pieces, i.e. expose factories that return graphs,
allowing full compositionality
• Avoid destruction of compositionality.
• Express functionality of a library such that materialization can be done by user outside of
library’s control.
• Libraries may optionally and additionally provide facilities that consume and materialize graphs
• Allows a library to provide convenience “sugar” for use cases

• Akka Streams Integration
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
• Adds interesting capabilities to Akka Streams
• Modern alternative to Apache Camel (EIP implementation)
• Camel en ze Akka
• Community driven, focused on connectors to external libraries,
integration patterns and conversions.
• ”A call to arms”
• https://github.com/akka/alpakka/releases/tag/v0.1
Alpakka 101
74

The Data Lake for Analytics, App Dev
“Lake Wobegon, the little town that time forgot
and the decades cannot improve.”
- Garrison Keillor
75

The Data Lake
76
• Note: This section will be expanded for DevNexus.

The Data Lake
77
• The Data Lake is where a copy of much of the data from source sytems “ends
up”, via Fast Data, etc.
• Easily accessible, massive repository of data built on commodity hardware (or
Cloud).
• Data is not stored in a way that is optimized for data analysis (S3)
• Data Lake retains all attributes
• Beware the Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117
• Let’s combine all this data to drive increased information sharing, usage, while
reducing cost through consolidation / tech simplication.
• Does it really work?
• Has the ideal of Enterprise-wide data management been realized?
• Deriving value from data still in hands of business end user (enter: Fast Data Platforms)

The Dark Side of the Data Lake
78Source: “Gartner Says Beware of the Data Lake Fallacy”
• Many companies tend to vacuum data into a Hadoop for later use
• Many companies use overlapping tools within the same ecosystem, that do not
interoperate
• Data lakes ignore how/why data is used, governed, defined and secured.
• Does this sound like a good solution?
• Data Lake solves old problem of siloing data. Great, so now it’s all in comingled.
• Federated query? AWS Athena? Why move data if not necessary?
• Inability to quantitatively measure data quality
• Accepts any data without governance or oversight
• Accepts any data without metadata (description)
• Inability to share lineage of findings by other analysis to share found value
• Security, access control (and tracking of both)
• Data Ownership, Entitlements?
• Tenancy?
• Regard for regulatory controls, compliance issues?
• What to do?

More to Come…
79Source: “Gartner Says Beware of the Data Lake Fallacy”
• At DevNexus 2017
• Thurs, Feb 23 @ 2:30pm
• http://devnexus.com/s/devnexus2017/presentations/17212

Presentation Improvements for DevNexus
“Build something 100 people love,
not something 1 million people kind of like.”
- Brian Chesky
80

• More diagrams, fewer words
• Refine, refine, refine
• Mix in coding examples
• Improve contrast between older architectures and reactive, for the Enterprise
• Content that contrasts different streaming options (Akka, Spark, Kafka)
• Add specific performance details
• Incorporate additional, interesting content (incl. Data Lake related)
Planned Improvements
81

Questions?
“I'm sorry, if you were right, I'd agree with you.”
- Robin Williams
82

• The Reactive Manifesto (http://www.reactivemanifesto.org/)
• Chaos Monkey? Use Linear Fault Driven Testing instead.
• https://www.lightbend.com/blog/architect-reactive-design-patterns
• http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html
• https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka
• https://kafka.apache.org
• http://www.slideshare.net/ducasf/introduction-to-kafka
• http://www.slideshare.net/SparkSummit/grace-huang-49762421
• http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
• https://github.com/akka/alpakka/releases/tag/v0.1
• http://www.slideshare.net/LisaHua/spark-overview-37479609
• http://spark.apache.org/
• https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/
• http://www.slideshare.net/gene7299/akka-actor-presentation
• http://www.slideshare.net/jboner/introducing-akka
• http://bit.ly/hewitt-on-actors
• http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html
• https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/
Resources
83

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Similar to Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark (20)

Recently uploaded

Recently uploaded (20)

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Editor's Notes