Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
● Machines FAIL
● Humans make mistakes
● We want everything in real time!
○ We can’t do everything in real time :(
● We might think of a new way to analyse old data
● We might want to take a look of older versions of the raw / aggregated data
● Looking at raw data is cool, looking at aggregated data is cooler, looking at indexed/ data with
ad-hoc filter is the coolest. What if we want them all on the same set of data?
◦ Large amount of static data
◦ Scalable solution
◦ Computing streaming data
◦ Low latency
◦ Lambda Architecture
◦ Kappa Architecture
Big Data Timeline
● Nathan Marz (Twitter)
● How to beat the CAP theorem
● Concepts :
○ Immutable data
○ Everything can be re-run
○ Using the best tool for purpose
○ Query = Function(All Data)
○ real time isn’t accurate, batch will
fix any mistakes
Lambda Architecture is:
A complementary pair of:
- in-memory real-time processing
- large HDD/SSD batch processing
Proposed by Nathan Marz
Slow, but large and persistent.
Fast, but small and volatile.
● Data duplication
○ Columnar + Compressed
○ Don’t be cheap...
● Too many tools!
○ Stay on 1 platform - Hadoop/YARN
● Do I really need to write everything twice? (Cross DB ORM)
■ Twitter Summingbird (MR + Storm)
■ Apache Spark (batch / Streaming)
■ Google Dataflow
● No place for ad-hoc analysis
○ Add more specialised data sources
■ Solr / Elasticsearch
● Incremental Algorithms are HARD - stream process based on smart thresholds (= history)
○ Mix it up - Key value access during speed process
● A new event may be related to an old one, that might be realted to an older one…
○(Add graph processing (GraphX/ Giraph/ Titan
● Jay Kreps (LinkedIn)
● Questioning The Lambda Architecture
○ retain the full log of the data
○ processing = new instance of the same stream
○ input - choose where to start reading from the log (now, 1 day ago, 1 year ago..)
○ real time is accurate!
○ re-processing only when code changes
● Different● Common
○ Greek letters
○ Real time processing at scale
○ Immutable Architectures
■ “Replay” possible
○ Born out of need
○ Both use Materialised views /
indexed results for serving
Only when code
Reliability Batch is reliable,
Function = Query
running on deltas
● Zeta Architecture
○ Includes cluster management
■ Container system etc.
○ Inspired by Google
○ Internet of Things
■ MQ (kafka - RT)
■ DB (HBase - Interactive)
■ DFS (Batch)
● Mu Architecture
○Lambda with only 1 set of aggregated views