Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
1. Realtime Data
Analysis Patterns
Mikio Braun
@mikiobraun
streamdrill & TU Berlin
O'Really Strata+Hadoop, Barcelona
Nov 21, 2014
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
2. How it all started: Realtime
Twitter Retweet Trends
Rails app + PostgreSQL
About 100 tweets/second,and it got worse
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
3. Road from there
● Version 1.0: Rails + PostgreSQL
– store and batch
● Version 2.0: Scala + Cassandra
– stream processing & working data on disk
● Version 3.0: streamdrill
– “in-memory realtime analytics database”
– approximative algorithms to bound resources
– moderate parallelism for some things
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
4. Lessons learned?
Not just one kind of
realtime.
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
6. Two Dimensions of Real-Time
Complexity Latency
● counting
● trends
● outlier detection
● recommendation
● prediction (churn,
etc.)
● now (ms, RTB)
● seconds (fraud)
● hours (monitoring)
● days (reporting)
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
7. What makes realtime hard
● Many Events
– 100 events / second
– 360k per hour
– 8.6M per day
– 260M per month
– 3.2B per year
● Many Objects
http://www.flickr.com/photos/arenamontanus/269158554/
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
8. Classes of Realtime
● Events per second (100s? 1000s? 10k?)
● Number of objects (A few dozen? Millions?)
● Complexity (Counting? Trends?)
● Latency (Milliseconds? Hours?)
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
10. Data Acquisition
● Flat files / HDFS
● Apache Flume / Logstash
● Apache Kafka for distributed logging
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
11. Processing
● Depending on Latency: Batch or Streaming
● Batch
– Apache Hadoop
– Apache Spark
– Apache Flink
● Streaming
– Apache Storm
– Apache Samza
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
12. Query Layer
● Hadoop/Storm/Spark have no query layer
● Some db backend like redis to store the results
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
13. Lambda Architecture: Mixing
Batch & Streaming
http://lambda-architecture.net/
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
15. Scaling vs. Approximation
● Scaling is expensive
● Not all results are relevant
● Data changes all the time anyway
● Approximate:
Trade accuracy for resource usage
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
17. Heavy Hitters
● Count activities over large item sets (millions, even
more, e.g. IP addresses, Twitter users)
● Interested in most active elements only.
frank
paul
jan
felix
leo
alex
15
12
8
5
3
2
Fixed tables of counts
Case 1: element already in data base
paul paul 12 13
Case 2: new element
nico alex 2
nico 3
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference
on Database Theory, 2005
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
18. Count Min Sketch
● Summarize histograms over large feature sets
● Like bloom filters, but better
m bins
0 0 3 0
1 1 0 2
0 2 0 0
0 3 5 2
0 5 3 2
2 4 5 0
1 3 7 3
0 2 0 8
n different
hash functions
Updates for new entry
Query result: 1
● Query: Take minimum over all hash functions
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.
LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
19. Hyper Log Log
● Hash stream to generate random bit strings
● Look for infrequent events
● If probability is one hundreths → should have
seen 100 events on average if it occurs.
● Average to improve estimate.
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
20. Comparing Approx. Algorithms
● Heavy Hitters:
– approx. counts + top-k
– large memory requirement
● Count Min Sketch
– approx. counts for all, but no top-k, no elements
– needs to know size beforehand
● HyperLogLog
– approx. number of distinct elements
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
21. Exponential Decay
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
22. Beyond Counting
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
23. Streamdrill & Demos
● Realtime Analysis Solutions
● Core Engine:
– Heavy Hitters + exponential decay + seconndary indices
– Instant counts & top-k results over time windows
– In-memory
– Written in Scala
● Modules
– Profiling and Trending
– Recommendations
– Count Distinct
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
24. Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
25. Example: Twitter Stock Analysis
● Trends:
– symbol:combinations $AAPL:$GOOG
– symbol:hashtag $AAPL:#trading
– symbol:keywords $GOOG:disruption
– symbol:mentions $GOOG:WallStreetCom
– symbol trend $AAPL
– symbol:url $FB:http://on.wsj.com/15fHaZW
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
26. Example: Twitter Stock Analysis
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
27. Example: Twitter Stock Analysis
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
28. Example: Twitter Stock Analysis
Twitter
streamdrill
JavaScript
via REST
tweets
Tweet Analyzer
updates
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
29. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
30. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
31. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
32. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
33. Realtime user profiles
● Process 10k events / second on one machine
● Track about 1 Million counts per 1 GB
● Shard by user for higher accuracy
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
34. Realtime Data Analysis Patterns
● Acquisition / Processing / Query Layer
● Acquisition: Flat files and distributed logs
● Processing: Scaling batch or streaming
● Query Layer: Separate query from processing
● Lambda and Kappa Architecture
● Approximation as alternative to scaling
● Trends with indices as building blocks for data
analysis
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
35. Thank You
Mikio Braun
mikio@streamdrill.com
@mikiobraun
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun