SlideShare a Scribd company logo
1 of 35
Realtime Data 
Analysis Patterns 
Mikio Braun 
@mikiobraun 
streamdrill & TU Berlin 
O'Really Strata+Hadoop, Barcelona 
Nov 21, 2014 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
How it all started: Realtime 
Twitter Retweet Trends 
Rails app + PostgreSQL 
About 100 tweets/second,and it got worse 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Road from there 
● Version 1.0: Rails + PostgreSQL 
– store and batch 
● Version 2.0: Scala + Cassandra 
– stream processing & working data on disk 
● Version 3.0: streamdrill 
– “in-memory realtime analytics database” 
– approximative algorithms to bound resources 
– moderate parallelism for some things 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Lessons learned? 
Not just one kind of 
realtime. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Applications 
FFiinnaannccee GGaammiinngg MMoonniittoorriinngg 
AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa 
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Two Dimensions of Real-Time 
Complexity Latency 
● counting 
● trends 
● outlier detection 
● recommendation 
● prediction (churn, 
etc.) 
● now (ms, RTB) 
● seconds (fraud) 
● hours (monitoring) 
● days (reporting) 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
What makes realtime hard 
● Many Events 
– 100 events / second 
– 360k per hour 
– 8.6M per day 
– 260M per month 
– 3.2B per year 
● Many Objects 
http://www.flickr.com/photos/arenamontanus/269158554/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Classes of Realtime 
● Events per second (100s? 1000s? 10k?) 
● Number of objects (A few dozen? Millions?) 
● Complexity (Counting? Trends?) 
● Latency (Milliseconds? Hours?) 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
General Architecture 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Data Acquisition 
● Flat files / HDFS 
● Apache Flume / Logstash 
● Apache Kafka for distributed logging 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Processing 
● Depending on Latency: Batch or Streaming 
● Batch 
– Apache Hadoop 
– Apache Spark 
– Apache Flink 
● Streaming 
– Apache Storm 
– Apache Samza 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Query Layer 
● Hadoop/Storm/Spark have no query layer 
● Some db backend like redis to store the results 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Lambda Architecture: Mixing 
Batch & Streaming 
http://lambda-architecture.net/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Kappa Architecture 
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Scaling vs. Approximation 
● Scaling is expensive 
● Not all results are relevant 
● Data changes all the time anyway 
● Approximate: 
Trade accuracy for resource usage 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Approximation harmful? 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Heavy Hitters 
● Count activities over large item sets (millions, even 
more, e.g. IP addresses, Twitter users) 
● Interested in most active elements only. 
frank 
paul 
jan 
felix 
leo 
alex 
15 
12 
8 
5 
3 
2 
Fixed tables of counts 
Case 1: element already in data base 
paul paul 12 13 
Case 2: new element 
nico alex 2 
nico 3 
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference 
on Database Theory, 2005 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Count Min Sketch 
● Summarize histograms over large feature sets 
● Like bloom filters, but better 
m bins 
0 0 3 0 
1 1 0 2 
0 2 0 0 
0 3 5 2 
0 5 3 2 
2 4 5 0 
1 3 7 3 
0 2 0 8 
n different 
hash functions 
Updates for new entry 
Query result: 1 
● Query: Take minimum over all hash functions 
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. 
LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Hyper Log Log 
● Hash stream to generate random bit strings 
● Look for infrequent events 
● If probability is one hundreths → should have 
seen 100 events on average if it occurs. 
● Average to improve estimate. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Comparing Approx. Algorithms 
● Heavy Hitters: 
– approx. counts + top-k 
– large memory requirement 
● Count Min Sketch 
– approx. counts for all, but no top-k, no elements 
– needs to know size beforehand 
● HyperLogLog 
– approx. number of distinct elements 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Exponential Decay 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Beyond Counting 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Streamdrill & Demos 
● Realtime Analysis Solutions 
● Core Engine: 
– Heavy Hitters + exponential decay + seconndary indices 
– Instant counts & top-k results over time windows 
– In-memory 
– Written in Scala 
● Modules 
– Profiling and Trending 
– Recommendations 
– Count Distinct 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
http://play.streamdrill.com/vis/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
● Trends: 
– symbol:combinations $AAPL:$GOOG 
– symbol:hashtag $AAPL:#trading 
– symbol:keywords $GOOG:disruption 
– symbol:mentions $GOOG:WallStreetCom 
– symbol trend $AAPL 
– symbol:url $FB:http://on.wsj.com/15fHaZW 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Twitter 
streamdrill 
JavaScript 
via REST 
tweets 
Tweet Analyzer 
updates 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime user profiles 
● Process 10k events / second on one machine 
● Track about 1 Million counts per 1 GB 
● Shard by user for higher accuracy 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime Data Analysis Patterns 
● Acquisition / Processing / Query Layer 
● Acquisition: Flat files and distributed logs 
● Processing: Scaling batch or streaming 
● Query Layer: Separate query from processing 
● Lambda and Kappa Architecture 
● Approximation as alternative to scaling 
● Trends with indices as building blocks for data 
analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Thank You 
Mikio Braun 
mikio@streamdrill.com 
@mikiobraun 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun

More Related Content

What's hot

Game Playing in Artificial Intelligence
Game Playing in Artificial IntelligenceGame Playing in Artificial Intelligence
Game Playing in Artificial Intelligencelordmwesh
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Data 101: Introduction to Data Visualization
Data 101:  Introduction to Data VisualizationData 101:  Introduction to Data Visualization
Data 101: Introduction to Data VisualizationDavid Newbury
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learningAkhilesh Joshi
 
Tutorial on AI-based Analytics in Traffic Management
Tutorial on AI-based Analytics in Traffic ManagementTutorial on AI-based Analytics in Traffic Management
Tutorial on AI-based Analytics in Traffic ManagementBiplav Srivastava
 
Crop prediction using machine learning
Crop prediction using machine learningCrop prediction using machine learning
Crop prediction using machine learningdataalcott
 
Tutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standardTutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standardGiuseppe Anastasi
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learningmahutte
 
Congestion on computer network
Congestion on computer networkCongestion on computer network
Congestion on computer networkDisi Dc
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Stock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningStock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningAravind Balaji
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyAlexandros Karatzoglou
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
UNIT-5 IoT Reference Architecture.pdf
UNIT-5 IoT Reference Architecture.pdfUNIT-5 IoT Reference Architecture.pdf
UNIT-5 IoT Reference Architecture.pdfMansiMehta96928
 

What's hot (20)

Game Playing in Artificial Intelligence
Game Playing in Artificial IntelligenceGame Playing in Artificial Intelligence
Game Playing in Artificial Intelligence
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Data 101: Introduction to Data Visualization
Data 101:  Introduction to Data VisualizationData 101:  Introduction to Data Visualization
Data 101: Introduction to Data Visualization
 
IoT transport protocols
IoT transport protocolsIoT transport protocols
IoT transport protocols
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
Tutorial on AI-based Analytics in Traffic Management
Tutorial on AI-based Analytics in Traffic ManagementTutorial on AI-based Analytics in Traffic Management
Tutorial on AI-based Analytics in Traffic Management
 
Crop prediction using machine learning
Crop prediction using machine learningCrop prediction using machine learning
Crop prediction using machine learning
 
Tutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standardTutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standard
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
 
Congestion on computer network
Congestion on computer networkCongestion on computer network
Congestion on computer network
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
IOT - Unit 3.pptx
IOT - Unit 3.pptxIOT - Unit 3.pptx
IOT - Unit 3.pptx
 
Stock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningStock Market Prediction using Machine Learning
Stock Market Prediction using Machine Learning
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Zigbee 802-15-4
Zigbee 802-15-4Zigbee 802-15-4
Zigbee 802-15-4
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 Sydney
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Vanet ppt
Vanet pptVanet ppt
Vanet ppt
 
Clustering
ClusteringClustering
Clustering
 
UNIT-5 IoT Reference Architecture.pdf
UNIT-5 IoT Reference Architecture.pdfUNIT-5 IoT Reference Architecture.pdf
UNIT-5 IoT Reference Architecture.pdf
 

Similar to Realtime Data Analysis Patterns

How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...Codemotion Tel Aviv
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine LearningMikio L. Braun
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Nicola Sandoli
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataDevFest DC
 
6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your InterviewerPrachiVarshney7
 
Modern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays CubaModern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays Cubabridgetkromhout
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...Codemotion
 
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian LitaKafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R MeetupJo-fai Chow
 
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesRudiger Wolf
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Expertsgraphistry
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in PracticeMikio L. Braun
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizingKrist Wongsuphasawat
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 

Similar to Realtime Data Analysis Patterns (20)

How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
 
6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer
 
Modern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays CubaModern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays Cuba
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
 
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian LitaKafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
 
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slides
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Experts
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizing
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 

More from Mikio L. Braun

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Mikio L. Braun
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlMikio L. Braun
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI ApplicationsMikio L. Braun
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Mikio L. Braun
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkMikio L. Braun
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An IntroductionMikio L. Braun
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine EinführungMikio L. Braun
 

More from Mikio L. Braun (7)

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into Flink
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
 

Recently uploaded

『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 

Recently uploaded (11)

『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 

Realtime Data Analysis Patterns

  • 1. Realtime Data Analysis Patterns Mikio Braun @mikiobraun streamdrill & TU Berlin O'Really Strata+Hadoop, Barcelona Nov 21, 2014 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 2. How it all started: Realtime Twitter Retweet Trends Rails app + PostgreSQL About 100 tweets/second,and it got worse Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 3. Road from there ● Version 1.0: Rails + PostgreSQL – store and batch ● Version 2.0: Scala + Cassandra – stream processing & working data on disk ● Version 3.0: streamdrill – “in-memory realtime analytics database” – approximative algorithms to bound resources – moderate parallelism for some things Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 4. Lessons learned? Not just one kind of realtime. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 5. Applications FFiinnaannccee GGaammiinngg MMoonniittoorriinngg AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 6. Two Dimensions of Real-Time Complexity Latency ● counting ● trends ● outlier detection ● recommendation ● prediction (churn, etc.) ● now (ms, RTB) ● seconds (fraud) ● hours (monitoring) ● days (reporting) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 7. What makes realtime hard ● Many Events – 100 events / second – 360k per hour – 8.6M per day – 260M per month – 3.2B per year ● Many Objects http://www.flickr.com/photos/arenamontanus/269158554/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 8. Classes of Realtime ● Events per second (100s? 1000s? 10k?) ● Number of objects (A few dozen? Millions?) ● Complexity (Counting? Trends?) ● Latency (Milliseconds? Hours?) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 9. General Architecture Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 10. Data Acquisition ● Flat files / HDFS ● Apache Flume / Logstash ● Apache Kafka for distributed logging Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 11. Processing ● Depending on Latency: Batch or Streaming ● Batch – Apache Hadoop – Apache Spark – Apache Flink ● Streaming – Apache Storm – Apache Samza Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 12. Query Layer ● Hadoop/Storm/Spark have no query layer ● Some db backend like redis to store the results Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 13. Lambda Architecture: Mixing Batch & Streaming http://lambda-architecture.net/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 14. Kappa Architecture http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 15. Scaling vs. Approximation ● Scaling is expensive ● Not all results are relevant ● Data changes all the time anyway ● Approximate: Trade accuracy for resource usage Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 16. Approximation harmful? Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 17. Heavy Hitters ● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) ● Interested in most active elements only. frank paul jan felix leo alex 15 12 8 5 3 2 Fixed tables of counts Case 1: element already in data base paul paul 12 13 Case 2: new element nico alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 18. Count Min Sketch ● Summarize histograms over large feature sets ● Like bloom filters, but better m bins 0 0 3 0 1 1 0 2 0 2 0 0 0 3 5 2 0 5 3 2 2 4 5 0 1 3 7 3 0 2 0 8 n different hash functions Updates for new entry Query result: 1 ● Query: Take minimum over all hash functions G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 19. Hyper Log Log ● Hash stream to generate random bit strings ● Look for infrequent events ● If probability is one hundreths → should have seen 100 events on average if it occurs. ● Average to improve estimate. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 20. Comparing Approx. Algorithms ● Heavy Hitters: – approx. counts + top-k – large memory requirement ● Count Min Sketch – approx. counts for all, but no top-k, no elements – needs to know size beforehand ● HyperLogLog – approx. number of distinct elements Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 21. Exponential Decay Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 22. Beyond Counting Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 23. Streamdrill & Demos ● Realtime Analysis Solutions ● Core Engine: – Heavy Hitters + exponential decay + seconndary indices – Instant counts & top-k results over time windows – In-memory – Written in Scala ● Modules – Profiling and Trending – Recommendations – Count Distinct Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 24. Example: Twitter Stock Analysis http://play.streamdrill.com/vis/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 25. Example: Twitter Stock Analysis ● Trends: – symbol:combinations $AAPL:$GOOG – symbol:hashtag $AAPL:#trading – symbol:keywords $GOOG:disruption – symbol:mentions $GOOG:WallStreetCom – symbol trend $AAPL – symbol:url $FB:http://on.wsj.com/15fHaZW Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 26. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 27. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 28. Example: Twitter Stock Analysis Twitter streamdrill JavaScript via REST tweets Tweet Analyzer updates Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 29. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 30. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 31. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 32. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 33. Realtime user profiles ● Process 10k events / second on one machine ● Track about 1 Million counts per 1 GB ● Shard by user for higher accuracy Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 34. Realtime Data Analysis Patterns ● Acquisition / Processing / Query Layer ● Acquisition: Flat files and distributed logs ● Processing: Scaling batch or streaming ● Query Layer: Separate query from processing ● Lambda and Kappa Architecture ● Approximation as alternative to scaling ● Trends with indices as building blocks for data analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 35. Thank You Mikio Braun mikio@streamdrill.com @mikiobraun Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun