SlideShare a Scribd company logo
1 of 32
1©MapR Technologies - Confidential
Real-time and Long-time with
Storm and Hadoop
2©MapR Technologies - Confidential
Real-time and Long-time with
Storm and Hadoop MapR
3©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://info.mapr.com/ted-pjug
 Hash tags: #mapr #pjug
4©MapR Technologies - Confidential
The Challenge
 Hadoop is great of processing vats of data
– But sucks for real-time (by design!)
 Storm is great for real-time processing
– But lacks any way to deal with batch processing
 It sounds like there isn’t a solution
– Neither fashionable solution handles everything
5©MapR Technologies - Confidential
This is not a problem.
It’s an opportunity!
6©MapR Technologies - Confidential
What is Map-Reduce?
 Map-reduce programs are defined (mostly) by
– A map function that does independent record transformations (and
deletions and replications)
– Reduce functions that do aggregation
 Map-reduce programs run in framework that
– Schedules and re-runs tasks
– Splits the input
– Moves map outputs to reduce inputs
– Receives the results
6
7©MapR Technologies - Confidential
Inside Map-Reduce
7
Input Map CombineShuffle
and sort
Reduce Output
Reduce
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
the, 1
time, 1
has, 1
come, 1
…
come, [3,2,1]
has, [1,5,2]
the, [1,2,1]
time,
[10,1,3]
…
come, 6
has, 8
the, 4
time, 14
…
8©MapR Technologies - Confidential
Not Just Text
 Counting words is easy
 Many other problems work as well
– Sessionize user logs
– Very large scale joins
– Large scale matrix recommendations
– Computing the quadrillionth digit of π
 Map-reduce is inherently batch oriented
9©MapR Technologies - Confidential
Inside Map-Reduce
9
Input Map Shuffle
and sort
Reduce Output
road1, polyline(p1, p2, p3, …)
lake1, polygon(p4, p5, p7, p9, …)
road2, polyline(p6, p7, p9, …)
tile0918-1412, road1
tile1082-8143, road1
tile0014-3284, lake1
tile1082-8143, lake1
…
tile0918-1412, [road1]
tile1082-8143, [road1, lake1]
tile0014-3284, [lake1]
…
tile1082-8143, img#1
tile0014-3284, img#2
tile1082-8143, img#3
…
10©MapR Technologies - Confidential
What is Storm
 A Storm program is called a topology
– Spouts inject data into a topology
– Bolts process data
 The units of data are called tuples
 All processing is flow-through
 Bolts can buffer or persist
 Output tuples can be anchored
 Bolts that fail are restarted and un-acked tuples are replayed
11©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period
Hadoop job
takes this
long for this
data
12©MapR Technologies - Confidential
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
Blended
view
Blended
view
Blended
View
13©MapR Technologies - Confidential
One Alternative
Search
Engine
NoSql
de Jour
Consumer
Real-time Long-time
?
14©MapR Technologies - Confidential
Problems
 Simply dumping into noSql engine doesn’t quite work
 Insert rate is limited
 No load isolation
– Big retrospective jobs kill real-time
 Low scan performance
– Hbase pretty good, but not stellar
 Difficult to set boundaries
– where does real-time end and long-time begin?
15©MapR Technologies - Confidential
Rough Design – Data Flow
Search
Engine
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
Logs
Logger
Bolt
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
Query Event
Spout
Counter
Bolt
Logger
Bolt
16©MapR Technologies - Confidential
Closer Look
 Critical design goals:
– fast ack for all tuples
– fast restart of counter
 Ack happens when tuple hits the replay log (100’s of milliseconds)
 Restart involves replaying semi-agg’s + replay log (very fast)
Counter
Bolt
Replay
Log
Semi-
aggregated
records
Incoming
records
Real-time Long-time
17©MapR Technologies - Confidential
A Frozen Moment in Time
 Snapshot defines the dividing
line
 All data in the snap is long-
time, all after is real-time
 Semi-agg strategy allows clean
query
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
18©MapR Technologies - Confidential
Guarantees
 Counter output volume is small-ish
– the greater of k tuples per 100K inputs or k tuple/s
– 1 tuple/s/label/bolt for this exercise
 Persistence layer must provide guarantees
– distributed against node failure
– must have either readable flush or closed-append
 HDFS is distributed, but provides no guarantees and strange
semantics
 MapRfs is distributed, provides all necessary guarantees
19©MapR Technologies - Confidential
Presentation Layer
 Presentation must
– read recent output of Logger bolt
– read relevant output of Hadoop jobs
– combine semi-aggregated records
 User will see
– counts that increment within 0-2 s of events
– seamless meld of short and long-term data
20©MapR Technologies - Confidential
Example 2 – AB testing in real-time
 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic
21©MapR Technologies - Confidential
A Quick Diversion
 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?
22©MapR Technologies - Confidential
A Philosophical Conclusion
 Probability as expressed by humans is subjective and depends on
information and experience
23©MapR Technologies - Confidential
I Dunno
24©MapR Technologies - Confidential
5 heads out of 10 throws
25©MapR Technologies - Confidential
2 heads out of 12 throws
26©MapR Technologies - Confidential
Bayesian Bandit
 Compute distributions based on data
 Sample p1 and p2 from these distributions
 Put a coin in bandit 1 if p1 > p2
 Else, put the coin in bandit 2
27©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
28©MapR Technologies - Confidential
Video Demo
29©MapR Technologies - Confidential
The Code
 Select an alternative
 Select and learn
 But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
30©MapR Technologies - Confidential
The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation
 Can be extended to more general response models
31©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://info.mapr.com/ted-pjug
 Hash tags: #mapr #pjug
32©MapR Technologies - Confidential
Thank You

More Related Content

What's hot

"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart..."Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...Edge AI and Vision Alliance
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...Edge AI and Vision Alliance
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-dataTed Dunning
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
Nervana Systems
Nervana SystemsNervana Systems
Nervana SystemsNand Dalal
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series PolarSeven Pty Ltd
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systemsOlivier Teytaud
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Rakuten Group, Inc.
 
“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung
“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung
“A Highly Data-Efficient Deep Learning Approach,” a Presentation from SamsungEdge AI and Vision Alliance
 
Deep learning: what? how? why? How to win a Kaggle competition
Deep learning: what? how? why? How to win a Kaggle competitionDeep learning: what? how? why? How to win a Kaggle competition
Deep learning: what? how? why? How to win a Kaggle competition317070
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceIntel Nervana
 

What's hot (19)

"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart..."Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
presentation
presentationpresentation
presentation
 
Nervana Systems
Nervana SystemsNervana Systems
Nervana Systems
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systems
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung
“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung
“A Highly Data-Efficient Deep Learning Approach,” a Presentation from Samsung
 
Deep learning: what? how? why? How to win a Kaggle competition
Deep learning: what? how? why? How to win a Kaggle competitionDeep learning: what? how? why? How to win a Kaggle competition
Deep learning: what? how? why? How to win a Kaggle competition
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 

Viewers also liked

Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004MapR Technologies
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01MapR Technologies
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the CloudMapR Technologies
 
Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011MapR Technologies
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
Securing Hadoop by MapR's Senior Principal Technologist Keys BotzumSecuring Hadoop by MapR's Senior Principal Technologist Keys Botzum
Securing Hadoop by MapR's Senior Principal Technologist Keys BotzumMapR Technologies
 

Viewers also liked (9)

SD Forum 11 04-2010
SD Forum 11 04-2010SD Forum 11 04-2010
SD Forum 11 04-2010
 
Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004Transactional Data Mining Ted Dunning 2004
Transactional Data Mining Ted Dunning 2004
 
Mahout classifier tour
Mahout classifier tourMahout classifier tour
Mahout classifier tour
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Big Data Analytics London
Big Data Analytics LondonBig Data Analytics London
Big Data Analytics London
 
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
Securing Hadoop by MapR's Senior Principal Technologist Keys BotzumSecuring Hadoop by MapR's Senior Principal Technologist Keys Botzum
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
 

Similar to Real-time and Long-time Together

Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningMapR Technologies
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 

Similar to Real-time and Long-time Together (20)

Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
London hug
London hugLondon hug
London hug
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Real-time and Long-time Together

  • 1. 1©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop
  • 2. 2©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop MapR
  • 3. 3©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
  • 4. 4©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • 5. 5©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
  • 6. 6©MapR Technologies - Confidential What is Map-Reduce?  Map-reduce programs are defined (mostly) by – A map function that does independent record transformations (and deletions and replications) – Reduce functions that do aggregation  Map-reduce programs run in framework that – Schedules and re-runs tasks – Splits the input – Moves map outputs to reduce inputs – Receives the results 6
  • 7. 7©MapR Technologies - Confidential Inside Map-Reduce 7 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
  • 8. 8©MapR Technologies - Confidential Not Just Text  Counting words is easy  Many other problems work as well – Sessionize user logs – Very large scale joins – Large scale matrix recommendations – Computing the quadrillionth digit of π  Map-reduce is inherently batch oriented
  • 9. 9©MapR Technologies - Confidential Inside Map-Reduce 9 Input Map Shuffle and sort Reduce Output road1, polyline(p1, p2, p3, …) lake1, polygon(p4, p5, p7, p9, …) road2, polyline(p6, p7, p9, …) tile0918-1412, road1 tile1082-8143, road1 tile0014-3284, lake1 tile1082-8143, lake1 … tile0918-1412, [road1] tile1082-8143, [road1, lake1] tile0014-3284, [lake1] … tile1082-8143, img#1 tile0014-3284, img#2 tile1082-8143, img#3 …
  • 10. 10©MapR Technologies - Confidential What is Storm  A Storm program is called a topology – Spouts inject data into a topology – Bolts process data  The units of data are called tuples  All processing is flow-through  Bolts can buffer or persist  Output tuples can be anchored  Bolts that fail are restarted and un-acked tuples are replayed
  • 11. 11©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 12. 12©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 13. 13©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
  • 14. 14©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
  • 15. 15©MapR Technologies - Confidential Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
  • 16. 16©MapR Technologies - Confidential Closer Look  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (100’s of milliseconds)  Restart involves replaying semi-agg’s + replay log (very fast) Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
  • 17. 17©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long- time, all after is real-time  Semi-agg strategy allows clean query Semi Agg Hadoop Aggregator Snap Long agg
  • 18. 18©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
  • 19. 19©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
  • 20. 20©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  • 21. 21©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
  • 22. 22©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
  • 23. 23©MapR Technologies - Confidential I Dunno
  • 24. 24©MapR Technologies - Confidential 5 heads out of 10 throws
  • 25. 25©MapR Technologies - Confidential 2 heads out of 12 throws
  • 26. 26©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
  • 27. 27©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 28. 28©MapR Technologies - Confidential Video Demo
  • 29. 29©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  • 30. 30©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
  • 31. 31©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
  • 32. 32©MapR Technologies - Confidential Thank You