SlideShare a Scribd company logo
1 of 44
Hadoop Performance

©MapR Technologies - Confidential        1
Agenda

     What is performance? Optimization?
     Case 1: Aggregation
     Case 2: Recommendations
     Case 3: Clustering
     Case 4: Matrix decomposition




©MapR Technologies - Confidential    2
What is Performance?

     Is doing something faster better?


     Is it the right task?


     Do you have a wide enough view?


     What is the right performance metric?




©MapR Technologies - Confidential     3
Aggregation

     Word-count and friends
       –   How many times did X occur?
       –   How many unique X’s occurred?


     Associative metrics permit decomposition
       –   Partial sums and grand totals for example
       –   Use combiners
       –   Use high resolution aggregates to compute low resolution aggregates


     Rank-based statistics do not permit decomposition
       –   Avoid them
       –   Use approximations

©MapR Technologies - Confidential           4
Inside Map-Reduce




                                             the, 1
                         "The time has come," the Walrus said,
                                             time, 1
                         "To talk of many things:           come, [3,2,1]
                                             has, 1
                         Of shoes—and ships—and sealing-wax [1,5,2]
                                                            has,            come, 6
                                             come, 1
                                                            the, [1,2,1]    has, 8
                                             …
                                                            time, [10,1,3   the, 4
                                                            ]               time, 14
                            Input      Map Combine
                                                 Shuffle    …
                                                            Reduce          …      Output
                                                and sort
                                                             Reduce



©MapR Technologies - Confidential                        5
                                                                                5
Don’t Do This



                                              Daily



                                    Raw

                                              Weekly



                                              Monthly




©MapR Technologies - Confidential         6
Do This Instead



                                    Daily       Weekly


                           Raw
                                                Monthly




©MapR Technologies - Confidential           7
Aggregation

     First rule:
       –   Don’t read the big input multiple times
       –   Compute longer term aggregates from short term aggregates


     Second rule:
       –   Don’t read the big input multiple times
       –   Compute multiple windowed aggregates at the same time




©MapR Technologies - Confidential          8
Rank Statistics Can Be Tamed

     Approximate quartiles are easily computed
       –   (but sorted data is evil)
     Approximate unique counts are easily computed
       –   use Bloom filter and extrapolate from number of set bits
       –   use multiple filters at different down-sample rates
     Approximate high or low approximate quantiles are easily
      computed
       –   keep largest 1000 elements
       –   keep largest 1000 elements from 10x down-sampled data
       –   and so on
     Approximate top-40 also possible


©MapR Technologies - Confidential            9
Recommendations

     Common patterns in the past may predict common patterns in the
      future


     People who bought item x also bought item y


     But also, people who bought Chinese food in the past, …


     Or people in SoMa really liked this restaurant in the past




©MapR Technologies - Confidential      10
People who bought …

     Key operation is counting number of people who bought x and y
       –   for all x’s and all y’s


     The raw problem appears to be O(N^3)


     At the least, O(k_max^2)
       –   for most prolific user, there are k^2 pairs to count
       –   k_max can be near N


     Scalable problems must be O(N)



©MapR Technologies - Confidential              11
But …

     What do we learn from users who buy everything
       –   they have no discrimination
       –   they are often the QA team
       –   they tell us nothing


     What do we learn from items bought by everybody
       –   the dual of omnivorous buyers
       –   these are often teaser items
       –   they tell us nothing




©MapR Technologies - Confidential          12
Also …

     What would you learn about a user from purchases
       –   1 … 20?
       –   21 … 100?
       –   101 … 1000?
       –   1001 … ∞?


     What about learning about an item?
       –   how many people do we need to see before we understand the item?




©MapR Technologies - Confidential          13
So …

     Cheat!


     Downsample every user to at most 1000 interactions
       –   most recent
       –   most rare
       –   random selection
       –   whatever is easiest


     Now k_max ≤ 1000




©MapR Technologies - Confidential   14
The Fundamental Things Apply

     Don’t read the raw data repeatedly


     Sessionize and denormalize per hour/day/week
       –   that is, group by user
       –   expand items with categories and content descriptors if feasible


     Feed all down-stream processing in one pass
       –   baby join to item characteristics
       –   downsample
       –   count grand totals
       –   compute cooccurrences


©MapR Technologies - Confidential              15
Deployment Matters, Too

     For restaurant case, basic recommendation info includes:
       –   user x merchant histories
       –   user x cuisine histories
       –   top local restaurant by anomalous repeat visits
       –   restaurant x indicator merchant cooccurrence matrix
       –   restaurant x indicator cuisine cooccurrence matrix


     These can all be stored and accessed using text retrieval
      techniques


     Fast deployment using mirrors and NFS (not standard Hadoop)


©MapR Technologies - Confidential           16
Non-Traditional Deployment Demo




                                    DEMO




©MapR Technologies - Confidential      17
EM Algorithms

     Start with random model estimates
     Use model estimates to classify examples
     Use classified examples to find probability maximum estimates
     Use model estimates to classify examples
     Use classified examples to find probability maximum estimates
       … And so on …




©MapR Technologies - Confidential    18
K-means as EM Algorithm

     Assign a random seed to each cluster
     Assign points to nearest cluster
     Move cluster to average of contained points
     Assign points to nearest cluster
                       … and so on …




©MapR Technologies - Confidential        19
K-means as Map-Reduce

     Assignment of points to cluster is trivially parallel


     Computation of new clusters is also parallel


     Moving points to averages is ideal for map-reduce




©MapR Technologies - Confidential       20
But …

     With map-reduce, iteration is evil


     Starting a program can take 10-30s


     Saving data to disk and then immediately reading from disk is silly


     Input might even fit in cluster memory




©MapR Technologies - Confidential     21
Fix #1

     Don’t do that!
     Use Spark
       –   in memory interactive map-reduce
       –   100x to 1000x faster
       –   must fit in memory
     Use Giraph
       –   BSP programming model rather than map-reduce
       –   essentially map-reduce-reduce-reduce…
     Use GraphLab
       –   Like BSP without the speed brakes
       –   100x faster


©MapR Technologies - Confidential              22
Fix #2

     Use a sketch-based algorithm


     Do one pass over the data to compute sketch of the data


     Cluster the sketch


     Done. With good theoretic bounds on accuracy


     Speedup of 3000x or more



©MapR Technologies - Confidential    23
An Example




©MapR Technologies - Confidential   24
The Problem

     Spirals are a classic “counter” example for k-means
     Classic low dimensional manifold with added noise


     But clustering still makes modeling work well




©MapR Technologies - Confidential     25
An Example




©MapR Technologies - Confidential   26
An Example




©MapR Technologies - Confidential   27
The Cluster Proximity Features

     Every point can be described by the nearest cluster
       –   4.3 bits per point in this case
       –   Significant error that can be decreased (to a point) by increasing number of
           clusters
     Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
      bit + 2 proximities)
       –   Error is negligible
       –   Unwinds the data into a simple representation




©MapR Technologies - Confidential             28
Lots of Clusters Are Fine




©MapR Technologies - Confidential   29
Surrogate Method

     Start with sloppy clustering into κ = k log n clusters
     Use this sketch as a weighted surrogate for the data
     Cluster surrogate data using ball k-means
     Results are provably good for highly clusterable data
     Sloppy clustering is on-line
     Surrogate can be kept in memory
     Ball k-means pass can be done at any time




©MapR Technologies - Confidential       30
Algorithm Costs

     O(k d log n) per point per iteration for Lloyd’s algorithm
     Number of iterations not well known
     Iteration > log n reasonable assumption




©MapR Technologies - Confidential      31
Algorithm Costs

     Surrogate methods
       –   fast, sloppy single pass clustering with κ = k log n
       –   fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))
           per point
       –   fast, in-memory, high-quality clustering of κ weighted centroids
                       O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
                       O(κ d log k) or O(d log κ log k) for larger k, looser quality
       –   result is k high-quality centroids
            •    Even the sloppy clusters may suffice




©MapR Technologies - Confidential                           32
Algorithm Costs

     How much faster for the sketch phase?
       –   take k = 2000, d = 10, n = 100,000
       –   k d log n = 2000 x 10 x 26 = 500,000
       –   log k + log log n = 11 + 5 = 17
       –   30,000 times faster is a bona fide big deal




©MapR Technologies - Confidential              33
Pragmatics

     But this requires a fast search internally
     Have to cluster on the fly for sketch
     Have to guarantee sketch quality
     Previous methods had very high complexity




©MapR Technologies - Confidential       34
How It Works

     For each point
       –   Find approximately nearest centroid (distance = d)
       –   If (d > threshold) new centroid
       –   Else if (u > d/threshold) new cluster
       –   Else add to nearest centroid
     If centroids > κ ≈ C log N
       –   Recursively cluster centroids with higher threshold


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly


©MapR Technologies - Confidential             35
Matrix Decomposition

     Many big matrices can often be compressed



                                    =



     Often used in recommendations




©MapR Technologies - Confidential       36
Neighest Neighbor

     Very high dimensional vectors can be compressed to 10-100
      dimensions with little loss of accuracy


     Fast search algorithms work up to dimension 50-100, don’t work
      above that




©MapR Technologies - Confidential   37
Random Projections

     Many problems in high dimension can be reduce to low dimension


     Reductions with good distance approximation are available


     Surprisingly, these methods can be done using random vectors




©MapR Technologies - Confidential   38
Fundamental Trick

     Random orthogonal projection preserves action of A




                                    Ax - Ay » Q Ax - Q Ay
                                              T        T




©MapR Technologies - Confidential                 39
Projection Search


                                         total ordering!




©MapR Technologies - Confidential   40
LSH Bit-match Versus Cosine
                       1


                     0.8


                     0.6


                     0.4


                     0.2
          Y Ax is




                       0
                            0       8   16   24    32       40   48   56   64

                    - 0.2


                    - 0.4


                    - 0.6


                    - 0.8


                      -1

                                                  X Ax is




©MapR Technologies - Confidential                    41
But How?

                  Y = AW
                  Q1 R = Y
                  B = Q1T A
                  LQ2 = B
                 USV = L            T


                  (Q1U) S (Q2V ) » A    T




©MapR Technologies - Confidential           42
Summary

     Don’t repeat big scans
       –   Cascade aggregations
       –   Compute several aggregates at once
     Use approximate measures for rank statistics
     Downsample where appropriate
     Use non-traditional deployment
     Use sketches
     Use random projections




©MapR Technologies - Confidential          43
Contact Me!

     We’re hiring at MapR in US and Europe


     Come get the slides at
           http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-
           12


     Get the code at
           https://github.com/tdunning


     Contact me at tdunning@maprtech.com or @ted_dunning



©MapR Technologies - Confidential        44

More Related Content

What's hot

Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention NetworksTaeoh Kim
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Andrea Tassi
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Get Used to Command Line Interface
Get Used to Command Line InterfaceGet Used to Command Line Interface
Get Used to Command Line InterfaceJunho Cho
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management Vinay Setty
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 

What's hot (16)

Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Data science
Data scienceData science
Data science
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Get Used to Command Line Interface
Get Used to Command Line InterfaceGet Used to Command Line Interface
Get Used to Command Line Interface
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 

Similar to Cmu Lecture on Hadoop Performance

CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningMapR Technologies
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 

Similar to Cmu Lecture on Hadoop Performance (20)

CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
London hug
London hugLondon hug
London hug
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
London data science
London data scienceLondon data science
London data science
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Recently uploaded

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Cmu Lecture on Hadoop Performance

  • 2. Agenda  What is performance? Optimization?  Case 1: Aggregation  Case 2: Recommendations  Case 3: Clustering  Case 4: Matrix decomposition ©MapR Technologies - Confidential 2
  • 3. What is Performance?  Is doing something faster better?  Is it the right task?  Do you have a wide enough view?  What is the right performance metric? ©MapR Technologies - Confidential 3
  • 4. Aggregation  Word-count and friends – How many times did X occur? – How many unique X’s occurred?  Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates  Rank-based statistics do not permit decomposition – Avoid them – Use approximations ©MapR Technologies - Confidential 4
  • 5. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax [1,5,2] has, come, 6 come, 1 the, [1,2,1] has, 8 … time, [10,1,3 the, 4 ] time, 14 Input Map Combine Shuffle … Reduce … Output and sort Reduce ©MapR Technologies - Confidential 5 5
  • 6. Don’t Do This Daily Raw Weekly Monthly ©MapR Technologies - Confidential 6
  • 7. Do This Instead Daily Weekly Raw Monthly ©MapR Technologies - Confidential 7
  • 8. Aggregation  First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates  Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time ©MapR Technologies - Confidential 8
  • 9. Rank Statistics Can Be Tamed  Approximate quartiles are easily computed – (but sorted data is evil)  Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates  Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on  Approximate top-40 also possible ©MapR Technologies - Confidential 9
  • 10. Recommendations  Common patterns in the past may predict common patterns in the future  People who bought item x also bought item y  But also, people who bought Chinese food in the past, …  Or people in SoMa really liked this restaurant in the past ©MapR Technologies - Confidential 10
  • 11. People who bought …  Key operation is counting number of people who bought x and y – for all x’s and all y’s  The raw problem appears to be O(N^3)  At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N  Scalable problems must be O(N) ©MapR Technologies - Confidential 11
  • 12. But …  What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing  What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing ©MapR Technologies - Confidential 12
  • 13. Also …  What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞?  What about learning about an item? – how many people do we need to see before we understand the item? ©MapR Technologies - Confidential 13
  • 14. So …  Cheat!  Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest  Now k_max ≤ 1000 ©MapR Technologies - Confidential 14
  • 15. The Fundamental Things Apply  Don’t read the raw data repeatedly  Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible  Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences ©MapR Technologies - Confidential 15
  • 16. Deployment Matters, Too  For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix  These can all be stored and accessed using text retrieval techniques  Fast deployment using mirrors and NFS (not standard Hadoop) ©MapR Technologies - Confidential 16
  • 17. Non-Traditional Deployment Demo DEMO ©MapR Technologies - Confidential 17
  • 18. EM Algorithms  Start with random model estimates  Use model estimates to classify examples  Use classified examples to find probability maximum estimates  Use model estimates to classify examples  Use classified examples to find probability maximum estimates  … And so on … ©MapR Technologies - Confidential 18
  • 19. K-means as EM Algorithm  Assign a random seed to each cluster  Assign points to nearest cluster  Move cluster to average of contained points  Assign points to nearest cluster … and so on … ©MapR Technologies - Confidential 19
  • 20. K-means as Map-Reduce  Assignment of points to cluster is trivially parallel  Computation of new clusters is also parallel  Moving points to averages is ideal for map-reduce ©MapR Technologies - Confidential 20
  • 21. But …  With map-reduce, iteration is evil  Starting a program can take 10-30s  Saving data to disk and then immediately reading from disk is silly  Input might even fit in cluster memory ©MapR Technologies - Confidential 21
  • 22. Fix #1  Don’t do that!  Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory  Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce…  Use GraphLab – Like BSP without the speed brakes – 100x faster ©MapR Technologies - Confidential 22
  • 23. Fix #2  Use a sketch-based algorithm  Do one pass over the data to compute sketch of the data  Cluster the sketch  Done. With good theoretic bounds on accuracy  Speedup of 3000x or more ©MapR Technologies - Confidential 23
  • 24. An Example ©MapR Technologies - Confidential 24
  • 25. The Problem  Spirals are a classic “counter” example for k-means  Classic low dimensional manifold with added noise  But clustering still makes modeling work well ©MapR Technologies - Confidential 25
  • 26. An Example ©MapR Technologies - Confidential 26
  • 27. An Example ©MapR Technologies - Confidential 27
  • 28. The Cluster Proximity Features  Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation ©MapR Technologies - Confidential 28
  • 29. Lots of Clusters Are Fine ©MapR Technologies - Confidential 29
  • 30. Surrogate Method  Start with sloppy clustering into κ = k log n clusters  Use this sketch as a weighted surrogate for the data  Cluster surrogate data using ball k-means  Results are provably good for highly clusterable data  Sloppy clustering is on-line  Surrogate can be kept in memory  Ball k-means pass can be done at any time ©MapR Technologies - Confidential 30
  • 31. Algorithm Costs  O(k d log n) per point per iteration for Lloyd’s algorithm  Number of iterations not well known  Iteration > log n reasonable assumption ©MapR Technologies - Confidential 31
  • 32. Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice ©MapR Technologies - Confidential 32
  • 33. Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 33
  • 34. Pragmatics  But this requires a fast search internally  Have to cluster on the fly for sketch  Have to guarantee sketch quality  Previous methods had very high complexity ©MapR Technologies - Confidential 34
  • 35. How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 35
  • 36. Matrix Decomposition  Many big matrices can often be compressed =  Often used in recommendations ©MapR Technologies - Confidential 36
  • 37. Neighest Neighbor  Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy  Fast search algorithms work up to dimension 50-100, don’t work above that ©MapR Technologies - Confidential 37
  • 38. Random Projections  Many problems in high dimension can be reduce to low dimension  Reductions with good distance approximation are available  Surprisingly, these methods can be done using random vectors ©MapR Technologies - Confidential 38
  • 39. Fundamental Trick  Random orthogonal projection preserves action of A Ax - Ay » Q Ax - Q Ay T T ©MapR Technologies - Confidential 39
  • 40. Projection Search total ordering! ©MapR Technologies - Confidential 40
  • 41. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is ©MapR Technologies - Confidential 41
  • 42. But How? Y = AW Q1 R = Y B = Q1T A LQ2 = B USV = L T (Q1U) S (Q2V ) » A T ©MapR Technologies - Confidential 42
  • 43. Summary  Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once  Use approximate measures for rank statistics  Downsample where appropriate  Use non-traditional deployment  Use sketches  Use random projections ©MapR Technologies - Confidential 43
  • 44. Contact Me!  We’re hiring at MapR in US and Europe  Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12  Get the code at https://github.com/tdunning  Contact me at tdunning@maprtech.com or @ted_dunning ©MapR Technologies - Confidential 44