Submit Search
Upload
Cmu Lecture on Hadoop Performance
•
Download as PPTX, PDF
•
7 likes
•
1,373 views
T
Ted Dunning
Follow
A lecture describing several ways to make Hadoop programs go faster.
Read less
Read more
Technology
Report
Share
Report
Share
1 of 44
Download now
Recommended
Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
ACM 2013-02-25
ACM 2013-02-25
Ted Dunning
Graphlab dunning-clustering
Graphlab dunning-clustering
Ted Dunning
Storm 2012-03-29
Storm 2012-03-29
Ted Dunning
New Directions for Mahout
New Directions for Mahout
Ted Dunning
News from Mahout
News from Mahout
Ted Dunning
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
jakehofman
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
Recommended
Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
ACM 2013-02-25
ACM 2013-02-25
Ted Dunning
Graphlab dunning-clustering
Graphlab dunning-clustering
Ted Dunning
Storm 2012-03-29
Storm 2012-03-29
Ted Dunning
New Directions for Mahout
New Directions for Mahout
Ted Dunning
News from Mahout
News from Mahout
Ted Dunning
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
jakehofman
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
Svm map reduce_slides
Svm map reduce_slides
Sara Asher
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Graham Wihlidal
Mahout and Recommendations
Mahout and Recommendations
Ted Dunning
CNN Attention Networks
CNN Attention Networks
Taeoh Kim
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
Data science
Data science
Joaquin Vanschoren
Bayesian Counters
Bayesian Counters
DataWorks Summit
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Andrea Tassi
Terascale Learning
Terascale Learning
pauldix
Training Neural Networks
Training Neural Networks
Databricks
Get Used to Command Line Interface
Get Used to Command Line Interface
Junho Cho
Applying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Tiago Sousa
Scalable membership management
Scalable membership management
Vinay Setty
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
MapR Technologies
Real-time and Long-time Together
Real-time and Long-time Together
MapR Technologies
News From Mahout
News From Mahout
MapR Technologies
Strata New York 2012
Strata New York 2012
MapR Technologies
Boston hug-2012-07
Boston hug-2012-07
Ted Dunning
Strata new-york-2012
Strata new-york-2012
Ted Dunning
More Related Content
What's hot
Svm map reduce_slides
Svm map reduce_slides
Sara Asher
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Graham Wihlidal
Mahout and Recommendations
Mahout and Recommendations
Ted Dunning
CNN Attention Networks
CNN Attention Networks
Taeoh Kim
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
Data science
Data science
Joaquin Vanschoren
Bayesian Counters
Bayesian Counters
DataWorks Summit
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Andrea Tassi
Terascale Learning
Terascale Learning
pauldix
Training Neural Networks
Training Neural Networks
Databricks
Get Used to Command Line Interface
Get Used to Command Line Interface
Junho Cho
Applying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Tiago Sousa
Scalable membership management
Scalable membership management
Vinay Setty
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
What's hot
(16)
Svm map reduce_slides
Svm map reduce_slides
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Mahout and Recommendations
Mahout and Recommendations
CNN Attention Networks
CNN Attention Networks
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
Data science
Data science
Bayesian Counters
Bayesian Counters
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Terascale Learning
Terascale Learning
Training Neural Networks
Training Neural Networks
Get Used to Command Line Interface
Get Used to Command Line Interface
Applying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Scalable membership management
Scalable membership management
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
Similar to Cmu Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
MapR Technologies
Real-time and Long-time Together
Real-time and Long-time Together
MapR Technologies
News From Mahout
News From Mahout
MapR Technologies
Strata New York 2012
Strata New York 2012
MapR Technologies
Boston hug-2012-07
Boston hug-2012-07
Ted Dunning
Strata new-york-2012
Strata new-york-2012
Ted Dunning
Devoxx Real-Time Learning
Devoxx Real-Time Learning
MapR Technologies
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
MapR Technologies
London hug
London hug
MapR Technologies
Mathematical bridges From Old to New
Mathematical bridges From Old to New
MapR Technologies
Storm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
MapR Technologies
Storm users group real time hadoop
Storm users group real time hadoop
Ted Dunning
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning Clustering
MapR Technologies
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
DataWorks Summit
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
MapR Technologies
London data science
London data science
Ted Dunning
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
Ted Dunning
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
Data Science London
New directions for mahout
New directions for mahout
MapR Technologies
Which Algorithms Really Matter
Which Algorithms Really Matter
Ted Dunning
Similar to Cmu Lecture on Hadoop Performance
(20)
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
Real-time and Long-time Together
Real-time and Long-time Together
News From Mahout
News From Mahout
Strata New York 2012
Strata New York 2012
Boston hug-2012-07
Boston hug-2012-07
Strata new-york-2012
Strata new-york-2012
Devoxx Real-Time Learning
Devoxx Real-Time Learning
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
London hug
London hug
Mathematical bridges From Old to New
Mathematical bridges From Old to New
Storm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
Storm users group real time hadoop
Storm users group real time hadoop
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning Clustering
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
London data science
London data science
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
New directions for mahout
New directions for mahout
Which Algorithms Really Matter
Which Algorithms Really Matter
More from Ted Dunning
Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
How to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning
Progress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
Machine Learning Logistics
Machine Learning Logistics
Ted Dunning
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
Machine Learning logistics
Machine Learning logistics
Ted Dunning
T digest-update
T digest-update
Ted Dunning
Finding Changes in Real Data
Finding Changes in Real Data
Ted Dunning
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning
Real time-hadoop
Real time-hadoop
Ted Dunning
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning
Sharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
Doing-the-impossible
Doing-the-impossible
Ted Dunning
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning
More from Ted Dunning
(20)
Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
How to Get Going with Kubernetes
How to Get Going with Kubernetes
Progress for big data in Kubernetes
Progress for big data in Kubernetes
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Machine Learning Logistics
Machine Learning Logistics
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Machine Learning logistics
Machine Learning logistics
T digest-update
T digest-update
Finding Changes in Real Data
Finding Changes in Real Data
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Real time-hadoop
Real time-hadoop
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Sharing Sensitive Data Securely
Sharing Sensitive Data Securely
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Dunning time-series-2015
Dunning time-series-2015
Doing-the-impossible
Doing-the-impossible
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Recently uploaded
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
Nathaniel Shimoni
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
LoriGlavin3
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
Nicole Novielli
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lars Bell
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
LoriGlavin3
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Stephanie Beckett
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
LoriGlavin3
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
LoriGlavin3
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
LoriGlavin3
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
HarshalMandlekar2
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
BookNet Canada
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Lonnie McRorey
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
LoriGlavin3
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
Rick Flair
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
LoriGlavin3
Recently uploaded
(20)
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
Cmu Lecture on Hadoop Performance
1.
Hadoop Performance ©MapR Technologies
- Confidential 1
2.
Agenda
What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition ©MapR Technologies - Confidential 2
3.
What is Performance?
Is doing something faster better? Is it the right task? Do you have a wide enough view? What is the right performance metric? ©MapR Technologies - Confidential 3
4.
Aggregation
Word-count and friends – How many times did X occur? – How many unique X’s occurred? Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates Rank-based statistics do not permit decomposition – Avoid them – Use approximations ©MapR Technologies - Confidential 4
5.
Inside Map-Reduce
the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax [1,5,2] has, come, 6 come, 1 the, [1,2,1] has, 8 … time, [10,1,3 the, 4 ] time, 14 Input Map Combine Shuffle … Reduce … Output and sort Reduce ©MapR Technologies - Confidential 5 5
6.
Don’t Do This
Daily Raw Weekly Monthly ©MapR Technologies - Confidential 6
7.
Do This Instead
Daily Weekly Raw Monthly ©MapR Technologies - Confidential 7
8.
Aggregation
First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time ©MapR Technologies - Confidential 8
9.
Rank Statistics Can
Be Tamed Approximate quartiles are easily computed – (but sorted data is evil) Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on Approximate top-40 also possible ©MapR Technologies - Confidential 9
10.
Recommendations
Common patterns in the past may predict common patterns in the future People who bought item x also bought item y But also, people who bought Chinese food in the past, … Or people in SoMa really liked this restaurant in the past ©MapR Technologies - Confidential 10
11.
People who bought
… Key operation is counting number of people who bought x and y – for all x’s and all y’s The raw problem appears to be O(N^3) At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N Scalable problems must be O(N) ©MapR Technologies - Confidential 11
12.
But …
What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing ©MapR Technologies - Confidential 12
13.
Also …
What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞? What about learning about an item? – how many people do we need to see before we understand the item? ©MapR Technologies - Confidential 13
14.
So …
Cheat! Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest Now k_max ≤ 1000 ©MapR Technologies - Confidential 14
15.
The Fundamental Things
Apply Don’t read the raw data repeatedly Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences ©MapR Technologies - Confidential 15
16.
Deployment Matters, Too
For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix These can all be stored and accessed using text retrieval techniques Fast deployment using mirrors and NFS (not standard Hadoop) ©MapR Technologies - Confidential 16
17.
Non-Traditional Deployment Demo
DEMO ©MapR Technologies - Confidential 17
18.
EM Algorithms
Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on … ©MapR Technologies - Confidential 18
19.
K-means as EM
Algorithm Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster … and so on … ©MapR Technologies - Confidential 19
20.
K-means as Map-Reduce
Assignment of points to cluster is trivially parallel Computation of new clusters is also parallel Moving points to averages is ideal for map-reduce ©MapR Technologies - Confidential 20
21.
But …
With map-reduce, iteration is evil Starting a program can take 10-30s Saving data to disk and then immediately reading from disk is silly Input might even fit in cluster memory ©MapR Technologies - Confidential 21
22.
Fix #1
Don’t do that! Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce… Use GraphLab – Like BSP without the speed brakes – 100x faster ©MapR Technologies - Confidential 22
23.
Fix #2
Use a sketch-based algorithm Do one pass over the data to compute sketch of the data Cluster the sketch Done. With good theoretic bounds on accuracy Speedup of 3000x or more ©MapR Technologies - Confidential 23
24.
An Example ©MapR Technologies
- Confidential 24
25.
The Problem
Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise But clustering still makes modeling work well ©MapR Technologies - Confidential 25
26.
An Example ©MapR Technologies
- Confidential 26
27.
An Example ©MapR Technologies
- Confidential 27
28.
The Cluster Proximity
Features Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation ©MapR Technologies - Confidential 28
29.
Lots of Clusters
Are Fine ©MapR Technologies - Confidential 29
30.
Surrogate Method
Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time ©MapR Technologies - Confidential 30
31.
Algorithm Costs
O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption ©MapR Technologies - Confidential 31
32.
Algorithm Costs
Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice ©MapR Technologies - Confidential 32
33.
Algorithm Costs
How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal ©MapR Technologies - Confidential 33
34.
Pragmatics
But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity ©MapR Technologies - Confidential 34
35.
How It Works
For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 35
36.
Matrix Decomposition
Many big matrices can often be compressed = Often used in recommendations ©MapR Technologies - Confidential 36
37.
Neighest Neighbor
Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy Fast search algorithms work up to dimension 50-100, don’t work above that ©MapR Technologies - Confidential 37
38.
Random Projections
Many problems in high dimension can be reduce to low dimension Reductions with good distance approximation are available Surprisingly, these methods can be done using random vectors ©MapR Technologies - Confidential 38
39.
Fundamental Trick
Random orthogonal projection preserves action of A Ax - Ay » Q Ax - Q Ay T T ©MapR Technologies - Confidential 39
40.
Projection Search
total ordering! ©MapR Technologies - Confidential 40
41.
LSH Bit-match Versus
Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is ©MapR Technologies - Confidential 41
42.
But How?
Y = AW Q1 R = Y B = Q1T A LQ2 = B USV = L T (Q1U) S (Q2V ) » A T ©MapR Technologies - Confidential 42
43.
Summary
Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections ©MapR Technologies - Confidential 43
44.
Contact Me!
We’re hiring at MapR in US and Europe Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12 Get the code at https://github.com/tdunning Contact me at tdunning@maprtech.com or @ted_dunning ©MapR Technologies - Confidential 44
Download now