Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hadoop Performance©MapR Technologies - Confidential        1
Agenda     What is performance? Optimization?     Case 1: Aggregation     Case 2: Recommendations     Case 3: Clusteri...
What is Performance?     Is doing something faster better?     Is it the right task?     Do you have a wide enough view...
Aggregation     Word-count and friends       –   How many times did X occur?       –   How many unique X’s occurred?    ...
Inside Map-Reduce                                             the, 1                         "The time has come," the Walr...
Don’t Do This                                              Daily                                    Raw                   ...
Do This Instead                                    Daily       Weekly                           Raw                       ...
Aggregation     First rule:       –   Don’t read the big input multiple times       –   Compute longer term aggregates fr...
Rank Statistics Can Be Tamed     Approximate quartiles are easily computed       –   (but sorted data is evil)     Appro...
Recommendations     Common patterns in the past may predict common patterns in the      future     People who bought ite...
People who bought …     Key operation is counting number of people who bought x and y       –   for all x’s and all y’s ...
But …     What do we learn from users who buy everything       –   they have no discrimination       –   they are often t...
Also …     What would you learn about a user from purchases       –   1 … 20?       –   21 … 100?       –   101 … 1000?  ...
So …     Cheat!     Downsample every user to at most 1000 interactions       –   most recent       –   most rare       –...
The Fundamental Things Apply     Don’t read the raw data repeatedly     Sessionize and denormalize per hour/day/week    ...
Deployment Matters, Too     For restaurant case, basic recommendation info includes:       –   user x merchant histories ...
Non-Traditional Deployment Demo                                    DEMO©MapR Technologies - Confidential      17
EM Algorithms     Start with random model estimates     Use model estimates to classify examples     Use classified exa...
K-means as EM Algorithm     Assign a random seed to each cluster     Assign points to nearest cluster     Move cluster ...
K-means as Map-Reduce     Assignment of points to cluster is trivially parallel     Computation of new clusters is also ...
But …     With map-reduce, iteration is evil     Starting a program can take 10-30s     Saving data to disk and then im...
Fix #1     Don’t do that!     Use Spark       –   in memory interactive map-reduce       –   100x to 1000x faster       ...
Fix #2     Use a sketch-based algorithm     Do one pass over the data to compute sketch of the data     Cluster the ske...
An Example©MapR Technologies - Confidential   24
The Problem     Spirals are a classic “counter” example for k-means     Classic low dimensional manifold with added nois...
An Example©MapR Technologies - Confidential   26
An Example©MapR Technologies - Confidential   27
The Cluster Proximity Features     Every point can be described by the nearest cluster       –   4.3 bits per point in th...
Lots of Clusters Are Fine©MapR Technologies - Confidential   29
Surrogate Method     Start with sloppy clustering into κ = k log n clusters     Use this sketch as a weighted surrogate ...
Algorithm Costs     O(k d log n) per point per iteration for Lloyd’s algorithm     Number of iterations not well known ...
Algorithm Costs     Surrogate methods       –   fast, sloppy single pass clustering with κ = k log n       –   fast slopp...
Algorithm Costs     How much faster for the sketch phase?       –   take k = 2000, d = 10, n = 100,000       –   k d log ...
Pragmatics     But this requires a fast search internally     Have to cluster on the fly for sketch     Have to guarant...
How It Works     For each point       –   Find approximately nearest centroid (distance = d)       –   If (d > threshold)...
Matrix Decomposition     Many big matrices can often be compressed                                    =     Often used i...
Neighest Neighbor     Very high dimensional vectors can be compressed to 10-100      dimensions with little loss of accur...
Random Projections     Many problems in high dimension can be reduce to low dimension     Reductions with good distance ...
Fundamental Trick     Random orthogonal projection preserves action of A                                    Ax - Ay » Q A...
Projection Search                                         total ordering!©MapR Technologies - Confidential   40
LSH Bit-match Versus Cosine                       1                     0.8                     0.6                     0....
But How?                  Y = AW                  Q1 R = Y                  B = Q1T A                  LQ2 = B            ...
Summary     Don’t repeat big scans       –   Cascade aggregations       –   Compute several aggregates at once     Use a...
Contact Me!     We’re hiring at MapR in US and Europe     Come get the slides at           http://www.mapr.com/company/e...
Upcoming SlideShare
Loading in …5
×

Cmu Lecture on Hadoop Performance

2,222 views

Published on

A lecture describing several ways to make Hadoop programs go faster.

Published in: Technology
  • Login to see the comments

Cmu Lecture on Hadoop Performance

  1. 1. Hadoop Performance©MapR Technologies - Confidential 1
  2. 2. Agenda What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition©MapR Technologies - Confidential 2
  3. 3. What is Performance? Is doing something faster better? Is it the right task? Do you have a wide enough view? What is the right performance metric?©MapR Technologies - Confidential 3
  4. 4. Aggregation Word-count and friends – How many times did X occur? – How many unique X’s occurred? Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates Rank-based statistics do not permit decomposition – Avoid them – Use approximations©MapR Technologies - Confidential 4
  5. 5. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax [1,5,2] has, come, 6 come, 1 the, [1,2,1] has, 8 … time, [10,1,3 the, 4 ] time, 14 Input Map Combine Shuffle … Reduce … Output and sort Reduce©MapR Technologies - Confidential 5 5
  6. 6. Don’t Do This Daily Raw Weekly Monthly©MapR Technologies - Confidential 6
  7. 7. Do This Instead Daily Weekly Raw Monthly©MapR Technologies - Confidential 7
  8. 8. Aggregation First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time©MapR Technologies - Confidential 8
  9. 9. Rank Statistics Can Be Tamed Approximate quartiles are easily computed – (but sorted data is evil) Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on Approximate top-40 also possible©MapR Technologies - Confidential 9
  10. 10. Recommendations Common patterns in the past may predict common patterns in the future People who bought item x also bought item y But also, people who bought Chinese food in the past, … Or people in SoMa really liked this restaurant in the past©MapR Technologies - Confidential 10
  11. 11. People who bought … Key operation is counting number of people who bought x and y – for all x’s and all y’s The raw problem appears to be O(N^3) At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N Scalable problems must be O(N)©MapR Technologies - Confidential 11
  12. 12. But … What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing©MapR Technologies - Confidential 12
  13. 13. Also … What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞? What about learning about an item? – how many people do we need to see before we understand the item?©MapR Technologies - Confidential 13
  14. 14. So … Cheat! Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest Now k_max ≤ 1000©MapR Technologies - Confidential 14
  15. 15. The Fundamental Things Apply Don’t read the raw data repeatedly Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences©MapR Technologies - Confidential 15
  16. 16. Deployment Matters, Too For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix These can all be stored and accessed using text retrieval techniques Fast deployment using mirrors and NFS (not standard Hadoop)©MapR Technologies - Confidential 16
  17. 17. Non-Traditional Deployment Demo DEMO©MapR Technologies - Confidential 17
  18. 18. EM Algorithms Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …©MapR Technologies - Confidential 18
  19. 19. K-means as EM Algorithm Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster … and so on …©MapR Technologies - Confidential 19
  20. 20. K-means as Map-Reduce Assignment of points to cluster is trivially parallel Computation of new clusters is also parallel Moving points to averages is ideal for map-reduce©MapR Technologies - Confidential 20
  21. 21. But … With map-reduce, iteration is evil Starting a program can take 10-30s Saving data to disk and then immediately reading from disk is silly Input might even fit in cluster memory©MapR Technologies - Confidential 21
  22. 22. Fix #1 Don’t do that! Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce… Use GraphLab – Like BSP without the speed brakes – 100x faster©MapR Technologies - Confidential 22
  23. 23. Fix #2 Use a sketch-based algorithm Do one pass over the data to compute sketch of the data Cluster the sketch Done. With good theoretic bounds on accuracy Speedup of 3000x or more©MapR Technologies - Confidential 23
  24. 24. An Example©MapR Technologies - Confidential 24
  25. 25. The Problem Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise But clustering still makes modeling work well©MapR Technologies - Confidential 25
  26. 26. An Example©MapR Technologies - Confidential 26
  27. 27. An Example©MapR Technologies - Confidential 27
  28. 28. The Cluster Proximity Features Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation©MapR Technologies - Confidential 28
  29. 29. Lots of Clusters Are Fine©MapR Technologies - Confidential 29
  30. 30. Surrogate Method Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time©MapR Technologies - Confidential 30
  31. 31. Algorithm Costs O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption©MapR Technologies - Confidential 31
  32. 32. Algorithm Costs Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice©MapR Technologies - Confidential 32
  33. 33. Algorithm Costs How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal©MapR Technologies - Confidential 33
  34. 34. Pragmatics But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity©MapR Technologies - Confidential 34
  35. 35. How It Works For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 35
  36. 36. Matrix Decomposition Many big matrices can often be compressed = Often used in recommendations©MapR Technologies - Confidential 36
  37. 37. Neighest Neighbor Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy Fast search algorithms work up to dimension 50-100, don’t work above that©MapR Technologies - Confidential 37
  38. 38. Random Projections Many problems in high dimension can be reduce to low dimension Reductions with good distance approximation are available Surprisingly, these methods can be done using random vectors©MapR Technologies - Confidential 38
  39. 39. Fundamental Trick Random orthogonal projection preserves action of A Ax - Ay » Q Ax - Q Ay T T©MapR Technologies - Confidential 39
  40. 40. Projection Search total ordering!©MapR Technologies - Confidential 40
  41. 41. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is©MapR Technologies - Confidential 41
  42. 42. But How? Y = AW Q1 R = Y B = Q1T A LQ2 = B USV = L T (Q1U) S (Q2V ) » A T©MapR Technologies - Confidential 42
  43. 43. Summary Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections©MapR Technologies - Confidential 43
  44. 44. Contact Me! We’re hiring at MapR in US and Europe Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12 Get the code at https://github.com/tdunning Contact me at tdunning@maprtech.com or @ted_dunning©MapR Technologies - Confidential 44

×