Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- 360度精準式行銷實務--制定精準式行銷策略 by Neo Marketing Wor... 1752 views
- 效度報告Final by Kuo Andrew 1491 views
- 360度精準式行銷實務--精準式影音行銷 by Neo Marketing Wor... 1788 views
- 360度精準式行銷實務--90分鐘上手CRM by Neo Marketing Wor... 1122 views
- 360度精準式行銷實務--精準式社群行銷 by Neo Marketing Wor... 3185 views
- Co co都可茶飲策略分析 by Lee Phoebe 1759 views

3,287 views

Published on

Published in:
Technology

No Downloads

Total views

3,287

On SlideShare

0

From Embeds

0

Number of Embeds

13

Shares

0

Downloads

26

Comments

8

Likes

1

No notes for slide

(the most significant increase in goodness-of-fit)

As direct simulation was computationally expensive when CCC was first developed, the technique employs a heuristic formula derived from numerous Monte Carlo simulations to generate one hyper-cube reference distribution based on the dimensions of the given training dataset to test all k of interest. Despite the intrinsic shortcomings of heuristic approximation, CCC remains perhaps the best method for estimating k, with only one meaningful improvement being proposed since its introduction: using Monte Carlo simulation, directly instead of heuristically, to generate a hyper-cube reference distribution. Tibshirani et al, Estimating the number of clusters in a dataset via the Gap Statistic, J.R. Statist. Soc. B 63, Oxford, UK: Wiley-Blackwell, 2001. 12 pp.

Normalize the curve log Wk v.s. k

ABC performs a more precise test at each k than does CCC. Instead of comparing a k-cluster solution in training data to a k-cluster solution in an approximated hyper-cube, ABC compares a k-cluster solution in training data to k-cluster solution in a data-adaptive reference distribution comprised of k hyper-cubes with dimensions that change based on the training data, the clustering algorithm, and on k. Such descriptive reference distributions allow for enhanced detection of differences between Wk and Wk*, which in turn leads to more accurate determinations of k.

ABC performs a more precise test at each k than does CCC. Instead of comparing a k-cluster solution in training data to a k-cluster solution in an approximated hyper-cube, ABC compares a k-cluster solution in training data to k-cluster solution in a data-adaptive reference distribution comprised of k hyper-cubes with dimensions that change based on the training data, the clustering algorithm, and on k. Such descriptive reference distributions allow for enhanced detection of differences between Wk and Wk*, which in turn leads to more accurate determinations of k.

Wk+ is calculated from a clustering solution in the reference distribution.

The difference between ABC and competing techniques is the reference distribution.

Show R code in EM

So something between 2,3,4 and at 9

- 1. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. DETERMINING THE NUMBER OF CLUSTERS IN A DATASET USING ABC I. KABUL, P. HALL, J. SILVA, W. SARLE ENTERPRISE MINER R&D SAS INSTITUTE
- 2. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CLUSTERING Objects within a cluster are as similar as possible Objects from different clusters are as dissimilar as possible Hossein Parsaei
- 3. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CHALLENGES IN CLUSTERING • No prior knowledge • Which similarity measure ? • Which clustering algorithm? • How to evaluate the results? • How many clusters? The Aligned Box Criterion (ABC) addresses the unsolved, important problem of determining the number of clusters in a data set. ABC can be applied in Market Segmentation and many other types of statistical, data mining and machine learning analyses.
- 4. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CONTENTS • Background • Aligned Box Criterion (ABC) Method • Results • ABC Method in Parallel and Distributed Architecture • Conclusions
- 5. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. BACKGROUND
- 6. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. FINDING THE RIGHT NUMBER OF CLUSTERS • Many methods have been proposed: • Calinski-Harabasz index [Calinski 1974] • Cubic clustering criterion (CCC) [Sarle 1983] • Silhouette statistic [Rousseeuw 1987] • Gap statistic [Tibshirani 2001] • Jump method [Sugar 2003] • Prediction strength [Tibshirani 2005] • Dirichlet process [Teh 2006]
- 7. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. WITHIN CLUSTER SUM OF SQUARES • A good clustering yields clusters where observations have small within-cluster sum-of-squares (and high between- cluster sum-of-squares). • Low values when the partition is good, BUT these are by construction monotone nonincreasing (within cluster dissimilarity always decreases with more clusters) r r r Ci ir Ci Cj jir xxn xxD 2 2 2 k r r r k D n W 1 2 1 Within-cluster SSE: Measure of compactness of clusters
- 8. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. BACKGROUND USING WK TO DETERMINE # OF CLUSTERS Elbow method (L-curve method) Idea: use the k corresponding to the “elbow” Problem: no reference clustering to compare the differences Wk Wk1’s are not normalized for comparison
- 9. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. BACKGROUND REFERENCE DISTRIBUTIONS • Cubic Clustering Criterion (CCC), Gap Statistic and ABC amplify the elbow phenomenon by using differences between within cluster sum of squares of a clustering solution in the training data (Wk) and a clustering solution in a reference distribution (Wk *). • Aligned box criterion (ABC) • Gap statistic • Cubic clustering criterion (CCC) Reference distribution complexity Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983 Gap Statistic: Tibshirani et al, J.R. Statist. Soc., 2001
- 10. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CCC METHOD Instead of using Wk directly, CCC uses R2 . 𝑅2 = 1 − 𝑇𝑟𝑎𝑐𝑒 𝑊 𝑇𝑟𝑎𝑐𝑒 𝑇 , 𝑇𝑟𝑎𝑐𝑒 𝑊 = 𝑊𝑘 For CCC calculation, R2 and E(R2) are approximated by heuristic formulas. 𝐶𝐶𝐶 = log 1 − 𝐸(𝑅2) 1 − 𝑅2 𝑛𝑝∗ 2 (0.001 + 𝐸(𝑅2))1.2 Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983 Derived from numerous Monte Carlo simulations to generate one hyper-cube reference distribution based on the dimensions of the given training dataset to test all k of interest.
- 11. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. GAP STATISTICS METHOD The Gap Statistic computes the (log) ratio Wk* / Wk. 𝐺𝑎𝑝 𝑘 = log 𝑊𝑘 ∗ − log 𝑊𝑘 Wk* is calculated from a clustering solution in the reference distribution. Finds k that maximizes Gap(k) (within some tolerance)
- 12. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. TWO TYPES OF UNIFORM DISTRIBUTIONS 1. Align with feature axes (data-geometry independent) Observations Bounding Box (aligned with feature axes) Monte Carlo Simulations
- 13. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. TWO TYPES OF UNIFORM DISTRIBUTIONS 2. Align with principal axes (data-geometry dependent) Observations Bounding Box (aligned with principal axes) Monte Carlo Simulations
- 14. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. COMPUTATION OF THE GAP STATISTIC for l = 1 to B Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the standard deviation of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that )(/11 ksdBsk B b kkb WW B kGap 1 loglog 1 )( 1)1()( kskGapkGap
- 15. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. GAP STATISTIC
- 16. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. NO-CLUSTER EXAMPLE (JOURNAL VERSION)
- 17. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC (ALIGNED BOX CRITERION)
- 18. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD ABC improves upon CCC and Gap Statistics by generating better estimates for Wk*. ABC uses k reference distributions, one for each tested k (k is number of clusters). • Data-driven Monte Carlo simulation of reference distribution at each tested k. • The reference distribution is k uniform hyper boxes aligned with the Principal Components from the clustering solution of the input data. Gap Statistic Reference Distribution ABC Reference Distribution
- 19. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD Why multiple reference distributions? The gap statistic performs hypothesis testing between k clusters/no-clusters for the whole input space • ABC is similar to recursive hypothesis testing between 1 cluster/2 clusters for each of the k candidate clusters • More stringent test. It is harder for larger k to pass this test. This is desirable. Gap Statistic Reference Distribution ABC Reference Distribution
- 20. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Sample Data
- 21. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Aligned Box Criterion
- 22. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Aligned Box Criterion
- 23. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 24. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 25. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 26. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 27. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 28. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 29. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 30. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
- 31. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ALIGNED BOX CRITERION (ABC) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Considering each cluster k separately Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the s.d. of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that )(/11 ksdBsk 1)1()( kskABCkABC 𝐴𝐵𝐶(𝑘) = log 𝑊𝑘 + − log 𝑊𝑘
- 32. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD RESULTS
- 33. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Wk*decreases faster. Gap Statistic Aligned Box Criterion
- 34. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Gap Statistic Aligned Box Criterion AlignedBoxCriterion Clearer Maxima.
- 35. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
- 36. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS • Observations: 7,000 • Variables: 2 • Monte Carlo Replications: 20 CCC method ABC method
- 37. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
- 38. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k CLAIMS PREDICTION CHALLENGE DATA • Anonymized customer data • 32 customer and product features • 13,184,290 customer records
- 39. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k EXECUTING CALCULATIONS • Cubic clustering criterion: PROC FASTCLUS • Gap statistic: R cluster package in the Open Source Integration Node in SAS Enterprise Miner • Aligned box criterion: PROC HPCLUS
- 40. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k INTERPRETING RESULTS Cubic Clustering Criterion
- 41. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k INTERPRETING RESULTS Gap Statistic
- 42. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k INTERPRETING RESULTS Aligned Box Criterion
- 43. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. REFERENCE DISTRIBUTION EFFECT OF CHANGING NUMBER OF OBSERVATIONS • How the number of observations in the reference distribution affects the result • Based on the number of observations n in the input dataset, we generated w*n number of observations in the reference distribution where w is between 0 and 1
- 44. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMPLE CASE
- 45. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS DATA SET WITH MORE CLUSTERS
- 46. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS DATA SET WITH MORE OBSERVATIONS
- 47. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS REAL DATA Kaggle Claims Prediction Challenge (n= 13,184,290, p= 35), 50 runs
- 48. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SCALABILITY
- 49. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS STABILITY
- 50. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD FOR PARALLEL AND DISTRIBUTED ARCHITECTURES
- 51. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 1-2 Node1 Root ….. Node2 Node3 NodeN 1) Run clustering k-means (in parallel) for k clusters 2) Assign each observation to a cluster 3) Compute 𝑊𝑘 1) Assign each cluster to a node 2) Collect the XX’ matrix for each cluster in the assigned node using a tree-based algorithm 3) Do PCA using XX’ matrix Node1 ….. Node2 Node3 NodeN
- 52. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 3-4 Node1 ….. Node2 Node3 NodeN 1) Eigenvectors are broadcasted to every node 2) Based on their assigned clusters, the observations in each node are projected into the new space 1) Bounding boxes are computed locally at each node for each cluster k 2) Bounding box information from each node is collected at the root and the root computes the bounding box coordinates for each cluster k 3) This information is distributed to each node and each node generates reference distributions Node1 ….. Node2 Node3 NodeN Node1 Root ….. Node2 Node3 NodeN Node1 ….. Node2 Node3 NodeN
- 53. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 5 Node1 Root ….. Node2 Node3 NodeN Run k-means clustering in parallel for the reference distribution and compute 𝑊𝑘 + Do this for B number of reference distributions Compute ABC for cluster k
- 54. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 6 What about the O(n^3) complexity of SVD??? - Computation of XX’ is parallelized - Or, do stochastic SVD
- 55. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD CONCLUSION
- 56. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS More accurate reference distributions lead to: • Better defined maxima. • Wk* values decreasing rapidly, especially for K > k. • Exposure of possible alternative solutions.
- 57. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CONCLUSION For large, highly dimensional or noisy data ABC is found to be: • Stable • Scalable Moreover, it exhibits desirable properties: • Clearer peaks • More stringent hypothesis test promotes smaller k values
- 58. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. www.SAS.com Q&A THANK YOU

No public clipboards found for this slide

Login to see the comments