SlideShare a Scribd company logo
1 of 22
Download to read offline
Clustering on Database Systems 
Vahid Mirjalili 
Michigan State University
Clustering 
• Partitioning data into groups 
Items in the same group should have higher similarity to each 
other than items from different groups 
• A similarity/dissimilarity measure 
• Examples: 
 Clustering patients in a hospital 
 Genomic clustering 
 Hand-written character recognition 
A. Jain, “Data Clustering: 50 years beyond K-means”
Clustering vs. Classification 
Reinforcement 
learning 
Predictive Modeling Tasks 
Unsupervised 
Learning 
• Classification is supervised 
Supervised 
Learning 
– class labels are provided; 
– learn a classifier to predict class labels of novel/unseen 
data 
• Clustering is unsupervised or semi-supervised; 
– No class label is give 
– Understand the structure underlying your data
Clustering Approaches 
 Probability-based 
– Assuming statistical independence among features 
– Inefficient updating and storing clusters 
 Distance-based 
– Assuming direct access to all data points 
– Hierarchical clustering: O(N2), not giving the best clustering
Distance-Based Clustering Algorithms 
• kmeans and its variants (kmedoids, kernel 
kmeans, fuzzy c-means, …) 
• Density based methods (DBSCAN) 
• Hierarchical methods
Challenges 
• Unknown number of clusters (from 1 to N) 
Input data K=2 K=6 
You always get some 
output as clusters 
Are they really distinct 
clusters? 
A. Jain, “Data Clustering: 50 years beyond K-means”
Challenges 
• Clusters with different shapes, sizes and 
densities 
Shapes: globular shape, linear vs. non-linear 
shapes 
A. Jain, “Data Clustering: 50 years beyond K-means”
Standard K-Means Algorithm 
• Find initial Cluster centroids randomly 
• An iterative algorithm 
1. Assignment step: assign each data point the 
cluster whose mean is closest (smallest distance) 
2. Update step: update the mean (centroid) of each 
cluster 
Distance: squared Euclidean distance 
( , )  
dist x   x  
 
j j  1  
 
Centroid: mean of feature vectors  
 
 
i C 
 
i 
C 
X 
N 
2 
  
1 
d 
j
Standard K-Means Algorithm
Problem in Database-oriented 
Clustering 
• Low memory available compared to size of 
dataset  data doesn’t fit in main memory 
• High I/O 
• Necessary to avoid too many iterations
RKM: An Efficient Disk-based KMeans 
Method 
• Find the initial centroids by 
• Only 3 iterations: 
r d c all      / 
– Assign every L points to nearest centroids; 
– Update the cluster centroids 
• Minor efficiency tricks: 
N L  
– Keep track of LS, SS and Nc for each cluster during 
assignment  update step: 
c c   LS / N
Implementation of RKM: 
storing data matrices 
• D  input dataset 
• Pj 
 cluster j (for j in [1..k]) 
• Mj, Qj, Nj 
 Linear Sum, Squared Sum, cluster 
size 
• Cj, Rj, Wj 
 Centroids, Variances, Weights 
(accessed during update step) 
C  
M / 
N 
j j j 
R  Q / N  
M M / 
N 
 
 
 
j j l 
l k 
j 
t 
j j j j j 
W N N 
1.. 
2 
/
RKM avoids local minima: 
split large clusters 
• Only performed if size of a cluster is less than 
a user-defined threshold 
1. Remove the centroid of the small cluster 
2. Find the largest cluster (largest Wj) 
3. Randomly choose two centroids for the largest 
cluster (using Cj, and Rj) 
4. Reassign the items of small and large clusters
RKM vs. Standard K-means: 
Random Dataset
RKM vs. Standard K-means: 
Initial Cluster Centroids 
K = 3
Cluster assignment: 
Results after one pass over all the data 
Many iterations needed 2 more iterations
RKM: Database design 
• Relational schema for sparse data 
representation: D(pid, inx, value) 
• For other matrices: doing 1 I/O per matrix row 
to minimize I/O 
Matrix access 
E step (assignment step) 
M step (update step) 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Performance Comparison 
• RKM (disk-based) 
• Memory based: 
– Standard K-means 
– Scalable K-means 
  
  
C dist x C 
Quan.error( )  
( , ) 
j k i P 
i j 
j 
1..
Time Complexity of RKM 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Time Complexity 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Conclusion 
• RKM resolve some of the limitations of K-means 
• RKM limits disk access (I/O) 
• Final clustering is achieved with 3 iterations 
• On large datasets RKM outperforms standard K-means 
• Other limitations of K-means clustering still 
remain
Read more … 
General implementation in IPython notebook: 
http://goo.gl/YZScH9 
http://www.vahidmirjalili.com

More Related Content

What's hot

introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Guided Transmission Media
Guided Transmission MediaGuided Transmission Media
Guided Transmission Mediaasrabatool
 
Chapter 4 data link layer
Chapter 4 data link layerChapter 4 data link layer
Chapter 4 data link layerNaiyan Noor
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
File organization
File organizationFile organization
File organizationGokul017
 
Advanced Database Lecture Notes
Advanced Database Lecture NotesAdvanced Database Lecture Notes
Advanced Database Lecture NotesJasour Obeidat
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
standard template library(STL) in C++
standard template library(STL) in C++standard template library(STL) in C++
standard template library(STL) in C++•sreejith •sree
 

What's hot (20)

introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Mining
Data MiningData Mining
Data Mining
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Guided Transmission Media
Guided Transmission MediaGuided Transmission Media
Guided Transmission Media
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Object Based Databases
Object Based DatabasesObject Based Databases
Object Based Databases
 
Chapter 4 data link layer
Chapter 4 data link layerChapter 4 data link layer
Chapter 4 data link layer
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Framing in data link layer
Framing in data link layerFraming in data link layer
Framing in data link layer
 
Active database
Active databaseActive database
Active database
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
File organization
File organizationFile organization
File organization
 
Advanced Database Lecture Notes
Advanced Database Lecture NotesAdvanced Database Lecture Notes
Advanced Database Lecture Notes
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Databases: Normalisation
Databases: NormalisationDatabases: Normalisation
Databases: Normalisation
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
standard template library(STL) in C++
standard template library(STL) in C++standard template library(STL) in C++
standard template library(STL) in C++
 

Viewers also liked

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative ClusteringToshihiro Kamishima
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveChong-Kuan Chen
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniquesGiorgos Bamparopoulos
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Longhow Lam
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Renato Jovic
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarAndrew Morgan
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksAnna Förster
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine LearningAmazon Web Services
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognitionHưng Đặng
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 

Viewers also liked (15)

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative Clustering
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning Perspective
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor Networks
 
Introduction to pattern recognition
Introduction to pattern recognitionIntroduction to pattern recognition
Introduction to pattern recognition
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognition
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Clustering on database systems rkm

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 

Similar to Clustering on database systems rkm (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 

Recently uploaded

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Recently uploaded (17)

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

Clustering on database systems rkm

  • 1. Clustering on Database Systems Vahid Mirjalili Michigan State University
  • 2. Clustering • Partitioning data into groups Items in the same group should have higher similarity to each other than items from different groups • A similarity/dissimilarity measure • Examples:  Clustering patients in a hospital  Genomic clustering  Hand-written character recognition A. Jain, “Data Clustering: 50 years beyond K-means”
  • 3. Clustering vs. Classification Reinforcement learning Predictive Modeling Tasks Unsupervised Learning • Classification is supervised Supervised Learning – class labels are provided; – learn a classifier to predict class labels of novel/unseen data • Clustering is unsupervised or semi-supervised; – No class label is give – Understand the structure underlying your data
  • 4. Clustering Approaches  Probability-based – Assuming statistical independence among features – Inefficient updating and storing clusters  Distance-based – Assuming direct access to all data points – Hierarchical clustering: O(N2), not giving the best clustering
  • 5. Distance-Based Clustering Algorithms • kmeans and its variants (kmedoids, kernel kmeans, fuzzy c-means, …) • Density based methods (DBSCAN) • Hierarchical methods
  • 6. Challenges • Unknown number of clusters (from 1 to N) Input data K=2 K=6 You always get some output as clusters Are they really distinct clusters? A. Jain, “Data Clustering: 50 years beyond K-means”
  • 7. Challenges • Clusters with different shapes, sizes and densities Shapes: globular shape, linear vs. non-linear shapes A. Jain, “Data Clustering: 50 years beyond K-means”
  • 8. Standard K-Means Algorithm • Find initial Cluster centroids randomly • An iterative algorithm 1. Assignment step: assign each data point the cluster whose mean is closest (smallest distance) 2. Update step: update the mean (centroid) of each cluster Distance: squared Euclidean distance ( , )  dist x   x   j j  1   Centroid: mean of feature vectors    i C  i C X N 2   1 d j
  • 10. Problem in Database-oriented Clustering • Low memory available compared to size of dataset  data doesn’t fit in main memory • High I/O • Necessary to avoid too many iterations
  • 11. RKM: An Efficient Disk-based KMeans Method • Find the initial centroids by • Only 3 iterations: r d c all      / – Assign every L points to nearest centroids; – Update the cluster centroids • Minor efficiency tricks: N L  – Keep track of LS, SS and Nc for each cluster during assignment  update step: c c   LS / N
  • 12. Implementation of RKM: storing data matrices • D  input dataset • Pj  cluster j (for j in [1..k]) • Mj, Qj, Nj  Linear Sum, Squared Sum, cluster size • Cj, Rj, Wj  Centroids, Variances, Weights (accessed during update step) C  M / N j j j R  Q / N  M M / N    j j l l k j t j j j j j W N N 1.. 2 /
  • 13. RKM avoids local minima: split large clusters • Only performed if size of a cluster is less than a user-defined threshold 1. Remove the centroid of the small cluster 2. Find the largest cluster (largest Wj) 3. Randomly choose two centroids for the largest cluster (using Cj, and Rj) 4. Reassign the items of small and large clusters
  • 14. RKM vs. Standard K-means: Random Dataset
  • 15. RKM vs. Standard K-means: Initial Cluster Centroids K = 3
  • 16. Cluster assignment: Results after one pass over all the data Many iterations needed 2 more iterations
  • 17. RKM: Database design • Relational schema for sparse data representation: D(pid, inx, value) • For other matrices: doing 1 I/O per matrix row to minimize I/O Matrix access E step (assignment step) M step (update step) Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 18. Performance Comparison • RKM (disk-based) • Memory based: – Standard K-means – Scalable K-means     C dist x C Quan.error( )  ( , ) j k i P i j j 1..
  • 19. Time Complexity of RKM Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 20. Time Complexity Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 21. Conclusion • RKM resolve some of the limitations of K-means • RKM limits disk access (I/O) • Final clustering is achieved with 3 iterations • On large datasets RKM outperforms standard K-means • Other limitations of K-means clustering still remain
  • 22. Read more … General implementation in IPython notebook: http://goo.gl/YZScH9 http://www.vahidmirjalili.com