SlideShare a Scribd company logo
1 of 65
Download to read offline
Data Mining & Machine
Learning
Tilani Gunawardena
PhD(UNIBAS), BSc.Eng(Pera), FHEA(UK), AMIE(SL)
2019/11/22
2
3
4
5
Data Mining: Process of discovering patterns in data
Data Mining
6
Machine Learning
◦ Grew out of work in AI
◦ New Capability for computers
Machine Learning is a science of getting computers to learn without being
explicitly programed
Learning= Improving with experience at some task
◦ Improve over task T
◦ With respect to P
◦ Based on experience E
Machine Learning
7
Database Mining
◦ Large datasets from growth of automation/web
◦ Ex: web click data, medical records, biology, engineering
Applications can’t program by hand
◦ Ex: Autonomous helicopter, handwriting recognition, most of NLP,
Computer vision
Self- customizing programs
◦ Amazon, Netflix product recommendation
Understand human Learning(brain, real AI)
Machine Learning
8
Types of Learning
◦ Supervised learning: Learn to predict
◦ correct answer for each example. Answer can be a numeric variable, categorical variable etc.
◦ Unsupervised learning: learn to understand and describe the data
◦ correct answers not given – just examples (e.g. – the same figures as above , without the labels)
◦ Reinforcement learning: learn to act
◦ occasional rewards
M M MF F F
9
Machine Learning Problems
10
The success of machine learning system also depends on the algorithms.
The algorithms control the search to find and build the knowledge
structures.
The learning algorithms should extract useful information from training
examples.
Algorithms
11
Supervised learning
◦ Prediction
◦ Classification (discrete labels), Regression (real values)
Unsupervised learning
◦ Clustering
◦ Probability distribution estimation
◦ Finding association (in features)
◦ Dimension reduction
Reinforcement learning
◦ Decision making (robot, chess machine)
Algorithms
12
• Problem of taking labeled dataset, gleaning
information from it so that you can label new data
sets
• Learn to predict output from input
• Function approximation
Supervised Learning
13
Predict housing prices
Supervised learning: example 1
Regression: predict continuous valued output(price)
14
Breast Cancer(malignant, benign)
Supervised learning: example 2
This is classification problem : Discrete valued output(0 or 1)
15
1 attribute/feature : Tumor Size
16
Supervised learning: example 3
2 attributes/features : Tumor Size and Age
17
1. Input: Credit history (# of loans, how much money
you make,…)
Out put : Lend money or not?
2. Input: picture , Output: Predict Bsc, Msc PhD
3. Input: picture, Output: Predict Age
4. Input: Large inventory of identical items, Output:
Predict how many items will sell over the next 3
months
5. Input: Customer accounts, Output: hacked or not
Q?
18
Find patterns and structure in data
Unsupervised Learning
19
Unsupervised Learning-examples
Organize computing clusters
◦Large data centers: what machines work
together?
Social network analysis
◦Given information which friends you send
email most /FB friends/Google+ circles
◦Can we automatically identify which are
cohesive groups of friends
20
Market Segmentation
◦ Customer data set and group customer into different market segments
Astronomical data analysis
◦ Clustering algorithms gives interesting & useful theories ex: how galaxies
are formed
21
1. Given email labeled as spam/not spam, learn a spam filter
2. Given set of news articles found on the web, group them into
set of articles about the same story
3. Given a database of customer data, automatically discover
market segments ad groups customer into different market
segments
4. Given a dataset of patients diagnosed as either having
diabetes or nor, learn to classify new patients as having
diabetes or not
Q?
22
Algorithms: K Nearest Neighbors
23
Simple Analogy..
Tell me about your friends(who your neighbors are) and I
will tell you who you are.
24
KNN – Different names
•K-Nearest Neighbors
•Memory-Based Reasoning
•Example-Based Reasoning
•Instance-Based Learning
•Lazy Learning
25
What is KNN?
A powerful classification algorithm used in pattern
recognition.
K nearest neighbors stores all available cases and classifies
new cases based on a similarity measure(e.g distance
function)
One of the top data mining algorithms used today.
A non-parametric lazy learning algorithm (An Instance-
based Learning method).
26
KNN: Classification Approach
An object (a new instance) is classified by a majority votes for its neighbor
classes.
The object is assigned to the most common class amongst its K nearest
neighbors.(measured by a distant function )
27
28
Distance Measure
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
29
Distance measure for Continuous Variables
30
Distance Between Neighbors
Calculate the distance between new example (E) and all examples in the
training set.
Euclidean distance between two examples.
◦ X = [x1,x2,x3,..,xn]
◦ Y = [y1,y2,y3,...,yn]
◦ The Euclidean distance between X and Y is defined as:
31


n
i
ii yxYXD
1
2
)(),(
K-Nearest Neighbor Algorithm
All the instances correspond to points in an n-dimensional feature space.
Each instance is represented with a set of numerical attributes.
Each of the training data consists of a set of vectors and a class label
associated with each vector.
Classification is done by comparing feature vectors of different K nearest
points.
Select the K-nearest examples to E in the training set.
Assign E to the most common class among its K-nearest neighbors.
32
`
Distance from John
sqrt [(35-37)2+(35-50)2 +(3-
2)2]=15.16
sqrt [(22-37)2+(50-50)2 +(2-
2)2]=15
sqrt [(63-37)2+(200-50)2 +(1-
2)2]=152.23
sqrt [(59-37)2+(170-50)2 +(1-
2)2]=122
sqrt [(25-37)2+(40-50)2 +(4-
2)2]=15.74
33
Customer Age Income No.
credit
cards
Class
George 35 35K 3 No
Rachel 22 50K 2 Yes
Steve 63 200K 1 No
Tom 59 170K 1 No
Anne 25 40K 4 Yes
John 37 50K 2 ? YES
How to choose K?
If K is too small it is sensitive to noise points.
Larger K works well. But too large K may include majority points
from other classes.
Rule of thumb is K < sqrt(n), n is number of examples.
34
X
35
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
36
Strengths of KNN
Very simple and intuitive.
Can be applied to the data from any distribution.
Good classification if the number of samples is
large enough.
37
Weaknesses of KNN
Takes more time to classify a new example.
 need to calculate and compare distance from new
example to all other examples.
Choosing k may be tricky.
Need large number of samples for accuracy.
38
Algorithms: Clustering
39
Grouping of records ,observations or cases into classes of
similar objects.
A cluster is a collection of records,
◦ Similar to one another
◦ Dissimilar to records in other clusters
What is Clustering?
40
Clustering
41
Clustering
42
There is no target variable for clustering
Clustering does not try to classify or predict the values of a target variable.
Instead, clustering algorithms seek to segment the entire data set into relatively
homogeneous subgroups or clusters,
◦ Where the similarity of the records within the cluster is maximized, and
◦ Similarity to records outside this cluster is minimized.
Difference between Clustering and
Classification
43
Between-cluster variation:
Within-cluster variation:
Goal of Clustering
between-cluster variation(BCV) is
large compared to the within-
cluster variation(WCV)
(Intra-cluster distance) the sum of distances between objects
in the same cluster are minimized
(Inter-cluster distance) while the distances between different
clusters are maximized
44
K-means Clustering
45
k-Means Clustering
Input: n objects (or points) and a number k
Algorithm
1) Randomly assign K records to be the initial cluster center locations
2) Assign each object to the group that has the closest centroid
3) When all objects have been assigned, recalculate the positions of the K
centroids
4) Repeat steps 2 to 3 until convergence or termination
46
K-Mean Clustering
47
Termination Conditions
The algorithm terminates when the centroids no longer
change.
The SSE(sum of squared errors) value is less than some small
threshold value 
Where p є Ci represents each data point in cluster i and mi
represent the centroid of cluster i.
SSE = d(p- mi )2
pÎci
å
i=1
k
å
48
Example 1:
Lets s suppose the following points are the delivery locations for
pizza
49
Lets locate three cluster centers randomly
50
Find the distance of points as shown
51
Assign the points to the nearest cluster center based on the
distance between each center and the point
52
Re-assign the cluster centres and locate nearest points
53
Re-assign the cluster centres and locate nearest points, calculate the
distance
54
Form the three clusters
55
Example 2:
56
57
How to decide k?
Unless the analyst has a prior knowledge of the number of underlying clusters,
therefore,
◦ Clustering solutions for each value of K is compared
◦ The value of K resulting in the smallest SSE being selected
58
Evaluation
59
Model’s Evaluation in the KDD Process
It is important that the test data is
not used in any way to create the
classifier.
63
Classification Step 1: Split data into train and test sets
Results Known
+
+
-
-
+
THE PAST
Data
Training set
Testing set
64
Classification Step 2: Build a model on a training set
Training set
Results Known
+
+
-
-
+
THE PAST
Data
Model Builder
Testing set
65
Classification Step 3: Evaluate on test set (Re-train?)
Data
Predictions
Y N
Results Known
Training set
Testing set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-

More Related Content

What's hot

Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighborUjjawal
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighborbutest
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clusteringMegha Sharma
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersPier Luca Lanzi
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar AhmedZaffar Ahmed Shaikh
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)Manish nath choudhary
 

What's hot (20)

K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Kmeans
KmeansKmeans
Kmeans
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
KNN
KNNKNN
KNN
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Lect4
Lect4Lect4
Lect4
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)
 

Similar to Introduction to data mining and machine learning

Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptxssuser6654de1
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
CS583-unsupervised-learning.ppt
CS583-unsupervised-learning.pptCS583-unsupervised-learning.ppt
CS583-unsupervised-learning.pptHathiramN1
 
CS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learningCS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learningssuserb02eff
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
 

Similar to Introduction to data mining and machine learning (20)

Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
K means report
K means reportK means report
K means report
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
CS583-unsupervised-learning.ppt
CS583-unsupervised-learning.pptCS583-unsupervised-learning.ppt
CS583-unsupervised-learning.ppt
 
CS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learningCS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learning
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Efficient Parallel Set-Similarity Joins Using MapReduce
 Efficient Parallel Set-Similarity Joins Using MapReduce Efficient Parallel Set-Similarity Joins Using MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce
 

Recently uploaded

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 

Recently uploaded (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Introduction to data mining and machine learning

  • 1. Data Mining & Machine Learning Tilani Gunawardena PhD(UNIBAS), BSc.Eng(Pera), FHEA(UK), AMIE(SL) 2019/11/22
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. Data Mining: Process of discovering patterns in data Data Mining 6
  • 7. Machine Learning ◦ Grew out of work in AI ◦ New Capability for computers Machine Learning is a science of getting computers to learn without being explicitly programed Learning= Improving with experience at some task ◦ Improve over task T ◦ With respect to P ◦ Based on experience E Machine Learning 7
  • 8. Database Mining ◦ Large datasets from growth of automation/web ◦ Ex: web click data, medical records, biology, engineering Applications can’t program by hand ◦ Ex: Autonomous helicopter, handwriting recognition, most of NLP, Computer vision Self- customizing programs ◦ Amazon, Netflix product recommendation Understand human Learning(brain, real AI) Machine Learning 8
  • 9. Types of Learning ◦ Supervised learning: Learn to predict ◦ correct answer for each example. Answer can be a numeric variable, categorical variable etc. ◦ Unsupervised learning: learn to understand and describe the data ◦ correct answers not given – just examples (e.g. – the same figures as above , without the labels) ◦ Reinforcement learning: learn to act ◦ occasional rewards M M MF F F 9
  • 11. The success of machine learning system also depends on the algorithms. The algorithms control the search to find and build the knowledge structures. The learning algorithms should extract useful information from training examples. Algorithms 11
  • 12. Supervised learning ◦ Prediction ◦ Classification (discrete labels), Regression (real values) Unsupervised learning ◦ Clustering ◦ Probability distribution estimation ◦ Finding association (in features) ◦ Dimension reduction Reinforcement learning ◦ Decision making (robot, chess machine) Algorithms 12
  • 13. • Problem of taking labeled dataset, gleaning information from it so that you can label new data sets • Learn to predict output from input • Function approximation Supervised Learning 13
  • 14. Predict housing prices Supervised learning: example 1 Regression: predict continuous valued output(price) 14
  • 15. Breast Cancer(malignant, benign) Supervised learning: example 2 This is classification problem : Discrete valued output(0 or 1) 15
  • 16. 1 attribute/feature : Tumor Size 16
  • 17. Supervised learning: example 3 2 attributes/features : Tumor Size and Age 17
  • 18. 1. Input: Credit history (# of loans, how much money you make,…) Out put : Lend money or not? 2. Input: picture , Output: Predict Bsc, Msc PhD 3. Input: picture, Output: Predict Age 4. Input: Large inventory of identical items, Output: Predict how many items will sell over the next 3 months 5. Input: Customer accounts, Output: hacked or not Q? 18
  • 19. Find patterns and structure in data Unsupervised Learning 19
  • 20. Unsupervised Learning-examples Organize computing clusters ◦Large data centers: what machines work together? Social network analysis ◦Given information which friends you send email most /FB friends/Google+ circles ◦Can we automatically identify which are cohesive groups of friends 20
  • 21. Market Segmentation ◦ Customer data set and group customer into different market segments Astronomical data analysis ◦ Clustering algorithms gives interesting & useful theories ex: how galaxies are formed 21
  • 22. 1. Given email labeled as spam/not spam, learn a spam filter 2. Given set of news articles found on the web, group them into set of articles about the same story 3. Given a database of customer data, automatically discover market segments ad groups customer into different market segments 4. Given a dataset of patients diagnosed as either having diabetes or nor, learn to classify new patients as having diabetes or not Q? 22
  • 23. Algorithms: K Nearest Neighbors 23
  • 24. Simple Analogy.. Tell me about your friends(who your neighbors are) and I will tell you who you are. 24
  • 25. KNN – Different names •K-Nearest Neighbors •Memory-Based Reasoning •Example-Based Reasoning •Instance-Based Learning •Lazy Learning 25
  • 26. What is KNN? A powerful classification algorithm used in pattern recognition. K nearest neighbors stores all available cases and classifies new cases based on a similarity measure(e.g distance function) One of the top data mining algorithms used today. A non-parametric lazy learning algorithm (An Instance- based Learning method). 26
  • 27. KNN: Classification Approach An object (a new instance) is classified by a majority votes for its neighbor classes. The object is assigned to the most common class amongst its K nearest neighbors.(measured by a distant function ) 27
  • 28. 28
  • 30. Distance measure for Continuous Variables 30
  • 31. Distance Between Neighbors Calculate the distance between new example (E) and all examples in the training set. Euclidean distance between two examples. ◦ X = [x1,x2,x3,..,xn] ◦ Y = [y1,y2,y3,...,yn] ◦ The Euclidean distance between X and Y is defined as: 31   n i ii yxYXD 1 2 )(),(
  • 32. K-Nearest Neighbor Algorithm All the instances correspond to points in an n-dimensional feature space. Each instance is represented with a set of numerical attributes. Each of the training data consists of a set of vectors and a class label associated with each vector. Classification is done by comparing feature vectors of different K nearest points. Select the K-nearest examples to E in the training set. Assign E to the most common class among its K-nearest neighbors. 32
  • 33. ` Distance from John sqrt [(35-37)2+(35-50)2 +(3- 2)2]=15.16 sqrt [(22-37)2+(50-50)2 +(2- 2)2]=15 sqrt [(63-37)2+(200-50)2 +(1- 2)2]=152.23 sqrt [(59-37)2+(170-50)2 +(1- 2)2]=122 sqrt [(25-37)2+(40-50)2 +(4- 2)2]=15.74 33 Customer Age Income No. credit cards Class George 35 35K 3 No Rachel 22 50K 2 Yes Steve 63 200K 1 No Tom 59 170K 1 No Anne 25 40K 4 Yes John 37 50K 2 ? YES
  • 34. How to choose K? If K is too small it is sensitive to noise points. Larger K works well. But too large K may include majority points from other classes. Rule of thumb is K < sqrt(n), n is number of examples. 34 X
  • 35. 35
  • 36. X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 36
  • 37. Strengths of KNN Very simple and intuitive. Can be applied to the data from any distribution. Good classification if the number of samples is large enough. 37
  • 38. Weaknesses of KNN Takes more time to classify a new example.  need to calculate and compare distance from new example to all other examples. Choosing k may be tricky. Need large number of samples for accuracy. 38
  • 40. Grouping of records ,observations or cases into classes of similar objects. A cluster is a collection of records, ◦ Similar to one another ◦ Dissimilar to records in other clusters What is Clustering? 40
  • 43. There is no target variable for clustering Clustering does not try to classify or predict the values of a target variable. Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, ◦ Where the similarity of the records within the cluster is maximized, and ◦ Similarity to records outside this cluster is minimized. Difference between Clustering and Classification 43
  • 44. Between-cluster variation: Within-cluster variation: Goal of Clustering between-cluster variation(BCV) is large compared to the within- cluster variation(WCV) (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized (Inter-cluster distance) while the distances between different clusters are maximized 44
  • 46. k-Means Clustering Input: n objects (or points) and a number k Algorithm 1) Randomly assign K records to be the initial cluster center locations 2) Assign each object to the group that has the closest centroid 3) When all objects have been assigned, recalculate the positions of the K centroids 4) Repeat steps 2 to 3 until convergence or termination 46
  • 48. Termination Conditions The algorithm terminates when the centroids no longer change. The SSE(sum of squared errors) value is less than some small threshold value  Where p є Ci represents each data point in cluster i and mi represent the centroid of cluster i. SSE = d(p- mi )2 pÎci å i=1 k å 48
  • 49. Example 1: Lets s suppose the following points are the delivery locations for pizza 49
  • 50. Lets locate three cluster centers randomly 50
  • 51. Find the distance of points as shown 51
  • 52. Assign the points to the nearest cluster center based on the distance between each center and the point 52
  • 53. Re-assign the cluster centres and locate nearest points 53
  • 54. Re-assign the cluster centres and locate nearest points, calculate the distance 54
  • 55. Form the three clusters 55
  • 57. 57
  • 58. How to decide k? Unless the analyst has a prior knowledge of the number of underlying clusters, therefore, ◦ Clustering solutions for each value of K is compared ◦ The value of K resulting in the smallest SSE being selected 58
  • 60. Model’s Evaluation in the KDD Process
  • 61. It is important that the test data is not used in any way to create the classifier.
  • 62.
  • 63. 63 Classification Step 1: Split data into train and test sets Results Known + + - - + THE PAST Data Training set Testing set
  • 64. 64 Classification Step 2: Build a model on a training set Training set Results Known + + - - + THE PAST Data Model Builder Testing set
  • 65. 65 Classification Step 3: Evaluate on test set (Re-train?) Data Predictions Y N Results Known Training set Testing set + + - - + Model Builder Evaluate + - + -