SlideShare a Scribd company logo
1 of 203
Download to read offline
Quick Tour of Data Mining
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
About Speaker
陳宜欣 Yi-Shin Chen
▷ Currently
• 清華大學資訊工程系副教授
• 主持智慧型資料工程與應用實驗室 (IDEA Lab)
▷ Education
• Ph.D. in Computer Science, USC, USA
• M.B.A. in Information Management, NCU, TW
• B.B.A. in Information Management, NCU, TW
▷ Courses (all in English)
• Research and Presentation Skills
• Introduction to Database Systems
• Advanced Database Systems
• Data Mining: Concepts, Techniques, and
Evolution of Data
The relationships between the techniques and our world
1900 1920 1940 1950 1960 1970
1950: Univac had developed
a magnetic tape
1951: Univac I delivered to
the US Census Bureau
1931: Gödel's Incompleteness
Theorem 1948: Information theory (by
Information Entropy
1944: Mark I
(Server) 1963: The origins of the Internet
Programmed Record
• Birth of high-level
• Batch processing
Record Managers
On-line Network
• Indexed sequential
• Data independence
• Concurrent Access
2001: Data Science
2009: Deep Learning
1970 1980 1990 2000 2010
1985: 1st standardized
of SQL
1976: E-R Model by
Peter Chen
1993: WWW
2006: Elastic
Compute Cloud
1980: Artificial Neural
Knowledge Discovery
in Databases
Object Relational
• Support multiple
datatypes and
1974: IBM System R
Relational Model
• Give Database users
high-level set-oriented
data access
Data Mining
What we know, and what we do now
Data Mining
▷ What is data mining?
• Algorithms for seeking unexpected “pearls of wisdom”
▷ Current data mining research:
• Focus on efficient ways to discover models of existing data sets
• Developed algorithms are: classification, clustering, association-
rule discovery, summarization…etc.
Data Mining Examples
8Slide from: Prof. Shou-De Lin
Origins of Data Mining
▷ Draws ideas from
• Machine learning/AI
• Pattern recognition
• Statistics
• Database systems
▷ Traditional Techniques may be unsuitable due to
• Enormity of data
• High dimensionality of data
• Heterogeneous, distributed nature of data
© Tan, Steinbach, Kumar Introduction to Data Mining
Data Mining
Recognition Statistics
Knowledge Discovery (KDD) Process
Data Cleaning
Data Integration
Data Mining
Informal Design Guidelines for Database
▷ Design a schema that can be explained easily relation by
relation. The semantics of attributes should be easy to interpret
▷ Should avoid update anomaly problems
▷ Relations should be designed such that their tuples will have as
few NULL values as possible
▷ The relations should be designed to satisfy the lossless join
condition (guarantee meaningful results for join operations)
Data Warehouse
▷Assemble and manage data from various sources
for the purpose of answering business questions
Data Warehouse
Knowledge Discovery (KDD) Process
Data Cleaning
Data Integration
Data Mining
KDD Process: Several Key Steps
▷ Pre-processing
• Learning the application domain
→ Relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
→ Find useful features
▷ Data mining
• Choosing functions of data mining
→ Choosing the mining algorithm
• Search for patterns of interest
▷ Evaluation
• Pattern evaluation and knowledge presentation
→ visualization, transformation, removing redundant patterns, etc.
© Han & Kamper Data Mining: Concepts and Techniques
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
The most important part in the whole process
Types of Attributes
▷There are different types of attributes
• Nominal (=,≠)
→ Nominal values can only distinguish one object from
→ Examples: ID numbers, eye color, zip codes
• Ordinal (<,>)
→ Ordinal values can help to order objects
→ Examples: rankings, grades
• Interval (+,-)
→ The difference between values are meaningful
→ Examples: calendar dates
• Ratio (*,/)
→ Both differences and ratios are meaningful
→ Examples: temperature in Kelvin, length, time, counts
Types of Data Sets
• Data Matrix
• Document Data
• Transaction Data
• World Wide Web
• Molecular Structures
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
of y load
of x Load
of y load
of x Load
Document 1
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Time Items
1 2009/2/8 Bread, Coke, Milk
2 2009/2/13 Beer, Bread
3 2009/2/23 Beer, Diaper
4 2009/3/1 Coke, Diaper, Milk
A Facebook Example
Data Matrix/Graph Data Example
Document Data
Transaction Data
Spatio-Temporal Data
Sequential Data
Tips for Converting Text to
Numerical Values
Recap: Types of Attributes
▷There are different types of attributes
• Nominal (=,≠)
→ Nominal values can only distinguish one object from
→ Examples: ID numbers, eye color, zip codes
• Ordinal (<,>)
→ Ordinal values can help to order objects
→ Examples: rankings, grades
• Interval (+,-)
→ The difference between values are meaningful
→ Examples: calendar dates
• Ratio (*,/)
→ Both differences and ratios are meaningful
→ Examples: temperature in Kelvin, length, time, counts
Vector Space Model
▷Represent the keywords of objects using a term vector
• Term: basic concept, e.g., keywords to describe an object
• Each term represents one dimension in a vector
• N total terms define an n-element terms
• Values of each term in a vector corresponds to the
importance of that term
▷Measure similarity by the vector distances
Document 1
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Term Frequency and Inverse
Document Frequency (TFIDF)
▷Since not all objects in the vector space are equally
important, we can weight each term using its
occurrence probability in the object description
• Term frequency: TF(d,t)
→ number of times t occurs in the object description d
• Inverse document frequency: IDF(t)
→ to scale down the terms that occur in many descriptions
Normalizing Term Frequency
▷nij represents the number of times a term ti occurs in
a description dj . tfij can be normalized using the total
number of terms in the document
• 𝑡𝑓𝑖𝑗 =
𝑛 𝑖𝑗
▷NormalizedValue could be:
• Sum of all frequencies of terms
• Max frequency value
• Any other values can make tfij between 0 to 1
Inverse Document Frequency
▷ IDF seeks to scale down the coordinates of terms
that occur in many object descriptions
• For example, some stop words(the, a, of, to, and…) may
occur many times in a description. However, they should
be considered as non-important in many cases
• 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔
𝑑𝑓 𝑖
+ 1
→ where dfi (document frequency of term ti) is the
number of descriptions in which ti occurs
▷ IDF can be replaced with ICF (inverse class frequency) and
many other concepts based on applications
Reasons of Log
▷ Each distribution can indicate the hidden force
Power-law distribution Normal distribution Normal distribution
Data Quality
Dirty Data
Big Data?
▷ “Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in the
last two years alone. This data comes from everywhere:
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. This data
is “big data.”
• --from
Data Quality
▷What kinds of data quality problems?
▷How can we detect problems with the data?
▷What can we do about these problems?
▷Examples of data quality problems:
•Noise and outliers
•Missing values
•Duplicate data
▷Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
38Two Sine Waves Two Sine Waves + Noise
▷Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set
Missing Values
▷Reasons for missing values
• Information is not collected
→ e.g., people decline to give their age and weight
• Attributes may not be applicable to all cases
→ e.g., annual income is not applicable to children
▷Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values
→ Weighted by their probabilities
Duplicate Data
▷Data set may include data objects that are
duplicates, or almost duplicates of one another
• Major issue when merging data from heterogeneous sources
• Same person with multiple email addresses
▷Data cleaning
• Process of dealing with duplicate data issues
Data Preprocessing
To be or not to be
Data Preprocessing
▷Dimensionality reduction
▷Feature subset selection
▷Feature creation
▷Discretization and binarization
▷Attribute transformation
▷Combining two or more attributes (or objects) into a single
attribute (or object)
• Data reduction
→ Reduce the number of attributes or objects
• Change of scale
→ Cities aggregated into regions, states, countries, etc
• More “stable” data
→ Aggregated data tends to have less variability
SELECT d.Name, avg(Salary)
FROM Employee AS e, Department AS d
WHERE e.Dept=d.DNo
▷Sampling is the main technique employed for data
• It is often used for both
→ Preliminary investigation of the data
→ The final data analysis
• Reasons:
→ Statistics: Obtaining the entire set of data of interest is too
→ Data mining: Processing the entire data set is too
Key Principle For Effective Sampling
▷The sample is representative
•Using a sample will work almost as well as using
the entire data sets
•The approximately the same property as the
original set of data
Sample Size Matters
8000 points 2000 Points 500 Points
Sampling Bias
▷ 2004 Taiwan presidential election polls
訪問日期 93 年 1 月 15日 至 1 月 17日
有效樣本 1068 人 拒 訪 699 人
抽樣誤差 在 95% 信心水準下,約 ± 3個百分點
訪問地區 台灣地區
抽樣方法 電話簿分層系統抽樣,電話號碼末二位隨機
Dimensionality Reduction
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data
mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Curse of Dimensionality
▷When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
• Randomly generate 500
• Compute difference
between max and min
distance between any pair
of points
Dimensionality Reduction: PCA
▷Goal is to find a projection that captures
the largest amount of variation in data
Feature Subset Selection
▷Another way to reduce dimensionality of data
▷Redundant features
•Duplicate much or all of the information contained in
one or more other attributes
•E.g. purchase price of a product vs. sales tax
▷Irrelevant features
•Contain no information that is useful for the data
mining task at hand
•E.g. students' ID is often irrelevant to the task of
predicting students' GPA
Feature Creation
▷Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes
▷Three general methodologies:
•Feature extraction
•Mapping data to new space
•Feature construction
→Combining features
Mapping Data to a New Space
▷Fourier transform
▷Wavelet transform
Two Sine
Two Sine Waves +
Noise Frequency
Discretization Using Class Labels
▷Entropy based approach
3 categories for both x and y 5 categories for both x and y
Discretization Without Using Class Labels
Attribute Transformation
▷A function that maps the entire set of values of a
given attribute to a new set of replacement values
• So each old value can be identified with one of the new
• Simple functions: xk, log(x), ex, |x|
• Standardization and Normalization
Transformation Examples
Preprocessing in Reality
Data Collection
▷Align /Classify the attributes correctly
Who post this message Mentioned User
Shared URL
Language Detection
▷To detect an language (possible languages)
in which the specified text is written
•Short message
•Different languages in one statement
你好 現在幾點鐘
apa kabar sekarang jam berapa ?
繁體中文 (zh-tw)
印尼文 (id)
Wrong Detection Examples
▷Twitter examples
@sayidatynet top song #LailaGhofran
shokran ya garh new album #listen
#ChineseTaipei #Sochi #2014冬奧
Before / after removing noise
en -> id
it -> zh-tw
en -> ja
Removing Noise
▷Removing noise before detection
•Html file ->tags
•Twitter -> hashtag, mention, URL
<meta name="twitter:description"
搜 尋 引 擎 巨 擘 Google8 日 在 法 文 版 首 頁
(張貼悔過書 ..."/>
首頁(張貼悔過書 ...
Data Cleaning
▷Special character
▷Utilize regular expressions to clean data
Unicode emotions ☺, ♥…
Symbol icon ☏, ✉…
Currency symbol €, £, $...
Tweet URL
Filter out non-(letters, space,
punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉
I added a video to a @YouTube playlist Jamie Riepe
Japanese Examples
▷Use regular expression remove all
special words
す^o^ アイコン、ラブラブ(-_-)♡
•うふふふふ 楽しむ ありがとうございます ア
イコン ラブラブ
Part-of-speech (POS) Tagging
▷Processing text and assigning parts of
speech to each word
▷Twitter POS tagging
•Noun (N), Adjective (A), Verb (V), URL (U)…
Happy Easter! I went to work and came home to an empty house now im
going for a quick run
Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N
to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A
▷@DirtyDTran gotta be caught up for
tomorrow nights episode
▷@ASVP_Jaykey for some reasons I found
this very amusing
• @DirtyDTran gotta be catch up for tomorrow night episode
• @ASVP_Jaykey for some reason I find this very amusing
RT @kt_biv : @caycelynnn loving and missing you! we are
still looking for Lucy
love miss be
Hashtag Segmentation
▷By using Microsoft Web N-Gram Service
(or by using Viterbi algorithm)
#pray #for #boston
Wow! explosion at a boston race ... #prayforboston
#citizen #science
#boston #marathon
#good #things #are #coming
#low #blood #pressure
More Preprocesses for Different Web
▷Extract source code without javascript
▷Removing html tags
Extract Source Code Without Javascript
▷Javascript code should be considered as an exception
• it may contain hidden content
Remove Html Tags
▷Removing html tags to extract meaningful content
More Preprocesses for Different Languages
▷Chinese Simplified/Traditional Conversion
▷Word segmentation
Chinese Simplified/Traditional Conversion
▷Word conversion
• 请乘客从后门落车 → 請乘客從後門下車
▷One-to-many mapping
• @shinrei 出去旅游还是崩坏 → @shinrei 出去旅游還是崩壞
游 (zh-cn) → 游|遊 (zh-tw)
▷Wrong segmentation
• 人体内存在很多微生物 → 內存: 人體 記憶體 在很多微生物
→ 存在: 人體內 存在 很多微生物
Wrong Chinese Word Segmentation
▷Wrong segmentation
• 這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH)
▷Wrong word
• @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎
(T) ?
▷Wrong order
• 人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na)
▷Unknown word
• 半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !!
未知詞: 團購
Similarity and Dissimilarity
To like or not to like
Similarity and Dissimilarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and
qk are, respectively, the kth attributes (components) or data
objects p and q.
▷Standardization is necessary, if scales differ.
kk qpdist
Minkowski Distance
▷ Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
kk qpdist
)||( 
: is extremely sensitive to the scales of the variables involved
Mahalanobis Distance
▷Mahalanobis distance measure:
•Transforms the variables into covariance
•Make the covariance equal to 1
•Calculate simple Euclidean distance
)()(),( 1
 
S is the covariance matrix of the input data
Similarity Between Binary Vectors
▷ Common situation is that objects, p and q, have
only binary attributes
▷ Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
▷ Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
Cosine Similarity
▷ If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.
▷ Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
▷ Correlation measures the linear relationship between objects
▷ To compute correlation, we standardize data objects, p and q,
and then take their dot product
)(/))(( pstdpmeanpp kk 
)(/))(( qstdqmeanqq kk 
qpqpncorrelatio ),(
Using Weights to Combine Similarities
▷May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.
▷Density-based clustering require a notion of
• Euclidean density
→ Euclidean density = number of points per unit volume
• Probability density
• Graph-based density
Data Exploration
Seeing is beliving
Data Exploration
▷A preliminary exploration of the data to better
understand its characteristics
▷Key motivations of data exploration include
• Helping to select the right tool for preprocessing or
• Making use of humans’ abilities to recognize
• People can recognize patterns not captured by
data analysis tools
Summary Statistics
▷Summary statistics are numbers that
summarize properties of the data
•Summarized properties include frequency,
location and spread
→ Examples: location - mean
spread - standard deviation
•Most summary statistics can be calculated in
a single pass through the data
Frequency and Mode
▷Given a set of unordered categorical values
→ Compute the frequency with each value occurs is the
easiest way
▷The mode of a categorical attribute
• The attribute value that has the highest frequency
 
vfrequency i
▷For ordered data, the notion of a percentile is
more useful
• An ordinal or continuous attribute x
• A number p between 0 and 100
▷The pth percentile xp is a value of x
• p% of the observed values of x are less than xp
Measures of Location: Mean and Median
▷The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also commonly used
Measures of Spread: Range and Variance
▷Range is the difference between the max and min
▷The variance or standard deviation is the most
common measure of the spread of a set of points.
▷However, this is also sensitive to outliers, so that
other measures are often used
Visualization is the conversion of data into a
visual or tabular format
▷Visualization of data is one of the most
powerful and appealing techniques for data
• Humans have a well developed ability to analyze large
amounts of information that is presented visually
• Can detect general patterns and trends
• Can detect outliers and unusual patterns
▷Is the placement of visual elements within a display
▷Can make a large difference in how easy it is to
understand the data
Visualization Techniques: Histograms
▷ Histogram
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of
objects in each bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
▷ Example: Petal Width (10 and 20 bins, respectively)
Visualization Techniques: Box Plots
▷Another way of displaying the distribution of data
• Following figure shows the basic part of a box plot
Scatter Plot Array
Visualization Techniques: Contour Plots
▷Contour plots
• Partition the plane into regions of similar values
• The contour lines that form the boundaries of these regions
connect points with equal values
• The most common example is contour maps of elevation
• Can also display temperature, rainfall, air pressure, etc.
Sea Surface Temperature (SST)
Visualization of the Iris Data Matrix
Visualization of the Iris Correlation Matrix
Visualization Techniques: Star Plots
▷ Similar approach to parallel coordinates
• One axis for each attribute
▷ The size and the shape of polygon fives a visual description of
the attribute value of the object
Petal length sepal length
Visualization Techniques: Chernoff Faces
▷This approach associates each attribute with a
characteristic of a face
▷The values of each attribute determine the
appearance of the corresponding facial
▷Each object becomes a separate face
Data Feature Facial Feature
Sepal length Size of face
Sepal width Forehead/jaw relative arc length
Petal length Shape of forehead
Petal width Shape of jaw
Do's and Don'ts
▷ Apprehension
• Correctly perceive relations among variables
▷ Clarity
• Visually distinguish all the elements of a graph
▷ Consistency
• Interpret a graph based on similarity to previous graphs
▷ Efficiency
• Portray a possibly complex relation in as simple a way as
▷ Necessity
• The need for the graph, and the graphical elements
▷ Truthfulness
• Determine the true value represented by any graphical
Data Mining Techniques
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
Understand the objectivities
Tasks in Data Mining
▷Problems should be well defined at the beginning
▷Two categories of tasks [Fayyad et al., 1996]
Predictive Tasks
• Predict unknown values
• e.g., potential customers
Descriptive Tasks
• Find patterns to describe data
• e.g., Friendship finding
Select Techniques
▷Problems could be further decomposed
Predictive Tasks
• Classification
• Ranking
• Regression
• …
Descriptive Tasks
• Clustering
• Association rules
• Summarization
• …
Supervised vs. Unsupervised Learning
▷ Supervised learning
• Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
• New data is classified based on the training set
▷ Unsupervised learning
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
▷ Given a collection of records (training set )
• Each record contains a set of attributes
• One of the attributes is the class
▷ Find a model for class attribute:
• The model forms a function of the values of other attributes
▷ Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is needed
→ To determine the accuracy of the model
▷Usually, the given data set is divided into training & test
• With training set used to build the model
• With test set used to validate it
▷Produce a permutation to items in a new list
• Items ranked in higher positions should be more important
• E.g., Rank webpages in a search engine Webpages in
higher positions are more relevant.
▷Find a function which model the data with least error
• The output might be a numerical value
• E.g.: Predict the stock value
▷Group data into clusters
• Similar to the objects within the same cluster
• Dissimilar to the objects in other clusters
• No predefined classes (unsupervised classification)
Association Rule Mining
▷Basic concept
• Given a set of transactions
• Find rules that will predict the occurrence of an item
• Based on the occurrences of other items in the transaction
▷Provide a more compact representation of the data
• Data: Visualization
• Text – Document Summarization
→ E.g.: Snippet
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
Test Set
Training Set
Decision Tree
Tid Refund Marital
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
continuous class
© Tan, Steinbach, Kumar Introduction to Data Mining
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Model: Decision TreeTraining Data
There could be more than one tree that fits the same data!
Algorithm for Decision Tree Induction
▷ Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Data Mining 117
Tree Induction
▷Greedy strategy.
• Split the records based on an attribute test that optimizes
certain criterion.
• Determine how to split the records
→ How to specify the attribute test condition?
→ How to determine the best split?
• Determine when to stop splitting
© Tan, Steinbach, Kumar Introduction to Data Mining
The Problem Of Decision Tree
Deep Bushy Tree Deep Bushy Tree Useless
The Decision Tree has a hard time with correlated attributes
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
10 20 30 40 50 60 70 80 90 100
Advantages/Disadvantages of Decision Trees
• Easy to understand
• Easy to generate rules
• May suffer from overfitting.
• Classifies by rectangular partitioning (so does not handle
correlated features very well).
• Can be quite large – pruning is necessary.
• Does not handle streaming data easily
Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large
© Tan, Steinbach, Kumar Introduction to Data Mining
Overfitting due to Noise
© Tan, Steinbach, Kumar Introduction to Data Mining
Decision boundary is distorted by noise point
Overfitting due to Insufficient Examples
© Tan, Steinbach, Kumar Introduction to Data Mining
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Bayes Classifier
▷A probabilistic framework for solving classification
▷Conditional Probability:
▷ Bayes theorem:
© Tan, Steinbach, Kumar Introduction to Data Mining
Bayesian Classifiers
▷Consider each attribute and class label as random
▷Given a record with attributes (A1, A2,…,An)
• Goal is to predict class C
• Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )
▷Can we estimate P(C| A1, A2,…,An ) directly from
© Tan, Steinbach, Kumar Introduction to Data Mining
Bayesian Classifier Approach
▷Compute the posterior probability P(C | A1, A2, …,
An) for all values of C using the Bayes theorem
▷Choose value of C that maximizes
P(C | A1, A2, …, An)
▷Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
▷How to estimate P(A1, A2, …, An | C )?
© Tan, Steinbach, Kumar Introduction to Data Mining
 
Naïve Bayes Classifier
▷A simplified assumption: attributes are conditionally
independent and each data sample has n attributes
▷No dependence relation between attributes
▷By Bayes theorem,
▷As P(X) is constant for all classes, assign X to the
class with maximum P(X|Ci)*P(Ci)
Naïve Bayesian Classifier: Comments
▷ Advantages :
• Easy to implement
• Good results obtained in most of the cases
▷ Disadvantages
• Assumption: class conditional independence
• Practically, dependencies exist among variables
→ E.g., hospitals: patients: Profile: age, family history etc
→ E.g., Symptoms: fever, cough etc., Disease: lung cancer, diabetes
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
▷ How to deal with these dependencies?
• Bayesian Belief Networks
Bayesian Networks
▷Bayesian belief network allows a subset of the
variables conditionally independent
▷A graphical model of causal relationships
• Represents dependency among the variables
• Gives a specification of joint probability distribution
Data Mining 129
Bayesian Belief Network: An Example
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
Bayesian Belief Networks
The conditional probability table for the variable
Shows the conditional probability for each
possible combination of its parents
ZParents iziPznzP
Neural Networks
▷Artificial neuron
• Each input is multiplied by a weighting factor.
• Output is 1 if sum of weighted inputs exceeds a threshold
value; 0 otherwise
▷Network is programmed by adjusting weights using
feedback from examples
General Structure
Data Mining 132
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
 
jiijj OwI 
O 
))(1( jjjjj OTOOErr 
kjjj wErrOOErr  )1(
ijijij OErrlww )(
jjj Errl)( 
Network Training
▷The ultimate objective of training
• Obtain a set of weights that makes almost all the tuples in
the training data classified correctly
• Initialize weights with random values
• Feed the input tuples into the network one by one
• For each unit
→ Compute the net input to the unit as a linear combination of
all the inputs to the unit
→ Compute the output value using the activation function
→ Compute the error
→ Update the weights and the bias
Summary of Neural Networks
• Prediction accuracy is generally high
• Robust, works when training examples contain errors
• Fast evaluation of the learned target function
• Long training time
• Difficult to understand the learned function (weights)
• Not easy to incorporate domain knowledge
The k-Nearest Neighbor Algorithm
▷All instances correspond to points in the n-D space.
▷The nearest neighbor are defined in terms of
Euclidean distance.
▷The target function could be discrete- or real-
▷For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
_ xq
_ _
Discussion on the k-NN Algorithm
▷Distance-weighted nearest neighbor algorithm
• Weight the contribution of each of the k neighbors
according to their distance to the query point xq
→ Giving greater weight to closer neighbors
▷Curse of dimensionality: distance between
neighbors could be dominated by irrelevant
• To overcome it, elimination of the least relevant attributes.
Association Rule Mining
Definition: Frequent Itemset
▷ Itemset: A collection of one or more items
• Example: {Milk, Bread, Diaper}
▷ k-itemset
• An itemset that contains k items
▷ Support count ()
• Frequency of occurrence of an itemset
• E.g. ({Milk, Bread,Diaper}) = 2
▷ Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
▷ Frequent Itemset
• An itemset whose support is greater than or
equal to a minsup threshold
© Tan, Steinbach, Kumar Introduction to Data Mining
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definition: Association Rule
Association Rule
– An implication expression of the form
X  Y, where X and Y are itemsets
– Example:
{Milk, Diaper}  {Beer}
Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain
both X and Y
– Confidence (c)
 Measures how often items in Y
appear in transactions that
contain X
© Tan, Steinbach, Kumar Introduction to Data Mining
Market-Basket transactions
Beer}Diaper,Milk{ 
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Strong Rules & Interesting
• Corr(A, B)=1, A & B are independent
• Corr(A, B)<1, occurrence of A is negatively correlated with B
• Corr(A, B)>1, occurrence of A is positively correlated with B
▷E.g. Corr(games, videos)=0.4/(0.6*0.75)=0.89
• In fact, games & videos are negatively associated
→ Purchase of one actually decrease the likelihood of purchasing the other
Clustering Analysis
Good Clustering
▷Good clustering (produce high quality clusters)
• Intra-cluster similarity is high
• Inter-cluster class similarity is low
▷Quality factors
• Similarity measure and its implementation
• Definition and representation of cluster chosen
• Clustering algorithm
Types of Clusters: Well-Separated
▷Well-Separated clusters:
• A cluster is a set of points such that any point in a cluster
is closer (or more similar) to every other point in the
cluster than to any point not in the cluster.
© Tan, Steinbach, Kumar Introduction to Data Mining
3 well-separated clusters
Types of Clusters: Center-Based
• A cluster is a set of objects such that an object in a cluster
is closer (more similar) to the “center” of a cluster
• The center of a cluster is often a centroid, the average of
all the points in the cluster, or a medoid, the most
“representative” point of a cluster
© Tan, Steinbach, Kumar Introduction to Data Mining
4 center-based clusters
Types of Clusters: Contiguity-Based
▷Contiguous cluster (Nearest neighbor or transitive)
• A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
© Tan, Steinbach, Kumar Introduction to Data Mining
Types of Clusters: Density-Based
• A cluster is a dense region of points, which is separated
by low-density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and
when noise and outliers are present.
© Tan, Steinbach, Kumar Introduction to Data Mining
Types of Clusters: Objective Function
▷Clusters defined by an objective function
• Finds clusters that minimize or maximize an objective
• Naïve approaches:
→ Enumerate all possible ways
→ Evaluate the `goodness' of each potential set of clusters
→NP Hard
• Can have global or local objectives.
→ Hierarchical clustering algorithms typically have local
→ Partitioned algorithms typically have global objectives
© Tan, Steinbach, Kumar Introduction to Data Mining
Partitioning Algorithms: Basic Concept
▷Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions.
• Heuristic methods.
→ k-means: each cluster is represented by the center of the
→ k-medoids or PAM (Partition Around Medoids) : each
cluster is represented by one of the objects in the cluster.
K-Means Clustering Algorithm
• Randomly initialize k cluster means
• Iterate:
→ Assign each genes to the nearest cluster mean
→ Recompute cluster means
• Stop when clustering converges
Two different K-means Clusterings
© Tan, Steinbach, Kumar Introduction to Data Mining
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Optimal Clustering
Original Points
Solutions to Initial Centroids Problem
▷ Multiple runs
• Helps, but probability is not on your side
▷ Sample and use hierarchical clustering to determine initial
▷ Select more than k initial centroids and then select among
these initial centroids
• Select most widely separated
▷ Postprocessing
▷ Bisecting K-means
• Not as susceptible to initialization issues
© Tan, Steinbach, Kumar Introduction to Data Mining
Bisecting K-means
▷ Bisecting K-means algorithm
• Variant of K-means that can produce a partitioned or a
hierarchical clustering
© Tan, Steinbach, Kumar Introduction to Data Mining
Bisecting K-means Example
Bisecting K-means Example
Produce a hierarchical clustering based on the sequence of
clusterings produced
Limitations of K-means: Differing Sizes
© Tan, Steinbach, Kumar Introduction to Data Mining
Original Points K-means (3 Clusters)
One solution is to use many clusters.
Find parts of clusters, but need to put together
K-means (10 Clusters)
Limitations of K-means: Differing Density
© Tan, Steinbach, Kumar Introduction to Data Mining
Original Points K-means (3 Clusters)
One solution is to use many clusters.
Find parts of clusters, but need to put together
K-means (10 Clusters)
Limitations of K-means: Non-globular Shapes
© Tan, Steinbach, Kumar Introduction to Data Mining
Original Points K-means (2 Clusters)
One solution is to use many clusters.
Find parts of clusters, but need to put together
K-means (10 Clusters)
Hierarchical Clustering
▷Produces a set of nested clusters organized as a
hierarchical tree
▷Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges
or splits
© Tan, Steinbach, Kumar Introduction to Data Mining
1 3 2 5 4 6
3 4
Strengths of Hierarchical Clustering
▷Do not have to assume any particular number of
• Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
▷They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
© Tan, Steinbach, Kumar Introduction to Data Mining
Density-Based Clustering
▷Clustering based on density (local cluster criterion),
such as density-connected points
▷Each cluster has a considerable higher density of
points than outside of the cluster
Density-Based Clustering Methods
▷Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
Data Mining 161
▷Density = number of points within a specified radius
▷A point is a core point if it has more than a specified
number of points (MinPts) within Eps
• These are points that are at the interior of a cluster
▷A border point has fewer than MinPts within Eps, but
is in the neighborhood of a core point
▷A noise point is any point that is not a core point or a
border point.
© Tan, Steinbach, Kumar Introduction to Data Mining
DBSCAN: Core, Border, and Noise Points
© Tan, Steinbach, Kumar Introduction to Data Mining
DBSCAN Examples
© Tan, Steinbach, Kumar Introduction to Data Mining
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well
© Tan, Steinbach, Kumar Introduction to Data Mining
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Recap: Data Mining Techniques
Predictive Tasks
• Classification
• Ranking
• Regression
• …
Descriptive Tasks
• Clustering
• Association rules
• Summarization
• …
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
Tasks in Data Mining
▷Problems should be well defined at the beginning
▷Two categories of tasks [Fayyad et al., 1996]
Predictive Tasks
• Predict unknown values
• e.g., potential customers
Descriptive Tasks
• Find patterns to describe data
• e.g., Friendship finding
For Predictive Tasks
Metrics for Performance Evaluation
© Tan, Steinbach, Kumar Introduction to Data Mining
 Focus on the predictive capability of a model
 Confusion Matrix:
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation
© Tan, Steinbach, Kumar Introduction to Data Mining
 Most widely-used metric:
Class=Yes Class=No
Class=Yes a
Class=No c
Limitation of Accuracy
▷ Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10
▷ If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
▷ Accuracy is misleading because model does not detect any
class 1 example
© Tan, Steinbach, Kumar Introduction to Data Mining
Cost-Sensitive Measures
© Tan, Steinbach, Kumar Introduction to Data Mining
 Precision is biased towards C(Yes|Yes) & C(Yes|No)
 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
Test of Significance
▷ Given two models:
• Model M1: accuracy = 85%, tested on 30 instances
• Model M2: accuracy = 75%, tested on 5000 instances
▷ Can we say M1 is better than M2?
• How much confidence can we place on accuracy of M1 and M2?
• Can the difference in performance measure be explained as a
result of random fluctuations in the test set?
© Tan, Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
▷ Prediction can be regarded as a Bernoulli trial
• A Bernoulli trial has 2 possible outcomes
→ Possible outcomes for prediction: correct or wrong
• Collection of Bernoulli trials has a Binomial distribution:
→ x ≈ Bin(N, p) x: number of correct predictions
→ e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = N × p = 50 × 0.5 = 25
▷ Given x (# of correct predictions) or equivalently, accuracy
(ac)=x/N, and N (# of test instances)
Can we predict p (true accuracy of model)?
© Tan, Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
▷For large test sets (N > 30),
• ac has a normal distribution
with mean p and variance
• Confidence Interval for p:
© Tan, Steinbach, Kumar Introduction to Data Mining
Area = 1 - 
Z/2 Z1-  /2
 
( 2/12/ Z
ZP c
p ccc
Example :Comparing Performance of 2 Models
▷Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
• d = |e2 – e1| = 0.1 (2-sided test)
▷At 95% confidence level, Z/2=1.96
© Tan, Steinbach, Kumar Introduction to Data Mining
ˆ 
128.0100.00043.096.1100.0 t
Interval contains 0 :
difference may not be statistically significant
For Descriptive Tasks
Computing Interestingness Measure
▷ Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table
© Tan, Steinbach, Kumar Introduction to Data Mining
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
Contingency table for X  Y
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
Used to define various measures
 support, confidence, lift, Gini,
J-measure, etc.
Drawback of Confidence
© Tan, Steinbach, Kumar Introduction to Data Mining
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375
Statistical Independence
▷ Population of 1000 students
• 600 students know how to swim (S)
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)
• P(SB) = 420/1000 = 0.42
• P(S)  P(B) = 0.6  0.7 = 0.42
• P(SB) = P(S)  P(B) => Statistical independence
• P(SB) > P(S)  P(B) => Positively correlated
• P(SB) < P(S)  P(B) => Negatively correlated
© Tan, Steinbach, Kumar Introduction to Data Mining
Statistical-based Measures
▷Measures that take into account statistical dependence
© Tan, Steinbach, Kumar Introduction to Data Mining
Example: Interest Factor
© Tan, Steinbach, Kumar Introduction to Data Mining
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Association Rule: Tea  Coffee
Confidence= P(Coffee,Tea) = 0.15
P(Coffee) = 0.9, P(Tea) = 0.2
 Interest = 0.15/(0.9×0.2)= 0.83 (< 1, therefore is negatively
Subjective Interestingness Measure
▷Objective measure:
• Rank patterns based on statistics computed from data
• e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).
▷Subjective measure:
• Rank patterns according to user’s interpretation
→ A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
→ A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
© Tan, Steinbach, Kumar Introduction to Data Mining
Interestingness via Unexpectedness
© Tan, Steinbach, Kumar Introduction to Data Mining
 Need to model expectation of users (domain knowledge)
 Need to combine expectation of users with evidence from
data (i.e., extracted patterns)
+ Pattern expected to be frequent
- Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent
Expected Patterns-
+ Unexpected Patterns
Different Propose Measures
© Tan, Steinbach, Kumar Introduction to Data Mining
Some measures are
good for certain
applications, but not for
What criteria should we
use to determine
whether a measure is
good or bad?
Comparing Different Measures
© Tan, Steinbach, Kumar Introduction to Data Mining
Example f11 f10 f01 f00
E1 8123 83 424 1370
E2 8330 2 622 1046
E3 9481 94 127 298
E4 3954 3080 5 2961
E5 2886 1363 1320 4431
E6 1500 2000 500 6000
E7 4000 2000 1000 3000
E8 4000 2000 2000 2000
E9 1720 7121 5 1154
E10 61 2483 4 7452
10 examples of
contingency tables:
Rankings of contingency tables
using various measures:
Property under Variable Permutation
Does M(A,B) = M(B,A)?
Symmetric measures:
 support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
 confidence, conviction, Laplace, J-measure, etc
© Tan, Steinbach, Kumar Introduction to Data Mining
A p q
A r s
B p r
B q s
Cluster Validity
▷ For supervised classification we have a variety of measures to
evaluate how good our model is
• Accuracy, precision, recall
▷ For cluster analysis, the analogous question is how to evaluate
the “goodness” of the resulting clusters?
▷ But “clusters are in the eye of the beholder”!
▷ Then why do we want to evaluate them?
• To avoid finding patterns in noise
• To compare clustering algorithms
• To compare two sets of clusters
• To compare two clusters
© Tan, Steinbach, Kumar Introduction to Data Mining
Clusters Found in Random Data
© Tan, Steinbach, Kumar Introduction to Data Mining
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Measures of Cluster Validity
▷ Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following three types
• External Index: Used to measure the extent to which cluster
labels match externally supplied class labels, e.g., Entropy
• Internal Index: Used to measure the goodness of a clustering
structure without respect to external information, e.g., Sum of
Squared Error (SSE)
• Relative Index: Used to compare two different clusters
▷ Sometimes these are referred to as criteria instead of indices
• However, sometimes criterion is the general strategy and index
is the numerical measure that implements the criterion.
© Tan, Steinbach, Kumar Introduction to Data Mining
Measuring Cluster Validity Via Correlation
▷ Two matrices
• Proximity Matrix
• “Incidence” Matrix
→ One row and one column for each data point
→ An entry is 1 if the associated pair of points belong to the same
→ An entry is 0 if the associated pair of points belongs to different
▷ Compute the correlation between the two matrices
• Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
▷ High correlation indicates that points that belong to the same
cluster are close to each other.
▷ Not a good measure for some density or contiguity based
© Tan, Steinbach, Kumar Introduction to Data Mining
Measuring Cluster Validity Via Correlation
▷Correlation of incidence and proximity matrices for
the K-means clustering of the following two data sets
© Tan, Steinbach, Kumar Introduction to Data Mining
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Corr = -0.9235 Corr = -0.5810
Internal Measures: SSE
▷ Clusters in more complicated figures aren’t well separated
▷ Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
• Sum of Squared Error (SSE)
▷ SSE is good for comparing two clusters
▷ Can also be used to estimate the number of clusters
© Tan, Steinbach, Kumar Introduction to Data Mining
2 5 10 15 20 25 30
5 10 15
Internal Measures: Cohesion and Separation
▷ Cluster Cohesion: Measures how closely related are objects in
a cluster
• Cohesion is measured by the within cluster sum of squares (SSE)
▷ Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Separation is measured by the between cluster sum of squares
• Where |Ci| is the size of cluster i
© Tan, Steinbach, Kumar Introduction to Data Mining
 
i Cx
mxWSS 2
 
ii mmCBSS 2
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and
frustrating part of cluster analysis.
Without a strong effort in this direction, cluster analysis will
remain a black art accessible only to those true believers who
have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
© Tan, Steinbach, Kumar Introduction to Data Mining
Case Studies
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
Case: Mining Reddit Data
Please check the data set during the breaks
Reddit Data
Reddit: The Front Page of the
50k+ on
this set
Subreddit Categories
▷Reddit’s structure may already provide a
baseline similarity
Provided Data
Recover Structure

More Related Content

What's hot

Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事台灣資料科學年會
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakesDataWorks Summit
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsArtifacia
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Universitat Politècnica de Catalunya
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory ProjectScalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory Projectjoshpaulson
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical ImagingSanghoon Hong
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RGentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RMarco Wirthlin
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...Edge AI and Vision Alliance

What's hot (20)

[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their Applications
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
Data science with Perl & Raku
Data science with Perl & RakuData science with Perl & Raku
Data science with Perl & Raku
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory ProjectScalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Icml2017 overview
Icml2017 overviewIcml2017 overview
Icml2017 overview
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical Imaging
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in RGentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
Data in Action
Data in ActionData in Action
Data in Action

Viewers also liked

qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan台灣資料科學年會
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Data Science Thailand
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresherbutest
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningArshad Ahmed
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterMark Chang
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsMark Chang
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會

Viewers also liked (20)

qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan
02 math essentials
02 math essentials02 math essentials
02 math essentials
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresher
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine Learning
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

Similar to Quick Guide to Data Mining Concepts

Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf09372002dedi
Describing, Discovering, and Understanding Multi-Dimensional Processes
Describing, Discovering, and Understanding Multi-Dimensional ProcessesDescribing, Discovering, and Understanding Multi-Dimensional Processes
Describing, Discovering, and Understanding Multi-Dimensional ProcessesDirk Fahland
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?Noam Cohen
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4shortJun Miyazaki
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar

Similar to Quick Guide to Data Mining Concepts (20)

Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Describing, Discovering, and Understanding Multi-Dimensional Processes
Describing, Discovering, and Understanding Multi-Dimensional ProcessesDescribing, Discovering, and Understanding Multi-Dimensional Processes
Describing, Discovering, and Understanding Multi-Dimensional Processes
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Data science unit1
Data science unit1Data science unit1
Data science unit1
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4short
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science

More from 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會

More from 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Quick Guide to Data Mining Concepts

  • 1. Quick Tour of Data Mining Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University
  • 2. About Speaker 陳宜欣 Yi-Shin Chen ▷ Currently • 清華大學資訊工程系副教授 • 主持智慧型資料工程與應用實驗室 (IDEA Lab) ▷ Education • Ph.D. in Computer Science, USC, USA • M.B.A. in Information Management, NCU, TW • B.B.A. in Information Management, NCU, TW ▷ Courses (all in English) • Research and Presentation Skills • Introduction to Database Systems • Advanced Database Systems • Data Mining: Concepts, Techniques, and Applications 2
  • 3. Evolution of Data Management The relationships between the techniques and our world 3
  • 4. 4 1900 1920 1940 1950 1960 1970 Manual Record Managers 1950: Univac had developed a magnetic tape 1951: Univac I delivered to the US Census Bureau 1931: Gödel's Incompleteness Theorem 1948: Information theory (by Shannon) Information Entropy 1944: Mark I (Server) 1963: The origins of the Internet Programmed Record Managers • Birth of high-level programming languages • Batch processing Punched-Card Record Managers On-line Network Databases • Indexed sequential records • Data independence • Concurrent Access
  • 5. 2001: Data Science 2009: Deep Learning 5 1970 1980 1990 2000 2010 1985: 1st standardized of SQL 1976: E-R Model by Peter Chen 1993: WWW 2006: Elastic Compute Cloud 1980: Artificial Neural Networks Knowledge Discovery in Databases Object Relational Model • Support multiple datatypes and applications 1974: IBM System R Relational Model • Give Database users high-level set-oriented data access operations
  • 6. Data Mining What we know, and what we do now 6
  • 7. Data Mining ▷ What is data mining? • Algorithms for seeking unexpected “pearls of wisdom” ▷ Current data mining research: • Focus on efficient ways to discover models of existing data sets • Developed algorithms are: classification, clustering, association- rule discovery, summarization…etc. 7
  • 8. Data Mining Examples 8Slide from: Prof. Shou-De Lin
  • 9. Origins of Data Mining ▷ Draws ideas from • Machine learning/AI • Pattern recognition • Statistics • Database systems ▷ Traditional Techniques may be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data 9 © Tan, Steinbach, Kumar Introduction to Data Mining Data Mining Machine Learning/AI Pattern Recognition Statistics Database
  • 10. Knowledge Discovery (KDD) Process 10 Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 14. Informal Design Guidelines for Database ▷ Design a schema that can be explained easily relation by relation. The semantics of attributes should be easy to interpret ▷ Should avoid update anomaly problems ▷ Relations should be designed such that their tuples will have as few NULL values as possible ▷ The relations should be designed to satisfy the lossless join condition (guarantee meaningful results for join operations) 14
  • 15. Data Warehouse ▷Assemble and manage data from various sources for the purpose of answering business questions 15 CRM ERP POS …OLTP Data Warehouse Meaningful
  • 16. Knowledge Discovery (KDD) Process 16 Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 17. KDD Process: Several Key Steps ▷ Pre-processing • Learning the application domain → Relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation → Find useful features ▷ Data mining • Choosing functions of data mining → Choosing the mining algorithm • Search for patterns of interest ▷ Evaluation • Pattern evaluation and knowledge presentation → visualization, transformation, removing redundant patterns, etc. 17 © Han & Kamper Data Mining: Concepts and Techniques
  • 18. Data Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation The most important part in the whole process 18
  • 19. Types of Attributes ▷There are different types of attributes • Nominal (=,≠) → Nominal values can only distinguish one object from another → Examples: ID numbers, eye color, zip codes • Ordinal (<,>) → Ordinal values can help to order objects → Examples: rankings, grades • Interval (+,-) → The difference between values are meaningful → Examples: calendar dates • Ratio (*,/) → Both differences and ratios are meaningful → Examples: temperature in Kelvin, length, time, counts 19只有這一種能適用所有的處理方法
  • 20. Types of Data Sets ▷Record • Data Matrix • Document Data • Transaction Data ▷Graph • World Wide Web • Molecular Structures ▷Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data 20 1.22.715.225.2710.23 ThicknessLoadDistanceProjection of y load Projection of x Load 1.22.715.225.2710.23 ThicknessLoadDistanceProjection of y load Projection of x Load Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Time Items 1 2009/2/8 Bread, Coke, Milk 2 2009/2/13 Beer, Bread 3 2009/2/23 Beer, Diaper 4 2009/3/1 Coke, Diaper, Milk
  • 27. Tips for Converting Text to Numerical Values 27
  • 28. Recap: Types of Attributes ▷There are different types of attributes • Nominal (=,≠) → Nominal values can only distinguish one object from another → Examples: ID numbers, eye color, zip codes • Ordinal (<,>) → Ordinal values can help to order objects → Examples: rankings, grades • Interval (+,-) → The difference between values are meaningful → Examples: calendar dates • Ratio (*,/) → Both differences and ratios are meaningful → Examples: temperature in Kelvin, length, time, counts 28
  • 29. Vector Space Model ▷Represent the keywords of objects using a term vector • Term: basic concept, e.g., keywords to describe an object • Each term represents one dimension in a vector • N total terms define an n-element terms • Values of each term in a vector corresponds to the importance of that term ▷Measure similarity by the vector distances 29 Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0
  • 30. Term Frequency and Inverse Document Frequency (TFIDF) ▷Since not all objects in the vector space are equally important, we can weight each term using its occurrence probability in the object description • Term frequency: TF(d,t) → number of times t occurs in the object description d • Inverse document frequency: IDF(t) → to scale down the terms that occur in many descriptions 30
  • 31. Normalizing Term Frequency ▷nij represents the number of times a term ti occurs in a description dj . tfij can be normalized using the total number of terms in the document • 𝑡𝑓𝑖𝑗 = 𝑛 𝑖𝑗 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒 ▷NormalizedValue could be: • Sum of all frequencies of terms • Max frequency value • Any other values can make tfij between 0 to 1 31
  • 32. Inverse Document Frequency ▷ IDF seeks to scale down the coordinates of terms that occur in many object descriptions • For example, some stop words(the, a, of, to, and…) may occur many times in a description. However, they should be considered as non-important in many cases • 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔 𝑁 𝑑𝑓 𝑖 + 1 → where dfi (document frequency of term ti) is the number of descriptions in which ti occurs ▷ IDF can be replaced with ICF (inverse class frequency) and many other concepts based on applications 32
  • 33. Reasons of Log ▷ Each distribution can indicate the hidden force • • • 33 Power-law distribution Normal distribution Normal distribution
  • 35. Big Data? ▷ “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is “big data.” • --from 35
  • 36. 4V 36
  • 37. Data Quality ▷What kinds of data quality problems? ▷How can we detect problems with the data? ▷What can we do about these problems? ▷Examples of data quality problems: •Noise and outliers •Missing values •Duplicate data 37
  • 38. Noise ▷Noise refers to modification of original values • Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen 38Two Sine Waves Two Sine Waves + Noise
  • 39. Outliers ▷Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set 39
  • 40. Missing Values ▷Reasons for missing values • Information is not collected → e.g., people decline to give their age and weight • Attributes may not be applicable to all cases → e.g., annual income is not applicable to children ▷Handling missing values • Eliminate Data Objects • Estimate Missing Values • Ignore the Missing Value During Analysis • Replace with all possible values → Weighted by their probabilities 40
  • 41. Duplicate Data ▷Data set may include data objects that are duplicates, or almost duplicates of one another • Major issue when merging data from heterogeneous sources ▷Examples: • Same person with multiple email addresses ▷Data cleaning • Process of dealing with duplicate data issues 41
  • 42. Data Preprocessing To be or not to be 42
  • 43. Data Preprocessing ▷Aggregation ▷Sampling ▷Dimensionality reduction ▷Feature subset selection ▷Feature creation ▷Discretization and binarization ▷Attribute transformation 43
  • 44. Aggregation ▷Combining two or more attributes (or objects) into a single attribute (or object) ▷Purpose • Data reduction → Reduce the number of attributes or objects • Change of scale → Cities aggregated into regions, states, countries, etc • More “stable” data → Aggregated data tends to have less variability 44 SELECT d.Name, avg(Salary) FROM Employee AS e, Department AS d WHERE e.Dept=d.DNo GROUP BY d.Name HAVING COUNT(e.ID)>=2;
  • 45. Sampling ▷Sampling is the main technique employed for data selection • It is often used for both → Preliminary investigation of the data → The final data analysis • Reasons: → Statistics: Obtaining the entire set of data of interest is too expensive → Data mining: Processing the entire data set is too expensive 45
  • 46. Key Principle For Effective Sampling ▷The sample is representative •Using a sample will work almost as well as using the entire data sets •The approximately the same property as the original set of data 46
  • 47. Sample Size Matters 47 8000 points 2000 Points 500 Points
  • 48. Sampling Bias ▷ 2004 Taiwan presidential election polls 48 TVBS 聯合報 訪問日期 93 年 1 月 15日 至 1 月 17日 有效樣本 1068 人 拒 訪 699 人 抽樣誤差 在 95% 信心水準下,約 ± 3個百分點 訪問地區 台灣地區 抽樣方法 電話簿分層系統抽樣,電話號碼末二位隨機
  • 49. Dimensionality Reduction ▷Purpose: • Avoid curse of dimensionality • Reduce amount of time and memory required by data mining algorithms • Allow data to be more easily visualized • May help to eliminate irrelevant features or reduce noise ▷Techniques • Principle Component Analysis • Singular Value Decomposition • Others: supervised and non-linear techniques 49
  • 50. Curse of Dimensionality ▷When dimensionality increases, data becomes increasingly sparse in the space that it occupies • Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful 50 • Randomly generate 500 points • Compute difference between max and min distance between any pair of points
  • 51. Dimensionality Reduction: PCA ▷Goal is to find a projection that captures the largest amount of variation in data 51 x2 x1 e
  • 52. Feature Subset Selection ▷Another way to reduce dimensionality of data ▷Redundant features •Duplicate much or all of the information contained in one or more other attributes •E.g. purchase price of a product vs. sales tax ▷Irrelevant features •Contain no information that is useful for the data mining task at hand •E.g. students' ID is often irrelevant to the task of predicting students' GPA 52
  • 53. Feature Creation ▷Create new attributes that can capture the important information in a data set much more efficiently than the original attributes ▷Three general methodologies: •Feature extraction →Domain-specific •Mapping data to new space •Feature construction →Combining features 53
  • 54. Mapping Data to a New Space ▷Fourier transform ▷Wavelet transform 54 Two Sine Waves Two Sine Waves + Noise Frequency
  • 55. Discretization Using Class Labels ▷Entropy based approach 55 3 categories for both x and y 5 categories for both x and y
  • 56. Discretization Without Using Class Labels 56
  • 57. Attribute Transformation ▷A function that maps the entire set of values of a given attribute to a new set of replacement values • So each old value can be identified with one of the new values • Simple functions: xk, log(x), ex, |x| • Standardization and Normalization 57
  • 60. Data Collection ▷Align /Classify the attributes correctly 60 Who post this message Mentioned User Hashtag Shared URL
  • 61. Language Detection ▷To detect an language (possible languages) in which the specified text is written ▷Difficulties •Short message •Different languages in one statement •Noisy 61 你好 現在幾點鐘 apa kabar sekarang jam berapa ? 繁體中文 (zh-tw) 印尼文 (id)
  • 62. Wrong Detection Examples ▷Twitter examples 62 @sayidatynet top song #LailaGhofran shokran ya garh new album #listen 中華隊的服裝挺特別的,好藍。。。 #ChineseTaipei #Sochi #2014冬奧 授業前の雪合戦w Before / after removing noise en -> id it -> zh-tw en -> ja
  • 63. Removing Noise ▷Removing noise before detection •Html file ->tags •Twitter -> hashtag, mention, URL 63 <meta name="twitter:description" content="觸犯法國隱私法〔駐歐洲特派記 者胡蕙寧、國際新聞中心/綜合報導〕網路 搜 尋 引 擎 巨 擘 Google8 日 在 法 文 版 首 頁 (張貼悔過書 ..."/> 觸犯法國隱私法〔駐歐洲特派記者胡蕙寧、國際新聞中 心/綜合報導〕網路搜尋引擎巨擘Google8日在法文版 首頁(張貼悔過書 ... 英文 (en) 繁中 (zh-tw)
  • 64. Data Cleaning ▷Special character ▷Utilize regular expressions to clean data 64 Unicode emotions ☺, ♥… Symbol icon ☏, ✉… Currency symbol €, £, $... Tweet URL Filter out non-(letters, space, punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉ I added a video to a @YouTube playlist Jamie Riepe (^|s*)http(S+)?(s*|$) (p{L}+)|(p{Z}+)| (p{Punct}+)|(p{Digit}+)
  • 65. Japanese Examples ▷Use regular expression remove all special words •うふふふふ(*^^*)楽しむ!ありがとうございま す^o^ アイコン、ラブラブ(-_-)♡ •うふふふふ 楽しむ ありがとうございます ア イコン ラブラブ 65 W
  • 66. Part-of-speech (POS) Tagging ▷Processing text and assigning parts of speech to each word ▷Twitter POS tagging •Noun (N), Adjective (A), Verb (V), URL (U)… 66 Happy Easter! I went to work and came home to an empty house now im going for a quick run Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A run_N
  • 67. Stemming ▷@DirtyDTran gotta be caught up for tomorrow nights episode ▷@ASVP_Jaykey for some reasons I found this very amusing 67 • @DirtyDTran gotta be catch up for tomorrow night episode • @ASVP_Jaykey for some reason I find this very amusing RT @kt_biv : @caycelynnn loving and missing you! we are still looking for Lucy love miss be look
  • 68. Hashtag Segmentation ▷By using Microsoft Web N-Gram Service (or by using Viterbi algorithm) 68 #pray #for #boston Wow! explosion at a boston race ... #prayforboston #citizenscience #bostonmarathon #goodthingsarecoming #lowbloodpressure → → → → #citizen #science #boston #marathon #good #things #are #coming #low #blood #pressure
  • 69. More Preprocesses for Different Web Data ▷Extract source code without javascript ▷Removing html tags 69
  • 70. Extract Source Code Without Javascript ▷Javascript code should be considered as an exception • it may contain hidden content 70
  • 71. Remove Html Tags ▷Removing html tags to extract meaningful content 71
  • 72. More Preprocesses for Different Languages ▷Chinese Simplified/Traditional Conversion ▷Word segmentation 72
  • 73. Chinese Simplified/Traditional Conversion ▷Word conversion • 请乘客从后门落车 → 請乘客從後門下車 ▷One-to-many mapping • @shinrei 出去旅游还是崩坏 → @shinrei 出去旅游還是崩壞 游 (zh-cn) → 游|遊 (zh-tw) ▷Wrong segmentation • 人体内存在很多微生物 → 內存: 人體 記憶體 在很多微生物 → 存在: 人體內 存在 很多微生物 73 內存|存在
  • 74. Wrong Chinese Word Segmentation ▷Wrong segmentation • 這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH) ▷Wrong word • @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎 (T) ? ▷Wrong order • 人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na) ▷Unknown word • 半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !! 74 地面|面積 實驗室|室友 存在|內在 未知詞: 團購
  • 75. Similarity and Dissimilarity To like or not to like 75
  • 76. Similarity and Dissimilarity ▷Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1] ▷Dissimilarity • Numerical measure of how different are two data objects • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies 76
  • 77. Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. ▷Standardization is necessary, if scales differ. 77    n k kk qpdist 1 2 )(
  • 78. Minkowski Distance ▷ Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. 78 r n k r kk qpdist 1 1 )||(    : is extremely sensitive to the scales of the variables involved
  • 79. Mahalanobis Distance ▷Mahalanobis distance measure: •Transforms the variables into covariance •Make the covariance equal to 1 •Calculate simple Euclidean distance 79 )()(),( 1 yxSyxyxd    S is the covariance matrix of the input data
  • 80. Similarity Between Binary Vectors ▷ Common situation is that objects, p and q, have only binary attributes ▷ Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 ▷ Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) 80
  • 81. Cosine Similarity ▷ If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d. ▷ Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150 81
  • 82. Correlation ▷ Correlation measures the linear relationship between objects ▷ To compute correlation, we standardize data objects, p and q, and then take their dot product 82 )(/))(( pstdpmeanpp kk  )(/))(( qstdqmeanqq kk  qpqpncorrelatio ),(
  • 83. Using Weights to Combine Similarities ▷May not want to treat all attributes the same. • Use weights wk which are between 0 and 1 and sum to 1. 83
  • 84. Density ▷Density-based clustering require a notion of density ▷Examples: • Euclidean density → Euclidean density = number of points per unit volume • Probability density • Graph-based density 84
  • 86. Data Exploration ▷A preliminary exploration of the data to better understand its characteristics ▷Key motivations of data exploration include • Helping to select the right tool for preprocessing or analysis • Making use of humans’ abilities to recognize patterns • People can recognize patterns not captured by data analysis tools 86
  • 87. Summary Statistics ▷Summary statistics are numbers that summarize properties of the data •Summarized properties include frequency, location and spread → Examples: location - mean spread - standard deviation •Most summary statistics can be calculated in a single pass through the data 87
  • 88. Frequency and Mode ▷Given a set of unordered categorical values → Compute the frequency with each value occurs is the easiest way ▷The mode of a categorical attribute • The attribute value that has the highest frequency 88   m v vfrequency i i valueattributewithovjectsofnumber 
  • 89. Percentiles ▷For ordered data, the notion of a percentile is more useful ▷Given • An ordinal or continuous attribute x • A number p between 0 and 100 ▷The pth percentile xp is a value of x • p% of the observed values of x are less than xp 89
  • 90. Measures of Location: Mean and Median ▷The mean is the most common measure of the location of a set of points. • However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also commonly used 90
  • 91. Measures of Spread: Range and Variance ▷Range is the difference between the max and min ▷The variance or standard deviation is the most common measure of the spread of a set of points. ▷However, this is also sensitive to outliers, so that other measures are often used 91
  • 92. Visualization Visualization is the conversion of data into a visual or tabular format ▷Visualization of data is one of the most powerful and appealing techniques for data exploration. • Humans have a well developed ability to analyze large amounts of information that is presented visually • Can detect general patterns and trends • Can detect outliers and unusual patterns 92
  • 93. Arrangement ▷Is the placement of visual elements within a display ▷Can make a large difference in how easy it is to understand the data ▷Example 93
  • 94. Visualization Techniques: Histograms ▷ Histogram • Usually shows the distribution of values of a single variable • Divide the values into bins and show a bar plot of the number of objects in each bin. • The height of each bar indicates the number of objects • Shape of histogram depends on the number of bins ▷ Example: Petal Width (10 and 20 bins, respectively) 94
  • 95. Visualization Techniques: Box Plots ▷Another way of displaying the distribution of data • Following figure shows the basic part of a box plot 95
  • 97. Visualization Techniques: Contour Plots ▷Contour plots • Partition the plane into regions of similar values • The contour lines that form the boundaries of these regions connect points with equal values • The most common example is contour maps of elevation • Can also display temperature, rainfall, air pressure, etc. 97 Celsius Sea Surface Temperature (SST)
  • 98. Visualization of the Iris Data Matrix 98 standard deviation
  • 99. Visualization of the Iris Correlation Matrix 99
  • 100. Visualization Techniques: Star Plots ▷ Similar approach to parallel coordinates • One axis for each attribute ▷ The size and the shape of polygon fives a visual description of the attribute value of the object 100 Petal length sepal length SepalwidthPetalwidth
  • 101. Visualization Techniques: Chernoff Faces ▷This approach associates each attribute with a characteristic of a face ▷The values of each attribute determine the appearance of the corresponding facial characteristic ▷Each object becomes a separate face 101 Data Feature Facial Feature Sepal length Size of face Sepal width Forehead/jaw relative arc length Petal length Shape of forehead Petal width Shape of jaw
  • 102. Do's and Don'ts ▷ Apprehension • Correctly perceive relations among variables ▷ Clarity • Visually distinguish all the elements of a graph ▷ Consistency • Interpret a graph based on similarity to previous graphs ▷ Efficiency • Portray a possibly complex relation in as simple a way as possible ▷ Necessity • The need for the graph, and the graphical elements ▷ Truthfulness • Determine the true value represented by any graphical element 102
  • 103. Data Mining Techniques Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
  • 105. Tasks in Data Mining ▷Problems should be well defined at the beginning ▷Two categories of tasks [Fayyad et al., 1996] 105 Predictive Tasks • Predict unknown values • e.g., potential customers Descriptive Tasks • Find patterns to describe data • e.g., Friendship finding VIP Cheap Potential
  • 106. Select Techniques ▷Problems could be further decomposed 106 Predictive Tasks • Classification • Ranking • Regression • … Descriptive Tasks • Clustering • Association rules • Summarization • … Supervised Learning Unsupervised Learning
  • 107. Supervised vs. Unsupervised Learning ▷ Supervised learning • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set ▷ Unsupervised learning • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 107
  • 108. Classification ▷ Given a collection of records (training set ) • Each record contains a set of attributes • One of the attributes is the class ▷ Find a model for class attribute: • The model forms a function of the values of other attributes ▷ Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is needed → To determine the accuracy of the model ▷Usually, the given data set is divided into training & test • With training set used to build the model • With test set used to validate it 108
  • 109. Ranking ▷Produce a permutation to items in a new list • Items ranked in higher positions should be more important • E.g., Rank webpages in a search engine Webpages in higher positions are more relevant. 109
  • 110. Regression ▷Find a function which model the data with least error • The output might be a numerical value • E.g.: Predict the stock value 110
  • 111. Clustering ▷Group data into clusters • Similar to the objects within the same cluster • Dissimilar to the objects in other clusters • No predefined classes (unsupervised classification) 111
  • 112. Association Rule Mining ▷Basic concept • Given a set of transactions • Find rules that will predict the occurrence of an item • Based on the occurrences of other items in the transaction 112
  • 113. Summarization ▷Provide a more compact representation of the data • Data: Visualization • Text – Document Summarization → E.g.: Snippet 113
  • 115. Illustrating Classification Task 115 Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 116. Decision Tree 116 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical continuous class © Tan, Steinbach, Kumar Introduction to Data Mining Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Splitting Attributes Model: Decision TreeTraining Data There could be more than one tree that fits the same data!
  • 117. Algorithm for Decision Tree Induction ▷ Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Data Mining 117
  • 118. Tree Induction ▷Greedy strategy. • Split the records based on an attribute test that optimizes certain criterion. ▷Issues • Determine how to split the records → How to specify the attribute test condition? → How to determine the best split? • Determine when to stop splitting 118 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 119. The Problem Of Decision Tree 119 Deep Bushy Tree Deep Bushy Tree Useless The Decision Tree has a hard time with correlated attributes 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 ? 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90
  • 120. Advantages/Disadvantages of Decision Trees ▷Advantages: • Easy to understand • Easy to generate rules ▷Disadvantages: • May suffer from overfitting. • Classifies by rectangular partitioning (so does not handle correlated features very well). • Can be quite large – pruning is necessary. • Does not handle streaming data easily 120
  • 121. Underfitting and Overfitting 121 Underfitting: when model is too simple, both training and test errors are large © Tan, Steinbach, Kumar Introduction to Data Mining
  • 122. Overfitting due to Noise 122 © Tan, Steinbach, Kumar Introduction to Data Mining Decision boundary is distorted by noise point
  • 123. Overfitting due to Insufficient Examples 123 © Tan, Steinbach, Kumar Introduction to Data Mining Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
  • 124. Bayes Classifier ▷A probabilistic framework for solving classification problems ▷Conditional Probability: ▷ Bayes theorem: 124 © Tan, Steinbach, Kumar Introduction to Data Mining )( )()|( )|( AP CPCAP ACP  )( ),( )|( )( ),( )|( CP CAP CAP AP CAP ACP  
  • 125. Bayesian Classifiers ▷Consider each attribute and class label as random variables ▷Given a record with attributes (A1, A2,…,An) • Goal is to predict class C • Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) ▷Can we estimate P(C| A1, A2,…,An ) directly from data? 125 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 126. Bayesian Classifier Approach ▷Compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem ▷Choose value of C that maximizes P(C | A1, A2, …, An) ▷Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) ▷How to estimate P(A1, A2, …, An | C )? 126 © Tan, Steinbach, Kumar Introduction to Data Mining )( )()|( )|( 21 21 21 n n n AAAP CPCAAAP AAACP    
  • 127. Naïve Bayes Classifier ▷A simplified assumption: attributes are conditionally independent and each data sample has n attributes ▷No dependence relation between attributes ▷By Bayes theorem, ▷As P(X) is constant for all classes, assign X to the class with maximum P(X|Ci)*P(Ci) 127    n k CixkPCiXP 1 )|()|( )( )()|()|( XP CiPCiXPXCiP 
  • 128. Naïve Bayesian Classifier: Comments ▷ Advantages : • Easy to implement • Good results obtained in most of the cases ▷ Disadvantages • Assumption: class conditional independence • Practically, dependencies exist among variables → E.g., hospitals: patients: Profile: age, family history etc → E.g., Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier ▷ How to deal with these dependencies? • Bayesian Belief Networks 128
  • 129. Bayesian Networks ▷Bayesian belief network allows a subset of the variables conditionally independent ▷A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution Data Mining 129
  • 130. Bayesian Belief Network: An Example 130 Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea LC ~LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.2 0.5 0.5 0.7 0.3 0.1 0.9 Bayesian Belief Networks The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents    n i ZParents iziPznzP 1 ))(|(),...,1(
  • 131. Neural Networks ▷Artificial neuron • Each input is multiplied by a weighting factor. • Output is 1 if sum of weighted inputs exceeds a threshold value; 0 otherwise ▷Network is programmed by adjusting weights using feedback from examples 131
  • 132. General Structure Data Mining 132 Output nodes Input nodes Hidden nodes Output vector Input vector: xi wij   i jiijj OwI  jIj e O    1 1 ))(1( jjjjj OTOOErr  jk k kjjj wErrOOErr  )1( ijijij OErrlww )( jjj Errl)( 
  • 133. Network Training ▷The ultimate objective of training • Obtain a set of weights that makes almost all the tuples in the training data classified correctly ▷Steps • Initialize weights with random values • Feed the input tuples into the network one by one • For each unit → Compute the net input to the unit as a linear combination of all the inputs to the unit → Compute the output value using the activation function → Compute the error → Update the weights and the bias 133
  • 134. Summary of Neural Networks ▷Advantages • Prediction accuracy is generally high • Robust, works when training examples contain errors • Fast evaluation of the learned target function ▷Criticism • Long training time • Difficult to understand the learned function (weights) • Not easy to incorporate domain knowledge 134
  • 135. The k-Nearest Neighbor Algorithm ▷All instances correspond to points in the n-D space. ▷The nearest neighbor are defined in terms of Euclidean distance. ▷The target function could be discrete- or real- valued. ▷For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. 135 . _ + _ xq + _ _ + _ _ +
  • 136. Discussion on the k-NN Algorithm ▷Distance-weighted nearest neighbor algorithm • Weight the contribution of each of the k neighbors according to their distance to the query point xq → Giving greater weight to closer neighbors ▷Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. • To overcome it, elimination of the least relevant attributes. 136
  • 138. Definition: Frequent Itemset ▷ Itemset: A collection of one or more items • Example: {Milk, Bread, Diaper} ▷ k-itemset • An itemset that contains k items ▷ Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 ▷ Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 ▷ Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold 138 © Tan, Steinbach, Kumar Introduction to Data Mining Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
  • 139. Definition: Association Rule 139 Association Rule – An implication expression of the form X  Y, where X and Y are itemsets – Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics – Support (s)  Fraction of transactions that contain both X and Y – Confidence (c)  Measures how often items in Y appear in transactions that contain X © Tan, Steinbach, Kumar Introduction to Data Mining Market-Basket transactions Example: Beer}Diaper,Milk{  4.0 5 2 |T| )BeerDiaper,,Milk(   s 67.0 3 2 )Diaper,Milk( )BeerDiaper,Milk,(    c TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
  • 140. Strong Rules & Interesting 140 ▷Corr(A,B)=P(AUB)/(P(A)P(B)) • Corr(A, B)=1, A & B are independent • Corr(A, B)<1, occurrence of A is negatively correlated with B • Corr(A, B)>1, occurrence of A is positively correlated with B ▷E.g. Corr(games, videos)=0.4/(0.6*0.75)=0.89 • In fact, games & videos are negatively associated → Purchase of one actually decrease the likelihood of purchasing the other 10000 6000 games 7500 video4000
  • 142. Good Clustering ▷Good clustering (produce high quality clusters) • Intra-cluster similarity is high • Inter-cluster class similarity is low ▷Quality factors • Similarity measure and its implementation • Definition and representation of cluster chosen • Clustering algorithm 142
  • 143. Types of Clusters: Well-Separated ▷Well-Separated clusters: • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 143 © Tan, Steinbach, Kumar Introduction to Data Mining 3 well-separated clusters
  • 144. Types of Clusters: Center-Based ▷Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 144 © Tan, Steinbach, Kumar Introduction to Data Mining 4 center-based clusters
  • 145. Types of Clusters: Contiguity-Based ▷Contiguous cluster (Nearest neighbor or transitive) • A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 145 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 146. Types of Clusters: Density-Based ▷Density-based • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. • Used when the clusters are irregular or intertwined, and when noise and outliers are present. 146 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 147. Types of Clusters: Objective Function ▷Clusters defined by an objective function • Finds clusters that minimize or maximize an objective function. • Naïve approaches: → Enumerate all possible ways → Evaluate the `goodness' of each potential set of clusters →NP Hard • Can have global or local objectives. → Hierarchical clustering algorithms typically have local objectives → Partitioned algorithms typically have global objectives 147 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 148. Partitioning Algorithms: Basic Concept ▷Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions. • Heuristic methods. → k-means: each cluster is represented by the center of the cluster → k-medoids or PAM (Partition Around Medoids) : each cluster is represented by one of the objects in the cluster. 148
  • 149. K-Means Clustering Algorithm ▷Algorithm: • Randomly initialize k cluster means • Iterate: → Assign each genes to the nearest cluster mean → Recompute cluster means • Stop when clustering converges 149 K=4
  • 150. Two different K-means Clusterings 150 © Tan, Steinbach, Kumar Introduction to Data Mining -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Sub-optimal Clustering -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Optimal Clustering Original Points
  • 151. Solutions to Initial Centroids Problem ▷ Multiple runs • Helps, but probability is not on your side ▷ Sample and use hierarchical clustering to determine initial centroids ▷ Select more than k initial centroids and then select among these initial centroids • Select most widely separated ▷ Postprocessing ▷ Bisecting K-means • Not as susceptible to initialization issues 151 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 152. Bisecting K-means ▷ Bisecting K-means algorithm • Variant of K-means that can produce a partitioned or a hierarchical clustering 152 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 154. Bisecting K-means Example 154 K=4 Produce a hierarchical clustering based on the sequence of clusterings produced
  • 155. Limitations of K-means: Differing Sizes 155 © Tan, Steinbach, Kumar Introduction to Data Mining Original Points K-means (3 Clusters) One solution is to use many clusters. Find parts of clusters, but need to put together K-means (10 Clusters)
  • 156. Limitations of K-means: Differing Density 156 © Tan, Steinbach, Kumar Introduction to Data Mining Original Points K-means (3 Clusters) One solution is to use many clusters. Find parts of clusters, but need to put together K-means (10 Clusters)
  • 157. Limitations of K-means: Non-globular Shapes 157 © Tan, Steinbach, Kumar Introduction to Data Mining Original Points K-means (2 Clusters) One solution is to use many clusters. Find parts of clusters, but need to put together K-means (10 Clusters)
  • 158. Hierarchical Clustering ▷Produces a set of nested clusters organized as a hierarchical tree ▷Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits 158 © Tan, Steinbach, Kumar Introduction to Data Mining 1 3 2 5 4 6 0 0.05 0.1 0.15 0.2 1 2 3 4 5 6 1 2 3 4 5
  • 159. Strengths of Hierarchical Clustering ▷Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level ▷They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) 159 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 160. Density-Based Clustering ▷Clustering based on density (local cluster criterion), such as density-connected points ▷Each cluster has a considerable higher density of points than outside of the cluster 160
  • 161. Density-Based Clustering Methods ▷Major features: • Discover clusters of arbitrary shape • Handle noise • One scan • Need density parameters as termination condition ▷Approaches • DBSCAN (KDD’96) • OPTICS (SIGMOD’99). • DENCLUE (KDD’98) • CLIQUE (SIGMOD’98) Data Mining 161
  • 162. DBSCAN ▷Density = number of points within a specified radius (Eps) ▷A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster ▷A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point ▷A noise point is any point that is not a core point or a border point. 162 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 163. DBSCAN: Core, Border, and Noise Points 163 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 164. DBSCAN Examples 164 © Tan, Steinbach, Kumar Introduction to Data Mining Original Points Point types: core, border and noise Eps = 10, MinPts = 4
  • 165. When DBSCAN Works Well 165 © Tan, Steinbach, Kumar Introduction to Data Mining Original Points Clusters • Resistant to Noise • Can handle clusters of different shapes and sizes
  • 166. Recap: Data Mining Techniques 166 Predictive Tasks • Classification • Ranking • Regression • … Descriptive Tasks • Clustering • Association rules • Summarization • …
  • 167. Evaluation Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
  • 168. Tasks in Data Mining ▷Problems should be well defined at the beginning ▷Two categories of tasks [Fayyad et al., 1996] 168 Predictive Tasks • Predict unknown values • e.g., potential customers Descriptive Tasks • Find patterns to describe data • e.g., Friendship finding VIP Cheap Potential
  • 170. Metrics for Performance Evaluation 170 © Tan, Steinbach, Kumar Introduction to Data Mining  Focus on the predictive capability of a model  Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
  • 171. Metrics for Performance Evaluation 171 © Tan, Steinbach, Kumar Introduction to Data Mining  Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) FNFPTNTP TNTP dcba da      Accuracy
  • 172. Limitation of Accuracy ▷ Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 ▷ If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % ▷ Accuracy is misleading because model does not detect any class 1 example 172 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 173. Cost-Sensitive Measures 173 © Tan, Steinbach, Kumar Introduction to Data Mining cba a pr rp ba a ca a         2 22 (F)measure-F (r)Recall (p)Precision  Precision is biased towards C(Yes|Yes) & C(Yes|No)  Recall is biased towards C(Yes|Yes) & C(No|Yes)  F-measure is biased towards all except C(No|No) dwcwbwaw dwaw 4321 41 AccuracyWeighted   
  • 174. Test of Significance ▷ Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances ▷ Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set? 174 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 175. Confidence Interval for Accuracy ▷ Prediction can be regarded as a Bernoulli trial • A Bernoulli trial has 2 possible outcomes → Possible outcomes for prediction: correct or wrong • Collection of Bernoulli trials has a Binomial distribution: → x ≈ Bin(N, p) x: number of correct predictions → e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N × p = 50 × 0.5 = 25 ▷ Given x (# of correct predictions) or equivalently, accuracy (ac)=x/N, and N (# of test instances) Can we predict p (true accuracy of model)? 175 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 176. Confidence Interval for Accuracy ▷For large test sets (N > 30), • ac has a normal distribution with mean p and variance p(1-p)/N • Confidence Interval for p: 176 © Tan, Steinbach, Kumar Introduction to Data Mining Area = 1 -  Z/2 Z1-  /2         1 ) /)1( ( 2/12/ Z Npp pa ZP c )(2 442 2 2/ 22 2/2/ 2 2/   ZN aNaNZZZaN p ccc   
  • 177. Example :Comparing Performance of 2 Models ▷Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 • d = |e2 – e1| = 0.1 (2-sided test) ▷At 95% confidence level, Z/2=1.96 177 © Tan, Steinbach, Kumar Introduction to Data Mining 0043.0 5000 )25.01(25.0 30 )15.01(15.0 ˆ     d  128.0100.00043.096.1100.0 t d Interval contains 0 : difference may not be statistically significant
  • 179. Computing Interestingness Measure ▷ Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table 179 © Tan, Steinbach, Kumar Introduction to Data Mining Y Y X f11 f10 f1+ X f01 f00 fo+ f+1 f+0 |T| Contingency table for X  Y f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures  support, confidence, lift, Gini, J-measure, etc.
  • 180. Drawback of Confidence 180 © Tan, Steinbach, Kumar Introduction to Data Mining Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375
  • 181. Statistical Independence ▷ Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 420 students know how to swim and bike (S,B) • P(SB) = 420/1000 = 0.42 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(SB) = P(S)  P(B) => Statistical independence • P(SB) > P(S)  P(B) => Positively correlated • P(SB) < P(S)  P(B) => Negatively correlated 181 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 182. Statistical-based Measures ▷Measures that take into account statistical dependence 182 © Tan, Steinbach, Kumar Introduction to Data Mining )](1)[()](1)[( )()(),( )()(),( )()( ),( )( )|( YPYPXPXP YPXPYXP tcoefficien YPXPYXPPS YPXP YXP Interest YP XYP Lift       
  • 183. Example: Interest Factor 183 © Tan, Steinbach, Kumar Introduction to Data Mining Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee,Tea) = 0.15 P(Coffee) = 0.9, P(Tea) = 0.2  Interest = 0.15/(0.9×0.2)= 0.83 (< 1, therefore is negatively associated)
  • 184. Subjective Interestingness Measure ▷Objective measure: • Rank patterns based on statistics computed from data • e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc). ▷Subjective measure: • Rank patterns according to user’s interpretation → A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) → A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) 184 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 185. Interestingness via Unexpectedness 185 © Tan, Steinbach, Kumar Introduction to Data Mining  Need to model expectation of users (domain knowledge)  Need to combine expectation of users with evidence from data (i.e., extracted patterns) + Pattern expected to be frequent - Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent + - Expected Patterns- + Unexpected Patterns
  • 186. Different Propose Measures 186 © Tan, Steinbach, Kumar Introduction to Data Mining Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad?
  • 187. Comparing Different Measures 187 © Tan, Steinbach, Kumar Introduction to Data Mining Example f11 f10 f01 f00 E1 8123 83 424 1370 E2 8330 2 622 1046 E3 9481 94 127 298 E4 3954 3080 5 2961 E5 2886 1363 1320 4431 E6 1500 2000 500 6000 E7 4000 2000 1000 3000 E8 4000 2000 2000 2000 E9 1720 7121 5 1154 E10 61 2483 4 7452 10 examples of contingency tables: Rankings of contingency tables using various measures:
  • 188. Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures:  support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures:  confidence, conviction, Laplace, J-measure, etc 188 © Tan, Steinbach, Kumar Introduction to Data Mining B B A p q A r s A A B p r B q s
  • 189. Cluster Validity ▷ For supervised classification we have a variety of measures to evaluate how good our model is • Accuracy, precision, recall ▷ For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? ▷ But “clusters are in the eye of the beholder”! ▷ Then why do we want to evaluate them? • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters 189 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 190. Clusters Found in Random Data 190 © Tan, Steinbach, Kumar Introduction to Data Mining 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Random Points 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y K- means 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y DBSCAN 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Complete Link
  • 191. Measures of Cluster Validity ▷ Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types • External Index: Used to measure the extent to which cluster labels match externally supplied class labels, e.g., Entropy • Internal Index: Used to measure the goodness of a clustering structure without respect to external information, e.g., Sum of Squared Error (SSE) • Relative Index: Used to compare two different clusters ▷ Sometimes these are referred to as criteria instead of indices • However, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion. 191 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 192. Measuring Cluster Validity Via Correlation ▷ Two matrices • Proximity Matrix • “Incidence” Matrix → One row and one column for each data point → An entry is 1 if the associated pair of points belong to the same cluster → An entry is 0 if the associated pair of points belongs to different clusters ▷ Compute the correlation between the two matrices • Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated. ▷ High correlation indicates that points that belong to the same cluster are close to each other. ▷ Not a good measure for some density or contiguity based clusters. 192 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 193. Measuring Cluster Validity Via Correlation ▷Correlation of incidence and proximity matrices for the K-means clustering of the following two data sets 193 © Tan, Steinbach, Kumar Introduction to Data Mining 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Corr = -0.9235 Corr = -0.5810
  • 194. Internal Measures: SSE ▷ Clusters in more complicated figures aren’t well separated ▷ Internal Index: Used to measure the goodness of a clustering structure without respect to external information • Sum of Squared Error (SSE) ▷ SSE is good for comparing two clusters ▷ Can also be used to estimate the number of clusters 194 © Tan, Steinbach, Kumar Introduction to Data Mining 2 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 K SSE 5 10 15 -6 -4 -2 0 2 4 6
  • 195. Internal Measures: Cohesion and Separation ▷ Cluster Cohesion: Measures how closely related are objects in a cluster • Cohesion is measured by the within cluster sum of squares (SSE) ▷ Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters • Separation is measured by the between cluster sum of squares • Where |Ci| is the size of cluster i 195 © Tan, Steinbach, Kumar Introduction to Data Mining     i Cx i i mxWSS 2 )(   i ii mmCBSS 2 )(
  • 196. Final Comment on Cluster Validity “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes 196 © Tan, Steinbach, Kumar Introduction to Data Mining
  • 197. Case Studies Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
  • 198. Case: Mining Reddit Data Please check the data set during the breaks 198
  • 200. Reddit: The Front Page of the Internet 50k+ on this set
  • 201. Subreddit Categories ▷Reddit’s structure may already provide a baseline similarity