SlideShare a Scribd company logo
1 of 59
Download to read offline
Prof. Pier Luca Lanzi
Data Mining
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Motivations
2
Prof. Pier Luca Lanzi
Why Data Mining?
“Necessity is the mother of invention”
Explosive Growth of Data
Pressing need for the automated analysis of massive data
Emerged in the late 1980s
Major developments in the mid 1990s
Prof. Pier Luca Lanzihttp://www.iflscience.com/technology/amount-data-internet-generates-every-minute-crazy/
Prof. Pier Luca Lanzi
Evolution of Technology
• 1960s: data collection, database creation, & network DBMS
• 1970s: relational data model, relational DBMS implementation
• 1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.); application-oriented DBMS
(spatial, scientific, engineering, etc.)
• 1990s: data mining, data warehousing, multimedia databases,
and Web databases
• 2000s: stream data management and mining, web technology (XML,
data integration), global information systems
• 2010s: social networks, NoSQL, unstructured data, etc.
5
Prof. Pier Luca Lanzi
http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
Prof. Pier Luca Lanzi
What is the Commercial Viewpoint?
• Huge amounts of data is being collected
and warehoused everyday
§ Web data, e-commerce
§ Purchases at department stores
§ Bank/Credit Card transactions
• Computers have become
cheaper and more powerful
• Competitive pressure is strong
to provide better, customized services
(e.g., CRM or Customer Relationship Management)
• Poor data across businesses and the government costs
huge amount of money
7
Prof. Pier Luca Lanzi
What is the Scientific Viewpoint?
• Data collected and stored at
enormous speeds (GB/hour)
§remote sensors on a satellite
§telescopes scanning the skies
§microarrays generating gene
expression data
§scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
§in classifying and segmenting data
§in Hypothesis Formation
8
Prof. Pier Luca Lanzi
Examples
• Customer attrition
§ Given customer information for the past months
§ Predict who is likely to attrite next month,
or estimate customer value
• Credit assessment
§ Given a loan application
§ Predict whether the bank should approve the loan
• Customer segmentation
§ Given several information about the customers
§ Identify interesting groups among them
• Community detection
§ Given a social network of users
§ Identify community based on their connections
(friendship relation, discussions, etc.)
9
Prof. Pier Luca Lanzi
Big Data?
Prof. Pier Luca Lanzi
http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/
http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0
http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Data Science?
Prof. Pier Luca Lanzihttp://wikibon.org/blog/role-of-the-data-scientist/
Prof. Pier Luca Lanzi
https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html
Prof. Pier Luca Lanzi
Data Mining?
Prof. Pier Luca Lanzi
Data Mining
The non-trivial process of identifying
(1) valid, (2) novel, (3) potentially useful,
and (4) understandable patterns in data.
Prof. Pier Luca Lanzi
An Example Using Contact Lens Data 18
NoneReducedYesHypermetropePre-presbyopic
NoneNormalYesHypermetropePre-presbyopic
NoneReducedNoMyopePresbyopic
NoneNormalNoMyopePresbyopic
NoneReducedYesMyopePresbyopic
HardNormalYesMyopePresbyopic
NoneReducedNoHypermetropePresbyopic
SoftNormalNoHypermetropePresbyopic
NoneReducedYesHypermetropePresbyopic
NoneNormalYesHypermetropePresbyopic
SoftNormalNoHypermetropePre-presbyopic
NoneReducedNoHypermetropePre-presbyopic
HardNormalYesMyopePre-presbyopic
NoneReducedYesMyopePre-presbyopic
SoftNormalNoMyopePre-presbyopic
NoneReducedNoMyopePre-presbyopic
hardNormalYesHypermetropeYoung
NoneReducedYesHypermetropeYoung
SoftNormalNoHypermetropeYoung
NoneReducedNoHypermetropeYoung
HardNormalYesMyopeYoung
NoneReducedYesMyopeYoung
SoftNormalNoMyopeYoung
NoneReducedNoMyopeYoung
Recommended lensesTear production rateAstigmatismSpectacle prescriptionAge
Prof. Pier Luca Lanzi
An example of possible pattern
if astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
Prof. Pier Luca Lanzi
is it a “good” pattern?
Prof. Pier Luca Lanzi
How Can We Evaluate a Pattern?
• Is it valid?
§The pattern has to be valid with respect
to a certainty level (rule true for the 86%)
• Is it novel?
§Is the relation between astigmatism and
hard contact lenses already well-known?
• Is it useful? Is it actionable?
§The pattern should provide information
useful to the bank for assessing credit risk
• Is it understandable?
22
Prof. Pier Luca Lanzi
but there is another important question …
was it “worth” finding it? (I mean $ worth)
how much did the search cost?
how much value did it bring
23
Prof. Pier Luca Lanzi
Example of Cost-Based
Model Evaluation
• A bank has a predictive model that can identify risky loans with an
accuracy of 72%
• Your company develops a model that can improve their
performance by 3% reaching an accuracy of 75%
• Is this a good result?
• We might simply evaluate the 3% of improvement but giving out
loans has a cost that depends on the type of error we make
24
Prof. Pier Luca Lanzi
Example of Cost-Based
Model Evaluation
• Predictive accuracy in this case is defined as
• True Positives, true negatives
§ Safe and risky loans predicted as safe and risky respectively
• False Positive Errors
§ We accept a risky loan which we predicted was safe
§ We are likely not to get the money back
(let’s say on average 30000 euros)
• False Negative Errors
§ We don’t give a safe loan since we predicted it was risky
§ We will loose the interest money
(let’s say on average 10000 euros
25
Prof. Pier Luca Lanzi
Example of Cost-Based
Model Evaluation
• Original Model
§1576 false positives and 1224 false negatives
§Total cost is 59525443
• Our Model
§1407 false positives and 1093 false negatives
§Total cost is 53147717 (more than 6 millions saved)
• What if we can change the way our model makes mistakes?
§1093 false positives and 1407 false negatives
§Total cost becomes 46852283 (more than 12 millions saved)
26
Prof. Pier Luca Lanzi
What is the General Idea?
• Build computer programs that navigate through databases
automatically, seeking regularities or patterns
• There will be problems
§Most patterns are banal and uninteresting
§Most patterns are spurious, inexact, or contingent on
accidental coincidences in the particular dataset used
§Real data is imperfect: Some parts will be garbled,
and some will be missing
• Algorithms need to be robust enough to cope with imperfect
data and to extract regularities that are inexact but useful
28
Prof. Pier Luca Lanzi
Pitfalls
29
Prof. Pier Luca Lanzi
http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
Prof. Pier Luca Lanzi
http://www.tylervigen.com
Prof. Pier Luca Lanzi
http://www.tylervigen.com
Prof. Pier Luca Lanzi
http://www.tylervigen.com
Prof. Pier Luca Lanzi
http://www.tylervigen.com
Prof. Pier Luca Lanzi
http://www.tylervigen.com
Prof. Pier Luca Lanzi
http://www.tylervigen.com
Prof. Pier Luca Lanzi
Statistics, Machine Learning,
and Data Mining
• Statistics is more theory-based, focuses on testing hypotheses
• Machine learning is more based on heuristic, focuses on building
program that learns, more general than Data Mining
• Data Mining
§Integrates theory and heuristics
§Focus on the entire process of discovery, including
data cleaning, learning, integration and visualization
Distinctions are blurred!
37
Prof. Pier Luca Lanzi
http://www.imdb.com/title/tt1178663/
Prof. Pier Luca Lanzi
Why Is It Different?
• Tremendous amount of data
§ High scalability to handle terabytes of data
• High-dimensionality of data
§ Micro-array may have tens of thousands of dimensions
• High complexity of data
§ Data streams and sensor data
§ Time-series data, temporal data, sequence data
§ Structure data, graphs, social networks and multi-linked data
§ Heterogeneous databases and legacy databases
§ Spatial, spatiotemporal, multimedia, text and Web data
§ Software programs, scientific simulations
• New and sophisticated applications
39
Prof. Pier Luca Lanzi
Knowledge Discovery Process 40
selection
cleaning
transformation
mining
evaluation
Prof. Pier Luca Lanzi
Knowledge Discovery Process
What are the main steps?
• Learning the application domain to extract
relevant prior knowledge and goals
• Data selection
• Data cleaning
• Data reduction and transformation
• Mining
§ Select the mining approach: classification,
regression, association, clustering, etc.
§ Choosing the mining algorithm(s)
§ Perform mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
§ visualization, transformation,
removing redundant patterns, etc.
• Use of discovered knowledge
41
Prof. Pier Luca Lanzi
What are the typical
Data Mining tasks?
42
Prof. Pier Luca Lanzi
What are the Major Data Mining
Tasks?
• Classification: predicting an item class
• Clustering: finding clusters in data
• Associations: frequent occurring events…
• Visualization: to facilitate human discovery
• Summarization: describing a group
• Deviation Detection: finding changes
• Estimation: predicting a continuous value
• Link Analysis: finding relationship
• “Sentiment” Analysis, “Opinion” Mining
• But many appears as time goes by, opinion mining,
sentiment mining
43
Prof. Pier Luca Lanzi
Prediction & Regression
Prof. Pier Luca Lanzi
Input variable: LSTAT - % lower status of the population
Output variable: MEDV - Median value of owner-occupied homes in $1000's
Prof. Pier Luca Lanzi
Input variable: LSTAT - % lower status of the population
Output variable: MEDV - Median value of owner-occupied homes in $1000's
Prof. Pier Luca Lanzi
Classification
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Clustering
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Associations
Prof. Pier Luca Lanzi
Data Mining Tasks: Associations
Bread
Peanuts
Milk
Fruit
Jam
Bread
Jam
Soda
Chips
Milk
Fruit
Steak
Jam
Soda
Chips
Bread
Jam
Soda
Chips
Milk
Bread
Fruit
Soda
Chips
Milk
Jam
Soda
Peanuts
Milk
Fruit
Fruit
Soda
Peanuts
Milk
Fruit
Peanuts
Cheese
Yogurt
Is there something interesting to be noted?
53
Prof. Pier Luca Lanzi
Data Mining Tasks: Associations
• Finds interesting associations and/or correlation relationships
among large set of data items.
• E.g., 98% of people who purchase tires and auto accessories also
get automotive services done
54
Prof. Pier Luca Lanzi
Data Mining Tasks: Other Tasks
• Outlier analysis
§ Outlier: a data object that does not comply
with the general behavior of the data
§ It can be considered as noise or exception
but is quite useful in fraud detection, rare events analysis
• Trend and evolution analysis
§ Trend and deviation: regression analysis
§ Sequential pattern mining, periodicity analysis
§ Similarity-based analysis
• Text Mining, Topic Modeling, Graph Mining, Data Streams
• Sentiment Analysis, Opinion Mining, etc.
• Other pattern-directed or statistical analyses
55
Prof. Pier Luca Lanzi
Relevant Issues
56
Prof. Pier Luca Lanzi
Are all the “Discovered” Patterns
Interesting?
• Data Mining may generate thousands of patterns,
not all of them are interesting.
• Interestingness measures: a pattern is interesting if it is easily understood by
humans, valid on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures:
• Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
• Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc.
57
Prof. Pier Luca Lanzi
Can We Find All and Only
Interesting Patterns?
• Completeness
§Find all the interesting patterns
§Can a data mining system find all
the interesting patterns?
§Association vs. classification vs. clustering
• Optimization
§Search for only interesting patterns:
§Can a data mining system find only
the interesting patterns?
§Two approaches: (1) first general all the patterns and then
filter out the uninteresting ones; (2) generate only the
interesting patterns—mining query optimization
58
Prof. Pier Luca Lanzi
What About Background Knowledge?
• A typical kind of background knowledge: Concept hierarchies
• Schema hierarchy
§ E.g., Street < City < ProvinceOrState < Country
• Set-grouping hierarchy
§ E.g., {20-39} = young, {40-59} = middle_aged
• Operation-derived hierarchy
§ email address: hagonzal@cs.uiuc.edu
§ login-name < department < university < country
• Rule-based hierarchy
§ LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2)
and (P1 - P2) < $50
59
Prof. Pier Luca Lanzi
https://www.youtube.com/watch?v=CO2mGny6fFs
Prof. Pier Luca Lanzi
Assignments
• Mining of Massive Datasets (Chapter 1)
61

More Related Content

What's hot

DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataPolytechnic University of Bari
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesPier Luca Lanzi
 

What's hot (14)

DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 

Similar to DMTM Lecture 02 Data mining

DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
World Future Society 2015 Professional Members Forum
World Future Society 2015 Professional Members ForumWorld Future Society 2015 Professional Members Forum
World Future Society 2015 Professional Members ForumWendy Schultz
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataHamilton Public Library
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web DataMarieke Guy
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMichael Mathioudakis
 
datamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxdatamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxHASHEMHASH
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Digital project planning and pedagogy
Digital project planning and pedagogyDigital project planning and pedagogy
Digital project planning and pedagogylibrarianrafia
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).pptSanjayAcharaya
 

Similar to DMTM Lecture 02 Data mining (20)

DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
World Future Society 2015 Professional Members Forum
World Future Society 2015 Professional Members ForumWorld Future Society 2015 Professional Members Forum
World Future Society 2015 Professional Members Forum
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
 
datamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxdatamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptx
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
DBMS
DBMSDBMS
DBMS
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Digital project planning and pedagogy
Digital project planning and pedagogyDigital project planning and pedagogy
Digital project planning and pedagogy
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Ux and Data Visualisation
Ux and Data VisualisationUx and Data Visualisation
Ux and Data Visualisation
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityPier Luca Lanzi
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationPier Luca Lanzi
 
VDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game designVDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game designPier Luca Lanzi
 

More from Pier Luca Lanzi (17)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 
VDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game designVDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game design
 

Recently uploaded

Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Recently uploaded (20)

Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 

DMTM Lecture 02 Data mining

  • 1. Prof. Pier Luca Lanzi Data Mining Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Motivations 2
  • 3. Prof. Pier Luca Lanzi Why Data Mining? “Necessity is the mother of invention” Explosive Growth of Data Pressing need for the automated analysis of massive data Emerged in the late 1980s Major developments in the mid 1990s
  • 4. Prof. Pier Luca Lanzihttp://www.iflscience.com/technology/amount-data-internet-generates-every-minute-crazy/
  • 5. Prof. Pier Luca Lanzi Evolution of Technology • 1960s: data collection, database creation, & network DBMS • 1970s: relational data model, relational DBMS implementation • 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.); application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: data mining, data warehousing, multimedia databases, and Web databases • 2000s: stream data management and mining, web technology (XML, data integration), global information systems • 2010s: social networks, NoSQL, unstructured data, etc. 5
  • 6. Prof. Pier Luca Lanzi http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
  • 7. Prof. Pier Luca Lanzi What is the Commercial Viewpoint? • Huge amounts of data is being collected and warehoused everyday § Web data, e-commerce § Purchases at department stores § Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive pressure is strong to provide better, customized services (e.g., CRM or Customer Relationship Management) • Poor data across businesses and the government costs huge amount of money 7
  • 8. Prof. Pier Luca Lanzi What is the Scientific Viewpoint? • Data collected and stored at enormous speeds (GB/hour) §remote sensors on a satellite §telescopes scanning the skies §microarrays generating gene expression data §scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists §in classifying and segmenting data §in Hypothesis Formation 8
  • 9. Prof. Pier Luca Lanzi Examples • Customer attrition § Given customer information for the past months § Predict who is likely to attrite next month, or estimate customer value • Credit assessment § Given a loan application § Predict whether the bank should approve the loan • Customer segmentation § Given several information about the customers § Identify interesting groups among them • Community detection § Given a social network of users § Identify community based on their connections (friendship relation, discussions, etc.) 9
  • 10. Prof. Pier Luca Lanzi Big Data?
  • 11. Prof. Pier Luca Lanzi http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/ http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0 http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf
  • 13. Prof. Pier Luca Lanzi Data Science?
  • 14. Prof. Pier Luca Lanzihttp://wikibon.org/blog/role-of-the-data-scientist/
  • 15. Prof. Pier Luca Lanzi https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html
  • 16. Prof. Pier Luca Lanzi Data Mining?
  • 17. Prof. Pier Luca Lanzi Data Mining The non-trivial process of identifying (1) valid, (2) novel, (3) potentially useful, and (4) understandable patterns in data.
  • 18. Prof. Pier Luca Lanzi An Example Using Contact Lens Data 18 NoneReducedYesHypermetropePre-presbyopic NoneNormalYesHypermetropePre-presbyopic NoneReducedNoMyopePresbyopic NoneNormalNoMyopePresbyopic NoneReducedYesMyopePresbyopic HardNormalYesMyopePresbyopic NoneReducedNoHypermetropePresbyopic SoftNormalNoHypermetropePresbyopic NoneReducedYesHypermetropePresbyopic NoneNormalYesHypermetropePresbyopic SoftNormalNoHypermetropePre-presbyopic NoneReducedNoHypermetropePre-presbyopic HardNormalYesMyopePre-presbyopic NoneReducedYesMyopePre-presbyopic SoftNormalNoMyopePre-presbyopic NoneReducedNoMyopePre-presbyopic hardNormalYesHypermetropeYoung NoneReducedYesHypermetropeYoung SoftNormalNoHypermetropeYoung NoneReducedNoHypermetropeYoung HardNormalYesMyopeYoung NoneReducedYesMyopeYoung SoftNormalNoMyopeYoung NoneReducedNoMyopeYoung Recommended lensesTear production rateAstigmatismSpectacle prescriptionAge
  • 19. Prof. Pier Luca Lanzi An example of possible pattern if astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard
  • 20. Prof. Pier Luca Lanzi is it a “good” pattern?
  • 21. Prof. Pier Luca Lanzi How Can We Evaluate a Pattern? • Is it valid? §The pattern has to be valid with respect to a certainty level (rule true for the 86%) • Is it novel? §Is the relation between astigmatism and hard contact lenses already well-known? • Is it useful? Is it actionable? §The pattern should provide information useful to the bank for assessing credit risk • Is it understandable? 22
  • 22. Prof. Pier Luca Lanzi but there is another important question … was it “worth” finding it? (I mean $ worth) how much did the search cost? how much value did it bring 23
  • 23. Prof. Pier Luca Lanzi Example of Cost-Based Model Evaluation • A bank has a predictive model that can identify risky loans with an accuracy of 72% • Your company develops a model that can improve their performance by 3% reaching an accuracy of 75% • Is this a good result? • We might simply evaluate the 3% of improvement but giving out loans has a cost that depends on the type of error we make 24
  • 24. Prof. Pier Luca Lanzi Example of Cost-Based Model Evaluation • Predictive accuracy in this case is defined as • True Positives, true negatives § Safe and risky loans predicted as safe and risky respectively • False Positive Errors § We accept a risky loan which we predicted was safe § We are likely not to get the money back (let’s say on average 30000 euros) • False Negative Errors § We don’t give a safe loan since we predicted it was risky § We will loose the interest money (let’s say on average 10000 euros 25
  • 25. Prof. Pier Luca Lanzi Example of Cost-Based Model Evaluation • Original Model §1576 false positives and 1224 false negatives §Total cost is 59525443 • Our Model §1407 false positives and 1093 false negatives §Total cost is 53147717 (more than 6 millions saved) • What if we can change the way our model makes mistakes? §1093 false positives and 1407 false negatives §Total cost becomes 46852283 (more than 12 millions saved) 26
  • 26. Prof. Pier Luca Lanzi What is the General Idea? • Build computer programs that navigate through databases automatically, seeking regularities or patterns • There will be problems §Most patterns are banal and uninteresting §Most patterns are spurious, inexact, or contingent on accidental coincidences in the particular dataset used §Real data is imperfect: Some parts will be garbled, and some will be missing • Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful 28
  • 27. Prof. Pier Luca Lanzi Pitfalls 29
  • 28. Prof. Pier Luca Lanzi http://launchhack.com/content/25-cartoons-give-current-big-data-hype-perspective/
  • 29. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 30. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 31. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 32. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 33. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 34. Prof. Pier Luca Lanzi http://www.tylervigen.com
  • 35. Prof. Pier Luca Lanzi Statistics, Machine Learning, and Data Mining • Statistics is more theory-based, focuses on testing hypotheses • Machine learning is more based on heuristic, focuses on building program that learns, more general than Data Mining • Data Mining §Integrates theory and heuristics §Focus on the entire process of discovery, including data cleaning, learning, integration and visualization Distinctions are blurred! 37
  • 36. Prof. Pier Luca Lanzi http://www.imdb.com/title/tt1178663/
  • 37. Prof. Pier Luca Lanzi Why Is It Different? • Tremendous amount of data § High scalability to handle terabytes of data • High-dimensionality of data § Micro-array may have tens of thousands of dimensions • High complexity of data § Data streams and sensor data § Time-series data, temporal data, sequence data § Structure data, graphs, social networks and multi-linked data § Heterogeneous databases and legacy databases § Spatial, spatiotemporal, multimedia, text and Web data § Software programs, scientific simulations • New and sophisticated applications 39
  • 38. Prof. Pier Luca Lanzi Knowledge Discovery Process 40 selection cleaning transformation mining evaluation
  • 39. Prof. Pier Luca Lanzi Knowledge Discovery Process What are the main steps? • Learning the application domain to extract relevant prior knowledge and goals • Data selection • Data cleaning • Data reduction and transformation • Mining § Select the mining approach: classification, regression, association, clustering, etc. § Choosing the mining algorithm(s) § Perform mining: search for patterns of interest • Pattern evaluation and knowledge presentation § visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge 41
  • 40. Prof. Pier Luca Lanzi What are the typical Data Mining tasks? 42
  • 41. Prof. Pier Luca Lanzi What are the Major Data Mining Tasks? • Classification: predicting an item class • Clustering: finding clusters in data • Associations: frequent occurring events… • Visualization: to facilitate human discovery • Summarization: describing a group • Deviation Detection: finding changes • Estimation: predicting a continuous value • Link Analysis: finding relationship • “Sentiment” Analysis, “Opinion” Mining • But many appears as time goes by, opinion mining, sentiment mining 43
  • 42. Prof. Pier Luca Lanzi Prediction & Regression
  • 43. Prof. Pier Luca Lanzi Input variable: LSTAT - % lower status of the population Output variable: MEDV - Median value of owner-occupied homes in $1000's
  • 44. Prof. Pier Luca Lanzi Input variable: LSTAT - % lower status of the population Output variable: MEDV - Median value of owner-occupied homes in $1000's
  • 45. Prof. Pier Luca Lanzi Classification
  • 47. Prof. Pier Luca Lanzi Clustering
  • 50. Prof. Pier Luca Lanzi Associations
  • 51. Prof. Pier Luca Lanzi Data Mining Tasks: Associations Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit Steak Jam Soda Chips Bread Jam Soda Chips Milk Bread Fruit Soda Chips Milk Jam Soda Peanuts Milk Fruit Fruit Soda Peanuts Milk Fruit Peanuts Cheese Yogurt Is there something interesting to be noted? 53
  • 52. Prof. Pier Luca Lanzi Data Mining Tasks: Associations • Finds interesting associations and/or correlation relationships among large set of data items. • E.g., 98% of people who purchase tires and auto accessories also get automotive services done 54
  • 53. Prof. Pier Luca Lanzi Data Mining Tasks: Other Tasks • Outlier analysis § Outlier: a data object that does not comply with the general behavior of the data § It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis § Trend and deviation: regression analysis § Sequential pattern mining, periodicity analysis § Similarity-based analysis • Text Mining, Topic Modeling, Graph Mining, Data Streams • Sentiment Analysis, Opinion Mining, etc. • Other pattern-directed or statistical analyses 55
  • 54. Prof. Pier Luca Lanzi Relevant Issues 56
  • 55. Prof. Pier Luca Lanzi Are all the “Discovered” Patterns Interesting? • Data Mining may generate thousands of patterns, not all of them are interesting. • Interestingness measures: a pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm • Objective vs. subjective interestingness measures: • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc. 57
  • 56. Prof. Pier Luca Lanzi Can We Find All and Only Interesting Patterns? • Completeness §Find all the interesting patterns §Can a data mining system find all the interesting patterns? §Association vs. classification vs. clustering • Optimization §Search for only interesting patterns: §Can a data mining system find only the interesting patterns? §Two approaches: (1) first general all the patterns and then filter out the uninteresting ones; (2) generate only the interesting patterns—mining query optimization 58
  • 57. Prof. Pier Luca Lanzi What About Background Knowledge? • A typical kind of background knowledge: Concept hierarchies • Schema hierarchy § E.g., Street < City < ProvinceOrState < Country • Set-grouping hierarchy § E.g., {20-39} = young, {40-59} = middle_aged • Operation-derived hierarchy § email address: hagonzal@cs.uiuc.edu § login-name < department < university < country • Rule-based hierarchy § LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2) and (P1 - P2) < $50 59
  • 58. Prof. Pier Luca Lanzi https://www.youtube.com/watch?v=CO2mGny6fFs
  • 59. Prof. Pier Luca Lanzi Assignments • Mining of Massive Datasets (Chapter 1) 61