SlideShare a Scribd company logo
1 of 45
Download to read offline
Data Mining
Data Mining and Text Mining (UIC 583 @ Politecnico)
Lecture outline                            2



  Why Data Mining?
  What is Data Mining?
  What are the typical tasks?
  What are the primitives?
  What are the typical applications?
  What are the major issues?




                   Prof. Pier Luca Lanzi
Why
Data Mining?
Why Data Mining?                              4
“Necessity is the mother of invention”

  Explosive Growth of Data
     Terabytes of available data
     Data collections and data availability
     Major sources of abundant data

  Pressing need for the automated analysis of massive data




                   Prof. Pier Luca Lanzi
Evolution of Database Technology                        5



  1960s:
     Data collection, database creation, IMS and network DBMS

  1970s:
     Relational data model, relational DBMS implementation

  1980s:
     RDBMS, advanced data models
     (extended-relational, OO, deductive, etc.)
     Application-oriented DBMS
     (spatial, scientific, engineering, etc.)

  1990s:
     Data mining, data warehousing, multimedia databases,
     and Web databases

  2000s
     Stream data management and mining
     Data mining and its applications
     Web technology (XML, data integration)
     Global information systems


                      Prof. Pier Luca Lanzi
Examples                                       6



  In vitro fertilization
     Given: embryos described by 60 features
     Problem: selection of embryos that will survive
     Data: historical records of embryos and outcome

  Cow culling
    Given: cows described by 700 features
    Problem: selection of cows that should be culled
    Data: historical records and farmers’ decisions




                  Prof. Pier Luca Lanzi
Examples                                        7



  Customer attrition
     Given: customer information for the past months
     Problem: predict who is likely to attrite next month,
     or estimate customer value
     Data: historical customer records

  Credit assessment
     Given: a loan application
     Problem: predict whether the bank should
     approve the loan
     Data: records from other loans




                  Prof. Pier Luca Lanzi
What is
Data Mining?
What is Data Mining?                           9



  The non-trivial process of identifying
     valid
     novel
     potentially useful, and
     ultimately understandable patterns in data.

  Alternative names,
     Data Fishing, Data Dredging (1960-)
     Data Mining (1990-), used by DB and business
     Knowledge Discovery in Databases (1989-), used by AI
     Business Intelligence, Information Harvesting,
     Information Discovery, Knowledge Extraction, ...

  Currently, Data Mining and Knowledge Discovery
  are used interchangeably


                  Prof. Pier Luca Lanzi
Example: Credit Risk                     10




      loan
             not repaid                       repaid




                                         salary

                 Prof. Pier Luca Lanzi
Example: Credit Risk                                      11




                                             IF salary<k THEN not repaid
      loan




                                         k               salary

                 Prof. Pier Luca Lanzi
Example: Credit Risk                             12



  Is it valid?
      The pattern has to be valid with respect
      to a certainty level (rule true for the 86%)

  Is it novel?
      The value k should be previously
      unknown or obvious

  Is it useful?
      The pattern should provide information
      useful to the bank for assessing credit risk

  Is it understandable?




                   Prof. Pier Luca Lanzi
What is the general idea?                      13



  Build computer programs that sift through databases
  automatically, seeking regularities or patterns

  There will be problems
     Most patterns are banal and uninteresting
     Most patterns are spurious, inexact, or contingent on
     accidental coincidences in the particular dataset used
     Real data is imperfect: Some parts will be garbled,
     and some will be missing

  Algorithms need to be robust enough to cope with imperfect
  data and to extract regularities that are inexact but useful




                   Prof. Pier Luca Lanzi
What are the related fields?                  14




                                          Machine
        Visualization
                                          Learning

                  Knowledge Discovery
                    And Data Mining

           Statistics                     Databases


                  Prof. Pier Luca Lanzi
Statistics, Machine Learning,                  15
and Data Mining

  Statistics:
     more theory-based, focused on testing hypotheses
  Machine learning
     more heuristic, focused on building program
     that learns, more general than Data Mining
  Knowledge Discovery
     integrates theory and heuristics
     focus on the entire process of discovery, including
     data cleaning, learning, integration and visualization
  Data Mining
     focus on the algorithms to extract patterns from data



                  Distinctions are blurred!

                   Prof. Pier Luca Lanzi
Why Not Traditional Data Analysis?                  16



  Tremendous amount of data
     High scalability to handle terabytes of data

  High-dimensionality of data
     Micro-array may have tens of thousands of dimensions

  High complexity of data
     Data streams and sensor data
     Time-series data, temporal data, sequence data
     Structure data, graphs, social networks
     and multi-linked data
     Heterogeneous databases and legacy databases
     Spatial, spatiotemporal, multimedia, text and Web data
     Software programs, scientific simulations

  New and sophisticated applications



                     Prof. Pier Luca Lanzi
Knowledge Discovery Process                            17



     raw data




      selection


                  cleaning
                                                      evaluation

                        transformation


                                             mining
                     Prof. Pier Luca Lanzi
Knowledge Discovery Process                    18
What are the main steps?

  Learning the application domain to extract
  relevant prior knowledge and goals
  Data selection
  Data cleaning
  Data reduction and transformation
  Mining
     Select the mining approach: classification,
     regression, association, clustering, etc.
     Choosing the mining algorithm(s)
     Perform mining: search for patterns of interest
  Pattern evaluation and knowledge presentation
     visualization, transformation,
     removing redundant patterns, etc.
  Use of discovered knowledge


                  Prof. Pier Luca Lanzi
Knowledge Discovery and                                19
Business Intelligence


    Increasing potential
    to support business                                       End User
         decisions                Making
                                 Decisions


                           Data Presentation                  Business
                        Visualization Techniques               Analyst

                               Data Mining                        Data
                          Information Discovery                 Analyst

                            Data Exploration
              Statistical Analysis, Querying and Reporting

                              OLAP, MDA
                      Data Warehouses / Data Marts
                                                                     DBA
                               Data Sources
       Paper, Files, Information Providers, Database Systems, OLTP

                   Prof. Pier Luca Lanzi
Architecture of a Typical                 20
Knowledge Discovery System



            Graphical user interface


              Pattern evaluation
                                               KB
             Data mining engine


      Database or data warehouse server




               DB                   DW


                 Prof. Pier Luca Lanzi
What tasks?
Major Data Mining Tasks                          22



  Classification: predicting an item class
  Clustering: finding clusters in data
  Associations: frequent occurring events…
  Visualization: to facilitate human discovery
  Summarization: describing a group
  Deviation Detection: finding changes
  Estimation: predicting a continuous value
  Link Analysis: finding relationship




                   Prof. Pier Luca Lanzi
Data Mining Tasks: classification                          23




                                              IF salary<k THEN not repaid
       loan
                   ?
                                                                ?



                                          k               salary

                  Prof. Pier Luca Lanzi
Data Mining Tasks: classification                24



  Classification and Prediction
     Finding models (functions) that describe
     and distinguish classes or concepts
     The goal is to describe the data or
     to make future prediction
     E.g., classify countries based on climate,
     or classify cars based on gas mileage
     Presentation: decision-tree, classification rule, neural
     network
     Prediction: Predict some unknown numerical values




                   Prof. Pier Luca Lanzi
Data Mining Tasks: clustering            25




                 Prof. Pier Luca Lanzi
Data Mining Tasks: clustering                   26



  Cluster analysis
     The class label is unknown
     Group data to form new classes, e.g., cluster houses to
     find distribution patterns
     Clustering based on the principle: maximizing the intra-
     class similarity and minimizing the interclass similarity




                   Prof. Pier Luca Lanzi
Data Mining Tasks: associations                       27




 Bread         Bread                       Steak     Jam
 Peanuts       Jam                         Jam       Soda
 Milk          Soda                        Soda      Peanuts
 Fruit         Chips                       Chips     Milk
 Jam           Milk                        Bread     Fruit
               Fruit


   Is there something interesting?
 Jam            Fruit                      Fruit     Fruit
 Soda           Soda                       Soda      Peanuts
 Chips          Chips                      Peanuts   Cheese
 Milk           Milk                       Milk      Yogurt
 Bread




                   Prof. Pier Luca Lanzi
Data Mining Tasks: associations                28



  Association Rule Mining
     Finds interesting associations and/or correlation
     relationships among large set of data items.
     E.g., 98% of people who purchase tires and auto
     accessories also get automotive services done




                  Prof. Pier Luca Lanzi
Data Mining Tasks: others                          29



  Outlier analysis
    Outlier: a data object that does not comply with the
    general behavior of the data
    It can be considered as noise or exception but is quite
    useful in fraud detection, rare events analysis

  Trend and evolution analysis
     Trend and deviation: regression analysis
     Sequential pattern mining, periodicity analysis
     Similarity-based analysis

  Text Mining, Graph Mining, Data Streams

  Other pattern-directed or statistical analyses


                   Prof. Pier Luca Lanzi
Are all the “Discovered” Patterns              30
Interesting?

  Data Mining may generate thousands of patterns,
  not all of them are interesting.
     Suggested approach: Human-centered, query-based,
     focused mining

  Interestingness measures: a pattern is interesting if it is
  easily understood by humans, valid on new or test data with
  some degree of certainty, potentially useful, novel, or
  validates some hypothesis that a user seeks to confirm

  Objective vs. subjective interestingness measures:
    Objective: based on statistics and structures of patterns,
    e.g., support, confidence, etc.
    Subjective: based on user’s belief in the data, e.g.,
    unexpectedness, novelty, etc.



                  Prof. Pier Luca Lanzi
Can we find all and only                           31
interesting patterns?

  Completeness: Find all the interesting patterns
    Can a data mining system find all
    the interesting patterns?
    Association vs. classification vs. clustering

  Optimization: Search for only interesting patterns:
    Can a data mining system find only
    the interesting patterns?
    Approaches
      • First general all the patterns and then filter out the
        uninteresting ones.
      • Generate only the interesting patterns—mining query
        optimization




                   Prof. Pier Luca Lanzi
Data Mining tasks                              32



  General functionality
    Descriptive data mining
    Predictive data mining

  Different views, different classifications
      Kinds of data to be mined
      Kinds of knowledge to be discovered
      Kinds of techniques utilized
      Kinds of applications adapted




                    Prof. Pier Luca Lanzi
What
primitives?
Primitives that Define a Data Mining Task      34



  Task-relevant data
  Type of knowledge to be mined
  Background knowledge
  Pattern interestingness measurements
  Visualization/presentation of discovered patterns




                   Prof. Pier Luca Lanzi
Primitive 1:                                35
Task-Relevant Data

  Database or data warehouse name
  Database tables or data warehouse cubes
  Condition for data selection
  Relevant attributes or dimensions
  Data grouping criteria




                 Prof. Pier Luca Lanzi
Primitive 2:                               36
Types of Knowledge to Be Mined

  Characterization
  Discrimination
  Association
  Classification/prediction
  Clustering
  Outlier analysis
  Other data mining tasks




                   Prof. Pier Luca Lanzi
Primitive 3:                                  37
Background Knowledge

  A typical kind of background knowledge: Concept hierarchies

  Schema hierarchy
     E.g., Street < City < ProvinceOrState < Country

  Set-grouping hierarchy
     E.g., {20-39} = young, {40-59} = middle_aged

  Operation-derived hierarchy
    email address: hagonzal@cs.uiuc.edu
    login-name < department < university < country

  Rule-based hierarchy
     LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2)
     and (P1 - P2) < $50


                  Prof. Pier Luca Lanzi
Primitive 4:                             38
Pattern Interestingness Measure

  Simplicity
  Certainty
  Utility
  Novelty




                 Prof. Pier Luca Lanzi
Primitive 5:                                      39
Presentation of Discovered Patterns

  Different backgrounds/usages may require
  different forms of representation
      E.g., rules, tables, crosstabs, pie/bar chart, etc.

  Concept hierarchy is also important
    Discovered knowledge might be more understandable
    when represented at high level of abstraction
    Interactive drill up/down, pivoting, slicing and dicing
    provide different perspectives to data

  Different kinds of knowledge require different representation:
  association, classification, clustering, etc.




                    Prof. Pier Luca Lanzi
Integration of Data Mining and                 40
Data Warehousing

  Data mining systems, DBMS, Data warehouse systems
  coupling
     No coupling, loose-coupling, semi-tight-coupling, tight-
     coupling
  On-line analytical mining data
     integration of mining and OLAP technologies
  Interactive mining multi-level knowledge
     Necessity of mining knowledge and patterns at different
     levels of abstraction by drilling/rolling, pivoting,
     slicing/dicing, etc.
  Integration of multiple mining functions
      Characterized classification, first clustering and then
     association




                  Prof. Pier Luca Lanzi
Coupling Data Mining with                     41
Data bases and Datawarehouses

  No coupling—flat file processing, not recommended
  Loose coupling
     Fetching data from DB/DW
  Semi-tight coupling—enhanced DM performance
     Provide efficient implement a few data mining primitives
     in a DB/DW system, e.g., sorting, indexing, aggregation,
     histogram analysis, multiway join, precomputation of
     some stat functions
  Tight coupling—A uniform information processing
  environment
     DM is smoothly integrated into a DB/DW system, mining
     query is optimized based on mining query, indexing,
     query processing methods, etc.




                  Prof. Pier Luca Lanzi
What issues?
Major Issues in Data Mining                                43



  Mining methodology
      Mining different kinds of knowledge from diverse data types, e.g., bio,
      stream, Web
      Performance: efficiency, effectiveness, and scalability
      Pattern evaluation: the interestingness problem
      Incorporation of background knowledge
      Handling noise and incomplete data
      Parallel, distributed and incremental mining methods
      Integration of the discovered knowledge with existing one: knowledge
      fusion

  User interaction
     Data mining query languages and ad-hoc mining
     Expression and visualization of data mining results
     Interactive mining of knowledge at multiple levels of abstraction

  Applications and social impacts
     Domain-specific data mining & invisible data mining
     Protection of data security, integrity, and privacy




                       Prof. Pier Luca Lanzi
Summary
Summary                                        45



 Data mining: Discovering interesting patterns
 from large amounts of data
 A natural evolution of database technology,
 in great demand, with wide applications
 A KDD process includes data cleaning, data integration,
 data selection, transformation, data mining,
 pattern evaluation, and knowledge presentation
 Data mining functionalities: characterization, discrimination,
 association, classification, clustering,
 outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining




                  Prof. Pier Luca Lanzi

More Related Content

What's hot

Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
 
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...SlideTeam
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Simplilearn
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSangath babu
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Machine learning seminar presentation
Machine learning seminar presentationMachine learning seminar presentation
Machine learning seminar presentationsweety seth
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANNMohamed Talaat
 
Intro to deep learning
Intro to deep learning Intro to deep learning
Intro to deep learning David Voyles
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckSlideTeam
 

What's hot (20)

Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence Ai And Deep L...
 
web mining
web miningweb mining
web mining
 
Machine learning
Machine learningMachine learning
Machine learning
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning seminar presentation
Machine learning seminar presentationMachine learning seminar presentation
Machine learning seminar presentation
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANN
 
Intro to deep learning
Intro to deep learning Intro to deep learning
Intro to deep learning
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
 
1.Introduction to deep learning
1.Introduction to deep learning1.Introduction to deep learning
1.Introduction to deep learning
 

Viewers also liked

Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...Ryan Rosario
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data miningSnehali Chake
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsMotaz Saad
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

Viewers also liked (18)

Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similar to Lecture 01 Data Mining

DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip finalDeborah McGuinness
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfNancykumari47
 
First, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadFirst, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadmark madsen
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison TreesSelman Bozkır
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining IntroductionVijayasankariS
 
Exploring the Information Ecosystem
Exploring the Information EcosystemExploring the Information Ecosystem
Exploring the Information EcosystemRob Hanna, ECMs
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 

Similar to Lecture 01 Data Mining (20)

DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Graph
GraphGraph
Graph
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
First, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadFirst, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overload
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
isd314-01
isd314-01isd314-01
isd314-01
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison Trees
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
Exploring the Information Ecosystem
Exploring the Information EcosystemExploring the Information Ecosystem
Exploring the Information Ecosystem
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Lecture 01 Data Mining

  • 1. Data Mining Data Mining and Text Mining (UIC 583 @ Politecnico)
  • 2. Lecture outline 2 Why Data Mining? What is Data Mining? What are the typical tasks? What are the primitives? What are the typical applications? What are the major issues? Prof. Pier Luca Lanzi
  • 4. Why Data Mining? 4 “Necessity is the mother of invention” Explosive Growth of Data Terabytes of available data Data collections and data availability Major sources of abundant data Pressing need for the automated analysis of massive data Prof. Pier Luca Lanzi
  • 5. Evolution of Database Technology 5 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) Global information systems Prof. Pier Luca Lanzi
  • 6. Examples 6 In vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Cow culling Given: cows described by 700 features Problem: selection of cows that should be culled Data: historical records and farmers’ decisions Prof. Pier Luca Lanzi
  • 7. Examples 7 Customer attrition Given: customer information for the past months Problem: predict who is likely to attrite next month, or estimate customer value Data: historical customer records Credit assessment Given: a loan application Problem: predict whether the bank should approve the loan Data: records from other loans Prof. Pier Luca Lanzi
  • 9. What is Data Mining? 9 The non-trivial process of identifying valid novel potentially useful, and ultimately understandable patterns in data. Alternative names, Data Fishing, Data Dredging (1960-) Data Mining (1990-), used by DB and business Knowledge Discovery in Databases (1989-), used by AI Business Intelligence, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently, Data Mining and Knowledge Discovery are used interchangeably Prof. Pier Luca Lanzi
  • 10. Example: Credit Risk 10 loan not repaid repaid salary Prof. Pier Luca Lanzi
  • 11. Example: Credit Risk 11 IF salary<k THEN not repaid loan k salary Prof. Pier Luca Lanzi
  • 12. Example: Credit Risk 12 Is it valid? The pattern has to be valid with respect to a certainty level (rule true for the 86%) Is it novel? The value k should be previously unknown or obvious Is it useful? The pattern should provide information useful to the bank for assessing credit risk Is it understandable? Prof. Pier Luca Lanzi
  • 13. What is the general idea? 13 Build computer programs that sift through databases automatically, seeking regularities or patterns There will be problems Most patterns are banal and uninteresting Most patterns are spurious, inexact, or contingent on accidental coincidences in the particular dataset used Real data is imperfect: Some parts will be garbled, and some will be missing Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful Prof. Pier Luca Lanzi
  • 14. What are the related fields? 14 Machine Visualization Learning Knowledge Discovery And Data Mining Statistics Databases Prof. Pier Luca Lanzi
  • 15. Statistics, Machine Learning, 15 and Data Mining Statistics: more theory-based, focused on testing hypotheses Machine learning more heuristic, focused on building program that learns, more general than Data Mining Knowledge Discovery integrates theory and heuristics focus on the entire process of discovery, including data cleaning, learning, integration and visualization Data Mining focus on the algorithms to extract patterns from data Distinctions are blurred! Prof. Pier Luca Lanzi
  • 16. Why Not Traditional Data Analysis? 16 Tremendous amount of data High scalability to handle terabytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications Prof. Pier Luca Lanzi
  • 17. Knowledge Discovery Process 17 raw data selection cleaning evaluation transformation mining Prof. Pier Luca Lanzi
  • 18. Knowledge Discovery Process 18 What are the main steps? Learning the application domain to extract relevant prior knowledge and goals Data selection Data cleaning Data reduction and transformation Mining Select the mining approach: classification, regression, association, clustering, etc. Choosing the mining algorithm(s) Perform mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Prof. Pier Luca Lanzi
  • 19. Knowledge Discovery and 19 Business Intelligence Increasing potential to support business End User decisions Making Decisions Data Presentation Business Visualization Techniques Analyst Data Mining Data Information Discovery Analyst Data Exploration Statistical Analysis, Querying and Reporting OLAP, MDA Data Warehouses / Data Marts DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP Prof. Pier Luca Lanzi
  • 20. Architecture of a Typical 20 Knowledge Discovery System Graphical user interface Pattern evaluation KB Data mining engine Database or data warehouse server DB DW Prof. Pier Luca Lanzi
  • 22. Major Data Mining Tasks 22 Classification: predicting an item class Clustering: finding clusters in data Associations: frequent occurring events… Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationship Prof. Pier Luca Lanzi
  • 23. Data Mining Tasks: classification 23 IF salary<k THEN not repaid loan ? ? k salary Prof. Pier Luca Lanzi
  • 24. Data Mining Tasks: classification 24 Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts The goal is to describe the data or to make future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown numerical values Prof. Pier Luca Lanzi
  • 25. Data Mining Tasks: clustering 25 Prof. Pier Luca Lanzi
  • 26. Data Mining Tasks: clustering 26 Cluster analysis The class label is unknown Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra- class similarity and minimizing the interclass similarity Prof. Pier Luca Lanzi
  • 27. Data Mining Tasks: associations 27 Bread Bread Steak Jam Peanuts Jam Jam Soda Milk Soda Soda Peanuts Fruit Chips Chips Milk Jam Milk Bread Fruit Fruit Is there something interesting? Jam Fruit Fruit Fruit Soda Soda Soda Peanuts Chips Chips Peanuts Cheese Milk Milk Milk Yogurt Bread Prof. Pier Luca Lanzi
  • 28. Data Mining Tasks: associations 28 Association Rule Mining Finds interesting associations and/or correlation relationships among large set of data items. E.g., 98% of people who purchase tires and auto accessories also get automotive services done Prof. Pier Luca Lanzi
  • 29. Data Mining Tasks: others 29 Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity-based analysis Text Mining, Graph Mining, Data Streams Other pattern-directed or statistical analyses Prof. Pier Luca Lanzi
  • 30. Are all the “Discovered” Patterns 30 Interesting? Data Mining may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: a pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc. Prof. Pier Luca Lanzi
  • 31. Can we find all and only 31 interesting patterns? Completeness: Find all the interesting patterns Can a data mining system find all the interesting patterns? Association vs. classification vs. clustering Optimization: Search for only interesting patterns: Can a data mining system find only the interesting patterns? Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization Prof. Pier Luca Lanzi
  • 32. Data Mining tasks 32 General functionality Descriptive data mining Predictive data mining Different views, different classifications Kinds of data to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted Prof. Pier Luca Lanzi
  • 34. Primitives that Define a Data Mining Task 34 Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns Prof. Pier Luca Lanzi
  • 35. Primitive 1: 35 Task-Relevant Data Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria Prof. Pier Luca Lanzi
  • 36. Primitive 2: 36 Types of Knowledge to Be Mined Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks Prof. Pier Luca Lanzi
  • 37. Primitive 3: 37 Background Knowledge A typical kind of background knowledge: Concept hierarchies Schema hierarchy E.g., Street < City < ProvinceOrState < Country Set-grouping hierarchy E.g., {20-39} = young, {40-59} = middle_aged Operation-derived hierarchy email address: hagonzal@cs.uiuc.edu login-name < department < university < country Rule-based hierarchy LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2) and (P1 - P2) < $50 Prof. Pier Luca Lanzi
  • 38. Primitive 4: 38 Pattern Interestingness Measure Simplicity Certainty Utility Novelty Prof. Pier Luca Lanzi
  • 39. Primitive 5: 39 Presentation of Discovered Patterns Different backgrounds/usages may require different forms of representation E.g., rules, tables, crosstabs, pie/bar chart, etc. Concept hierarchy is also important Discovered knowledge might be more understandable when represented at high level of abstraction Interactive drill up/down, pivoting, slicing and dicing provide different perspectives to data Different kinds of knowledge require different representation: association, classification, clustering, etc. Prof. Pier Luca Lanzi
  • 40. Integration of Data Mining and 40 Data Warehousing Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight- coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions Characterized classification, first clustering and then association Prof. Pier Luca Lanzi
  • 41. Coupling Data Mining with 41 Data bases and Datawarehouses No coupling—flat file processing, not recommended Loose coupling Fetching data from DB/DW Semi-tight coupling—enhanced DM performance Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling—A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Prof. Pier Luca Lanzi
  • 43. Major Issues in Data Mining 43 Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy Prof. Pier Luca Lanzi
  • 45. Summary 45 Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining Prof. Pier Luca Lanzi