SlideShare a Scribd company logo
1 of 40
Download to read offline
Introduction to Data Mining
       for Newbies



                         Nov. 2th, 2012
                          @echojuliett
Google Datacenter
@Douglas County, Georgia

“These colorful pipes send and receive water for cooling our facility.
Also pictured is a G-Bike, the vehicle of choice for team members to get
around outside our data centers.”




Source: http://www.google.com/about/datacenters/gallery/#/tech/10
Eunjeong Lucy Park
PhDs, Data scientist @SNU DMLab



A person who live on lattes.




Find me at:
http://dmlab.snu.ac.kr, http://lucypark.kr




                                             3
“All scientists are data scientists.”
                - Monica Rogati, Senior Research Scientist @LinkedIn




                                           Source: http://xkcd.com/242/   4
“Data is everywhere.”

                   Tweets
                                                      Cell phone logs




                     Social networking data


                                                Politician data


        Web documents




 Manufacturing fault data                     Credit card transactions



                                                                         5
“Data mining is…”

   •   “…the process of exploration an analysis, by automatic or semi-automatic means,
       of large quantities of data in order to discover meaningful patterns and rules.”
                                                                                        - Berry and Linoff, 1997




Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997.
                                                                                                                 6
“Data mining is…”

•   “…the belief in data.”
                                                                 - @echojuliett, 2012




•   Inductive reasoning
      Mathematical induction: prove for k=1, assume for k, then prove for k+1
      Induction vs. prejudice: # of cases
      Ex: What is your hobby?


                                                                                        7
“Data mining is…”




                    8
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     9
Data types




       Source: http://www.tipforest.com/t/83




      Structured data                          Unstructured data
(the general) Data mining process

                                                                  Interpretation

                                                    Data mining

                           Preprocessing                                           KNOWLEDGE
            Selection

                             Target data
                                                                     Patterns
                                                  Preprocessed
        DATA                                          data
     warehouse

  of somewhat domain (Marketing, Finance, Manufacturing, etc.)
Selection

  • Data exploration
     – How many variables?
           •   Independent variables, dependent variables, …

           •   Continuous variables, categorical variables, …

     – How many records?

     – What distribution?

     – …



  • Variable selection & dimensionality reduction
     – Ex: Step-wise selection, PCA (Principal Component Analysis)
Preprocessing

  • “Partitioning” the data
     – training data & validation data (& test data …)




                                  Data set




              Training data                      Validation data
Preprocessing

  • Beware of “overfitting”




 Source: Bishop, PRML, p.7
Data mining methods

            Predictive methods                           Descriptive methods

   Classification                                 Clustering




  Learns a method for predicting the instance     Finds “natural” grouping of instances given
  class from pre-labeled (classified) instances   un-labeled data

   Regression                                     Association Rules




                                                   Method for discovering interesting
  An attempt to predict a continuous attribute     relations between variables in large DBs
Regression
  • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN),
    …


  • Polynomial curve fitting

        •   The basic form

                                                                                 min




        •   The advanced form

                                                         min



  • Example:
        •   Tomorrow’s stock price = f (recent prices, economic indicators, …)
Classification
  • Regression with a categorical dependent variable


  • Naïve Bayes classification, decision trees, ANNs, SVMs,…




  • Ex: E-mail spam detection



                                                   inbox


                       ?
                                                  spam
Clustering
  • Grouping of similar objects
  • Unsupervised, Exploratory Knowledge Discovery


  • k-means, hierarchical clustering, SOM, …




  • Ex: Politician segmentation
                                                   J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9)




         0. 8




         0. 7




         0. 6




         0. 5




         0. 4




         0. 3




         0. 2




         0. 1




           0
            322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310
            326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59
             321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309
             320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88
              325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26
              317304324129
               316303288168 22 28327893 211 197 152 92 97 34 214 31 145
               311302289 13116422419379 199 181 85
                               160200  171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123
                                                        282 210 290218      14020115825114283 236241 162 239 25 113274 228 21 109 102 39
                                                                            116254104   60  223 144180 110139115 105190 219119 284111
                                                                                                                                    73    247151121293
                                                                                                                                             138114328
                                                                                                                                             275327306




            Democratic United Party                                                        Grand National Party                               Others
            (liberal)                                                                      (conservative)
Association Rules




 Source: http://lucypark.tistory.com/48
Data mining methods

            Predictive methods                           Descriptive methods

   Classification                                 Clustering




  Learns a method for predicting the instance     Finds “natural” grouping of instances given
  class from pre-labeled (classified) instances   un-labeled data

   Regression                                     Association Rules




                                                   Method for discovering interesting
  An attempt to predict a continuous attribute     relations between variables in large DBs
Pop quiz!




            21
Pop quiz!




            22
Pop quiz!




            23
Pop quiz!




            24
Pop quiz!




 Source: http://www.cis.hut.fi/research/som-research/worldmap.html
                                                                     25
Pop quiz!




 Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/
                                                                      26
Pop quiz!




            27
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     28
Historical Note
  Data Fishing, Data Dredging: 1960-
     • used by statisticians (as a bad name)



  Knowledge Discovery in Databases (KDD): 1989-
     • used by Artificial Intelligence (AI), Machine Learning (ML) communities



  Data Mining, Data Analytics: 1990-
     • used in DB communities, business



  Big data: 2000-
Comparisons
  • Data mining
  • Statistics
  • Machine learning
  • Pattern recognition
  • …
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     31
R




Source: http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html
SAS Enterprise Miner (“E-miner”)
XLMiner
  • 15-day trial version available at http://www.solver.com/xlminer-data-mining
  • Useful for prototyping


  • Supports:
      •   Preprocessing
           •   Data partitioning
           •   Missing data imputation
           •   Categorical data transformation
           •   PCA (Principal Component Analysis)
      •   Algorithms
           •   Multiple linear regression
           •   k-NN (k nearest neighbors)
           •   CART (classification and regression trees)
           •   ANN (artificial neural networks)
           •   Discriminant analysis
           •   logistic regression
           •   Naïve Bayes classification
           •   Association rules
           •   k-means clustering
           •   Hierarchical clustering
More…
 • Mathworks MATLAB / GNU Octave
     Most DM algorithms are preinstalled
     Relatively easy to learn



 • General purpose programming languages
     For example, C, Java, Python, etc.
     Packages such as Orange(http://orange.biolab.si/) for Python are available
     May be more fit for tasks like natural language processing


 • Even more…
     Try visiting http://www.kdnuggets.com/software/suites.html
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     36
Foreign warriors




  •   Mitchell (Carnegie Mellon University)

  •   Vapnik (NEC Labs)

  •   Bishop (Microsoft Cambridge)

  •   Smola (Yahoo, Australian National University)

  •   Ng (Stanford University)
Foreign warriors




  •   조성준 (서울대)

  •   조재희 (광운대)

  •   조성배 (연세대)

  •   이성임 (단국대)

  •   김성범 (고려대)
References
  •   [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001.

  •   [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006.

  •   [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010
Any Questions?


                 ?

More Related Content

What's hot

Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process Shuvra Ghosh
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : ConceptsPragya Pandey
 
introduction to Web system
introduction to Web systemintroduction to Web system
introduction to Web systemhashim102
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Cookies: HTTP state management mechanism
Cookies: HTTP state management mechanismCookies: HTTP state management mechanism
Cookies: HTTP state management mechanismJivan Nepali
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Business intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and ApplicationsBusiness intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and Applicationsraj
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 

What's hot (20)

Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
introduction to Web system
introduction to Web systemintroduction to Web system
introduction to Web system
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Digital data
Digital dataDigital data
Digital data
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Cookies: HTTP state management mechanism
Cookies: HTTP state management mechanismCookies: HTTP state management mechanism
Cookies: HTTP state management mechanism
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Business intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and ApplicationsBusiness intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and Applications
 
01 intro
01 intro01 intro
01 intro
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Data mining
Data miningData mining
Data mining
 
web mining
web miningweb mining
web mining
 
Data Mining
Data MiningData Mining
Data Mining
 
E shopping
E shoppingE shopping
E shopping
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 

Viewers also liked

On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남Eunjeong (Lucy) Park
 
The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)Eunjeong (Lucy) Park
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향홍배 김
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법 홍배 김
 
자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLPEunjeong (Lucy) Park
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)홍배 김
 
도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택Jc Kim
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events홍배 김
 
Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기Gyuhyeon Jeon
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization홍배 김
 
[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑neuroassociates
 
텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)Taejun Kim
 
Getting started with Data Warehousing and BI
Getting started with Data Warehousing and BIGetting started with Data Warehousing and BI
Getting started with Data Warehousing and BIEdureka!
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDatatdc-globalcode
 
A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator홍배 김
 
Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)정명훈 Jerry Jeong
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDatatdc-globalcode
 
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...Sebastian Raschka
 

Viewers also liked (20)

On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and Beyond
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 
The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
 
자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
 
도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events
 
Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization
 
[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑
 
텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)
 
Getting started with Data Warehousing and BI
Getting started with Data Warehousing and BIGetting started with Data Warehousing and BI
Getting started with Data Warehousing and BI
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator
 
Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid language
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
 

Similar to Introduction to Data Mining for Newbies

`Data mining
`Data mining`Data mining
`Data miningJebin R
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating SystemITz_1
 
Что такое Data Science
Что такое Data ScienceЧто такое Data Science
Что такое Data ScienceOlga Lavrentieva
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison TreesSelman Bozkır
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining IntroAsma CHERIF
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptPadmajaLaksh
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...teodroscampaus
 

Similar to Introduction to Data Mining for Newbies (20)

Data mining
Data miningData mining
Data mining
 
`Data mining
`Data mining`Data mining
`Data mining
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Что такое Data Science
Что такое Data ScienceЧто такое Data Science
Что такое Data Science
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
data mining
data miningdata mining
data mining
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison Trees
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Data Mining and Knowledge
Data Mining and KnowledgeData Mining and Knowledge
Data Mining and Knowledge
 
Graph
GraphGraph
Graph
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 

Introduction to Data Mining for Newbies

  • 1. Introduction to Data Mining for Newbies Nov. 2th, 2012 @echojuliett
  • 2. Google Datacenter @Douglas County, Georgia “These colorful pipes send and receive water for cooling our facility. Also pictured is a G-Bike, the vehicle of choice for team members to get around outside our data centers.” Source: http://www.google.com/about/datacenters/gallery/#/tech/10
  • 3. Eunjeong Lucy Park PhDs, Data scientist @SNU DMLab A person who live on lattes. Find me at: http://dmlab.snu.ac.kr, http://lucypark.kr 3
  • 4. “All scientists are data scientists.” - Monica Rogati, Senior Research Scientist @LinkedIn Source: http://xkcd.com/242/ 4
  • 5. “Data is everywhere.” Tweets Cell phone logs Social networking data Politician data Web documents Manufacturing fault data Credit card transactions 5
  • 6. “Data mining is…” • “…the process of exploration an analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” - Berry and Linoff, 1997 Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997. 6
  • 7. “Data mining is…” • “…the belief in data.” - @echojuliett, 2012 • Inductive reasoning  Mathematical induction: prove for k=1, assume for k, then prove for k+1  Induction vs. prejudice: # of cases  Ex: What is your hobby? 7
  • 9. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 9
  • 10. Data types Source: http://www.tipforest.com/t/83 Structured data Unstructured data
  • 11. (the general) Data mining process Interpretation Data mining Preprocessing KNOWLEDGE Selection Target data Patterns Preprocessed DATA data warehouse of somewhat domain (Marketing, Finance, Manufacturing, etc.)
  • 12. Selection • Data exploration – How many variables? • Independent variables, dependent variables, … • Continuous variables, categorical variables, … – How many records? – What distribution? – … • Variable selection & dimensionality reduction – Ex: Step-wise selection, PCA (Principal Component Analysis)
  • 13. Preprocessing • “Partitioning” the data – training data & validation data (& test data …) Data set Training data Validation data
  • 14. Preprocessing • Beware of “overfitting” Source: Bishop, PRML, p.7
  • 15. Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  • 16. Regression • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN), … • Polynomial curve fitting • The basic form min • The advanced form min • Example: • Tomorrow’s stock price = f (recent prices, economic indicators, …)
  • 17. Classification • Regression with a categorical dependent variable • Naïve Bayes classification, decision trees, ANNs, SVMs,… • Ex: E-mail spam detection inbox ? spam
  • 18. Clustering • Grouping of similar objects • Unsupervised, Exploratory Knowledge Discovery • k-means, hierarchical clustering, SOM, … • Ex: Politician segmentation J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9) 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310 326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59 321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309 320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88 325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26 317304324129 316303288168 22 28327893 211 197 152 92 97 34 214 31 145 311302289 13116422419379 199 181 85 160200 171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123 282 210 290218 14020115825114283 236241 162 239 25 113274 228 21 109 102 39 116254104 60 223 144180 110139115 105190 219119 284111 73 247151121293 138114328 275327306 Democratic United Party Grand National Party Others (liberal) (conservative)
  • 19. Association Rules Source: http://lucypark.tistory.com/48
  • 20. Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  • 21. Pop quiz! 21
  • 22. Pop quiz! 22
  • 23. Pop quiz! 23
  • 24. Pop quiz! 24
  • 25. Pop quiz! Source: http://www.cis.hut.fi/research/som-research/worldmap.html 25
  • 26. Pop quiz! Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/ 26
  • 27. Pop quiz! 27
  • 28. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 28
  • 29. Historical Note Data Fishing, Data Dredging: 1960- • used by statisticians (as a bad name) Knowledge Discovery in Databases (KDD): 1989- • used by Artificial Intelligence (AI), Machine Learning (ML) communities Data Mining, Data Analytics: 1990- • used in DB communities, business Big data: 2000-
  • 30. Comparisons • Data mining • Statistics • Machine learning • Pattern recognition • …
  • 31. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 31
  • 33. SAS Enterprise Miner (“E-miner”)
  • 34. XLMiner • 15-day trial version available at http://www.solver.com/xlminer-data-mining • Useful for prototyping • Supports: • Preprocessing • Data partitioning • Missing data imputation • Categorical data transformation • PCA (Principal Component Analysis) • Algorithms • Multiple linear regression • k-NN (k nearest neighbors) • CART (classification and regression trees) • ANN (artificial neural networks) • Discriminant analysis • logistic regression • Naïve Bayes classification • Association rules • k-means clustering • Hierarchical clustering
  • 35. More… • Mathworks MATLAB / GNU Octave  Most DM algorithms are preinstalled  Relatively easy to learn • General purpose programming languages  For example, C, Java, Python, etc.  Packages such as Orange(http://orange.biolab.si/) for Python are available  May be more fit for tasks like natural language processing • Even more…  Try visiting http://www.kdnuggets.com/software/suites.html
  • 36. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 36
  • 37. Foreign warriors • Mitchell (Carnegie Mellon University) • Vapnik (NEC Labs) • Bishop (Microsoft Cambridge) • Smola (Yahoo, Australian National University) • Ng (Stanford University)
  • 38. Foreign warriors • 조성준 (서울대) • 조재희 (광운대) • 조성배 (연세대) • 이성임 (단국대) • 김성범 (고려대)
  • 39. References • [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001. • [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006. • [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010