2. Google Datacenter
@Douglas County, Georgia
“These colorful pipes send and receive water for cooling our facility.
Also pictured is a G-Bike, the vehicle of choice for team members to get
around outside our data centers.”
Source: http://www.google.com/about/datacenters/gallery/#/tech/10
3. Eunjeong Lucy Park
PhDs, Data scientist @SNU DMLab
A person who live on lattes.
Find me at:
http://dmlab.snu.ac.kr, http://lucypark.kr
3
4. “All scientists are data scientists.”
- Monica Rogati, Senior Research Scientist @LinkedIn
Source: http://xkcd.com/242/ 4
5. “Data is everywhere.”
Tweets
Cell phone logs
Social networking data
Politician data
Web documents
Manufacturing fault data Credit card transactions
5
6. “Data mining is…”
• “…the process of exploration an analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover meaningful patterns and rules.”
- Berry and Linoff, 1997
Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997.
6
7. “Data mining is…”
• “…the belief in data.”
- @echojuliett, 2012
• Inductive reasoning
Mathematical induction: prove for k=1, assume for k, then prove for k+1
Induction vs. prejudice: # of cases
Ex: What is your hobby?
7
9. 1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
9
10. Data types
Source: http://www.tipforest.com/t/83
Structured data Unstructured data
11. (the general) Data mining process
Interpretation
Data mining
Preprocessing KNOWLEDGE
Selection
Target data
Patterns
Preprocessed
DATA data
warehouse
of somewhat domain (Marketing, Finance, Manufacturing, etc.)
12. Selection
• Data exploration
– How many variables?
• Independent variables, dependent variables, …
• Continuous variables, categorical variables, …
– How many records?
– What distribution?
– …
• Variable selection & dimensionality reduction
– Ex: Step-wise selection, PCA (Principal Component Analysis)
13. Preprocessing
• “Partitioning” the data
– training data & validation data (& test data …)
Data set
Training data Validation data
14. Preprocessing
• Beware of “overfitting”
Source: Bishop, PRML, p.7
15. Data mining methods
Predictive methods Descriptive methods
Classification Clustering
Learns a method for predicting the instance Finds “natural” grouping of instances given
class from pre-labeled (classified) instances un-labeled data
Regression Association Rules
Method for discovering interesting
An attempt to predict a continuous attribute relations between variables in large DBs
16. Regression
• Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN),
…
• Polynomial curve fitting
• The basic form
min
• The advanced form
min
• Example:
• Tomorrow’s stock price = f (recent prices, economic indicators, …)
20. Data mining methods
Predictive methods Descriptive methods
Classification Clustering
Learns a method for predicting the instance Finds “natural” grouping of instances given
class from pre-labeled (classified) instances un-labeled data
Regression Association Rules
Method for discovering interesting
An attempt to predict a continuous attribute relations between variables in large DBs
28. 1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
28
29. Historical Note
Data Fishing, Data Dredging: 1960-
• used by statisticians (as a bad name)
Knowledge Discovery in Databases (KDD): 1989-
• used by Artificial Intelligence (AI), Machine Learning (ML) communities
Data Mining, Data Analytics: 1990-
• used in DB communities, business
Big data: 2000-
34. XLMiner
• 15-day trial version available at http://www.solver.com/xlminer-data-mining
• Useful for prototyping
• Supports:
• Preprocessing
• Data partitioning
• Missing data imputation
• Categorical data transformation
• PCA (Principal Component Analysis)
• Algorithms
• Multiple linear regression
• k-NN (k nearest neighbors)
• CART (classification and regression trees)
• ANN (artificial neural networks)
• Discriminant analysis
• logistic regression
• Naïve Bayes classification
• Association rules
• k-means clustering
• Hierarchical clustering
35. More…
• Mathworks MATLAB / GNU Octave
Most DM algorithms are preinstalled
Relatively easy to learn
• General purpose programming languages
For example, C, Java, Python, etc.
Packages such as Orange(http://orange.biolab.si/) for Python are available
May be more fit for tasks like natural language processing
• Even more…
Try visiting http://www.kdnuggets.com/software/suites.html
36. 1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
36
37. Foreign warriors
• Mitchell (Carnegie Mellon University)
• Vapnik (NEC Labs)
• Bishop (Microsoft Cambridge)
• Smola (Yahoo, Australian National University)
• Ng (Stanford University)