SlideShare a Scribd company logo
1 of 55
Download to read offline
Prof. Pier Luca Lanzi	

Data Representation
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi	

Readings	

•  “Data Mining and Analysis” – Chapter 1	

•  “Mining of Massive Datasets” – Chapter 1	

2
Prof. Pier Luca Lanzi	

Describing the Data
Prof. Pier Luca Lanzi	

Contact Lenses Data 	

 4	

None	

Reduced	

Yes	

Hypermetrope	

Pre-presbyopic 	

None	

Normal	

Yes	

Hypermetrope	

Pre-presbyopic	

None	

Reduced	

No	

Myope	

Presbyopic	

None	

Normal	

No	

Myope	

Presbyopic	

None	

Reduced	

Yes	

Myope	

Presbyopic	

Hard	

Normal	

Yes	

Myope	

Presbyopic	

None	

Reduced	

No	

Hypermetrope	

Presbyopic	

Soft	

Normal	

No	

Hypermetrope	

Presbyopic	

None	

Reduced	

Yes	

Hypermetrope	

Presbyopic	

None	

Normal	

Yes	

Hypermetrope	

Presbyopic	

Soft	

Normal	

No	

Hypermetrope	

Pre-presbyopic	

None	

Reduced	

No	

Hypermetrope	

Pre-presbyopic	

Hard	

Normal	

Yes	

Myope	

Pre-presbyopic	

None	

Reduced	

Yes	

Myope	

Pre-presbyopic	

Soft	

Normal	

No	

Myope	

Pre-presbyopic	

None	

Reduced	

No	

Myope	

Pre-presbyopic	

hard	

Normal	

Yes	

Hypermetrope	

Young	

None	

Reduced	

Yes	

Hypermetrope	

Young	

Soft	

Normal	

No	

Hypermetrope	

Young	

None	

Reduced	

No	

Hypermetrope	

Young	

Hard	

Normal	

Yes	

Myope	

Young	

None	

Reduced	

Yes	

Myope	

Young 	

Soft	

Normal	

No	

Myope	

Young	

None	

Reduced	

No	

Myope	

Young	

Recommended lenses	

Tear production rate	

Astigmatism	

Spectacle prescription	

Age
Prof. Pier Luca Lanzi	

•  Data are often abstracted as an nxd data matrix, with n rows and
d columns, given as	

•  Rows are called instances, examples, records, transactions,
objects, points, feature-vectors, etc. 	

•  Columns are called attributes, properties, features, dimensions,
variables, fields, etc.	

5
Prof. Pier Luca Lanzi	

6
Prof. Pier Luca Lanzi	

Instances, Attributes, Concepts	

•  Instances (observations, case)	

§ The atomic elements of information from a dataset	

§ Also known as records, prototypes, or examples	

•  Attributes (variable) 	

	

§ Measures aspects of an instance	

§ Also known as features or variables	

§ Each instance is composed of a certain number of attributes	

•  Concepts	

§ Special content inside the data	

§ Kind of things that can be learned	

§ Intelligible and operational concept description	

7
Prof. Pier Luca Lanzi	

CPU Performance Data	

 8	

0	

0	

32	

128	

CHMAX	

0	

0	

8	

16	

CHMIN	

Channels	

 Performance	

Cache
(Kb)	

Main memory
(Kb)	

Cycle time
(ns)	

45	

0	

4000	

1000	

480	

209	

67	

32	

8000	

512	

480	

208	

…	

269	

32	

32000	

8000	

29	

2	

198	

256	

6000	

256	

125	

1	

PRP	

CACH	

MMAX	

MMIN	

MYCT
Prof. Pier Luca Lanzi	

Two Versions of the Weather Data	

 9	

…	

…	

…	

…	

…	

Yes	

False	

Normal	

Mild	

Rainy	

Yes	

False	

High	

Hot 	

Overcast 	

No	

True	

High 	

Hot 	

Sunny	

No	

False	

High	

Hot	

Sunny	

Play	

Windy	

Humidity	

Temperature	

Outlook	

…	

…	

…	

…	

…	

Yes	

False	

80	

75	

Rainy	

Yes	

False	

86	

83	

Overcast 	

No	

True	

90	

80	

Sunny	

No	

False	

85	

85	

Sunny	

Play	

Windy	

Humidity	

Temperature	

Outlook
Prof. Pier Luca Lanzi	

AttributeTypes
Prof. Pier Luca Lanzi	

Attributes	

•  Numeric Attributes	

§ Real-valued or integer-valued domain	

§ Interval-scaled when only differences are meaningful
(e.g., temperature)	

§ Ratio-scaled when differences and ratios are meaningful 
(e.g., Age)	

•  Categorical Attributes	

§ Set-valued domain composed of a set of symbols	

§ Nominal when only equality is meaningful
(e.g., domain(Sex) = { M, F})	

§ Ordinal when both equality (are two values the same?) and
inequality (is one value less than another?) are meaningful
(e.g., domain(Education) = { High School, BS, MS, PhD})	

11
Prof. Pier Luca Lanzi	

Numerical Attributes	

•  Not only ordered but measured in fixed and equal units	

•  Examples	

§ Attribute “temperature” expressed in degrees	

§ Attribute “year”	

•  Characteristics	

§ Difference of two values makes sense	

§ Sum or product doesn’t make sense	

§ Zero point is not defined	

•  Sometimes they are divided into “discrete” and “continuous”	

12
Prof. Pier Luca Lanzi	

Ratio Attributes	

•  Ratio quantities are ones for which the 
measurement scheme defines a zero point	

•  Example	

§ Attribute “distance”	

•  Characteristics	

§ Distance between an object and itself is zero	

§ Ratio quantities are treated as real numbers	

§ All mathematical operations are allowed	

§ Is there an “inherently” defined zero point?	

§ It depends on scientific knowledge 	

13
Prof. Pier Luca Lanzi	

Nominal Attributes (or Categorical)	

•  Values are distinct symbols	

•  Values themselves serve only as labels or names	

•  Example	

§ Attribute “outlook” from weather data	

§ Values: “sunny”, “overcast”, and “rainy”	

•  Characteristics	

§ No relation is implied among nominal values	

§ No ordering	

§ No distance measure 	

§ Only equality tests can be performed	

14
Prof. Pier Luca Lanzi	

Ordinal Attributes	

•  Impose order on values	

•  No distance between values defined	

•  Example	

§ The attribute “temperature” in weather data	

§ Values: “hot”  “mild”  “cool”	

•  Characteristics	

§ Addition and subtraction don’t make sense	

§ Distinction between nominal and ordinal not always clear (e.g.
attribute “outlook”)	

15
Prof. Pier Luca Lanzi	

Nominal or Ordinal?	

•  Attribute “age” nominal	

§ If age = young and astigmatic = no
and tear production rate = normal
then recommendation = soft	

•  Attribute “age” ordinal
(e.g. “young”  “pre-presbyopic”  “presbyopic”)	

§ If age≤pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft	

16
Prof. Pier Luca Lanzi	

Why Specifying Attribute Types?	

•  Some algorithms fit some specific data types best	

•  Express the best possible patterns into data	

•  Make the most adequate comparisons	

•  Example 	

§ Outlook  “sunny” does not make sense, while	

§ Temperature  “cool” or 	

§ Humidity  70 does	

•  Additional uses of attribute type	

§ Check for valid values	

§ Deal with missing values, etc.	

17
Prof. Pier Luca Lanzi	

MissingValues
Prof. Pier Luca Lanzi	

Why Missing Values Exist?	

•  Faulty equipment, incorrect measurements, missing cells in manual
data entry, censored/anonymous data	

•  Review scores for movies, books, etc.	

•  Very frequent in questionnaires for medical scenarios	

•  Censored/anonymous data 	

•  In practice, a low rate of missing values may be suspicious	

•  Interview data (“Did you ever …”)	

19
Prof. Pier Luca Lanzi	

Missing Values	

•  Frequently indicated by out-of-range entries (e.g. max/min float)	

•  Missing value may have significance in itself	

§ E.g. missing test in a medical examination	

•  Most schemes assume that is not the case	

§ “missing” may need to be coded as additional value 	

•  Does absence of value have some significance?	

§ If it does, “missing” is a separate value 	

	

§ If it does not, “missing” must be treated in a special way	

20
Prof. Pier Luca Lanzi	

What Types of Missing Values?	

•  Missing completely at random (MCAR)	

§  The distribution of an example having a missing value for an attribute does not depend on
either the observed data or the missing data	

§  Example: some survey questions contain a random sample of the whole questionnaire	

•  Missing at random (MAR)	

§  The distribution of an example having a missing value for an attribute depends on the
observed data, but does not depend on the missing data	

§  Missing at Random means the propensity for a data point to be missing is not related to the
missing data, but it is related to some of the observed data.	

§  Whether or not someone answered #13 on your survey has nothing to do with the missing
values, but it does have to do with the values of some other variable.	

§  Example: Respondents in service occupations less likely to report income	

•  Not missing at random (NMAR)	

§  the distribution of an example having a missing value for an attribute depends on the missing
values.	

§  Example: respondents with high income less likely to report income	

21
Prof. Pier Luca Lanzi	

Dealing with Missing Values	

•  Use what you know 	

§ Why data is missing	

§ Distribution of missing data	

•  Decide on the best strategy to yield the least biased estimates	

§ Deletion Methods (listwise deletion, pairwise deletion)	

§ Single Imputation Methods (mean/mode substitution, dummy variable
method, single regression)	

§ Model-Based Methods (maximum Likelihood, multiple imputation	

22
Prof. Pier Luca Lanzi	

Strategies for missing values handling	

•  The handling of missing data depends on the type	

•  Discarding all the examples with a missing values	

§ Simplest approach	

§ Allows the use of unmodified data mining methods	

§ Only practical if there are few examples with missing values. Otherwise, it
can introduce bias	

•  Fill in the missing value manually J	

•  Convert the missing values into a new value	

§ Use a special value for it	

§ Add an attribute that indicates if value is missing or not	

§ Greatly increases the difficulty of the data mining process	

•  Imputation methods	

§ Assign a value to the missing one, based on the rest of the dataset. Use
the unmodified data mining methods.	

23
Prof. Pier Luca Lanzi	

Listwise Deletion (Complete Case Analysis)
	

•  Only analyze cases with available data
on each variable	

•  Simple, but reduces the data	

•  Comparability across analyses 	

•  Does not use all the information	

•  Estimates may be biased if data not
MCAR	

24
Prof. Pier Luca Lanzi	

Pairwise deletion (Available Case Analysis)	

•  Analysis with all cases in which
the variables of interest are
present	

•  Advantage	

§ Keeps as many cases as
possible for each analysis	

§ Uses all information
possible with each analysis	

•  Disadvantage	

§ Can’t compare analyses
because sample different
each time	

25
Prof. Pier Luca Lanzi	

Imputation methods	

•  Extract a model from the dataset to perform the imputation	

•  Suitable for MCAR and, to a lesser extent, for MAR	

•  Not suitable for NMAR type of missing data	

•  For NMAR we need to go back to the source of the data to
obtain more information	

•  Survey of imputation methods available at
http://sci2s.ugr.es/MVDM/index.php
http://sci2s.ugr.es/MVDM/biblio.php	

26
Prof. Pier Luca Lanzi	

Single Imputation Methods	

•  Mean/mode substitution (most common value)	

§ Replace missing value with sample mean or mode	

§ Run analyses as if all complete cases	

§ Advantages: Can use complete case analysis methods	

§ Disadvantages: Reduces variability	

•  Dummy variable control	

§ Create an indicator for missing value (1=value is missing for observation;
0=value is observed for observation)	

§ Impute missing values to a constant (such as the mean)	

§ Include missing indicator in the algorithm	

§ Advantage: uses all available information about missing observation	

§ Disadvantage: results in biased estimates, not theoretically driven	

•  Regression Imputation	

§ Replaces missing values with predicted score from a regression equation.	

27
Prof. Pier Luca Lanzi	

Multiple Imputation Process 
	

28
Prof. Pier Luca Lanzi	

Do Not Impute (DNI)	

•  Simply use the default policy of the data mining method	

•  Works only if the policy exists	

29
Prof. Pier Luca Lanzi	

InaccurateValues
Prof. Pier Luca Lanzi	

Inaccurate Values	

•  Data has not been collected for mining it	

•  Errors and omissions that don’t affect original purpose of data
(e.g. age of customer)	

•  Typographical errors in nominal attributes, 
thus values need to be checked for consistency	

•  Typographical and measurement errors in numeric attributes,
thus outliers need to be identified	

•  Errors may be deliberate (e.g. wrong zip codes)	

31
Prof. Pier Luca Lanzi	

The GeometricView
Prof. Pier Luca Lanzi	

The Geometrical View of the Data	

•  When the data matrix contains only numerical values	

§ Every row can be viewed as a point in a d-dimension space	

§ Every column as a point in a n-dimensional space	

33
Prof. Pier Luca Lanzi	

34
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi	

The ProbabilisticView
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi	

Data Formats
Prof. Pier Luca Lanzi	

Data Format	

•  Most commercial tools have their own proprietary format	

•  Most tools import excel files and comma-separated value files	

44	

Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38
Prof. Pier Luca Lanzi	

Attribute-Relation File Format (ARFF)
	

45	

%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
http://www.cs.waikato.ac.nz/~ml/weka/arff.html!
Prof. Pier Luca Lanzi	

Additional Attribute Types	

•  ARFF supports string attributes:

	

•  Similar to nominal attributes but list of values 
is not pre-specified	

•  ARFF also supports date attributes:

	

•  Uses the ISO-8601 combined date 
and time format yyyy-MM-dd-THH:mm:ss	

46	

@attribute description string
@attribute today date
Prof. Pier Luca Lanzi	

Additional Attribute Types	

•  ARFF supports sparse data, for instance the following examples,

	

•  Can also be represented as,	

47	

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
Prof. Pier Luca Lanzi	

Missing Values in ARFF	

 48	

@relation labor	

@attribute 'duration' real	

@attribute 'wage-increase-first-year' real	

@attribute 'wage-increase-second-year' real	

@attribute 'wage-increase-third-year' real	

@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}	

@attribute 'working-hours' real	

@attribute 'pension' {'none','ret_allw','empl_contr'}	

@attribute 'standby-pay' real	

@attribute 'shift-differential' real	

@attribute 'education-allowance' {'yes','no'}	

@attribute 'statutory-holidays' real	

@attribute 'vacation' {'below_average','average','generous'}	

@attribute 'longterm-disability-assistance' {'yes','no'}	

@attribute 'contribution-to-dental-plan' {'none','half','full'}	

@attribute 'bereavement-assistance' {'yes','no'}	

@attribute 'contribution-to-health-plan' {'none','half','full'}	

@attribute 'class' {'bad','good'}	

@data	

1,5,?, ?, ?,40, ?, ?,2, ?,11,'average', ?, ?,'yes',?,'good'	

2,4.5,5.8, ?, ?,35,'ret_allw', ?, ?,'yes',11,'below_average', ?,'full', ?,'full','good'	

?, ?, ?, ?, ?,38,'empl_contr', ?,5, ?,11,'generous','yes','half','yes','half','good'	

3,3.7,4,5,'tc', ?, ?, ?, ?,'yes', ?, ?, ?, ?,'yes', ?,'good'
Prof. Pier Luca Lanzi	

Attribute Types and Interpretation	

•  Interpretation of attribute types in ARFF depends 
on the mining scheme	

•  Numeric attributes are interpreted as 	

§ Ordinal scales if less-than and greater-than are used 	

§ Ratio scales if distance calculations are performed
(normalization/standardization may be required)	

•  Instance-based schemes define distance between nominal values
(0 if values are equal, 1 otherwise)	

•  Integers in some given data file: nominal, ordinal, or ratio scale?	

49
Prof. Pier Luca Lanzi	

DSPL: Dataset Publishing Language	

•  Open format by Google available at
http://code.google.com/apis/publicdata/	

•  Use existing data: add an XML metadata file existing CSV	

•  Read by the Google Public Data Explorer, which includes
animated bar chart, motion chart, and map visualization	

•  Allow linking to concepts in other datasets	

•  Geo-enabled: allows adding latitude and longitude data to your
concept definitions	

50
Prof. Pier Luca Lanzi	

Model Representation
Prof. Pier Luca Lanzi	

Predictive Model Markup Language	

•  XML-based markup language developed by the Data Mining
Group (DMG) to provide a way for applications to define models
related to predictive analytics and data mining	

•  The goal is to share models between applications	

•  Vendor-independent method of defining models	

•  Allow to exchange of models between applications.	

•  PMML Components: data dictionary, data transformations, model,
mining schema, targets, output	

52
Prof. Pier Luca Lanzi	

Data Repository
Prof. Pier Luca Lanzi	

Publicly Available Datasets	

•  UCI repository	

§ http://archive.ics.uci.edu/ml/	

§ Probably the most famous collection of datasets	

•  Kaggle	

§ http://www.kaggle.com/	

§ It is not a static repository of datasets, but a site that manages
Data Mining competitions	

§ Example of the modern concept of crowdsourcing	

54
Prof. Pier Luca Lanzi	

Publicly Available Datasets	

•  KDNuggets 	

§ http://www.kdnuggets.com/datasets/	

•  PSPbenchmarks	

§ http://www.infobiotic.net/PSPbenchmarks/	

§ Datasets derived from Protein Structure Prediction problems	

§ Interesting benchmarks because they can be parametrised in a
very large variety of ways	

•  Pascal Large Scale Learning Challenge	

§ http://largescale.ml.tu-berlin.de/about/	

55

More Related Content

What's hot

What's hot (20)

DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 

Viewers also liked

Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Pier Luca Lanzi
 

Viewers also liked (17)

DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph Mining
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Course Organization
Course OrganizationCourse Organization
Course Organization
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
Videogame Design and Programming: Conferenza d'Ateneo 18 Maggio 2011
 
Fitness Inheritance in Evolutionary and
Fitness Inheritance in Evolutionary andFitness Inheritance in Evolutionary and
Fitness Inheritance in Evolutionary and
 
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
GECCO-2014 Learning Classifier Systems: A Gentle IntroductionGECCO-2014 Learning Classifier Systems: A Gentle Introduction
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
 
Lecture 02 Machine Learning For Data Mining
Lecture 02 Machine Learning For Data MiningLecture 02 Machine Learning For Data Mining
Lecture 02 Machine Learning For Data Mining
 
Evolving Rules to Solve Problems: The Learning Classifier Systems Way
Evolving Rules to Solve Problems: The Learning Classifier Systems WayEvolving Rules to Solve Problems: The Learning Classifier Systems Way
Evolving Rules to Solve Problems: The Learning Classifier Systems Way
 
Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules Basics
 

Similar to DMTM 2015 - 03 Data Representation

PSYA4 - Research methods
PSYA4 - Research methodsPSYA4 - Research methods
PSYA4 - Research methods
Nicky Burt
 
Psyc 2301 chapter two powerpoint(2)
Psyc 2301 chapter two powerpoint(2)Psyc 2301 chapter two powerpoint(2)
Psyc 2301 chapter two powerpoint(2)
Liz Vera
 

Similar to DMTM 2015 - 03 Data Representation (20)

Sti2018 jws
Sti2018 jwsSti2018 jws
Sti2018 jws
 
Developing Analytic Technique and Defeating Cognitive Bias in Security
Developing Analytic Technique and Defeating Cognitive Bias in SecurityDeveloping Analytic Technique and Defeating Cognitive Bias in Security
Developing Analytic Technique and Defeating Cognitive Bias in Security
 
PSYA4 - Research methods
PSYA4 - Research methodsPSYA4 - Research methods
PSYA4 - Research methods
 
Statistical Approaches to Missing Data
Statistical Approaches to Missing DataStatistical Approaches to Missing Data
Statistical Approaches to Missing Data
 
UK star ultimate testing survival
UK star ultimate testing survivalUK star ultimate testing survival
UK star ultimate testing survival
 
Loubier slide share_qualitative
Loubier slide share_qualitativeLoubier slide share_qualitative
Loubier slide share_qualitative
 
Research misconduct: science's self-administered poison
Research misconduct: science's self-administered poisonResearch misconduct: science's self-administered poison
Research misconduct: science's self-administered poison
 
Psyc 2301 chapter two powerpoint(2)
Psyc 2301 chapter two powerpoint(2)Psyc 2301 chapter two powerpoint(2)
Psyc 2301 chapter two powerpoint(2)
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
On practical philosophy of research in science and technology
On practical philosophy of research in science and technologyOn practical philosophy of research in science and technology
On practical philosophy of research in science and technology
 
About your graduate studies part 2
About your graduate studies part 2About your graduate studies part 2
About your graduate studies part 2
 
Biswa research
Biswa researchBiswa research
Biswa research
 
intro-qual-quant.pptx
intro-qual-quant.pptxintro-qual-quant.pptx
intro-qual-quant.pptx
 
intro-qual-quant.pptx
intro-qual-quant.pptxintro-qual-quant.pptx
intro-qual-quant.pptx
 
Introduction to quantitative and qualitative research
Introduction to quantitative and qualitative researchIntroduction to quantitative and qualitative research
Introduction to quantitative and qualitative research
 
intro-qual-quant.pptx
intro-qual-quant.pptxintro-qual-quant.pptx
intro-qual-quant.pptx
 
What is research
What is researchWhat is research
What is research
 
Michael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems RapidlyMichael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems Rapidly
 
عزوز
عزوزعزوز
عزوز
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 

More from Pier Luca Lanzi

More from Pier Luca Lanzi (15)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 

DMTM 2015 - 03 Data Representation

  • 1. Prof. Pier Luca Lanzi Data Representation Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Readings •  “Data Mining and Analysis” – Chapter 1 •  “Mining of Massive Datasets” – Chapter 1 2
  • 3. Prof. Pier Luca Lanzi Describing the Data
  • 4. Prof. Pier Luca Lanzi Contact Lenses Data 4 None Reduced Yes Hypermetrope Pre-presbyopic None Normal Yes Hypermetrope Pre-presbyopic None Reduced No Myope Presbyopic None Normal No Myope Presbyopic None Reduced Yes Myope Presbyopic Hard Normal Yes Myope Presbyopic None Reduced No Hypermetrope Presbyopic Soft Normal No Hypermetrope Presbyopic None Reduced Yes Hypermetrope Presbyopic None Normal Yes Hypermetrope Presbyopic Soft Normal No Hypermetrope Pre-presbyopic None Reduced No Hypermetrope Pre-presbyopic Hard Normal Yes Myope Pre-presbyopic None Reduced Yes Myope Pre-presbyopic Soft Normal No Myope Pre-presbyopic None Reduced No Myope Pre-presbyopic hard Normal Yes Hypermetrope Young None Reduced Yes Hypermetrope Young Soft Normal No Hypermetrope Young None Reduced No Hypermetrope Young Hard Normal Yes Myope Young None Reduced Yes Myope Young Soft Normal No Myope Young None Reduced No Myope Young Recommended lenses Tear production rate Astigmatism Spectacle prescription Age
  • 5. Prof. Pier Luca Lanzi •  Data are often abstracted as an nxd data matrix, with n rows and d columns, given as •  Rows are called instances, examples, records, transactions, objects, points, feature-vectors, etc. •  Columns are called attributes, properties, features, dimensions, variables, fields, etc. 5
  • 6. Prof. Pier Luca Lanzi 6
  • 7. Prof. Pier Luca Lanzi Instances, Attributes, Concepts •  Instances (observations, case) § The atomic elements of information from a dataset § Also known as records, prototypes, or examples •  Attributes (variable) § Measures aspects of an instance § Also known as features or variables § Each instance is composed of a certain number of attributes •  Concepts § Special content inside the data § Kind of things that can be learned § Intelligible and operational concept description 7
  • 8. Prof. Pier Luca Lanzi CPU Performance Data 8 0 0 32 128 CHMAX 0 0 8 16 CHMIN Channels Performance Cache (Kb) Main memory (Kb) Cycle time (ns) 45 0 4000 1000 480 209 67 32 8000 512 480 208 … 269 32 32000 8000 29 2 198 256 6000 256 125 1 PRP CACH MMAX MMIN MYCT
  • 9. Prof. Pier Luca Lanzi Two Versions of the Weather Data 9 … … … … … Yes False Normal Mild Rainy Yes False High Hot Overcast No True High Hot Sunny No False High Hot Sunny Play Windy Humidity Temperature Outlook … … … … … Yes False 80 75 Rainy Yes False 86 83 Overcast No True 90 80 Sunny No False 85 85 Sunny Play Windy Humidity Temperature Outlook
  • 10. Prof. Pier Luca Lanzi AttributeTypes
  • 11. Prof. Pier Luca Lanzi Attributes •  Numeric Attributes § Real-valued or integer-valued domain § Interval-scaled when only differences are meaningful (e.g., temperature) § Ratio-scaled when differences and ratios are meaningful (e.g., Age) •  Categorical Attributes § Set-valued domain composed of a set of symbols § Nominal when only equality is meaningful (e.g., domain(Sex) = { M, F}) § Ordinal when both equality (are two values the same?) and inequality (is one value less than another?) are meaningful (e.g., domain(Education) = { High School, BS, MS, PhD}) 11
  • 12. Prof. Pier Luca Lanzi Numerical Attributes •  Not only ordered but measured in fixed and equal units •  Examples § Attribute “temperature” expressed in degrees § Attribute “year” •  Characteristics § Difference of two values makes sense § Sum or product doesn’t make sense § Zero point is not defined •  Sometimes they are divided into “discrete” and “continuous” 12
  • 13. Prof. Pier Luca Lanzi Ratio Attributes •  Ratio quantities are ones for which the measurement scheme defines a zero point •  Example § Attribute “distance” •  Characteristics § Distance between an object and itself is zero § Ratio quantities are treated as real numbers § All mathematical operations are allowed § Is there an “inherently” defined zero point? § It depends on scientific knowledge 13
  • 14. Prof. Pier Luca Lanzi Nominal Attributes (or Categorical) •  Values are distinct symbols •  Values themselves serve only as labels or names •  Example § Attribute “outlook” from weather data § Values: “sunny”, “overcast”, and “rainy” •  Characteristics § No relation is implied among nominal values § No ordering § No distance measure § Only equality tests can be performed 14
  • 15. Prof. Pier Luca Lanzi Ordinal Attributes •  Impose order on values •  No distance between values defined •  Example § The attribute “temperature” in weather data § Values: “hot” “mild” “cool” •  Characteristics § Addition and subtraction don’t make sense § Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”) 15
  • 16. Prof. Pier Luca Lanzi Nominal or Ordinal? •  Attribute “age” nominal § If age = young and astigmatic = no and tear production rate = normal then recommendation = soft •  Attribute “age” ordinal (e.g. “young” “pre-presbyopic” “presbyopic”) § If age≤pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft 16
  • 17. Prof. Pier Luca Lanzi Why Specifying Attribute Types? •  Some algorithms fit some specific data types best •  Express the best possible patterns into data •  Make the most adequate comparisons •  Example § Outlook “sunny” does not make sense, while § Temperature “cool” or § Humidity 70 does •  Additional uses of attribute type § Check for valid values § Deal with missing values, etc. 17
  • 18. Prof. Pier Luca Lanzi MissingValues
  • 19. Prof. Pier Luca Lanzi Why Missing Values Exist? •  Faulty equipment, incorrect measurements, missing cells in manual data entry, censored/anonymous data •  Review scores for movies, books, etc. •  Very frequent in questionnaires for medical scenarios •  Censored/anonymous data •  In practice, a low rate of missing values may be suspicious •  Interview data (“Did you ever …”) 19
  • 20. Prof. Pier Luca Lanzi Missing Values •  Frequently indicated by out-of-range entries (e.g. max/min float) •  Missing value may have significance in itself § E.g. missing test in a medical examination •  Most schemes assume that is not the case § “missing” may need to be coded as additional value •  Does absence of value have some significance? § If it does, “missing” is a separate value § If it does not, “missing” must be treated in a special way 20
  • 21. Prof. Pier Luca Lanzi What Types of Missing Values? •  Missing completely at random (MCAR) §  The distribution of an example having a missing value for an attribute does not depend on either the observed data or the missing data §  Example: some survey questions contain a random sample of the whole questionnaire •  Missing at random (MAR) §  The distribution of an example having a missing value for an attribute depends on the observed data, but does not depend on the missing data §  Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. §  Whether or not someone answered #13 on your survey has nothing to do with the missing values, but it does have to do with the values of some other variable. §  Example: Respondents in service occupations less likely to report income •  Not missing at random (NMAR) §  the distribution of an example having a missing value for an attribute depends on the missing values. §  Example: respondents with high income less likely to report income 21
  • 22. Prof. Pier Luca Lanzi Dealing with Missing Values •  Use what you know § Why data is missing § Distribution of missing data •  Decide on the best strategy to yield the least biased estimates § Deletion Methods (listwise deletion, pairwise deletion) § Single Imputation Methods (mean/mode substitution, dummy variable method, single regression) § Model-Based Methods (maximum Likelihood, multiple imputation 22
  • 23. Prof. Pier Luca Lanzi Strategies for missing values handling •  The handling of missing data depends on the type •  Discarding all the examples with a missing values § Simplest approach § Allows the use of unmodified data mining methods § Only practical if there are few examples with missing values. Otherwise, it can introduce bias •  Fill in the missing value manually J •  Convert the missing values into a new value § Use a special value for it § Add an attribute that indicates if value is missing or not § Greatly increases the difficulty of the data mining process •  Imputation methods § Assign a value to the missing one, based on the rest of the dataset. Use the unmodified data mining methods. 23
  • 24. Prof. Pier Luca Lanzi Listwise Deletion (Complete Case Analysis) •  Only analyze cases with available data on each variable •  Simple, but reduces the data •  Comparability across analyses •  Does not use all the information •  Estimates may be biased if data not MCAR 24
  • 25. Prof. Pier Luca Lanzi Pairwise deletion (Available Case Analysis) •  Analysis with all cases in which the variables of interest are present •  Advantage § Keeps as many cases as possible for each analysis § Uses all information possible with each analysis •  Disadvantage § Can’t compare analyses because sample different each time 25
  • 26. Prof. Pier Luca Lanzi Imputation methods •  Extract a model from the dataset to perform the imputation •  Suitable for MCAR and, to a lesser extent, for MAR •  Not suitable for NMAR type of missing data •  For NMAR we need to go back to the source of the data to obtain more information •  Survey of imputation methods available at http://sci2s.ugr.es/MVDM/index.php http://sci2s.ugr.es/MVDM/biblio.php 26
  • 27. Prof. Pier Luca Lanzi Single Imputation Methods •  Mean/mode substitution (most common value) § Replace missing value with sample mean or mode § Run analyses as if all complete cases § Advantages: Can use complete case analysis methods § Disadvantages: Reduces variability •  Dummy variable control § Create an indicator for missing value (1=value is missing for observation; 0=value is observed for observation) § Impute missing values to a constant (such as the mean) § Include missing indicator in the algorithm § Advantage: uses all available information about missing observation § Disadvantage: results in biased estimates, not theoretically driven •  Regression Imputation § Replaces missing values with predicted score from a regression equation. 27
  • 28. Prof. Pier Luca Lanzi Multiple Imputation Process 28
  • 29. Prof. Pier Luca Lanzi Do Not Impute (DNI) •  Simply use the default policy of the data mining method •  Works only if the policy exists 29
  • 30. Prof. Pier Luca Lanzi InaccurateValues
  • 31. Prof. Pier Luca Lanzi Inaccurate Values •  Data has not been collected for mining it •  Errors and omissions that don’t affect original purpose of data (e.g. age of customer) •  Typographical errors in nominal attributes, thus values need to be checked for consistency •  Typographical and measurement errors in numeric attributes, thus outliers need to be identified •  Errors may be deliberate (e.g. wrong zip codes) 31
  • 32. Prof. Pier Luca Lanzi The GeometricView
  • 33. Prof. Pier Luca Lanzi The Geometrical View of the Data •  When the data matrix contains only numerical values § Every row can be viewed as a point in a d-dimension space § Every column as a point in a n-dimensional space 33
  • 34. Prof. Pier Luca Lanzi 34
  • 38. Prof. Pier Luca Lanzi The ProbabilisticView
  • 43. Prof. Pier Luca Lanzi Data Formats
  • 44. Prof. Pier Luca Lanzi Data Format •  Most commercial tools have their own proprietary format •  Most tools import excel files and comma-separated value files 44 Year,Make,Model,Length 1997,Ford,E350,2.34 2000,Mercury,Cougar,2.38 Year;Make;Model;Length 1997;Ford;E350;2,34 2000;Mercury;Cougar;2,38
  • 45. Prof. Pier Luca Lanzi Attribute-Relation File Format (ARFF) 45 % % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ... http://www.cs.waikato.ac.nz/~ml/weka/arff.html!
  • 46. Prof. Pier Luca Lanzi Additional Attribute Types •  ARFF supports string attributes: •  Similar to nominal attributes but list of values is not pre-specified •  ARFF also supports date attributes: •  Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss 46 @attribute description string @attribute today date
  • 47. Prof. Pier Luca Lanzi Additional Attribute Types •  ARFF supports sparse data, for instance the following examples, •  Can also be represented as, 47 0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}
  • 48. Prof. Pier Luca Lanzi Missing Values in ARFF 48 @relation labor @attribute 'duration' real @attribute 'wage-increase-first-year' real @attribute 'wage-increase-second-year' real @attribute 'wage-increase-third-year' real @attribute 'cost-of-living-adjustment' {'none','tcf','tc'} @attribute 'working-hours' real @attribute 'pension' {'none','ret_allw','empl_contr'} @attribute 'standby-pay' real @attribute 'shift-differential' real @attribute 'education-allowance' {'yes','no'} @attribute 'statutory-holidays' real @attribute 'vacation' {'below_average','average','generous'} @attribute 'longterm-disability-assistance' {'yes','no'} @attribute 'contribution-to-dental-plan' {'none','half','full'} @attribute 'bereavement-assistance' {'yes','no'} @attribute 'contribution-to-health-plan' {'none','half','full'} @attribute 'class' {'bad','good'} @data 1,5,?, ?, ?,40, ?, ?,2, ?,11,'average', ?, ?,'yes',?,'good' 2,4.5,5.8, ?, ?,35,'ret_allw', ?, ?,'yes',11,'below_average', ?,'full', ?,'full','good' ?, ?, ?, ?, ?,38,'empl_contr', ?,5, ?,11,'generous','yes','half','yes','half','good' 3,3.7,4,5,'tc', ?, ?, ?, ?,'yes', ?, ?, ?, ?,'yes', ?,'good'
  • 49. Prof. Pier Luca Lanzi Attribute Types and Interpretation •  Interpretation of attribute types in ARFF depends on the mining scheme •  Numeric attributes are interpreted as § Ordinal scales if less-than and greater-than are used § Ratio scales if distance calculations are performed (normalization/standardization may be required) •  Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) •  Integers in some given data file: nominal, ordinal, or ratio scale? 49
  • 50. Prof. Pier Luca Lanzi DSPL: Dataset Publishing Language •  Open format by Google available at http://code.google.com/apis/publicdata/ •  Use existing data: add an XML metadata file existing CSV •  Read by the Google Public Data Explorer, which includes animated bar chart, motion chart, and map visualization •  Allow linking to concepts in other datasets •  Geo-enabled: allows adding latitude and longitude data to your concept definitions 50
  • 51. Prof. Pier Luca Lanzi Model Representation
  • 52. Prof. Pier Luca Lanzi Predictive Model Markup Language •  XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics and data mining •  The goal is to share models between applications •  Vendor-independent method of defining models •  Allow to exchange of models between applications. •  PMML Components: data dictionary, data transformations, model, mining schema, targets, output 52
  • 53. Prof. Pier Luca Lanzi Data Repository
  • 54. Prof. Pier Luca Lanzi Publicly Available Datasets •  UCI repository § http://archive.ics.uci.edu/ml/ § Probably the most famous collection of datasets •  Kaggle § http://www.kaggle.com/ § It is not a static repository of datasets, but a site that manages Data Mining competitions § Example of the modern concept of crowdsourcing 54
  • 55. Prof. Pier Luca Lanzi Publicly Available Datasets •  KDNuggets § http://www.kdnuggets.com/datasets/ •  PSPbenchmarks § http://www.infobiotic.net/PSPbenchmarks/ § Datasets derived from Protein Structure Prediction problems § Interesting benchmarks because they can be parametrised in a very large variety of ways •  Pascal Large Scale Learning Challenge § http://largescale.ml.tu-berlin.de/about/ 55