7. SVM summary
avoid the plague of local minima
the engineer’s expertise is in the appropriate
kernel (beware of overfitting, cross-validate and
experiment your own kernels)
only classify between 2 class (one vs all or one
vs one methodology)
a reference in use cases in computer vision,
bio informatics
9. Neural Network summary
Gradient descent algorithm : stochastic, mini-
batch, conjugate
plague of local minima : difficult to calibrate
the engineer’s expertise is in the appropriate
architecture (beware of overfitting, cross-
validate and experiment your own architecture
‘deeper learning’)
10. >> t = classregtree(X,Y);
>> Y_pred = t(X_new);
Regression Trees
13. Why a regression and what is a
regression ?
A regression is a model to explain and predict a process :
supervised machine learning
14. Why regularizing ?• Terms are correlated
• The regression matrix becomes close to singular
• Badly conditioned matrix yield poor numerical results
• Bayesian interpretation
Likelihood
Regularisation term
Posterior
Prior
We rather minimize
15. Why Lasso and Elastic Net?• No method owns the truth
• Reduce the number of predictors in a regression model
• Identify important predictors
• Select among redundant predictors
• Produce shrinkage estimates with potentially lower
predictive errors than ordinary least squares (cross
validation)
Lasso :
Elastic Net :
16. Ensemble learning
Why ensemble learning ?
‘melding results from many weak learners into one high-
quality ensemble predictor’
17. Main differences between Bagging and
Boosting
BAGGING BOOSTING
Bagging is randomness Boosting is adaptative and deterministic
Bootstrapped sample Complete initial sample
Each model must perform well over the whole
sample
Each model has to perform better than the
previous one on outliers
Every model have the same weight Models are weighted according to their
performance
Defining features
Advantages and disadvantages
BAGGING BOOSTING
Reducing model variance Variance might rise
Not a simple model anymore Not a simple model anymore
Can be parallelized Can not be parallelized
Less noise over fitting : better than boosting
when noise
Models are weighted according to their
performance
Bagging is usually efficienter than boosting On specific cases, boosting might achieve a far
better accuracy
24. Exploratory Data Analysis
Why exploratory analysis ? Can be used to:
o Graphical view
o “Pre filtering”: preliminary data trends and behaviour
• Means:
• Multivariate Plots
• Features transformation : principal component analysis, factor model
• Features selection : stepwise optimization
28. Factor model
Alternative to PCA to improve your components
>>[Lambda,Psi,T,stats,F]=factoran(stocks,3,'rotate','promax);
-1
-0.5
0
0.5
1 -1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
Component 2
DeutscheBank
DaimlerAllianzMAN
ThyssenKrupp
BMWLufthansa
Siemens
DeutschePost
Commerzbank
BASF
Adidas
Linde
MunichRe
MetroHeidelberger
SAP
Bayer
Salzgitter
Infineon
DeutscheBahn
EONRWE
VW
DeutscheTelekom
BeiersdorfMRKFresenius
Henkel
FreseniusMedical
Component 1
Component3
29. Paring predictors : stepwise optimization Some predictors might be correlated, other irrelevant
Requires Statistics Toolbox™
>>[coeff,inOut]=stepwisefit(stocks, index);
2007 2008 2009 2010 2011
-0.1
0
0.1
0.2
0.3
Returns
original data
stepwise fit
2007 2008 2009 2010 2011
0.5
1
1.5
Prices
30. Cloud of randomly generated points
• Each cluster center is randomly chosen inside specified bounds
• Each cluster contains the specified number of points per cluster
• Each cluster point is sampled from a gaussian distribution
• Multidimensionnal dataset
>>clusters = 8; % number of clusters.
>>points = 30; % number of points in each cluster.
>>std_dev = 0.05; % common cluster standard deviation
>>bounds = [0 1]; % bounds for the cluster center
>>[x,vcentroid,proportions,groups] =cluster_generation(bounds,clusters,points,std_dev);
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Group1
Group2
Group3
Group4
Group5
Group6
Group7
Group8
31. Clustering Why clustering ?
o Segment populations into natural subgroups
o Identify outliers
o As a preprocessing method – build separate models on each
• Means
• Hierarchical clustering
• Clustering with neural network (self-organizer map, competitive layer)
• Clustering with K-means nearest neighbours
• Clustering with K-means fuzzy logic
• Clustering using Gaussian mixture models
• Predictors: categorical, ordinal, discontinuous -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Input Vectors
x(1)
x(2)
33. Hierarchical Cluster Analysis – how do I do it ?
• Calculate pairwise distances between points
>> distances = pdist(x)
• Carry out hierarchical cluster analysis
>> tree = linkage(distances)
• Visualise as a dendrogram
>> dendrogram(tree)
• Assign points to clusters
>> assignments = cluster(tree,‘cutoff',0.1)
34. Assessing the quality of a hierarchical cluster
analysis
• The cophenetic correlation coefficient measures how
closely the length of the tree links match the original
distances between points
• How ‘faithful’ the tree is to the original data
• 0 is poor, 1 is good
>> cophenet(tree,distances)
35. K-Means Cluster Analysis – what is it doing?
Randomly pick K cluster
centroids
Assign points to the
closest centroid
Recalculate positions of
cluster centroids
Reassign points to the
closest centroid
Recalculate positions of
cluster centroids
Repeat until centroid positions converge
………
36. K-Means Cluster Analysis – how do I do it ?
Running the K-mean algorithm for K fixed
>> [memberships,centroids] = kmeans(x,K);
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
37. Evaluating a K-Means analysis and choosing K
• Try a range of different K’s, and
compare the point-centroid
distances for each
>> for K=3:15
[clusters,centroids,distances] =
kmeans(data,K);
totaldist(K-2)=sum(distances);
end
plot(3:15,totaldist);
• Create silhouette plots
>> silhouette(data,clusters)
38. Sidebar: Distance Metrics
• Measures of how similar datapoints are – different
definitions make sense for different data
• Many built-in distance metrics, or define your own
>> doc pdist
>> distances = pdist(data,metric); %pdist = pairwise distances
>> squareform(distances)
>> kmeans(data,k,’distance’,’cityblock’) %not all metrics supported
Euclidean Distance
Default
Cityblock Distance
Useful for discrete variables
Cosine Distance
Useful for clustering variables
39. Fuzzy c-means Cluster Analysis – what is it doing?
• Very similar to K-means
• Samples are not assigned definitively to a cluster, but
have a ‘membership’ value relative to each cluster
Requires Fuzzy Logic Toolbox™
Running the fuzzy K-mean algorithm
for K fixed
>> [centroids, memberships]=fcm(x,K);
40. Gaussian Mixture Models
• Assume that data is drawn from a fixed number K of normal
distributions
• Fit these parameters using the EM algorithm
>> gmobj = gmdistribution.fit(x,8);
>> assignments = cluster(gmobj,x);
Plot the probability density
>> ezsurf(@(x,y)pdf(gmobj,[x y]));
0
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0
10
20
41. Evaluating a Gaussian Mixture Model clustering
• Plot the probability density function of the model
>> ezsurf(@(x,y)pdf(gmobj,[x y]));
• Plot the posterior probabilities of observations
>> p = posterior(gmobj,data);
>> scatter(data(:,1),data(:,2),5,p(:,g)); % Do this for each group g
• Plot the Mahalanobis distances of observations to components
>> m = mahal(gmobj,data);
>> scatter(data(:,1),data(:,2),5,m(:,g)); % Do this for each group g
42. Choosing the right number of components in a
Gaussian Mixture Model
• Evaluate for a range of K and plot AIC and/or BIC
• AIC (Akaike Information Criterion) and BIC (Bayesian
Information Criterion) are measures of the quality of
the model fit, with a penalty for higher K
>> for K=3:15
gmobj = gmdistribution.fit(data,K);
AIC(K-2) = gmobj.AIC;
end
plot(3:15,AIC);
43. Neural Networks – what are they?
Input
variables
Weights
Bias
Transfer
function
Output
variable
A two layer
feedforward
network
Build your
architecture
44. Self Organising Maps Neural Net – what are they?
• Start with a regular grid of
‘neurons’ laid over the dataset
• The size of the grid gives the
number of clusters
• Neurons compete to recognise
datapoints (by being close to
them)
• Winning neurons are moved
closer to the datapoints
• Repeat until convergence
-0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
SOM Weight Positions
Weight 1
Weight2
-0.2 0 0.2 0.4 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SOM Weight Positions
Weight 1
Weight2
45. Summary: Cluster analysis
No method owns the truth
Use the diagnostic tools to assess your clusters
Beware of local minima : global optimization
46. Classification
Why classification ? Can be used to:
o Learning the way to classify from already classified
observations
oClassify new observations
• Means:
• Discriminant analysis classification
• Bootstrapped aggregated decision tree classifier
• Neural network classifier
• Support vector machine classifier
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Group1
Group2
Group3
Group4
Group5
Group6
Group7
Group8
47. Discriminant Analysis – how does it work?
• Fit a multivariate normal density to each class
• linear — Fits a multivariate normal density to each group,
with a pooled estimate of covariance. This is the default.
• diaglinear — Similar to linear, but with a diagonal
covariance matrix estimate (naive Bayes classifiers).
• quadratic — Fits multivariate normal densities with
covariance estimates stratified by group.
• diagquadratic — Similar to quadratic, but with a diagonal
covariance matrix estimate (naive Bayes classifiers).
• Classify a new point by evaluating its probability for
each density function, and classifying to the highest
probability
49. Interpreting Discriminant Analyses
• Visualise the posterior probability
surfaces
>> [XI,YI] = meshgrid(linspace(4,8),
linspace(2,4.5));
>> X = XI(:); Y = YI(:);
>> [class,err,P] = classify([X Y],
meas(:,1:2), species,'quadratic');
>> for i=1:3
ZI = reshape(P(:,i),100,100);
surf(XI,YI,ZI,'EdgeColor','none');
hold on;
end
50. Interpreting Discriminant Analyses
• Visualise the probability density
of sample observations
• An indicator of the region in
which the model has support
from training data
>> [XI,YI] = meshgrid(linspace(4,8),
linspace(2,4.5));
>> X = XI(:); Y = YI(:);
>> [class,err,P,logp] = classify([X Y],
meas(:,1:2), species, 'quadratic');
>> ZI = reshape(logp,100,100);
>> surf(XI,YI,ZI,'EdgeColor','none');
51. Classifying K-Nearest Neigbours – what does it do?
• One of the simplest classifiers – a sample is classified
by taking the K nearest points from the training set,
and choosing the majority class of those K points
• There is no real training phase – all the work is done
during the application of the model
>> classes =
knnclassify(sample,training,group,K)
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
-0.5
0
0.5
1
1.5
x1
x2
group1
group2
group3
group4
group5
group6
group7
group8
52. Decision Trees – how do they work?
• Threshold value for a variable
that partitions the dataset
• Threshold for all predictors
• Resulting model is a tree where
each node is a logical test on a
predictor (var1<thresh1,
var2>thresh2)
53. Decision Trees – how do I build them ?
• Build tree model
>> tree = classregtree(x,y);
>> view(tree)
• Evaluate the model on new data
>> tree(x_new)
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
-0.5
0
0.5
1
1.5
x1
x2
group1
group2
group3
group4
group5
group6
group7
group8
54. Enhancing the model : bagged trees
• Prune the decision tree
>> [cost,secost,ntnodes,bestlevel] =test(t, 'test', x, y);
>> topt = prune(t, 'level', bestlevel);
• Bootstrapped aggregated trees forest
>> [cost,secost,ntnodes,bestlevel] =test(t, 'test', x, y);
>> forest = TreeBagger(100, x, y);
>> y_pred = predict(forest,x);
• Visualise class boundaries as before
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
-0.5
0
0.5
1
1.5
x1
x2
group1
group2
group3
group4
group5
group6
group7
group8
55. Pattern Recognition Neural Network– what are
they?
• Two-layer (i.e. one-hidden-layer) feed forward neural
networks can learn any input-output relationship
given enough neurons in the hidden layer.
• No restrictions on the predictors
56. Pattern Recognition Neural Network– how do I
build them ?
• Build a neural network model
>> net = patternnet(10);
• Train the net to classify
observations
>> [net,tr] = train(net,x,y);
• Apply the model to new data
>> y_pred = net(x);
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
x1
x2
1
2
3
4
5
6
7
8
57. Support Vector Machines – what are they?
• The SVM algorithm finds a boundary between the classes
that maximises the minimum distance of the boundary
to any of the points
• No restrictions on the predictors
• 1 vs all to classify multiple classes
58. Support Vector Machines – how do I build them ?
• Build an SVM model
>> svmmodel = svmtrain(x,y)
• Try different kernel functions
>> svmmodel =
svmtrain(x,y,’kernel_function’,’rbf’)
• Apply the model to new data
>> classes =
svmclassify(svmmodel,x_new);
-1
0
1
2
3
4
1
2
Support Vectors
59. Evaluating a Classifying Model
• Three main strategies
• Resubstitution – test the model on the same data that you
trained it with
• Cross-Validation
• Holdout Test on a completely new dataset
• Use cross-validation to evaluate model parameters such as the number of leaf
for a tree or the number of hidden neurons.
Apply cross validation to your classifying model
>> cp = cvpartition(y,'k',10);
>> ldaFun= @(xtrain,ytrain,xtest)(classify(xtest,xtrain,ytrain));
>> ldaCVErr = crossval('mcr',x,y,'predfun',ldaFun,'partition',cp)
60. Summary: Classification algorithms
No absolute best methods
Simple does not mean inefficient
Decision trees produce models and neural network overfit the
noise : use bootstrapping and cross-validation
Parallelize
61. Regression
Why Regression ? Can be used to:
oLearn to model a continuous response from observations
oPredict the response for new observations
• Means:
• Linear regressions
• Non-linear regressions
• Bootstrapped regression tree
• Neural network as a fitting tool
62. New data set with a continuous response from one
predictor
• Non-linear function to fit
• A continuous response to fit from one continuous predictor
>>[x,t] = simplefit_dataset;
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
63. Linear Regression – what is it?
• A collection of methods that
find the best coefficients b
such that y ≈ X*b
• Best b means minimising
the least squares difference
between the predicted and
actual values of y
• “Linear” means linear in b –
you can include extra
variables to give a nonlinear
relationship in X
64. Linear Regression – how do I do it ?
>> b = xy
• Linear Regression
>> b = regress(y, [ones(size(X,1),1) x])
>> stats = regstats(y, [ones(size(x,1),1) x])
• Robust Regression – better in the presence of outliers
>> robust_b = robustfit(X,y) %NB (X,y) not (y,X)
• Ridge Regression – better if data is close to collinear
>> ridge_b = ridge(y,X,k) %k is the ridge parameter
• Apply the model to new data
>> y = newdata*b;
65. Interpreting a linear regression model
• Examine coefficients to see
which predictors have a large
effect on the response
>> [b,bint,r,rint,stats]=regress(y,X)
>> errorbar(1:size(b,1),b, b-
bint(:,1),bint(:,2)-b)
• Examine residuals to check for
possible outliers
>> rcoplot(r,rint)
• Examine R2 statistic and p-
value to check overall model
significance
>> stats(1)*100 %R2 as a percentage
>> stats(3) %p-value
• Additional diagnostics with
regstats
67. Fit Neural Network– what are they?
• Fitting networks are feedforward neural networks used to fit
an input-output relationship.
• This architecture can learn any input-output relationship given
enough neurons.
• No restrictions on the predictors
(categorical,ordinal,discontinuous)
68. Fit Neural Network– how do I build them ?
• Build a fit neural net model
>> net = fitnet(10);
• Train the net to fit the target
>> [net,tr] = train(net,x,t);
• Apply the model to new data
>> y_pred = net(x);
0 1 2 3 4 5 6 7 8 9
-2
0
2
4
6
8
10
12
Function Fit for Output Element 1
OutputandTarget
-0.02
0
0.02
0.04
Error
Input
Targets
Outputs
Errors
Fit
Targets - Outputs
69. Regression trees– what are they?
• A decision tree with binary splits for regression. An object
of class RegressionTree can predict responses for new data
with the predict method.
• No restrictions on the predictors
(categorical,ordinal,discontinuous)
70. Regression trees – how do I use them?
• Build a fit neural net model
>> rtree = RegressionTree.fit(x,t);
• Train the net to fit the target
>> y_tree = predict(rtree,x);
• Apply the model to new data
>> y_pred = net(x);
0 1 2 3 4 5 6 7 8 9 10
0
5
10
0 10 20 30 40 50 60 70 80 90 100
0
0.5
1
1.5
x 10
-15
71. Summary
Data Mining
Exploration
Univariate
Pie chart,
Histogram, etc…
Multivariate
Feature
selection and
transformation
Modelling
Clustering
Partitive
K-means
Gaussian
mixture model
SOMHierarchical
Classification
Discriminant
Decision Tree
Neural Network
Support Vector
Machine
Regression