Invezz.com - Grow your wealth with trading signals
Les outils de modélisation des Big Data
1. Les outils de modélisation des Big Data
SEPIA
3 dec 13
Pr Michel Béra
Chaire de Modélisation statistique du Risque
CNAM/SITI/IMATH
1
2. • Plan de l’exposé
– L’inégalité de Vapnik et les fondements d’une nouvelle
théorie de la robustesse (1971 et 1995)
– Éclairages sur les méthodes classiques (NN, Decision Trees,
analyse factorielle)
– La notion de géométrie des données et d’espace étendu –
le Kernel Trick – quali et quanti : un combat dépassé
– Big Data et monde vapnikien, utopies et réalités – notions
de complexité informatique
– Modélisation moderne : un enchaînement d’approches, du
Machine Learning aveugle aux finesses de l’Evidence based
Policy
2
3. Theoretical Statistics
« Data are as they are »
Applied Statistics
« modeling data then testing »
Theory of ill-posed problems
Empirical Methods
of conjuration (PCA,NN,Bayes)
1974 VC Dimension
2001: Start of the
internet era,
Millions of records
& thousands of variables
1980 SRM (Vapnik)
1995 Support Vector Machines (Vapnik)
1960: Mainframe.
Huge Datasets start appearing.
1930
Kolmogorov-SmirnovFisher
1950Cramer
High dimensionnal problems
malediction
STOP !
Watch out !
GO !
Statistical history
3
4. 1. Le monde de Vapnik
- Conférence aux Bell Labs (New Jersey) de 1995
4
5. Consistency : definition
1) A learning process (model) is said to be consistent if
model error, measured on new data sampled from the
same underlying probability laws of our original
sample, converges, when original sample size
increases, towards model error, measured on original
sample.
2) A model that is consistent is also said to generalize well,
or to be robust
5
6. %error
number of training examples
Test error
Training error
Consistent training?
%error
number of training examples
Test error
Training error
6
7. Generalization: definition
• Generalization capacity for a model describes how
(ex: error function) a model will perform on data that
he has never seen before (in his training set)
• Good generalization for a model means that model
errors on new unknown data will be of the same
« size » than known error on his training set. The
model is also called « robust ».
7
10. Vapnik approach to modeling (1)
• Vapnik approach is based on the family of functions S =
{f(X,w), w ε W}, in which a model is chosen as a specific
function, described by a specific w
• For Vapnik, the model function must answer properly for a
given row X the question described by target Y, ie predict Y,
the quality of the answer being measured by a cost function Q
• Different families of functions may provide the same
« quality » of answer
10
11. Vapnik approach to modeling (2)
• All the trick is then to find a good family of functions S, that
not only answers in a « good way » the question described by
target Y, but that can also be easy to understand, ie also
provide a good description, allowing to explain easily what is
underlying the data behaviour of the problem question
• VC dimension will be a key to understand and control model
robustness
11
12. VC dimension - definition (1)
• Let us consider a sample (x1, .. , xL) from Rn
• There are 2L different ways to separate the sample in two sub-
samples
• A set S of functions f(X,w) shatters the sample if all 2L
separations can be defined by different f(X,w) from family S
12
13. VC dimension - definition (2)
A function family S has VC dimension h (h is an integer) if:
1) Every sample of h vectors from Rn can be shattered by a
function from S
2) There is at least one sample of h+1 vectors that cannot be
shattered by any function from S
13
14. Example: VC dimension
VC dimension:
- Measures the complexity of a
solution (function).
- Is not directly related to the
number of variables
VC dimension:
- Measures the complexity of a
solution (function).
- Is not directly related to the
number of variables
14
15. Other examples
• VC dimension for hyperplanes of Rn is n+1
• VC dimension of set of functions:
f(x,w) = sign (sin (w.x) ),
c <= x <= 1, c>0,
where w is a free parameter, is infinite.
– Conclusion : VC dimension is not always equal to the
number n of parameters (X1,..,Xn) of a given family S of
functions from Rn to {-1,+1}.
15
16. Key Example: linear models -> y = <w|x> + b
• VC dimension of family S of linear models:
with:
depends on C and can take any value between 0 and n.
This is the basis for Machine Learning approaches such as SVM
(Support Vector Machines) or Ridge Regression.
16
17. VC dimension : interpretation
• VC dimension of S: an integer, that measures the
shattering (or separating) power (“complexity”) of
function family S:
• We shall now show that VC dimension (a major
theorem from Vapnik) gives a powerful indication for
model consistency, hence “robustness”.
17
18. What is a Risk Functional?
• A function of the parameters of the
learning machine, assessing how much it is
expected to fail on a given task.
Parameter space (w)
R[f(x,w)]
w*
18
19. Examples of Risk Functionals
• Classification:
– Error rate
– AUC
• Regression:
– Mean square error
19
20. Lift Curve
O
MKI =
O M
Fraction of customers selected
Fractionofgoodcustomersselected
Ideal Lift
100%
100%Customers
ordered
according
to f(x);
selection
of the top
ranking
customers.
Gini index
0 ≤≤≤≤ KI ≤≤≤≤ 1
20
22. Learning Theory Problem (1)
• A model computes a function:
• Problem : minimize in w Risk Expectation
– w : a parameter that specifies the chosen model
– z = (X, y) are possible values for attributes (variables)
– Q measures (quantifies) model error cost
– P(z) is the underlying probability law (unknown) for data z
22
23. • We get L data from learning sample (z1, .. , zL), and we suppose them
iid sampled from law P(z).
• To minimize R(w), we start by minimizing Empirical Risk over this
sample :
• Example of classical cost functions :
– classification (eg. Q can be a cost function based on cost for
misclassified points)
– regression (eg. Q can be a cost function of least squares type)
Learning Theory Problem (2)
23
24. Learning Theory Problem (3)
• Central problem for Statistical Learning Theory:
What is the relation
between Risk Expectation R(W)
and Empirical Risk E(W)?
• How to define and measure a generalization capacity
(“robustness”) for a model ?
24
25. Four Pillars for SLT (1 and 2)
• Consistency (guarantees generalization)
– Under what conditions will a model be consistent ?
• Model convergence speed (a measure for
generalization capacity)
– How does generalization capacity improve when
sample size L grows?
25
26. Four Pillars for SLT (3 and 4)
• Generalization capacity control
– How to control in an efficient way model generalization
starting with the only given information we have: our
sample data?
• A strategy for good learning algorithms
– Is there a strategy that guarantees, measures and controls
our learning model generalization capacity ?
26
27. Vapnik main theorem
• Q : Under which conditions will a learning process (model) be
consistent?
• R : A model will be consistent if and only if the function f that
defines the model comes from a family of functions S with
finite VC dimension h
• A finite VC dimension h not only guarantees a generalization
capacity (consistency), but to pick f in a family S with finite VC
dimension h is the only way to build a model that generalizes.
27
28. Model convergence speed (generalization
capacity)
• Q : What is the nature of model risk difference between
learning data (sample: empirical risk) and test data (expected
risk), for a sample of finite size L?
• R : This difference is no greater than a limit that only depends
on the ratio between VC dimension h of model functions
family S, and sample size L, ie h/L
This statement is a new theorem that belongs to Kolmogorov-
Smirnov way for results, ie theorems that do not depend on
data’s underlying probability law.
28
29. Empirical risk minimization in LS case
• With probability 1-q, the following inequality is true:
where w0 is the parameter w value that minimizes
Empirical Risk:
29
31. “SRM” methodology: how to control model
generalization capacity
Expected Risk = Empirical Risk + Confidence Interval
• To minimize Empirical Risk alone will not always give a
good generalization capacity: one will want to minimize
the sum of Empirical Risk and Confidence Interval
• What is important is not Vapnik limit numerical value ,
most often too large to be of any practical use, it is the
fact that this limit is a non decreasing function of model
family function “richness”, ie shattering power
31
32. SRM strategy (1)
• With probability 1-q,
• When h/L is too large, second term of equation
becomes large
• SRM basic idea for strategy is to minimize
simultaneously both terms standing on the right of
this majoring equation for R(w)
• To do this, one has to make h a controlled parameter
32
33. SRM strategy (2)
• Let us consider a sequence S1 < S2 < .. < Sn of model
family functions, with respective growing VC
dimensions
h1 < h2 < .. < hn
• For each family Si of our sequence, the inequality
is valid
33
34. SRM strategy (3)
SRM : find i such that expected risk R(w) becomes
minimum, for a specific h*=hi, relating to a specific
family Si of our sequence; build model using f from Si
Empirical
Risk
Risk
Model Complexity
Total Risk
Confidence interval
In h/L
Best Model
h*
34
35. How to chose h*: cross-validation
• Learning sample of size L is divided in two: basic learning set
of size L1, and validation set of size L2
• For a given meta-parameter that controls the model family S
richness, hence its h, a model is built on basic learning set,
and its actual risk is measured on validation set
• Meta-parameter is determined so that model actual risk is
minimum on validation set: this leads to the best family, ie h*
• Final model is computed from this optimal family: best trade-
off between fit and robustness is achieved by construction
35
36. Some Learning Machines
• Linear models
• Polynomial models
• Kernel methods
• Neural networks
• Decision trees
36
37. Learning Process
• Learning machines include:
– Linear discriminant (including Naïve Bayes)
– Kernel methods
– Neural networks
– Decision trees, Random Forests
• Learning is tuning:
– Parameters (weights w or αααα, threshold b)
– Hyperparameters (basis functions, kernels, number of
units, number of features/attributes)
37
38. Industrial Data Mining: implementation example
x1
xn
x3
x2
Output
System
y1
yp
y2
Input
k
x k
y
DataPreparation
Learning
Algorithm
Class of Models
DataEncoding
LossCriterion
k
x
k
y
Descriptors
Automatic
via SRM
Ridge
regression
KI
(Gini index)
Polynomials
( )
κ, σκ, σκ, σκ, σ
γγγγ
w
38
39. Data Encoding/Compression
• Encodes nominal and ordinal variables
numerically.
• Encodes continuous variables non-linearly.
• Compresses variables in robust categories.
• Handles missing values and outliers.
• This process includes adjustable hyper-
parameters.
39
40. Multiple Structures
S1⊂ S2 ⊂ … SN
• Weight decay/Ridge regression:
Sk = { w | ||w||2< ωk }, ω1<ω2<…<ωk
γ1 > γ2 > γ3 >… > γk (γ is the ridge)
• Feature selection:
Sk = { w | ||w||0< σk },
σ1<σ2<…<σk (σ is the number of features)
• Data compression:
κ1<κ2<…<κk (κ may be the number of clusters)
40
41. Hyper-parameter selection
• w = parameter vector.
γ, σ, κ = hyper-parameters.
• Cross-validation with K-folds:
• For various values of γ, σ, κ:
– Adjust w on (K-1)/K training
examples.
– Test on K remaining examples.
– Rotate examples and average test
results (CV error).
– Select γ, σ, κ to minimize CV error.
– Re-compute w on all training
examples using opt. γ, σ, κ.
X y
Prospective
study /
“real”
validation
Trainingdata:MakeKfoldsTestdata
41
42. SRM put to work : campaign optimization
O
MKI =
O
M
Fraction of customers selected
Fractionofgoodcustomersselected
Ideal Lift
100%
100%Customers
ordered
according
to f(x);
selection
of the top
ranking
customers.
G
CV lift
O
GKR −=1
42
43. Summary
• Weight decay is a powerful mean of
overfitting avoidance.
• It is also known as “ridge regression”.
• It is grounded in the SRM theory.
• Multiple structures are used by most
current DM engines : ridge, feature
selection, data compression.
43
44. Quelques exemples concrets
• Census : expliquer ce qui fait que l’on gagne
plus ou moins de $50000/an
• Données biostatistiques : feature reduction
44
45. Ockham’s Razor
• Principle proposed by William of
Ockham in the fourteenth
century: “Pluralitas non est
ponenda sine neccesitate”.
• Of two theories providing
similarly good predictions, prefer
the simplest one.
• Shave off unnecessary
parameters of your models.
45
46. Vision : l’Atelier de modélisation prédictive
• Le data mining/machine learning intervient en
amont pour sélectionner dans un grand ensemble de
variables, sur une problématique, les « bonnes »
variables susceptibles d’inférence utile. Cette étape
peut être « automatisée »
• On met ensuite en place la stratification, la
randomisation et les RCT appropriés, à partir de ces
variables « particulièrement intéressantes »
• On finit par les tests sur les résultats (étape qui peut
être aussi automatisée)
• => un accélérateur de production de résultats pour
une Evidence Based Policy toujours plus efficace
46