SlideShare a Scribd company logo
1 of 137
Download to read offline
Representation learning
in limited-data settings
Ga¨el Varoquaux
Limited-data settings
n to be compared to:
A measure of the signal-to-noise ratio
The dimensional of the data p
Deep learning does not work well in
small-sample regimes
But we can borrow ideas
This talk: No silver bullet,
many simple (shallow) tricks
G Varoquaux 1
Small-n problems are important
83% of data scientists1 never have n > 1M
n is often small for applications such as medicine
Bigger is better (how to not use this talk)
Get more data (pool related datasets)
Find a related problem and try transfer
This talk: data that differs from common sources
1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasetsG Varoquaux 2
Perils of deep learning with small n
Selecting architecture, learning rate...
A deep architecture is validated by its measured accuracy
overfitting the validation & test set
Sampling noise for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Optimizing test accuracy will explore the tails
cf online challenges
Need for guiding principles
G Varoquaux 3[Varoquaux 2018]
Outline
1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
2 Matrix factorization and its variants
For signals
For discrete objects
3 Fisher kernels
Kernels feature maps
From likelihoods to Kernels
G Varoquaux 4
1 Representations for machine
learning
Defining the notion of representations
Their use for supervised learning
1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
Settings: supervised learning
Given n pairs (x, y) ∈ X × Y drawn i.i.d.
find a function f : X → Y such that f(x) ≈ y
Notation: ˆy
def
= f(x)
Empirical risk minimization
Loss function l : Y × Y →
Estimation of f: f = argmin
f∈F
¾ l(ˆy, y)
This course: how to choose good function classes F
G Varoquaux 7
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Model too simple: underfit
Model too complex: overfit
G Varoquaux 8
Theory: the generalization error
Generalization error of a prediction function f:
Notation : E(f)
def
= ¾ l(y, f(x))
Finite-sample regime
Ideally: f = argmin
f∈F
¾ l f(x), y
In practice: ˆf = argmin
f∈F
n
i=1
l f(xi), yi
E(ˆf) ≥ E(f )
f
f
G Varoquaux 9
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Bayes rate
Best possible pre-
diction
¾ l(g(x) + e, g(x))
Approximation
error: g F
Our model is
wrong
Estimation
Sampling noise on
train data
ˆf f
G Varoquaux 10
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Bayes rate
Best possible pre-
diction
¾ l(g(x) + e, g(x))
Due to the noise e
Cannot be avoided
G Varoquaux 10
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Approximation
error: g F
Our model is
wrong
Decreases for larger F
Empirical upper bound:
train error
G Varoquaux 10
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Estimation
Sampling noise on
train data
ˆf f
Finite-sample problem
Decreases as n grows
Increases for larger F
Guesstimate: difference be-
tween train and test error
G Varoquaux 10
Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f f
g
Degree 1, large n
small estimation error
large approximation
error
G Varoquaux 11
Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
ˆf = argminf∈F i l f(xi), yi
f f
g
Degree 1, large n
small estimation error
large approximation
error
Function class F not
restrictive enough
Function class F too
restrictive
G Varoquaux 11
Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error
sklearn.model selection.learning curve
G Varoquaux 12
Overfit
region
Underfit? Or Bayes rate?
Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
sklearn.model selection.learning curve
G Varoquaux 12
Estimation error ∼ gap be-
tween train and test error
Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
Degree of polynomial
9 1
Simpler models reach the assymptotic regime faster
(smaller “sample complexity”)
But can underfit
G Varoquaux 12
Gauging overfit vs underfit: validation curves
5 10 15
Polynomial degree
Error
Generalization error
Training error
sklearn.model selection.validation curve
Reveals underfits
G Varoquaux 13
Linear models for limited-data settings
In high-dimensional limited-data settings,
linear models are often the best choice
For p-dimensional data, x ∈ p,
they have p parameters
n ∼ 200 000
Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B
Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94)
Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92)
G Varoquaux 14
Theory: Approximating with linear predictors
Linear predictor1: ˆy = xTw, w ∈ p
Data model: y = xTw + δ(x) + e ¾[e] = 0
xTw : best linear predictor
Ridge estimator:
ˆw = argmin
w
ytrain − Xtrainw 2
Fro + λ w 2
2
Error compared to best linear predictor:
¾ y − xT ˆw 2
2 = ¾ y − xTw 2
2 + o σ2p/ntrain
[Hsu... 2014, sec 2.5]
Random design analysis can characterize the generalization
error without assuming a correct data-generating model
(miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018]
1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
Theory: Approximating with linear predictors
Linear predictor1: ˆy = xTw, w ∈ p
Data model: y = xTw + δ(x) + e ¾[e] = 0
xTw : best linear predictor
Ridge estimator:
ˆw = argmin
w
ytrain − Xtrainw 2
Fro + λ w 2
2
Error compared to best linear predictor:
¾ y − xT ˆw 2
2 = ¾ y − xTw 2
2 + o σ2p/ntrain
Approximation error
Data not linearly generated
⇒ craft more features
Estimation error
Curse of dimensionality
⇒ limit number of features
1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
Example: extrapolating sea level (tides)
Predict sea level as a function of time
Test outside of observed range1
1Technically, this is not in our theory: test set train set.G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
Covariates
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10 dim=100
Covariates
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10 dim=100
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
dim=100
dim=1000
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
dim=100
dim=1000
Choice of covariates / basis / signal representation
⇒ huge difference on approximation error
⇒ huge difference on generalization error
G Varoquaux 16
Summary
ˆy = f(x), f chosen in F
to minimize the observed error
i∈train
l f(xi), y
generalization error:
- approximation error ⇒ F adapted to the data
- estimation error ⇒ F small
Limited-data settings
Linear models best option when p n
A good choice of covariates is crucial
G Varoquaux 17
1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
Composing L rectifying functions on intermediate representa-
tions of dimension k gives O k
p
p(L−1)
kp linear regions.
Basis expansion + linear predictor gives O(k)
Exponential in depth, linear with dimension [Montufar... 2014]
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
y multidimensional
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
A good choice of z can decrease sample complexity
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
Transfer: r is learned on large data; a simple h used.
G Varoquaux 19
Background: Information theory
Entropy = amount of information in x
H(x) = ¾p[− log p(x)]
Equi-probable distribution
= high entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Uneven distribution
= low entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Mutual information between x and y
I(x; y) = H(x, y) − H(x) − H(y)
x ⊥⊥ y (independent) ⇔ I(x; y) = 0
independence ⇔ p(x; y) = p(x)p(y)
H(x; y) = ¾(x;y) log p(x; y) = ¾(x;y) log p(x) + log p(y)
x
y
= ¾x log p(x) + ¾y log p(y) = H(x) + H(y)
G Varoquaux 20
Theory: information in representations
A representation z of x is sufficient for y if y ⊥⊥ x|z,
or equivalently if I(z; y) = I(x; y)
x, z, y form a Markov chain if (y|x, z) = (y|z).
x → z → y
Data processing inequality: I(x; y) ≤ I(x; z)
A sufficient representation z is minimal when
I(x; z) is smallest among sufficient representations
G Varoquaux 21[Achille and Soatto 2018]
Nuisances and invariances
A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0
Representation z is invariant to the nuisance n
if z ⊥⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low
In a Markov chain x → z1 → z2 · · · → zL → y
If z is a sufficient representation for y,
I(z; n) ≤ I(z; x) − I(x; y)
Communication bottleneck: I(z1; z2) < I(z1; x)
⇒ I(z2; n) ≤ I(z1; z2) − I(x; y)
Stacking increases invariance
G Varoquaux 22[Achille and Soatto 2018]
Invariant representations on a continous space
st
Shift invariance representation = Fourier basis
Fourier transform: F(s)f =
t
e−i f t
st
complex i
Shifting the signal: st → st = st+k
F(s )f =
t
e−i f t
st+k =
t
e−i f (t−k)
st = ei k f
t
e−i f t
st
= ei k f
F(s)f → change in phase
An orthonormal basis
of shift-invariant vectors
G Varoquaux 23
Invariant representations on a continous space
st
Shift invariance = Fourier basis
Local deformations = Wavelets
Locally equivalent to Fourier basis
But without the global extent
Decimated wavelets
Isometric transform of the signal
Higher scales lose shift invariance
Redundant wavelets
Increase the dimensionality
Good shift invariance
G Varoquaux 23
Representations invariant to rich deformations
Scaling
Rotations
Deformations
Ingredients
Modulus of wavelet / Fourier transform
⇒ non linearity & filter banks (convolutions)
+ stacking (repeating simple invariants)
Scattering transform
Derived from first principles
Building first-order invariants
Convolutional networks
Learned from data
Pooling across pixels (eg max)
G Varoquaux 24[Mallat 2016]
Summary
Intermediate representations give
expressiveness to predictive models
Good representations keep predictive information
and loose nuisance information
Bottleneck and regularization to loose information
Limited-data settings
Given know invariants of the problem,
reusing existing representations helps
eg Headless conv-net, wavelets... [Oyallon... 2017]
G Varoquaux 25
1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
The need to supervision
Maximizing I(z; y) (≤ I(x; y)) sufficient representations
⇒ supervised learning
while minimizing I(z; n) nuisance
⇒ sampling nuisance / invariants
data augmentation
Challenge: amount of labeled data
Pretext tasks
Other targets y that capture useful information
Finding them needs domain knowledge
G Varoquaux 27
Deep architectures
...
ˆy = fd
Wd
◦ ... ◦ f1
W1
(x)
Typically fk
Wk
(x) = gk
(WT
k x) and gk
element-wise non-linearity
Thus ˆy = gd
WT
d ... g1
(WT
1 x)
Stacked representations: Wk
{Wk} optimized to minimize a prediction error
G Varoquaux 28
Shallow architectures for limited data
Keep one
latent layer
2
Without non-linearity:
ˆy = xT
W1 W2, y ∈ k
W1 ∈ p×d
W2 ∈ d×k
,
factored / reduced-rank linear model
Multi-task / multi-output literature
⇒ structured loss (multiple soft-max’s)
Overparametrization sometimes useful: d > k
can be achieved with dropout
G Varoquaux 29[Bzdok... 2015, Mensch... 2018]
Simple case: square loss = reduced rank regression
ˆY = X W1 W2, Y ∈ n×k
W1 ∈ p×d
, W2 ∈ d×k
ˆW1, ˆW2 = argmin
W1,W2
ˆY − Ytrain
2
Fro For squared loss the
problem is convex
Full-rank solution1 (X and Y on train set):
ˆW = ˆΣ−1
X XT
Y ˆY = X ˆW = X ˆΣ−1
X XT
Y
Rank d solution: [Izenman 1975, Rahim... 2017b]
ˆRd
def
= YT ˆY ∈ k×k SVD
→ = ˆUd ˆsd
ˆVd, ˆUd ∈ k×d
then ˆW1 = Σ−1
X
XTY ˆUd
ˆW2 = ˆUT
d
Full-rank solution Rank-k projector2
1No need for pesky SGDs
2The projector captures the variance explained on the multiple outputsG Varoquaux 30
Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Directly supervising z:
z = ˆy for a (simple) predictive model
Trick: “cross-fit” during training
obtain ˆy by splitting the training data
Testset Trainset
Fulldata
(in sklearn: cross val predict)
G Varoquaux 31
Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Directly supervising z:
z = ˆy for a (simple) predictive model
Trick: “cross-fit” during training
obtain ˆy by splitting the training data
Testset Trainset
Fulldata
(in sklearn: cross val predict)
Application: tackling dimensionality [Rahim... 2017a]
Some features are a high-dimensional signal
eg medical images
f1: linear to reduce signal features
f2: non-linear (eg trees) on all features
G Varoquaux 31
Model stacking to encode discrete items
Sex Date Hired Employee Position
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
predict
→
Salary
69222.18
97392.47
104717.28
Difficulty: number of different positions
what invariants?
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Target encoding1 [Micci-Barreca 2001]
position → ¾train[salary|position]
1To inject categories in , before a second level that combines all columnsG Varoquaux 32
Summary
Supervision helps selecting
the relevant part of the signal
In limited-sample settings, simple
models can create representations
Simple latent-factor models
Multi-output models
Stacking: fit a first-level model
G Varoquaux 33
Summary of first section
For generalization: small family of functions fw that
approximate the signal well
Generalization of a linear predictor:
approximation error + o(p/ntrain
)
Predictors by composition: ˆy = f2(z), z = f1(x)
x
f1
→ z
f2
→ y ideally, f1 makes z invariant to nuisances
Reuse representations with the right invariances:
wavelets, fasttext, pretrained headless neural nets
Simple supervised models
can create representations
stacking multioutput pretext tasks
G Varoquaux 34
2 Matrix factorization and its
variants
Simple unsupervised representation learning
More unlabeled data than labeled data
Learn representations and transfer them
Here: Focus on simple models for limited n or low SNR settings
Particularly interesting regime: p large and n large.
2 Matrix factorization and its variants
For signals
For discrete objects
Principal Component Analysis
Find the directions of largest variance
Computation
X ∈ n×p ΣX = XTX ∈ p×p
PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ n×k
G Varoquaux 37
Principal Component Analysis
Find the directions of largest variance
Computation
X ∈ n×p ΣX = XTX ∈ p×p
PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ n×k
Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N(0, Ip), U ∈ n×k, V ∈ k×p
ˆU, ˆV = argmin
U,V
X − U V 2
Fro
Rotationally invariant: U = U O, OT V also solution for O s.t. OTO = I
G Varoquaux 37
Principal Component Analysis
Find the directions of largest variance
In a learning pipeline
Useful for dimensionality reduction (eg p is large)
Eases statistics and computations
Generalization error of PCA + OLS
within a factor of 4 of ridge
[Dhillon... 2013]
G Varoquaux 37
Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
True sources, signals U
Observations (mixed signal)
ICA recovered signals
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ p×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Latent signals V Observed data U V
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ p×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Computation: FastICA [Hyv¨arinen and Oja 2000]
Power iterations on V
Each time:
- apply a smooth increasing non-linearity on {uj}
- decorrelate
Preprocessing: whiten the data eg with PCA
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
ICA to learn representations
Across patches of natural images:
Gabor-like filters
Similar to wavelets
and first layer of convnets
G Varoquaux 39[Hyv¨arinen and Oja 2000]
Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse
U ∈ n×k, V ∈ k×p
k can be > p (overcomplete dictionary)
Estimation: ˆU, ˆV = argmin
U,V,
s.t. vi
2
2 ≤1
X − U V 2
Fro + λ U 1
Combining squared loss and
1 penalty creates sparsity
Constraint on vi
2
2 required to
avoid cancelling out penalty with
V → ∞ and U → 0
x2
x1
G Varoquaux 40
Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse
U ∈ n×k, V ∈ k×p
k can be > p (overcomplete dictionary)
Estimation: ˆU, ˆV = argmin
U,V,
s.t. V∈C
X − U V 2
Fro + λΩ(U)
Constraint set and penalty can be varied1
Typically, 2, 1, and positivity2 on U or V.
1Fast when C and Ω lead to simple projections and penalized regression.
2Recovers a form of NMF (non-negative matrix factorization)G Varoquaux 40
Sparse dictionary learning to learn representations
Across patches of natural images:
Also learns Gabor-like filters1
Good for sparse models,
eg for denoising
1as ICA, K-Means, etc on images patchesG Varoquaux 41[Mairal... 2014]
Large n large p: brain imaging
Brain activity at rest
1000 subjects with
∼ 100–10 000 samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
G Varoquaux 42
voxels
time
voxels
time
X +U · V= E
25
Large n large p: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
G Varoquaux 43
users
product
users
products
X +U · V= E
Online estimation: stochastic optimization
min
w
i
l(xi w)
Many samples min
w
¾[l(y, x w)]
Gradient descent: wt+1 ← wt + αt wl
Stochastic gradient descent: wt+1 ← wt + αt¾[ wl]
Use a cheap estimate of ¾[ wl] (e.g. subsampling)
αt must decrease
“suitably” with t.
Those pesky learning rate
G Varoquaux 44
Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Alternating
minimization
Data
matrix
Large matrices
= terabytes of data
argmin
U,V
X−U V 2
Fro + λΩ(U)
G Varoquaux 45[Mairal... 2010]
Online estimation for matrix factorization
Large matrices
= terabytes of data
argmin
U,V
X−U V 2
Fro + λΩ(U)
Rewrite as an expectation:
argmin
V i
min
u
Xi − V u 2
Fro + λΩ(u)
argmin
E
f(V)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 45[Mairal... 2010]
Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 45[Mairal... 2010]
Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
Subsampled
& online
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 45[Mensch... 2017]
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
G Varoquaux 46
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2
gt(V)
surrogate
=
x
l(x, V) ui is used, and not u
G Varoquaux 46
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2 = tr
1
2
V VAt − V Bt
At
def
= (1 −
1
t
)At−1 +
1
t
utut Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtut
At and Bt are sufficient statistics of the loss
accumulated over the data
G Varoquaux 46
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2 = tr
1
2
V VAt − V Bt
At
def
= (1 −
1
t
)At−1 +
1
t
utut Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtut
3. Minimize surrogate
Vt = argmin
V∈C
gt(V) gt = VAt − Bt
G Varoquaux 46
Stochastic Majorization-Minimization [Mairal 2013]
V = argmin
V∈C x
l(x, V) where l(x, V) = min
u
f(x, V, u)
Algorithm:
gt(V)
majorant
=
x
l(x, V) ui is used, and not u
⇒ Majorization-Minimization scheme1
Surrogate computation SMM Full minimization
2nd order information No learning rate
1SOMF uses a approximate majorant and minimization [Mensch... 2017]G Varoquaux 47
Experimental convergence: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue
×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 48
Experimental convergence: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 49
Summary
Versatile matrix-factorization formulation1
argmin
U∈ n×k,V∈C
X − U V 2
Fro + λΩ(U)
Estimation
Stochastic majorization miniminization2
⇒ an online alternated optimization
Example use of learned representations
Biomakers of autism on brain images:
p ∼ 100 000, n ∼ 1 000 [Abraham... 2017]
11-layer linear autoencoder
2Common case algorithm readily usable in scikit-learn:
MiniBatchDictionaryLearningG Varoquaux 50
2 Matrix factorization and its variants
For signals
For discrete objects
Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2019]
=⇒ Poisson loss, instead of squared loss
(xj|u, V) = Poisson (u V)j = 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior3
(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
3Because it is the conjugate prior of the Poisson, it imposes soft sparsity,
and it raises rotational invarianceG Varoquaux 52
Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2019]
=⇒ Poisson loss, instead of squared loss
(xj|u, V) = Poisson (u V)j = 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior3
(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
Maximum a posteriori estimation:
ˆU, ˆV = argmin
U,V
−
j
log (xj|u, V) +
i
log (ui)
3Because it is the conjugate prior of the Poisson, it imposes soft sparsity,
and it raises rotational invarianceG Varoquaux 52
Gamma-Poisson estimation
Full log-likelihood expression:
log L =
p
j=1
xj log((u V)j) − (u V)j − log(xj!)
+
k
i=1
(αi − 1) log(ui) −
ui
βi
− αi log βi − log Γ(αi)
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
G Varoquaux 53
Gamma-Poisson estimation
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
=1
x j
(UV) j
u i
n
=1
u i
−1
u i ← u i
p
j=1
x j
(UV) j
Vij +
αi − 1
u i
p
j=1
Vij + β−1
i
−1
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.G Varoquaux 53
Adapt the majorization minimization algorithm
while V(t) − V(t−1)
F > η do
draw xt from the training set.
while ut − uold
t 2 > do
ut ← ut.
xt
utV(t) V(t)T + a−1
ut
. 1 V(t)T + b−1 .−1
At ← V(t). uT
t
xt
utV(t)
Bt ← uT
t 1
A(t) ← ρ A(t−1) + A(t)
B(t) ← ρ B(t−1) + B(t)
V(t) ← A(t)./ B(t)
t ← t + 1
G Varoquaux 54[Lefevre... 2011, Cerda and Varoquaux 2019]
Application: sub-string representation
Problem: representing non-normalized categories
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 55[Cerda and Varoquaux 2019]
Application: sub-string representation
Gamma-Poisson
factorization
on sub-strings counts
3-gram1
P
3-gram2
ol
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 56[Cerda and Varoquaux 2019]
Application: sub-string representation
Gamma-Poisson
factorization
on sub-strings counts
3-gram1
P
3-gram2
ol
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 56[Cerda and Varoquaux 2019]
Application: sub-string representation
Representations that extract latent categories
library
perator
cialist
rehouse
manager
mmunity
rescue
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
am
es
Categories
G Varoquaux 57[Cerda and Varoquaux 2019]
Application: sub-string representation
Inferring plausible feature names
ntant,
assistant,
library
ator,
equipment,
operator
dministration,
specialist
,
craftsworker,
warehouse
rossing,
program,
manager
cian,
mechanic,
community
efighter,
rescuer,
rescue
onal,
correction,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
Inferred
featurenam
es
Categories
G Varoquaux 57[Cerda and Varoquaux 2019]
Natural language processing: topic-modeling history
Topic modeling: embedding documents1
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
→
03078090707907
00790752700578
94071006000797
topics
the
Python
performance
profiling
module
is
code
can
a
030
007
940
009
100
000
documents
topics
+
What terms
are in a topics
What documents
are in a topics
LSA (Latent Semantic Analysis) [Landauer... 1998]
SVD2 of the terms×documents matrix
1Typically for information retrieval purpose, aka search engines
2Later: refinements for more complex loss: LDA (Latent Dirichlet Allocation)
[Blei... 2003] and Gamma Poisson [Canny 2004].G Varoquaux 58
Word embeddings
Distributional semantics: meaning of words
“You shall know a word by the company it keeps”
Firth, 1957
Example: A glass of red , please
Could be wine maybe juice?
wine and juice have related meanings
Factorization of the word×context matrix
What choice of context?
What loss?
word2vec [Mikolov... 2013a] glove [Pennington... 2014]
G Varoquaux 59
Word2vec: skip-gram sampling [Mikolov... 2013b]
{ ˆuw, ˆvc} = argmax
{uw,vc}
pairs of words (w, c)
in the same window1
log softmax(V uT
w)c
softmax(z)i =
exp zi
j exp zj
uw ∈ k: embedding of word w
V ∈ card(voc)×k: [vc, c ∈ voc]
all context words
Big sum on contexts
⇒ solved by SGD2
salad
meat
juice
wine
glass
green
red
Center
word
U:wordembedding
salad
meat
juice
wine
glass
red
green
Context
word
V:contextembedding
Other view:
Language models
Prediction of words
1Efficient: never build the matrix, stream directly from text.
2These windows are called skip gramG Varoquaux 60
Word2vec: negative sampling [Mikolov... 2013a]
Costly loss: log softmax(z)i = log
exp zi
j exp zj
Approximate1 Huge sum in softmax (all vocabulary)
Downsample it by drawing the positive (numerator)
and a few negative examples (denominator)
Negative sampling loss2:
[Goldberg and Levy 2014] log σ(vc uT
w) +
nneg words w
not in window
log σ(−vcuw )
σ: sigmoid (log σ(z) = −1 − exp −z)
1Related to noise contrastive estimate, that avoid computing costly
normalizations in likelihoods [Gutmann and Hyv¨arinen 2010]
2Related to a matrix factorization of mutual information inword occurence
[Levy and Goldberg 2014]G Varoquaux 61
Beyond natural language: metric learning
Triplet loss
For a “anchor”, b close to a, c far from a:
log σ(vT
aub) − log σ(vT
auc)
Quadruplet loss [Chen... 2017]
For a and b close by, c and d far appart:
log σ(vT
aub) − log σ(vT
cud)
In practice: draw1 randomly (a, b, c) or (a, b, c, d)
Metric learning: [Bellet... 2013]
Learning embeddings with weak supervision
1Many strategies, eg “hard negative mining”, requires a good test set and
metric to set, as with SGD hyperparameters.G Varoquaux 62
Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
G Varoquaux 63
Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Learning embeddings of enti-
ties {ei} and relations {rj}:
ea ∼ eb + rc
a model of the relation
Then triplet / quadruplet loss Reuse existing:
conceptnet.io
G Varoquaux 63
[Bordes... 2013,
Wang... 2017]
The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
G Varoquaux 64
The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
Matrix factorization models1: 2 hyper parameters:
Dimensionality k Regularization λ
Set them to optimize representations for supervised problems
1Using majorization-minimization approaches to avoid learning rateG Varoquaux 64
Summary
Discrete entities lead to counting occurences
⇒ Poisson and logistic loss (ugly logs in equations)
Word & entity embeddings
Factorization of coocurrences in a notion of context
more generally: metric learning
Limited-data settings:
Avoid negative-sampling models ( hyper-parameters)
Try to reuse representations (fastext, conceptnet.io)
G Varoquaux 65
3 Fisher kernels
What if the objects studied do not naturally
live in a vector space?
eg graphs of varying number of nodes
3 Fisher kernels
Kernels feature maps
From likelihoods to Kernels
Learning with Kernels [Scholkopf and Smola 2001]
Kernels
A kernel K is a function: X × X → +
positive symmetric
It captures similarity between observations
Building functions with kernels
on the training data:
Ki
def
= K(xi, ·) i ∈ train
prediction function2:
f(x) =
i∈train
wi Ki(x)
2Benefits of this formulation: i) non-linear predictor trained with linear
problem; ii) expressiveness that increases with amount of training dataG Varoquaux 68
Feature maps [Scholkopf and Smola 2001]
Drawbacks of kernels
Compute cost O(n2)
Representations not explicit
f(x) =
i∈train
wi Ki(x)
As K is symmetric positive1,
φ : X → d , such that x, x K(x, x ) = φ(x)Tφ(x )
φ is a “feature map”
f(x) is a linear function of φ(x)
but d can be ∞
Approximate φ
1Think of it as a generalization of the Cholesky decompositionG Varoquaux 69
Nystr¨om approximate feature maps
[Drineas and Mahoney 2005]
On a random subset of the training data:
G
def
=






K(x1, x1) . . . K(x1, xm)
... .
...
K(xm, x1) . . . K(xm, xm)






∈ Rm×m
Let L ∈ k×m rank-k approximation LTL
rank−k
≈ G−1
Feature map1
φNystrom(x) =






K(x1, x)
...
K(xm, x)






LT
sklearn.kernel approximation.Nystroem
1Exercise: check that φNystrom(x)TφNystrom(x) ≈ φ(x)Tφ(x) for x in our subset.G Varoquaux 70
Nystr¨om approximate feature maps
[Drineas and Mahoney 2005]
On a random subset of the training data:
G
def
=






K(x1, x1) . . . K(x1, xm)
... .
...
K(xm, x1) . . . K(xm, xm)






∈ Rm×m
Let L ∈ k×m rank-k approximation LTL
rank−k
≈ G−1
Feature map φNystrom(x) =






K(x1, x)
...
K(xm, x)






LT
sklearn.kernel approximation.Nystroem
See also: Random features [Rahimi and Recht 2008]
sklearn.kernel approximation.RBFSampler
G Varoquaux 70
3 Fisher kernels
Kernels feature maps
From likelihoods to Kernels
Parametric generative model
Consider a model of x parametrized by w ∈ k:
(x) = Pw(x) log-likelihood LP
def
= log Pw
Maximum likelihood estimates: ˆw = argmaxw LP(x)
Kullback-Leibler divergence
Natural distance1 to another distribution
KL(P|Q) = ¾P[LP − LQ]
Goal:
Benefit from our model to build a representation
All models are wrong but some are useful
1Not a distance, technically, as not symmetric.G Varoquaux 72
Local behavior of parametric models
Fisher information matrix
Expectation of Hessian of L given w:
I(w)
def
= ¾
∂2
∂2w
L(x|w) w ∈ k×k
Order-2 approximation of KL divergence:
KL(Pw|Pw+ ) = TIw
Iw also scales the covariance of the estimation
error on maximum-likelihood estimates of w
(Cramer-Rao bounds)
G Varoquaux 73
( )wI
Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
KL close to w1
KL close to w2
G Varoquaux 74
Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
KL close to w1
KL close to w2
Non constant across
the family of distri-
butions
{Pw, w ∈ k}
G Varoquaux 74
Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
{Pw, w ∈ k} form a Riemannian
manifold, with I as the metric tensor
[Rao 1945]G Varoquaux 74
Remannian manifolds
Continuous geometry on curved spaces (eg the Earth)
Locally, but not globally, Euclidean
A Riemmannian manifold M is a
differentiable space endowed
with a metric d that is locally
equivalent to a Euclidean vector
space:
ξ
MT
M
M M
M'
LogM
ExpM
for M and M ∈ M, if d(M, M ) → 0, M and M can
be mapped to elements of a vector space m, m
such that d(M, M ) ∼ mTm
Global structure: geodesic distance
G Varoquaux 75
Fisher Kernel [Jaakkola and Haussler 1999]
A Kernel locally equivalent to the KL divergence
Build upon the Fisher matrix
Create a feature map
Vector space where Euclidean distance ≈ KL
⇒
G Varoquaux 76
Fisher Kernel [Jaakkola and Haussler 1999]
A Kernel locally equivalent to the KL divergence
Build upon the Fisher matrix
Create a feature map
Vector space where Euclidean distance ≈ KL
⇒
In practice:
1. Fit model Pw on train data:
ˆw ← argmax
w
i∈train
L(xi, w)
2. Compute gradient on w of likelihood for ˆw:
zFisher(x) = wL(x, ˆw) ∈ k
G Varoquaux 76
Fisher Kernel applications
Text: TF-IDF [Elkan 2005]
Multinomial model of word appearance
Genomics [Jaakkola and Haussler 1999]
Hidden Markov model of DNA sequences
(variable-length sequences ⇒ encoding difficult)
Tree-structured data [Nicotra... 2004]
A transition model on the tree
Brain connectivity [Varoquaux... 2010]
Multivariate Gaussian model (covariances)
G Varoquaux 77
Summary
Kernels build prediction functions on similarities
Features maps / kernel approximation captures the
corresponding representation
Fisher Kernels can go from likelihood to vector space
Very useful for non numeric objects
G Varoquaux 78
Limited-data settings
Reminder: Your valida-
tion measure is intrinsi-
cally unreliable
(sampling noise)
Get more data
For instance acquiring data
on a related task, to learn
representations
Use simple models
Do not spend too much
time tweaking ­20% ­10%  0% +10% +20%
Distribution of errors under a binomial law        
1000
300
200
100
30
Number of available samples   
­2% +2%
­4% +4%
­5% +5%
­7% +7%
­15% +12%
G Varoquaux 79[Varoquaux 2018]
References I
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras,
B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers
from multi-site resting-state data: an autism-based example.
NeuroImage, 147:736, 2017.
A. Achille and S. Soatto. Emergence of invariance and
disentanglement in deep representations. The Journal of Machine
Learning Research, 19(1):1947–1980, 2018.
A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for
feature vectors and structured data. arXiv preprint
arXiv:1306.6709, 2013.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
G Varoquaux 80
References II
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko.
Translating embeddings for modeling multi-relational data. In
Advances in Neural Information Processing Systems, pages
2787–2795, 2013.
D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux.
Semi-supervised factored logistic regression for
high-dimensional neuroimaging data. In Advances in Neural
Information Processing Systems, page 3348, 2015.
J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122,
2004.
J.-F. Cardoso. Dependence, correlation and gaussianity in
independent component analysis. Journal of Machine Learning
Research, 4:1177, 2003.
P. Cerda and G. Varoquaux. Encoding high-cardinality string
categorical variables. arXiv:1907.01860, 2019.
G Varoquaux 81
References III
W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep
quadruplet network for person re-identification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, page 403, 2017.
P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk
comparison of ordinary least squares vs ridge regression. The
Journal of Machine Learning Research, 14:1505, 2013.
P. Drineas and M. W. Mahoney. On the nystr¨om method for
approximating a gram matrix for improved kernel-based learning.
journal of machine learning research, 6:2153, 2005.
C. Elkan. Deriving tf-idf as a fisher kernel. In International
Symposium on String Processing and Information Retrieval, page
295, 2005.
G Varoquaux 82
References IV
Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et
al.’s negative-sampling word-embedding method. arXiv:1402.3722,
2014.
P. K. Gopalan, L. Charlin, and D. Blei. Content-based
recommendations with poisson factorization. In Advances in
Neural Information Processing Systems, page 3176, 2014.
M. Gutmann and A. Hyv¨arinen. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In
Proceedings of the International Conference on Artificial
Intelligence and Statistics, page 297, 2010.
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge
regression. Foundations of Computational Mathematics, 14, 2014.
A. Hyv¨arinen and E. Oja. Independent component analysis:
algorithms and applications. Neural networks, 13(4):411, 2000.
G Varoquaux 83
References V
A. J. Izenman. Reduced-rank regression for the multivariate linear
model. Journal of multivariate analysis, 5:248, 1975.
T. Jaakkola and D. Haussler. Exploiting generative models in
discriminative classifiers. In Advances in neural information
processing systems, pages 487–493, 1999.
T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent
semantic analysis. Discourse processes, 25:259, 1998.
A. Lefevre, F. Bach, and C. F´evotte. Online algorithms for
nonnegative matrix factorization with the itakura-saito
divergence. In Applications of Signal Processing to Audio and
Acoustics (WASPAA), page 313. IEEE, 2011.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix
factorization. In Advances in neural information processing
systems, page 2177, 2014.
G Varoquaux 84
References VI
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix
factorization and sparse coding. Journal of Machine Learning
Research, 11:19–60, 2010.
J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and
vision processing. Foundations and Trends® in Computer
Graphics and Vision, 8(2-3):85–283, 2014.
S. Mallat. Understanding deep convolutional networks.
Philosophical Transactions of the Royal Society A, 374:20150203,
2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66:113, 2017.
G Varoquaux 85
References VII
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting
universal representations of cognition across brain-imaging
studies. arXiv preprint arXiv:1809.06035, 2018.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction problems.
ACM SIGKDD Explorations Newsletter, 3:27, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. In ICLR Workshop Papers.
2013a.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems, page 3111, 2013b.
G Varoquaux 86
References VIII
G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of
linear regions of deep neural networks. In Advances in neural
information processing systems, page 2924, 2014.
L. Nicotra, A. Micheli, and A. Starita. Fisher kernel for tree structured
data. In 2004 IEEE International Joint Conference on Neural
Networks (IEEE Cat. No. 04CH37541), volume 3, pages 1917–1922.
IEEE, 2004.
E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering
transform: Deep hybrid networks. In Proceedings of the IEEE
international conference on computer vision, page 5618, 2017.
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), page
1532, 2014.
G Varoquaux 87
References IX
M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint
prediction of multiple scores captures better individual traits
from brain images. Neuroimage, 158:145–154, 2017a.
M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions
from neuroimaging: assessing reduced-rank linear models. In
2017 International Workshop on Pattern Recognition in
Neuroimaging (PRNI), pages 1–4. IEEE, 2017b.
A. Rahimi and B. Recht. Random features for large-scale kernel
machines. In Advances in neural information processing systems,
pages 1177–1184, 2008.
C. Rao. Information and accuracy attainable in the estimation of
statistical parameters. Bull Calcutta. Math. Soc., 37:81, 1945.
G Varoquaux 88
References X
S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression:
Bias-variance decompositions, covariance penalties, and
prediction error estimation. Journal of the American Statistical
Association, pages 1–14, 2018.
B. Scholkopf and A. J. Smola. Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press,
2001.
G. Varoquaux. Cross-validation failure: small sample sizes lead to
large error bars. Neuroimage, 180:68–77, 2018.
G. Varoquaux, F. Baronnet, A. Kleinschmidt, P. Fillard, and B. Thirion.
Detection of brain functional-connectivity difference in
post-stroke patients using group-level covariance modeling. In
International Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 200–208. Springer, 2010.
G Varoquaux 89
References XI
Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding:
A survey of approaches and applications. IEEE Transactions on
Knowledge and Data Engineering, 29(12):2724–2743, 2017.
G Varoquaux 90

More Related Content

More from Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in PythonGael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareGael Varoquaux
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetGael Varoquaux
 

More from Gael Varoquaux (20)

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Representation learning in limited-data settings

  • 1. Representation learning in limited-data settings Ga¨el Varoquaux
  • 2. Limited-data settings n to be compared to: A measure of the signal-to-noise ratio The dimensional of the data p Deep learning does not work well in small-sample regimes But we can borrow ideas This talk: No silver bullet, many simple (shallow) tricks G Varoquaux 1
  • 3. Small-n problems are important 83% of data scientists1 never have n > 1M n is often small for applications such as medicine Bigger is better (how to not use this talk) Get more data (pool related datasets) Find a related problem and try transfer This talk: data that differs from common sources 1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasetsG Varoquaux 2
  • 4. Perils of deep learning with small n Selecting architecture, learning rate... A deep architecture is validated by its measured accuracy overfitting the validation & test set Sampling noise for ntest = 1000: -10% -5% 0% +5% +10% Binomial distribution of error on test accuracy -2% +2% Optimizing test accuracy will explore the tails cf online challenges Need for guiding principles G Varoquaux 3[Varoquaux 2018]
  • 5. Outline 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations 2 Matrix factorization and its variants For signals For discrete objects 3 Fisher kernels Kernels feature maps From likelihoods to Kernels G Varoquaux 4
  • 6. 1 Representations for machine learning Defining the notion of representations Their use for supervised learning
  • 7. 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations
  • 8. Settings: supervised learning Given n pairs (x, y) ∈ X × Y drawn i.i.d. find a function f : X → Y such that f(x) ≈ y Notation: ˆy def = f(x) Empirical risk minimization Loss function l : Y × Y → Estimation of f: f = argmin f∈F ¾ l(ˆy, y) This course: how to choose good function classes F G Varoquaux 7
  • 9. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise G Varoquaux 8
  • 10. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 G Varoquaux 8
  • 11. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 G Varoquaux 8
  • 12. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 G Varoquaux 8
  • 13. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 Degree 9 G Varoquaux 8
  • 14. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 Degree 9 Truth Model too simple: underfit Model too complex: overfit G Varoquaux 8
  • 15. Theory: the generalization error Generalization error of a prediction function f: Notation : E(f) def = ¾ l(y, f(x)) Finite-sample regime Ideally: f = argmin f∈F ¾ l f(x), y In practice: ˆf = argmin f∈F n i=1 l f(xi), yi E(ˆf) ≥ E(f ) f f G Varoquaux 9
  • 16. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Bayes rate Best possible pre- diction ¾ l(g(x) + e, g(x)) Approximation error: g F Our model is wrong Estimation Sampling noise on train data ˆf f G Varoquaux 10
  • 17. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Bayes rate Best possible pre- diction ¾ l(g(x) + e, g(x)) Due to the noise e Cannot be avoided G Varoquaux 10
  • 18. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Approximation error: g F Our model is wrong Decreases for larger F Empirical upper bound: train error G Varoquaux 10
  • 19. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Estimation Sampling noise on train data ˆf f Finite-sample problem Decreases as n grows Increases for larger F Guesstimate: difference be- tween train and test error G Varoquaux 10
  • 20. Example: polynomial regression degree f f Degree 9, small n no approximation error large estimation error f f g Degree 1, large n small estimation error large approximation error G Varoquaux 11
  • 21. Example: polynomial regression degree f f Degree 9, small n no approximation error large estimation error ˆf = argminf∈F i l f(xi), yi f f g Degree 1, large n small estimation error large approximation error Function class F not restrictive enough Function class F too restrictive G Varoquaux 11
  • 22. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error sklearn.model selection.learning curve G Varoquaux 12 Overfit region Underfit? Or Bayes rate?
  • 23. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error Generalization error Training error sklearn.model selection.learning curve G Varoquaux 12 Estimation error ∼ gap be- tween train and test error
  • 24. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error Generalization error Training error Degree of polynomial 9 1 Simpler models reach the assymptotic regime faster (smaller “sample complexity”) But can underfit G Varoquaux 12
  • 25. Gauging overfit vs underfit: validation curves 5 10 15 Polynomial degree Error Generalization error Training error sklearn.model selection.validation curve Reveals underfits G Varoquaux 13
  • 26. Linear models for limited-data settings In high-dimensional limited-data settings, linear models are often the best choice For p-dimensional data, x ∈ p, they have p parameters n ∼ 200 000 Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94) Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92) G Varoquaux 14
  • 27. Theory: Approximating with linear predictors Linear predictor1: ˆy = xTw, w ∈ p Data model: y = xTw + δ(x) + e ¾[e] = 0 xTw : best linear predictor Ridge estimator: ˆw = argmin w ytrain − Xtrainw 2 Fro + λ w 2 2 Error compared to best linear predictor: ¾ y − xT ˆw 2 2 = ¾ y − xTw 2 2 + o σ2p/ntrain [Hsu... 2014, sec 2.5] Random design analysis can characterize the generalization error without assuming a correct data-generating model (miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018] 1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
  • 28. Theory: Approximating with linear predictors Linear predictor1: ˆy = xTw, w ∈ p Data model: y = xTw + δ(x) + e ¾[e] = 0 xTw : best linear predictor Ridge estimator: ˆw = argmin w ytrain − Xtrainw 2 Fro + λ w 2 2 Error compared to best linear predictor: ¾ y − xT ˆw 2 2 = ¾ y − xTw 2 2 + o σ2p/ntrain Approximation error Data not linearly generated ⇒ craft more features Estimation error Curse of dimensionality ⇒ limit number of features 1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
  • 29. Example: extrapolating sea level (tides) Predict sea level as a function of time Test outside of observed range1 1Technically, this is not in our theory: test set train set.G Varoquaux 16
  • 30. Example: extrapolating sea level (tides) Polynomial regression dim=10 Covariates G Varoquaux 16
  • 31. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 Covariates G Varoquaux 16
  • 32. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates G Varoquaux 16
  • 33. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 G Varoquaux 16
  • 34. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 G Varoquaux 16
  • 35. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 dim=1000 G Varoquaux 16
  • 36. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 dim=1000 Choice of covariates / basis / signal representation ⇒ huge difference on approximation error ⇒ huge difference on generalization error G Varoquaux 16
  • 37. Summary ˆy = f(x), f chosen in F to minimize the observed error i∈train l f(xi), y generalization error: - approximation error ⇒ F adapted to the data - estimation error ⇒ F small Limited-data settings Linear models best option when p n A good choice of covariates is crucial G Varoquaux 17
  • 38. 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations
  • 39. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” G Varoquaux 19
  • 40. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion Composing L rectifying functions on intermediate representa- tions of dimension k gives O k p p(L−1) kp linear regions. Basis expansion + linear predictor gives O(k) Exponential in depth, linear with dimension [Montufar... 2014] G Varoquaux 19
  • 41. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks y multidimensional G Varoquaux 19
  • 42. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks For limited data hw(z) = wTz, a linear predictor A good choice of z can decrease sample complexity G Varoquaux 19
  • 43. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks For limited data hw(z) = wTz, a linear predictor Transfer: r is learned on large data; a simple h used. G Varoquaux 19
  • 44. Background: Information theory Entropy = amount of information in x H(x) = ¾p[− log p(x)] Equi-probable distribution = high entropy x=0 x=1 x=2 x=3 x=4 x=5 P Uneven distribution = low entropy x=0 x=1 x=2 x=3 x=4 x=5 P Mutual information between x and y I(x; y) = H(x, y) − H(x) − H(y) x ⊥⊥ y (independent) ⇔ I(x; y) = 0 independence ⇔ p(x; y) = p(x)p(y) H(x; y) = ¾(x;y) log p(x; y) = ¾(x;y) log p(x) + log p(y) x y = ¾x log p(x) + ¾y log p(y) = H(x) + H(y) G Varoquaux 20
  • 45. Theory: information in representations A representation z of x is sufficient for y if y ⊥⊥ x|z, or equivalently if I(z; y) = I(x; y) x, z, y form a Markov chain if (y|x, z) = (y|z). x → z → y Data processing inequality: I(x; y) ≤ I(x; z) A sufficient representation z is minimal when I(x; z) is smallest among sufficient representations G Varoquaux 21[Achille and Soatto 2018]
  • 46. Nuisances and invariances A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0 Representation z is invariant to the nuisance n if z ⊥⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low In a Markov chain x → z1 → z2 · · · → zL → y If z is a sufficient representation for y, I(z; n) ≤ I(z; x) − I(x; y) Communication bottleneck: I(z1; z2) < I(z1; x) ⇒ I(z2; n) ≤ I(z1; z2) − I(x; y) Stacking increases invariance G Varoquaux 22[Achille and Soatto 2018]
  • 47. Invariant representations on a continous space st Shift invariance representation = Fourier basis Fourier transform: F(s)f = t e−i f t st complex i Shifting the signal: st → st = st+k F(s )f = t e−i f t st+k = t e−i f (t−k) st = ei k f t e−i f t st = ei k f F(s)f → change in phase An orthonormal basis of shift-invariant vectors G Varoquaux 23
  • 48. Invariant representations on a continous space st Shift invariance = Fourier basis Local deformations = Wavelets Locally equivalent to Fourier basis But without the global extent Decimated wavelets Isometric transform of the signal Higher scales lose shift invariance Redundant wavelets Increase the dimensionality Good shift invariance G Varoquaux 23
  • 49. Representations invariant to rich deformations Scaling Rotations Deformations Ingredients Modulus of wavelet / Fourier transform ⇒ non linearity & filter banks (convolutions) + stacking (repeating simple invariants) Scattering transform Derived from first principles Building first-order invariants Convolutional networks Learned from data Pooling across pixels (eg max) G Varoquaux 24[Mallat 2016]
  • 50. Summary Intermediate representations give expressiveness to predictive models Good representations keep predictive information and loose nuisance information Bottleneck and regularization to loose information Limited-data settings Given know invariants of the problem, reusing existing representations helps eg Headless conv-net, wavelets... [Oyallon... 2017] G Varoquaux 25
  • 51. 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations
  • 52. The need to supervision Maximizing I(z; y) (≤ I(x; y)) sufficient representations ⇒ supervised learning while minimizing I(z; n) nuisance ⇒ sampling nuisance / invariants data augmentation Challenge: amount of labeled data Pretext tasks Other targets y that capture useful information Finding them needs domain knowledge G Varoquaux 27
  • 53. Deep architectures ... ˆy = fd Wd ◦ ... ◦ f1 W1 (x) Typically fk Wk (x) = gk (WT k x) and gk element-wise non-linearity Thus ˆy = gd WT d ... g1 (WT 1 x) Stacked representations: Wk {Wk} optimized to minimize a prediction error G Varoquaux 28
  • 54. Shallow architectures for limited data Keep one latent layer 2 Without non-linearity: ˆy = xT W1 W2, y ∈ k W1 ∈ p×d W2 ∈ d×k , factored / reduced-rank linear model Multi-task / multi-output literature ⇒ structured loss (multiple soft-max’s) Overparametrization sometimes useful: d > k can be achieved with dropout G Varoquaux 29[Bzdok... 2015, Mensch... 2018]
  • 55. Simple case: square loss = reduced rank regression ˆY = X W1 W2, Y ∈ n×k W1 ∈ p×d , W2 ∈ d×k ˆW1, ˆW2 = argmin W1,W2 ˆY − Ytrain 2 Fro For squared loss the problem is convex Full-rank solution1 (X and Y on train set): ˆW = ˆΣ−1 X XT Y ˆY = X ˆW = X ˆΣ−1 X XT Y Rank d solution: [Izenman 1975, Rahim... 2017b] ˆRd def = YT ˆY ∈ k×k SVD → = ˆUd ˆsd ˆVd, ˆUd ∈ k×d then ˆW1 = Σ−1 X XTY ˆUd ˆW2 = ˆUT d Full-rank solution Rank-k projector2 1No need for pesky SGDs 2The projector captures the variance explained on the multiple outputsG Varoquaux 30
  • 56. Model stacking x f1 → z f2 → y Learn f1 separately Directly supervising z: z = ˆy for a (simple) predictive model Trick: “cross-fit” during training obtain ˆy by splitting the training data Testset Trainset Fulldata (in sklearn: cross val predict) G Varoquaux 31
  • 57. Model stacking x f1 → z f2 → y Learn f1 separately Directly supervising z: z = ˆy for a (simple) predictive model Trick: “cross-fit” during training obtain ˆy by splitting the training data Testset Trainset Fulldata (in sklearn: cross val predict) Application: tackling dimensionality [Rahim... 2017a] Some features are a high-dimensional signal eg medical images f1: linear to reduce signal features f2: non-linear (eg trees) on all features G Varoquaux 31
  • 58. Model stacking to encode discrete items Sex Date Hired Employee Position M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III predict → Salary 69222.18 97392.47 104717.28 Difficulty: number of different positions what invariants? 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Target encoding1 [Micci-Barreca 2001] position → ¾train[salary|position] 1To inject categories in , before a second level that combines all columnsG Varoquaux 32
  • 59. Summary Supervision helps selecting the relevant part of the signal In limited-sample settings, simple models can create representations Simple latent-factor models Multi-output models Stacking: fit a first-level model G Varoquaux 33
  • 60. Summary of first section For generalization: small family of functions fw that approximate the signal well Generalization of a linear predictor: approximation error + o(p/ntrain ) Predictors by composition: ˆy = f2(z), z = f1(x) x f1 → z f2 → y ideally, f1 makes z invariant to nuisances Reuse representations with the right invariances: wavelets, fasttext, pretrained headless neural nets Simple supervised models can create representations stacking multioutput pretext tasks G Varoquaux 34
  • 61. 2 Matrix factorization and its variants Simple unsupervised representation learning More unlabeled data than labeled data Learn representations and transfer them Here: Focus on simple models for limited n or low SNR settings Particularly interesting regime: p large and n large.
  • 62. 2 Matrix factorization and its variants For signals For discrete objects
  • 63. Principal Component Analysis Find the directions of largest variance Computation X ∈ n×p ΣX = XTX ∈ p×p PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ n×k G Varoquaux 37
  • 64. Principal Component Analysis Find the directions of largest variance Computation X ∈ n×p ΣX = XTX ∈ p×p PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ n×k Model: low-rank Gaussian latent factors X ≈ U V + E, E ∼ N(0, Ip), U ∈ n×k, V ∈ k×p ˆU, ˆV = argmin U,V X − U V 2 Fro Rotationally invariant: U = U O, OT V also solution for O s.t. OTO = I G Varoquaux 37
  • 65. Principal Component Analysis Find the directions of largest variance In a learning pipeline Useful for dimensionality reduction (eg p is large) Eases statistics and computations Generalization error of PCA + OLS within a factor of 4 of ridge [Dhillon... 2013] G Varoquaux 37
  • 66. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 True sources, signals U Observations (mixed signal) ICA recovered signals 1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
  • 67. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 Model: X = U V V ∈ p×p, VTV = Ip If V is Gaussian, the model is not identifiable Seek low mutual information across {uj} ⇒ Maximally non-Gaussian marginals [Cardoso 2003] Latent signals V Observed data U V 1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
  • 68. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 Model: X = U V V ∈ p×p, VTV = Ip If V is Gaussian, the model is not identifiable Seek low mutual information across {uj} ⇒ Maximally non-Gaussian marginals [Cardoso 2003] Computation: FastICA [Hyv¨arinen and Oja 2000] Power iterations on V Each time: - apply a smooth increasing non-linearity on {uj} - decorrelate Preprocessing: whiten the data eg with PCA 1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
  • 69. ICA to learn representations Across patches of natural images: Gabor-like filters Similar to wavelets and first layer of convnets G Varoquaux 39[Hyv¨arinen and Oja 2000]
  • 70. Dictionary learning Find vectors V that represents well the signal with sparse combinations U Model: X = U V s.t. U is sparse U ∈ n×k, V ∈ k×p k can be > p (overcomplete dictionary) Estimation: ˆU, ˆV = argmin U,V, s.t. vi 2 2 ≤1 X − U V 2 Fro + λ U 1 Combining squared loss and 1 penalty creates sparsity Constraint on vi 2 2 required to avoid cancelling out penalty with V → ∞ and U → 0 x2 x1 G Varoquaux 40
  • 71. Dictionary learning Find vectors V that represents well the signal with sparse combinations U Model: X = U V s.t. U is sparse U ∈ n×k, V ∈ k×p k can be > p (overcomplete dictionary) Estimation: ˆU, ˆV = argmin U,V, s.t. V∈C X − U V 2 Fro + λΩ(U) Constraint set and penalty can be varied1 Typically, 2, 1, and positivity2 on U or V. 1Fast when C and Ω lead to simple projections and penalized regression. 2Recovers a form of NMF (non-negative matrix factorization)G Varoquaux 40
  • 72. Sparse dictionary learning to learn representations Across patches of natural images: Also learns Gabor-like filters1 Good for sparse models, eg for denoising 1as ICA, K-Means, etc on images patchesG Varoquaux 41[Mairal... 2014]
  • 73. Large n large p: brain imaging Brain activity at rest 1000 subjects with ∼ 100–10 000 samples Images of dimensionality > 100 000 Dense matrix, large both ways G Varoquaux 42 voxels time voxels time X +U · V= E 25
  • 74. Large n large p: recommender systems 3 9 7 7 9 5 7 8 4 1 6 9 7 7 1 4 4 9 5 5 8 Product ratings Millions of entries Hundreds of thousands of products and users Large sparse matrix G Varoquaux 43 users product users products X +U · V= E
  • 75. Online estimation: stochastic optimization min w i l(xi w) Many samples min w ¾[l(y, x w)] Gradient descent: wt+1 ← wt + αt wl Stochastic gradient descent: wt+1 ← wt + αt¾[ wl] Use a cheap estimate of ¾[ wl] (e.g. subsampling) αt must decrease “suitably” with t. Those pesky learning rate G Varoquaux 44
  • 76. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Alternating minimization Data matrix Large matrices = terabytes of data argmin U,V X−U V 2 Fro + λΩ(U) G Varoquaux 45[Mairal... 2010]
  • 77. Online estimation for matrix factorization Large matrices = terabytes of data argmin U,V X−U V 2 Fro + λΩ(U) Rewrite as an expectation: argmin V i min u Xi − V u 2 Fro + λΩ(u) argmin E f(V) ⇒ Optimize on approximations (sub-samples) G Varoquaux 45[Mairal... 2010]
  • 78. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 45[Mairal... 2010]
  • 79. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization Subsampled & online Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 45[Mensch... 2017]
  • 80. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) G Varoquaux 46
  • 81. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t i=1 xi − V ui 2 2 gt(V) surrogate = x l(x, V) ui is used, and not u G Varoquaux 46
  • 82. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t i=1 xi − V ui 2 2 = tr 1 2 V VAt − V Bt At def = (1 − 1 t )At−1 + 1 t utut Bt def = (1 − 1 t )Bt−1 + 1 t xtut At and Bt are sufficient statistics of the loss accumulated over the data G Varoquaux 46
  • 83. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t i=1 xi − V ui 2 2 = tr 1 2 V VAt − V Bt At def = (1 − 1 t )At−1 + 1 t utut Bt def = (1 − 1 t )Bt−1 + 1 t xtut 3. Minimize surrogate Vt = argmin V∈C gt(V) gt = VAt − Bt G Varoquaux 46
  • 84. Stochastic Majorization-Minimization [Mairal 2013] V = argmin V∈C x l(x, V) where l(x, V) = min u f(x, V, u) Algorithm: gt(V) majorant = x l(x, V) ui is used, and not u ⇒ Majorization-Minimization scheme1 Surrogate computation SMM Full minimization 2nd order information No learning rate 1SOMF uses a approximate majorant and minimization [Mensch... 2017]G Varoquaux 47
  • 85. Experimental convergence: large images 5s 1min 6min 2.80 2.85 2.90 2.95 Testobjectivevalue ×104 Time ADHD Sparse dictionary 2 GB 1min 1h 5h 0.105 0.106 0.107 0.108 0.109 Aviris NMF 103 GB 1min 1h 5h 0.35 0.36 0.37 0.38 0.39 0.40 Testobjectivevalue Time Aviris Dictionary learning 103 GB OMF: SOMF: r = 4 r = 6 r = 8 r = 12 r = 24r = 1 Best step-size SGD 100s 1h 5h 24h 0.98 1.00 1.02 1.04 ×105 HCP Sparse dictionary 2 TB SOMF = Subsampled Online Matrix Factorization G Varoquaux 48
  • 86. Experimental convergence: recommender system SOMF = Subsampled Online Matrix Factorization G Varoquaux 49
  • 87. Summary Versatile matrix-factorization formulation1 argmin U∈ n×k,V∈C X − U V 2 Fro + λΩ(U) Estimation Stochastic majorization miniminization2 ⇒ an online alternated optimization Example use of learned representations Biomakers of autism on brain images: p ∼ 100 000, n ∼ 1 000 [Abraham... 2017] 11-layer linear autoencoder 2Common case algorithm readily usable in scikit-learn: MiniBatchDictionaryLearningG Varoquaux 50
  • 88. 2 Matrix factorization and its variants For signals For discrete objects
  • 89. Gamma-Poisson for factorizing counts [Canny 2004] When X is a matrix of counts - Recommenders systems [Gopalan... 2014] - Database string entries [Cerda and Varoquaux 2019] =⇒ Poisson loss, instead of squared loss (xj|u, V) = Poisson (u V)j = 1/xj! (u V) xj j e−(u V)j u are loadings, modeled as random with a Gamma prior3 (ui) = u αi−1 i e−ui/βi β αi i Γ(αi) 3Because it is the conjugate prior of the Poisson, it imposes soft sparsity, and it raises rotational invarianceG Varoquaux 52
  • 90. Gamma-Poisson for factorizing counts [Canny 2004] When X is a matrix of counts - Recommenders systems [Gopalan... 2014] - Database string entries [Cerda and Varoquaux 2019] =⇒ Poisson loss, instead of squared loss (xj|u, V) = Poisson (u V)j = 1/xj! (u V) xj j e−(u V)j u are loadings, modeled as random with a Gamma prior3 (ui) = u αi−1 i e−ui/βi β αi i Γ(αi) Maximum a posteriori estimation: ˆU, ˆV = argmin U,V − j log (xj|u, V) + i log (ui) 3Because it is the conjugate prior of the Poisson, it imposes soft sparsity, and it raises rotational invarianceG Varoquaux 52
  • 91. Gamma-Poisson estimation Full log-likelihood expression: log L = p j=1 xj log((u V)j) − (u V)j − log(xj!) + k i=1 (αi − 1) log(ui) − ui βi − αi log βi − log Γ(αi) Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi G Varoquaux 53
  • 92. Gamma-Poisson estimation Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi Equivalent to some NMF formulation: multiplicative updates1 Vij ← Vij n =1 x j (UV) j u i n =1 u i −1 u i ← u i p j=1 x j (UV) j Vij + αi − 1 u i p j=1 Vij + β−1 i −1 1Efficient implementation with sparse matrices: the summations can be done only on non-zero entries of X.G Varoquaux 53
  • 93. Adapt the majorization minimization algorithm while V(t) − V(t−1) F > η do draw xt from the training set. while ut − uold t 2 > do ut ← ut. xt utV(t) V(t)T + a−1 ut . 1 V(t)T + b−1 .−1 At ← V(t). uT t xt utV(t) Bt ← uT t 1 A(t) ← ρ A(t−1) + A(t) B(t) ← ρ B(t−1) + B(t) V(t) ← A(t)./ B(t) t ← t + 1 G Varoquaux 54[Lefevre... 2011, Cerda and Varoquaux 2019]
  • 94. Application: sub-string representation Problem: representing non-normalized categories Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 55[Cerda and Varoquaux 2019]
  • 95. Application: sub-string representation Gamma-Poisson factorization on sub-strings counts 3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 56[Cerda and Varoquaux 2019]
  • 96. Application: sub-string representation Gamma-Poisson factorization on sub-strings counts 3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 56[Cerda and Varoquaux 2019]
  • 97. Application: sub-string representation Representations that extract latent categories library perator cialist rehouse manager mmunity rescue officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant am es Categories G Varoquaux 57[Cerda and Varoquaux 2019]
  • 98. Application: sub-string representation Inferring plausible feature names ntant, assistant, library ator, equipment, operator dministration, specialist , craftsworker, warehouse rossing, program, manager cian, mechanic, community efighter, rescuer, rescue onal, correction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Inferred featurenam es Categories G Varoquaux 57[Cerda and Varoquaux 2019]
  • 99. Natural language processing: topic-modeling history Topic modeling: embedding documents1 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents the Python performance profiling module is code can a → 03078090707907 00790752700578 94071006000797 topics the Python performance profiling module is code can a 030 007 940 009 100 000 documents topics + What terms are in a topics What documents are in a topics LSA (Latent Semantic Analysis) [Landauer... 1998] SVD2 of the terms×documents matrix 1Typically for information retrieval purpose, aka search engines 2Later: refinements for more complex loss: LDA (Latent Dirichlet Allocation) [Blei... 2003] and Gamma Poisson [Canny 2004].G Varoquaux 58
  • 100. Word embeddings Distributional semantics: meaning of words “You shall know a word by the company it keeps” Firth, 1957 Example: A glass of red , please Could be wine maybe juice? wine and juice have related meanings Factorization of the word×context matrix What choice of context? What loss? word2vec [Mikolov... 2013a] glove [Pennington... 2014] G Varoquaux 59
  • 101. Word2vec: skip-gram sampling [Mikolov... 2013b] { ˆuw, ˆvc} = argmax {uw,vc} pairs of words (w, c) in the same window1 log softmax(V uT w)c softmax(z)i = exp zi j exp zj uw ∈ k: embedding of word w V ∈ card(voc)×k: [vc, c ∈ voc] all context words Big sum on contexts ⇒ solved by SGD2 salad meat juice wine glass green red Center word U:wordembedding salad meat juice wine glass red green Context word V:contextembedding Other view: Language models Prediction of words 1Efficient: never build the matrix, stream directly from text. 2These windows are called skip gramG Varoquaux 60
  • 102. Word2vec: negative sampling [Mikolov... 2013a] Costly loss: log softmax(z)i = log exp zi j exp zj Approximate1 Huge sum in softmax (all vocabulary) Downsample it by drawing the positive (numerator) and a few negative examples (denominator) Negative sampling loss2: [Goldberg and Levy 2014] log σ(vc uT w) + nneg words w not in window log σ(−vcuw ) σ: sigmoid (log σ(z) = −1 − exp −z) 1Related to noise contrastive estimate, that avoid computing costly normalizations in likelihoods [Gutmann and Hyv¨arinen 2010] 2Related to a matrix factorization of mutual information inword occurence [Levy and Goldberg 2014]G Varoquaux 61
  • 103. Beyond natural language: metric learning Triplet loss For a “anchor”, b close to a, c far from a: log σ(vT aub) − log σ(vT auc) Quadruplet loss [Chen... 2017] For a and b close by, c and d far appart: log σ(vT aub) − log σ(vT cud) In practice: draw1 randomly (a, b, c) or (a, b, c, d) Metric learning: [Bellet... 2013] Learning embeddings with weak supervision 1Many strategies, eg “hard negative mining”, requires a good test set and metric to set, as with SGD hyperparameters.G Varoquaux 62
  • 104. Embedding entities in knowledge graphs Structured (graph) represen- tation of human knowledge eg dbpedia, Yago G Varoquaux 63
  • 105. Embedding entities in knowledge graphs Structured (graph) represen- tation of human knowledge eg dbpedia, Yago Learning embeddings of enti- ties {ei} and relations {rj}: ea ∼ eb + rc a model of the relation Then triplet / quadruplet loss Reuse existing: conceptnet.io G Varoquaux 63 [Bordes... 2013, Wang... 2017]
  • 106. The value of simple models Risk of invisible overfit dur- ing search for hyperparameters and models Complex models call for a clear utility measure with low mea- surement error Many reliable labels G Varoquaux 64
  • 107. The value of simple models Risk of invisible overfit dur- ing search for hyperparameters and models Complex models call for a clear utility measure with low mea- surement error Many reliable labels Matrix factorization models1: 2 hyper parameters: Dimensionality k Regularization λ Set them to optimize representations for supervised problems 1Using majorization-minimization approaches to avoid learning rateG Varoquaux 64
  • 108. Summary Discrete entities lead to counting occurences ⇒ Poisson and logistic loss (ugly logs in equations) Word & entity embeddings Factorization of coocurrences in a notion of context more generally: metric learning Limited-data settings: Avoid negative-sampling models ( hyper-parameters) Try to reuse representations (fastext, conceptnet.io) G Varoquaux 65
  • 109. 3 Fisher kernels What if the objects studied do not naturally live in a vector space? eg graphs of varying number of nodes
  • 110. 3 Fisher kernels Kernels feature maps From likelihoods to Kernels
  • 111. Learning with Kernels [Scholkopf and Smola 2001] Kernels A kernel K is a function: X × X → + positive symmetric It captures similarity between observations Building functions with kernels on the training data: Ki def = K(xi, ·) i ∈ train prediction function2: f(x) = i∈train wi Ki(x) 2Benefits of this formulation: i) non-linear predictor trained with linear problem; ii) expressiveness that increases with amount of training dataG Varoquaux 68
  • 112. Feature maps [Scholkopf and Smola 2001] Drawbacks of kernels Compute cost O(n2) Representations not explicit f(x) = i∈train wi Ki(x) As K is symmetric positive1, φ : X → d , such that x, x K(x, x ) = φ(x)Tφ(x ) φ is a “feature map” f(x) is a linear function of φ(x) but d can be ∞ Approximate φ 1Think of it as a generalization of the Cholesky decompositionG Varoquaux 69
  • 113. Nystr¨om approximate feature maps [Drineas and Mahoney 2005] On a random subset of the training data: G def =       K(x1, x1) . . . K(x1, xm) ... . ... K(xm, x1) . . . K(xm, xm)       ∈ Rm×m Let L ∈ k×m rank-k approximation LTL rank−k ≈ G−1 Feature map1 φNystrom(x) =       K(x1, x) ... K(xm, x)       LT sklearn.kernel approximation.Nystroem 1Exercise: check that φNystrom(x)TφNystrom(x) ≈ φ(x)Tφ(x) for x in our subset.G Varoquaux 70
  • 114. Nystr¨om approximate feature maps [Drineas and Mahoney 2005] On a random subset of the training data: G def =       K(x1, x1) . . . K(x1, xm) ... . ... K(xm, x1) . . . K(xm, xm)       ∈ Rm×m Let L ∈ k×m rank-k approximation LTL rank−k ≈ G−1 Feature map φNystrom(x) =       K(x1, x) ... K(xm, x)       LT sklearn.kernel approximation.Nystroem See also: Random features [Rahimi and Recht 2008] sklearn.kernel approximation.RBFSampler G Varoquaux 70
  • 115. 3 Fisher kernels Kernels feature maps From likelihoods to Kernels
  • 116. Parametric generative model Consider a model of x parametrized by w ∈ k: (x) = Pw(x) log-likelihood LP def = log Pw Maximum likelihood estimates: ˆw = argmaxw LP(x) Kullback-Leibler divergence Natural distance1 to another distribution KL(P|Q) = ¾P[LP − LQ] Goal: Benefit from our model to build a representation All models are wrong but some are useful 1Not a distance, technically, as not symmetric.G Varoquaux 72
  • 117. Local behavior of parametric models Fisher information matrix Expectation of Hessian of L given w: I(w) def = ¾ ∂2 ∂2w L(x|w) w ∈ k×k Order-2 approximation of KL divergence: KL(Pw|Pw+ ) = TIw Iw also scales the covariance of the estimation error on maximum-likelihood estimates of w (Cramer-Rao bounds) G Varoquaux 73 ( )wI
  • 118. Fisher-Rao manifold (information geometry) Order-2 approximation of KL(Pw|Pw+ ) = TIw KL close to w1 KL close to w2 G Varoquaux 74
  • 119. Fisher-Rao manifold (information geometry) Order-2 approximation of KL(Pw|Pw+ ) = TIw KL close to w1 KL close to w2 Non constant across the family of distri- butions {Pw, w ∈ k} G Varoquaux 74
  • 120. Fisher-Rao manifold (information geometry) Order-2 approximation of KL(Pw|Pw+ ) = TIw {Pw, w ∈ k} form a Riemannian manifold, with I as the metric tensor [Rao 1945]G Varoquaux 74
  • 121. Remannian manifolds Continuous geometry on curved spaces (eg the Earth) Locally, but not globally, Euclidean A Riemmannian manifold M is a differentiable space endowed with a metric d that is locally equivalent to a Euclidean vector space: ξ MT M M M M' LogM ExpM for M and M ∈ M, if d(M, M ) → 0, M and M can be mapped to elements of a vector space m, m such that d(M, M ) ∼ mTm Global structure: geodesic distance G Varoquaux 75
  • 122. Fisher Kernel [Jaakkola and Haussler 1999] A Kernel locally equivalent to the KL divergence Build upon the Fisher matrix Create a feature map Vector space where Euclidean distance ≈ KL ⇒ G Varoquaux 76
  • 123. Fisher Kernel [Jaakkola and Haussler 1999] A Kernel locally equivalent to the KL divergence Build upon the Fisher matrix Create a feature map Vector space where Euclidean distance ≈ KL ⇒ In practice: 1. Fit model Pw on train data: ˆw ← argmax w i∈train L(xi, w) 2. Compute gradient on w of likelihood for ˆw: zFisher(x) = wL(x, ˆw) ∈ k G Varoquaux 76
  • 124. Fisher Kernel applications Text: TF-IDF [Elkan 2005] Multinomial model of word appearance Genomics [Jaakkola and Haussler 1999] Hidden Markov model of DNA sequences (variable-length sequences ⇒ encoding difficult) Tree-structured data [Nicotra... 2004] A transition model on the tree Brain connectivity [Varoquaux... 2010] Multivariate Gaussian model (covariances) G Varoquaux 77
  • 125. Summary Kernels build prediction functions on similarities Features maps / kernel approximation captures the corresponding representation Fisher Kernels can go from likelihood to vector space Very useful for non numeric objects G Varoquaux 78
  • 126. Limited-data settings Reminder: Your valida- tion measure is intrinsi- cally unreliable (sampling noise) Get more data For instance acquiring data on a related task, to learn representations Use simple models Do not spend too much time tweaking ­20% ­10%  0% +10% +20% Distribution of errors under a binomial law         1000 300 200 100 30 Number of available samples    ­2% +2% ­4% +4% ­5% +5% ­7% +7% ­15% +12% G Varoquaux 79[Varoquaux 2018]
  • 127. References I A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras, B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. NeuroImage, 147:736, 2017. A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018. A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. G Varoquaux 80
  • 128. References II A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux. Semi-supervised factored logistic regression for high-dimensional neuroimaging data. In Advances in Neural Information Processing Systems, page 3348, 2015. J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122, 2004. J.-F. Cardoso. Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177, 2003. P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. arXiv:1907.01860, 2019. G Varoquaux 81
  • 129. References III W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page 403, 2017. P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk comparison of ordinary least squares vs ridge regression. The Journal of Machine Learning Research, 14:1505, 2013. P. Drineas and M. W. Mahoney. On the nystr¨om method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6:2153, 2005. C. Elkan. Deriving tf-idf as a fisher kernel. In International Symposium on String Processing and Information Retrieval, page 295, 2005. G Varoquaux 82
  • 130. References IV Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014. P. K. Gopalan, L. Charlin, and D. Blei. Content-based recommendations with poisson factorization. In Advances in Neural Information Processing Systems, page 3176, 2014. M. Gutmann and A. Hyv¨arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, page 297, 2010. D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14, 2014. A. Hyv¨arinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411, 2000. G Varoquaux 83
  • 131. References V A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis, 5:248, 1975. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in neural information processing systems, pages 487–493, 1999. T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse processes, 25:259, 1998. A. Lefevre, F. Bach, and C. F´evotte. Online algorithms for nonnegative matrix factorization with the itakura-saito divergence. In Applications of Signal Processing to Audio and Acoustics (WASPAA), page 313. IEEE, 2011. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, page 2177, 2014. G Varoquaux 84
  • 132. References VI J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(2-3):85–283, 2014. S. Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A, 374:20150203, 2016. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66:113, 2017. G Varoquaux 85
  • 133. References VII A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting universal representations of cognition across brain-imaging studies. arXiv preprint arXiv:1809.06035, 2018. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3:27, 2001. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers. 2013a. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, page 3111, 2013b. G Varoquaux 86
  • 134. References VIII G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, page 2924, 2014. L. Nicotra, A. Micheli, and A. Starita. Fisher kernel for tree structured data. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), volume 3, pages 1917–1922. IEEE, 2004. E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering transform: Deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, page 5618, 2017. J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), page 1532, 2014. G Varoquaux 87
  • 135. References IX M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint prediction of multiple scores captures better individual traits from brain images. Neuroimage, 158:145–154, 2017a. M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions from neuroimaging: assessing reduced-rank linear models. In 2017 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2017b. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008. C. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bull Calcutta. Math. Soc., 37:81, 1945. G Varoquaux 88
  • 136. References X S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association, pages 1–14, 2018. B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage, 180:68–77, 2018. G. Varoquaux, F. Baronnet, A. Kleinschmidt, P. Fillard, and B. Thirion. Detection of brain functional-connectivity difference in post-stroke patients using group-level covariance modeling. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 200–208. Springer, 2010. G Varoquaux 89
  • 137. References XI Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743, 2017. G Varoquaux 90