A 4-hour long course given at the Deep learning 2019 summer school.
An updated version of this slide deck can be found here:
https://www.slideshare.net/GaelVaroquaux/representation-learning-in-limiteddata-settings-250095542
The topic is how to learn representations for machine learning when the amount of data is limited, for instance when the amount of samples is not large compared to the dimensionality of the problem, or when there is a lot of noise which renders learning difficult. This course bridge deep learning to more classic "shallow" learning techniques that work well in limited-data settings, with some theory and some practical recommendations.
1. Representations for machine learning: some learning theory results, some reflections on representations, and some simple models that extract representations.
2. Matrix factorizations: covering the wide spectrum from PCA to word2vec via dictionary learning and metric learning
3. Fisher kernels: building representations from likelihood models (slightly more academic)
2. Limited-data settings
n to be compared to:
A measure of the signal-to-noise ratio
The dimensional of the data p
Deep learning does not work well in
small-sample regimes
But we can borrow ideas
This talk: No silver bullet,
many simple (shallow) tricks
G Varoquaux 1
3. Small-n problems are important
83% of data scientists1 never have n > 1M
n is often small for applications such as medicine
Bigger is better (how to not use this talk)
Get more data (pool related datasets)
Find a related problem and try transfer
This talk: data that differs from common sources
1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasetsG Varoquaux 2
4. Perils of deep learning with small n
Selecting architecture, learning rate...
A deep architecture is validated by its measured accuracy
overfitting the validation & test set
Sampling noise for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Optimizing test accuracy will explore the tails
cf online challenges
Need for guiding principles
G Varoquaux 3[Varoquaux 2018]
5. Outline
1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
2 Matrix factorization and its variants
For signals
For discrete objects
3 Fisher kernels
Kernels feature maps
From likelihoods to Kernels
G Varoquaux 4
6. 1 Representations for machine
learning
Defining the notion of representations
Their use for supervised learning
7. 1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
8. Settings: supervised learning
Given n pairs (x, y) ∈ X × Y drawn i.i.d.
find a function f : X → Y such that f(x) ≈ y
Notation: ˆy
def
= f(x)
Empirical risk minimization
Loss function l : Y × Y →
Estimation of f: f = argmin
f∈F
¾ l(ˆy, y)
This course: how to choose good function classes F
G Varoquaux 7
10. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
G Varoquaux 8
11. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
G Varoquaux 8
12. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
G Varoquaux 8
13. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
G Varoquaux 8
14. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Model too simple: underfit
Model too complex: overfit
G Varoquaux 8
15. Theory: the generalization error
Generalization error of a prediction function f:
Notation : E(f)
def
= ¾ l(y, f(x))
Finite-sample regime
Ideally: f = argmin
f∈F
¾ l f(x), y
In practice: ˆf = argmin
f∈F
n
i=1
l f(xi), yi
E(ˆf) ≥ E(f )
f
f
G Varoquaux 9
16. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Bayes rate
Best possible pre-
diction
¾ l(g(x) + e, g(x))
Approximation
error: g F
Our model is
wrong
Estimation
Sampling noise on
train data
ˆf f
G Varoquaux 10
17. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Bayes rate
Best possible pre-
diction
¾ l(g(x) + e, g(x))
Due to the noise e
Cannot be avoided
G Varoquaux 10
18. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Approximation
error: g F
Our model is
wrong
Decreases for larger F
Empirical upper bound:
train error
G Varoquaux 10
19. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Estimation
Sampling noise on
train data
ˆf f
Finite-sample problem
Decreases as n grows
Increases for larger F
Guesstimate: difference be-
tween train and test error
G Varoquaux 10
20. Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f f
g
Degree 1, large n
small estimation error
large approximation
error
G Varoquaux 11
21. Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
ˆf = argminf∈F i l f(xi), yi
f f
g
Degree 1, large n
small estimation error
large approximation
error
Function class F not
restrictive enough
Function class F too
restrictive
G Varoquaux 11
22. Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error
sklearn.model selection.learning curve
G Varoquaux 12
Overfit
region
Underfit? Or Bayes rate?
23. Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
sklearn.model selection.learning curve
G Varoquaux 12
Estimation error ∼ gap be-
tween train and test error
24. Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
Degree of polynomial
9 1
Simpler models reach the assymptotic regime faster
(smaller “sample complexity”)
But can underfit
G Varoquaux 12
25. Gauging overfit vs underfit: validation curves
5 10 15
Polynomial degree
Error
Generalization error
Training error
sklearn.model selection.validation curve
Reveals underfits
G Varoquaux 13
26. Linear models for limited-data settings
In high-dimensional limited-data settings,
linear models are often the best choice
For p-dimensional data, x ∈ p,
they have p parameters
n ∼ 200 000
Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B
Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94)
Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92)
G Varoquaux 14
27. Theory: Approximating with linear predictors
Linear predictor1: ˆy = xTw, w ∈ p
Data model: y = xTw + δ(x) + e ¾[e] = 0
xTw : best linear predictor
Ridge estimator:
ˆw = argmin
w
ytrain − Xtrainw 2
Fro + λ w 2
2
Error compared to best linear predictor:
¾ y − xT ˆw 2
2 = ¾ y − xTw 2
2 + o σ2p/ntrain
[Hsu... 2014, sec 2.5]
Random design analysis can characterize the generalization
error without assuming a correct data-generating model
(miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018]
1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
28. Theory: Approximating with linear predictors
Linear predictor1: ˆy = xTw, w ∈ p
Data model: y = xTw + δ(x) + e ¾[e] = 0
xTw : best linear predictor
Ridge estimator:
ˆw = argmin
w
ytrain − Xtrainw 2
Fro + λ w 2
2
Error compared to best linear predictor:
¾ y − xT ˆw 2
2 = ¾ y − xTw 2
2 + o σ2p/ntrain
Approximation error
Data not linearly generated
⇒ craft more features
Estimation error
Curse of dimensionality
⇒ limit number of features
1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
29. Example: extrapolating sea level (tides)
Predict sea level as a function of time
Test outside of observed range1
1Technically, this is not in our theory: test set train set.G Varoquaux 16
36. Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
dim=100
dim=1000
Choice of covariates / basis / signal representation
⇒ huge difference on approximation error
⇒ huge difference on generalization error
G Varoquaux 16
37. Summary
ˆy = f(x), f chosen in F
to minimize the observed error
i∈train
l f(xi), y
generalization error:
- approximation error ⇒ F adapted to the data
- estimation error ⇒ F small
Limited-data settings
Linear models best option when p n
A good choice of covariates is crucial
G Varoquaux 17
38. 1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
39. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
G Varoquaux 19
40. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
Composing L rectifying functions on intermediate representa-
tions of dimension k gives O k
p
p(L−1)
kp linear regions.
Basis expansion + linear predictor gives O(k)
Exponential in depth, linear with dimension [Montufar... 2014]
G Varoquaux 19
41. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
y multidimensional
G Varoquaux 19
42. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
A good choice of z can decrease sample complexity
G Varoquaux 19
43. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
Transfer: r is learned on large data; a simple h used.
G Varoquaux 19
44. Background: Information theory
Entropy = amount of information in x
H(x) = ¾p[− log p(x)]
Equi-probable distribution
= high entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Uneven distribution
= low entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Mutual information between x and y
I(x; y) = H(x, y) − H(x) − H(y)
x ⊥⊥ y (independent) ⇔ I(x; y) = 0
independence ⇔ p(x; y) = p(x)p(y)
H(x; y) = ¾(x;y) log p(x; y) = ¾(x;y) log p(x) + log p(y)
x
y
= ¾x log p(x) + ¾y log p(y) = H(x) + H(y)
G Varoquaux 20
45. Theory: information in representations
A representation z of x is sufficient for y if y ⊥⊥ x|z,
or equivalently if I(z; y) = I(x; y)
x, z, y form a Markov chain if (y|x, z) = (y|z).
x → z → y
Data processing inequality: I(x; y) ≤ I(x; z)
A sufficient representation z is minimal when
I(x; z) is smallest among sufficient representations
G Varoquaux 21[Achille and Soatto 2018]
46. Nuisances and invariances
A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0
Representation z is invariant to the nuisance n
if z ⊥⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low
In a Markov chain x → z1 → z2 · · · → zL → y
If z is a sufficient representation for y,
I(z; n) ≤ I(z; x) − I(x; y)
Communication bottleneck: I(z1; z2) < I(z1; x)
⇒ I(z2; n) ≤ I(z1; z2) − I(x; y)
Stacking increases invariance
G Varoquaux 22[Achille and Soatto 2018]
47. Invariant representations on a continous space
st
Shift invariance representation = Fourier basis
Fourier transform: F(s)f =
t
e−i f t
st
complex i
Shifting the signal: st → st = st+k
F(s )f =
t
e−i f t
st+k =
t
e−i f (t−k)
st = ei k f
t
e−i f t
st
= ei k f
F(s)f → change in phase
An orthonormal basis
of shift-invariant vectors
G Varoquaux 23
48. Invariant representations on a continous space
st
Shift invariance = Fourier basis
Local deformations = Wavelets
Locally equivalent to Fourier basis
But without the global extent
Decimated wavelets
Isometric transform of the signal
Higher scales lose shift invariance
Redundant wavelets
Increase the dimensionality
Good shift invariance
G Varoquaux 23
49. Representations invariant to rich deformations
Scaling
Rotations
Deformations
Ingredients
Modulus of wavelet / Fourier transform
⇒ non linearity & filter banks (convolutions)
+ stacking (repeating simple invariants)
Scattering transform
Derived from first principles
Building first-order invariants
Convolutional networks
Learned from data
Pooling across pixels (eg max)
G Varoquaux 24[Mallat 2016]
50. Summary
Intermediate representations give
expressiveness to predictive models
Good representations keep predictive information
and loose nuisance information
Bottleneck and regularization to loose information
Limited-data settings
Given know invariants of the problem,
reusing existing representations helps
eg Headless conv-net, wavelets... [Oyallon... 2017]
G Varoquaux 25
51. 1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
52. The need to supervision
Maximizing I(z; y) (≤ I(x; y)) sufficient representations
⇒ supervised learning
while minimizing I(z; n) nuisance
⇒ sampling nuisance / invariants
data augmentation
Challenge: amount of labeled data
Pretext tasks
Other targets y that capture useful information
Finding them needs domain knowledge
G Varoquaux 27
53. Deep architectures
...
ˆy = fd
Wd
◦ ... ◦ f1
W1
(x)
Typically fk
Wk
(x) = gk
(WT
k x) and gk
element-wise non-linearity
Thus ˆy = gd
WT
d ... g1
(WT
1 x)
Stacked representations: Wk
{Wk} optimized to minimize a prediction error
G Varoquaux 28
54. Shallow architectures for limited data
Keep one
latent layer
2
Without non-linearity:
ˆy = xT
W1 W2, y ∈ k
W1 ∈ p×d
W2 ∈ d×k
,
factored / reduced-rank linear model
Multi-task / multi-output literature
⇒ structured loss (multiple soft-max’s)
Overparametrization sometimes useful: d > k
can be achieved with dropout
G Varoquaux 29[Bzdok... 2015, Mensch... 2018]
55. Simple case: square loss = reduced rank regression
ˆY = X W1 W2, Y ∈ n×k
W1 ∈ p×d
, W2 ∈ d×k
ˆW1, ˆW2 = argmin
W1,W2
ˆY − Ytrain
2
Fro For squared loss the
problem is convex
Full-rank solution1 (X and Y on train set):
ˆW = ˆΣ−1
X XT
Y ˆY = X ˆW = X ˆΣ−1
X XT
Y
Rank d solution: [Izenman 1975, Rahim... 2017b]
ˆRd
def
= YT ˆY ∈ k×k SVD
→ = ˆUd ˆsd
ˆVd, ˆUd ∈ k×d
then ˆW1 = Σ−1
X
XTY ˆUd
ˆW2 = ˆUT
d
Full-rank solution Rank-k projector2
1No need for pesky SGDs
2The projector captures the variance explained on the multiple outputsG Varoquaux 30
56. Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Directly supervising z:
z = ˆy for a (simple) predictive model
Trick: “cross-fit” during training
obtain ˆy by splitting the training data
Testset Trainset
Fulldata
(in sklearn: cross val predict)
G Varoquaux 31
57. Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Directly supervising z:
z = ˆy for a (simple) predictive model
Trick: “cross-fit” during training
obtain ˆy by splitting the training data
Testset Trainset
Fulldata
(in sklearn: cross val predict)
Application: tackling dimensionality [Rahim... 2017a]
Some features are a high-dimensional signal
eg medical images
f1: linear to reduce signal features
f2: non-linear (eg trees) on all features
G Varoquaux 31
58. Model stacking to encode discrete items
Sex Date Hired Employee Position
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
predict
→
Salary
69222.18
97392.47
104717.28
Difficulty: number of different positions
what invariants?
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Target encoding1 [Micci-Barreca 2001]
position → ¾train[salary|position]
1To inject categories in , before a second level that combines all columnsG Varoquaux 32
59. Summary
Supervision helps selecting
the relevant part of the signal
In limited-sample settings, simple
models can create representations
Simple latent-factor models
Multi-output models
Stacking: fit a first-level model
G Varoquaux 33
60. Summary of first section
For generalization: small family of functions fw that
approximate the signal well
Generalization of a linear predictor:
approximation error + o(p/ntrain
)
Predictors by composition: ˆy = f2(z), z = f1(x)
x
f1
→ z
f2
→ y ideally, f1 makes z invariant to nuisances
Reuse representations with the right invariances:
wavelets, fasttext, pretrained headless neural nets
Simple supervised models
can create representations
stacking multioutput pretext tasks
G Varoquaux 34
61. 2 Matrix factorization and its
variants
Simple unsupervised representation learning
More unlabeled data than labeled data
Learn representations and transfer them
Here: Focus on simple models for limited n or low SNR settings
Particularly interesting regime: p large and n large.
63. Principal Component Analysis
Find the directions of largest variance
Computation
X ∈ n×p ΣX = XTX ∈ p×p
PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ n×k
G Varoquaux 37
64. Principal Component Analysis
Find the directions of largest variance
Computation
X ∈ n×p ΣX = XTX ∈ p×p
PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ n×k
Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N(0, Ip), U ∈ n×k, V ∈ k×p
ˆU, ˆV = argmin
U,V
X − U V 2
Fro
Rotationally invariant: U = U O, OT V also solution for O s.t. OTO = I
G Varoquaux 37
65. Principal Component Analysis
Find the directions of largest variance
In a learning pipeline
Useful for dimensionality reduction (eg p is large)
Eases statistics and computations
Generalization error of PCA + OLS
within a factor of 4 of ridge
[Dhillon... 2013]
G Varoquaux 37
66. Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
True sources, signals U
Observations (mixed signal)
ICA recovered signals
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
67. Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ p×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Latent signals V Observed data U V
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
68. Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ p×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Computation: FastICA [Hyv¨arinen and Oja 2000]
Power iterations on V
Each time:
- apply a smooth increasing non-linearity on {uj}
- decorrelate
Preprocessing: whiten the data eg with PCA
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
69. ICA to learn representations
Across patches of natural images:
Gabor-like filters
Similar to wavelets
and first layer of convnets
G Varoquaux 39[Hyv¨arinen and Oja 2000]
70. Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse
U ∈ n×k, V ∈ k×p
k can be > p (overcomplete dictionary)
Estimation: ˆU, ˆV = argmin
U,V,
s.t. vi
2
2 ≤1
X − U V 2
Fro + λ U 1
Combining squared loss and
1 penalty creates sparsity
Constraint on vi
2
2 required to
avoid cancelling out penalty with
V → ∞ and U → 0
x2
x1
G Varoquaux 40
71. Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse
U ∈ n×k, V ∈ k×p
k can be > p (overcomplete dictionary)
Estimation: ˆU, ˆV = argmin
U,V,
s.t. V∈C
X − U V 2
Fro + λΩ(U)
Constraint set and penalty can be varied1
Typically, 2, 1, and positivity2 on U or V.
1Fast when C and Ω lead to simple projections and penalized regression.
2Recovers a form of NMF (non-negative matrix factorization)G Varoquaux 40
72. Sparse dictionary learning to learn representations
Across patches of natural images:
Also learns Gabor-like filters1
Good for sparse models,
eg for denoising
1as ICA, K-Means, etc on images patchesG Varoquaux 41[Mairal... 2014]
73. Large n large p: brain imaging
Brain activity at rest
1000 subjects with
∼ 100–10 000 samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
G Varoquaux 42
voxels
time
voxels
time
X +U · V= E
25
74. Large n large p: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
G Varoquaux 43
users
product
users
products
X +U · V= E
75. Online estimation: stochastic optimization
min
w
i
l(xi w)
Many samples min
w
¾[l(y, x w)]
Gradient descent: wt+1 ← wt + αt wl
Stochastic gradient descent: wt+1 ← wt + αt¾[ wl]
Use a cheap estimate of ¾[ wl] (e.g. subsampling)
αt must decrease
“suitably” with t.
Those pesky learning rate
G Varoquaux 44
76. Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Alternating
minimization
Data
matrix
Large matrices
= terabytes of data
argmin
U,V
X−U V 2
Fro + λΩ(U)
G Varoquaux 45[Mairal... 2010]
77. Online estimation for matrix factorization
Large matrices
= terabytes of data
argmin
U,V
X−U V 2
Fro + λΩ(U)
Rewrite as an expectation:
argmin
V i
min
u
Xi − V u 2
Fro + λΩ(u)
argmin
E
f(V)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 45[Mairal... 2010]
78. Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 45[Mairal... 2010]
79. Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
Subsampled
& online
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 45[Mensch... 2017]
80. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
G Varoquaux 46
81. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2
gt(V)
surrogate
=
x
l(x, V) ui is used, and not u
G Varoquaux 46
82. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2 = tr
1
2
V VAt − V Bt
At
def
= (1 −
1
t
)At−1 +
1
t
utut Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtut
At and Bt are sufficient statistics of the loss
accumulated over the data
G Varoquaux 46
83. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2 = tr
1
2
V VAt − V Bt
At
def
= (1 −
1
t
)At−1 +
1
t
utut Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtut
3. Minimize surrogate
Vt = argmin
V∈C
gt(V) gt = VAt − Bt
G Varoquaux 46
84. Stochastic Majorization-Minimization [Mairal 2013]
V = argmin
V∈C x
l(x, V) where l(x, V) = min
u
f(x, V, u)
Algorithm:
gt(V)
majorant
=
x
l(x, V) ui is used, and not u
⇒ Majorization-Minimization scheme1
Surrogate computation SMM Full minimization
2nd order information No learning rate
1SOMF uses a approximate majorant and minimization [Mensch... 2017]G Varoquaux 47
85. Experimental convergence: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue
×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 48
87. Summary
Versatile matrix-factorization formulation1
argmin
U∈ n×k,V∈C
X − U V 2
Fro + λΩ(U)
Estimation
Stochastic majorization miniminization2
⇒ an online alternated optimization
Example use of learned representations
Biomakers of autism on brain images:
p ∼ 100 000, n ∼ 1 000 [Abraham... 2017]
11-layer linear autoencoder
2Common case algorithm readily usable in scikit-learn:
MiniBatchDictionaryLearningG Varoquaux 50
89. Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2019]
=⇒ Poisson loss, instead of squared loss
(xj|u, V) = Poisson (u V)j = 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior3
(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
3Because it is the conjugate prior of the Poisson, it imposes soft sparsity,
and it raises rotational invarianceG Varoquaux 52
90. Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2019]
=⇒ Poisson loss, instead of squared loss
(xj|u, V) = Poisson (u V)j = 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior3
(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
Maximum a posteriori estimation:
ˆU, ˆV = argmin
U,V
−
j
log (xj|u, V) +
i
log (ui)
3Because it is the conjugate prior of the Poisson, it imposes soft sparsity,
and it raises rotational invarianceG Varoquaux 52
91. Gamma-Poisson estimation
Full log-likelihood expression:
log L =
p
j=1
xj log((u V)j) − (u V)j − log(xj!)
+
k
i=1
(αi − 1) log(ui) −
ui
βi
− αi log βi − log Γ(αi)
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
G Varoquaux 53
92. Gamma-Poisson estimation
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
=1
x j
(UV) j
u i
n
=1
u i
−1
u i ← u i
p
j=1
x j
(UV) j
Vij +
αi − 1
u i
p
j=1
Vij + β−1
i
−1
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.G Varoquaux 53
93. Adapt the majorization minimization algorithm
while V(t) − V(t−1)
F > η do
draw xt from the training set.
while ut − uold
t 2 > do
ut ← ut.
xt
utV(t) V(t)T + a−1
ut
. 1 V(t)T + b−1 .−1
At ← V(t). uT
t
xt
utV(t)
Bt ← uT
t 1
A(t) ← ρ A(t−1) + A(t)
B(t) ← ρ B(t−1) + B(t)
V(t) ← A(t)./ B(t)
t ← t + 1
G Varoquaux 54[Lefevre... 2011, Cerda and Varoquaux 2019]
94. Application: sub-string representation
Problem: representing non-normalized categories
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 55[Cerda and Varoquaux 2019]
95. Application: sub-string representation
Gamma-Poisson
factorization
on sub-strings counts
3-gram1
P
3-gram2
ol
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 56[Cerda and Varoquaux 2019]
96. Application: sub-string representation
Gamma-Poisson
factorization
on sub-strings counts
3-gram1
P
3-gram2
ol
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 56[Cerda and Varoquaux 2019]
97. Application: sub-string representation
Representations that extract latent categories
library
perator
cialist
rehouse
manager
mmunity
rescue
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
am
es
Categories
G Varoquaux 57[Cerda and Varoquaux 2019]
98. Application: sub-string representation
Inferring plausible feature names
ntant,
assistant,
library
ator,
equipment,
operator
dministration,
specialist
,
craftsworker,
warehouse
rossing,
program,
manager
cian,
mechanic,
community
efighter,
rescuer,
rescue
onal,
correction,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
Inferred
featurenam
es
Categories
G Varoquaux 57[Cerda and Varoquaux 2019]
99. Natural language processing: topic-modeling history
Topic modeling: embedding documents1
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
→
03078090707907
00790752700578
94071006000797
topics
the
Python
performance
profiling
module
is
code
can
a
030
007
940
009
100
000
documents
topics
+
What terms
are in a topics
What documents
are in a topics
LSA (Latent Semantic Analysis) [Landauer... 1998]
SVD2 of the terms×documents matrix
1Typically for information retrieval purpose, aka search engines
2Later: refinements for more complex loss: LDA (Latent Dirichlet Allocation)
[Blei... 2003] and Gamma Poisson [Canny 2004].G Varoquaux 58
100. Word embeddings
Distributional semantics: meaning of words
“You shall know a word by the company it keeps”
Firth, 1957
Example: A glass of red , please
Could be wine maybe juice?
wine and juice have related meanings
Factorization of the word×context matrix
What choice of context?
What loss?
word2vec [Mikolov... 2013a] glove [Pennington... 2014]
G Varoquaux 59
101. Word2vec: skip-gram sampling [Mikolov... 2013b]
{ ˆuw, ˆvc} = argmax
{uw,vc}
pairs of words (w, c)
in the same window1
log softmax(V uT
w)c
softmax(z)i =
exp zi
j exp zj
uw ∈ k: embedding of word w
V ∈ card(voc)×k: [vc, c ∈ voc]
all context words
Big sum on contexts
⇒ solved by SGD2
salad
meat
juice
wine
glass
green
red
Center
word
U:wordembedding
salad
meat
juice
wine
glass
red
green
Context
word
V:contextembedding
Other view:
Language models
Prediction of words
1Efficient: never build the matrix, stream directly from text.
2These windows are called skip gramG Varoquaux 60
102. Word2vec: negative sampling [Mikolov... 2013a]
Costly loss: log softmax(z)i = log
exp zi
j exp zj
Approximate1 Huge sum in softmax (all vocabulary)
Downsample it by drawing the positive (numerator)
and a few negative examples (denominator)
Negative sampling loss2:
[Goldberg and Levy 2014] log σ(vc uT
w) +
nneg words w
not in window
log σ(−vcuw )
σ: sigmoid (log σ(z) = −1 − exp −z)
1Related to noise contrastive estimate, that avoid computing costly
normalizations in likelihoods [Gutmann and Hyv¨arinen 2010]
2Related to a matrix factorization of mutual information inword occurence
[Levy and Goldberg 2014]G Varoquaux 61
103. Beyond natural language: metric learning
Triplet loss
For a “anchor”, b close to a, c far from a:
log σ(vT
aub) − log σ(vT
auc)
Quadruplet loss [Chen... 2017]
For a and b close by, c and d far appart:
log σ(vT
aub) − log σ(vT
cud)
In practice: draw1 randomly (a, b, c) or (a, b, c, d)
Metric learning: [Bellet... 2013]
Learning embeddings with weak supervision
1Many strategies, eg “hard negative mining”, requires a good test set and
metric to set, as with SGD hyperparameters.G Varoquaux 62
104. Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
G Varoquaux 63
105. Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Learning embeddings of enti-
ties {ei} and relations {rj}:
ea ∼ eb + rc
a model of the relation
Then triplet / quadruplet loss Reuse existing:
conceptnet.io
G Varoquaux 63
[Bordes... 2013,
Wang... 2017]
106. The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
G Varoquaux 64
107. The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
Matrix factorization models1: 2 hyper parameters:
Dimensionality k Regularization λ
Set them to optimize representations for supervised problems
1Using majorization-minimization approaches to avoid learning rateG Varoquaux 64
108. Summary
Discrete entities lead to counting occurences
⇒ Poisson and logistic loss (ugly logs in equations)
Word & entity embeddings
Factorization of coocurrences in a notion of context
more generally: metric learning
Limited-data settings:
Avoid negative-sampling models ( hyper-parameters)
Try to reuse representations (fastext, conceptnet.io)
G Varoquaux 65
109. 3 Fisher kernels
What if the objects studied do not naturally
live in a vector space?
eg graphs of varying number of nodes
111. Learning with Kernels [Scholkopf and Smola 2001]
Kernels
A kernel K is a function: X × X → +
positive symmetric
It captures similarity between observations
Building functions with kernels
on the training data:
Ki
def
= K(xi, ·) i ∈ train
prediction function2:
f(x) =
i∈train
wi Ki(x)
2Benefits of this formulation: i) non-linear predictor trained with linear
problem; ii) expressiveness that increases with amount of training dataG Varoquaux 68
112. Feature maps [Scholkopf and Smola 2001]
Drawbacks of kernels
Compute cost O(n2)
Representations not explicit
f(x) =
i∈train
wi Ki(x)
As K is symmetric positive1,
φ : X → d , such that x, x K(x, x ) = φ(x)Tφ(x )
φ is a “feature map”
f(x) is a linear function of φ(x)
but d can be ∞
Approximate φ
1Think of it as a generalization of the Cholesky decompositionG Varoquaux 69
113. Nystr¨om approximate feature maps
[Drineas and Mahoney 2005]
On a random subset of the training data:
G
def
=
K(x1, x1) . . . K(x1, xm)
... .
...
K(xm, x1) . . . K(xm, xm)
∈ Rm×m
Let L ∈ k×m rank-k approximation LTL
rank−k
≈ G−1
Feature map1
φNystrom(x) =
K(x1, x)
...
K(xm, x)
LT
sklearn.kernel approximation.Nystroem
1Exercise: check that φNystrom(x)TφNystrom(x) ≈ φ(x)Tφ(x) for x in our subset.G Varoquaux 70
114. Nystr¨om approximate feature maps
[Drineas and Mahoney 2005]
On a random subset of the training data:
G
def
=
K(x1, x1) . . . K(x1, xm)
... .
...
K(xm, x1) . . . K(xm, xm)
∈ Rm×m
Let L ∈ k×m rank-k approximation LTL
rank−k
≈ G−1
Feature map φNystrom(x) =
K(x1, x)
...
K(xm, x)
LT
sklearn.kernel approximation.Nystroem
See also: Random features [Rahimi and Recht 2008]
sklearn.kernel approximation.RBFSampler
G Varoquaux 70
116. Parametric generative model
Consider a model of x parametrized by w ∈ k:
(x) = Pw(x) log-likelihood LP
def
= log Pw
Maximum likelihood estimates: ˆw = argmaxw LP(x)
Kullback-Leibler divergence
Natural distance1 to another distribution
KL(P|Q) = ¾P[LP − LQ]
Goal:
Benefit from our model to build a representation
All models are wrong but some are useful
1Not a distance, technically, as not symmetric.G Varoquaux 72
117. Local behavior of parametric models
Fisher information matrix
Expectation of Hessian of L given w:
I(w)
def
= ¾
∂2
∂2w
L(x|w) w ∈ k×k
Order-2 approximation of KL divergence:
KL(Pw|Pw+ ) = TIw
Iw also scales the covariance of the estimation
error on maximum-likelihood estimates of w
(Cramer-Rao bounds)
G Varoquaux 73
( )wI
118. Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
KL close to w1
KL close to w2
G Varoquaux 74
119. Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
KL close to w1
KL close to w2
Non constant across
the family of distri-
butions
{Pw, w ∈ k}
G Varoquaux 74
120. Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
{Pw, w ∈ k} form a Riemannian
manifold, with I as the metric tensor
[Rao 1945]G Varoquaux 74
121. Remannian manifolds
Continuous geometry on curved spaces (eg the Earth)
Locally, but not globally, Euclidean
A Riemmannian manifold M is a
differentiable space endowed
with a metric d that is locally
equivalent to a Euclidean vector
space:
ξ
MT
M
M M
M'
LogM
ExpM
for M and M ∈ M, if d(M, M ) → 0, M and M can
be mapped to elements of a vector space m, m
such that d(M, M ) ∼ mTm
Global structure: geodesic distance
G Varoquaux 75
122. Fisher Kernel [Jaakkola and Haussler 1999]
A Kernel locally equivalent to the KL divergence
Build upon the Fisher matrix
Create a feature map
Vector space where Euclidean distance ≈ KL
⇒
G Varoquaux 76
123. Fisher Kernel [Jaakkola and Haussler 1999]
A Kernel locally equivalent to the KL divergence
Build upon the Fisher matrix
Create a feature map
Vector space where Euclidean distance ≈ KL
⇒
In practice:
1. Fit model Pw on train data:
ˆw ← argmax
w
i∈train
L(xi, w)
2. Compute gradient on w of likelihood for ˆw:
zFisher(x) = wL(x, ˆw) ∈ k
G Varoquaux 76
124. Fisher Kernel applications
Text: TF-IDF [Elkan 2005]
Multinomial model of word appearance
Genomics [Jaakkola and Haussler 1999]
Hidden Markov model of DNA sequences
(variable-length sequences ⇒ encoding difficult)
Tree-structured data [Nicotra... 2004]
A transition model on the tree
Brain connectivity [Varoquaux... 2010]
Multivariate Gaussian model (covariances)
G Varoquaux 77
125. Summary
Kernels build prediction functions on similarities
Features maps / kernel approximation captures the
corresponding representation
Fisher Kernels can go from likelihood to vector space
Very useful for non numeric objects
G Varoquaux 78
126. Limited-data settings
Reminder: Your valida-
tion measure is intrinsi-
cally unreliable
(sampling noise)
Get more data
For instance acquiring data
on a related task, to learn
representations
Use simple models
Do not spend too much
time tweaking 20% 10% 0% +10% +20%
Distribution of errors under a binomial law
1000
300
200
100
30
Number of available samples
2% +2%
4% +4%
5% +5%
7% +7%
15% +12%
G Varoquaux 79[Varoquaux 2018]
127. References I
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras,
B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers
from multi-site resting-state data: an autism-based example.
NeuroImage, 147:736, 2017.
A. Achille and S. Soatto. Emergence of invariance and
disentanglement in deep representations. The Journal of Machine
Learning Research, 19(1):1947–1980, 2018.
A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for
feature vectors and structured data. arXiv preprint
arXiv:1306.6709, 2013.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
G Varoquaux 80
128. References II
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko.
Translating embeddings for modeling multi-relational data. In
Advances in Neural Information Processing Systems, pages
2787–2795, 2013.
D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux.
Semi-supervised factored logistic regression for
high-dimensional neuroimaging data. In Advances in Neural
Information Processing Systems, page 3348, 2015.
J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122,
2004.
J.-F. Cardoso. Dependence, correlation and gaussianity in
independent component analysis. Journal of Machine Learning
Research, 4:1177, 2003.
P. Cerda and G. Varoquaux. Encoding high-cardinality string
categorical variables. arXiv:1907.01860, 2019.
G Varoquaux 81
129. References III
W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep
quadruplet network for person re-identification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, page 403, 2017.
P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk
comparison of ordinary least squares vs ridge regression. The
Journal of Machine Learning Research, 14:1505, 2013.
P. Drineas and M. W. Mahoney. On the nystr¨om method for
approximating a gram matrix for improved kernel-based learning.
journal of machine learning research, 6:2153, 2005.
C. Elkan. Deriving tf-idf as a fisher kernel. In International
Symposium on String Processing and Information Retrieval, page
295, 2005.
G Varoquaux 82
130. References IV
Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et
al.’s negative-sampling word-embedding method. arXiv:1402.3722,
2014.
P. K. Gopalan, L. Charlin, and D. Blei. Content-based
recommendations with poisson factorization. In Advances in
Neural Information Processing Systems, page 3176, 2014.
M. Gutmann and A. Hyv¨arinen. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In
Proceedings of the International Conference on Artificial
Intelligence and Statistics, page 297, 2010.
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge
regression. Foundations of Computational Mathematics, 14, 2014.
A. Hyv¨arinen and E. Oja. Independent component analysis:
algorithms and applications. Neural networks, 13(4):411, 2000.
G Varoquaux 83
131. References V
A. J. Izenman. Reduced-rank regression for the multivariate linear
model. Journal of multivariate analysis, 5:248, 1975.
T. Jaakkola and D. Haussler. Exploiting generative models in
discriminative classifiers. In Advances in neural information
processing systems, pages 487–493, 1999.
T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent
semantic analysis. Discourse processes, 25:259, 1998.
A. Lefevre, F. Bach, and C. F´evotte. Online algorithms for
nonnegative matrix factorization with the itakura-saito
divergence. In Applications of Signal Processing to Audio and
Acoustics (WASPAA), page 313. IEEE, 2011.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix
factorization. In Advances in neural information processing
systems, page 2177, 2014.
G Varoquaux 84
132. References VI
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix
factorization and sparse coding. Journal of Machine Learning
Research, 11:19–60, 2010.
J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and
vision processing. Foundations and Trends® in Computer
Graphics and Vision, 8(2-3):85–283, 2014.
S. Mallat. Understanding deep convolutional networks.
Philosophical Transactions of the Royal Society A, 374:20150203,
2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66:113, 2017.
G Varoquaux 85
133. References VII
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting
universal representations of cognition across brain-imaging
studies. arXiv preprint arXiv:1809.06035, 2018.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction problems.
ACM SIGKDD Explorations Newsletter, 3:27, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. In ICLR Workshop Papers.
2013a.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems, page 3111, 2013b.
G Varoquaux 86
134. References VIII
G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of
linear regions of deep neural networks. In Advances in neural
information processing systems, page 2924, 2014.
L. Nicotra, A. Micheli, and A. Starita. Fisher kernel for tree structured
data. In 2004 IEEE International Joint Conference on Neural
Networks (IEEE Cat. No. 04CH37541), volume 3, pages 1917–1922.
IEEE, 2004.
E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering
transform: Deep hybrid networks. In Proceedings of the IEEE
international conference on computer vision, page 5618, 2017.
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), page
1532, 2014.
G Varoquaux 87
135. References IX
M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint
prediction of multiple scores captures better individual traits
from brain images. Neuroimage, 158:145–154, 2017a.
M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions
from neuroimaging: assessing reduced-rank linear models. In
2017 International Workshop on Pattern Recognition in
Neuroimaging (PRNI), pages 1–4. IEEE, 2017b.
A. Rahimi and B. Recht. Random features for large-scale kernel
machines. In Advances in neural information processing systems,
pages 1177–1184, 2008.
C. Rao. Information and accuracy attainable in the estimation of
statistical parameters. Bull Calcutta. Math. Soc., 37:81, 1945.
G Varoquaux 88
136. References X
S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression:
Bias-variance decompositions, covariance penalties, and
prediction error estimation. Journal of the American Statistical
Association, pages 1–14, 2018.
B. Scholkopf and A. J. Smola. Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press,
2001.
G. Varoquaux. Cross-validation failure: small sample sizes lead to
large error bars. Neuroimage, 180:68–77, 2018.
G. Varoquaux, F. Baronnet, A. Kleinschmidt, P. Fillard, and B. Thirion.
Detection of brain functional-connectivity difference in
post-stroke patients using group-level covariance modeling. In
International Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 200–208. Springer, 2010.
G Varoquaux 89
137. References XI
Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding:
A survey of approaches and applications. IEEE Transactions on
Knowledge and Data Engineering, 29(12):2724–2743, 2017.
G Varoquaux 90