Deep Learning Practice and Theory

Deep Learning
Filling the gap between
practice and theory
Preferred Networks
Daisuke Okanohara
hillbig@preferred.jp
Aug. 3rd 2017
Summer School of Correspondence and Fusion of AI and Brain Science

Background:
Unreasonable success of deep learning
l DL succeed in solving many complex tasks
̶ Image recognition, speech recognition, natural language processing, robot
controlling, computational chemistry etc.
l But we don’t understand why DL work so well
̶ Its success is much higher than our understanding

Background
DL research process become close to science process
l Try first, examine next
̶ First, we obtain an unexpected good result experimentally
̶ We then find a theory that explains why it work so well
l This process is different from previous ML research
̶ Careful design of new algorithms sometimes (or often) doesn’t work
̶ Many results contradict our intuition

Outline
Three main unsolved problems in deep learning
l Why can DL learn ?
l Why can DL recognize and generate real world data ?
l Why can DL keep and manipulate complex information ?

Optimization in training DL
l Learn a NN model f(x; q) by minimizing a training error L(q)
L(q) = Si l(f(xi; q), yi)
where l(f(xi; q), yi) is a loss function and q is a set of parameters
l E.g. two layer feed forward NN
f(x; q)) = a(W2(a(W1x))
where a is an element-wise activate function such as
a(z)=max(0, z)
l(f(xi; q), yi) = ||f(xi; q) – yi||2 (L2 loss)

Gradient descent
Stochastic Gradient Descent
l Gradient descent
̶ Compute the gradient of L(q) with regard to q; g(q), then update q
using g(q) as
qt+1 := qt – at g(qt)
where at>0 is a learning rate
l Stochastic gradient descent:
̶ Since the exact computation of gradient is expensive, we instead use an
approximated gradient by using a sampled data set (mini-batch)
g’(qt) = 1/|B| Si∈B l(qt, xi, yi)
-αg
θ2
θ1
Contour of L(q)

Optimization in Deep learning
l L(q) is highly non-convex and includes many local optima,
plateaus and saddle points
̶ In plateau regions, the gradient becomes almost zero and the
convergence becomes significantly slow
̶ In saddle points, only few directions will decrease L(q) and it is hard to
escape from such points
Saddle pointsPlateau
local optimum

Miracle of deep learning training
l It was believed that we cannot train large NNs using SGD
̶ Impossible to optimize non-convex problem of over million dimensions
l However, SGD can find a solution with low-training error
̶ When using large model, it often find a solution with zero training error
̶ Moreover, an initialization doesn’t matter
(c.f. <-> K-means require good initializer)
l More surprisingly, SGD can find a solution with low-test error
̶ Although the model is over-parameterized, it does not over-fit and
achieves generalization
l Practically OK, but we want to know why

Why can DL learn ?
l Why does DL succeed in find a solution with a low train error?
̶ Although obtimization is a highly non-convex optimization problem
l Why does DL succeed in finding a solution with a low test
error ?
̶ Although NN is over parametrized and no effective regularization

Loss surface analysis using spherical spin glass model (1/5)
[Choromanska+ 2015]
l Consider a DNN with ReLU s(x)=max(0, x)
where q is the normalization factor
l We can re-express this as
where Ai,j=1 if the path (i, j) is active
and Ai,j=0 if the path is inactive
̶ ReLU can be considered as a switch; the path is active if
all ReLU are active and is inactive otherwise
ReLU is active
ReLU is inactive
xi
Y
Path is active if all
Relu is active

l After several assumptions, this function can be re-
exampressed as a H-spin spherical spin-grass model
s.t.
l Now, we can use the analysis of spherical spin-grass model
̶ We now know the distribution of critical points
̶ k: Index (the number of negative eigenvalues of the Hessian)
k=n: local minimum, k>0: saddle point
12

Distribution of critical points
Almost no critical points
with large k above LEinf
-> Few local minima
In the band [LE0, LEinf]
many critical points with
small k are found in near
LE0
->local minima are close
to the global minimum

Distribution of test losses
14

l This analysis is relied on several unrealistic assumptions
̶ Such as
“Each activation is independent from inputs”
“Each path‘s input is independent”
l Can we remove these assumptions or show these assumptions
hold in almost training cases ?
Remaining problem

Depth creates no bad local minima [Lu+ 2017]
l Non convexity comes from depth and nonlinearity
l Depth only creates non convexity
̶ Weight space symmetry means that there are many distinct
configuration with same loss values which would result in a non-convex
epigraph
l Consider a following feed forward linear NN
minW L(W) = ||WH WH-1 …W1X – Y||2
Then If X and Y have full row rank, then all local minima of L(W)
are global minima [Theorem 2.3, Lu, Kawaguchi 2017]

Deep and Wide NN also create no bad local minima
[Nguyen+ 2017]
l If the following conditions hold
̶ (1) Activation function s is analytic on R, strictly monotonically
increasing
̶ (2) s is bounded*
̶ (3) the loss function l(a) is twice differentiable,
̶ l’(a)=0 if a is a global minimum
̶ (4) Training samples are linearly independent,
then every critical point for with the weight matrices have full
column rank, is a global minimum
̶ We can achieve these conditions if we use sigmoid, tanh or softplus for
s and the squared loss for l
̶ -> Solved in the case of non-linear NN with some conditions

Why DL can learn ?
l Why does DL succeed in find a solution with a low train error?
̶ Although obtimization is a highly non-convex optimization problem
l Why does DL succeed in finding a solution with a low test error ?
̶ Although NN is over parametrized and no effective regularization

NN is over parametrized but achieves generalization
l Although the number of parameters of DNN is much larger
than the number of samples, DNN does not overfit and
achieves generalization
l Large model tend to achieve low test error
Number of
parameters
Test error
(lower the better)
When num. of parameters is larger
than num. of training samples
“overfitting” is observed
Conventional ML models
DNN
No over-fitting is observed
Moreover the test error decreases as the
num. of parameters is increased

Random Labeling experiment [Zhang+ 17]
l Model capacity should be restricted to achieve generalization
̶ C.f. Rademacher complexity, VC-dimension, uniform stability
l Conduct an experiment on a copy of the test data where the
true labels were replaced by random labels
-> NN model easily fit even for random labels
l Compare the result with that using regularization techniques
-> No significant difference
l Therefore NN model has enough model complexity to fit to
random labeling but it can generalize well w/o regularization
̶ For random labels, NN memorize the samples, but for true labels NN
learn patterns for generalization [Arpit+ 17]
l … WHY?

SGD plays a significant role for generalization
l SGD achieves an approximate Bayesian inference [Mandt+ 17]
̶ Bayesian inference provides a sample following q ~ P(q|D)
l SGD’s noise removes unnecessary information of input to
estimate output [Shwartz-Ziv+ 17]
̶ During training the mutual information between input and the network
is decreased but that between the network and output is kept
l Sharpness and norms of weights also relate to generalization
̶ Flat minima achieve generalization. But it
depends on the scale of weights
̶ If we find a flat minimum with small norm of weights, then it achieves
generalization [Neyshabur+ 17]
FlatSharp

Training always converge to the solution with low-test error
[Wu+ 17]
l Even when we optimize the model with different initializations,
they always converge to the solution with low test error
l Flat minima have large basin while sharp minima have small basin
̶ Almost parameters will converge to flat minima
l Flat minima corresponds to the low model complexity
= low test error
l Question: Why does NN learning induce flat minima ?
Flat minima have large basin
Sharp minima have small basin

Why can DL recognize and generate
real world data ?

Why does deep learning work ?
Lin’s hypothesis [Lin+ 16]
l Real world phenomena have following characteristics
1. Low order polynomial
̶ Known physical interactions have at most 4th-order polynomials
2. Local interaction
̶ Number of interactions between objects increases linearly
3. Symmetry
̶ Small degree of freedoms
4. Markovian
̶ Most generation process depends on only the previous state
l -> DNN can exploit these characteristics
24/50

Generation and recognition (1/2)
l Data x is generated from unknown factors z
l Generation and recognition are inverse operations
z
x
E.g. Image generation, recognition
z：Object, Position of camera, Lighting condition
(Dragon, [10, 2, -4], white）
x：Image
Generation
z
x
Recognition
（Inference）
Inference: Infer the posterior
P(z|x)
Generation Recognition

Generation and recognition（2/2）
l Data is often generated from multiple factors
̶ Uninteresting factors are sometimes called covariates or
disturbance variables of hidden variable
l Generation process can be very complex
̶ Each step can be non-linear
̶ Gaussian, non-Gaussian noises are added at several steps
̶ E.g. Image rendering requires dozens steps
l In general, generation process is unknown
̶ Any generation process is the approx. of actual process
26/50
z1
x
c
h
z2
hm

Why do we consider generative models?
l For more accurate recognition and inference
̶ If we know the generate process, we can improve recognition and inference
u “What I cannot create, I do not understand”
Richard Feynman
u “Computer vision is inverse computer graphics”
Geofferty Hinton
̶ By inverting the generation process, we obtain recognition process
l For transfer learning
̶ By changing covariates, we can transfer the learned model to other
environments
l For sampling examples to compute statistics and validation
27/50

E. g. Mapping of hand-written data into 2D using VAE
Original hand-written data is high-dimension (784-dim)
If we map these data into 2-dim space, types, shapes change smoothly
If we want to classify “1”,
we need to find this simple
boundary

Representation learning is more powerful than
the nearest neighbor method and manifold learning
l Actually we can significantly reduce the required training samples when
using representation learning [Arora+ 2017]
l Using the distance metric defined on the original space, or the
neighborhood notion may not work
?
In reality, samples with the same label are
located in very different places in the
original space. Their region may not be
even connected in original space
Ideally, near sample
will help to determine
the label
Man with
glasses

Real-world data is distributed in low-dimensional manifold
30/50
Each point
corresponds to a
possible data
Data distributed in
low-dimensional
space
C.f. distribution of galaxies
in the universe
Why does low-dimensional
manifold appear ?
Low dimensional factor
is converted to
high-dimensional data
without increasing the
complexity [Lin+16]

Original space and latent space
31/50
generate
recognition
l In the latent space, the meaning of data is smoothly changed

Learning is easy in the latent space
32/50
generate
recognition
l Since many tasks related to the factors, the classification
boundary becomes simple in the latent space
Require many training examples
in the original space
Require few training examples
in the latent space

How to learn a generative and inference model ?
l Generation process and its counterpart recognition process
are highly non-linear and complex
l -> Use a deep neural network to approximate them
z
x
Generation
x = f(z)
z
x
Recognition
z = g(x)

Deep generative models
Fast sampling
of x
Compute
the likelihood
P(x)
Produce sharp
image
Stable
Training
VAE
[Kingma+ 14]
√ △
Lower-bound
(IW-VAE
[Burda+15])
X √
GAN
[Goodfellow+ 14,16]
(IPM)
√ X √ X-△
AutoRegressive
[Oord+ 16ab]
△-√
(Parallel
multi-scale
[Reed+ 17])
√ √ √
Energy model
[Zhao+ 16] [Dai+ 17]
△-√ △
Up to
constant
√ △

VAE: Variational AutoEncoder [Kingma+ 14]
z
μ
(μ, σ) = Dec(z; φ)
x〜N(μ, σ)
σ
x
A NN network outputs mean and covariance
(μ, σ) = Dec(z; φ)
Generate x in the following steps
(1) Sample z = N(0, I)
(2) Compute (μ, σ) = Dec(z; φ）
(3) Sample x = N(μ, σI)
Defined distribution
p(x) = ∫p(x|z)p(z)dz

VAE: Variational Autoencoder
Induced distribution
l p(x|z) is a Gaussian and p(x) corresponds to (infinitely-many)
mixture of Gaussians
p(x) = ∫p(x|z)p(z)dz
̶ Neural network can model complex relation between z and x

VAE: Variational AutoEncoder
Use maximum likelihood estimation for learning the parameter q
Since the exact likelihood is intractable, we instead maximize
the lower bound of likelihood known as ELBO (Evidence lower bound）
The proposal distribution q(z|x)
should be close to the true
posterior p(z|x)
Maximizing wrt. q(z|x) correspond
to the minimization of
KL(q(z|x) || p(z|x))
= Learn the encoder as a side effect

Reparametization Trick
Since we take an expectation with regard to Q(z|x) it is difficult to compute
the gradient of ELBO wrt. Q(z|x)
-> We can use reparamerization trick !
μ' σ'
x'
z
μ σ
x
ε
Converted computation graph
can be regarded as an
auto-encoder where a noise εσ
is added to the latent variable μ

The problem of maximum likelihood estimation against
low-dimensional manifold data (1/3) [Arjovsky+ 17ab]
l Maximum likelihood estimation (MLE) estimate a distribution
P(x) using a model Q(x)
LMLE(P, Q) = Sx P(x) log Q(x)
̶ Usually, this is replaced with the empirical distribution (1/N)Si log Q(xi)
l In low-dimensional manifold data, P(x) = 0 in most x
l To model such P, Q(x) also should satisfy Q(x) = 0 in most x
l If we use such Q(x), log Q(x) is undefined (or NaN) when Q(xi) =
0, so we cannot optimize Q(x) using MLE
l to solve this -> Use Q(x) s.t. Q(xi)>0 for all {xi}
̶ E.g. Q(x) = N(µ, s) , this means a sample is µ with added noise s

The problem of maximum likelihood estimation against
low-dimensional manifold data (2/2)
l MLE require Q(xi) >0 for all {xi}
l to solve this -> Use Q(x) s.t. Q(xi)>0 for all {xi}
l Q(x) = N(µ, s) this means a sample is µ with added noise s
̶ This makes blurry images
l Another difficulty is there is no notion of
the closeness wrt. the space geometry
When the area size of the intersection are
same, MLE will give the same score.
Although the left distribution is close to the
true distribution, MLE scores are same

GAN（Generative Adversarial Net）
[Goodfellow+ 14, 17]
l Compete two neural networks to learn a distribution
l Generator (counterfeiters)
̶ Goal: deceive the generator
̶ Learn to generate a realistic sample that can deceive the generator
l Discriminator (Police)
̶ Goal: detect a sample generated by the generator
̶ Learn to detect the difference between real and generated ones
Generator
Real
Discriminator
RealFake
Chosen randomly

GAN: Generative adversarial
z
x = G(z)
x
Sample x in the following step
(1) Sample z 〜 U(0, I)
(2) Compute x = G(z）
(without adding noise)
No adding noise step
at last

Training of GAN
l Use Discriminator D(x)
̶ Output 1 if x is estimated as real and 0 otherwise
l Train D to maximize V and G to minimize V
̶ If learning succeeded, this learning will reach
the following Nash equilibrium
∫p(z)G(z)dz=P(x), D(x)=1/2
̶ Since D provides dD(x)/dx to update G, so
they are actually cooperate to learn P(x)
z
x'
x = G(z)
{1(Real), 0(Fake)}
y = D(x)
x

Modeling low dimensional manifold
l When z is low-dimensional data, the deterministic function
x = F(z) outputs low-dimensional manifold in the space x
l Using CNN for G(z) and D(x) is also important
̶ D(x) becomes similar score when x and x’ are similar
l Recent study showed that training without using discriminator
is also able to generate realistic data [Bojanowski+ 17]
l These two factors are important to produce realistic data
z
x=F(z)
z ∈ R1 x ∈ R2

Demonstration of GAN training
http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/
45
Each generated
samples follows
dD(x)/dx

Training GAN
https://github.com/mattya/chainer-DCGAN
After 30 minutes
46

Stacked GAN
http://mtyka.github.io/machine/learning/2017/06/06/highres-gan-faces.html

New GAN papers are coming out every week
GAN Zoo https://github.com/hindupuravinash/the-gan-zoo
l Since GAN provides a new way to train a probabilistic model
many GAN papers are coming out, (20 papers/mon Jul.2017)
l Interpretation of GAN framework
̶ Wasserstein Distance, Integral Probability Measure, Inverse RL
l New stable training method
̶ Lipschitzness of D, Ensemble of Ds, etc.
l New Applications
̶ Speech, Text, Inference model (q(z|x))
l Conditional GAN
̶ Multi-class Super-resolution,

Super Resolution + Regression loss for perception network
[Chen+ 17]
l Generate photo-realistic image from segmentation result
̶ High resolution, globally consistent, stable training
Output: photo-realistic imageInput: Segmentation

ICA: Independent component analysis
Reference: [Hyvärinen 01]
l Find a component z that generates data x
x = f(z)
where f is an unknown function called mixture function and
components are independent each other p(z) = Pp(zi)
l When f is linear and p(zi) is non-Gaussian, we can identify f and
z correctly
l However, when f is nonlinear, we cannot identify f and z
̶ There are infinitely many possible f and z
l -> When data is time-series data x(1), x(2), …, x(n) and they are
generate from z which are (1) non-stationary or (2) stationary
independent sources, we can identify non-linear f and z

Non-linear ICA for non-stationary time series data
[Hyvärinen+ 16]
l When sources are independent and non-stationary, we can
identify a non-linear mixture function f and z
l Assumption: sources change slowly
̶ sources can be considered as stationary in
short time segment
̶ Many interesting data have this property
1. Divide time series data into segments
2. Train multi-class classifier to classify
each data point into each segment
3. The last layer’s feature corresponds to
(linear mixture of) independent sources

Non-linear ICA for stationary time series data
[Hyvärinen+ 17]
l When sources are independent and stationary, we can also
identify a non-linear mixture function f and z
l Sources should be uniform dependent
̶ for x = s(t) and y=s(t-1)
1. Train a binary classifier to classify whether given data pairs are
taken from adjacent (x(t), x(t+1)) or random (x(t), x(u))
2. The last layer’s features correspond to
(linear mixture of) independent sources

Conjectures [Okanohara]
l Train a multi-class classifier with very large number of classes
(e.g. Imagenet). Then the features of last layer correspond to
(mixture-of) independent component
̶ To show this, we need a reasonable model between the set of labels
and independent components
̶ Dark knowledge [Hinton14] is effective to transfer the model because
this reveals the independent components
l Similarly GAN’s discriminators (or the energy functions) also
extract the independent components

Why can DL keep and manipulate
complex information ?

Information Abstract Level
l Abstract knowledge
̶ Text, relation
l Model
̶ Simulator / generative model
l Raw Experience
̶ Sensory stream
Abstract
Detailed
Small volume
Independent from problem/task
context
Large volume
Dependent on
problem/task/context

Local representation vs distributed representation
l Local representation
̶ each concept is represented by one symbol
̶ e.g. Giraff=1, Panda=2, Lion=3, Tiger=4
̶ no interfere, noise immunity, precise
l Distributed representation
̶ each concept is represented by a set of symbol, and each symbol
participates in representing many concepts
̶ Generalizable
̶ less accurate
̶ interfere
Giraff Pand Lion Tiger
Long neck ◯
four legs ◯ ◯ ◯ ◯
body hair ◯ ◯ ◯
paw pad ◯ ◯

High dimensional vector vs low dimensional data
l High dimensional vector
̶ Random two vectors are always almost orthogonal
̶ many concepts can be stored within one vector
u w = x + y + z,
̶ Same characteristics as local representation
l Low dimensional vector
̶ Interfere each other
̶ Cannot keep precise memory
̶ Beneficial for generalization
l Interference and generalization are strongly related

Two layer feedforward network = memory augmented network
[Vaswani+ 17]
l Memory augmented network
a = V Softmax(Kq)
̶ K is a key matrix (i-th row corresponds to a key for i-th memory)
̶ V is a value matix. i-th column correspond to a value for i-th value
̶ We may use winner-take-all instead of Softmax
l Two layer feedforward network
a = W2Relu(W1x)
̶ i-th row of W1 corresponds to a key for i-th memory
̶ i-th column of W2 corresponds to a value for i-th memory

Three layer feed-forward network is also memory-augmented
network [Okanohara unpublished]
l Three layer feed-forward network can be considered as first
layer is used for computing keys and second stores key and t
a = W3Relu(W2Relu(W1x))
l key: Relu(W1x)
l The i-th row of W2 corresponds to the key of i-th memory cell
l The i-th column of W3 corresponds to the value of i-th
memory cell

Two-leayr NN update rule interpretation
[Okanohara unpublished]
l The update rule of two layer feedforward network for
h = Relu(W1x)
a = W2h
is
dh = W2
Tda
dW2= da hT
dW1= dh diag(Relu’(W1x)) xT
= W2
Tda diag(Relu’(W1x)) xT
l
These update rules correspond to storing the error (da) as a
value and storing input (x) as a key for memory network
̶ Update only for active memories (Relu’(W1x))

Resnet is memory augmented network
[Okanohara unpublished]
l Since resnet is the following form
h = h + Resnet(h)
and Resnet(h) consists of two layer, we can interpret it as
recalling memory and add it to the current vector
̶ Squeeze operation correspond to limit the number of memory cells
l Resnet lookups memory iteratively
̶ Large number of steps = large number of memory lookups
l This interpretation is different from using shortcut [He+15] or
unrolled iterative estimation [Greff+16]

Infinite memory network
l What happen if we increase the number of hidden units
iteratively for each training sample ?
̶ This is similar to “Memory Networks” where we store previous hidden
activation in explicit memory or “Progressive Network” [Rusu+ 16]
where we incrementally add new network (and fixed old network) for
each new task
l We expect that it can prevent from catastrophic forgetting and
achieve one-shot learning
̶ How to make sure generalization ?

Conclusion
l There are still many unsolved problems in DNN
̶ Why can DNN learn in general setting ?
̶ How to represent real world information ?
l There are still many unsolved problems in AI
̶ Disentanglement of information
̶ One-shot learning using attention and memory mechanism
u Avoid catastrophic forgetting, interference
̶ Stable, data-efficient reinforcement learning
̶ How to abstract information
u grounding (language), strong noise (e.g. dropout), extract hidden
factors by using (non-)stationary or commonality among task

References
l [Choromanska+ 2015] “The loss surface of multilayer networks”, A. Choromanska,
and et al., AIstats 2015
l [Lu+ 2017] ”Depth creates No Bad Local Minima”, H. Lu, and et al.,
arXiv:1702.08580
l [Nguyen+ 2017] “The loss surface of deep and wide neural networks”, Q. Nguyen,
and et al., arXiv:1704.08045
l [Zhang+ 2017] “Understanding deep learning requires rethinking generalization”, C.
Zhang, and et al., ICLR 2017
l [Arpit+ 2017] ”A Closer Look at Memorization in Deep Networks”, D. Arpit, and et al.,
ICML 2017
l [Mangt+ 2017] “Stochastic Gradient Descent as Approximate Bayesian Inference”, S.
Mandt and et al., arXiv:1704.04289
l [Shwartz-Ziv+ 2017] “Opening the Black Box of Deep Neural Networks via
Information”, R. Shartz-Ziv, and et al., arXiv:1703.00810

l [Neyshabur+ 17] “Exploring Generalization in Deep Learning”, B. Neyhabur,
and et al., arXiv:1706.08947
l [Wu+ 17] “Towards Understanding Generalization of Deep Learning:
Perspective of Loss Landscapes”, L. Wu and et al., arXiv:1706.10239
l [Lin + 16] “Why does deep and cheap learning work so well”, H W. Lin, and
et al., arXiv1708.08226
l [Arora+ 17] “Provable benefits of representation learning”, S. Arora, and et
al., arXiv:1706.04601
l [Kingma+ 14] ”Auto-Encoding Variational Bayes”, D. P. Kingma and et al.,
ICLR 2014
l [Burda+ 15] “Importance Weighted Autoencoders”, Y. Burda and et al.,
arXiv:1509.00519

l [Goodfellow+ 14] “Generative Adversarial Nets”, I. Goodfellow, and et al.,
NIPS 2014
l [Goodfellow 16] “NIPS 16 Tutorial: Generative Adversarial Networks”,
arXiv:1701.00160
l [Oord+ 16a], “Conditional Image Generation with PixelCNN decoders”, A.
Oord and et al., NIPS 2016
l [Oord+ 16b], “WaveNet: A Generative Model for Raw Audio”, A. Oord and
et al., arXiv1609.03499
l [Reed+ 17] “Parallel Multiscale Autoregressive Density Estimation”, S. Reed
and et al, arXiv:1703.03664
l [Zhao+ 17] ”Energy-based Generative Adversarial Network”, J. Zhao and et
al., arXiv:1609.03126
l [Dai+ 17] “Calibrating Energy-based Generative Adversarial networks”, Z.
Dai and et al., ICLR 2017

l [Arjovsky+ 17a] ”Towards principled methods for training generative
adversarial networks”, M. Arjovsky, and et al, arXiv:1701.04862
l [Arjovsky+ 17b] “Wasserstein Generative Adversarial Networks”, M.
Arjovsky, and et al., ICML 2017
l [Bojanowski+ 17] “Optimizing the Latent Space of Generative Networks”, P.
Bojanowski and et al., arXiv:1707.05776
l [Chen+ 17] ”Photographic Image Synthesis with Cascaded Refinement
Networks”, Q. Chen and et al., arXiv:1707.09405
l [Hyvärinen+ 01] “Independent Component Analysis”, A. Hyvärinen and et
al., John Wiley ‘ Sons. 2001
l [Hyvärinen+ 16] “Unsupervised Feature Extraction by Time-Contrastive
Learning and Nonlinear ICA”, A. Hyvärinen and et al, NIPS 2016
l [Hyvärinen+ 17] “Nonlinear ICA of Temporally Dependent Stationary
Sources”, A. Hyvärinen and et al, AISTATS 2017

l [Vaswani+ 17] “Attention is all you need”, A. Vaswani, arxiv:1706.03762 (the
idea appears only in version 3 https://arxiv.org/abs/1706.03762v3)
l [He+ 15] “Deep Residual Learning for Image Recognition”, K. He and et al.,
arXiv:1512.03385
l [Rusu+ 16] “Progressive Neural Networks”, A. Rusu+ and et al.,
arXiv:1606.04671

Deep Learning Practice and Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Deep Learning Practice and Theory

Similar to Deep Learning Practice and Theory (20)

More from Preferred Networks

More from Preferred Networks (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Practice and Theory