Slides by Louis Monier (Altavista Co-Founder & CTO) for Deep Learning keynote #1 at Holberton School. The keynote was followed by a workshop prepared by Gregory Renard. If you want to assist to similar keynote for free, checkout http://www.meetup.com/Holberton-School/
5. More Neurons, More Layers: Deep Learning
input1
input2
input3
input layer hidden layers output layer
circle
iris
eye
face
89% likely to be a face
6. How to find the best values for the weights?
w = Parameter
Error
a
b
Define error = |expected - computed| 2
Find parameters that minimize average error
Gradient of error wrt w is
Update:
take a step downhill
7. Egg carton in 1 million dimensions
Get to here
or here
9. Stochastic Gradient Descent (SGD) and Minibatch
It’s too expensive to compute the gradient on all inputs to take a step.
We prefer a quick approximation.
Use a small random sample of inputs (minibatch) to compute the gradient.
Apply the gradient to all the weights.
Welcome to SGD, much more efficient than regular GD!
10. Usually things would get intense at this point...
Chain rule
Partial derivatives
Hessian
Sum over paths
Credit: Richard Socher Stanford class cs224d
Jacobian matrix
Local gradient
11. Learning a new representation
Credit: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
13. Train. Validate. Test.
Training set
Validation set
Cross-validation
Testing set
2 3 41
2 3 41
2 3 41
2 3 41
shuffle!
one epoch
Never ever, ever, ever, ever, ever, ever use for training!
15. Tensors are not scary
Number: 0D
Vector: 1D
Matrix: 2D
Tensor: 3D, 4D, … array of numbers
This image is a 3264 x 2448 x 3 tensor
[r=0.45 g=0.84 b=0.76]
17. Fully-Connected Networks
Per layer, every input connects to every neuron.
input1
input2
input3
input layer hidden layers output layer
circle
iris
eye
face
89% likely to be a face
18. Recurrent Neural Networks (RNN)
Appropriate when inputs are sequences.
We’ll cover in detail when we work on text.
pipe a not is this
t=0t=1t=2t=3t=4
this is not a pipe
t=0 t=1 t=2 t=3 t=4
19. Convolutional Neural Networks (ConvNets)
Appropriate for image tasks, but not limited to image tasks.
We’ll cover in detail in the next session.
20. Hyperparameters
Activations: a zoo of nonlinear functions
Initializations: Distribution of initial weights. Not all zeros.
Optimizers: driving the gradient descent
Objectives: comparing a prediction to the truth
Regularizers: forcing the function we learn to remain “simple”
...and many more
29. Cross-entropy Loss
x1 = 6.78
x2 = 4.25
x3 = -0.16
x4 = -3.5
dog = 0.92536
cat = 0.07371
car = 0.00089
hat = 0.00003
Raw scores Probabilities
S
O
F
T
M
A
X
dog: loss = 0.078
cat: loss = 2.607
car: loss = 7.024
hat: loss = 10.414
If label is:
=?
label
(truth)
32. Regularization: Avoiding Overfitting
Idea: keep the functions “simple” by constraining the weights.
Loss = Error(prediction, truth) + L(keep solution simple)
L2 makes all parameters medium-size
L1 kills many parameters.
L1+L2 sometimes used.
33. At every step, during training, ignore the output of a
fraction p of neurons.
p=0.5 is a good default.
Regularization: Dropout
input1
input2
input3