If you're willing to understand how neural networks work behind the scene and debug the back-propagation algorithm step by step by yourself, this presentation should be a good starting point.
We'll cover elements on:
- the popularity of neural networks and their applications
- the artificial neuron and the analogy with the biological one
- the perceptron
- the architecture of multi-layer perceptrons
- loss functions
- activation functions
- the gradient descent algorithm
At the end, there will be an implementation FROM SCRATCH of a fully functioning neural net.
code: https://github.com/ahmedbesbes/Neural-Network-from-scratch
4. • Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
5. • Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
6. • Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
7. • Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
8. • Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
39. What happens inside the hidden layer(s):
• Activations of previous neurons become inputs to adjacent neurons
How to interpret that?
• Intermediate non-linear computations ~ feature engineering
• Transformation of the input space
• New representations of the data over one/many layers
x1
x2
Inputs
𝑦1
𝑦2
Outputs
(predictions)
Input layer Hidden layer Output layer
51. Wait … How did we come up with the weights
to solve the XOR problem?
…
We trained the network!
52. 0 - Input:
raw pixels
Car
2 - Prediction
1 - Forward propagation
3 - Loss
computation
53. 𝑙 𝑓 𝑥(𝑖)
; 𝑊 , 𝑦(𝑖)
Loss function
Model prediction
(car)
Ground truth
label (Boat)
The loss function
quantifies the cost
that we pay when
misclassifying a
boat as a car
Training example
Parameters of
the network
54. Loss Formula for a single training data Formula for all training data Task
Mean
Square
Error
(MSE)
1
2
( 𝑦(𝑖) − 𝑦(𝑖))² 1
2𝑁
𝑖=0
𝑁
( 𝑦(𝑖) − 𝑦(𝑖))²
Regression
Cross
Entropy 𝑦(𝑖) log( 𝑦(𝑖))
1
𝑁
𝑖=0
𝑁
𝑦(𝑖) log( 𝑦(𝑖))
Classification
Examples of loss functions
55. Optimization problem
min
𝑊
1
𝑛
𝑖=1
𝑛
𝑙 (𝑓 𝑥(𝑖)
; 𝑊 , 𝑦(𝑖)
)
Loss function Training example
Example label
Model
parameters
Average over
the training set
Train the network : find the parameters that minimize the average loss on the training set
Model prediction
56. 𝑤 𝑛+1 ← 𝑤 𝑛 − η
𝑑𝑓 𝑤
𝑑𝑤
, η > 0
Gradient descent algorithm
57. 𝑊𝑛+1
[𝑙]
← 𝑊𝑛
[𝑙]
− η 𝛻 𝑊[𝑙](
1
𝑛
𝑖=1
𝑛
𝑙 𝑓 𝑥 𝑖
; 𝑊), 𝑦 𝑖
Gradient of the loss
w.r.t
weights of layer l
Weight values
of layer l at
iteration n +1
Weight values
of layer l at
iteration n
Average
training loss
Learning
rate
58. 0 - Input:
raw pixels
Car
2 - Prediction
4 – Backward propagation
3 - Loss
computation
𝜕𝐿
𝜕𝑊[6]
𝜕𝐿
𝜕𝑊[5]
𝜕𝐿
𝜕𝑊[4]
𝜕𝐿
𝜕𝑊[3]
𝜕𝐿
𝜕𝑊[2]
𝜕𝐿
𝜕𝑊[1]
5 – Weight
update using
gradient descent
59.
60. Single training example each time:
stochastic gradient descent
A batch of training examples at each
time: batch gradient descent
61. Terms Definition Formula
𝒛𝒋
𝒍 Weighted input to
the neuron j in
layer l (pre-
activation)
𝒛𝒋
𝒍
=
𝒌=𝟏
𝒏𝒍−𝟏
𝒘𝒋𝒌
𝒍
∗ 𝒂 𝒌
𝒍−𝟏
+ 𝒃𝒋
𝒍
𝒂𝒋
𝒍 Activation of
neuron j in layer l
𝒂𝒋
𝒍
= 𝝈(𝒛𝒋
𝒍
)
𝒃𝒋
𝒍 Bias of neuron j in
layer
-
𝒘𝒋𝒌
𝒍 Weight connecting
the neuron k in
layer l-1 to the
neuron j in layer l
-
𝑎1
0
𝑎1
0
𝑎1
2
𝑎2
2
L0 L1 L2
𝑤23
2
𝑤22
1
𝑎1
1
𝑎2
1
𝑎3
1
𝑧1
1
𝑧2
1
𝑧3
1
65. Term Definition Formula Shape
𝒛𝒍
Vector of weighted inputs to the
neurons in layer l 𝒛𝒍
= 𝑾𝒍
𝒂𝒍−𝟏
+ 𝒃𝒍
(𝑛𝑙
, )
𝒂𝒍
Vector of neuron activations in layer l
𝒂𝒍
= 𝝈(𝒛𝒍
)
(𝑛𝑙
, )
𝒃𝒍
Vector of neuron biases in layer l
-
(𝑛𝑙
, )
𝒘𝒍
Weight matrix connecting weights in
layer l-1 to weights in layer l -
(𝑛𝑙
, 𝑛𝑙−1
)
73. 2 5 8 10 20
1e-3
1e-2
0.5
1e-1
1
Number of layers
Learningrate
74. Don’t:
• Initialize weights to 0 this causes symmetry and same gradient
for all weights
• Initialize very small weights this causes very small gradients
Do:
• He initialization: w = np.random.randn(D, H) * sqrt(2.0/n)
• Initialize all biases with a constant small value ~ 0.01
86. def train(self, X, y, batch_size, epochs, learning_rate, validation_split=0.2,
print_every=10):
history_train_losses = []
history_train_accuracies = []
history_test_losses = []
history_test_accuracies = []
x_train, x_test, y_train, y_test = train_test_split(X.T, y.T,
test_size=validation_split)
x_train, x_test, y_train, y_test = x_train.T, x_test.T, y_train.T, y_test.T
for e in tqdm_notebook(range(epochs)):
if x_train.shape[1] % batch_size == 0:
n_batches = int(x_train.shape[1] / batch_size)
else:
n_batches = int(x_train.shape[1] / batch_size ) - 1
x_train, y_train = shuffle(x_train.T, y_train.T, random_state=0)
x_train, y_train = x_train.T, y_train.T
batches_x = [x_train[:, batch_size*i:batch_size*(i+1)] for i in
range(0, n_batches)]
batches_y = [y_train[:, batch_size*i:batch_size*(i+1)] for i in
range(0, n_batches)]
train_losses = []
train_accuracies = []
test_losses = []
test_accuracies = []
train/test split
Preparation of mini batches of data and
labels
Keep track of kpis (accuracy/loss) on train
and validation sets
87. Training over mini batches
dw_per_epoch = [np.zeros(w.shape) for w in self.weights]
db_per_epoch = [np.zeros(b.shape) for b in self.biases]
for batch_x, batch_y in zip(batches_x, batches_y):
batch_y_pred, pre_activations, activations = self.forward(batch_x)
deltas = self.compute_deltas(pre_activations, batch_y, batch_y_pred)
dW, db = self.backpropagate(deltas, pre_activations, activations)
for i, (dw_i, db_i) in enumerate(zip(dW, db)):
dw_per_epoch[i] += dw_i / batch_size
db_per_epoch[i] += db_i / batch_size
batch_y_train_pred = self.predict(batch_x)
train_loss = cost_function(batch_y, batch_y_train_pred)
train_losses.append(train_loss)
train_accuracy = accuracy_score(batch_y.T, batch_y_train_pred.T)
train_accuracies.append(train_accuracy)
batch_y_test_pred = self.predict(x_test)
test_loss = cost_function(y_test, batch_y_test_pred)
test_losses.append(test_loss)
test_accuracy = accuracy_score(y_test.T, batch_y_test_pred.T)
test_accuracies.append(test_accuracy)
# weight update
for i, (dw_epoch, db_epoch) in enumerate(zip(dw_per_epoch, db_per_epoch)):
self.weights[i] = self.weights[i] - learning_rate * dw_epoch
self.biases[i] = self.biases[i] - learning_rate * db_epoch
88. history_train_losses.append(np.mean(train_losses))
history_train_accuracies.append(np.mean(train_accuracies))
history_test_losses.append(np.mean(test_losses))
history_test_accuracies.append(np.mean(test_accuracies))
if e % print_every == 0:
print('Epoch {} / {} | train loss: {} | train accuracy: {} | val loss : {} | val accuracy : {}'.format(
e, epochs, np.round(np.mean(train_losses), 3), np.round(np.mean(train_accuracies), 3),
np.round(np.mean(test_losses), 3), np.round(np.mean(test_accuracies), 3)))
history = {'epochs': epochs,
'train_loss': history_train_losses,
'train_acc': history_train_accuracies,
'test_loss': history_test_losses,
'test_acc': history_test_accuracies
}
return history
def predict(self, a):
# input shape : (input_shape, batch_size)
for w, b in zip(self.weights, self.biases):
z = np.dot(w, a) + b
a = activation(z)
predictions = (a > 0.5).astype(int)
# predictions = predictions.reshape(-1)
return predictions
Monitoring
the model
performance
Inference
method
89.
90. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Number of hidden neurons
91. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Architecture: one-hidden-
layer Neural Net
92. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Defining the optimizer
93. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Training loop over
the data: one epoch
94. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Training loop over
the data: one epoch
95. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Set stored gradients to zero
96. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Forward pass
97. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Compute the loss
98. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Backprop
99. import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step() Weight update