2. Basic Neuron Model In A
Feedforward Network
ā¢ Inputs xi
arrive through
pre-synaptic connections
ā¢ Synaptic efficacy is
modeled using real
weights wi
ā¢ The response of the
neuron is a nonlinear
function f of its weighted
inputs
3.
4. Task
Plot the following type of Neural activation functions.
1(a) Threshold Function
Ļ(v)= +1 for vā„0
0 for v<0
1(b) Threshold Function
Ļ(v)= +1 for vā„0
-1 otherwise
2 Piecewise linear Function
Ļ(v)= 1 for vā„+1/2
v for +1/2>v>-1/2
0 for vā¤-1/2
3(a) Sigmoid Function
Ļ(v)=1/(1+ exp(-Ī»v))
3(b) Sigmoid Function
Ļ(v)=2/(1+ exp(-Ī»v))
3(c) Sigmoid Function
Ļ(v)=tanh(Ī»v)
For 3 vary the value of āĪ»ā and show the changes in the graph.
15. 1970s
The Backpropagation algorithm was first proposed by
Paul Werbos in the 1970's. However, it was
rediscoved in 1986 by Rumelhart and McClelland &
became widely used.
It took 30 years before the error backpropagation (or
in short: backprop) algorithm popularized.
16.
17. Differences In Networks
Feedforward Networks
ā¢ Solutions are known
ā¢ Weights are learned
ā¢ Evolves in the weight
space
ā¢ Used for:
ā Prediction
ā Classification
ā Function
approximation
Feedback Networks
ā¢ Solutions are
unknown
ā¢ Weights are
prescribed
ā¢ Evolves in the state
space
ā¢ Used for:
ā Constraint satisfaction
ā Optimization
ā Feature matching
18. Architecture
A Back Prop network has atleast 3 layers of units:
an input layer, at least one intermediate hidden layer, &
an output layer. Connection weights in a Back Prop
network are one way. Units are connected in a feed-
forward fashion with input units fully connected to units
in the hidden layer & hidden units fully connected to units
in the output layer. When a Back Prop network is cycled,
an input pattern is propagated forward to the output units
through the intervening input-to-hidden and hidden-to-
output weights.
19. Inputs To Neurons
ā¢ Arise from other neurons or from outside
the network
ā¢ Nodes whose inputs arise outside the
network are called input nodes and simply
copy values
ā¢ An input may excite or inhibit the response
of the neuron to which it is applied,
depending upon the weight of the
connection
21. Weights
ā¢ Represent synaptic efficacy and may be
excitatory or inhibitory
ā¢ Normally, positive weights are considered
as excitatory while negative weights are
thought of as inhibitory
ā¢ Learning is the process of modifying the
weights in order to produce a network that
performs some function
26. Backpropagation Preparation
ā¢ Training Set
A collection of input-output patterns that are
used to train the network
ā¢ Testing Set
A collection of input-output patterns that are
used to assess network performance
ā¢ Learning Rate-Ī·
A scalar parameter, analogous to step size in
numerical integration, used to set the rate of
adjustments
27. Learning
ā¢ Learning occurs during a training phase in which each input
pattern in a training set is applied to the input units and then
propagated forward.
ā¢ The pattern of activation arriving at the output layer is then
compared with the correct output pattern to calculate an
error signal.
ā¢ The error signal for each such target output pattern is then
back propagated from the outputs to the inputs in order to
appropriately adjust the weights in each layer of the network.
28. Learning
ā¢ The process goes on for several cycles till the error
reduces to a predefined limit.
ā¢ After a BackProp network has learned the correct
classification for a set of inputs, it can be tested on a
second set of inputs to see how well it classifies
untrained patterns.
ā¢ Thus, an important consideration in applying
BackProp learning is how well the network
generalizes.
29. The basic principles of the back propagation algorithm are:
(1) the error of the output signal of a neuron is used to
adjust its weights such that the error decreases, and (2)
the error in hidden layers is estimated proportional to the
weighted sum of the (estimated) errors in the layer
above.
31. During the training, the data is presented to the network
several thousand times. For each data sample, the
current output of the network is calculated and compared
to the "true" target value. The error signal dj of neuron j
is computed from the difference between the target and
the calculated output. For hidden neurons, this difference
is estimated by the weighted error signals of the layer
above. The error terms are then used to adjust the
weights wij of the neural network.
32. A Pseudo-Code Algorithm
ā¢ Randomly choose the initial weights
ā¢ While error is too large
ā For each training pattern (presented in random order)
ā¢ Apply the inputs to the network
ā¢ Calculate the output for every neuron from the input layer,
through the hidden layer(s), to the output layer
ā¢ Calculate the error at the outputs
ā¢ Use the output error to compute error signals for pre-output
layers
ā¢ Use the error signals to compute weight adjustments
ā¢ Apply the weight adjustments
ā Periodically evaluate the network performance
35. Apply Inputs From A Pattern
ā¢ Apply the value of
each input parameter
to each input node
ā¢ Input nodes computer
only the identity
function
Feedforward
Inputs
Outputs
36. Calculate Outputs For Each
Neuron Based On The Pattern
ā¢ The output from neuron j
for pattern p is Opj where
and
k ranges over the input
indices and Wjk is the
weight on the connection
from input k to neuron j
Feedforward
Inputs
Outputs
jnetjpj
e
netO Ī»ā
+
=
1
1
)(
ā+=
k
jkpkbiasj WOWbiasnet *
37. Calculate The Error Signal For
Each Output Neuron
ā¢ The output neuron error signal Ī“pj is given
by Ī“pj=(Tpj-Opj) Opj (1-Opj)
ā¢ Tpj is the target value of output neuron j for
pattern p
ā¢ Opj is the actual output value of output
neuron j for pattern p
38. Calculate The Error Signal For
Each Hidden Neuron
ā¢ The hidden neuron error signal Ī“pj is given
by
where Ī“pk is the error signal of a post-
synaptic neuron k and Wkj is the weight of
the connection from hidden neuron j to the
post-synaptic neuron k
kj
k
pkpjpjpj WOO āā= Ī“Ī“ )1(
39. Calculate And Apply Weight
Adjustments
ā¢ Compute weight adjustments āWji at time
t by
āWji(t)= Ī· Ī“pj Opi
ā¢ Apply weight adjustments according to
Wji(t+1) = Wji(t) + āWji(t)
ā¢ Some add a momentum term Ī±āāWji(t-1)
40. ā¢ Thus, the network adjusts its weights after each data
sample. This learning process is in fact a gradient
descent in the error surface of the weight space - with all
its drawbacks. The learning algorithm is slow and prone
to getting stuck in a local minimum.
41.
42. Simulation Issues
ļ§ How to Select Initial Weights
ļ§ Local Minima
ļ§ Solutions to Local minima
ļ§ Rate of Learning
ļ§ Stopping Criterion
ļ§ Initialization
43. ā¢ For the standard back propagation algorithm, the initial
weights of the multi-layer perceptron have to be
relatively small. They can, for instance, be selected
randomly from a small interval around zero. During
training they are slowly adapted. Starting with small
weights is crucial, because large weights are rigid and
cannot be changed quickly.
44. Sequential & Batch modes
For a given training set ,back-propagation learning
proceeds in two basic ways:
1. Sequential Mode
2. Batch Mode
45. Sequential mode
ā¢ The sequential mode of back-propagation learning is also
referred to as on-line, pattern or stochastic mode.
ā¢ To be specific, consider an epoch consisting of N training ex.
Arranged in the order (x(1),d(1)),ā¦,(x(N),d(N)).
ā¢ The first ex. pair (x(1),d(1))in the epoch is presented to the
network,& the sequence of forward & backward computations
described previously is performed, resulting in certain adjustments
to the synaptic weights & bias level of the network.
ā¢ The second ex. (x(N),d(N)) in the epoch is presented,& the
sequence of forward & backward computations is repeated,
resulting in the further adjustments to the synaptic weights & bias
levels. This process is continued until the last example pair
(x(N),d(N)) in the epoch is accounted for.
46. Batch Propagation
ā¢ In this mode of back-propagation learning weight
updating is performed after the presentation of all the
training examples that constitute an epoch.
ā¢ For a particular epoch, the cost function is the average
squared error, reproduced here in composite form is
defined as:-
Ī¾av = (1/2N )Ī£ Ī£ ej
2
(n) for n=1 to N
for j ā¬ C
47. ā¢ Let N denote the total no. of patterns contained in the
training set. The average squared error energy is
obtained by summing Ī¾(n) over all n and then
normalizing with respect to the set size N, as shown by :-
ā¢ Ī¾av = 1/N Ī£ Ī¾(n) for n=1 to N
48. Stopping Criteria
ā¢ The back-propagation algorithm cannot be shown to converge .
ā¢ To formulate a criterion, it is logically to think in terms of the
unique properties of a local or global minimum.
ā¢ The back-propagation algorithm is considered to have
converged when the Euclidean norm of the gradient vector reaches
a sufficient small gradient threshold.
ā¢ The back-propagation algorithm is considered to have converged
when the absolute rate of change in the average squared error pre
epoch is sufficiently small.
ā¢ The drawback of this convergence criterion is that, for
successful trials, learning time may be long.
49.
50. ā¢ The back-propagation algorithm makes adjustments by
computing the derivative, or slope of the network error
with respect to each neuronās output. It attempts to
minimize the overall error by descending this slope to the
minimum value for every weight. It advances one step
down the slope each epoch. If the network takes steps
that are too large, it may pass the global minimum. If it
takes steps that are small, it may settle on local minima,
or take an inordinate amount of time to arrive at the
global minimum. The ideal step size for a given problem
requires detailed, high-order derivative analysis, a task
not performed by the algorithm.
53. Local Minima
ļ®For simple 2 layer networks (without a hidden layer), the
error surface is bowl shaped and using gradient-descent to
minimize error is not a problem; the network will always
find an errorless solution (at the bottom of the bowl). Such
errorless solutions are called global minima.
ļ®However, extra hidden layer implies complex surfaces.
Since some minima are deeper than others, it is possible
that gradient descent may not find a global minima.
Instead, the network may fall into local minima which
represent suboptimal solutions.
54. ā¢ The algorithm cycles through the training samples as:-
ā¢ Initialization
ā¢ Presentation of training Examples
ā¢ Forward Computation
55. Initialization
ā¢ Assuming that no prior information is available, pick the
synaptic weights and thresholds from a uniform
distribution whose mean is zero & whose variance is
chosen to make the standard deviation of the induced
local fields of the neurons lie at the transition between
the linear and saturated parts of the sigmoid activation
function.
56. Presentation of training Examples
Present the network with an epoch of training examples.
For each example in the set order in same fashion,
perform the sequence of forward and backward
computation as described below.
57. Solutions to Local minima
ļ®Usual solution : More hidden layers. Logic -
Although additional hidden units increase the
complexity of the error surface, the extra
dimensionalilty increases the number of possible
escape routes.
ļ®Our solution ā Tunneling
58. Rate of Learning
ļ®If the learning rate Ī· is very small, then the
algorithm proceeds slowly, but accurately follows
the path of steepest descent in weight space.
ļ®If Ī· is large, the algorithm may oscillate.
59. ļ®A simple method of effectively increasing the rate of
learning is to modify the delta rule by including a
momentum term:
Īwji
(n) = Ī± Īwji
(n-1) + Ī· Ī“j
(n)yi
(n)
where Ī± is a positive constant termed the momentum
constant. This is called the generalized delta rule.
ļ§ The effect is that if the basic delta rule is consistently
pushing a weight in the same direction, then it gradually
gathers "momentum" in that direction.
61. An Example: Exclusive āORā
ā¢ Training set
ā ((0.1, 0.1), 0.1)
ā ((0.1, 0.9), 0.9)
ā ((0.9, 0.1), 0.9)
ā ((0.9, 0.9), 0.1)
ā¢ Testing set
ā Use at least 121 pairs equally spaced on the
unit square and plot the results
ā Omit the training set (if desired)
64. Feedforward Network Training by
Backpropagation: Process
Summary
ā¢ Select an architecture
ā¢ Randomly initialize weights
ā¢ While error is too large
ā Select training pattern and feedforward to find
actual network output
ā Calculate errors and backpropagate error
signals
ā Adjust weights
ā¢ Evaluate performance using the test set
65. An Example (continued):
Network Architecture
Sample
input
0.1
0.9
Actual
output
???
1
1
1
??
??
??
??
??
??
??
??
??
Target
output
0.9
66. Feedforward Network Training by
Backpropagation: Process
Summary
ā¢ Select an architecture
ā¢ Randomly initialize weights
ā¢ While error is too large
ā Select training pattern and feedforward to find
actual network output
ā Calculate errors and backpropagate error
signals
ā Adjust weights
ā¢ Evaluate performance using the test set
67. Backpropagation
ā¢Very powerful - can learn any function, given enough
hidden units! With enough hidden units, we can
generate any function.
ā¢Have the same problems of Generalization vs.
Memorization. With too many units, we will tend to
memorize the input and not generalize well. Some
schemes exist to āpruneā the neural network.
68. BackProp networks are not limited in its use because
they can adapt their weights to acquire new
knowledge. BackProp networks learn by example,
and can be used to make predictions.
69. Write a program to train and simulate neural
network for following network
ā Input Nodes = 2 &
Output Nodes = 1
ā Input Nodes = 3 and
Output nodes = 1
Inputs Outputs
A B Y
0 0 0
0 1 1
1 0 1
1 1 0
Inputs Outputs
A B C Y
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 1
1 0 1 1
1 1 0 1
1 1 1 1
70. ā¢ Artificial Neural Network
ā Simon Haykin
ā¢ Artificial Neural Network
ā Jacek Zurada