An artificial Neural Network (ANN) is an efficient approach for solving a variety of tasks using teaching methods and sample data on the principal of training. With proper training, ANN are capable of generalizing and recognizing similarity among different input patterns.The main problem in using ANN is parameter setting, because there is no definite and explicit method to select optimal parameters of the ANN. There are a number pf parameters that must be decided upon like number of layers, number of neurons per layer, number of training iteration, number of samples etc...
2. NEED FOR SETTING
PARAMETER VALUES
1. LOCAL MINIMA
w1 – Global Minima
w2, w3 – Local minima
W1 W2 W3
1
32
Erms
Erms
min
3. NEED FOR SETTING
PARAMETER VALUES
2. LEARNING RATE
Small learning rate – Slow and lengthy
learning
Large learning rate –
Output may saturate
or may swing across desired.
May take too long to train.
3. Learning will improve and network training will
converge if inputs and outputs are statistical i.e.
numeric.
4. TYPES OF TRAINING
Supervised Training
•Supplies the neural network with inputs and the
desired Outputs
• Response of the network to the inputs is measured
•The weights are modified to reduce the difference
between the actual and desired outputs
Unsupervised Training
•Only supplies inputs
•The neural network adjusts its own weights so that
similar inputs cause similar outputs
•The network identifies the patterns and differences in
the inputs without any external assistance
5. I.INITIALISATION OF WEIGHTS
Larger weights will drive the output of layer 1 to
saturation.
Network will require larger training time to
emerge out of saturation.
Weights chosen as :- Small weights
between -1 and 1
Or
between -0.5 and 0.5
6. INITIALISATION OF WEIGHTS
PROBLEM ITH THIS CHOISE:
If some of input parameters are very high, they
will predominate the output.
e. g. x = [ 10 2 0.2 1]
SOLUTION:
Weights are initialized as inversely proportional
to input.
Output will not depend on any individual
parameters, but total input as a whole.
7. RULE FOR INITIALISATION OF WEIGHTS
Weight between input and 1st layer:
P
vij = (1/2P) ∑p=1 (1/|xj|)
P is total no of input patterns.
Weight between 1st layer and output layer:
P
wij = (1/2P) ∑p=1 (1/ f(∑ vij xj )
8. II. FREQUENCY OF WEIGHT UPDATES
Per pattern training: Weight changes after every
input is applied.
Input set repeated if NN is not trained yet.
Per epoch training: Epoch is one iteration through the
process of providing the network with an input and
updating the network's weights
Many epochs are required to train the neural
network
weight changes as suggested by every input are
accumulated together into a single change at the
end of each epoch i.e. set of patterns.
No change in weight at end of each input.
Also called BATCH MODE training
9. FREQUENCY OF WEIGHT UPDATES
Advantages / Disadvantages
Batch mode training not possible for on-
line training.
For large applications, with large training
time, parallel processing may reduce time
in batch mode training.
Per pattern training is more expensive as
weight changes more often.
Per pattern suitable for small NN and
small data set
10. III. LEARNING RATE
FOR PERCEPTRON TRAINING
ALGORITHM
Too small η – Very slow learning
Too large η – Output may saturate on one
direction.
η = 0 --- no weight change
η = 1 --- Common Choice
11. PROBLEM WITH η = 1
If η = 1 ∆w = ± x
New Output = (w + ∆w)t x
Output = wtx ± xtx
Here if wtx > xtx output will always be positive
and grows in one direction only.
Should be - wtx < ∆wtx ∆w = ± x
η |xtx| >| wtx |
η > | wtx | / |xtx|
η is normally between 0 and 1.
12. III. LEARNING RATE
FOR BACK PROPAGATION ALGORITHM
Large η in early iterations and steadily
decrease it when NN is converging.
Increase η at every iteration that improves
performance by significant amount and vise
versa.
Steadily double the η untill error value worsens.
If Second derivative of E, ▼2E is constant and
low, η can be large.
If Second derivative of E, ▼2E is large, η can be
small.
For above, more computation required.
14. MOMENTUM
Can be prevented if weight changes depend
on average gradient of Error, rather than
gradient at a point.
Averaging δE/ δw in a small neighborhood
leads the network in general direction of MSE
decrease without getting stuck at local
minima.
May become complex.
15. MOMENTUM
Shortcut method:
Weight change at ith iteration of back
propagation algorithm also depends on
immediately preceding weight changes.
This has an averaging effect.
This diminishes drastic fluctuations in weight
changes over consecutive iterations.
Achieved by adding momentum to weight
update rule.
16. MOMENTUM
Δwkj(t+1) = ηδkxi + α∆wkj(t)
∆wkj(t) is weight change required at time t .
α is a constant . α ≤ 1.
Disadvantage:
Past training trend can strongly bias current
training.
α depends on application.
α = 0, no effect of past value.
α = 1, no effect of current value.
17. What constitutes a “good” training
set?
Samples must represent the general population
Samples must contain members of each class
Samples in each class must contain a wide range
of variations or noise effect
18. GENERALIZABILITY
Occurs more in large NN with less inputs.
Inputs are repeated while training till error
reduces.
This leads to network memorizing the inputs
samples.
Such trained NN may behave correctly with
training data but fail with any unknown data.
Also called over training.
19. GENERALIZABILITY- SOLUTION
The set of all known samples is broken into
two orthogonal (independent) sets:
Training set - A group of samples used to
train the neural network
Testing set - A group of samples used to test
the performance of the neural network
◦ Used to estimate the error rate
Training continues as long as error to test
data gradually reduces.
Training terminates as soon as error on test
data increases.
20. GENERALIZABILITY
E
time
Error on test
data
Error on training
data
Time when
error on test
data starts to
increase
•Performance over test data is monitored over
several iterations, not just one iteration.
21. GENERALIZABILITY
Weight will NOT change on test data.
Overtraining can be avoided by using small
number of parameters (hidden nodes and
weights).
If size of training set is small, multiple sets
can be created by adding small randomly
generated noise or displacement.
X = { x1, x2, x3…..xn} then
X’ = { x1+ß1, x2+ß2, x3+ß3… xn + ßn}
22. NO. OF HIDDEN LAYERS AND NODES
Mostly obtained by trial and error.
Too few nodes – NW may not be efficient.
Too large nodes –
Computation is tedious and expensive.
NW may memorize the inputs and perform poorly on
test data.
NW is called well trained if performs well on
data not used for testing.
Hence NN should be capable of generalizing
from input, rather than memorizing the
inputs.
23. NO. OF HIDDEN LAYERS AND NODES
Methods:
Adaptive algorithm-
◦ Choose large number of nodes and train.
◦ Gradually discard nodes one by one during training.
◦ Train till performance reduces below unacceptable
level.
◦ NN to be retrained at each change in nodes.
◦ Or vice versa
◦ Choose small number of nodes and increase nodes
till performance is satisfactory.
24. Let’s see how NN size advances:
Linear Classification:
L1 ax1+bx2+c>0
ax1+bx2+c<0
L1
25. Let’s see how NN size advances:
Two class problem - Nonlinear
L1
L2 L11
L1
L2
26. Let’s see how NN size advances:
Two class problem - Nonlinear
L1
L2 L11
L3
L4
L1
L2
L3
L4
P
27. Let’s see how NN size advances:
Two class problem - Nonlinear
L22
PP1
P2
P3
P4
P1
P4
P2
P3
29. NUMBER OF INPUT SAMPLES
As a thumb rule: 5 to 10 times as many
samples as the number of weights to be trained.
Baum and Haussler suggest:
◦ P > |w| /(1-a)
◦ P is number of samples,
◦ |w| is number of weights to be trained,
◦ a expected accuracy on test set.
30. Non-numeric inputs
Nonnumeric inputs like colours have no
inherent order.
Can not be depicted on an axis e.g. red-blue-
green-yellow.
Colour becomes position sensitive. Results in
Erroneous training.
Hence assign binary vector with component
corresponding to each colour. e.g.
Green – 0 0 1 0 red – 1 0 0 0
Blue – 0 1 0 0 yellow – 0 0 0 1
But dimension increases drastically
31. Termination criteria
“Halt when goal is achieved.”
Perceptron training of linearly separable patterns –
◦ Correct classification of all samples.
◦ Termination is assured if ƞ is sufficiently small.
◦ Program may run indefinitely if ƞ is not appropriate.
◦ Different choice of if ƞ may yield classification.
Back propagation algorithm using delta rule–
◦ Termination can never be achieved with above criteria as
output can never be +1 or -1.
◦ Will have to fix Emin , the minimum error acceptable.
Terminates as error goes below Emin.
32. Termination criteria
Perceptron training of linearly non-separable
patterns –
◦ Above criteria will allow procedure to run indefinitely.
◦ Compare amount of progress in recent past.
◦ If number of misclassification has not changed in large
step, samples are not linearly separable.
◦ Can fix limit of minimum % of correct classification for
termination.