Professional Documents
Culture Documents
Neural Networks
Lectures 5+6
x1
xn
xn
1st question:
what do the extra layers gain you? Start with looking at
what a single layer can’t do
XOR
problem
0+0=0
1+1=2=0 mod 2
1+0=1
0+1=1
Perceptron does not work here
Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
units
+1
1
3
2
+1
Input Output
xn
Hidden layers
Properties of architecture
m
yi f ( wij x j bi )
j 1
Properties of architecture
m
yi f ( wij x j bi )
j 1
Properties of architecture
m
yi f ( wij x j bi )
j 1
Properties of architecture
m
yi f ( wij x j bi )
j 1
x1
x2 ?
•But with more layers how are the weights for the first 2 layers
found when the error is computed for layer 3 only?
•There is no direct error signal for the first layers!!!!!
Credit assignment problem
We therefore want to find out how weight wij affects the error ie we
want:
E (t )
wij (t )
Backpropagation learning algorithm ‘BP’
x2
y1
Inputs xi
Outputs yj
ym
u j (t ) vi
ji (t ) xi (t ) wkj(t)
z j g (u j (t )) zj
ak (t ) w kj (t ) z j
j xi
y k g ( ak (t ))
Backward Pass
Will use a sum of squares error measure. For each training pattern
we have:
1
E (t ) (d k (t ) yk (t )) 2
2 k 1
where dk is the target value for dimension k. We want to know how
to modify weights in order to decrease E. Use gradient descent ie
E (t )
wij (t 1) wij (t )
wij (t )
E (t ) E (t ) ai (t )
wij (t ) ai (t ) wij (t )
ui (t ) ai (t )
x j (t ) z j (t )
vij (t ) wij (t )
Term A Let
E (t ) E (t )
i (t ) , i (t )
ui (t ) ai (t )
(error terms). Can evaluate these by chain rule:
For output units we therefore have:
E (t ) E (t )
i (t ) g ' ( ai (t ))
ai (t ) yi (t )
i (t ) g ' ( ai (t ))(d i (t ) yi (t ))
For hidden units must use the chain rule:
E (t ) E (t ) a j (t )
i (t )
ui (t ) j a j (t ) ui (t )
wki wji
k j
i = g’(ai) j wji j
Combining A+B gives
E (t )
i (t ) x j (t )
vij (t )
E (t )
i (t ) z j (t )
wij (t )
So to achieve gradient descent in E should change weights by
vij (t 1) vij (t ) i (t ) x j (t )
g ' (ui (t )) x j (t ) k (t ) wki
k
5 Multi-Layer Perceptron (2)
-Dynamics of MLP
Topic
Summary of BP algorithm
Network training
Dynamics of BP learning
Regularization
Algorithm (sequential)
x1 v11= -1
w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2
w22= 1
v10= 1
v20= 1
x1= 0 v11= -1
w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2= 1 v22= 1 y2
w22= 1
v11= -1 u1 = 1
x1 w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2
w22= 1
u2 = 2
u1 = -1x0 + 0x1 +1 = 1
u2 = 0x0 + 1x1 +1 = 2
Calculate first layer outputs by passing activations thru activation
functions
z1 = 1
x1 v11= -1
w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2
w22= 1
z2 = 2
z1 = g(u1) = 1
z2 = g(u2) = 2
Calculate 2nd layer outputs (weighted sum thru activation functions):
x1 v11= -1
w11= 1 y1= 2
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2= 2
w22= 1
y1 = a1 = 1x1 + 0x2 +1 = 2
y2 = a2 = -1x1 + 1x2 +1 = 2
Backward pass:
wij (t 1) wij (t ) i (t ) z j (t )
( d i (t ) yi (t )) g ' ( ai (t )) z j (t )
x1 v11= -1
w11= 1 1= -1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 2= -2
w22= 1
v11= -1 z1 = 1
x1 w11= 1 1 z1 =-1
v21= 0 w21= -1 1 z2 =-2
v12= 0
w12= 0
x2 v22= 1 2 z1 =-2
w22= 1 2 z2 =-4
z2 = 2
wij (t 1) wij (t ) i (t ) z j (t )
Weight changes will be:
wij (t 1) wij (t ) i (t ) z j (t )
x1 v11= -1
w11= 0.9
v21= 0 w21= -1.2
v12= 0
w12= -0.2
x2 v22= 1
w22= 0.6
But first must calculate ’s:
v11= -1
x1 1 w11= -1 1= -1
v21= 0
2 w21= 2
v12= 0 1 w12= 0
x2 v22= 1 2= -2
2 w22= -2
’s propagate back:
v11= -1 1= 1
x1 1= -1
v21= 0
v12= 0
x2 v22= 1 2= -2
2 = -2
1 = - 1 + 2 = 1
2 = 0 – 2 = -2
And are multiplied by inputs:
vij (t 1) vij (t ) i (t ) x j (t )
x1= 0 v11= -1 1 x1 = 0
1= -1
v21= 0 1 x2 = 1
v12= 0
2 x1 = 0
x2= 1 v22= 1 2= -2
2 x2 = -2
Finally change weights:
vij (t 1) vij (t ) i (t ) x j (t )
x1= 0 v11= -1
w11= 0.9
v21= 0 w21= -1.2
v12= 0.1
w12= -0.2
x2 = 1 v22= 0.8
w22= 0.6
vij (t 1) vij (t ) i (t ) x j (t )
v11= -1 z1 = 1.2
x1= 0 w11= 0.9
v21= 0 w21= -1.2
v12= 0.1
w12= -0.2
x2 = 1 v22= 0.8
w22= 0.6
z2 = 1.6
Now go forward again (would normally use a new input vector):
vij (t 1) vij (t ) i (t ) x j (t )
dg ( a )
Where: g ' ( ai (t ))
da
where k is a positive
constant. The sigmoidal
function gives a value in
range of 0 to 1.
Alternatively can use
tanh(ka) which is same
shape but in range –1 to 1.
Input-output function of a
neuron (rate coding
assumption)
Note: when net = 0, f = 0.5
Derivative of sigmoidal function is
k exp( k ai (t ))
g ' (ai (t )) k g (ai (t ))[1 g (ai (t ))]
[1 k exp(k ai (t ))]2
Sequential mode
• Less storage for each weighted connection
• Random order of presentation and updating per pattern means
search of weight space is stochastic--reducing risk of local
minima
• Able to take advantage of any redundancy in training set (i.e..
same pattern occurs more than once in training set, esp. for large
difficult training sets)
• Simpler to implement
Batch mode:
• Faster learning than sequential mode
• Easier from theoretical viewpoint
• Easier to parallelise
Dynamics of BP learning
Recall, mean
1 psquared error is typically used
E(t)= 2
(d
k 1
k (t ) Ok (t )) 2
idea is to reduce E
valleys
Selecting initial weight values
E(F)=Es(F)+ ER(F)
ER ( F ) y d x 2
without regularization
with regularization
Momentum
p M
E [ d j ( n) y j ( n)]i2
i 1 j 1
Hold-out method
Simplest method when data is not scare
where H ( x1 , ... , xm )= k
a
i=1 i f ( m
j=1 wijxj + bi )