MLP Handout PDF

Feed-forward Neural Networks
Ying Wu
Electrical Engineering and Computer Science
Northwestern University
Evanston, IL 60208
http://www.eecs.northwestern.edu/~yingwu
1 / 24
Connectionism
How does our brain process information?
Are we Turing Machines?
Things that are dicult for Turing Machines
Perception is dicult for Turing Machines
We have so many neurons
How do they work?
Can we have computational models?
Connectionism vs. Computationalism

2 / 24
History
in 1940s
The model for neuron (McCulloch & Pitts, 1943)
The Hebbian learning rule (Hebb, 1949)
in 1950s
Perceptron (Rosenblatt, 1950s)
in 1960s
Limitation of Perceptron (Minsky & Papert, 1969)
Expert systems was so hot by then
again in 1980s
Hopeld feed-back network (Hopeld, 1982)
Back-propagation algorithm (Rumelhart & Le Cun, 1986)
Expert systems
again in 1990s
Overtting in neural networks
SVM was so hot (Vapnik, 1995)
where to go?
3 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
4 / 24
Neuron: the Basic Unit
x
1
x
2
x
d
w
1
w
2
w
d
.
.
.
Input x = [1, . . . , x
d
]
T
R
d+1
Connection weight (i.e., synapses) w = [w

0
, . . . , w
d
]
T
R
d+1
Net activation: net

net =
d
i =0
w
i
x
i
= w
T
x
Activation function and output

y = f (net) = f (w
T
x)
5 / 24
Activation Function
Activation function introduces nonlinearity
We can use
f (x) = sgn(x) =
_
1 x 0
1 x < 0
Or we can use the Sigmoid function

f (x) =
2
1 + e
2x
1, f (x) (1, 1)
its derivative
f
(x) = 1 f
2
(x)
Or
f (x) =
1
1 + e
x
, f (x) (0, 1)
its derivative
f
(x) = f (x)[1 f (x)] =

e
x
(1 + e
x
)
2
6 / 24
Outline
Neuron Model
7 / 24
Perceptron
.
.
.
x
1
.
.
.
x
2
input layer
output layer
x
d
z
1
z
c
Two layers (input and linear output)
Desired output t = [t
1
, . . . , t
c
]
T
R
c
Actual output z
i
= w
T
i
x, i = 1, . . . , c
Learning (Widrow-Ho)
w
i
(t + 1) = w
i
(t) + (t
i
z
i
)x = w
i
(t) + (t
i
w
T
i
x)x
It only works for linearly separable patterns
It cannot even solve the simple XOR problem

8 / 24
Multi-layer Network
.
.
.
.
.
.
x
1
.
.
.
x
2
input layer hidden layer output layer
x
d
Input layer x = [1, . . . , x

d
]
T
R
d+1
Hidden layer y
j
= f (w
T
j
x), j = 1, . . . , n
H
Output layer z
k
= f (w
T
k
y), k = 1, . . . , c
Weight between hidden node y

j
and input node x
i
is w
ji
Weight between output node z

k
and hidden node y
j
is w
kj
May have multiple hidden layers

9 / 24
Discriminant Function
For 3-layer network, its discriminant function

g
k
(x) = z
k
= f
_
_
n
H
j =1
w
kj
f
_
d
i =1
w
ji
x
i
+ w
j 0
_
+ w
k0
_
_
Kolmogorov showed that a 3-layer structure is enough to

approximate any nonlinear function
We expect a 3-layer MLP to make any decision boundary
Certainly, the nonlinearity depends on n

H
, the number of
hidden units
Larger n
H
results in overtting
Smaller n
H
leads to undertting
10 / 24
Training the Network
Desired output of the network t = [t

1
, . . . , t
c
]
T
The objective in training the weights {w

j
, w
k
}
J(w) =
1
2
t z
2
We need to nd the best set of {w

j
, w
k
} that minimize J
It can be done through gradient-based optimization
In a general form
w(k + 1) = w(k)
J
w
It make it clear, lets do it component by component

11 / 24
Back-propagation (BP): output-hidden w
k
w
kj
is the weight between output node k and hidden node j
J
w
kj
=
J
net
k
net
k
w
kj
Dene sensitivity for a general node i as
i
=
J
net
i
In this case, for the output node k
k
=
J
net
k
=
J
z
k
z
k
net
k
= (t
k
z
k
)f
(net
k
)
As net
k
=
n
H
j =1
w
kj
y
j
, it is clear that
net
k
w
kj
= y
j
So we have w
kj
=
k
y
j
= (t
k
z
k
)f
(net
k
)y
j
This is a generalization of Widrow-Ho

12 / 24
Back-propagation (BP): hidden-input w
j
As did before
J
w
ji
=
J
y
j
y
j
net
j
net
j
w
ji
. .
easy
The rst one is a little more complicated

J
y
j
=

y
j
_
1
2
c
k=1
(t
k
z
k
)
2
_
=
c
k=1
(t
k
z
k
)
z
k
y
j
=
c
k=1
(t
k
z
k
)
z
k
net
k
net
k
y
j
=
c
k=1
(t
k
z
k
)f
(net
k
)w
kj
=
c
k=1
k
w
kj
We can compute the sensitivity for the hidden node j
j
=
J
net
j
=
J
y
j
y
j
net
j
= f
(net
j
)
c
k=1
w
kj
k
13 / 24
Why is it Called Back Propagation?
j
1
2
k
c
input layer
hidden layer
output layer
w
kj
Sensitivity
i
reects the information on node i

j
of a hidden node j combines two sources of information
linear combination of those from the output layer

c
k=1
w
kj
its local information f

(net
j
)
The learning rule for the hidden-input node is

w
ji
=
j
x
i
= f
(net
j
)
c
k=1
w
kj
k
x
i
14 / 24
Algorithm: Back-propagation (BP)
Algorithm 1: Stochastic Back-propagation
Init: n
H
, w, stop criterion , , k = 0
Do k k + 1
x
k
randomly pick
forward compute y and then z
backward compute {
k
} and then {
j
}
w
kj
w
kj
+
k
y
j
w
ji
w
ji
+
j
x
i
Until J(w) <
Return w
This is the one-sample BP training
It can be easily extended to batch training

15 / 24
Bayes Discriminant and MLP
In the linear discriminative models, we know that the MSE and

MMSE approximate to Bayesian discriminate asymptotically
MLP can do better by approximating the posteriors
Suppose we have c classes and the desired output is t

k
(x) = 1
if x
k
, and 0 otherwise
The MLP criterion

J(w) =
x
[g
k
(x; w)t
k
]
2
=
x
k
[g
k
(x; w)1]
2
+
x /
k
[g
k
(x; w)0]
2
It can be shown that minimizing lim

n
J(w) is equivalent to
miniming
_
[g
k
(x; w) P(
k
|x]
2
p(x)dx
This means the output units represents the posteriors

g
k
(x; w) P(
k
|x)
16 / 24
Outputs as Probabilities
If we want the output of MLP to be posteriors
The desired output in training should be in [0, 1]
As we have limited number of training samples, the outputs

may not sum to 1
We can use a dierent activation function for the output layer
Softmax activation
z
k
=
e
net
k
c
m=1
e
net
m
17 / 24
Practice: Number of Hidden Units
The number of hidden nodes is the most critical parameter in

MLP
It determines the expressive power of the network and the

complexity of the decision boundary
A smaller number leads to simpler boundary, and a larger

number can produce very complicated one
Overtting and generalizability
Unfortunately, there is no foolproof method to choose this

parameter
Many heuristics were proposed

18 / 24
Practice: Learning Rates
Another critical parameter in MLP is the learning rate
In principle, if is small enough, the iteration converges
But very slowly
To speed it up, we need to use the 2nd order gradient

information, e.g., the Newtonian method in training
19 / 24
Practice: Plateaus and Momentum
Error surfaces often have plateaus where

J(w)
w
is very small
Then the iteration can hardly move on
Introduce momentum to push it
Momentum uses the weight change at previous iteration

w(k + 1) = w(k) + (1 )w
bp
(k) + w(k 1)
Algorithm 2: Stochastic Back-propagation with Momentum
Init: n
H
, w, stop criterion , , k = 0
Do k k + 1
x
k
randomly pick
b
kj
(1 )
k
y
i
+ b
kj
; b
ji
(1 )
j
x
i
+ b
ji
w
ji
w
kj
+ b
kj
; w
ji
w
ji
+ b
ji
Until J(w) <
Return w
20 / 24
Outline
Neuron Model
21 / 24
Radial Basis Function Network
.
.
.
.
.
.
x
1
.
.
.
x
2
input layer hidden layer output layer
x
d
K
K
K
w
kj
The input-hidden weights are all 1
The activation function for the hidden units is the Radial

Basis Function (RBF), e.g.,
K(||x x
c
||) = exp{
x x
c
2
2
2
}
The output
z
k
(x) =
n
H
j =0
w
kj
K(x, x
j
)
22 / 24
Interpretation
It can be treated as a function approximation, as a linear

combination of a set of bases
The hidden units transform the original feature space to

another feature space (high-dim) by using the kernel
We hope the data become linearly separable in the new

feature space
This is what we did in the Kernel Machines!

23 / 24
Learning
Parameters in the RBF network
The basis center x

j
for each hidden node
The variance of the RBF function
The weights W
Once the RBF parameters are set, W can be done by

pseudo-inverse or Widrow-Ho
Finding RBF parameters is not easy
Uniformly select the centers
Using the data cluster centers as x

i
24 / 24

MLP Handout PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLP Handout PDF

Uploaded by

Copyright:

Available Formats

Feed-forward Neural Networks

How does our brain process information?

Are we Turing Machines?

Things that are dicult for Turing Machines

Perception is dicult for Turing Machines

We have so many neurons

How do they work?

Can we have computational models?

Connectionism vs. Computationalism

The model for neuron (McCulloch & Pitts, 1943)

The Hebbian learning rule (Hebb, 1949)

Perceptron (Rosenblatt, 1950s)

Limitation of Perceptron (Minsky & Papert, 1969)

Expert systems was so hot by then

Hopeld feed-back network (Hopeld, 1982)

Back-propagation algorithm (Rumelhart & Le Cun, 1986)

Overtting in neural networks

SVM was so hot (Vapnik, 1995)

Connection weight (i.e., synapses) w = [w

Net activation: net

Activation function and output

Activation function introduces nonlinearity

Or we can use the Sigmoid function

(x) = f (x)[1 f (x)] =

Two layers (input and linear output)

It only works for linearly separable patterns

It cannot even solve the simple XOR problem

Input layer x = [1, . . . , x

Weight between hidden node y

Weight between output node z

May have multiple hidden layers

For 3-layer network, its discriminant function

Kolmogorov showed that a 3-layer structure is enough to

We expect a 3-layer MLP to make any decision boundary

Certainly, the nonlinearity depends on n

Desired output of the network t = [t

The objective in training the weights {w

We need to nd the best set of {w

It can be done through gradient-based optimization

It make it clear, lets do it component by component

Dene sensitivity for a general node i as

In this case, for the output node k

This is a generalization of Widrow-Ho

The rst one is a little more complicated

We can compute the sensitivity for the hidden node j

linear combination of those from the output layer

its local information f

The learning rule for the hidden-input node is

This is the one-sample BP training

It can be easily extended to batch training

In the linear discriminative models, we know that the MSE and

MLP can do better by approximating the posteriors

Suppose we have c classes and the desired output is t

The MLP criterion

It can be shown that minimizing lim

This means the output units represents the posteriors

If we want the output of MLP to be posteriors

The desired output in training should be in [0, 1]

As we have limited number of training samples, the outputs

We can use a dierent activation function for the output layer

The number of hidden nodes is the most critical parameter in

It determines the expressive power of the network and the

A smaller number leads to simpler boundary, and a larger