You are on page 1of 24

Feed-forward Neural Networks

Ying Wu
Electrical Engineering and Computer Science
Northwestern University
Evanston, IL 60208
http://www.eecs.northwestern.edu/~yingwu
1 / 24
Connectionism

How does our brain process information?

Are we Turing Machines?

Things that are dicult for Turing Machines

Perception is dicult for Turing Machines

We have so many neurons

How do they work?

Can we have computational models?

Connectionism vs. Computationalism


2 / 24
History

in 1940s

The model for neuron (McCulloch & Pitts, 1943)

The Hebbian learning rule (Hebb, 1949)

in 1950s

Perceptron (Rosenblatt, 1950s)

in 1960s

Limitation of Perceptron (Minsky & Papert, 1969)

Expert systems was so hot by then

again in 1980s

Hopeld feed-back network (Hopeld, 1982)

Back-propagation algorithm (Rumelhart & Le Cun, 1986)

Expert systems

again in 1990s

Overtting in neural networks

SVM was so hot (Vapnik, 1995)

where to go?
3 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
4 / 24
Neuron: the Basic Unit
x
1
x
2
x
d
w
1
w
2
w
d
.
.
.

Input x = [1, . . . , x
d
]
T
R
d+1

Connection weight (i.e., synapses) w = [w


0
, . . . , w
d
]
T
R
d+1

Net activation: net


net =
d

i =0
w
i
x
i
= w
T
x

Activation function and output


y = f (net) = f (w
T
x)
5 / 24
Activation Function

Activation function introduces nonlinearity

We can use
f (x) = sgn(x) =
_
1 x 0
1 x < 0

Or we can use the Sigmoid function


f (x) =
2
1 + e
2x
1, f (x) (1, 1)
its derivative
f

(x) = 1 f
2
(x)

Or
f (x) =
1
1 + e
x
, f (x) (0, 1)
its derivative
f

(x) = f (x)[1 f (x)] =


e
x
(1 + e
x
)
2
6 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
7 / 24
Perceptron
.
.
.

x
1
.
.
.

x
2
input layer
output layer
x
d
z
1
z
c

Two layers (input and linear output)

Desired output t = [t
1
, . . . , t
c
]
T
R
c

Actual output z
i
= w
T
i
x, i = 1, . . . , c

Learning (Widrow-Ho)
w
i
(t + 1) = w
i
(t) + (t
i
z
i
)x = w
i
(t) + (t
i
w
T
i
x)x

It only works for linearly separable patterns

It cannot even solve the simple XOR problem


8 / 24
Multi-layer Network
.
.
.

.
.
.

x
1
.
.
.

x
2
input layer hidden layer output layer
x
d

Input layer x = [1, . . . , x


d
]
T
R
d+1

Hidden layer y
j
= f (w
T
j
x), j = 1, . . . , n
H

Output layer z
k
= f (w
T
k
y), k = 1, . . . , c

Weight between hidden node y


j
and input node x
i
is w
ji

Weight between output node z


k
and hidden node y
j
is w
kj

May have multiple hidden layers


9 / 24
Discriminant Function

For 3-layer network, its discriminant function


g
k
(x) = z
k
= f
_
_
n
H

j =1
w
kj
f
_
d

i =1
w
ji
x
i
+ w
j 0
_
+ w
k0
_
_

Kolmogorov showed that a 3-layer structure is enough to


approximate any nonlinear function

We expect a 3-layer MLP to make any decision boundary

Certainly, the nonlinearity depends on n


H
, the number of
hidden units

Larger n
H
results in overtting

Smaller n
H
leads to undertting
10 / 24
Training the Network

Desired output of the network t = [t


1
, . . . , t
c
]
T

The objective in training the weights {w


j
, w
k
}
J(w) =
1
2
t z
2

We need to nd the best set of {w


j
, w
k
} that minimize J

It can be done through gradient-based optimization

In a general form
w(k + 1) = w(k)
J
w

It make it clear, lets do it component by component


11 / 24
Back-propagation (BP): output-hidden w
k

w
kj
is the weight between output node k and hidden node j
J
w
kj
=
J
net
k
net
k
w
kj

Dene sensitivity for a general node i as

i
=
J
net
i

In this case, for the output node k

k
=
J
net
k
=
J
z
k
z
k
net
k
= (t
k
z
k
)f

(net
k
)

As net
k
=

n
H
j =1
w
kj
y
j
, it is clear that
net
k
w
kj
= y
j

So we have w
kj
=
k
y
j
= (t
k
z
k
)f

(net
k
)y
j

This is a generalization of Widrow-Ho


12 / 24
Back-propagation (BP): hidden-input w
j

As did before
J
w
ji
=
J
y
j
y
j
net
j
net
j
w
ji
. .
easy

The rst one is a little more complicated


J
y
j
=

y
j
_
1
2
c

k=1
(t
k
z
k
)
2
_
=
c

k=1
(t
k
z
k
)
z
k
y
j
=
c

k=1
(t
k
z
k
)
z
k
net
k
net
k
y
j
=
c

k=1
(t
k
z
k
)f

(net
k
)w
kj
=
c

k=1

k
w
kj

We can compute the sensitivity for the hidden node j

j
=
J
net
j
=
J
y
j
y
j
net
j
= f

(net
j
)
c

k=1
w
kj

k
13 / 24
Why is it Called Back Propagation?
j
1
2
k
c
input layer
hidden layer
output layer
w
kj

Sensitivity
i
reects the information on node i


j
of a hidden node j combines two sources of information

linear combination of those from the output layer


c

k=1
w
kj

its local information f



(net
j
)

The learning rule for the hidden-input node is


w
ji
=
j
x
i
= f

(net
j
)
c

k=1
w
kj

k
x
i
14 / 24
Algorithm: Back-propagation (BP)
Algorithm 1: Stochastic Back-propagation
Init: n
H
, w, stop criterion , , k = 0
Do k k + 1
x
k
randomly pick
forward compute y and then z
backward compute {
k
} and then {
j
}
w
kj
w
kj
+
k
y
j
w
ji
w
ji
+
j
x
i
Until J(w) <
Return w

This is the one-sample BP training

It can be easily extended to batch training


15 / 24
Bayes Discriminant and MLP

In the linear discriminative models, we know that the MSE and


MMSE approximate to Bayesian discriminate asymptotically

MLP can do better by approximating the posteriors

Suppose we have c classes and the desired output is t


k
(x) = 1
if x
k
, and 0 otherwise

The MLP criterion


J(w) =

x
[g
k
(x; w)t
k
]
2
=

x
k
[g
k
(x; w)1]
2
+

x /
k
[g
k
(x; w)0]
2

It can be shown that minimizing lim


n
J(w) is equivalent to
miniming
_
[g
k
(x; w) P(
k
|x]
2
p(x)dx

This means the output units represents the posteriors


g
k
(x; w) P(
k
|x)
16 / 24
Outputs as Probabilities

If we want the output of MLP to be posteriors

The desired output in training should be in [0, 1]

As we have limited number of training samples, the outputs


may not sum to 1

We can use a dierent activation function for the output layer

Softmax activation
z
k
=
e
net
k
c

m=1
e
net
m
17 / 24
Practice: Number of Hidden Units

The number of hidden nodes is the most critical parameter in


MLP

It determines the expressive power of the network and the


complexity of the decision boundary

A smaller number leads to simpler boundary, and a larger


number can produce very complicated one

Overtting and generalizability

Unfortunately, there is no foolproof method to choose this


parameter

Many heuristics were proposed


18 / 24
Practice: Learning Rates

Another critical parameter in MLP is the learning rate

In principle, if is small enough, the iteration converges

But very slowly

To speed it up, we need to use the 2nd order gradient


information, e.g., the Newtonian method in training
19 / 24
Practice: Plateaus and Momentum

Error surfaces often have plateaus where


J(w)
w
is very small

Then the iteration can hardly move on

Introduce momentum to push it

Momentum uses the weight change at previous iteration


w(k + 1) = w(k) + (1 )w
bp
(k) + w(k 1)
Algorithm 2: Stochastic Back-propagation with Momentum
Init: n
H
, w, stop criterion , , k = 0
Do k k + 1
x
k
randomly pick
b
kj
(1 )
k
y
i
+ b
kj
; b
ji
(1 )
j
x
i
+ b
ji
w
ji
w
kj
+ b
kj
; w
ji
w
ji
+ b
ji
Until J(w) <
Return w
20 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
21 / 24
Radial Basis Function Network
.
.
.

.
.
.

x
1
.
.
.

x
2
input layer hidden layer output layer
x
d
K
K
K
w
kj

The input-hidden weights are all 1

The activation function for the hidden units is the Radial


Basis Function (RBF), e.g.,
K(||x x
c
||) = exp{
x x
c

2
2
2
}

The output
z
k
(x) =
n
H

j =0
w
kj
K(x, x
j
)
22 / 24
Interpretation

It can be treated as a function approximation, as a linear


combination of a set of bases

The hidden units transform the original feature space to


another feature space (high-dim) by using the kernel

We hope the data become linearly separable in the new


feature space

This is what we did in the Kernel Machines!


23 / 24
Learning

Parameters in the RBF network

The basis center x


j
for each hidden node

The variance of the RBF function

The weights W

Once the RBF parameters are set, W can be done by


pseudo-inverse or Widrow-Ho

Finding RBF parameters is not easy

Uniformly select the centers

Using the data cluster centers as x


i
24 / 24

You might also like