You are on page 1of 3

Lets set up the notations rst.

Following similar conventions to that of the ML class, let


a
()
i
= the activation level of unit i of layer when i 1,
a
()
= the column vector whose i-th entry is a
()
i
(with a 0-indexed bias unit appended),
g() = the activation function, usually a sigmoid, but not necessarily the logistic function,

()
ij
= the weight for connecting unit j of layer to unit i of layer + 1,

()
= the weight matrix whose (i, j)-th entry is
()
ij
,
x = the input vector = a
(1)
by convention,
y = the target vector,
m = no. of samples = size of the training set.
For convenience, we denote the size of each layer (excluding the bias unit) by N
()
. When
the i-th unit is not a bias unit (i.e. i = 0 if we follow the ML classs convention), dene the
net input to (or excitation level of) unit i of a hidden or output layer as
z
()
i
=
N
(1)

j=0

(1)
ij
a
(1)
j
; 1 i N(), 2.
(Be careful with the range of the indices i, j, in the above.) That is, z
()
i
is the argument
we supply to g in order to obtain the activation a
()
i
if 2 and i = 0:
a
()
i
= g(z
()
i
) = g
_
_
N
(1)

j=0

(1)
ij
a
(1)
j
_
_
; 1 i N(), 2.
In neural network literature, the net input is usually denoted as net
i
(), but we follow the
notation of the ML class here. Many textbooks on NN also adopt a row vector convention
rather than a column vector convention. So, what we write as z
()
=
(1)
a
(1)
here might
be written as z
()
= a
(1)
W
(1)
in there instead, where z and a become row vectors and
W =

. But we wont go further into these little details here.


We have suppressed the indices for training samples in the above. When there are more
than one samples, let x(s), y(s), a
()
(s) and z
()
(s) denote the relevant quantities for the s-th
sample. This deviates slightly from the notations of the ML class, in which what we call
y
i
(s) here (i.e. the i-th entry of the output vector for sample s), for instance, is called y
(s)
i
in
the class. In this short note, superscripts are always reserved for layer indices and subscripts
are reserved for row/column indices of vectors and matrices. The sample index s is neither
superscripted nor subscripted.
Suppose there are L layers in total, including the input and output layers. Let J be the
training cost for the network. For instance, if we take the sum of squared errors as the
1
training cost, we set J =

m
s=1
a
(L)
(s) y(s)
2
. In the ML class, J is taken as
J =
1
m
m

s=1
N
(L)

k=1
_
y
k
(s) log
_
a
(L)
k
(s)
_
(1 y
k
(s)) log
_
1 a
(L)
k
(s)
__
. .
C(s)
+

2m

2
(1)
=
1
m
m

s=1
C(s) +

2m

2
(say),
where is the unrolled vector obtained by stacking the columns of
()
for all together,
and C(s) is the training cost (without regularization) for the single sample s. (Again, be
careful with the range of the indices.)
To train the neural network, we mean to nd the weights that minimize J. Many optimiza-
tion methods can be used to nd the (locally) optimal weights. Some of them (e.g. look up
the Nelder-Mead method in Wikipedia) only require successive evaluations of J at dierent
values of . In that case, the backpropagation algorithm is not needed. However, most op-
timization methods (gradient descent, CG or BFGS, to name a few) require calculations of
not only J, but also the gradient of J. Thus we need some way to evaluate
J

()
ij
. Practically,
the cost function J can be broken into training costs for individual samples and penalties for
individual weights. Therefore the gradient is evaluated and accumulated sample-by-sample
and penalty-by-penalty. For instance, in last example, we have
J

()
ij
=
1
m
m

s=1
C(s)

()
ij
+

m

()
ij
.
If we can evaluate the gradient of C(s) for each s, the gradient of J can also be obtained
easily.
The backpropagation algorithm is essentially a way to compute, by the chain rule in calculus,
the gradients
C(s)

()
ij
of the training cost C(s) for a single sample s. Let us rst dene an
auxillary variable

()
i
(s) =
C(s)
z
()
i
(with 2 and 1 i N
()
).
This is the delta term we met in the ML class. For convenience, we drop the sample index
s in the sequel. We also write
()
= (
()
1
,
()
2
, . . . ,
()
N
()
)

. By the chain rule,


C

()
ij
=
C
z
(+1)
i
z
(+1)
i

()
ij
=
(+1)
i
a
()
j
( 1, 1 i N
(+1)
and 0 j N
()
).
(Note that in the above, the index j may refer to the bias unit, but i does not.) So, we are
able to compute
C

()
ij
provided that we know how to compute
(+1)
. But how do we compute
2

(+1)
? For the output layer, the quantity can be computed directly:

(L)
i
=
C
z
(L)
i
=
C
a
(L)
i
da
(L)
i
dz
(L)
i
=
C
a
(L)
i
g

(z
(L)
i
) (1 i N
(L)
). (2)
In the ML class, g is the logistic function and C is shown in formula (1) on the previous
page. You know how to dierentiate them. Here I want to keep C and g generic, so that
formula (2) is generic too.
For hidden layers, by the chain rule, we have

(+1)
j
=
C
z
(+1)
j
=
N
(+2)

i=1
C
z
(+2)
i
z
(+2)
i
a
(+1)
j
a
(+1)
j
z
(+1)
j
=
N
(+2)

i=1

(+2)
i

(+1)
ij
g

(z
(+1)
j
) (2 +1 L1).
(Note: Be very careful with the indices here. In the above, both i, j are nonzero. Hence the
column of
(+1)
that corresponds to the bias unit i.e. the rst column if we follow the
ML classs convention is not involved in the calculation.) So, given that we have already
evaluated and stored the values of z
(+1)
and a
(+1)
for each layer in the forward pass, we
can use the above recurrence relation to compute
(+1)
in a backward manner (where + 1
runs from L 1 down to 2).
We have nished the derivation of the backpropation algorithm. A small remark: In the ML
class, is interpreted as an error term. Such interpretation seems to be rather widespread
in the NN literature. Now if we take g as the logistic function and C as the cost function
described in (1) (as Prof. Ng did in his class),
(L)
is indeed an error term it is equal to
the discrepancy a
(L)
y between the predicted output and the target output. However, if
other g or C are used, things may change. For instance, if C is replaced by C = a
(L)
y
2
,
we would get
(L)
j
= 2(a
(L)
j
y
j
)a
(L)
j
(1 a
(L)
j
) instead. In general, is just what you see in
its denition: it is a parial derivative, a rate of change, or a dierential quotient. No more,
no less. For pedagogical purposes and historical reason, there may be some value to simply
call it an error, but please keep in mind that it is actually a sensitivity term, or an error
per innitesimal change in
ij
, and the error we are talking about is in principle an error in
the cost function, not an error between the predicted and target outputs.
3

You might also like