Professional Documents
Culture Documents
Hypothesis #1
Inspired by hypothesis from Xie & Seung 2000 as well as
Hinton 2007 (Deep Learning Workshop talk)
Wi,j
5
d(si )
/
(sj )
dt
Wi,j
synapEc
change
pre- state
d(si )
/
(sj )
dt
temporal change in
postsynapEc ring rate
Happy Coincidence
Comparative Behavior:
Biological observaEon
(Bi & Poo 2001)
8
Post-synapEc s
Pos. slope yields weight
increase, and vice-versa
Pre-synapEc spike
@J
@si
si something + something
Then it would be nice if we could get
@J
@Wij
11
Wij (sj )
@J @si
s i (sj )
@si @Wij
s = (R(s)
s) =
@E
@s
Gradient descent
or
(t)
(t 1)
(t 1)
(t 1)
s
=
s
+
(R(s
)
s
)
R(s)
where is where the neurons acEvaEon would converge (if
the rest of the network conEnues with state s)
12
R(s) / b + W (s)
Happy Coincidence
Hypothesis #3
Inspired by Hopeld nets and Boltzmann machines
14
Has derivaEve
Where
@E(s)
=s
@s
R(s)
dierent
@E
So is gradient descent on the energy
s = (R(s) s) =
@s
15
Happy Coincidence
17
unit
Ri /
Wi,j (sj )
dsj
=
dt
1 @E
,
@s
@E(s)
and
Ri ds
sii / 1 @E Ri (s)
@s
=
=
dt
@s i
si
he same form as in Eq. 2, we need to impose symmetric connections, i.e. Wi,j = Wj,i . Fi
0
1
18
X
0
Ri (s) := (si ) @
Wj,i (sj ) + bi A .
Hypothesis #4
Inspired by Hopeld nets and Boltzmann machines
Wi,j Wj,i
19
Happy Coincidence
Very
Happy Coincidence
s = (x, h, y) h = (h1 , h2 )
h
clamped and h and y have sehled to and
y
then external signal drives slightly towards
y
a target value y, creaTng a perturbaTon
h1
h2
s+
f.p.
s-
Observe x
Observe x & y
Modied energy:
F = E(s) +
t-
x Cx (s)
t+
y Cy (s)
Cy
No need to do a full inference to minimize and set y=target,
propagaEng a small perturbaEon is enough
@F
@s
L(s, W, ) = Cy (s) +
@L
@Cy
=0)
+
@s
@s
=0
s=s
Lagrange coecients
F
=0
2
@s
s
At xed point and soluEon we can take a gradient step
wrt W:
2
W =
24
@L
=
@W
@ F
@s@W
=0
s=s+
@ 2 F + @s+
@2F +
)
+
=0
2
@s @ y
@s@ y
(*)
@s2
@s2
@s2
@s@
@s
when we take
25
@2F
+
@s2
=0
We had and so
W =
=
=
@W
@s@W s
@ y
@2F
@s+
@2E
@s+
W =
=
@s@W s @ y
@s@W s @ y
2
X
1
1
@
E
@
2
W ij (si )(s
=
(si )(sj )
Had E(s)
= ||s||
so
j)
2
2
@s@Wij
@s
i6=j
@
@s
@
and Wij /
(si )(sj )
=
(si )(sj )
at y = 0
@s
+
+
(si )(sj )
(si )(sj )
+
(s+
)(s
(si )(sj )
Wij /
i
j )
Z t+
d
=
(si )(sj )dt
dt
t
Z t+
d(si )
d(sj )
=
(
(sj ) + (si )
)dt
dt
dt
t
Propagation of errors =
Incremental Target Prop
If temporal derivaEves = error gradients
Feedback paths compute incremental targets
for the feedforward paths, moving the
hidden acEvaEons in the right direcEon
The top-down perturbaEons which are
propagated represent the surprise signal
while the feedback paths compute targets
towards which the feedforward acEvaEons
are moved
28
h1
h2
29
Summary of Contributions
Showed that a rate-based update emulates STDP
Bengio et al, 2015, arXiv:1509.05936
Showed that propagaEon of perturbaEons at xed-point of a
symmetrically connected recurrent net propagates gradients
Bengio & Fischer, 2015, arXiv:1510.02777
Showed that the rate-based STDP update aZer propagaEon of
perturbaEons corresponds to SGD on predicEon error
Scellier & Bengio, 2016, 1602.05179
Showed experimentally that a deep xed-point recurrent net
with 1, 2 or 3 hidden layers can be trained on MNIST to 0%
training error
Scellier & Bengio, 2016, 1602.05179
30
31