You are on page 1of 32

Towards

bridging the gap between


deep learning and biology
Yoshua Bengio
19 February 2016
MILA

Towards Bridging the Gap Between


Deep Learning and Biology
Supervised backprop works incredibly well for deep learning
AI applica=ons, but
It is not clear how brains could implement it
It does not address the unsupervised and RL problems
which are more biologically relevant

Central problem: credit assignment in hidden layers (and


through Eme, with recurrent networks)

Central Issue in Deep Learning:


Credit Assignment
What should hidden layers do?
Established approaches:
BackpropagaEon
StochasEc relaxaEon in Boltzmann machines
Variance scales linearly with number
REINFORCE
of neurons geTng the credit

Are these related?


How does the brain do it?

What is the brains learning algorithm?


Cue: Spike-Timing Dependent Plasticity
Observed throughout
the nervous system,
especially in cortex
STDP: weight
increases if post-spike
just aZer pre-spike,
decreases if just
before.
Timing counted only
if spike on only one
side within window
4

Hypothesis #1
Inspired by hypothesis from Xie & Seung 2000 as well as
Hinton 2007 (Deep Learning Workshop talk)

STDP is explained by a learning rule with this form:


Weight change propor=onal to post-synap=c


rate of change =mes pre-synap=c spike.
or

Wi,j
5

d(si )
/
(sj )
dt

Proposed Interpretation of STDP


Inspired by Hinton 2007 (Deep Learning Workshop talk)

Let s = conEnuous-valued state of all neurons


= soma integrated voltage potenEal (avg out eect of spikes)
neuron
Proposed learning rule:
nonlinearity

Wi,j
synapEc
change

pre- state

d(si )
/
(sj )
dt

temporal change in
postsynapEc ring rate

presynapEc spike rate


(or equivalently,
the spikes themselves)

Happy Coincidence

In simula=ons, this learning rule ts the


classical STDP curves
7

Comparative Behavior:

Simulations supports hypothesis


STDP as presynap.c ac.vity .mes rate of change of postsynap.c
ac.vity, Bengio et al., arXiv 2015

Weight change vs post minus pre spike Eming dierence

Biological observaEon
(Bi & Poo 2001)
8

Our simulaEon, using

Why it matches the STDP curve


When post-synapEc s increases, probability of post-spike is
larger aZer some event (pre-spike) than before
Nearest post-spike more
likely to be aZer pre-spike

Post-synapEc s
Pos. slope yields weight
increase, and vice-versa

Pre-synapEc spike

How to make sense


of this view of STDP
from a machine
learning point of
view?
10

Intuition: this would be SGD on some


objective function if the rate of
change of neurons corresponds to
the gradient of an objective function
If s i

And

@J
@si

(neurons try to move towards


congura=ons with smaller J)

si something + something
Then it would be nice if we could get

@J

@Wij
11

Wij (sj )

@J @si
s i (sj )
@si @Wij

Leaky integrator neurons slowly

moving towards better configurations


Leaky integrator neuron with state (integrated voltage) s:

s = (R(s)

s) =

@E

@s

Gradient descent
or

(t)
(t 1)
(t 1)
(t 1)
s
=
s
+
(R(s
)
s
)


R(s)
where is where the neurons acEvaEon would converge (if
the rest of the network conEnues with state s)

12

R(s) / b + W (s)

Happy Coincidence

Denoising auto-encoders with


reconstruc=on func=on R(s) converge
towards R(s)-s = gradient of energy
13

(Alain & Bengio, ICLR 2013)

Hypothesis #3
Inspired by Hopeld nets and Boltzmann machines

NEURAL COMPUTATION = INFERENCE:


Neural ac=va=ons tend to noisily move
towards congura=ons making neurons
ac=va=ons more compa=ble with each
other according to some energy func=on

14

Variant of the energy function of the

continuous Hopfield Net

Energy (or Lyapunov) funcEon

Has derivaEve
Where

@E(s)
=s
@s

R(s)

dierent

R(s) = 0 (s) (b + W (s))

@E
So is gradient descent on the energy
s = (R(s) s) =
@s
15

Happy Coincidence

Con=nuous Hopeld Net + Noise =


Langevin MCMC
16

Neural Computation as Inference


Langevin MCMC (and most MCMC) = small steps going down
the energy, plus injec=ng randomness
2
@E(zt )
zt+1 = zt
+ GaussianNoise
2 @zt
inference to move towards good conguraEons of h that explain
x, given current synapEc weights.

17

The need for symmetry


If we want

unit

Ri /

Wi,j (sj )

dsj
=
dt

1 @E
,
@s

@E(s)
and
Ri ds
sii / 1 @E Ri (s)
@s
=
=
dt
@s i

si

with Eq. 1,then, we need symmetry because otherwise we have


with
0
1
X1
0
Ri (s) = (si ) @
(Wi,j + Wj,i )(sj ) + bi A .
2

j6=i

he same form as in Eq. 2, we need to impose symmetric connections, i.e. Wi,j = Wj,i . Fi
0
1
18
X
0
Ri (s) := (si ) @
Wj,i (sj ) + bi A .

Hypothesis #4
Inspired by Hopeld nets and Boltzmann machines

There is an inference network made of


neuronal unit (one or more neurons) such
that the synap=c inuence between any
pair of such units is symmetric:

Wi,j Wj,i
19

Happy Coincidence

Autoencoders without forced symmetry


end up with symmetric weights
Experimentally found: (Vincent et al 2011)

WHY? (Arora et al 2015, arXiv 1511.05653)


20

h rect(W rect(W T h))

Very

Happy Coincidence

Early Inference in Con=nuous-Variable


Energy-Based Models Approximates BackPropaga=on
21

The Connection to Backprop


Early Inference in Energy-Based Model Approximates Back-PropagaTon,
Bengio & Fischer, arXiv 2015

s = (x, h, y) h = (h1 , h2 )

Near a xed point of the update


s
R(s) s
Consider what happens when input x is

h
clamped and h and y have sehled to and
y
then external signal drives slightly towards
y
a target value y, creaTng a perturbaTon

Now the closest layer h1 gets updated


and the perturbaEon is propagated just
like back-prop would mandate, and similarly
this perturbaEon gets propagated to h2
22
And similarly for the next layer h , etc.

h1

h2

Expanding the Energy Function to


Include External Influences

s+
f.p.
s-
Observe x
Observe x & y

Modied energy:

F = E(s) +

t-

x Cx (s)

t+

y Cy (s)

= E(s) s.t. Cx (s) = 0


NegaEve phase: F
x is clamped by seTng x = 1
y is free by seTng y = 0
PosiEve phase: F + = F + y Cy (s)
x is sEll clamped
y is pushed towards reducing by seTng small
Cy
y >0
23

Cy
No need to do a full inference to minimize and set y=target,
propagaEng a small perturbaEon is enough

Minimizing the prediction cost


during the negative phase
@F
Consider min cost C(s) under the constraint
@s

@F

@s

L(s, W, ) = Cy (s) +

@L
@Cy
=0)
+
@s
@s

=0
s=s

Lagrange coecients

F
=0
2
@s

s
At xed point and soluEon we can take a gradient step
wrt W:
2

W =
24

@L
=
@W

@ F
@s@W

Finding the Lagrange Coefficients


@F + (s,
@s

=0
s=s+

@ 2 F + @s+
@2F +
)
+
=0
2
@s @ y
@s@ y

(*)

Where the f.p. soln is seen as a funcEon of


s+
y
Since F = E(s)
+ x Cx (s) + y C
y (s) we get that
2 +
@2F +
@Cy
@ F
@2F
@ 2 Cy
and so if we
=
+ y
=

@s2

@s2

@s2

@s@

@s

are going to slowly raise from 0, iniEally, term in is 0 and (*)


y
y
2
+
@Cy
@
F
@s
@Cy
yields which matches
+
=0
2
@s @ y
@s
@s

+
@s

when we take
25

@2F
+
@s2

=0

SGD Update Gives a Contrastive


Hebbian Learning Step
+
@s
@L
@2F

We had and so
W =
=
=
@W
@s@W s
@ y

@2F
@s+
@2E
@s+
W =
=
@s@W s @ y
@s@W s @ y

2
X
1
1
@
E
@
2
W ij (si )(s
=
(si )(sj )
Had E(s)
= ||s||
so
j)
2
2
@s@Wij
@s
i6=j

@
@s
@
and Wij /
(si )(sj )
=
(si )(sj )
at y = 0

@s

+
+
(si )(sj )

(si )(sj )

which is a form of (incremental) ContrasEve Hebbian Learning


where the posiEve phase is epsilon away from the negaEve phase
26

Which corresponds to STDP if weights


are forced to be symmetric
We had

+
(s+
)(s
(si )(sj )
Wij /
i
j )
Z t+
d
=
(si )(sj )dt
dt
t
Z t+
d(si )
d(sj )
=
(
(sj ) + (si )
)dt
dt
dt
t

= symmetrized STDP update dWij / d(si ) (s )


j
dt
dt
from which we started
27

Propagation of errors =
Incremental Target Prop
If temporal derivaEves = error gradients
Feedback paths compute incremental targets
for the feedforward paths, moving the
hidden acEvaEons in the right direcEon

The top-down perturbaEons which are
propagated represent the surprise signal
while the feedback paths compute targets
towards which the feedforward acEvaEons
are moved
28

h1

h2

It works: MNIST experiments


Clamp x, let hidden and output relax Ell convergence
Propagate output gradient as perturbaEon
Update by STDP rule (with forced symmetry)

29

Summary of Contributions
Showed that a rate-based update emulates STDP
Bengio et al, 2015, arXiv:1509.05936
Showed that propagaEon of perturbaEons at xed-point of a
symmetrically connected recurrent net propagates gradients
Bengio & Fischer, 2015, arXiv:1510.02777
Showed that the rate-based STDP update aZer propagaEon of
perturbaEons corresponds to SGD on predicEon error
Scellier & Bengio, 2016, 1602.05179
Showed experimentally that a deep xed-point recurrent net
with 1, 2 or 3 hidden layers can be trained on MNIST to 0%
training error
Scellier & Bengio, 2016, 1602.05179
30

Many Open Questions Remain


Trying to bridge the gap between neuroscience and deep
learning has seemingly helped us bridge the gap between
Boltzmann machines or conEnuous Hopeld nets and backprop
while connecEng with older work on xed point RNNs and
contrasEve Hebbian learning.
Many exciEng & happy coincidences and many quesEons!

31

Experiments with noise, going from f.p. in s to f.p. in P(s)?


How about when we are not at a xed point? (f.p. not realisEc)
How to handle the unsupervised case?
What if we do not have exact symmetry?
How to go beyond staEc inputs? (biologically plausible
alternaEve to BPTT)

MILA: Montreal Institute for Learning Algorithms

You might also like