Bridgingdeeplearningbiology Bengio

Towards
bridging the gap between

deep learning and biology
Yoshua Bengio
19 February 2016
MILA
Towards Bridging the Gap Between

Deep Learning and Biology
Supervised backprop works incredibly well for deep learning
AI applica=ons, but
It is not clear how brains could implement it
It does not address the unsupervised and RL problems
which are more biologically relevant
Central problem: credit assignment in hidden layers (and

through Eme, with recurrent networks)
Central Issue in Deep Learning:

Credit Assignment
What should hidden layers do?
Established approaches:
BackpropagaEon
StochasEc relaxaEon in Boltzmann machines
Variance scales linearly with number
REINFORCE
of neurons geTng the credit
Are these related?

How does the brain do it?
What is the brains learning algorithm?

Cue: Spike-Timing Dependent Plasticity
Observed throughout
the nervous system,
especially in cortex
STDP: weight
increases if post-spike
just aZer pre-spike,
decreases if just
before.
Timing counted only
if spike on only one
side within window
4
Hypothesis #1
Inspired by hypothesis from Xie & Seung 2000 as well as
Hinton 2007 (Deep Learning Workshop talk)
STDP is explained by a learning rule with this form:

Weight change propor=onal to post-synap=c

rate of change =mes pre-synap=c spike.
or
Wi,j
5
d(si )
/
(sj )
dt
Proposed Interpretation of STDP

Inspired by Hinton 2007 (Deep Learning Workshop talk)
Let s = conEnuous-valued state of all neurons

= soma integrated voltage potenEal (avg out eect of spikes)
neuron
Proposed learning rule:
nonlinearity
Wi,j
synapEc
change
pre- state
d(si )
/
(sj )
dt
temporal change in
postsynapEc ring rate
presynapEc spike rate

(or equivalently,
the spikes themselves)
Happy Coincidence
In simula=ons, this learning rule ts the

classical STDP curves
7
Comparative Behavior:
Simulations supports hypothesis

STDP as presynap.c ac.vity .mes rate of change of postsynap.c
ac.vity, Bengio et al., arXiv 2015
Weight change vs post minus pre spike Eming dierence
Biological observaEon
(Bi & Poo 2001)
8
Our simulaEon, using
Why it matches the STDP curve

When post-synapEc s increases, probability of post-spike is
larger aZer some event (pre-spike) than before
Nearest post-spike more
likely to be aZer pre-spike
Post-synapEc s
Pos. slope yields weight
increase, and vice-versa
Pre-synapEc spike
How to make sense

of this view of STDP
from a machine
learning point of
view?
10
Intuition: this would be SGD on some

objective function if the rate of
change of neurons corresponds to
the gradient of an objective function
If s i

And
@J
@si
(neurons try to move towards

congura=ons with smaller J)
si something + something
Then it would be nice if we could get
@J
@Wij
11
Wij (sj )
@J @si
s i (sj )
@si @Wij
Leaky integrator neurons slowly
moving towards better configurations

Leaky integrator neuron with state (integrated voltage) s:
s = (R(s)
s) =
@E
@s
Gradient descent
or

(t)
(t 1)
(t 1)
(t 1)
s
=
s
+
(R(s
)
s
)

R(s)
where is where the neurons acEvaEon would converge (if
the rest of the network conEnues with state s)
12
R(s) / b + W (s)
Happy Coincidence
Denoising auto-encoders with

reconstruc=on func=on R(s) converge
towards R(s)-s = gradient of energy
13
(Alain & Bengio, ICLR 2013)
Hypothesis #3
Inspired by Hopeld nets and Boltzmann machines
NEURAL COMPUTATION = INFERENCE:

Neural ac=va=ons tend to noisily move
towards congura=ons making neurons
ac=va=ons more compa=ble with each
other according to some energy func=on
14
Variant of the energy function of the
continuous Hopfield Net
Energy (or Lyapunov) funcEon
Has derivaEve
Where
@E(s)
=s
@s
R(s)
dierent
R(s) = 0 (s) (b + W (s))
@E
So is gradient descent on the energy
s = (R(s) s) =
@s
15
Happy Coincidence
Con=nuous Hopeld Net + Noise =

Langevin MCMC
16
Neural Computation as Inference

Langevin MCMC (and most MCMC) = small steps going down
the energy, plus injec=ng randomness
2
@E(zt )
zt+1 = zt
+ GaussianNoise
2 @zt
inference to move towards good conguraEons of h that explain
x, given current synapEc weights.
17
The need for symmetry

If we want
unit
Ri /
Wi,j (sj )
dsj
=
dt
1 @E
,
@s
@E(s)
and
Ri ds
sii / 1 @E Ri (s)
@s
=
=
dt
@s i
si
with Eq. 1,then, we need symmetry because otherwise we have

with
0
1
X1
0
Ri (s) = (si ) @
(Wi,j + Wj,i )(sj ) + bi A .
2

j6=i
he same form as in Eq. 2, we need to impose symmetric connections, i.e. Wi,j = Wj,i . Fi
0
1
18
X
0
Ri (s) := (si ) @
Wj,i (sj ) + bi A .
Hypothesis #4
Inspired by Hopeld nets and Boltzmann machines
There is an inference network made of

neuronal unit (one or more neurons) such
that the synap=c inuence between any
pair of such units is symmetric:
Wi,j Wj,i
19
Happy Coincidence
Autoencoders without forced symmetry

end up with symmetric weights
Experimentally found: (Vincent et al 2011)
WHY? (Arora et al 2015, arXiv 1511.05653)

20
h rect(W rect(W T h))
Very
Happy Coincidence
Early Inference in Con=nuous-Variable

Energy-Based Models Approximates BackPropaga=on
21
The Connection to Backprop

Early Inference in Energy-Based Model Approximates Back-PropagaTon,
Bengio & Fischer, arXiv 2015
s = (x, h, y) h = (h1 , h2 )
Near a xed point of the update

s
R(s) s
Consider what happens when input x is
h
clamped and h and y have sehled to and
y
then external signal drives slightly towards
y
a target value y, creaTng a perturbaTon
Now the closest layer h1 gets updated

and the perturbaEon is propagated just
like back-prop would mandate, and similarly
this perturbaEon gets propagated to h2
22
And similarly for the next layer h , etc.
h1
h2
Expanding the Energy Function to

Include External Influences
s+
f.p.
s-
Observe x
Observe x & y
Modied energy:
F = E(s) +
t-
x Cx (s)
t+
y Cy (s)
= E(s) s.t. Cx (s) = 0

NegaEve phase: F
x is clamped by seTng x = 1
y is free by seTng y = 0
PosiEve phase: F + = F + y Cy (s)
x is sEll clamped
y is pushed towards reducing by seTng small
Cy
y >0
23
Cy
No need to do a full inference to minimize and set y=target,
propagaEng a small perturbaEon is enough
Minimizing the prediction cost

during the negative phase
@F
Consider min cost C(s) under the constraint
@s
@F
@s
L(s, W, ) = Cy (s) +
@L
@Cy
=0)
+
@s
@s
=0
s=s
Lagrange coecients
F
=0
2
@s
s
At xed point and soluEon we can take a gradient step
wrt W:
2
W =
24
@L
=
@W
@ F
@s@W
Finding the Lagrange Coefficients

@F + (s,
@s
=0
s=s+
@ 2 F + @s+
@2F +
)
+
=0
2
@s @ y
@s@ y
(*)
Where the f.p. soln is seen as a funcEon of

s+
y
Since F = E(s)
+ x Cx (s) + y C
y (s) we get that
2 +
@2F +
@Cy
@ F
@2F
@ 2 Cy
and so if we
=
+ y
=
@s2
@s2
@s2
@s@
@s
are going to slowly raise from 0, iniEally, term in is 0 and (*)

y
y
2
+
@Cy
@
F
@s
@Cy
yields which matches
+
=0
2
@s @ y
@s
@s

+
@s
when we take
25
@2F
+
@s2
=0
SGD Update Gives a Contrastive

Hebbian Learning Step
+
@s
@L
@2F
We had and so
W =
=
=
@W
@s@W s
@ y

@2F
@s+
@2E
@s+
W =
=
@s@W s @ y
@s@W s @ y

2
X
1
1
@
E
@
2
W ij (si )(s
=
(si )(sj )
Had E(s)
= ||s||
so
j)
2
2
@s@Wij
@s
i6=j

@
@s
@
and Wij /
(si )(sj )
=
(si )(sj )
at y = 0
@s
+
+
(si )(sj )
(si )(sj )
which is a form of (incremental) ContrasEve Hebbian Learning

where the posiEve phase is epsilon away from the negaEve phase
26
Which corresponds to STDP if weights

are forced to be symmetric
We had
+
(s+
)(s
(si )(sj )
Wij /
i
j )
Z t+
d
=
(si )(sj )dt
dt
t
Z t+
d(si )
d(sj )
=
(
(sj ) + (si )
)dt
dt
dt
t
= symmetrized STDP update dWij / d(si ) (s )

j
dt
dt
from which we started
27
Propagation of errors =
Incremental Target Prop
If temporal derivaEves = error gradients
Feedback paths compute incremental targets
for the feedforward paths, moving the
hidden acEvaEons in the right direcEon

The top-down perturbaEons which are
propagated represent the surprise signal
while the feedback paths compute targets
towards which the feedforward acEvaEons
are moved
28
h1
h2
It works: MNIST experiments

Clamp x, let hidden and output relax Ell convergence
Propagate output gradient as perturbaEon
Update by STDP rule (with forced symmetry)
29
Summary of Contributions
Showed that a rate-based update emulates STDP
Bengio et al, 2015, arXiv:1509.05936
Showed that propagaEon of perturbaEons at xed-point of a
symmetrically connected recurrent net propagates gradients
Bengio & Fischer, 2015, arXiv:1510.02777
Showed that the rate-based STDP update aZer propagaEon of
perturbaEons corresponds to SGD on predicEon error
Scellier & Bengio, 2016, 1602.05179
Showed experimentally that a deep xed-point recurrent net
with 1, 2 or 3 hidden layers can be trained on MNIST to 0%
training error
Scellier & Bengio, 2016, 1602.05179
30
Many Open Questions Remain

Trying to bridge the gap between neuroscience and deep
learning has seemingly helped us bridge the gap between
Boltzmann machines or conEnuous Hopeld nets and backprop
while connecEng with older work on xed point RNNs and
contrasEve Hebbian learning.
Many exciEng & happy coincidences and many quesEons!
31
Experiments with noise, going from f.p. in s to f.p. in P(s)?

How about when we are not at a xed point? (f.p. not realisEc)
How to handle the unsupervised case?
What if we do not have exact symmetry?
How to go beyond staEc inputs? (biologically plausible
alternaEve to BPTT)
MILA: Montreal Institute for Learning Algorithms

Bridgingdeeplearningbiology Bengio

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bridgingdeeplearningbiology Bengio

Uploaded by

Copyright:

Available Formats

Towards

bridging the gap between

Towards Bridging the Gap Between

Central problem: credit assignment in hidden layers (and

Central Issue in Deep Learning:

Are these related?

What is the brains learning algorithm?

STDP is explained by a learning rule with this form:

Weight change propor=onal to post-synap=c

Proposed Interpretation of STDP

Let s = conEnuous-valued state of all neurons

presynapEc spike rate

In simula=ons, this learning rule ts the

Simulations supports hypothesis

Weight change vs post minus pre spike Eming dierence

Our simulaEon, using

Why it matches the STDP curve

How to make sense

Intuition: this would be SGD on some

(neurons try to move towards

Leaky integrator neurons slowly

moving towards better configurations

Denoising auto-encoders with

(Alain & Bengio, ICLR 2013)

NEURAL COMPUTATION = INFERENCE:

Variant of the energy function of the

continuous Hopfield Net

Energy (or Lyapunov) funcEon

R(s) = 0 (s) (b + W (s))

Con=nuous Hopeld Net + Noise =

Neural Computation as Inference

The need for symmetry

with Eq. 1,then, we need symmetry because otherwise we have

There is an inference network made of

Autoencoders without forced symmetry

WHY? (Arora et al 2015, arXiv 1511.05653)

h rect(W rect(W T h))

Early Inference in Con=nuous-Variable

The Connection to Backprop

Near a xed point of the update

Now the closest layer h1 gets updated

Expanding the Energy Function to

= E(s) s.t. Cx (s) = 0

Minimizing the prediction cost

Finding the Lagrange Coefficients

Where the f.p. soln is seen as a funcEon of

are going to slowly raise from 0, iniEally, term in is 0 and (*)

SGD Update Gives a Contrastive

which is a form of (incremental) ContrasEve Hebbian Learning

Which corresponds to STDP if weights

= symmetrized STDP update dWij / d(si ) (s )

It works: MNIST experiments

Many Open Questions Remain

Experiments with noise, going from f.p. in s to f.p. in P(s)?

MILA: Montreal Institute for Learning Algorithms

You might also like