NAACL2013 Socher Manning DeepLearning

Deep Learn|ng for NL
(w|thout Mag|c)
k|chard Socher and Chr|stopher Mann|ng
Stanford Un|vers|ty
nAACL 2013, ALlanLa
hup://nlp.sLanford.edu/courses/nAACL2013/
*wlLh a blg Lhank you Lo ?oshua 8englo, wlLh whom we
paruclpaLed ln Lhe prevlous ACL 2012 verslon of Lhls LuLorlal
Deep Learning
MosL currenL machlne learnlng works
well because of human-deslgned
represenLauons and lnpuL feaLures
Machlne learnlng becomes [usL opumlzlng
welghLs Lo besL make a nal predlcuon
8epresenLauon learnlng auempLs Lo
auLomaucally learn good feaLures or represenLauons
ueep learnlng algorlLhms auempL Lo learn muluple levels of
represenLauon of lncreaslng complexlLy/absLracuon
nL8 WordneL

S8L arser
2
A Deep Architecture
Malnly, work has explored deep bellef neLworks (u8ns), Markov
8andom llelds wlLh muluple layers, and varlous Lypes of
muluple-layer neural neLworks
CuLpuL layer
Pere predlcung a supervlsed LargeL

Pldden layers
1hese learn more absLracL
represenLauons as you head up
lnpuL layer
8aw sensory lnpuLs (roughly) 3
Five Reasons to Explore
Deep Learning
arL 1.1: 1he 8aslcs
4
Learning representations
3
Pandcralng feaLures ls ume-consumlng
1he feaLures are oen boLh over-specled and lncompleLe
1he work has Lo be done agaln for each Lask/domaln/.
We musL move beyond handcraed feaLures and slmple ML
Pumans develop represenLauons for learnlng and reasonlng
Cur compuLers should do Lhe same
ueep learnlng provldes a way of dolng Lhls
The need for distributed
representations
CurrenL nL sysLems are lncredlbly fraglle because of
Lhelr aLomlc symbol represenLauons
Crazy sentenna|
comp|ement, such as for
"||kes [(be|ng) crazy]"
6
The need for distributional &
distributed representations
Learned word represenLauons help enormously ln nL
1hey provlde a powerful slmllarlLy model for words
ulsLrlbuuonal slmllarlLy based word clusLers greaLly help mosL
appllcauons
+1.4 l1 uependency arslng 13.2 error reducuon (koo &
Colllns 2008, 8rown clusLerlng)
+3.4 l1 named LnuLy 8ecognluon 23.7 error reducuon
(SLanford nL8, exchange clusLerlng)
ulsLrlbuLed represenLauons can do even beuer by represenung
more dlmenslons of slmllarlLy

7
Learnlng feaLures LhaL are noL muLually excluslve can be exponenually
more emclenL Lhan nearesL-nelghbor-llke or clusLerlng-llke models
The need for distributed
representations
Mulu-
ClusLerlng
ClusLerlng
8
C1 C2 C3
lnpuL
Distributed representations deal with
the curse of dimensionality
Cenerallzlng locally (e.g., nearesL
nelghbors) requlres represenLauve
examples for all relevanL varlauons!
Classlc soluuons:
Manual feaLure deslgn
Assumlng a smooLh LargeL
funcuon (e.g., llnear models)
kernel meLhods (llnear ln Lerms
of kernel based on daLa polnLs)
neural neLworks parameLerlze and
learn a slmllarlLy" kernel

9
Unsupervised feature and
weight learning
1oday, mosL pracucal, good nL& ML meLhods requlre
labeled Lralnlng daLa (l.e., supervlsed learnlng)
8uL almosL all daLa ls unlabeled
MosL lnformauon musL be acqulred unsuperv|sed
lorLunaLely, a good model of observed daLa can really help you
learn classlcauon declslons
10
We need good lnLermedlaLe represenLauons
LhaL can be shared across Lasks
Muluple levels of laLenL varlables allow
comblnaLorlal sharlng of sLausucal sLrengLh
lnsumclenL model depLh can be
exponenually lnemclenL

Learning multiple levels of
representation
8lologlcally lnsplred learnlng
1he corLex seems Lo have a generlc
learnlng algorlLhm
1he braln has a deep archlLecLure
1ask 1 CuLpuL
Llngulsuc lnpuL
1ask 2 CuLpuL 1ask 3 CuLpuL
11
Learning multiple levels
of representation
Successlve model layers learn deeper lnLermedlaLe represenLauons

Layer 1
Layer 2
Layer 3
Plgh-level
llngulsuc represenLauons
[Lee eL al. lCML 2009, Lee eL al. nlS 2009]
12
Handling the recursivity of human
language
Puman senLences are composed
from words and phrases
We need composluonallLy ln our
ML models
8ecurslon: Lhe same operaLor
(same parameLers) ls applled
repeaLedly on dlerenL
componenLs
! #$%&& '()*+
,-./0&1 /20/(#
03/ 3.#0)(.'
'3-('3
!"#$%&"' $!(
)*"($+,
(.$(&#
/
01
2($3 4563
71 01
4 #89++
'&%:5
71
71
'!*&'!
73
/(89.$"'
;(<&(#(.$9$"%.#
x
t-1
x
t
x
t+1
z
t-1
z
t
z
t+1
13
Why now?
uesplLe prlor lnvesugauon and undersLandlng of many of Lhe
algorlLhmlc Lechnlques .
8efore 2006 Lralnlng deep archlLecLures was unsuccessfu| !
WhaL has changed?
new meLhods for unsupervlsed pre-Lralnlng have been
developed (8esLrlcLed 8olLzmann Machlnes = 88Ms,
auLoencoders, conLrasuve esumauon, eLc.)
More emclenL parameLer esumauon meLhods
8euer undersLandlng of model regularlzauon
Deep Learning models have already
achieved impressive results for HLT
neural Language Model
[Mlkolov eL al. lnLerspeech 2011]

MS8 MAvlS Speech SysLem
[uahl eL al. 2012, Selde eL al. 2011,
followlng Mohamed eL al. 2011]

1he algorlLhms represenL Lhe rsL ume a
company has released a deep-neural-
neLworks (unn)-based speech-recognluon
algorlLhm ln a commerclal producL."

Mode| \ WSI ASk task Lva| WLk
kn3 8asellne 17.2
ulscrlmlnauve LM 16.9
8ecurrenL nn comblnauon 14.4
Acousnc mode| &
tra|n|ng
kecog
\ WLk
k103S
ISn
nubS
SW8
CMM 40-mlx,
8MMl, SW8 309h
1-pass
-adapL
27.4 23.6
u8n-unn 7 layer
x 2048, SW8 309h
1-pass
-adapL
18.S
(-33)
16.1
(-32)
CMM 72-mlx,
8MMl, lSP 2000h
k-pass
+adapL
18.6 17.1
13
Deep Learn Models Have Interesting
Performance Characteristics
ueep learnlng models can now be very fasL ln some clrcumsLances
SLnnA [ColloberL eL al. 2011] can do CS or nL8 fasLer Lhan
oLher SC1A Laggers (16x Lo 122x), uslng 23x less memory
WS! CS 97.29 acc, ConLL nL8 89.39 l1, ConLL Chunklng 94.32 l1
Changes ln compuung Lechnology favor deep learnlng
ln nL, speed has Lradluonally come from explolung sparslLy
8uL wlLh modern machlnes, branches and wldely spaced
memory accesses are cosLly
unlform parallel operauons on dense vecLors are fasLer
1hese Lrends are even sLronger wlLh mulu-core Cus and Cus

16
17
Outline of the Tutorial
1. 1he 8aslcs
1. Mouvauons
2. lrom loglsuc regresslon Lo neural neLworks
3. Word represenLauons
4. unsupervlsed word vecLor learnlng
3. 8ackpropagauon 1ralnlng
6. Learnlng word-level classlers: CS and nL8
7. Sharlng sLausucal sLrengLh
2. 8ecurslve neural neLworks
3. Appllcauons, ulscusslon, and 8esources
18
1. 1he 8aslcs
1. Mouvauon
2. 8ecurslve neural neLworks for arslng
3. Cpumlzauon and 8ackpropagauon 1hrough SLrucLure
4. Composluonal vecLor Crammars: arslng
3. 8ecurslve AuLoencoders: araphrase ueLecuon
6. MaLrlx-vecLor 8nns: 8elauon classlcauon
7. 8ecurslve neural 1ensor neLworks: SenumenL Analysls
19
1. 1he 8aslcs
1. AssorLed Speech and nL Appllcauons
2. ueep Learnlng: Ceneral SLraLegy and 1rlcks
3. 8esources (readlngs, code, .)
4. ulscusslon
20
From logistic regression to
neural nets
arL 1.2: 1he 8aslcs
21
Demystifying neural networks
neural neLworks come wlLh
Lhelr own Lermlnologlcal
baggage
. [usL llke SvMs

8uL lf you undersLand how
loglsuc regresslon or maxenL
models work
1hen you a|ready understand Lhe
operauon of a baslc neural
neLwork neuron!
A s|ng|e neuron
A compuLauonal unlL wlLh o (3) lnpuLs
and 1 ouLpuL
and parameLers w, b
Acuvauon
funcuon
lnpuLs
8las unlL corresponds Lo lnLercepL Lerm
CuLpuL
22
From Maxent Classifiers to Neural
Networks
ln nL, a maxenL classler ls normally wrluen as:
Supervlsed learnlng glves us a dlsLrlbuuon for daLum J over classes ln c

vecLor form:

Such a classler ls used as-ls ln a neural neLwork (a somax layer")
Cen as Lhe Lop layer: I = somax(\x)
8uL for now we'll derlve a Lwo-class loglsuc model for one neuron
P(c | d, !) =
exp !
i
f
i
(c, d)
i
!
exp !
i
f
i
( " c , d)
i
!
" c #C
!
P(c | d, !) =
e
!
!
f (c,d)
e
!
!
f ( ! c ,d)
! c
"
23
From Maxent Classifiers to Neural
Networks
vecLor form:

Make Lwo class:

P(c
1
| d, !) =
e
!
!
f (c
1
,d)
e
!
!
f (c
1
,d)
+e
!
!
f (c
2
,d)
=
e
!
!
f (c
1
,d)
e
!
!
f (c
1
,d)
+e
!
!
f (c
2
,d)
!
e
"!
!
f (c
1
,d)
e
"!
!
f (c
1
,d)
=
1
1+e
!
!
[ f (c
2
,d)! f (c
1
,d)]
= for x = f (c
1
, d) ! f (c
2
, d)
1
1+e
!!
!
x
24
= f (!
!
x)
P(c | d, !) =
e
!
!
f (c,d)
e
!
!
f ( ! c ,d)
! c
"
for f(z) = 1/(1 + exp(-z)), Lhe loglsuc funcuon - a slgmold non-llnearlLy.
This is exactly what a neuron
computes
h
w,b
(x) = f (w
!
x +b)
f (z) =
1
1+e
!z
w, b are Lhe parameLers of Lhls neuron
l.e., Lhls loglsuc regresslon model
23
b. We can have an always on"
feaLure, whlch glves a class prlor,
or separaLe lL ouL, as a blas Lerm
A neural network = running several
logistic regressions at the same time
lf we feed a vecLor of lnpuLs Lhrough a bunch of loglsuc regresslon
funcuons, Lhen we geL a vecLor of ouLpuLs .
8ot we Joot bove to JeclJe
obeoJ of ume wbot votlobles
tbese loqlsuc teqtessloos ote
ttyloq to pteJlct!
26
. whlch we can feed lnLo anoLher loglsuc regresslon funcuon
lt ls tbe ttololoq
ctltetloo tbot wlll Jltect
wbot tbe lotetmeJlote
blJJeo votlobles sboolJ
be, so os to Jo o qooJ
job ot pteJlcuoq tbe
totqets fot tbe oext
loyet, etc.
27
8efore we know lL, we have a mululayer neural neLwork..
28
Matrix notation for a layer
We have

ln maLrlx noLauon

where f ls applled elemenL-wlse:

o
1

o
2

o
3

a
1
= f (W
11
x
1
+W
12
x
2
+W
13
x
3
+b
1
)
a
2
= f (W
21
x
1
+W
22
x
2
+W
23
x
3
+b
2
)
etc.
z =Wx +b
a = f (z)
f ([z
1
, z
2
, z
3
]) =[ f (z
1
), f (z
2
), f (z
3
)]
29
w
12

b
3

How do we train the weights W?
lor a slngle supervlsed layer, we Lraln [usL llke a maxenL model -
we calculaLe and use error derlvauves (gradlenLs) Lo lmprove
Cnllne learnlng: SLochasuc gradlenL descenL (SCu)
Cr lmproved verslons llke AdaCrad (uuchl, Pazan, & Slnger 2010)
8aLch learnlng: Con[ugaLe gradlenL or L-8lCS

A mululayer neL could be more complex because Lhe lnLernal
(hldden") loglsuc unlLs make Lhe funcuon non-convex . [usL as
for hldden C8ls [Cuauonl eL al. 2003, Cunawardana eL al. 2003]
8uL we can use Lhe same ldeas and Lechnlques
!usL wlLhouL guaranLees .
We backpropagaLe" error derlvauves Lhrough Lhe model
30
Non-linearities: Why theyre needed
lor loglsuc regresslon: map Lo probablllues
Pere: funcuon approxlmauon,
e.g., regresslon or classlcauon
WlLhouL non-llnearlues, deep neural neLworks
can'L do anyLhlng more Lhan a llnear Lransform
LxLra layers could [usL be complled down lnLo
a slngle llnear Lransform
robablllsuc lnLerpreLauon unnecessary excepL ln
Lhe 8olLzmann machlne/graphlcal models
eople oen use oLher non-llnearlues, such as
tanh, as we'll dlscuss ln parL 3
31
Summary
Knowing the meaning of words!
?ou now undersLand Lhe baslcs and Lhe relauon Lo oLher models
neuron = loglsuc regresslon or slmllar funcuon
lnpuL layer = lnpuL Lralnlng/LesL vecLor
8las unlL = lnLercepL Lerm/always on feaLure
Acuvauon = response
Acuvauon funcuon ls a loglsuc (or slmllar slgmold" nonllnearlLy)
8ackpropagauon = runnlng sLochasuc gradlenL descenL backward
layer-by-layer ln a mululayer neLwork
WelghL decay = regularlzauon / 8ayeslan prlor
32
Effective deep learning became possible
through unsupervised pre-training
[Lrhan eL al., !ML8 2010]

urely supervlsed neural neL WlLh unsupervlsed pre-Lralnlng
(wlLh 88Ms and uenolslng AuLo-Lncoders)
0-9 handwrluen dlglL recognluon error raLe (MnlS1 daLa)
33
Word Representations
arL 1.3: 1he 8aslcs
34
The standard word representation
1he vasL ma[orlLy of rule-based and sLausucal nL work regards
words as aLomlc symbols: hotel, conference, walk
ln vecLor space Lerms, Lhls ls a vecLor wlLh one 1 and a loL of zeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
ulmenslonallLy: 20k (speech) - 30k (18) - 300k (blg vocab) - 13M (Coogle 11)
We call Lhls a one-hoL" represenLauon. lLs problem:
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
33
Distributional similarity based
representations
?ou can geL a loL of value by represenung a word by
means of lLs nelghbors
?ou shall know a word by Lhe company lL keeps"
(!. 8. llrLh 1937: 11)
Cne of Lhe mosL successful ldeas of modern sLausucal nL
government debt problems turning into banking crises as has happened in
saying that Europe needs unified banking regulation to replace the hodgepodge
" 1hese words wlll represenL bookloq !

36
?ou can vary wheLher you use local or large conLexL
Lo geL a more synLacuc or semanuc clusLerlng
Class-based (hard) and soft
clustering word representations
Class based models learn word classes of slmllar words based on
dlsLrlbuuonal lnformauon ( ~ class PMM)
8rown clusLerlng (8rown eL al. 1992)
Lxchange clusLerlng (Marun eL al. 1998, Clark 2003)
uesparslcauon and greaL example of unsupervlsed pre-Lralnlng
So clusLerlng models learn for each clusLer/Loplc a dlsLrlbuuon
over words of how llkely LhaL word ls ln each clusLer
LaLenL Semanuc Analysls (LSA/LSl), 8andom pro[ecuons
LaLenL ulrlchleL Analysls (LuA), PMM clusLerlng

37
Neural word embeddings
as a distributed representation
Slmllar ldea
Comblne vecLor space
semanucs wlLh Lhe predlcuon of
probablllsuc models (8englo eL
al. 2003, ColloberL & WesLon
2008, 1urlan eL al. 2010)
ln all of Lhese approaches,
lncludlng deep learnlng models,
a word ls represenLed as a
dense vecLor

lloqolsucs =
38
0.286
0.792
-0.177
-0.107
0.109
-0.342
0.349
0.271
Neural word embeddings -
visualization
39
Stunning new result at this conference!
Mikolov, Yih & Zweig (NAACL 2013)
1hese represenLauons are woy beuer aL encodlng dlmenslons of
slmllarlLy Lhan we reallzed!
Analogles Lesung dlmenslons of slmllarlLy can be solved qulLe
well [usL by dolng vecLor subLracuon ln Lhe embeddlng space
SynLacucally
x
opple
- x
opples
= x
cot
- x
cots
= x
fomlly
- x
fomllles

Slmllarly for verb and ad[ecuve morphologlcal forms
Semanucally (Semeval 2012 Lask 2)
x
sbltt
- x
clotbloq
= x
cbolt
- x
fotoltote

40
Stunning new result at this conference!
Mikolov, Yih & Zweig (NAACL 2013)
Method Syntax correct
LSA 320 dlm 16.3 [besL]
8nn 80 dlm 16.2
8nn 320 dlm 28.3
8nn 1600 dlm 39.6
Method Semanncs Spearm !
u1u-n8 (8lnk & P. 2012) 0.230 [Semeval wln]
LSA 640 0.149
8nn 80 0.211
8nn 1600 0.273 [new SC1A]
41
Advantages of the neural word
embedding approach
42
Compared Lo a meLhod llke LSA, neural word embeddlngs
can become more meanlngful Lhrough addlng supervlslon
from one or muluple Lasks
ulscrlmlnauve ne-Lunlng"
lor lnsLance, senumenL ls usually noL capLured ln unsupervlsed
word embeddlngs buL can be ln neural word vecLors
We can bulld represenLauons for large llngulsuc unlLs
See parL 2
Unsupervised word vector
learning
arL 1.4: 1he 8aslcs
43
A neural network for learning word
vectors (ColloberL eL al. !ML8 2011)
ldea: A word and lLs conLexL ls a posluve Lralnlng
sample, a random word ln LhaL same conLexL glves
a negauve Lralnlng sample:
caL chllls on a maL caL chllls !e[u a maL
Slmllar: lmpllclL negauve evldence ln ConLrasuve
Lsumauon, (SmlLh and Llsner 2003)
44
A neural network for learning word
vectors
43
Pow do we formallze Lhls ldea? Ask LhaL
score(caL chllls on a maL) score(caL chllls !e[u a maL)

Pow do we compuLe Lhe score?
WlLh a neural neLwork
Lach word ls assoclaLed wlLh an
o-dlmenslonal vecLor
Word embedding matrix
lnluallze all word vecLors randomly Lo form a word embeddlng
maLrlx
v

l = . o

Lhe caL maL .
1hese are Lhe word feaLures we wanL Lo learn
Also called a look-up Lable
ConcepLually you geL a word's vecLor by le muluplylng a
one-hoL vecLor e by l: x = le
[ ]
46
score(caL chllls on a maL)
1o descrlbe a phrase, reLrleve (vla lndex) Lhe correspondlng
vecLors from l
caL chllls on a maL
1hen concaLenaLe Lhem Lo 3o vecLor:
x =[ ]
Pow do we Lhen compuLe score(x)?

Word vectors as input to a neural
network
47
A Single Layer Neural Network
A slngle layer was a comblnauon of a llnear
layer and a nonllnearlLy:
1he neural acuvauons o can Lhen
be used Lo compuLe some funcuon
lor lnsLance, Lhe score we care abouL:
48
Summary: Feed-forward Computation
49
Compuung a wlndow's score wlLh a 3-layer neural
neL: s = scote(caL chllls on a maL)
caL chllls on a maL
Summary: Feed-forward Computation
s = score(caL chllls on a maL)
s
c
= score(caL chllls !e[u a maL)

ldea for Lralnlng ob[ecuve: make score of Lrue wlndow
larger and corrupL wlndow's score lower (unul Lhey're
good enough): mlnlmlze
1hls ls conunuous, can perform SCu
30
Training with Backpropagation
Assumlng cosL I ls 0, lL ls slmple Lo see LhaL we
can compuLe Lhe derlvauves of s and s
c
wrL all Lhe
lnvolved varlables: u, w, b, x

31
LeL's conslder Lhe derlvauve of a slngle welghL w
lj

1hls only appears lnslde o
l
lor example: w
23
ls only
used Lo compuLe o
2

x
1
x
2
x
3
+1

o
1
o
2

s
u
2

w
23

32
uerlvauve of welghL w
lj
:
33
x
1
x
2
x
3
+1

o
1
o
2

s
u
2

w
23

uerlvauve of slngle welghL w
lj
:
Local error
slgnal
Local lnpuL
slgnal
34
x
1
x
2
x
3
+1

o
1
o
2

s
u
2

w
23

We wanL all comblnauons of
l = 1, 2 and j = 1, 2, 3
Soluuon: CuLer producL:
where ls Lhe
responslblllLy" comlng from
each acuvauon o
lrom slngle welghL w
lj
Lo full w:
33
x
1
x
2
x
3
+1

o
1
o
2

s
u
2

w
23

lor blases b, we geL:
36
x
1
x
2
x
3
+1

o
1
o
2

s
u
2

w
23

37
1haL's almosL backpropagauon
lL's slmply Laklng derlvauves and uslng Lhe chaln rule!

8emalnlng Lrlck: we can re-use derlvauves compuLed for
hlgher layers ln compuung derlvauves for lower layers

Lxample: lasL derlvauves of model, Lhe word vecLors ln x
1ake derlvauve of score wlLh
respecL Lo slngle word vecLor
(for slmpllclLy a 1d vecLor,
buL same lf lL was longer)
now, we cannoL [usL Lake
lnLo conslderauon one o
l

because each x
j
ls connecLed
Lo all Lhe neurons above and
hence x
j
lnuences Lhe
overall score Lhrough all of
Lhese, hence:
8e-used parL of prevlous derlvauve
38
Training with Backpropagation:
softmax
39
WhaL ls Lhe ma[or beneL of deep learned word vecLors?
AblllLy Lo also propagaLe labeled lnformauon lnLo Lhem,
vla somax/maxenL and hldden layer:

5
c
1
c
2
c
3

x
1
x
2
x
3
+1

o
1
o
2

P(c | d, !) =
e
!
!
f (c,d)
e
!
!
f ( ! c ,d)
! c
"
Backpropagation Training
arL 1.3: 1he 8aslcs
60
Back-Prop
CompuLe gradlenL of example-wlse loss wrL
parameLers
Slmply applylng Lhe derlvauve chaln rule wlsely
lf compuung Lhe loss(example, parameLers) ls O(o)
compuLauon, Lhen so ls compuung Lhe gradlenL
61
Simple Chain Rule
62
Multiple Paths Chain Rule
63
Multiple Paths Chain Rule - General
.
64
Chain Rule in Flow Graph
.
.
.
llow graph: any dlrecLed acycllc graph
node = compuLauon resulL
arc = compuLauon dependency

= successors of
63
Back-Prop in Multi-Layer Net
.
.
66
h = sigmoid(Vx)
Back-Prop in General Flow Graph
.
.
.
= successors of
1. lprop: vlslL nodes ln Lopo-sorL order
- CompuLe value of node glven predecessors
2. 8prop:
- lnluallze ouLpuL gradlenL = 1
- vlslL nodes ln reverse order:
CompuLe gradlenL wrL each node uslng
gradlenL wrL successors
Slngle scalar ouLpuL
67
Automatic Differentiation
1he gradlenL compuLauon can
be auLomaucally lnferred from
Lhe symbollc expresslon of Lhe
fprop.
Lach node Lype needs Lo know
how Lo compuLe lLs ouLpuL and
how Lo compuLe Lhe gradlenL
wrL lLs lnpuLs glven Lhe
gradlenL wrL lLs ouLpuL.
Lasy and fasL proLoLyplng
68
Learning word-level classifiers:
POS and NER
arL 1.6: 1he 8aslcs
69
The Model
(ColloberL & WesLon 2008,
ColloberL eL al. 2011)
Slmllar Lo word vecLor
learnlng buL replaces Lhe
slngle scalar score wlLh a
5ofmox/MaxenL classler
1ralnlng ls agaln done vla
backpropagauon whlch glves
an error slmllar Lo Lhe score
ln Lhe unsupervlsed word
vecLor learnlng model
70
5
c
1
c
2
c
3

x
1
x
2
x
3
+1

o
1
o
2

The Model - Training
We already know Lhe somax classler and how Lo opumlze lL
1he lnLeresung LwlsL ln deep learnlng ls LhaL Lhe lnpuL feaLures
are also learned, slmllar Lo learnlng word vecLors wlLh a score:
5
c
1
c
2
c
3

x
1
x
2
x
3
+1

o
1
o
2

s
u
2

w
23

x
1
x
2
x
3
+1

o
1
o
2

71
CS
WSI (acc.)
NLk
CoNLL (I1)
SLaLe-of-Lhe-arL*
97.24 89.31
Supervlsed nn
96.37 81.47
unsupervlsed pre-Lralnlng
followed by supervlsed nn**
97.20 88.87
+ hand-craed feaLures***
97.29 89.39
* 8epresenLauve sysLems: CS: (1ouLanova eL al. 2003), nL8: (Ando & hang
2003)
** 130,000-word embeddlng Lralned on Wlklpedla and 8euLers wlLh 11 word
wlndow, 100 unlL hldden layer - for 7 weeks! - Lhen supervlsed Lask Lralnlng
***leaLures are characLer sumxes for CS and a gazeueer for nL8
The secret sauce is the unsupervised
pre-training on a large text collection
72
CS
WSI (acc.)
NLk
CoNLL (I1)
Supervlsed nn 96.37 81.47
nn wlLh 8rown clusLers 96.92 87.13
llxed embeddlngs* 97.10 88.87
C&W 2011** 97.29 89.S9
* Same archlLecLure as C&W 2011, buL word embeddlngs are kepL consLanL
durlng Lhe supervlsed Lralnlng phase
** C&W ls unsupervlsed pre-Lraln + supervlsed nn + feaLures model of lasL sllde
Supervised refinement of the
unsupervised word representation helps
73
Sharing statistical strength
arL 1.7
74
Multi-Task Learning
Cenerallzlng beuer Lo new
Lasks ls cruclal Lo approach
Al
ueep archlLecLures learn
good lnLermedlaLe
represenLauons LhaL can be
shared across Lasks
Cood represenLauons make
sense for many Lasks
raw input x
task 1
output y1
task 3
output y3
task 2
output y2
shared
intermediate
representation h
73
Combining Multiple Sources of
Evidence with Shared Embeddings
8elauonal learnlng
Muluple sources of lnformauon / relauons
Some symbols (e.g. words, wlklpedla enLrles) shared
Shared embeddlngs help propagaLe lnformauon
among daLa sources: e.g., WordneL, Wn, Wlklpedla,
lree8ase, .
76
Sharing Statistical Strength
8esldes very fasL predlcuon, Lhe maln advanLage of
deep learnlng ls sLausucal
oLenual Lo learn from less labeled examples because
of sharlng of sLausucal sLrengLh:
unsupervlsed pre-Lralnlng & mulu-Lask learnlng
Seml-supervlsed learnlng #
77
Semi-Supervised Learning
PypoLhesls: (cx) can be more accuraLely compuLed uslng
shared sLrucLure wlLh (x)
purely
supervlsed
78
Semi-Supervised Learning
PypoLhesls: (cx) can be more accuraLely compuLed uslng
shared sLrucLure wlLh (x)
seml-
supervlsed
79
Deep autoencoders
AlLernauve Lo conLrasuve unsupervlsed word learnlng
AnoLher ls 88Ms (PlnLon eL al. 2006), whlch we don'L cover Loday
Works well for xed lnpuL represenLauons
1. uenluon, lnLuluon and varlanLs of auLoencoders
2. SLacklng for deep auLoencoders
3. Why do auLoencoders lmprove deep neural neLs so much?
80
Auto-Encoders
Mululayer neural neL wlLh LargeL ouLpuL = lnpuL
8econsLrucuon=decoder(encoder(lnpuL))
robable lnpuLs have
small reconsLrucuon error
.
code= laLenL feaLures
.
encoder
decoder
lnpuL
reconsLrucuon
81
PCA = Linear Manifold = Linear Auto-
Encoder
reconsLrucuon error vecLor
Llnear manlfold
reconsLrucuon(x)
x
lnpuL x, 0-mean
feaLures=code=h(x)=w x
reconsLrucuon(x)=w
1
h(x) = w
1
w x
W = prlnclpal elgen-basls of Cov()
LSA example:
x = (normallzed) dlsLrlbuuon
of co-occurrence frequencles
82
The Manifold Learning Hypothesis
Lxamples concenLraLe near a lower dlmenslonal
manlfold" (reglon of hlgh denslLy where small changes are only
allowed ln cerLaln dlrecuons)
83
84
Auto-Encoders Learn Salient
Variations, like a non-linear PCA
Mlnlmlzlng reconsLrucuon error
forces laLenL represenLauon of
slmllar lnpuLs" Lo sLay on
manlfold
Auto-Encoder Variants
ulscreLe lnpuLs: cross-enLropy or log-llkellhood reconsLrucuon
crlLerlon (slmllar Lo used for dlscreLe LargeLs for MLs)
revenung Lhem Lo learn Lhe ldenuLy everywhere:
undercompleLe (eg CA): bouleneck code smaller Lhan lnpuL
SparslLy: penallze hldden unlL acuvauons so aL or near 0
[Coodfellow eL al 2009]
uenolslng: predlcL Lrue lnpuL from corrupLed lnpuL
[vlncenL eL al 2008]
ConLracuve: force encoder Lo have small derlvauves
[8lfal eL al 2011]
83
Sparse autoencoder illustration for
images
naLural lmages
Learned bases: Ldges"
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
! 0.8 * + 0.3 * + 0.5 *
! ! 0.8 * "
!"

+ 0.3 * "
#$

+ 0.5 * "
"!
[a
1
, ., a
64
] = [0, 0, ., 0, 0.8, 0, ., 0, 0.3, 0, ., 0, 0.S, 0]
(feaLure represenLauon)

1esL example
86
Stacking Auto-Encoders
Can be sLacked successfully (8englo eL al nlS'2006) Lo form hlghly
non-llnear represenLauons
87
Layer-wise Unsupervised Learning
! input
88
Layer-wise Unsupervised Pre-training
!
!
input
features
89
!
!
!
input
features
reconstruction
of input
=
?
!
input
90
!
!
input
features
91
!
!
input
features
!
More abstract
features
92
!
!
input
features
!
More abstract
features
reconstruction
of features
=
?
! ! ! !
Layer-Wise Unsupervised Pre-training
93
!
!
input
features
!
More abstract
features
94
!
!
input
features
!
More abstract
features
!
Even more abstract
features
93
!
!
input
features
!
More abstract
features
!
Even more abstract
features
Output
f(X) six
Target
Y
two!
=
?
Supervised Fine-Tuning
96
Why is unsupervised pre-training
working so well?
8egularlzauon hypoLhesls:
8epresenLauons good for (x)
are good for (yx)
Cpumlzauon hypoLhesls:
unsupervlsed lnluallzauons sLarL
near beuer local mlnlmum of
supervlsed Lralnlng error
Mlnlma oLherwlse noL
achlevable by random
lnluallzauon
Lrhan, Courvllle, Manzagol,
vlncenL, 8englo (!ML8, 2010)

97
Recursive Deep Learning
arL 2
98
Building on Word Vector Space Models
99
x
2
x
1
0 1 2 3 4 3 6 7 8 9 10
3
4
3
2
1
Monday
9
2
1uesday
9.3
1.3
8y mapplng Lhem lnLo Lhe same vecLor space!
1
3
1.1
4
Lhe counLry of my blrLh
Lhe place where l was born
8uL how can we represenL Lhe meanlng of longer phrases?
lrance
2
2.3
Cermany
1
3
How should we map phrases into a
vector space?
0.4
0.3
2.3
3.6
4
4.3
7
7
2.1
3.3
2.3
3.8
3.3
6.1
1
3.3
1
3
use prlnclple of composluonallLy
1he meanlng (vecLor) of a senLence
ls deLermlned by
(1) Lhe meanlngs of lLs words and
(2) Lhe rules LhaL comblne Lhem.
Models ln Lhls secuon
can [olnLly learn parse
Lrees and composluonal
vecLor represenLauons
x
2
x
1
0 1 2 3 4 3 6 7 8 9 10
3
4
3
2
1
Lhe place where l was born
Monday
1uesday
lrance
Cermany
100
Semantic Vector Spaces
ulsLrlbuuonal 1echnlques
8rown ClusLers
useful as feaLures lnslde
models, e.g. C8ls for
nL8, eLc.
CannoL capLure longer
phrases
Slngle Word vecLors uocumenLs vecLors
8ag of words models
LSA, LuA
CreaL for l8, documenL
explorauon, eLc.
lgnore word order, no
deLalled undersLandlng
vecLors
represenung
hrases and SenLences
LhaL do noL lgnore word order
and capLure semanucs for nL Lasks
1. Mouvauon
3. Cpumlzauon and 8ackpropagauon 1hrough SLrucLure
102
Sentence Parsing: What we want
9
1
3
3
8
3
9
1
4
3
n
n

S
7
1
v
1be cot sot oo tbe mot. 103
Learn Structure and Representation
n
n

S
v
3
2
3
3
8
3
3
4
7
3
1be cot sot oo tbe mot.
9
1
3
3
8
3
9
1
4
3
7
1
104
Recursive Neural Networks for
Structure Prediction
oo tbe mot.
9
1
4
3
3
3
8
3
8
3
3
3
Neural
Network
8
3
1.3
lnpuLs: Lwo candldaLe chlldren's represenLauons
CuLpuLs:
1. 1he semanuc represenLauon lf Lhe Lwo nodes are merged.
2. Score of how plauslble Lhe new node would be.
8
3
103
Recursive Neural Network Definition
score = u
1
p

p = Lanh(W + b),

Same W parameLers aL all nodes
of Lhe Lree
8
3
3
3
Neural
Network
8
3
1.3
score =
= parenL
c
1
c
2

c
1

c
2

106
Related Work to Socher et al. (ICML
2011)
ollack (1990): 8ecurslve auLo-assoclauve memorles
revlous 8ecurslve neural neLworks work by
Coller & kchler (1996), CosLa eL al. (2003) assumed
xed Lree sLrucLure and used one hoL vecLors.
PlnLon (1990) and 8ouou (2011): 8elaLed ldeas abouL
recurslve models and recurslve operaLors as smooLh
verslons of loglc operauons
107
Parsing a sentence with an RNN
Neural
Network
0.1
2
0
Neural
Network
0.4
1
0
Neural
Network
2.3
3
3
9
1
3
3
8
3
9
1
4
3
7
1
Neural
Network
3.1
3
2
Neural
Network
0.3
0
1
108
Parsing a sentence
9
1
3
3
3
2
Neural
Network
1.1
2
1
Neural
Network
0.1
2
0
Neural
Network
0.4
1
0
Neural
Network
2.3
3
3
3
3
8
3
9
1
4
3
7
1
109
Parsing a sentence
3
2
Neural
Network
1.1
2
1
Neural
Network
0.1
2
0
3
3
Neural
Network
3.6
8
3
9
1
3
3
3
3
8
3
9
1
4
3
7
1
110
Parsing a sentence
3
2
3
3
8
3
3
4
7
3
9
1
3
3
3
3
8
3
9
1
4
3
7
1
111
Max-Margin Framework - Details
1he score of a Lree ls compuLed by
Lhe sum of Lhe parslng declslon
scores aL each node.
Slmllar Lo max-margln parslng (1askar eL al. 2004), a supervlsed
max-margln ob[ecuve

1he loss penallzes all lncorrecL declslons
SLrucLure search for A(x) was maxlmally greedy
lnsLead: 8eam Search wlLh CharL
8
3
3
3
RNN
8
3
1.3
112
Backpropagation Through Structure
lnLroduced by Coller & kchler (1996)
rlnclpally Lhe same as general backpropagauon
1wo dlerences resulung from Lhe Lree sLrucLure:
SpllL derlvauves aL each node
Sum derlvauves of W from all nodes
113
BTS: Split derivatives at each node
uurlng forward prop, Lhe parenL ls compuLed uslng 2 chlldren
Pence, Lhe errors need Lo be compuLed wrL each of Lhem:

where each chlld's error ls n-dlmenslonal
8
3
3
3
8
3
c
1

p = Lanh(W + b)
c
1

c
2

c
2

8
3
3
3
8
3
c
1

c
2

114
BTS: Sum derivatives of all nodes
?ou can acLually assume lL's a dlerenL W aL each node
lnLuluon vla example:
lf Lake separaLe derlvauves of each occurrence, we geL same:
113
BTS: Optimization
As before, we can plug Lhe gradlenLs lnLo a
sLandard o-Lhe-shelf L-8lCS opumlzer
8esL resulLs wlLh AdaCrad (uuchl eL al, 2011):
lor non-conunuous ob[ecuve use subgradlenL
metboJ (8aLll eL al. 2007)
116
Discussion: Simple RNN
Cood resulLs wlLh slngle maLrlx 8nn (more laLer)
Slngle welghL maLrlx 8nn could capLure some
phenomena buL noL adequaLe for more complex,
hlgher order composluon and parslng long senLences
1he composluon funcuon ls Lhe same
for all synLacuc caLegorles, puncLuauon, eLc
!
"
#
"
$
%
!
&"'()
&
Solution: Syntactically-Untied RNN
ldea: Condluon Lhe composluon funcuon on Lhe
synLacuc caLegorles, unue Lhe welghLs"
Allows for dlerenL composluon funcuons for palrs
of synLacuc caLegorles, e.g. Adv + Ad[, v + n
Comblnes dlscreLe synLacuc caLegorles wlLh
conunuous semanuc lnformauon
Solution: CVG =
PCFG + Syntactically-Untied RNN
roblem: Speed. Lvery candldaLe score ln beam
search needs a maLrlx-vecLor producL.
Soluuon: CompuLe score uslng a llnear comblnauon
of Lhe log-llkellhood from a slmple ClC + 8nn
runes very unllkely candldaLes for speed
rovldes coarse synLacuc caLegorles of Lhe
chlldren for each beam candldaLe
Composluonal vecLor Crammars: CvC = ClC + 8nn
Details: Compositional Vector
Grammar
Scores aL each node compuLed by comblnauon of
ClC and Su-8nn:
lnLerpreLauon: lacLorlng dlscreLe and conunuous
parslng ln one model:
Socher eL al (2013): More deLalls aL ACL
Related Work
8esulung CvC arser ls relaLed Lo prevlous work LhaL exLends ClC
parsers
kleln and Mannlng (2003a) : manual feaLure englneerlng
eLrov eL al. (2006) : learnlng algorlLhm LhaL spllLs and merges
synLacuc caLegorles
Lexlcallzed parsers (Colllns, 2003, Charnlak, 2000): descrlbe each
caLegory wlLh a lexlcal lLem
Pall and kleln (2012) comblne several such annoLauon schemes ln a
facLored parser.
CvCs exLend Lhese ldeas from dlscreLe represenLauons Lo rlcher
conunuous ones
Permann & 8lunsom (2013): Comblne ComblnaLory CaLegorlal
Crammars wlLh 8nns and also unue welghLs, see upcomlng ACL 2013
Experiments
SLandard w5I spllL, labeled l1
8ased on slmple ClC wlLh fewer sLaLes
lasL prunlng of search space, few maLrlx-vecLor producLs
3.8 hlgher l1, 20 fasLer Lhan SLanford parser
arser 1est, A|| Sentences
SLanford ClC, (kleln and Mannlng, 2003a) 83.3
SLanford lacLored (kleln and Mannlng, 2003b) 86.6
lacLored ClCs (Pall and kleln, 2012) 89.4
Colllns (Colllns, 1997) 87.7
SSn (Penderson, 2004) 89.4
8erkeley arser (eLrov and kleln, 2007) 90.1
CvC (8nn) (Socher eL al., ACL 2013) 83.0
CvC (Su-8nn) (Socher eL al., ACL 2013) 90.4
Charnlak - Self 1ralned (McClosky eL al. 2006) 91.0
Charnlak - Self 1ralned-8e8anked (McClosky eL al. 2006) 92.1
SU-RNN Analysis
Learns nouon of so head words
u1-n

v-n
Analysis of resulting vector
representations
All Lhe gures are ad[usLed for seasonal varlauons
1. All Lhe numbers are ad[usLed for seasonal ucLuauons
2. All Lhe gures are ad[usLed Lo remove usual seasonal pauerns

knlghL-8ldder wouldn'L commenL on Lhe oer
1. Parsco decllned Lo say whaL counLry placed Lhe order
2. CoasLal wouldn'L dlsclose Lhe Lerms
Sales grew almosL 7 Lo unk m. from unk m.
1. Sales rose more Lhan 7 Lo 94.9 m. from 88.3 m.
2. Sales surged 40 Lo unk b. yen from unk b.

SU-RNN Analysis
Can Lransfer semanuc lnformauon from
slngle relaLed example
1raln senLences:
Pe eaLs spaghe wlLh a fork.
She eaLs spaghe wlLh pork.
1esL senLences
Pe eaLs spaghe wlLh a spoon.
Pe eaLs spaghe wlLh meaL.
SU-RNN Analysis
Labeling in Recursive Neural Networks
Neural
Network
8
3
We can use each node's
represenLauon as feaLures for a
sofmox classler:
1ralnlng slmllar Lo model ln parL 1 wlLh
sLandard cross-enLropy error + scores
Softmax
Layer
N
127
Scene Parsing
1he meanlng of a scene lmage ls
also a funcuon of smaller reglons,
how Lhey comblne as parLs Lo form
larger ob[ecLs,
and how Lhe ob[ecLs lnLeracL.
Slmllar prlnclple of composluonallLy.
128
Algorithm for Parsing Images
Same 8ecurslve neural neLwork as for naLural language parslng!
(Socher eL al. lCML 2011)
!"#$%&"'
(&#'' )&""
*"+,"-$'
*",#-$./
1"2&"'"-$#$.3-'
4"325"
6%.57.-+
4#&'.-+ 8#$%&#5 */"-" 9,#+"'
4#&'.-+ 8#$%&#5 */"-" 9,#+"'
129
Multi-class segmentation
Method Accuracy
lxel C8l (Could eL al., lCCv 2009) 74.3
Classler on superplxel feaLures 73.9
8eglon-based energy (Could eL al., lCCv 2009) 76.4
Local labelllng (1lghe & Lazebnlk, LCCv 2010) 76.9
Superplxel M8l (1lghe & Lazebnlk, LCCv 2010) 77.3
SlmulLaneous M8l (1lghe & Lazebnlk, LCCv 2010) 77.3
8ecurslve neural neLwork 78.1
SLanford 8ackground uaLaseL (Could eL al. 2009)
130
1. Mouvauon
3. 1heory: 8ackpropagauon 1hrough SLrucLure
131
Semi-supervised Recursive
Autoencoder
1o capLure senumenL and solve anLonym problem, add a somax classler
Lrror ls a welghLed comblnauon of reconsLrucuon error and cross-enLropy
Socher eL al. (LMnL 2011)
!"#$%&'()#'*$% "(($( ,($&&-"%'($./ "(($(
0
123
0
143
0
1567"53
132
Paraphrase Detection
ollack sald Lhe plalnus falled Lo show LhaL Merrlll
and 8lodgeL dlrecLly caused Lhelr losses
8aslcally , Lhe plalnus dld noL show LhaL omlsslons
ln Merrlll's research caused Lhe clalmed losses
1he lnlual reporL was made Lo ModesLo ollce
uecember 28
lL sLems from a ModesLo pollce reporL
133
How to compare
the meaning
of two sentences?
134
Unsupervised Recursive Autoencoders
Slmllar Lo 8ecurslve neural neL buL lnsLead of a
supervlsed score we compuLe a reconsLrucuon error
aL each node. Socher eL al. (LMnL 2011)
!
"
!
#
!
$
%
$
&'()*!
"
+!
#
, . /0
%
"
&'()*!
$
+%
$
, . /0
133
Unsupervised unfolding RAE
136
AuempL Lo encode enure Lree sLrucLure aL each node
Recursive Autoencoders for Full
Sentence Paraphrase Detection
unsupervlsed unfoldlng 8AL and a palr-wlse senLence
comparlson of nodes ln parsed Lrees
Socher eL al. (nlS 2011)
137
LxperlmenLs on Mlcroso 8esearch araphrase Corpus
(uolan eL al. 2004)
Method Acc. I1
8us eL al.(2008) 70.6 80.3
Mlhalcea eL al.(2006) 70.3 81.3
lslam eL al.(2007) 72.6 81.3
Clu eL al.(2006) 72.0 81.6
lernando eL al.(2008) 74.1 82.4
Wan eL al.(2006) 73.6 83.0
uas and SmlLh (2009) 73.9 82.3
uas and SmlLh (2009) + 18 Surface leaLures 76.1 82.7
l. 8u eL al. (ACL 2012): SLrlng 8e-wrlung kernel 76.3 --
unfoldlng 8ecurslve AuLoencoder (nlS 2011) 76.8 83.6
138
139
1. Mouvauon
3. 1heory: 8ackpropagauon 1hrough SLrucLure
140
Compositionality Through Recursive
Matrix-Vector Spaces
Cne way Lo make Lhe composluon funcuon more powerful
was by unLylng Lhe welghLs W
8uL whaL lf words acL mosLly as an operaLor, e.g. very" ln
very good
roposal: A new composluon funcuon
p = Lanh(W + b)

c
1

c
2

141
Compositionality Through Recursive
Matrix-Vector Recursive Neural Networks
p = Lanh(W + b)

c
1

c
2

p = Lanh(W + b)

C
2
c
1

C
1
c
2

142
Predicting Sentiment Distributions
Cood example for non-llnearlLy ln language
143
MV-RNN for Relationship Classification
ke|anonsh|p Sentence w|th |abe|ed nouns for wh|ch
to pred|ct re|anonsh|ps
Cause-
LecL(e2,e1)

Avlan [lnuenza]e1 ls an lnfecuous
dlsease caused by Lype a sLralns of Lhe
lnuenza [vlrus]e2.
LnuLy-
Crlgln(e1,e2)
1he [moLher]e1 le her nauve [land]e2
abouL Lhe same ume and Lhey were
marrled ln LhaL clLy.
Message-
1oplc(e2,e1)

8oadslde [auracuons]e1 are frequenLly
adverused wlLh [blllboards]e2 Lo auracL
LourlsLs.
144
Sentiment Detection
SenumenL deLecuon ls cruclal Lo buslness
lnLelllgence, sLock Lradlng, .
143
Sentiment Detection and Bag-of-Words
Models
MosL meLhods sLarL wlLh a bag of words
+ llngulsuc feaLures/processlng/lexlca
8uL such meLhods (lncludlng -ldf) can'L
dlsungulsh:
+ whlLe blood cells desLroylng an lnfecuon
- an lnfecuon desLroylng whlLe blood cells
146
Sentiment Detection and Bag-of-Words
Models
SenumenL ls LhaL senumenL ls easy"
ueLecuon accuracy for longer documenLs ~90
LoLs of easy cases (. horrlble. or . awesome .)

lor daLaseL of slngle senLence movle revlews
(ang and Lee, 2003) accuracy never reached
above 80 for 7 years
Parder cases requlre acLual undersLandlng of
negauon and lLs scope and oLher semanuc eecLs
Data: Movie Reviews
SLeallng Parvard doesnL care abouL
cleverness, wlL or any oLher klnd of
lnLelllgenL humor.
1here are slow and repeuuve parLs
buL lL has [usL enough splce Lo keep lL
lnLeresung.
148
Two missing pieces for improving
sentiment
1. Composluonal 1ralnlng uaLa
2. 8euer Composluonal model
1. New Sentiment Treebank
1. New Sentiment Treebank
arse Lrees of 11,833 senLences
213,134 phrases wlLh labels
Allows Lralnlng and evaluaung
wlLh composluonal lnformauon
2. New Compositional Model
8ecurslve neural 1ensor neLwork
More expresslve Lhan any oLher 8nn so far
ldea: Allow more lnLeracuons of vecLors
Recursive Neural Tensor Network
Experimental Result on Treebank
Experimental Result on Treebank
8n1n can capLure buL ?
8n1n accuracy of 72, compared Lo Mv-8nn (63),
bln8 (38) and 8nn (34)
Negation Results
Negation Results
MosL meLhods capLure LhaL negauon oen makes
Lhlngs more negauve (See ous, 2010)
Analysls on negauon daLaseL
Negation Results
8uL how abouL negaung negauves?
osluve acuvauon should lncrease!
Visualizing Deep Learning: Word
Embeddings
Overview of RNN Model Variations
Cb[ecuve luncuons
Superv|sed Scores for Structure red|cnon
C|ass|her for Sennment, ke|anons, V|sua| Cb[ects, Loglc
Unsuperv|sed autoencod|ng |mmed|ate ch||dren or enure Lree sLrucLure
Composluon luncuons
Syntacnca||y-Unned We|ghts
Matr|x Vector kNN
1ensor-8ased Mode|s
1ree SLrucLures
Consntuency arse 1rees
ComblnaLory CaLegorlcal Crammar 1rees
uependency arse 1rees
llxed 1ree SLrucLures (Connecuons Lo Cnns)
162
Summary: Recursive Deep Learning
8ecurslve ueep Learnlng can predlcL hlerarchlcal sLrucLure and classlfy Lhe
sLrucLured ouLpuL uslng composluonal vecLors
SLaLe-of-Lhe-arL performance (all wlLh code on www.socher.org)
ars|ng on Lhe WS! (!ava code soon)
Sennment Ana|ys|s on muluple corpora
araphrase detecnon wlLh unsupervlsed 8nns
ke|anon C|ass|hcanon on SemLval 2011, 1ask8
Cb[ect detecnon on SLanford background and MS8C daLaseLs
163
!"#$%&"'
(&#'' )&""
*"+,"-$'
! #$%&& '()*+
,-./0&1 /20/(#
03/ 3.#0)(.'
'3-('3
./'$0&/1 $."
2%/"$34
"-$"&'
*
67
8"$9 :;<9
=7
*",#-$/1
>"?&"'"-$#$/0-'
67
: ',#33
1&0@;
=7
=7
1.%&1.
=9
7"0?3" A%/3;/-+
B-;/1"'
C0&;'
*",#-$/1
>"?&"'"-$#$/0-'
7#&'/-+ =#$%&#3 D#-+%#+" *"-$"-1"' 7#&'/-+ =#$%&#3 D#-+%#+" *"-$"-1"'
7#&'/-+ =#$%&#3 *1"-" B,#+"' 7#&'/-+ =#$%&#3 *1"-" B,#+"'
Part 3
1. AssorLed Speech and nL Appllcauons
2. ueep Learnlng: Ceneral SLraLegy and 1rlcks
3. 8esources (readlngs, code, .)
4. ulscusslon
164
Assorted Speech and NLP
Applications
arL 3.1: Appllcauons
163
Existing NLP Applications
Language Mode||ng (Speech 8ecognluon, Machlne 1ranslauon)
Word-Sense Learn|ng and ulsamblguauon
keason|ng over know|edge 8ases
Acousuc Modellng
arL-Cf-Speech 1agglng
Chunklng
named LnuLy 8ecognluon
Semanuc 8ole Labellng
arslng
SenumenL Analysls
araphraslng
Cuesuon-Answerlng
166
Language Modeling
redlcL (nexL word prevlous word)
Clves a probablllLy for a longer sequence
Appllcauons Lo Speech, 1ranslauon and Compresslon
CompuLauonal bouleneck: large vocabulary v means LhaL
compuung Lhe ouLpuL cosLs hldden unlLs x v.
167
Neural Language Model
8eoqlo et ol Nll52000
ooJ IMlk 200J A
Neotol ltoboblllsuc
looqooqe MoJel
Lach word represenLed by
a dlsLrlbuLed conunuous-
valued code
Cenerallzes Lo sequences
of words LhaL are
semanucally slmllar Lo
Lralnlng sequences
168
Recurrent Neural Net Language
Modeling for ASR

[Mlkolov eL al 2011]
8lgger ls beuer.
experlmenLs on 8roadcasL
news nlS1-8104

perplexlLy goes from
140 Lo 102

aper shows how Lo
Lraln a recurrenL neural neL
wlLh a slngle core ln a few
days, wlLh 1 absoluLe
lmprovemenL ln WL8

Code: http://www.fit.vutbr.cz/~imikolov/rnnlm/

Code: hup://www.L.vuLbr.cz/~lmlkolov/rnnlm/
169
Application to Statistical Machine
Translation
Schwenk (nAACL 2012 workshop on Lhe fuLure of LM)
41M words, Arablc/Lngllsh blLexLs + 131M Lngllsh from LuC
erplexlLy down from 71.1 (6 Clg back-o) Lo 36.9 (neural
model, 300M memory)
+1.8 8LLu score (30.73 Lo 32.28)
Can Lake advanLage of longer conLexLs
Code: http://lium.univ-lemans.fr/cslm/
170
Learning Multiple Word Vectors
1ackles problems wlLh polysemous words
Can be done wlLh boLh sLandard -ldf based
meLhods [8elslnger and Mooney, nAACL 2010]
8ecenL neural word vecLor model by [Puang eL al. ACL 2012]
learns muluple proLoLypes uslng boLh local and global conLexL
SLaLe of Lhe arL
correlauons wlLh
human slmllarlLy
[udgmenLs
171
Learning Multiple Word Vectors
vlsuallzauon of learned word vecLors from
Puang eL al. (ACL 2012)

172
Common Sense Reasoning
Inside Knowledge Bases
Cuesuon: Can neural neLworks learn Lo capLure loglcal
lnference, seL lncluslons, parL-of and hypernym relauonshlps?
173
Neural Networks for Reasoning
over Relationships
Plgher scores for each
LrlpleL 1 = (e1,8,e2)
lndlcaLe LhaL enuues are
more llkely ln relauonshlp
1ralnlng uses conLrasuve
esumauon funcuon, slmllar
Lo word vecLor learnlng
n1n scorlng funcuon:
CosL:

174
Accuracy of Predicting True and False
Relationships
8elaLed Work
8ordes, WesLon,
ColloberL & 8englo,
AAAl 2011)
(8ordes, CloroL,
WesLon & 8englo,
AlS1A1S 2012)
173
Mode| Iree8ase WordNet
ulsLance Model 68.3 61.0
Padamard Model 80.0 68.8
SLandard Layer Model (n1n) 76.0 83.3
8lllnear Model (n1n) 84.1 87.7
neural 1ensor neLwork (Chen eL al. 2013) 86.2 90.0
Accuracy Per Relationship
176
Deep Learning
General Strategy and Tricks
arL 3.2
177
General Strategy
1. SelecL neLwork sLrucLure approprlaLe for problem
1. SLrucLure: Slngle words, xed wlndows vs 8ecurslve
SenLence 8ased vs 8ag of words
2. nonllnearlLy
2. Check for lmplemenLauon bugs wlLh gradlenL checks
3. arameLer lnluallzauon
4. Cpumlzauon Lrlcks
3. Check lf Lhe model ls powerful enough Lo overL
1. lf noL, change model sLrucLure or make model larger"
2. lf you can overL: 8egularlze
178
Non-linearities: Whats used
loglsuc (slgmold") Lanh

Lanh ls [usL a rescaled and shled slgmold
Lanh ls whaL ls mosL used and oen performs besL for deep neLs

tanh(z) = 2logistic(2z) !1
179
Non-linearities: There are various
other choices
hard Lanh so slgn recuer

hard Lanh slmllar buL compuLauonally cheaper Lhan Lanh and saLuraLes hard.
[CloroL and 8englo Al51A15 2010, 2011] dlscuss soslgn and recuer
rect(z) = max(z, 0)
softsign(z) =
a
1+ a
180
MaxOut Network
A very recenL Lype of nonllnearlLy/neLwork
Coodfellow eL al. (2013)
Where
1hls funcuon Loo ls a unlversal approxlmaLor
SLaLe of Lhe arL on several lmage daLaseLs
181
Gradient Checks are Awesome!
Allows you Lo know LhaL Lhere are no bugs ln your neural
neLwork lmplemenLauon!
SLeps:
1. lmplemenL your gradlenL
2. lmplemenL a nlLe dlerence compuLauon by looplng
Lhrough Lhe parameLers of your neLwork, addlng and
subLracung a small epsllon (~10-4) and esumaLe derlvauves
3. Compare Lhe Lwo and make sure Lhey are Lhe same
182
,
General Strategy
1. SelecL approprlaLe neLwork SLrucLure
1. SLrucLure: Slngle words, xed wlndows vs 8ecurslve
SenLence 8ased vs 8ag of words
2. nonllnearlLy
2. Check for lmplemenLauon bugs wlLh gradlenL check
183
Parameter Initialization
lnluallze hldden layer blases Lo 0 and ouLpuL (or reconsLrucuon)
blases Lo opumal value lf welghLs were 0 (e.g. mean LargeL or
lnverse slgmold of mean LargeL).
lnluallze welghLs ~ unlform(-r,r), r lnversely proporuonal Lo fan-
ln (prevlous layer slze) and fan-ouL (nexL layer slze):
for Lanh unlLs, and 4x blgger for slgmold unlLs [CloroL AlS1A1S 2010]

re-Lralnlng wlLh 8esLrlcLed 8olLzmann machlnes
184
CradlenL descenL uses LoLal gradlenL over all examples per
updaLe, SCu updaLes aer only 1 or few examples:
L = loss funcuon, z
L
= currenL example, = parameLer vecLor, and
L
= learnlng raLe.
Crdlnary gradlenL descenL as a baLch meLhod, very slow, should
never be used. use 2
nd
order baLch meLhod such as L8lCS. Cn
large daLaseLs, SCu usually wlns over all baLch meLhods. Cn
smaller daLaseLs L8lCS or Con[ugaLe CradlenLs wln. Large-baLch
L8lCS exLends Lhe reach of L8lCS [Le eL al lCML'2011].
Stochastic Gradient Descent (SGD)
183
Learning Rates
SlmplesL reclpe: keep lL xed and use Lhe same for all
parameLers.
ColloberL scales Lhem by Lhe lnverse of square rooL of Lhe fan-ln
of each neuron
8euer resulLs can generally be obLalned by allowlng learnlng
raLes Lo decrease, Lyplcally ln C(1/L) because of Lheoreucal
convergence guaranLees, e.g.,
wlLh hyper-parameLers
0
and
8euer yeL: no learnlng raLes by uslng L-8lCS or AdaCrad (uuchl
eL al. 2011)

186
Long-Term Dependencies
and Clipping Trick
ln very deep neLworks such as recurrenL neLworks (or posslbly
recurslve ones), Lhe gradlenL ls a producL of !acoblan maLrlces,
each assoclaLed wlLh a sLep ln Lhe forward compuLauon. 1hls
can become very small or very large qulckly [8englo eL al 1994],
and Lhe locallLy assumpuon of gradlenL descenL breaks down.
1he soluuon rsL lnLroduced by Mlkolov ls Lo cllp gradlenLs
Lo a maxlmum value. Makes a blg dlerence ln 8nns

187
General Strategy
1. SelecL approprlaLe neLwork SLrucLure
1. SLrucLure: Slngle words, xed wlndows vs 8ecurslve SenLence 8ased vs 8ag of words
2. nonllnearlLy
2. Check for lmplemenLauon bugs wlLh gradlenL check

Assumlng you found Lhe rlghL neLwork sLrucLure, lmplemenLed lL
correcLly, opumlze lL properly and you can make your model
overL on your Lralnlng daLa.

now, lL's ume Lo regularlze
188
Prevent Overfitting:
Model Size and Regularization
Slmple rsL sLep: 8educe model slze by lower number of unlLs
and layers and oLher parameLers
SLandard L1 or L2 regularlzauon on welghLs
Larly SLopplng: use parameLers LhaL gave besL valldauon error
SparslLy consLralnLs on hldden acuvauons, e.g. add Lo cosL:
uropouL (PlnLon eL al. 2012):
8andomly seL 30 of Lhe lnpuLs aL each layer Lo 0
AL LesL ume half Lhe ouLgolng welghLs (now Lwlce as many)
revenLs Co-adapLauon
189
Deep Learning Tricks of the Trade
?. 8englo (2012), racucal 8ecommendauons for CradlenL-
8ased 1ralnlng of ueep ArchlLecLures"
unsupervlsed pre-Lralnlng
SLochasuc gradlenL descenL and seng learnlng raLes
Maln hyper-parameLers
Learnlng raLe schedule & early sLopplng
MlnlbaLches
arameLer lnluallzauon
number of hldden unlLs
L1 or L2 welghL decay
SparslLy regularlzauon
uebugglng # llnlLe dlerence gradlenL check (?ay)
Pow Lo emclenLly search for hyper-parameLer congurauons
190
Resources: Tutorials and Code
arL 3.3: 8esources
191
Related Tutorials
See Neotol Net looqooqe MoJels #$%&'()*+,-( eotty
ueep Learnlng LuLorlals: hup://deeplearnlng.neL/LuLorlals
SLanford deep learnlng LuLorlals wlLh slmple programmlng
asslgnmenLs and readlng llsL hup://deeplearnlng.sLanford.edu/wlkl/
8ecurslve AuLoencoder class pro[ecL
hup://cseweb.ucsd.edu/~elkan/2308/learnlngmeanlng.pdf
CraduaLe Summer School: ueep Learnlng, leaLure Learnlng
hup://www.lpam.ucla.edu/programs/gss2012/
lCML 2012 8epresenLauon Learnlng LuLorlal hup://
www.lro.umonLreal.ca/~bengloy/Lalks/deep-learnlng-LuLorlal-2012.hLml
More read|ng (|nc|ud|ng tutor|a| references):
hup:]]n|p.stanford.edu]courses]NAACL2013]
192
Software
1heano (yLhon Cu/Cu) maLhemaucal and deep learnlng
llbrary hup://deeplearnlng.neL/soware/Lheano
Can do auLomauc, symbollc dlerenuauon
Senna: CS, Chunklng, nL8, S8L
by ColloberL eL al. hup://ronan.colloberL.com/senna/
SLaLe-of-Lhe-arL performance on many Lasks
3300 llnes of C, exLremely fasL and uslng very llule memory
8ecurrenL neural neLwork Language Model
hup://www.L.vuLbr.cz/~lmlkolov/rnnlm/
8ecurslve neural neL and 8AL models for paraphrase deLecuon,
senumenL analysls, relauon classlcauon www.socher.org
193
Software: whats next
C-Lhe-shelf SvM packages are useful Lo researchers
from a wlde varleLy of elds (no need Lo undersLand
8kPS).
Cne of Lhe goals of deep learnlng: 8ulld o-Lhe-shelf
nL classlcauon packages LhaL are uslng as Lralnlng
lnpuL only raw LexL (lnsLead of feaLures) posslbly wlLh a
label.
194
Discussion
arL 3.4:
193
Concerns
Many algorlLhms and varlanLs (burgeonlng eld)
Pyper-parameLers (layer slze, regularlzauon, posslbly
learnlng raLe)
use mulu-core machlnes, clusLers and random
sampllng for cross-valldauon (8ergsLra & 8englo 2012)
reuy common for powerful meLhods, e.g. 8M23, LuA
Can use (mlnl-baLch) L-8lCS lnsLead of SCu
196
Concerns
noL always obvlous how Lo comblne wlLh exlsung nL
Slmple: Add word or phrase vecLors as feaLures. CeLs
close Lo sLaLe of Lhe arL for nL8, [1urlan eL al, ACL
2010]
lnLegraLe wlLh known problem sLrucLures: 8ecurslve
and recurrenL neLworks for Lrees and chalns
?our research here
197
Concerns
Slower Lo Lraln Lhan llnear models
Cnly by a small consLanL facLor, and much more
compacL Lhan non-parameLrlc (e.g. n-gram models)
very fasL durlng lnference/LesL ume (feed-forward
pass ls [usL a few maLrlx muluplles)
need more Lralnlng daLa
Can booJle ooJ beoeft ftom more Lralnlng daLa,
sulLable for age of 8lg uaLa (Coogle Lralns neural
neLs wlLh a bllllon connecuons, [Le eL al, lCML 2012])
198
Concerns
1here aren'L many good ways Lo encode prlor
knowledge abouL Lhe sLrucLure of language lnLo deep
learnlng models
1here ls some LruLh Lo Lhls. Powever:
?ou can choose archlLecLures sulLable for a problem
domaln, as we dld for llngulsuc sLrucLure
?ou can lnclude human-deslgned feaLures ln Lhe rsL
layer, [usL llke for a llnear model
And Lhe goal ls Lo geL Lhe machlne dolng Lhe learnlng!
199
Concern:
Problems with model interpretability
no dlscreLe caLegorles or words, everyLhlng ls a conunuous
vecLor. We'd llke have symbollc feaLures llke n, v, eLc. and
see why Lhelr comblnauon makes sense.
1rue, buL mosL of language ls fuzzy and many words have so
relauonshlps Lo each oLher. Also, many nL feaLures are
already noL human-undersLandable (e.g., concaLenauons/
comblnauons of dlerenL feaLures).
Can Lry by pro[ecuons of welghLs and nearesL nelghbors, see
parL 2
200
Concern: non-convex optimization
Can lnluallze sysLem wlLh convex learner
Convex SvM
llxed feaLure space
1hen opumlze non-convex varlanL (add and Lune learned
feaLures), can'L be worse Lhan convex learner
noL a blg problem ln pracuce (oen relauvely sLable
performance across dlerenL local opuma)

201
Advantages
uesplLe a small communlLy ln Lhe lnLersecuon of deep
learnlng and nL, already many sLaLe of Lhe arL resulLs
on a varleLy of language Lasks
Cen very slmple maLrlx derlvauves (backprop) for
Lralnlng and maLrlx mulupllcauons for Lesung # fasL
lmplemenLauon
lasL lnference and well sulLed for mulu-core Cus/Cus
and parallellzauon across machlnes
202
Learning Multiple Levels of
Abstraction
1he blg payo of deep learnlng
ls Lo learn feaLure
represenLauons and hlgher
levels of absLracuon
1hls allows much easler
generallzauon and Lransfer
beLween domalns, languages,
and Lasks
203
The End
204

NAACL2013 Socher Manning DeepLearning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NAACL2013 Socher Manning DeepLearning

Uploaded by

Copyright:

Available Formats

Deep Learn|ng for NL

You might also like