Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
Latent Structure Modelsfor NLP
André Mar ns Ins tuto de Telecomunicações & IST & UnbabelTsvetomila Mihaylova Ins tuto de Telecomunicações
Nikita Nangia NYUVlad Niculae Ins tuto de Telecomunicações
deep-spin.github.io/tutorial
Structured predic on and NLP
• Structured predic on: a machine learning framework for predic ng structured,constrained, and interdependent outputs• NLP deals with structured and ambiguous textual data:• machine transla on• speech recogni on• syntac c parsing• seman c parsing• informa on extrac on• ...
deep-spin.github.io/tutorial 2
Examples of structure in NLP
POS taggingVERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
Dependency parsing· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
Word alignments
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
Exponen ally many parse trees!Cannot enumerate.
deep-spin.github.io/tutorial 3
Examples of structure in NLP
POS taggingVERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
Dependency parsing· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
Word alignments
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
Exponen ally many parse trees!Cannot enumerate.
deep-spin.github.io/tutorial 3
Examples of structure in NLPPOS tagging
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
Dependency parsing· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
Word alignments
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
Exponen ally many parse trees!Cannot enumerate.
deep-spin.github.io/tutorial 3
NLP 5 years ago:Structured predic on and pipelines
• Big pipeline systems, connec ng di erent structured predictors, trainedseparately• Advantages: fast and simple to train, can rearrange pieces
• Disadvantage: linguis c annota ons required for each component• Bigger disadvantage: error propagates through the pipeline
deep-spin.github.io/tutorial 5
NLP 5 years ago:Structured predic on and pipelines
• Big pipeline systems, connec ng di erent structured predictors, trainedseparately• Advantages: fast and simple to train, can rearrange pieces• Disadvantage: linguis c annota ons required for each component
• Bigger disadvantage: error propagates through the pipeline
deep-spin.github.io/tutorial 5
NLP 5 years ago:Structured predic on and pipelines
• Big pipeline systems, connec ng di erent structured predictors, trainedseparately• Advantages: fast and simple to train, can rearrange pieces• Disadvantage: linguis c annota ons required for each component• Bigger disadvantage: error propagates through the pipeline
deep-spin.github.io/tutorial 5
NLP today:End-to-end training
• Forget pipelines—train everything from scratch!• No more error propaga on or linguis c annota ons!
• Treat everything as latent!
deep-spin.github.io/tutorial 7
NLP today:End-to-end training
• Forget pipelines—train everything from scratch!• No more error propaga on or linguis c annota ons!• Treat everything as latent!
deep-spin.github.io/tutorial 7
Representa on learning
• Uncover hidden representa onsuseful for the downstream task.• Neural networks are well-suitedfor this: deep computa on graphs.
• Neural representa ons areunstructured, inscrutable.Language data has underlyingstructure!
posi veneutralnega ve
input
deep-spin.github.io/tutorial 8
Representa on learning
• Uncover hidden representa onsuseful for the downstream task.• Neural networks are well-suitedfor this: deep computa on graphs.• Neural representa ons areunstructured, inscrutable.Language data has underlyingstructure!
posi veneutralnega ve
input
deep-spin.github.io/tutorial 8
Latent structure models
• Seek structured hiddenrepresenta ons instead! posi ve
neutralnega ve
input
deep-spin.github.io/tutorial 9
Latent structure models
• Seek structured hiddenrepresenta ons instead! posi ve
neutralnega ve
input· · ·
· · ·
deep-spin.github.io/tutorial 9
Latent structure models aren’t so new!
• They have a very long history in NLP:• IBM Models for SMT (latent word alignments) [Brown et al., 1993]• HMMs [Rabiner, 1989]• CRFs with hidden variables [Qua oni et al., 2007]• Latent PCFGs [Petrov and Klein, 2008, Cohen et al., 2012]• Trained with EM, spectral learning, method of moments, ...• O en, very strict assump ons (e.g. strong factoriza ons)• Today, neural networks opened up some new possibili es!
deep-spin.github.io/tutorial 10
Why do we love latent structure models?
• The inferred latent variables can bring us some interpretability• They o er a way of injec ng prior knowledge as a structured bias• Hopefully: Higher predic ve power with fewer model parameters
• smaller carbon footprint!
deep-spin.github.io/tutorial 11
Why do we love latent structure models?
• The inferred latent variables can bring us some interpretability• They o er a way of injec ng prior knowledge as a structured bias• Hopefully: Higher predic ve power with fewer model parameters• smaller carbon footprint!
deep-spin.github.io/tutorial 11
What this tutorial is about:
• Discrete, combinatorial latent structures• O en the structure is inspired by some linguis c intui on• We’ll cover both:• RL methods (structure built incrementally, reward coming from downstream task)• ... vs end-to-end di eren able approaches (global op miza on, marginaliza on)• stochas c computa on graphs• ... vs determinis c graphs.• All plugged in discrimina ve neural models.
deep-spin.github.io/tutorial 12
This tutorial is not about:
• It’s not about con nuous latent variables• It’s not about deep genera ve learning• We won’t cover GANs, VAEs, etc.• There are (very good) recent tutorials on deep varia onal models for NLP:• “Varia onal Inference and Deep Genera ve Models” (Schulz and Aziz, ACL 2018)• “Deep Latent-Variable Models for Natural Language” (Kim, Wiseman, Rush, EMNLP 2018)
deep-spin.github.io/tutorial 13
Unstructured vs structured
• To be er explain the math, we’ll o en backtrack to unstructuredmodels (wherethe latent variable is a categorical) before jumping to the structured ones
deep-spin.github.io/tutorial 14
The unstructured case: Probability simplex
[0,0,1]
[.3, .2, .5]
• Each vertex is an indicator vector,represen ng one class:
zc = [0, . . . ,0, 1︸︷︷︸cth posi on
,0, . . . ,0].
• Points inside are probability vectors, aconvex combina on of classes:
p ≥ 0,∑cpc = 1.
deep-spin.github.io/tutorial 15
The unstructured case: Probability simplex
[0,0,1]
[.3, .2, .5]
• Each vertex is an indicator vector,represen ng one class:
zc = [0, . . . ,0, 1︸︷︷︸cth posi on
,0, . . . ,0].
• Points inside are probability vectors, aconvex combina on of classes:
p ≥ 0,∑cpc = 1.
deep-spin.github.io/tutorial 15
The unstructured case: Probability simplex
[0,0,1]
[.3, .2, .5]
• Each vertex is an indicator vector,represen ng one class:
zc = [0, . . . ,0, 1︸︷︷︸cth posi on
,0, . . . ,0].
• Points inside are probability vectors, aconvex combina on of classes:
p ≥ 0,∑cpc = 1.
deep-spin.github.io/tutorial 15
What’s the analogous of for a structure?• A structured object z can be represented as a bit vector.
• Example:
• a dependency tree can be represented a O(L2) vector indexed by arcs• each entry is 1 i the arc belongs to the tree• structural constraints: not all bit vectors represent valid trees!
z1 = [1,0,0,0,1,0,0,0,1]
⋆ dog on wheels
z2 = [0,0,1,0,0,1,1,0,0]
⋆ dog on wheels
z3 = [1,0,0,0,1,0,0,1,0]
⋆ dog on wheels
deep-spin.github.io/tutorial 16
What’s the analogous of for a structure?• A structured object z can be represented as a bit vector.• Example:• a dependency tree can be represented a O(L2) vector indexed by arcs• each entry is 1 i the arc belongs to the tree• structural constraints: not all bit vectors represent valid trees!
z1 = [1,0,0,0,1,0,0,0,1]
⋆ dog on wheels
z2 = [0,0,1,0,0,1,1,0,0]
⋆ dog on wheels
z3 = [1,0,0,0,1,0,0,1,0]
⋆ dog on wheels
deep-spin.github.io/tutorial 16
What’s the analogous of for a structure?• A structured object z can be represented as a bit vector.• Example:• a dependency tree can be represented a O(L2) vector indexed by arcs• each entry is 1 i the arc belongs to the tree• structural constraints: not all bit vectors represent valid trees!
z1 = [1,0,0,0,1,0,0,0,1]
⋆ dog on wheels
z2 = [0,0,1,0,0,1,1,0,0]
⋆ dog on wheels
z3 = [1,0,0,0,1,0,0,1,0]
⋆ dog on wheels
deep-spin.github.io/tutorial 16
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 17
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z
• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 17
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 17
Unstructured vs Structured
• Unstructured case: simplex∆
• Structured case: marginal polytopeM
M
deep-spin.github.io/tutorial 18
Unstructured vs Structured
• Unstructured case: simplex∆
• Structured case: marginal polytopeM
M
deep-spin.github.io/tutorial 18
Unstructured vs Structured
• Unstructured case: simplex∆
• Structured case: marginal polytopeM
M
deep-spin.github.io/tutorial 18
Compu ng the most likely structureis a very high-dimensional argmax
· · ·
s zinputx
outputby
There are exponen allymany structures(s cannot t in memory;
we cannot “loop” over s nor z)
deep-spin.github.io/tutorial 19
Compu ng the most likely structureis a very high-dimensional argmax
· · ·
s zinputx
outputbyThere are exponen ally
many structures(s cannot t in memory;
we cannot “loop” over s nor z)
deep-spin.github.io/tutorial 19
Dealing with the combinatorial explosion
→ → → · · ·1. Incremental structures• Build structure greedily, as sequenceof discrete choices (e.g., shi -reduce).• Scores (par al structure, ac on) tuples.• Advantages: exible, rich histories.• Disadvantages: greedy, local decisionsare subop mal, error propaga on.
max · · · , , , , · · ·
2. Factoriza on into parts• Op mizes globally (e.g. Viterbi,Chu-Liu-Edmonds, Kuhn-Munkres).• Scores smaller parts.• Advantages: op mal, elegant, canhandle hard & global constraints.• Disadvantages: strong assump ons.
deep-spin.github.io/tutorial 20
The challenge of discrete choices.
z = 1z = 2
· · ·z =N
s zinputx
outputbys = fθ(x) by = gφ(z, x)
∂L(by,y)∂w =? or, essen ally, ∂z
∂s=?
deep-spin.github.io/tutorial 21
The challenge of discrete choices.
z = 1z = 2
· · ·z =N
s
zinputx
outputbys = fθ(x) by = gφ(z, x)
∂L(by,y)∂w =? or, essen ally, ∂z
∂s=?
deep-spin.github.io/tutorial 21
The challenge of discrete choices.
z = 1z = 2
· · ·z =N
s z
inputx
outputbys = fθ(x) by = gφ(z, x)
∂L(by,y)∂w =? or, essen ally, ∂z
∂s=?
deep-spin.github.io/tutorial 21
The challenge of discrete choices.
z = 1z = 2
· · ·z =N
s zinputx
outputbys = fθ(x) by = gφ(z, x)
∂L(by,y)∂w =? or, essen ally, ∂z
∂s=?
deep-spin.github.io/tutorial 21
The challenge of discrete choices.
z = 1z = 2
· · ·z =N
s zinputx
outputbys = fθ(x) by = gφ(z, x)
∂L(by,y)∂w =?
or, essen ally, ∂z∂s=?
deep-spin.github.io/tutorial 21
The challenge of discrete choices.
z = 1z = 2
· · ·z =N
s zinputx
outputbys = fθ(x) by = gφ(z, x)
∂L(by,y)∂w =? or, essen ally, ∂z
∂s=?
deep-spin.github.io/tutorial 21
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Discrete mappings are “ at”s
z = 1z = 2
· · ·z =N
s zz
∂z∂s=?
deep-spin.github.io/tutorial 22
Argmax
z = 1z = 2
· · ·z =N
s z
∂z∂s = 0
z1
s10
1
s2 − 1 s2 s2 + 1
deep-spin.github.io/tutorial 23
Example: Regression with latent categoriza on
(c)
s
zp
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2− log exp(sc)
Z
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2− log exp(sc)
Z
predict topic c (z = ec)
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
predict topic c (z = ec)
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
∂L∂Ws
= ∂L∂y
∂y∂v
∂v∂z
≡0︷︸︸︷∂z∂s
∂s∂Ws
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
Workarounds: circumven ng the issue,bypassing discrete variables
Tackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
Op on 1. Pretrain latent classi erWs
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2− log exp(sc)
Z
Op on 2. Mul -task learning
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
Workarounds: circumven ng the issue,bypassing discrete variables
Tackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = Ez(y − y)2
− log exp(sc)Z
Op on 3. Stochas city! ∂Ez(y(z)−y)2
∂Ws=/ 0
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u, z]) y =Wyv L = (y − y)2
− log exp(sc)Z
Op on 4. Gradient surrogates (e.g. straight-through, ∂z∂s ← I)
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Example: Regression with latent categoriza on
(c)
s
z
p
input x
u
embeddings E Ws
u = 1|x|∑
j Exj s =Wsu
output y
v
Wv Wy
v = tanh (Wv[u,p]) y =Wyv L = (y − y)2
− log exp(sc)Z
Op on 5. Con nuous relaxa on (e.g. so max)
Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end
deep-spin.github.io/tutorial 24
Dealing with discrete latent variables
1. Pre-train external classi er2. Mul -task learning3. Stochas c latent variables
(Part 2)
4. Gradient surrogates
(Part 3)
5. Con nuous relaxa on
(Part 4)
deep-spin.github.io/tutorial 25
Dealing with discrete latent variables
1. Pre-train external classi er2. Mul -task learning3. Stochas c latent variables (Part 2)4. Gradient surrogates (Part 3)5. Con nuous relaxa on (Part 4)
deep-spin.github.io/tutorial 25
Roadmap of the tutorial
• Part 1: Introduc on • Part 2: Reinforcement learning• Part 3: Gradient surrogates
Co ee Break
• Part 4: End-to-end di eren able models• Part 5: Conclusions
deep-spin.github.io/tutorial 26
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)and incur a loss,
L(y(z; x), y)
or simply L(z)
• But we don’t know z!• In this sec on:
we jointly learn a structured predic on model πθ(z | x)
by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)
and incur a loss,L(y(z; x), y)
or simply L(z)
• But we don’t know z!• In this sec on:
we jointly learn a structured predic on model πθ(z | x)
by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)and incur a loss,
L(y(z; x), y)
or simply L(z)
• But we don’t know z!• In this sec on:
we jointly learn a structured predic on model πθ(z | x)
by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)and incur a loss,
L(y(z; x), y) or simply L(z)
• But we don’t know z!• In this sec on:
we jointly learn a structured predic on model πθ(z | x)
by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)and incur a loss,
L(y(z; x), y) or simply L(z)
• But we don’t know z!
• In this sec on:we jointly learn a structured predic on model πθ(z | x)
by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)and incur a loss,
L(y(z; x), y) or simply L(z)
• But we don’t know z!• In this sec on:
we jointly learn a structured predic on model πθ(z | x)
by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,
we can make a predic on y(z; x)and incur a loss,
L(y(z; x), y) or simply L(z)
• But we don’t know z!• In this sec on:
we jointly learn a structured predic on model πθ(z | x)by op mizing the expected loss,
Eπθ(z|x)L(z)
deep-spin.github.io/tutorial 27
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 28
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
• Joint learning: Combines a cons tuency parser and a sentence representa onmodel.
• The parser, fθ(x) is a transi on-based shi -reduce parser. It looks at top twoelements of stack and top element of the bu er.• TreeLSTM combines top two elements of the stack when the parser choses the
reduce ac on.
deep-spin.github.io/tutorial 29
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
• Joint learning: Combines a cons tuency parser and a sentence representa onmodel.• The parser, fθ(x) is a transi on-based shi -reduce parser. It looks at top twoelements of stack and top element of the bu er.
• TreeLSTM combines top two elements of the stack when the parser choses thereduce ac on.
deep-spin.github.io/tutorial 29
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
• Joint learning: Combines a cons tuency parser and a sentence representa onmodel.• The parser, fθ(x) is a transi on-based shi -reduce parser. It looks at top twoelements of stack and top element of the bu er.• TreeLSTM combines top two elements of the stack when the parser choses the
reduce ac on.
deep-spin.github.io/tutorial 29
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 30
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 30
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 30
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 30
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 30
Stack-augmented Parser-InterpreterNeural-Network
[Bowman et al., 2016]
deep-spin.github.io/tutorial 30
Shi -Reduce parsing
We can write a shi -reduce style parse as a sequence of Bernoulli randomvariables,
z = z1, . . . , z2L−1
where, zj ∈ 0,1 ∀j ∈ [1,2L − 1]
deep-spin.github.io/tutorial 31
Shi -Reduce parsing
A sequence of Bernoulli trials but with condi onal dependence,
p(z1, z2, . . . , z2L−1) =2L−1∏j=1
p(zj | z<j)
deep-spin.github.io/tutorial 32
Latent structure learning with SPINN
• But now, remove syntac c supervision from SPINN.
• We model the parse, z, as a latent variable with our parser as the score func ones mator, fθ(x).• With shi -reduce parsing, we’re making discrete decisions⇒ REINFORCE as a“natural” solu on.
deep-spin.github.io/tutorial 33
Latent structure learning with SPINN• But now, remove syntac c supervision from SPINN.
• We model the parse, z, as a latent variable with our parser as the score func ones mator, fθ(x).• With shi -reduce parsing, we’re making discrete decisions⇒ REINFORCE as a“natural” solu on.
deep-spin.github.io/tutorial 33
Latent structure learning with SPINN• But now, remove syntac c supervision from SPINN.
• We model the parse, z, as a latent variable with our parser as the score func ones mator, fθ(x).
• With shi -reduce parsing, we’re making discrete decisions⇒ REINFORCE as a“natural” solu on.
deep-spin.github.io/tutorial 33
Latent structure learning with SPINN• But now, remove syntac c supervision from SPINN.
• We model the parse, z, as a latent variable with our parser as the score func ones mator, fθ(x).• With shi -reduce parsing, we’re making discrete decisions⇒ REINFORCE as a“natural” solu on.
deep-spin.github.io/tutorial 33
Unsupervised SPINN
No syntac c supervision.Only reward is from the downstream task.We only get this reward a er parsing the full sentence.
deep-spin.github.io/tutorial 34
SPINN with REINFORCE[Williams, 1992]
Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.
• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.
NOTE: Only a single reward at the end of parsing.
deep-spin.github.io/tutorial 35
SPINN with REINFORCE[Williams, 1992]
Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE
• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.
NOTE: Only a single reward at the end of parsing.
deep-spin.github.io/tutorial 35
SPINN with REINFORCE[Williams, 1992]
Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.
• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.
NOTE: Only a single reward at the end of parsing.
deep-spin.github.io/tutorial 35
SPINN with REINFORCE[Williams, 1992]
Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)
• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.
NOTE: Only a single reward at the end of parsing.
deep-spin.github.io/tutorial 35
SPINN with REINFORCE[Williams, 1992]
Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.
NOTE: Only a single reward at the end of parsing.
deep-spin.github.io/tutorial 35
SPINN with REINFORCE[Williams, 1992]
Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.
NOTE: Only a single reward at the end of parsing.
deep-spin.github.io/tutorial 35
Through the looking glass of REINFORCE
∇θEz∼πθ(z|x)[L(z)]
= ∇θ∑
zL(z)πθ(z | x)
(By de ni on of expecta on. How to evaluate?)
=∑zL(z)∇θπθ(z | x)
=∑zL(z)πθ(z | x)∇θ logπθ(z | x)
(By Leibniz integral rule for log )
= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]
deep-spin.github.io/tutorial 36
Through the looking glass of REINFORCE
∇θEz∼πθ(z|x)[L(z)] = ∇θ∑
zL(z)πθ(z | x)
(By de ni on of expecta on. How to evaluate?)
=∑zL(z)∇θπθ(z | x)
=∑zL(z)πθ(z | x)∇θ logπθ(z | x)
(By Leibniz integral rule for log )
= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]
deep-spin.github.io/tutorial 36
Through the looking glass of REINFORCE
∇θEz∼πθ(z|x)[L(z)] = ∇θ∑
zL(z)πθ(z | x)
(By de ni on of expecta on. How to evaluate?)
=∑zL(z)∇θπθ(z | x)
=∑zL(z)πθ(z | x)∇θ logπθ(z | x)
(By Leibniz integral rule for log )
= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]
deep-spin.github.io/tutorial 36
Through the looking glass of REINFORCE
∇θEz∼πθ(z|x)[L(z)] = ∇θ∑
zL(z)πθ(z | x)
(By de ni on of expecta on. How to evaluate?)
=∑zL(z)∇θπθ(z | x)
=∑zL(z)πθ(z | x)∇θ logπθ(z | x)
(By Leibniz integral rule for log )
= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]
deep-spin.github.io/tutorial 36
Through the looking glass of REINFORCE
∇θEz∼πθ(z|x)[L(z)] = ∇θ∑
zL(z)πθ(z | x)
(By de ni on of expecta on. How to evaluate?)
=∑zL(z)∇θπθ(z | x)
=∑zL(z)πθ(z | x)∇θ logπθ(z | x)
(By Leibniz integral rule for log )
= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]deep-spin.github.io/tutorial 36
SPINN with REINFORCE, aka RL-SPINN
Yogatama et al. [2017] uses REINFORCE to train SPINN!
However, this vanilla implementa on isn’t very e ec ve at learning syntax.This model fails to solve a simple toy problem.
deep-spin.github.io/tutorial 37
SPINN with REINFORCE, aka RL-SPINN
Yogatama et al. [2017] uses REINFORCE to train SPINN!However, this vanilla implementa on isn’t very e ec ve at learning syntax.
This model fails to solve a simple toy problem.
deep-spin.github.io/tutorial 37
SPINN with REINFORCE, aka RL-SPINN
Yogatama et al. [2017] uses REINFORCE to train SPINN!However, this vanilla implementa on isn’t very e ec ve at learning syntax.This model fails to solve a simple toy problem.
deep-spin.github.io/tutorial 37
Toy problem: ListOps[Nangia and Bowman, 2018]
[max 2 9 [min 4 7 ] 0 ]
deep-spin.github.io/tutorial 38
Toy problem: ListOps[Nangia and Bowman, 2018]
Accuracy SelfModel μ(σ)μ(σ)μ(σ) max F1LSTM 71.5 (1.5) 74.4 -RL-SPINN 60.7 (2.6) 64.8 30.8
Random Trees - - 30.1F1 wrt. Avg.
Model LB RB GT Depth48D RL-SPINN 64.5 16.0 32.1 14.6128D RL-SPINN 43.5 13.0 71.1 10.4
GT Trees 41.6 8.8 100.0 9.6Random Trees 24.0 24.0 24.2 5.2
But why?
deep-spin.github.io/tutorial 39
Toy problem: ListOps[Nangia and Bowman, 2018]
Accuracy SelfModel μ(σ)μ(σ)μ(σ) max F1LSTM 71.5 (1.5) 74.4 -RL-SPINN 60.7 (2.6) 64.8 30.8
Random Trees - - 30.1F1 wrt. Avg.
Model LB RB GT Depth48D RL-SPINN 64.5 16.0 32.1 14.6128D RL-SPINN 43.5 13.0 71.1 10.4
GT Trees 41.6 8.8 100.0 9.6Random Trees 24.0 24.0 24.2 5.2
But why?
deep-spin.github.io/tutorial 39
RL-SPINN’s Troubles
This system faces at least two big problems,
1. High variance of gradients2. Coadapta on
deep-spin.github.io/tutorial 40
RL-SPINN’s Troubles
This system faces at least two big problems,1. High variance of gradients2. Coadapta on
deep-spin.github.io/tutorial 40
High variance
• We have a single reward at the end of parsing.
• We are sampling parses from very large search space!Catalan number of binary trees.• And the policy is stochas c.
deep-spin.github.io/tutorial 41
High variance
• We have a single reward at the end of parsing.• We are sampling parses from very large search space!Catalan number of binary trees.
• And the policy is stochas c.
deep-spin.github.io/tutorial 41
High variance
• We have a single reward at the end of parsing.• We are sampling parses from very large search space!Catalan number of binary trees.
3 tokens ⇒ 5 trees5 tokens ⇒ 42 trees10 tokens ⇒ 16796 trees
• And the policy is stochas c.
deep-spin.github.io/tutorial 41
High variance
• We have a single reward at the end of parsing.• We are sampling parses from very large search space!Catalan number of binary trees.• And the policy is stochas c.
deep-spin.github.io/tutorial 41
High variance
So, some mes the policy lands in a “rewarding state”:
[sm [sm [sm [max 5 6 ] 2 ] 0 ] 5 0 8 6 ]Figure: Truth: 7; Pred: 7
Some mes it doesn’t:
deep-spin.github.io/tutorial 42
High variance
So, some mes the policy lands in a “rewarding state”:
Some mes it doesn’t:
[max [med [med 1 [sm 3 1 3 ] 9 ] 6 ] 5 ]Figure: Truth: 6; Pred: 5deep-spin.github.io/tutorial 42
High variance
Catalan number of parses means we need many many samples to lower variance!
Possible solu ons,1. Gradient normaliza on2. Control variates, aka baselines
deep-spin.github.io/tutorial 43
High variance
Catalan number of parses means we need many many samples to lower variance!Possible solu ons,1. Gradient normaliza on2. Control variates, aka baselines
deep-spin.github.io/tutorial 43
Control variates
• A simple control variate: moving average of recent rewards
• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.So,
∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]
Which we can do because,∑zb(x)∇π(z) = b(x)
∑z∇π(z) = b(x)∇1 = 0
deep-spin.github.io/tutorial 44
Control variates
• A simple control variate: moving average of recent rewards• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.
So,
∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]
Which we can do because,∑zb(x)∇π(z) = b(x)
∑z∇π(z) = b(x)∇1 = 0
deep-spin.github.io/tutorial 44
Control variates
• A simple control variate: moving average of recent rewards• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.So,
∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]
Which we can do because,∑zb(x)∇π(z) = b(x)
∑z∇π(z) = b(x)∇1 = 0
deep-spin.github.io/tutorial 44
Control variates
• A simple control variate: moving average of recent rewards• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.So,
∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]
Which we can do because,∑zb(x)∇π(z) = b(x)
∑z∇π(z) = b(x)∇1 = 0
deep-spin.github.io/tutorial 44
Issues with SPINN with REINFORCE
This system faces two big problems,1. High variance of gradients2. Coadapta on
deep-spin.github.io/tutorial 45
Coadapta on problem
Learning composi on func on parameters φ with backpropaga on,and parser parameters θ with REINFORCE.
Generally, φ will be learned more quickly than θ,making it harder to explore the parsing search space and op mize for θ.
Di erence in variance of two gradient es mates.
Possible solu on:Proximal Policy Op miza on (Schulman et al., 2017)
deep-spin.github.io/tutorial 46
Coadapta on problem
Learning composi on func on parameters φ with backpropaga on,and parser parameters θ with REINFORCE.
Generally, φ will be learned more quickly than θ,making it harder to explore the parsing search space and op mize for θ.
Di erence in variance of two gradient es mates.
Possible solu on:Proximal Policy Op miza on (Schulman et al., 2017)
deep-spin.github.io/tutorial 46
Coadapta on problem
Learning composi on func on parameters φ with backpropaga on,and parser parameters θ with REINFORCE.
Generally, φ will be learned more quickly than θ,making it harder to explore the parsing search space and op mize for θ.
Di erence in variance of two gradient es mates.
Possible solu on:Proximal Policy Op miza on (Schulman et al., 2017)
deep-spin.github.io/tutorial 46
Coadapta on problem
Learning composi on func on parameters φ with backpropaga on,and parser parameters θ with REINFORCE.
Generally, φ will be learned more quickly than θ,making it harder to explore the parsing search space and op mize for θ.
Di erence in variance of two gradient es mates.
Possible solu on:Proximal Policy Op miza on (Schulman et al., 2017)
deep-spin.github.io/tutorial 46
Making REINFORCE+SPINN work
Havrylov et al. [2019] use,1. Input dependent control variate2. Gradient normaliza on3. Proximal Policy Op miza on
They solve ListOps!However, does not learn English grammars.
deep-spin.github.io/tutorial 47
Making REINFORCE+SPINN work
Havrylov et al. [2019] use,1. Input dependent control variate2. Gradient normaliza on3. Proximal Policy Op miza on
They solve ListOps!
However, does not learn English grammars.
deep-spin.github.io/tutorial 47
Making REINFORCE+SPINN work
Havrylov et al. [2019] use,1. Input dependent control variate2. Gradient normaliza on3. Proximal Policy Op miza on
They solve ListOps!However, does not learn English grammars.
deep-spin.github.io/tutorial 47
Should I? Shouldn’t I?
• Unbiased!
• In a simple se ng, with enoughtricks, it can work!
• High variance• Has not yet been very e ec ve atlearning English syntax.
deep-spin.github.io/tutorial 48
Should I? Shouldn’t I?
• Unbiased!
• In a simple se ng, with enoughtricks, it can work!
• High variance
• Has not yet been very e ec ve atlearning English syntax.
deep-spin.github.io/tutorial 48
Should I? Shouldn’t I?
• Unbiased!• In a simple se ng, with enoughtricks, it can work!
• High variance
• Has not yet been very e ec ve atlearning English syntax.
deep-spin.github.io/tutorial 48
Should I? Shouldn’t I?
• Unbiased!• In a simple se ng, with enoughtricks, it can work!
• High variance• Has not yet been very e ec ve atlearning English syntax.
deep-spin.github.io/tutorial 48
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.
• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)
incur loss Lz(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly
• 3B: use this strategy to reduce variance in the stochas c model.
deep-spin.github.io/tutorial 49
So far:• Tackled expected loss in a stochas c computa on graph
Eπθ(z|x)L(z)
• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.
In this sec on:• Consider the determinis c alterna ve:
pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L
z(x)
• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.deep-spin.github.io/tutorial 49
Recap: The argmax problem
z = 1z = 2
· · ·z =N
s z
∂z∂s = 0
z1
s10
1
s2 − 1 s2 s2 + 1
z = argmax(s)
deep-spin.github.io/tutorial 50
So max
z = 1z = 2
· · ·z =N
s p
∂p∂s = diag(p) − pp
⊤
p1
s10
1
s2 − 1 s2 s2 + 1
pj = exp(sj)/Z
deep-spin.github.io/tutorial 51
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0
• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤
• More explana on in a while
s
z
pWhat about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)
• Backward: pretend z was some con nuous p; ∂p∂s =/ 0
• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤
• More explana on in a while
sz
pWhat about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)
• Backward: pretend z was some con nuous p; ∂p∂s =/ 0
• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤
• More explana on in a while
sz
pWhat about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0
• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤
• More explana on in a while
sz
p
What about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0
• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤
• More explana on in a while
sz
p
What about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0• simplest: iden ty, p(s) = s, ∂p∂s = I
• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤
• More explana on in a while
sz
p
What about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp
⊤
• More explana on in a while
sz
p
What about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp
⊤
• More explana on in a while
sz
p
What about the structured case?
deep-spin.github.io/tutorial 52
Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]
• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp
⊤
• More explana on in a while
sz
pWhat about the structured case?
deep-spin.github.io/tutorial 52
Dealing with the combinatorial explosion
→ → → · · ·1. Incremental structures• Build structure greedily, as sequenceof discrete choices (e.g., shi -reduce).• Scores (par al structure, ac on) tuples.• Advantages: exible, rich histories.• Disadvantages: greedy, local decisionsare subop mal, error propaga on.
max · · · , , , , · · ·
2. Factoriza on into parts• Op mizes globally (e.g. Viterbi,Chu-Liu-Edmonds, Kuhn-Munkres).• Scores smaller parts.• Advantages: op mal, elegant, canhandle hard & global constraints.• Disadvantages: strong assump ons.
deep-spin.github.io/tutorial 53
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)
• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.
• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.
• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step
• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func on
Example: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).
deep-spin.github.io/tutorial 54
STE for the factorized approach
Requires a bit more work:• Recap: marginal polytope• Predic ng structures globally: Maximum A Posteriori (MAP)• Deriving Straight-Through and SPIGOT
deep-spin.github.io/tutorial 55
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 56
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z
• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 56
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 56
Predic ng structures from scores of parts
• η(i→ j): score of arc i→ j• z(i→ j): is arc i→ j selected?
• Task-speci c algorithm for thehighest-scoring structure.
η z outputey
deep-spin.github.io/tutorial 57
Predic ng structures from scores of parts
• η(i→ j): score of arc i→ j• z(i→ j): is arc i→ j selected?• Task-speci c algorithm for thehighest-scoring structure.
η z outputey
deep-spin.github.io/tutorial 57
Algorithms for speci c structuresBest structure (MAP)
Marginals
Sequence tagging Viterbi[Rabiner, 1989]
Forward-Backward[Rabiner, 1989]
Cons tuent treesCKY
[Kasami, 1966, Younger, 1967][Cocke and Schwartz, 1970]
Inside-Outside[Baker, 1979]
Temporal alignments DTW[Sakoe and Chiba, 1978]
So -DTW[Cuturi and Blondel, 2017]
Dependency trees Max. Spanning Arborescence[Chu and Liu, 1965, Edmonds, 1967]
Matrix-Tree[Kirchho , 1847]
Assignments Kuhn-Munkres[Kuhn, 1955, Jonker and Volgenant, 1987]
#P-complete[Valiant, 1979, Taskar, 2004]
deep-spin.github.io/tutorial 58
Structured Straight-Through
• Forward pass:Find highest-scoring structure:z = argmax
z∈Zη⊤z
• Backward pass:pretend we used μ =η.
ηz
μ
deep-spin.github.io/tutorial 59
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).
• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)
• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid∂s = z − z
true.• We don’t have labels! Induce labels by “pulling back” the downstream target:
the “best” (unconstrained) latent value would be: argminz∈RD Ly(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through Es matorRevisited
• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid
∂s = z − ztrue.
• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L
y(z), y
• One gradient descent step star ng from z: ztrue← z − ∂L∂z
∂LMTL∂s
=∂L
∂s︸︷︷︸=0
+∂Lhid∂s
= z −z −
∂L
∂z
=∂L
∂z
[Mar ns and Niculae, 2019]
deep-spin.github.io/tutorial 60
Straight-Through in the structured case[Peng et al., 2018, Mar ns and Niculae, 2019]
• Structured STE: perceptron update with induced annota onargminμ∈RD
L(y(μ), y) ≈ z − ∇zL(z)→ ztrue
(one step of gradient descent)
• SPIGOT takes into account the constraints; uses the inducedannota on
argminμ∈M
L(y(μ), y) ≈ ProjMz − ∇zL(z)→ ztrue
(one step of projected gradient descent!)• We discuss a generic way to compute the projec on in part 4.
z
ztrue = z - ∇zL(z)
z
z - ∇zL(z)
ztrue
deep-spin.github.io/tutorial 61
Straight-Through in the structured case[Peng et al., 2018, Mar ns and Niculae, 2019]
• Structured STE: perceptron update with induced annota onargminμ∈RD
L(y(μ), y) ≈ z − ∇zL(z)→ ztrue
(one step of gradient descent)• SPIGOT takes into account the constraints; uses the inducedannota on
argminμ∈M
L(y(μ), y) ≈ ProjMz − ∇zL(z)→ ztrue
(one step of projected gradient descent!)
• We discuss a generic way to compute the projec on in part 4.
z
ztrue = z - ∇zL(z)
z
z - ∇zL(z)
ztrue
deep-spin.github.io/tutorial 61
Straight-Through in the structured case[Peng et al., 2018, Mar ns and Niculae, 2019]
• Structured STE: perceptron update with induced annota onargminμ∈RD
L(y(μ), y) ≈ z − ∇zL(z)→ ztrue
(one step of gradient descent)• SPIGOT takes into account the constraints; uses the inducedannota on
argminμ∈M
L(y(μ), y) ≈ ProjMz − ∇zL(z)→ ztrue
(one step of projected gradient descent!)• We discuss a generic way to compute the projec on in part 4.
z
ztrue = z - ∇zL(z)
z
z - ∇zL(z)
ztrue
deep-spin.github.io/tutorial 61
Summary: Straight-Through Es mator
We saw how to use the Straight-Through Es mator to allow learning models withargmax in the middle of the computa on graph.We were op mizing L
z(x)
Now we will see how to apply STE for stochas c graphs, as an alterna veapproach of REINFORCE.
deep-spin.github.io/tutorial 62
Summary: Straight-Through Es mator
We saw how to use the Straight-Through Es mator to allow learning models withargmax in the middle of the computa on graph.
We were op mizing Lz(x)
Now we will see how to apply STE for stochas c graphs, as an alterna veapproach of REINFORCE.
deep-spin.github.io/tutorial 62
Summary: Straight-Through Es mator
We saw how to use the Straight-Through Es mator to allow learning models withargmax in the middle of the computa on graph.We were op mizing L
z(x)
Now we will see how to apply STE for stochas c graphs, as an alterna veapproach of REINFORCE.
deep-spin.github.io/tutorial 62
Summary: Straight-Through Es mator
We saw how to use the Straight-Through Es mator to allow learning models withargmax in the middle of the computa on graph.We were op mizing L
z(x)
Now we will see how to apply STE for stochas c graphs, as an alterna veapproach of REINFORCE.
deep-spin.github.io/tutorial 62
Stochas c node in the computa on graph
Recall the stochas c objec ve:
Eπθ(z|x)L(z)
• REINFORCE (previous sec on).
High variance.
• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].
deep-spin.github.io/tutorial 63
Stochas c node in the computa on graph
Recall the stochas c objec ve:
Eπθ(z|x)L(z)
• REINFORCE (previous sec on).
High variance.
• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].
deep-spin.github.io/tutorial 63
Stochas c node in the computa on graph
Recall the stochas c objec ve:
Eπθ(z|x)L(z)
• REINFORCE (previous sec on).
High variance.• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].
deep-spin.github.io/tutorial 63
Stochas c node in the computa on graph
Recall the stochas c objec ve:
Eπθ(z|x)L(z)
• REINFORCE (previous sec on). High variance.
• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].
deep-spin.github.io/tutorial 63
Stochas c node in the computa on graph
Recall the stochas c objec ve:
Eπθ(z|x)L(z)
• REINFORCE (previous sec on). High variance.• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].
deep-spin.github.io/tutorial 63
Categorical reparameteriza on
[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z
∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!
deep-spin.github.io/tutorial 64
Categorical reparameteriza on
[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)
• What is the gradient of a sample ∂z∂θ?!• Reparameteriza on: Move the
stochas city out of the gradient path.• Makes z determinis c w.r.t. s!
s z
deep-spin.github.io/tutorial 64
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z
∂θ?!
• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!
s z
deep-spin.github.io/tutorial 64
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z
∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.
• Makes z determinis c w.r.t. s!
s
𝝐 (stochastic)
z
deep-spin.github.io/tutorial 64
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z
∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!
s+𝝐s
𝝐 (stochastic)
+
z
deep-spin.github.io/tutorial 64
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z
∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!
s+𝝐s
𝝐 (stochastic)
+
z
deep-spin.github.io/tutorial 64
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z
∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!
s+𝝐s
𝝐 (stochastic)
+
z
As a result:
Stochas city is moved as an input.
We can backpropagate through the determinis c input to z.
deep-spin.github.io/tutorial 64
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
s+𝝐s
𝝐 (stochastic)
+
z
How do we sample from a categorical variable?
deep-spin.github.io/tutorial 65
Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]
s+𝝐s
𝝐 (stochastic)
+
z
How do we sample from a categorical variable?
deep-spin.github.io/tutorial 65
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:
• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)
• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj
• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)
• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick
• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)
• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))
• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)
Requires sampling from the Standard Gumbel Distribu on G(0,1).Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)
1. Inverse transform sampling:• p = softmax(s)• ci =∑
j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1
2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)
The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).
Deriva on & more info: [Adams, 2013, Vieira, 2014]
s+𝝐s
𝝐 (stochastic)
+
z
We have an argmax againand cannot backpropagate!
deep-spin.github.io/tutorial 66
Straight-Through Gumbel Es matorApply a variant of the Straight-Through Es mator to Gumbel-Max!
[Jang et al., 2017, Maddison et al., 2016]
• Forward: z = argmax(s + ε)• Backward: pretend we haddonep = softmax(s + ε)
What about the structured case?
deep-spin.github.io/tutorial 67
Straight-Through Gumbel Es matorApply a variant of the Straight-Through Es mator to Gumbel-Max!
[Jang et al., 2017, Maddison et al., 2016]
• Forward: z = argmax(s + ε)
• Backward: pretend we haddonep = softmax(s + ε)
s+𝝐s
𝝐
𝝐i =-log(-log(ui))ui~U(0,1)
+
z
p
What about the structured case?
deep-spin.github.io/tutorial 67
Straight-Through Gumbel Es matorApply a variant of the Straight-Through Es mator to Gumbel-Max!
[Jang et al., 2017, Maddison et al., 2016]
• Forward: z = argmax(s + ε)• Backward: pretend we haddonep = softmax(s + ε)
s+𝝐s
𝝐
𝝐i =-log(-log(ui))ui~U(0,1)
+
z
p
What about the structured case?
deep-spin.github.io/tutorial 67
Straight-Through Gumbel Es matorApply a variant of the Straight-Through Es mator to Gumbel-Max!
[Jang et al., 2017, Maddison et al., 2016]
• Forward: z = argmax(s + ε)• Backward: pretend we haddonep = softmax(s + ε)
s+𝝐s
𝝐
𝝐i =-log(-log(ui))ui~U(0,1)
+
z
pWhat about the structured case?
deep-spin.github.io/tutorial 67
Dealing with the combinatorial explosion
→ → → · · ·1. Incremental structures• Build structure greedily, as sequenceof discrete choices (e.g., shi -reduce).• Scores (par al structure, ac on) tuples.• Advantages: exible, rich histories.• Disadvantages: greedy, local decisionsare subop mal, error propaga on.
max · · · , , , , · · ·
2. Factoriza on into parts• Op mizes globally (e.g. Viterbi,Chu-Liu-Edmonds, Kuhn-Munkres).• Scores smaller parts.• Advantages: op mal, elegant, canhandle hard & global constraints.• Disadvantages: strong assump ons.
deep-spin.github.io/tutorial 68
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)
• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.
• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.
• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step
• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func on
Example: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Sampling from incremental structures
• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].
deep-spin.github.io/tutorial 69
Example: Gumbel Tree-LSTM[Choi et al., 2018]
• Building task-speci c tree structures.• Straight-Through Gumbel-So max at each step to select one arc.
The cat sat on
The cat cat sat sat on
The cat sat on
v1 = 0.5 v2 = 0.1 v2 = 0.4
q
deep-spin.github.io/tutorial 70
Sampling from factorized modelsPerturb-and-MAP
[Papandreou and Yuille, 2011, Corro and Titov, 2019a,b]
Reparameterize by perturbing the arc scores. (inexact!)
• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.
• ε ∼ G(0,1)• η =η + ε• argmaxz∈Z η⊤z
deep-spin.github.io/tutorial 71
Sampling from factorized modelsPerturb-and-MAP
[Papandreou and Yuille, 2011, Corro and Titov, 2019a,b]
Reparameterize by perturbing the arc scores. (inexact!)
• Sample from the normal Gumbel distribu on.
• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.
• ε ∼ G(0,1)
• η =η + ε• argmaxz∈Z η⊤z
deep-spin.github.io/tutorial 71
Sampling from factorized modelsPerturb-and-MAP
[Papandreou and Yuille, 2011, Corro and Titov, 2019a,b]
Reparameterize by perturbing the arc scores. (inexact!)
• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.
• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.
• ε ∼ G(0,1)• η =η + ε
• argmaxz∈Z η⊤z
deep-spin.github.io/tutorial 71
Sampling from factorized modelsPerturb-and-MAP
[Papandreou and Yuille, 2011, Corro and Titov, 2019a,b]
Reparameterize by perturbing the arc scores. (inexact!)
• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).
• Backward: we could use Straight-Through withIden ty.
• ε ∼ G(0,1)• η =η + ε• argmaxz∈Z η⊤z
deep-spin.github.io/tutorial 71
Sampling from factorized modelsPerturb-and-MAP
[Papandreou and Yuille, 2011, Corro and Titov, 2019a,b]
Reparameterize by perturbing the arc scores. (inexact!)
• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.
• ε ∼ G(0,1)• η =η + ε• argmaxz∈Z η⊤z
deep-spin.github.io/tutorial 71
Summary: Gradient surrogates
• Based on the Straight-Through Es mator.• Can be used for stochas c or determinis c computa on graphs.• Forward pass: Get an argmax (might be structured).• Backpropaga on: use a func on, which we hope is close to argmax.• Examples:• Argmax for itera ve structures and factoriza on into parts• Sampling from itera ve structures and factoriza on into parts
deep-spin.github.io/tutorial 72
Gradient surrogates: Pros and cons
Pros• Do not su er from the high variance problem of REINFORCE.• Allow for exibility to select or sample a latent structured in the middle of thecomputa on graph.• E cient computa on.
Cons• The Gumbel sampling with Perturb-and-MAP is an approxima on.• Bias, due to func on mismatch on the backpropaga on
(next sec on will address this problem.)
deep-spin.github.io/tutorial 73
Overview
Eπθ(z|x)L(z)
• REINFORCE• Straight-Through Gumbel
(Perturb & MAP)
• SparseMAP
Largmaxz πθ(z | x)
• Straight-Through• SPIGOT
LEπθ(z|x)[z]
• Structured A n. Nets• SparseMAP
And more, a er the break!
deep-spin.github.io/tutorial 74
Overview
Eπθ(z|x)L(z)
• REINFORCE• Straight-Through Gumbel
(Perturb & MAP)
• SparseMAP
Largmaxz πθ(z | x)
• Straight-Through• SPIGOT
LEπθ(z|x)[z]
• Structured A n. Nets• SparseMAP
And more, a er the break!
deep-spin.github.io/tutorial 74
End-to-end di eren able relaxa ons
1. Digging into so max2. Alterna ves to so max3. Generalizing to structured predic on4. Stochas city and global structures
Recall: Discrete choices & di eren abilitys
z = 1z = 2
· · ·z =N
s z
p
∂z∂s= 0 or n/a
inputx
outputbys = fθ(x) y = gφ(z, x)
(argmax)
p1
s10
1
s2 − 1 s2 s2 + 1deep-spin.github.io/tutorial 76
One solu on: smooth relaxa ons
z = 1z = 2
· · ·z =N
s
z
p
∂p∂s=
inputx
outputbys = fθ(x) y = gφ(z, x)
(so max)
p = softmax(s) = E[z], i.e.replace Ef(z)with fE[z]
p1
s10
1
s2 − 1 s2 s2 + 1deep-spin.github.io/tutorial 76
One solu on: smooth relaxa ons
z = 1z = 2
· · ·z =N
s
z
p
∂p∂s=
inputx
outputbys = fθ(x) y = gφ(z, x)
(so max)
p = softmax(s) = E[z], i.e.replace Ef(z)with fE[z]
p1
s10
1
s2 − 1 s2 s2 + 1deep-spin.github.io/tutorial 76
Overview
Eπθ(z|x)L(z)
• REINFORCE• Straight-Through Gumbel
(Perturb & MAP)
• SparseMAP
Largmaxz πθ(z | x)
• Straight-Through• SPIGOT
LEπθ(z|x)[z]
• Structured A n. Nets• SparseMAP
And more, a er the break!
deep-spin.github.io/tutorial 77
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax
maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choices
Expected score under p: Ei∼p si = p⊤sargmax
maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choices
Expected score under p: Ei∼p si = p⊤sargmax
maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choices
Expected score under p: Ei∼p si = p⊤sargmax
maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5 p = [0,0,1]
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choices
Expected score under p: Ei∼p si = p⊤sargmax
maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [1/3, 1/3, 1/3]
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤s
argmax
maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5s = [.7, .1,1.5]
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax
maximizes expected scoreShannon entropy of p: H(p) = −
∑i pi log pi
so max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5s = [.7, .1,1.5]
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax maximizes expected score
Shannon entropy of p: H(p) = −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5s = [.7, .1,1.5]
p⋆ = [0,0,1]
argmaxp∈
p⊤sdeep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax maximizes expected scoreShannon entropy of p: H(p) = −
∑i pi log pi
so max maximizes expected score + entropy:
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
0.800
1.000
1.000
0.800
0.5 11.5
0.5 1 1.5
0.5
1
1.5s = [.7, .1,1.5]
p⋆ = [0,0,1]
argmaxp∈
p⊤s
deep-spin.github.io/tutorial 78
What is so max?O en de ned via pi =
exp si∑j exp sj
, but where does it come from?
p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax maximizes expected scoreShannon entropy of p: H(p) = −
∑i pi log pi
so max maximizes expected score + entropy:
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
1.000
1.400
1.600
1.600
argmaxp∈
p⊤s +H(p)
0.5 11.5
0.5 1 1.5
0.5
1
1.5s = [.7, .1,1.5]
p⋆ = [0,0,1]
argmaxp∈
p⊤s
deep-spin.github.io/tutorial 78
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)
Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)
if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.
Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Varia onal form of so maxProposi on. The unique solu on to argmax
p∈p⊤s +H(p) is given by pj =
exp sj∑i exp si
.
Explicit form of the op miza on problem:
maximize∑
j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1
Lagrangian:
L(p,ν, τ) = −∑
j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):
0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ
p⊤ν = 0p ∈ ν ≥ 0
log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.
pi = exp(si)/exp(τ+1) = exp(si)/Z
Must nd Z such that∑
j pj = 1.Answer: Z =∑
j exp(sj)
So, pi =exp(si)∑j exp(sj)
.
Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]
deep-spin.github.io/tutorial 79
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pjsparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017]
[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pjsparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017]
[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pjsparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017]
[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017]
[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pjsparsemax: Ω(p)= 1/2∥p∥22
α-entmax: Ω(p)= 1/α(α−1)∑
j pαi
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017]
[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016]
[Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pjsparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)
∑j pαi
Generalized entropy interpolates in between [Tsallis, 1988]Used in Sparse Seq2Seq: [Peters et al., 2019]
(Mon 13:50, poster session 2D)
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017][Niculae and Blondel, 2017, Mar ns and Astudillo, 2016]
[Niculae and Blondel, 2017, Blondel et al., 2019]
[Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
Generalizing so max: Smoothed argmaxespΩ(s) = argmax
p∈p⊤s −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pjsparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)
∑j pαi
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
cso max: Ω(p)=∑
j pj log pj + ι(a ≤ p ≤ b)
p1
s10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
[Niculae and Blondel, 2017][Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019]
[Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]
deep-spin.github.io/tutorial 80
The structured case: Marginal polytope[Wainwright and Jordan, 2008]
• Each vertex corresponds to one such bit vector z• Points inside correspond to marginal distribu ons:convex combina ons of structured objects
μ = p1z1 + . . . + pNzN︸ ︷︷ ︸exponen ally many terms
, p ∈ ∆.
p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]
⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].
M
deep-spin.github.io/tutorial 81
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 82
Algorithms for speci c structuresBest structure (MAP) Marginals
Sequence tagging Viterbi[Rabiner, 1989]
Forward-Backward[Rabiner, 1989]
Cons tuent treesCKY
[Kasami, 1966, Younger, 1967][Cocke and Schwartz, 1970]
Inside-Outside[Baker, 1979]
Temporal alignments DTW[Sakoe and Chiba, 1978]
So -DTW[Cuturi and Blondel, 2017]
Dependency trees Max. Spanning Arborescence[Chu and Liu, 1965, Edmonds, 1967]
Matrix-Tree[Kirchho , 1847]
Assignments Kuhn-Munkres[Kuhn, 1955, Jonker and Volgenant, 1987]
#P-complete[Valiant, 1979, Taskar, 2004]
deep-spin.github.io/tutorial 83
Algorithms for speci c structuresBest structure (MAP) Marginals
Sequence tagging Viterbi[Rabiner, 1989]
Forward-Backward[Rabiner, 1989]
Cons tuent treesCKY
[Kasami, 1966, Younger, 1967][Cocke and Schwartz, 1970]
Inside-Outside[Baker, 1979]
Temporal alignments DTW[Sakoe and Chiba, 1978]
So -DTW[Cuturi and Blondel, 2017]
Dependency trees Max. Spanning Arborescence[Chu and Liu, 1965, Edmonds, 1967]
Matrix-Tree[Kirchho , 1847]
Assignments Kuhn-Munkres[Kuhn, 1955, Jonker and Volgenant, 1987]
#P-complete[Valiant, 1979, Taskar, 2004]
dyn.prog.
deep-spin.github.io/tutorial 83
Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.
• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log
∑k′ expαi−1,k′ + (ηU)i,k + (ηV)k′,k
for all k
5 for i ∈ n − 1, . . . ,1 do # backward log-probabili es6 βi,k = log∑
k′ expβi+1,k′ + (ηU)i+1,k′ + (ηV)k,k′
for all k
7 Z =∑
k expαn,k # par on func on8 return μ = exp
α +β − logZ
# marginals
(ηU)1,NN
(ηU)1,VB
(ηU)1,PR
(ηU)2,NN
(ηU)2,VB
(ηU)2,PR
ηV
(ηU)n,NN
(ηU)n,VB
(ηU)n,PR
deep-spin.github.io/tutorial 84
Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.
• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]
Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log
∑k′ expαi−1,k′ + (ηU)i,k + (ηV)k′,k
for all k
5 for i ∈ n − 1, . . . ,1 do # backward log-probabili es6 βi,k = log∑
k′ expβi+1,k′ + (ηU)i+1,k′ + (ηV)k,k′
for all k
7 Z =∑
k expαn,k # par on func on8 return μ = exp
α +β − logZ
# marginals
(ηU)1,NN
(ηU)1,VB
(ηU)1,PR
(ηU)2,NN
(ηU)2,VB
(ηU)2,PR
ηV
(ηU)n,NN
(ηU)n,VB
(ηU)n,PR
deep-spin.github.io/tutorial 84
Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)
• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]
Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log
∑k′ expαi−1,k′ + (ηU)i,k + (ηV)k′,k
for all k
5 for i ∈ n − 1, . . . ,1 do # backward log-probabili es6 βi,k = log∑
k′ expβi+1,k′ + (ηU)i+1,k′ + (ηV)k,k′
for all k
7 Z =∑
k expαn,k # par on func on8 return μ = exp
α +β − logZ
# marginals
(ηU)1,NN
(ηU)1,VB
(ηU)1,PR
(ηU)2,NN
(ηU)2,VB
(ηU)2,PR
ηV
(ηU)n,NN
(ηU)n,VB
(ηU)n,PR
deep-spin.github.io/tutorial 84
Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]
• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]
Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log
∑k′ expαi−1,k′ + (ηU)i,k + (ηV)k′,k
for all k
5 for i ∈ n − 1, . . . ,1 do # backward log-probabili es6 βi,k = log∑
k′ expβi+1,k′ + (ηU)i+1,k′ + (ηV)k,k′
for all k
7 Z =∑
k expαn,k # par on func on8 return μ = exp
α +β − logZ
# marginals
(ηU)1,NN
(ηU)1,VB
(ηU)1,PR
(ηU)2,NN
(ηU)2,VB
(ηU)2,PR
ηV
(ηU)n,NN
(ηU)n,VB
(ηU)n,PR
deep-spin.github.io/tutorial 84
Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log
∑k′ expαi−1,k′ + (ηU)i,k + (ηV)k′,k
for all k
5 for i ∈ n − 1, . . . ,1 do # backward log-probabili es6 βi,k = log∑
k′ expβi+1,k′ + (ηU)i+1,k′ + (ηV)k,k′
for all k
7 Z =∑
k expαn,k # par on func on8 return μ = exp
α +β − logZ
# marginals
(ηU)1,NN
(ηU)1,VB
(ηU)1,PR
(ηU)2,NN
(ηU)2,VB
(ηU)2,PR
ηV
(ηU)n,NN
(ηU)n,VB
(ηU)n,PR
deep-spin.github.io/tutorial 84
Deriva ves of marginals 2: Matrix-Tree
L(s): Laplacian of the edge score graph
Z = detL(s)μ = L(s)−1
∇μ = ∇L−1 = L−1∂L∂η
L−1
deep-spin.github.io/tutorial 85
Structured A en on Networks
la
coali on
...
aide
η μinputx
outputy
CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outside
and non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].
[Kim et al., 2017]
deep-spin.github.io/tutorial 86
Structured A en on Networksη μinput
xoutput
y
CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outside
and non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].
[Kim et al., 2017]
deep-spin.github.io/tutorial 86
Structured A en on Networksη μinput
xoutput
y
η(i): score of word ireceiving a en on
η(i, i + 1): score ofconsecu ve wordsreceiving a en on
μ(i): probability ofword i ge ng a en on
CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outside
and non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].
[Kim et al., 2017]
deep-spin.github.io/tutorial 86
Structured A en on Networksη μinput
xoutput
y
η(i): score of word ireceiving a en on
η(i, i + 1): score ofconsecu ve wordsreceiving a en on
μ(i): probability ofword i ge ng a en on
CRF marginals (from forward–backward) give a en on weights ∈ (0,1)
Similar idea for projec ve dependency trees with inside–outside
and non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].
[Kim et al., 2017]
deep-spin.github.io/tutorial 86
Structured A en on Networksη μinput
xoutput
y
dog on wheels
η(dog→ on): arc score(tree constraints)
μ(dog→ on):probability of arc
CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outside
and non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].
[Kim et al., 2017]
deep-spin.github.io/tutorial 86
Structured A en on Networksη μinput
xoutput
y
dog on wheels
η(dog→ on): arc score(tree constraints)
μ(dog→ on):probability of arc
CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outsideand non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].
[Kim et al., 2017]
deep-spin.github.io/tutorial 86
Di eren able Perturb & ParseExtending Gumbel-So max to structured stochas c models
• Forward pass:sample structure z (approximately)z = argmax
z∈Z(η + ε)⊤z
• Backward pass:pretend we did marginal inferenceμ = argmax
μ∈M(η + ε)⊤z + H(μ)
(or some similar relaxa on)
ηz
μ
[Corro and Titov, 2019a,b]
deep-spin.github.io/tutorial 87
Back-propaga ng through marginalsPros:
• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.
deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,
• (Structured A en on Networks:) All computa ons exact.Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.
deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.
deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.
deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.
deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.
• Case-by-case algorithms required, can get tedious.
deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.deep-spin.github.io/tutorial 88
Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.
Cons:
• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)
• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])
• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.deep-spin.github.io/tutorial 88
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 89
[Niculae et al., 2018a]
argmax argmaxp∈
p⊤s
so max argmaxp∈
p⊤s +H(p)
sparsemax argmaxp∈
p⊤s − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
Just like so max relaxes argmax,marginals relax MAP di eren ably!
Unlike argmax/so max, computa on is not obvious!
deep-spin.github.io/tutorial 89
SparseMAP solu on[Niculae et al., 2018a]
μ⋆ = argmaxμ∈M
μ⊤η − 1/2∥μ∥2
= = .6 + .4(μ⋆ is unique, but may have mul ple decomposi ons p. Ac ve Set recovers a sparse one.)
deep-spin.github.io/tutorial 90
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM
• update the (sparse) coe cients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM
• update the (sparse) coe cients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Ac ve Set achievesnite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise
• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Ac ve Set achievesnite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]
• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,
Vinyes and Obozinski, 2017]
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
nite & linear convergence!
Completely modular: just add MAP
deep-spin.github.io/tutorial 91
[Chen et al., 2017]
a
gentleman
overlooking
a
neighborhood
situation
.
a
police
officer
watches a
situation
closely . a
police
officer
watches a
situation
closely .
[Niculae et al., 2018a]
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
A
gentleman
overlooking
a
neighborhood
situa on
.
A
police
o cer
watches
a
situa on
closely
.
Overview
Eπθ(z|x)L(z)
• REINFORCE• Straight-Through Gumbel
(Perturb & MAP)
• SparseMAP
Largmaxz πθ(z | x)
• Straight-Through• SPIGOT
LEπθ(z|x)[z]
• Structured A n. Nets• SparseMAP
And more, a er the break!
deep-spin.github.io/tutorial 94
Structured latent variables without sampling
EzL(z)=∑z∈Z
Ly
φ
(z)π
θ
(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP
e.g., a TreeLSTM de ned by z
sum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP
e.g., a TreeLSTM de ned by z
sum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ?
∑h∈H
∂EL(z)
∂θ
idea 1
πθ(z) ∝ expfθ(z)
so max
idea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θ
idea 1
πθ(z) ∝ expfθ(z)
so max
idea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1
πθ(z) ∝ expfθ(z)
so max
idea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2
πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax
idea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.
STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3
SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.
STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
EzL(z)=∑z∈Z
Lyφ(z)πθ(z | x)
How to de ne πθ? ∑h∈H
∂EL(z)
∂θidea 1 πθ(z) ∝ exp
fθ(z)
so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP
e.g., a TreeLSTM de ned by zsum overall possible trees
parsing model,using some scorer fθ(z; x)
Exponen ally large sum!
All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.
deep-spin.github.io/tutorial 95
Structured latent variables without sampling
• • •= .7× • • • + .3× • • •
+0× • • • + ...E[L(z)]= .7× L( • • •)+ .3× L( • • • )
recall our shorthand L(z) = L(yφ(z), y)
deep-spin.github.io/tutorial 96
Structured latent variables without sampling
• • •= .7× • • • + .3× • • • +0× • • • + ...
E[L(z)]= .7× L( • • •)+ .3× L( • • • )
recall our shorthand L(z) = L(yφ(z), y)
deep-spin.github.io/tutorial 96
Structured latent variables without sampling
• • •= .7× • • • + .3× • • • +0× • • • + ...E[L(z)]= .7× L( • • •)+ .3× L( • • • )
recall our shorthand L(z) = L(yφ(z), y)
deep-spin.github.io/tutorial 96
[Corro and Titov, 2019b]
Stanford Sen ment (Accuracy)Socher et alBigram Naive Bayes 83.1[Niculae et al., 2018b]TreeLSTM w/ CoreNLP 83.2TreeLSTM w/ SparseMAP 84.7[Corro and Titov, 2019b]GCN w/ CoreNLP 83.8GCN w/ Perturb-and-MAP 84.6
Stanford Natural Language Inference (Accuracy)[Kim et al., 2017]Simple A en on 86.2Structured A en on 86.8[Liu and Lapata, 2018]100D SAN - 86.8Yogatama et al100D RL-SPINN 80.5[Choi et al., 2018]100D ST Gumbel-Tree 82.6300D - 85.6600D - 86.0[Corro and Titov, 2019b]Latent Tree + 1 GCN - 85.2Latent Tree + 2 GCN - 86.2
deep-spin.github.io/tutorial 97
Is it syntax?!
• Unlike e.g. unsupervised parsing, the structures we learn are guided by adownstream objec ve (typically discrimina ve).• They don’t typically resemble gramma cal structure (yet) [Williams et al., 2018]
(future work: more induc ve biases and constraints?)
• Common to compare latent structures with parser outputs.But is this always a meaningful comparison?
deep-spin.github.io/tutorial 98
Is it syntax?!
• Unlike e.g. unsupervised parsing, the structures we learn are guided by adownstream objec ve (typically discrimina ve).• They don’t typically resemble gramma cal structure (yet) [Williams et al., 2018]
(future work: more induc ve biases and constraints?)• Common to compare latent structures with parser outputs.
But is this always a meaningful comparison?
deep-spin.github.io/tutorial 98
Syntax vs. Composi on Order[Niculae et al., 2018b]
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
deep-spin.github.io/tutorial 99
Syntax vs. Composi on Order[Niculae et al., 2018b]
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·deep-spin.github.io/tutorial 99
Syntax vs. Composi on Order[Niculae et al., 2018b]
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
p = 15.33%
⋆ a deep and meaningful lm .1.0 1.0
1.01.0
1.0 1.0
p = 15.27%
⋆ a deep and meaningful lm .
1.01.0
1.01.0
1.01.0
· · ·CoreNLP parse, p = 0%
⋆ a deep and meaningful lm .
1.0 1.0
1.01.0
1.0
1.0
deep-spin.github.io/tutorial 100
Overview
Eπθ(z|x)L(z)
• REINFORCE• Straight-Through Gumbel
(Perturb & MAP)• SparseMAP
Largmaxz πθ(z | x)
• Straight-Through• SPIGOT
LEπθ(z|x)[z]
• Structured A n. Nets• SparseMAP
And more, a er the break!
deep-spin.github.io/tutorial 101
Overview
Eπθ(z|x)L(z)
• REINFORCESPL• Straight-Through Gumbel
(Perturb & MAP)SPL,MRG
• SparseMAPMAP+
Largmaxz πθ(z | x)
• Straight-ThroughMAP,MRG• SPIGOTMAP+
LEπθ(z|x)[z]
• Structured A n. NetsMRG
• SparseMAPMAP+
Computa on:
SPL: Sampling. (Simple in incremental/unstructured, hard for most global structures.)MAP: Finding the highest-scoring structure.MRG: Marginal inference.
And more, a er the break!
deep-spin.github.io/tutorial 101
Conclusions
• Latent structure models are desirable for interpretability, structural bias, andhigher predic ve power with fewer parameters.• Stochas c latent variables can be dealt with RL or straight-through gradients.• Determinis c argmax requires surrogate gradients (e.g. SPIGOT).• Con nuous relaxa ons of argmax include SANs and SparseMAP.• Intui vely, some of these di erent methods are trying to do similar things orrequire the same building blocks (e.g. SPIGOT and SparseMAP).
• ... we didn’t even get into deep genera ve models! These tools apply,but there are new challenges. [Corro and Titov, 2019a, Kim et al., 2019a,b, Kawakami et al., 2019]
deep-spin.github.io/tutorial 102
References IRyan Adams. The gumbel-max trick for discrete distribu ons, 2013. URL
https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/. Blog post.James K Baker. Trainable grammars for speech recogni on. The Journal of the Acous cal Society of America, 65(S1):S132–S132,1979.
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Es ma ng or propaga ng gradients through stochas c neurons forcondi onal computa on. arXiv preprint arXiv:1308.3432, 2013.
Mathieu Blondel, André FT Mar ns, and Vlad Niculae. Learning classi ers with Fenchel-Young losses: Generalized entropies,margins, and algorithms. In Proc. of AISTATS, 2019.
Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Po s. A fast uni edmodel for parsing and sentence understanding. In Proc. of ACL, 2016. doi: 10.18653/v1/P16-1139.
Stephen Boyd and Lieven Vandenberghe. Convex op miza on. Cambridge University Press, 2004.Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The mathema cs of sta s cal machinetransla on: Parameter es ma on. Computa onal Linguis cs, 19(2):263–311, 1993.
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM for natural language inference. InProc. of ACL, 2017.
Jihun Choi, Kang Min Yoo, and Sang-goo Lee. Learning to compose task-speci c tree structures. In Proc. of AAAI, 2018.
deep-spin.github.io/tutorial 103
References IIYoeng-Jin Chu and Tseng-Hong Liu. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400, 1965.William John Cocke and Jacob T Schwartz. Programming languages and their compilers. Courant Ins tute of Mathema calSciences., 1970.
Shay B Cohen, Karl Stratos, Michael Collins, Dean P Foster, and Lyle Ungar. Spectral learning of latent-variable PCFGs. In Proc. ofACL, 2012.
Caio Corro and Ivan Titov. Di eren able Perturb-and-Parse: Semi-Supervised Parsing with a Structured Varia onal Autoencoder.In Proc. of ICLR, 2019a.
Caio Corro and Ivan Titov. Learning latent trees with stochas c perturba ons and di eren able dynamic programming. In Proc.of ACL, 2019b.
Marco Cuturi and Mathieu Blondel. So -DTW: a di eren able loss func on for me-series. In Proc. of ICML, 2017.Jack Edmonds. Op mum branchings. J. Res. Nat. Bur. Stand., 71B:233–240, 1967.Marguerite Frank and Philip Wolfe. An algorithm for quadra c programming. Nav. Res. Log., 3(1-2):95–110, 1956.Serhii Havrylov, Germán Kruszewski, and Armand Joulin. Coopera ve Learning of Disjoint Syntax and Seman cs. In Proc.
NAACL-HLT, 2019.Geo rey Hinton. Neural networks for machine learning. In Coursera video lectures, 2012.Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameteriza on with Gumbel-so max. In Proc. of ICLR, 2017.
deep-spin.github.io/tutorial 104
References IIIRoy Jonker and Anton Volgenant. A shortest augmen ng path algorithm for dense and sparse linear assignment problems.
Compu ng, 38(4):325–340, 1987.Tadao Kasami. An e cient recogni on and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory
Report no. R-257, 1966.Kazuya Kawakami, Chris Dyer, and Phil Blunsom. Learning to discover, ground and use words with segmental neural languagemodels. In Proc. of ACL, 2019.
Yoon Kim, Carl Denton, Loung Hoang, and Alexander M Rush. Structured a en on networks. In Proc. of ICLR, 2017.Yoon Kim, Chris Dyer, and Alexander Rush. Compound probabilis c context-free grammars for grammar induc on. In Proc. of
ACL, 2019a.Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gábor Melis. Unsupervised recurrent neural networkgrammars. In Proc. of NAACL-HLT, 2019b.
Diederik P Kingma and Max Welling. Auto-encoding Varia onal Bayes. 2014.Gustav Kirchho . Ueber die au ösung der gleichungen, auf welche man bei der untersuchung der linearen vertheilunggalvanischer ströme geführt wird. Annalen der Physik, 148(12):497–508, 1847.
Harold W Kuhn. The Hungarian method for the assignment problem. Nav. Res. Log., 2(1-2):83–97, 1955.Simon Lacoste-Julien and Mar n Jaggi. On the global linear convergence of Frank-Wolfe op miza on variants. In Proc. of
NeurIPS, 2015.
deep-spin.github.io/tutorial 105
References IVZhifei Li and Jason Eisner. First-and second-order expecta on semirings with applica ons to minimum-risk training ontransla on forests. In Proc. of EMNLP, 2009.
Yang Liu and Mirella Lapata. Learning structured text representa ons. TACL, 6:63–75, 2018.Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribu on: A con nuous relaxa on of discrete randomvariables. In Proc. of ICLR, 2016.
Jean Maillard and Stephen Clark. Latent tree learning with di eren able parsers: Shi -Reduce parsing and chart parsing. arXivpreprint arXiv:1806.00840, 2018.
Chaitanya Malaviya, Pedro Ferreira, and André FT Mar ns. Sparse and constrained a en on for neural machine transla on. InProc. of ACL, 2018.
André FT Mar ns and Ramón Fernandez Astudillo. From so max to sparsemax: A sparse model of a en on and mul -labelclassi ca on. In Proc. of ICML, 2016.
André FT Mar ns and Julia Kreutzer. Learning what’s easy: Fully di eren able neural easy- rst taggers. In Proc. of EMNLP, 2017.André FT Mar ns and Vlad Niculae. Notes on latent structure models and SPIGOT. preprint arXiv:1907.10348, 2019.André FT Mar ns, Mário AT Figueiredo, Pedro MQ Aguiar, Noah A Smith, and Eric P Xing. AD3: Alterna ng direc ons dualdecomposi on for MAP inference in graphical models. JMLR, 16(1):495–545, 2015.
Arthur Mensch and Mathieu Blondel. Di eren able dynamic programming for structured predic on and a en on. In Proc. ofICML, 2018.
deep-spin.github.io/tutorial 106
References VNikita Nangia and Samuel Bowman. ListOps: A diagnos c dataset for latent tree learning. In Proc. of NAACL SRW, 2018.Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural a en on. In Proc. of NeurIPS, 2017.Vlad Niculae, André FT Mar ns, Mathieu Blondel, and Claire Cardie. SparseMAP: Di eren able sparse structured inference. In
Proc. of ICML, 2018a.Vlad Niculae, André FT Mar ns, and Claire Cardie. Towards dynamic computa on graphs via sparse latent structure. In Proc. of
EMNLP, 2018b.Jorge Nocedal and Stephen Wright. Numerical Op miza on. Springer New York, 1999.George Papandreou and Alan L Yuille. Perturb-and-MAP random elds: Using discrete op miza on to learn and sample fromenergy models. In Proc. of ICCV, 2011.
Hao Peng, Sam Thomson, and Noah A Smith. Backpropaga ng through structured argmax using a SPIGOT. In Proc. of ACL, 2018.Ben Peters, Vlad Niculae, and André FT Mar ns. Sparse sequence-to-sequence models. In Proc. of ACL, 2019.Slav Petrov and Dan Klein. Discrimina ve log-linear grammars with latent variables. In Advances in neural informa on processing
systems, pages 1153–1160, 2008.Ariadna Qua oni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden condi onal random elds.
IEEE Transac ons on Pa ern Analysis & Machine Intelligence, 29(10):1848–1852, 2007.
deep-spin.github.io/tutorial 107
References VILawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applica ons in speech recogni on. P. IEEE, 77(2):257–286, 1989.
Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm op miza on for spoken word recogni on. IEEE Trans. onAcous cs, Speech, and Sig. Proc., 26:43–49, 1978.
Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimiza on of graphical model parameters givenapproximate inference, decoding, and model structure. In Proc. of AISTATS, 2011.
Ben Taskar. Learning structured predic on models: A large margin approach. PhD thesis, Stanford University, 2004.Constan no Tsallis. Possible generaliza on of Boltzmann-Gibbs sta s cs. Journal of Sta s cal Physics, 52:479–487, 1988.Leslie G Valiant. The complexity of compu ng the permanent. Theor. Comput. Sci., 8(2):189–201, 1979.Tim Vieira. Gumbel-max trick, 2014. URL
https://timvieira.github.io/blog/post/2014/07/31/gumbel-max-trick/. Blog post.Marina Vinyes and Guillaume Obozinski. Fast column genera on for atomic norm regulariza on. In Proc. of AISTATS, 2017.Mar n J Wainwright and Michael I Jordan. Graphical models, exponen al families, and varia onal inference., volume 1. NowPublishers, Inc., 2008.
Adina Williams, Andrew Drozdov, and Samuel R Bowman. Do latent tree learning models iden fy meaningful structure insentences? TACL, 2018.
deep-spin.github.io/tutorial 108
References VIIRonald J. Williams. Simple sta s cal gradient-following algorithms for connec onist reinforcement learning. Mach. Learn., 8,1992.
Philip Wolfe. Finding the nearest point in a polytope. Mathema cal Programming, 11(1):128–149, 1976.Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenste e, and Wang Ling. Learning to compose words into sentences withreinforcement learning. In Proc. of ICLR, 2017.
Daniel H Younger. Recogni on and parsing of context-free languages in me n3. Informa on and Control, 10(2):189–208, 1967.
deep-spin.github.io/tutorial 109