106
Direct Optimization CSC2547 Adamo Young, Dami Choi, Sepehr Abbasi Zadeh

Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct OptimizationCSC2547

Adamo Young, Dami Choi, Sepehr Abbasi Zadeh

Page 2: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Optimization

● A way to obtain gradient estimates that directly optimizes a non-differentiable objective.

● It has first appeared in structured prediction problems.

Page 3: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured PredictionWhenever the goal state has inter-dependency

Image from Wikipedia Image from http://dbmsnotes-ritu.blogspot.com/

Page 4: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Prediction

Scoring function , discrete

Page 5: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Prediction

Inference:

Structured Prediction

Scoring function , discrete

Page 6: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Prediction

Inference:

Training:

Structured Prediction

Scoring function , discrete

Page 7: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gradient Estimator

Page 8: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gradient Estimator

● Gradient descent on discrete :

Page 9: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gradient Estimator

● Gradient descent on discrete :

● Option 1: continuous relaxation

Page 10: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gradient Estimator

● Gradient descent on discrete :

● Option 1: continuous relaxation● Option 2: estimate

Page 11: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Loss Gradient Theorem (McAllester et al., 2010;Song et al,. 2016)

Page 12: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Loss Gradient Theorem (McAllester et al., 2010;Song et al,. 2016)

Page 13: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Loss Gradient Theorem (McAllester et al., 2010;Song et al,. 2016)

Inference:

Loss-augmented Inference:

Page 14: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Loss Gradient Theorem (McAllester et al., 2010;Song et al,. 2016)

Page 15: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Loss Gradient Theorem (McAllester et al., 2010;Song et al,. 2016)

“Away from worse” “Towards better”

Page 16: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Limitations

● Existence of ○ Bias/variance trade-off

● Solving argmax of loss-adjusted inference

Page 17: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Applications● Phoneme-to-speech alignment (McAllester et al. 2010)

● Maximizing average precision for ranking (Song et al. 2016)

● Discrete structured VAE (Lorberbom et al. 2018)

● RL with discrete action spaces (Lorberbom et al. 2019)

Page 18: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Applications● Phoneme-to-speech alignment (McAllester et al. 2010)

● Maximizing average precision for ranking (Song et al. 2016)

● Discrete structured VAE (Lorberbom et al. 2018)

● RL with discrete action spaces (Lorberbom et al. 2019)

Page 19: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Optimization through arg max for Discrete

Variational Auto-EncoderGuy Lorberbom, Andreea Gane, Tommi Jaakola,

Tamir Hazan

Page 20: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Probability Background

● Gumbel Distribution● Various Sampling “Tricks”

○ Reparameterization○ Gumbel-Max○ Gumbel-Softmax

Page 21: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel Distribution

Intuitively: Distribution of extreme value of a number of normally distributed samples

x

p(x)

https://en.wikipedia.org/wiki/Gumbel_distribution

Page 22: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Dot = parameter nodeRectangle = deterministic node

Circle = stochastic nodeLine = functional dependency

Gradient Estimators for Stochastic Computation Graphs

Schulman et al 2016

Page 23: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gradient Estimators for Stochastic Computation Graphs

Dot = parameter nodeRectangle = deterministic node

Circle = stochastic nodeLine = functional dependency

Red Line = gradient propagation

Page 24: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Reparameterization Trick

Kingma et al 2015

Page 25: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Reparameterization Trick

REINFORCE/REBAR/RELAX Reparam

Williams 1988Tucker et al 2016Grathwohl et al 2017

Page 26: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Max Trick

Page 27: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Max Trick

REINFORCE/REBAR/RELAX Direct Optimization

Page 28: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Softmax Trick

REINFORCE/REBAR/RELAX CONCRETE

Jang et al 2017Maddison et al 2017

Page 29: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Softmax Distribution

Jang et al 2017

Page 30: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Why discrete latent variables?

● Stronger inductive bias● Interpretability● Allow structural relations in encoder

Page 31: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Standard (Gaussian) VAE

Kingma et al 2013

Page 32: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Standard (Gaussian) VAE

Kingma et al 2013

Page 33: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Standard (Gaussian) VAE

Kingma et al 2013

Page 34: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Standard (Gaussian) VAE

Kingma et al 2013

Page 35: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Naive Categorical VAE

Page 36: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Naive Categorical VAE

Page 37: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Naive Categorical VAE

Page 38: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Naive Categorical VAE

We can apply standard gradient estimators (REINFORCE/REBAR/RELAX)

Page 39: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Max VAE

Page 40: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Max VAE + Direct Optimization

Page 41: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Max VAE + Direct Optimization

Page 42: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-Max VAE + Direct Optimization

Algorithm:1) Sample from Gumbel2) Compute 3) Estimate gradient

Page 43: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Encoder

No structure:

Page 44: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Encoder

No structure:

Pairwise relationships:

Solve argmax with QIP/MaxFlow

Page 45: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Encoder

No structure:

Pairwise relationships:

Solve with CPLEX/Max Flow

Not practical with Gumbel-Softmax: exponential number of terms to sum over in the denominator

Page 46: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Structured Encoder may help

Page 47: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gradient Bias-Variance Tradeoff

Direct Gumbel-Max VAE (with associated epsilon)Gumbel-Softmax VAE (with associated tau)

Page 48: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Gumbel-Max VAE trains fasterK = 10

Page 49: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

VAE Comparison

Standard (Gaussian) Gumbel-Softmax Naive Categorical + standard gradient estimator

Gumbel-Max + Direct

+ Unbiased, low variance gradients

+ Discrete latent variables

+ Discrete latent variables

+ Unbiased gradients

+ Discrete latent variables

+ Allows structural relations

- Continuous latent variables

- Limited structural relations

- Biased gradients- Limited structural

relations- Extra parameter (tau)

- Limited structural relations

- Biased gradients- Extra parameter

(epsilon)- Optimization

subproblem to get gradients

Page 50: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Policy Gradients: Direct Optimization of Policies in

Discrete Action SpacesGuy Lorberbom, Chris J. Maddison, Nicolas Heess,

Tamir Hazan, Daniel Tarlow

Page 51: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Reinforcement Learning

Agent

Environment

actionreward, state

Goal:Maximize cumulative reward

Page 52: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Policy Gradient Method

Goal:

Agent

Environment

actionreward, state

Page 53: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Policy Gradient Method

Want:

REINFORCE:

Page 54: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Policy Gradient Method

Want:

REINFORCE:

Direct Policy Gradient:

Page 55: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

State Reward TreeTree of all possible trajectories (fix the seed of the environment)

Separate environment stochasticity and policy stochasticity

Page 56: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

State Reward TreeGiven:

Can sample trajectories:

Page 57: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Reparameterize the PolicyInstead of sampling per-timestep

we sample per-trajectory.

Given action sequences ,

define:

Page 58: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-max reparameterizationNow that we have

Let for each trajectory , and

Page 59: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-max reparameterizationNow that we have

Let for each trajectory , and

Page 60: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel-max reparameterization

Let , and .

Then under this reparameterization,

Page 61: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Discrete configurations

Scoring function

Loss

Inference

Loss-augmentedInference

Structured Prediction RL

Page 62: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Discrete configurations

Scoring function

Loss

Inference

Loss-augmentedInference

Structured Prediction RL

Page 63: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Discrete configurations

Scoring function

Loss

Inference

Loss-augmentedInference

Structured Prediction RL

Page 64: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Discrete configurations

Scoring function

Loss

Inference

Loss-augmentedInference

Structured Prediction RL

Page 65: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Discrete configurations

Scoring function

Loss

Inference

Loss-augmentedInference

Structured Prediction RL

Page 66: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Discrete configurations

Scoring function

Loss

Inference

Loss-augmentedInference

Structured Prediction RL

Page 67: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Policy Gradient (DirPG)

Page 68: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Policy Gradient (DirPG)

Page 69: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Direct Policy Gradient (DirPG)

Page 70: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

AlgorithmFor every training step:

1. Sample

2.

3. Compute gradients

Page 71: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

ProblemFor every training step:

1. Sample

2. ⇐ How to obtain this?

3. Compute gradients

Page 72: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Solution: A* sampling (Maddison et al., 2014)

Use heuristic search to find trajectory with direct objective better than

Page 73: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Complete AlgorithmFor every training step:

1. Sample and compute

2. While budget not exceeded:

a. Obtain from heuristic search

b. End search if

3. Compute gradients

Page 74: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

LimitationsFor every training step:

1. Sample and compute

2. While budget not exceeded:

a. Obtain from heuristic search

b. End search if

3. Compute gradients

Must be able to reset environment to previously visited states.

Page 75: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

LimitationsFor every training step:

1. Sample and compute

2. While budget not exceeded:

a. Obtain from heuristic search

b. End search if

3. Compute gradients

Must be able to reset environment to previously visited states.

Termination on first improvement

Page 76: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Combinatorial banditsNumber of trajectories searched to find increases as training progresses for combinatorial bandits.

Page 77: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

MiniGridComparisons between different heuristics for DirPG and REINFORCE on MiniGrid.

Page 78: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

MiniGridEvidence of “pulling up” on MiniGrid.

Page 79: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Related Work● Gradient Estimators

○ REINFORCE (Williams 1988)○ REBAR (Tucker et al 2017)○ RELAX (Grathwohl et al 2018)○ Gumbel-Softmax (Jang et al 2017, Maddison et al 2017)

● Discrete Deep Generative Models○ VQ-VAE (Oord et al 2017)○ Discrete VAE (Rolfe 2017)○ Gumbel-Sinkhorn (Mena at al 2018)

● Reinforcement Learning

Page 80: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Top-Down sampling using A* Sampling

Page 81: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Non-starters● Compute for all possible trajectories

● Roll-out many trajectories and select best

Page 82: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel Process

Page 83: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel ProcessWe know:

Page 84: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel ProcessWe know:

Therefore:

Page 85: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel ProcessWe know:

Page 86: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel Process

A B

Page 87: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel Process

A B

Page 88: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Gumbel Process

A B

Page 89: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

Page 90: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

Page 91: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

Page 92: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

Page 93: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

Page 94: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

Page 95: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3

Page 96: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3

Page 97: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

Page 98: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

Page 99: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

Page 100: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

1.3

Page 101: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

1.3

Page 102: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

1.30.19

Page 103: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

1.30.19

● Repeat until terminating state found.

● Yield trajectory and

Page 104: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

1.30.19

● Repeat until terminating state found.

● Yield trajectory and

Recall, Goal:

How to prioritize ?

Page 105: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Trajectory Generation● Lazily create partitions of trajectories.● Recursion rule:

○ For , copy parent node’s value.○ For the remaining choices of actions, group them and compute truncated value.

1.3

1.3 1.1

1.30.19

● Repeat until terminating state found.

● Yield trajectory and

Recall, Goal:

How to prioritize ?

Page 106: Direct Optimization · Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. Standard (Gaussian) VAE Kingma et al 2013. ... Combinatorial bandits Number of trajectories searched

Search for large using A* Sampling● Lower bound of accumulated reward (L)

● Upper bound of reward-to-go (U)

● In practice: