24
Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1,2 , Michele Donini 1 , Paolo Frasconi 3 , Massimiliano Pontil 1,2 1 Istituto Italiano di Tecnologia, Genoa, Italy 2 Dept of Computer Science, University College London, UK 3 Dept of Information Engineering, Universit degli Studi di Firenze, Italy ICML 2017/ Presenter: Ji Gao Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy) Forward and Reverse Gradient-Based Hyperparameter Optimization ICML 2017/ Presenter: Ji Gao 1 / 24

Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Forward and Reverse Gradient-Based HyperparameterOptimization

Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, MassimilianoPontil 1,2

1Istituto Italiano di Tecnologia, Genoa, Italy2Dept of Computer Science, University College London, UK

3Dept of Information Engineering, Universit degli Studi di Firenze, Italy

ICML 2017/ Presenter: Ji Gao

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 1 / 24

Page 2: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Outline

1 Motivation

2 MethodOverviewOptimization

3 Complexity analysis

4 Experiment

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 2 / 24

Page 3: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Motivation

Choose hyperparameters in optimization are hard

Could we automatically select hyperparameters?

Hyperparameter optimization: Construct a response function of thehyperparameters and explore the hyperparameter space to seek anoptimum

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 3 / 24

Page 4: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Related Work

Grid search: List parameters on a grid and train all of them.Problem: Impractical when number of hyperparameters is large. Evenoutperform by random search.

Bayesian optimization: Treat the global process as a random functionand place a prior over it. After that, construct an acquisition function(referred to as infill sampling criteria) that determines the next querypoint.

Gradient-based methods: Use the gradient method to optimizehyperparameters.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 4 / 24

Page 5: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Hyperparameters

s: state in Rd , including weights (object) and hyperparameters λ.

st = Φt(st−1, λ)

An example in such definition:

Gradient Descent with Momentum

w: weights. J: objective function. λ: hyperparametersst = (vt ,wt) :

vt = µvt−1 +∇Jt(wt−1)

wt = wt−1 − η(µvt−1 −∇Jt(wt−1))

In this case: λ = (µ, η)

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 5 / 24

Page 6: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Problem formulation

Goal of hyperparameter optimization

Solve:minλ

f (λ)

Where a response function f : Rm → R is defined at λ ∈ Rm as

f (λ) = E (sT (λ))

E: Validation error

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 6 / 24

Page 7: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Problem formulation - Optimization

Goal of hyperparameter optimization

Solve:

minλ,s1,...st

E (sT )

Subject to: st = Φt(st−1, λ)

Lagrangian:

L(s, λ, α) = E (sT ) +T∑t=1

αt(Φt(st−1, λ)− st)

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 7 / 24

Page 8: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Problem formulation - Optimization

Lagrangian:

L(s, λ, α) = E (sT ) +T∑t=1

αt(Φt(st−1, λ)− st)

Derivatives of Lagrangian:

∂L

∂at= Φt(st−1, λ)− st , t = 1..T

∂L

∂st= at+1

∂Φt(st , λ)

∂st− at , t = 1..T

∂L

∂sT= ∇E (sT )− aT

∂L

∂λ=

T∑t=1

αt∂Φt(st−1, λ)

∂λ

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 8 / 24

Page 9: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Problem formulation - Optimization

Solution can be achieved by setting each derivatives to 0.

Solution:

Let At = ∂Φt(st−1,λ)∂st−1

, Bt = ∂Φt(st−1,λ)∂λ

The the solution is:

at = ∇E (sT )At+1..AT

And we have:

∂L

∂λ= ∇E (sT )

T∑t=1

(At+1..AT )Bt (1)

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 9 / 24

Page 10: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Algorithm

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 10 /

24

Page 11: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Another way to Calculate

We have:

∇f (λ) = ∇E (ST )dsTdλ

Let Zt = dsTdλ ,

Zt = AtZt−1 + Bt

Lead to a recursive solution

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 11 /

24

Page 12: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Algorithm2

Can be real-time updated.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 12 /

24

Page 13: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Background: Algorithmic (Automatic) Differentiation

Algorithmic Differentiation: Techniques to numerically evaluate thederivative of a function.

Two modes of AD: Forward mode and Reverse mode.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 13 /

24

Page 14: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Complexity of Algorithmic Differentiation

Complexity of calculating the Jacobian matrix (the matrix of allfirst-order partial derivatives):Suppose f : Rn → Rp can be evaluated in time c(n, p) and spaces(n, p). We have:

For any vector r ∈ Rn, product of r and Jacobian matrix JF r can beevaluated in time O(c(n, p)) and space O(s(n, p)) using forward modeAD.For any vector q ∈ Rp, product of q and Jacobian matrix qT JF can beevaluated in time O(c(n, p)) and space O(s(n, p)) using reverse modeAD.Jacobian can be calculated in time O(nc(n, p)) using forward mode,and O(pc(n, p)) using reverse mode.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 14 /

24

Page 15: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Complexity

Suppose st = Φt(st − 1, λ) can updated in time g(d ,m) and spaceh(d ,m).For Algorithm 1:

Each step of at+1At+1 and atBt cost O(g(d ,m)) time. So it’s totallyO(Tg(d ,m)) time. For space, we need to store all st , which requiresO(Th(d ,m)) space.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 15 /

24

Page 16: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Complexity - Algorithm 2

For Algorithm 2:

Each step of AtZt+1 require m Jacobian vector multiplication, so the costis O(mg(d ,m)) time. So it’s totally O(Tmg(d ,m)) time. For space, weonly need to store the current st , which requires O(h(d ,m)) space.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 16 /

24

Page 17: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 1 - Data Hyper-cleaning

Task: Have a large dataset with corrupted labels. Can only afford toclean a subset. Train a model.

Method: Weighting every training sample a hyperparameter in [0,1].Train with a weighted loss on the cleaned validation set.

Train a plain softmax regression model with weight W and bias b

Optimization problem:

minλ

Eval(WT , bT )

Subject to: λ ∈ [0, 1]Ntr , ||λ||1 ≤ R

Experiment design: 5000 examples from MNIST dataset as thetraining data, corrupt 2500 of them. Have 5000 more as validationdata, and 10000 as test set.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 17 /

24

Page 18: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 1 result

Oracle: Train with 2500 correct samples together with validation set.Baseline: Train with corrupted data and validation set.DH-R: Optimize and find a cleaned subset Dc with a different R value,and finally train with Dc and the validation set.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 18 /

24

Page 19: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 1 result

It can successfully discard corrupted samples.

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 19 /

24

Page 20: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 2 - Multiple task learning

Task: Find simultaneously the model of several different related tasks.For example, few shot learning.

Experiment design: Try both CIFAR-10 and CIFAR-100.50 samples on CIFAR-10, 300 samples on CIFAR-100 as training set.Same size of validation set, and all rest for testing. Use pretrainedInception-V3 model to fetch the feature.

Use a regularizer from [Evgeniou et al., 2005]ΩA,ρ(W ) =

∑Kj ,k=1 Aj ,k ||wj − wk ||22 + ρ

∑Kk=1 ||wk ||2

Training error Etr (W ) =∑

i l(Wxi + b, yi ) + ΩA,ρ(W )

Optimization problem:

minλ

Eval(WT , bT )

Subject to: ρ ≥ 0,A ≥ 0

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 20 /

24

Page 21: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 2 result

HMTL-S algorithm find the following relationship graph:

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 21 /

24

Page 22: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 3 - Phone classification

Task: Phone state classification over 183 classes.

Experiment design: Data: TIMIT phonetic recognition dataset.Model: A previous multi task learning framework [Badino,2016].

Hyperparameters: learning rate η, momentum µ and ρ, ahyperparameter of the algorithm

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 22 /

24

Page 23: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 3 result

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 23 /

24

Page 24: Forward and Reverse Gradient-Based Hyperparameter Optimization · Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1;2, Michele Donini 1, Paolo Frasconi

Experiment 3 result

Luca Franceschi , Michele Donini , Paolo Frasconi , Massimiliano Pontil ( Istituto Italiano di Tecnologia, Genoa, Italy)Forward and Reverse Gradient-Based Hyperparameter OptimizationICML 2017/ Presenter: Ji Gao 24 /

24