60
NUS School of Computing Summer School Gaussian Process Methods in Machine Learning Jonathan Scarlett [email protected] Lecture 3: Advanced Bayesian Optimization Methods August 2018

NUS School of Computing Summer School [3mm] Gaussian ...scarlett/gp_slides/lecture_bo3.pdf · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning JonathanScarlett

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • NUS School of Computing Summer School

    Gaussian Process Methods in Machine Learning

    Jonathan [email protected]

    Lecture 3: Advanced Bayesian Optimization Methods

    August 2018

  • License Information

    These slides are an edited version of those for EE-620 Advanced Topics in MachineLearning at EPFL (taught by Prof. Volkan Cevher, LIONS group), with the followinglicense information:I This work is released under a Creative Commons License with the following terms:I Attribution

    I The licensor permits others to copy, distribute, display, and perform the work. In return,licensees must give the original authors credit.

    I Non-CommercialI The licensor permits others to copy, distribute, display, and perform the work. In return,

    licensees may not use the work for commercial purposes – unless they get the licensor’spermission.

    I Share AlikeI The licensor permits others to distribute derivative works only under a license identical

    to the one that governs the licensor’s work.I Full Text of the License

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 2/ 39

    http://creativecommons.org/licenses/by-nc-sa/1.0/http://creativecommons.org/licenses/by-nc-sa/1.0/legalcode

  • Outline of Lectures

    • Lecture 0: Bayesian Modeling and Regression

    • Lecture 1: Gaussian Processes, Kernels, and Regression

    • Lecture 2: Optimization with Gaussian Processes

    • Lecture 3: Advanced Bayesian Optimization Methods

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 3/ 39

  • Outline: This Lecture

    I This lecture1. Practical twists on Bayesian optimization2. Level-set estimation3. One-step lookahead algorithms4. Truncated variance reduction

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 4/ 39

  • Recap 1: Black-Box Function Optimization

    Black-box function optimization:

    x? ∈ arg maxx∈D⊆Rd

    f(x)

    • Setting:

    I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′

  • Recap 2: Bayesian Optimization (BO) Template

    A general BO template [Shahriari et al., 2016]

    1: for t = 1, 2, . . . , T do2: choose new xt+1 by optimizing an acquisition function α(·)

    xt+1 ∈ arg maxx∈D

    α(x;Dt)

    3: query objective function to obtain yt+14: augment data Dt+1 = {Dt, (xt+1, yt+1)}5: update the GP model6: end for7: make final recommendation x̂

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 6/ 39

  • Twists

    Practical variations along the same theme:

    Pointwise costs: Choosing point x incurs a cost c(x) [Snoek et al., 2012]

    I Examples: Advertising costs, sensor power consumption

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 39

  • Twists

    Practical variations along the same theme:

    Heteroscedastic noise: Choosing point x incurs noise σ2(x) [Goldberg et al., 1997]

    I Example: Different sensing quality

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 39

  • Twists

    Practical variations along the same theme:

    Multi-fidelity: Alternative evaluations f1, . . . , fK related to f [Swersky et al., 2013]

    I Example: Varying data set sizes in automated machine learning

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 39

  • Another Twist: Level-Set Estimation

    Level-set estimation: Estimate the super- and sub-level sets [Gotovos et al., 2013]

    H(f) :={

    x : f(x) > h}, L(f) :=

    {x : f(x) < h

    }for some threshold h

    I Example: Find all hotspots in environmental monitoringSuper-level set

    Sub-level set

    Threshold

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 8/ 39

  • Accommodating the BO Twists: Lookahead Algorithms

    • Mostly heuristic BO approaches

    I Entropy search (ES): [Hennig et al., 2012]

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Minimum regret search (MRS): [Metzen, 2016]

    xt ≈ arg minx∈D

    Eyt

    [Ex∗[regret

    ∣∣ {xi, yi}ti=1]]I Multi-step lookahead: approximation of the ideal lookahead loss function

    [Osborne et al., 2009, Gonzalez et al., 2016]

    Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.

    Disadvantages: Expensive to compute; no theory; no LSE

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39

  • Accommodating the BO Twists: Lookahead Algorithms

    • Mostly heuristic BO approaches

    I Entropy search (ES): [Hennig et al., 2012]

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Minimum regret search (MRS): [Metzen, 2016]

    xt ≈ arg minx∈D

    Eyt

    [Ex∗[regret

    ∣∣ {xi, yi}ti=1]]

    I Multi-step lookahead: approximation of the ideal lookahead loss function[Osborne et al., 2009, Gonzalez et al., 2016]

    Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.

    Disadvantages: Expensive to compute; no theory; no LSE

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39

  • Accommodating the BO Twists: Lookahead Algorithms

    • Mostly heuristic BO approaches

    I Entropy search (ES): [Hennig et al., 2012]

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Minimum regret search (MRS): [Metzen, 2016]

    xt ≈ arg minx∈D

    Eyt

    [Ex∗[regret

    ∣∣ {xi, yi}ti=1]]I Multi-step lookahead: approximation of the ideal lookahead loss function

    [Osborne et al., 2009, Gonzalez et al., 2016]

    Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.

    Disadvantages: Expensive to compute; no theory; no LSE

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39

  • Accommodating the BO Twists: Lookahead Algorithms

    • Mostly heuristic BO approaches

    I Entropy search (ES): [Hennig et al., 2012]

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Minimum regret search (MRS): [Metzen, 2016]

    xt ≈ arg minx∈D

    Eyt

    [Ex∗[regret

    ∣∣ {xi, yi}ti=1]]I Multi-step lookahead: approximation of the ideal lookahead loss function

    [Osborne et al., 2009, Gonzalez et al., 2016]

    Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.

    Disadvantages: Expensive to compute; no theory; no LSE

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39

  • Note:

    Lookahead algorithms tend to be moreversatile with respect to interesting twists on

    the optimization problem

    • Example.I Minimize entropy ⇐⇒ maximize reduction in entropyI Extension: Maximize reduction in entropy per unit cost

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 10/ 39

  • More on Entropy Search

    • Entropy search and its variants are particularly popular:

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Interpretation: Choose the point that makes us least uncertain (i.e., minimizesentropy) about the optimizer x∗

    • Difficulty. Cannot compute Eyt[H(x? | {xi, yi}ti=1)

    ]exactly

    I Need to approximate, typically using Monte Carlo methodsI Particularly difficult for higher dimensions, e.g., x ∈ Rd for d > 10

    • Alternative: Max-value entropy search.

    xt ≈ arg minxt∈D

    Eyt[H(f? | {xi, yi}ti=1)

    ]I Intuition: Low uncertainty in f∗ = f(x∗) should mean we have found x∗

    I Now approximating entropy is easier – only one-dimensional

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 39

  • More on Entropy Search

    • Entropy search and its variants are particularly popular:

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Interpretation: Choose the point that makes us least uncertain (i.e., minimizesentropy) about the optimizer x∗

    • Difficulty. Cannot compute Eyt[H(x? | {xi, yi}ti=1)

    ]exactly

    I Need to approximate, typically using Monte Carlo methodsI Particularly difficult for higher dimensions, e.g., x ∈ Rd for d > 10

    • Alternative: Max-value entropy search.

    xt ≈ arg minxt∈D

    Eyt[H(f? | {xi, yi}ti=1)

    ]I Intuition: Low uncertainty in f∗ = f(x∗) should mean we have found x∗

    I Now approximating entropy is easier – only one-dimensional

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 39

  • More on Entropy Search

    • Entropy search and its variants are particularly popular:

    xt ≈ arg minxt∈D

    Eyt[H(x? | {xi, yi}ti=1)

    ]H: entropy function

    I Interpretation: Choose the point that makes us least uncertain (i.e., minimizesentropy) about the optimizer x∗

    • Difficulty. Cannot compute Eyt[H(x? | {xi, yi}ti=1)

    ]exactly

    I Need to approximate, typically using Monte Carlo methodsI Particularly difficult for higher dimensions, e.g., x ∈ Rd for d > 10

    • Alternative: Max-value entropy search.

    xt ≈ arg minxt∈D

    Eyt[H(f? | {xi, yi}ti=1)

    ]I Intuition: Low uncertainty in f∗ = f(x∗) should mean we have found x∗

    I Now approximating entropy is easier – only one-dimensional

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 39

  • Experimental Example

    • Performance plots from [Metzen, 2016] for robot control task:

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 12/ 39

  • Accommodating the Twists: Level-Set Estimation

    • Limited literature

    I Confidence-bound based LSE algorithm: [Gotovos et al., 2013]

    xt = arg maxx∈Mt−1

    min{ut(x)− h, h− `t(x)

    }ut/lt: upper/lower confidence boundsMt: the set of unclassified pointsh: the level-set threshold

    I Analogous to, but distinct from, the GP-UCB algorithm for BOI Intuition: Resolve uncertainty of points whose confidence interval crosses h

    I Straddle heuristic: [Bryan et al., 2006]

    xt = arg maxx∈D

    1.96σt−1(x)− |µt−1(x)− h|

    Advantages: Versatility in the sense of handling level-set estimation

    Disadvantages: No theory (Straddle); lacking in other versatility (costs, non-uniformnoise, multi-fidelity)

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 39

  • Accommodating the Twists: Level-Set Estimation

    • Limited literature

    I Confidence-bound based LSE algorithm: [Gotovos et al., 2013]

    xt = arg maxx∈Mt−1

    min{ut(x)− h, h− `t(x)

    }ut/lt: upper/lower confidence boundsMt: the set of unclassified pointsh: the level-set threshold

    I Analogous to, but distinct from, the GP-UCB algorithm for BOI Intuition: Resolve uncertainty of points whose confidence interval crosses h

    I Straddle heuristic: [Bryan et al., 2006]

    xt = arg maxx∈D

    1.96σt−1(x)− |µt−1(x)− h|

    Advantages: Versatility in the sense of handling level-set estimation

    Disadvantages: No theory (Straddle); lacking in other versatility (costs, non-uniformnoise, multi-fidelity)

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 39

  • Accommodating the Twists: Level-Set Estimation

    • Limited literature

    I Confidence-bound based LSE algorithm: [Gotovos et al., 2013]

    xt = arg maxx∈Mt−1

    min{ut(x)− h, h− `t(x)

    }ut/lt: upper/lower confidence boundsMt: the set of unclassified pointsh: the level-set threshold

    I Analogous to, but distinct from, the GP-UCB algorithm for BOI Intuition: Resolve uncertainty of points whose confidence interval crosses h

    I Straddle heuristic: [Bryan et al., 2006]

    xt = arg maxx∈D

    1.96σt−1(x)− |µt−1(x)− h|

    Advantages: Versatility in the sense of handling level-set estimation

    Disadvantages: No theory (Straddle); lacking in other versatility (costs, non-uniformnoise, multi-fidelity)

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 39

  • Accommodating the Twists with Guarantees: TruVaR

    Truncated Variance Reduction (TruVaR) algorithm: [Bogunovic et al., 2016]I Unified BO and LSEI Versatility to handle all of the above twistsI Theoretical guarantees

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 14/ 39

  • TruVaR Intuition (for optimization):

    • Use confidence bounds to keep track ofpotential maximizers

    • Choose points that shrink their uncertainty

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 15/ 39

  • Modified Template for Choosing xt Based on {(xt′ , yt′)}t′

  • TruVaR: Intution

    ConfidenceTarget

    Selected point

    Potential maximizers

    Max. lower bound

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Intution

    0.0 0.2 0.4 0.6 0.8 1.04

    3

    2

    1

    0

    1

    2

    3

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39

  • TruVaR: Acquisition Function

    • Acquisition function based on variance reduction per cost

    arg maxx∈D

    ∑x∈Mt−1 max{β(i)σ

    2t−1(x), η2(i)} −

    ∑x∈Mt−1 max{β(i)σ

    2t−1|x(x), η

    2(i)}

    c(x)

    σ2t−1|x: posterior variance given all points up to time t− 1, and xβ(i): exploration parameter

    The set of potential maximizers Mt• BO

    Mt ={

    x ∈Mt−1 : ut(x) ≥ maxx∈Mt−1

    `t(x)}

    ut(x)/lt(x): upper/lower confidence bounds

    • LSEMt =

    {x ∈Mt−1 : ut(x) ≥ h and `t(x) ≤ h

    }SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 39

  • Numerical Evidence

    • Real and synthetic data

    • Acronyms

    LSE Level-set estimation algorithm [Gotovos et al., 2013]

    STR Straddle heuristic [Bryan et al., 2006]

    VAR Maximum variance rule [Gotovos et al., 2013]

    EI Expected improvement [Mockus et al., 1978]

    GP-UCB Gaussian process upper confidence bound [Srinivas et al., 2012]

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 19/ 39

  • Numerical Evidence 1: Level-Set Estimation (I)

    • Lake Zurich chlorophyll concentration via an autonomous vehicle:

    0 400 800 1200Length (m)

    20

    15

    10

    5

    0

    Dep

    th (m

    )

    1.86

    h = 1.5

    0

    • Evaluate performance with the F1 score:

    F1 =#true positives

    #true positives + 12(#false positives + #false negatives

    ) ∈ [0, 1]where “positive” means above the level-set h.

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 20/ 39

  • Numerical Evidence 1: Level-Set Estimation (II)

    • Classification performance (unit-cost):

    Time

    0 20 40 60 80 100 120

    F1score

    0

    0.2

    0.4

    0.6

    0.8

    1

    TruVaR

    LSE

    STR

    VAR

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 21/ 39

  • Numerical Evidence 1: Level-Set Estimation (III)

    • Cost function: (i) Penalizes distance traveled; (ii) Penalizes deeper measurements

    LSE algorithm [Gotovos et al., 2013]: TruVaR:

    0 400 800 1200Length (m)

    20

    15

    10

    5

    0

    Dep

    th (m

    )

    0 400 800 1200Length (m)

    20

    15

    10

    5

    0

    Dep

    th (m

    )

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 22/ 39

  • Numerical Evidence 1: Level-Set Estimation (IV)

    • Classification performance (non-unit cost):

    Cost (×104)×10

    4

    0 0.5 1 1.5 2

    F1score

    0

    0.2

    0.4

    0.6

    0.8

    1

    TruVaR

    LSE

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 23/ 39

  • Numerical Evidence 2: Level-Set Estimation (I)

    • Twist: Choice of the noise level

    I Noise levels {10−6, 10−3, 0.05}

    I Corresponding costs {15, 10, 2}

    • Synthetic simulation

    I Function drawn from GP with squared-exponential kernel

    I True kernel used in algorithms

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 24/ 39

  • Numerical Evidence 2: Level-Set Estimation (II)

    • Synthetic function drawn from GP:

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 25/ 39

  • Numerical Evidence 2: Level-Set Estimation (III)

    • Oracle-level classification performance:

    Cost (×104)0.1 0.15 0.2 0.25 0.3 0.35 0.4

    F1score

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    TruVaRLSE high noiseLSE medium noiseLSE small noise

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 26/ 39

  • Numerical Evidence 2: Level-Set Estimation (IV)• Cost incurred for each noise level:

    • TruVaR gradually switches between different levels:

    high noise / low cost =⇒ medium noise / cost =⇒ low noise / high cost

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 27/ 39

  • Numerical Evidence 3: Bayesian Optimization

    • Hyper-parameter tuning: SVM on grid dataset [Snoek et al., 2012]

    I Tuning 3 hyperparameters for SVM algorithm

    I GP kernel estimated online using maximum-likelihood

    • Generalization error:

    Time

    0 20 40 60 80 100

    ValidationError

    0.24

    0.25

    0.26

    0.27

    TruVaR

    EI

    GP-UCB

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 28/ 39

  • TruVaR – Batch Setting

    • In the batch setting, we choose B > 1 points at each time, evaluate them inparallel, and observe the B observations [Azimi et al., 2010]I Example 1: Equipment allows running scientific experiments in parallelI Example 2: f is a computation, and we have multiple computing cores

    • With B = 1, we can interpret TruVaR as greedily minimizing∑x∈Mt−1

    max{β(i)σ2t−1|x(x), η2(i)}

    with respect to x

    • A simple batch extension: In each round, run B steps of the greedy algorithmminimizing this function

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 29/ 39

  • TruVaR – Batch Setting

    • In the batch setting, we choose B > 1 points at each time, evaluate them inparallel, and observe the B observations [Azimi et al., 2010]I Example 1: Equipment allows running scientific experiments in parallelI Example 2: f is a computation, and we have multiple computing cores

    • With B = 1, we can interpret TruVaR as greedily minimizing∑x∈Mt−1

    max{β(i)σ2t−1|x(x), η2(i)}

    with respect to x

    • A simple batch extension: In each round, run B steps of the greedy algorithmminimizing this function

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 29/ 39

  • Epilogue: Theoretical Performance

    Definition: Numerical �-accuracy

    I (BO) The reported point after T rounds satisfies f(x̂T ) ≥ f(x?)− �

    I (LSE) The classification after T rounds is correct for points at least �2 -far from h

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 30/ 39

  • Epilogue: Theoretical Performance (I)

    • Generalize the canonical notion of rounds T to costs C to shrink variance:

    C∗(ξ,M) = minS

    {c(S) : max

    x∈MσS(x) ≤ ξ

    },

    σ2S : posterior variance given points in S

    TheoremFor a finite domain D, under a submodularity assumption, if TruVaR is run until thecumulative cost reaches

    C� =∑

    i : 4η(i−1)>�

    C∗( η(i)β

    1/2(i)

    ,M(i−1)

    )log|M(i−1)|β(i)

    η2(i),

    for suitable β(i), then with probability at least 1− δ we have �-accuracy.In the cumulative cost, the outer bounds on Mt are defined as

    M(i) :={

    x : f(x) ≥ f(x?)− 4η(i)}

    (BO)

    M(i) :={

    x : |f(x)− h| ≤ 2η(i)}

    (LSE)

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 31/ 39

  • Epilogue: Theoretical Performance (II)

    CorollaryThere exist β(i) such that we have �-accuracy with probability at least 1− δ once

    T ≥ O∗(σ2γT

    �2+C1γTσ2

    )where C1 = 1log(1+σ−2) , and

    γT = maxS : |S|=T

    I(f ; yS)

    is the maximum amount of information yS = (y1, . . . , yT ) can reveal about f uponquerying points S = (x1, . . . ,xT )

    • New: Improved dependence on the noise level in BO

    I For small σ and �� σ, existing bound (GP-UCB): T ≥ O∗(C1γT�2

    )

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 32/ 39

  • Epilogue: Theoretical Performance (III)

    • Multi-noise setup:I noise levels σ2(1), . . . , σ2(K)I sampling costs c(1), . . . , c(k)

    CorollaryFor each k = 1, · · · ,K, let T ∗(k) denote the smallest value of T such that

    T ≥ Ω∗(σ(k)2γT

    �2+C1(k)γTσ(k)2

    )where C1 = 1log(1+σ(k)−2) .

    There exist choices of β(i) such that we have �-accuracy with probability at least 1− δonce the cumulative cost reaches

    minkc(k)T ∗(k)

    • As good as sticking to any fixed noise/cost pair a posteriori!

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 33/ 39

  • Epilogue: Theoretical Performance (IV)

    • Recall: Minimum cost required to shrink variance

    C∗(ξ,M) = minS

    {c(S) : max

    x∈MσS(x) ≤ ξ

    },

    σ2S : posterior variance given points in S

    • In a single epoch, TruVaR greedily maximizes a submodular set function

    g(S) = −∑

    x∈Mt−1

    max{β(i)σ2t−1|S(x), η2(i)}

    • By submodularity, our incurred cost is within a logarithmic factor of the optimum:

    C(i) ≤ C∗( η(i)β

    1/2(i)

    ,M(i−1)

    )log|M(i−1)|β(i)

    η2(i)

    • Sum over the epochs i to obtain the theory

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 34/ 39

  • Further Reading

    • Tutorials/classes:I Taking the Human Out of the Loop: A Review of Bayesian Optimization

    (Shahriari et al., 2016)I A Tutorial on Bayesian Optimization of Expensive Cost Functions, with

    Application to Active User Modeling and Hierarchical Reinforcement Learning(Brochu et al., 2010)

    I Lectures on Gaussian Processes & Bayesian optimization by Nando de Freitas(available on YouTube)

    • Other:I Various papers referenced at the end of each set of slides (this & previous ones)I Popular GP book: Gaussian Processes for Machine Learning (Rasmussen, 2006)I My papers: http://www.comp.nus.edu.sg/~scarlett/

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 35/ 39

    http://www.comp.nus.edu.sg/~scarlett/

  • Useful Programming Packages

    • Useful libraries:I Python packages (some with other methods beyonds GPs):

    I GPy and GPyOptI SpearmintI BayesianOptimizationI PyBoI HyperOptI MOE

    I Packages for other languages:I GPML for MATLABI GPFit and rBayesianOptimization for R

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 36/ 39

  • References I

    [1] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher.Truncated variance reduction: A unified approach to Bayesian optimization andlevel-set estimation.In Conf. Neur. Inf. Proc. Sys. (NIPS), 2016.

    [2] Eric Brochu, Vlad M. Cora, and Nando de Freitas.A tutorial on bayesian optimization of expensive cost functions, with application toactive user modeling and hierarchical reinforcement learning.http://arxiv.org/abs/1012.2599, 2010.

    [3] Brent Bryan, Robert C Nichol, Christopher R Genovese, Jeff Schneider,Christopher J Miller, and Larry Wasserman.Active learning for identifying function threshold boundaries.In Conf. Neur. Inf. Proc. Sys. (NIPS), pages 163–170, 2006.

    [4] Paul W Goldberg, Christopher KI Williams, and Christopher M Bishop.Regression with input-dependent noise: A Gaussian process treatment.Adv. Neur. Inf. Proc. Sys. (NIPS), 10:493–499, 1997.

    [5] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause.Active learning for level set estimation.In Int. Joint. Conf. Art. Intel., 2013.

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 37/ 39

  • References II

    [6] Philipp Hennig and Christian J Schuler.Entropy search for information-efficient global optimization.J. Mach. Learn. Research, 13(1):1809–1837, 2012.

    [7] Jan Hendrik Metzen.Minimum regret search for single-and multi-task optimization.In Int. Conf. Mach. Learn. (ICML), 2016.

    [8] J Moćkus, V Tiesis, and A Źilinskas.The application of Bayesian methods for seeking the extremum. vol. 2, 1978.

    [9] Carl Edward Rasmussen.Gaussian processes for machine learning.MIT Press, 2006.

    [10] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nandode Freitas.Taking the human out of the loop: A review of Bayesian optimization.Proc. IEEE, 104(1):148–175, 2016.

    [11] Jasper Snoek, Hugo Larochelle, and Ryan P Adams.Practical Bayesian optimization of machine learning algorithms.In Adv. Neur. Inf. Proc. Sys. 2012.

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 38/ 39

  • References III

    [12] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger.Information-theoretic regret bounds for Gaussian process optimization in thebandit setting.IEEE Trans. Inf. Theory, 58(5):3250–3265, May 2012.

    [13] Kevin Swersky, Jasper Snoek, and Ryan P Adams.Multi-task Bayesian optimization.In Adv. Neur. Inf. Proc. Sys. (NIPS), pages 2004–2012, 2013.

    SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 39/ 39

    Lecture 3 – Advanced Bayesian Optimization Methods