Optimal Learning: an Overview

Optimal Learning: an Overview

Peter I. Frazier

Operations Research & Information Engineering, Cornell University

Thursday June 12, 2014Guest Lecture, Operations Research 3 – Decision-making

Tsinghua University

Research supported by AFOSR and NSF

Frazier (Cornell University) Optimal Learning talk 1 / 45

What is optimal learning?

In many applications, we make decisions about which data to collect.

In making these decisions we trade the benefit of information (theability to make better decisions in the future) against its cost (money,time, or opportunity cost).

Statistical learning is making predictions or decisions based on data.

Optimal learning is making decisions about which data tocollect in an optimal way.


Optimal learning overlaps with other fields

Optimal learning overlaps with these fields:

Bayesian statistics, and machine learning.

Decision-making under uncertainty, and dynamic programming.


Outline

1 Example Optimal Learning Problems

2 Bayesian Selection of the BestProblem summaryBayesian inferenceThe Knowledge-Gradient (KG) methodOptimality analysis using dynamic programming

3 Conclusion


Dynamic Pricing

Our goal is to price airline tickets tomaximize revenue.

We learn about demand for a flight as wesell tickets.

The information collected depends on howwe price each ticket: we only observewhether the price that the customer waswilling to pay was above or below theoffered price.

Collecting more information now mayprovide the ability to improve revenueslater.

Round Trip (Start New Search)

Depart New York/Newark, NJ (EWR - Liberty)

Arrive Los Angeles, CA (LAX)

Date Sun., Jan. 11, 2009 Time Anytime

Cabin Economy Travelers 1

Fares listed are for the entire trip per person and do not include all taxes and fees. Additional bag charges may

apply.

The fare displayed is the lowest available for the dates requested; however some flights may not be in the cabin

you requested.

Nonstopfrom

$937

Search tip:

To use US Helicopter service between Newark and Manhattan, select the Wall Street (JRB) or

Midtown Manhattan (TSS) heliport.

Select Your Departing Flight for Sun., Jan. 11, 2009:

BUY NOW - limited tickets at our lowest price

Price Departing Arriving Travel Time OnePass Miles

Nonstop flights from $937

from

$937

4 tickets

at this price

Select

Depart:

8:30 a.m.Sun., Jan. 11, 2009New York/Newark, NJ (EWR

- Liberty)

Arrive:

11:52 a.m.Sun., Jan. 11, 2009

Los Angeles, CA (LAX)

Travel

Time:

6 hr 22

mn

OnePass Miles/

Elite

Qualification*:

2,454

/150%

Flight: CO1402

Aircraft: Boeing 737-800

Fare Class: Economy (H)

Meal: Snack

See On-Time Performance

View Seats

Select Departing Flight

from

$937

3 tickets

at this price

Select

Depart:

1:10 p.m.Sun., Jan. 11, 2009New York/Newark, NJ (EWR

- Liberty)

Arrive:

4:15 p.m.Sun., Jan. 11, 2009


Travel

Time:

6 hr 5

mn

OnePass Miles/

Elite

Qualification*:

2,454

/150%

Flight: CO17



Meal: Snack


View Seats

from

$937

8 tickets

at this price

Select

Depart:


- Liberty)

Arrive:

10:27 p.m.Sun., Jan. 11, 2009


Travel

Time:

6 hr 27

mn

OnePass Miles/

Elite

Qualification*:

2,454

/150%

Flight: CO1502



Meal: Snack


View Seats

from

$1,187

2 tickets

at this price

Select

Depart:


- Liberty)

Arrive:

6:36 p.m.Sun., Jan. 11, 2009


Travel

Time:

6 hr 16

mn

OnePass Miles/

Elite

Qualification*:

2,454

/150%

Flight: CO65


Fare Class: Economy (Y)

Meal: Snack


View Seats

from

$1,237

1 ticket

at this price

Select

Depart:


- Liberty)

Arrive:

7:51 p.m.Sun., Jan. 11, 2009


Travel

Time:

6 hr 21

mn

OnePass Miles/

Elite

Qualification*:

2,454

/150%

Flight: CO702


Fare Class: First (A)

Meal: Dinner


View Seats

Flights with stops from $1,182

from

$1,182

Depart:


Arrive:

4:08 p.m.Sun., Jan. 11, 2009

Atlanta, GA (ATL)

Flight

Time:

2 hr 39

mn

OnePass Miles/

Elite

Qualification*:

756 /100%

Flight: CO1161


Fare Class: Economy (N)

Meal: None

Select- Liberty) See On-Time Performance

View Seats

Change Planes. Connect time in Atlanta, GA (ATL) is 1 hour 22 minutes.

Depart:

5:30 p.m.Sun., Jan. 11, 2009

Atlanta, GA (ATL)

Arrive:

7:34 p.m.Sun., Jan. 11, 2009


Flight

Time:

5 hr 4 mn

Travel

Time:

9 hr 5

mn

OnePass Miles/

Elite

Qualification*:

1,946 /150%

Total Miles:

2,702

Flight: CO4524



Meal: None

No Special Meal Offered.

View Seats

Continental flight 4524 operated by Delta Air Lines.

from

$1,376

5 tickets

at this price

Select

Depart:


- Liberty)

Arrive:

4:07 p.m.Sun., Jan. 11, 2009

Cleveland, OH (CLE)

Flight

Time:

1 hr 42

mn

OnePass Miles/

Elite

Qualification*:

416 /100%

Flight: CO325


Fare Class: Economy (N)

Meal: None


View Seats

Change Planes. Connect time in Cleveland, OH (CLE) is 1 hour 8 minutes.

Depart:

5:15 p.m.Sun., Jan. 11, 2009

Cleveland, OH (CLE)

Arrive:

7:26 p.m.Sun., Jan. 11, 2009


Flight

Time:

5 hr 11

mn

Travel

Time:

8 hr 1

mn

OnePass Miles/

Elite

Qualification*:

2,053 /150%

Total Miles:

2,469

Flight: CO67



Meal: Snack



View Seats

from

$1,446

6 tickets

at this price

Select

Depart:


- Liberty)

Arrive:

8:23 a.m.Sun., Jan. 11, 2009Houston, TX (IAH -

Intercontinental)

Flight

Time:

3 hr 53

mn

OnePass Miles/

Elite

Qualification*:

1,415

/100%

Flight: CO611


Fare Class: Economy (M)

Meal: Snack



View Seats

Change Planes. Connect time in Houston, TX (IAH - Intercontinental) is 52 minutes.

Depart:


Intercontinental)

Arrive:

10:53 a.m.Sun., Jan. 11, 2009


Flight

Time:

3 hr 38

mn

Travel

Time:

8 hr 23

mn

OnePass Miles/

Elite

Qualification*:

1,379 /150%

Total Miles:

2,794

Flight: CO1495



Meal: Snack



View Seats

from

$1,446

5 tickets

at this price

Select

Depart:


- Liberty)

Arrive:


Intercontinental)

Flight

Time:

4 hr 0 mn

OnePass Miles/

Elite

Qualification*:

1,415

/100%

Flight: CO211



Meal: Snack



View Seats

Change Planes. Connect time in Houston, TX (IAH - Intercontinental) is 1 hour 20 minutes.

Depart:


Intercontinental)

Arrive:

1:30 p.m.Sun., Jan. 11, 2009


Flight

Time:

3 hr 40

mn

Travel

OnePass Miles/

Elite

Qualification*:

1,379 /150%

Total Miles:

2,794

Flight: CO1605



Meal: Snack



Time:

9 hr 0

mn

View Seats

from

$1,446

4 tickets

at this price

Select

Depart:


- Liberty)

Arrive:


Intercontinental)

Flight

Time:

4 hr 0 mn

OnePass Miles/

Elite

Qualification*:

1,415

/100%

Flight: CO211



Meal: Snack



View Seats

Change Planes. Connect time in Houston, TX (IAH - Intercontinental) is 2 hours 30 minutes.

Depart:

1:00 p.m.Sun., Jan. 11, 2009Houston, TX (IAH -

Intercontinental)

Arrive:

2:48 p.m.Sun., Jan. 11, 2009


Flight

Time:

3 hr 48

mn

Travel

Time:

10 hr

18 mn

OnePass Miles/

Elite

Qualification*:

1,379 /150%

Total Miles:

2,794

Flight: CO1695



Meal: Snack



View Seats

from

$1,446

4 tickets

at this price

Select

Depart:


- Liberty)

Arrive:


Intercontinental)

Flight

Time:

4 hr 10

mn

OnePass Miles/

Elite

Qualification*:

1,415

/100%

Flight: CO1011



Meal: Snack



View Seats

Change Planes. Connect time in Houston, TX (IAH - Intercontinental) is 1 hour.

Depart:


Intercontinental)

Arrive:

2:48 p.m.Sun., Jan. 11, 2009


Flight

Time:

3 hr 48

mn

Travel

Time:

8 hr 58

mn

OnePass Miles/

Elite

Qualification*:

1,379 /150%

Total Miles:

2,794

Flight: CO1695



Meal: Snack



View Seats

from

$1,446

5 tickets

at this price

Select

Depart:


- Liberty)

Arrive:


Intercontinental)

Flight

Time:

4 hr 10

mn

OnePass Miles/

Elite

Qualification*:

1,415

/100%

Flight: CO1011



Meal: Snack



View Seats

Change Planes. Connect time in Houston, TX (IAH - Intercontinental) is 2 hours 15 minutes.

Depart:


Intercontinental)

Arrive:

4:00 p.m.Sun., Jan. 11, 2009


Flight

Time:

3 hr 45

mn

Travel

Time:

10 hr

10 mn

OnePass Miles/

Elite

Qualification*:

1,379 /150%

Total Miles:

2,794

Flight: CO137



Meal: None


View Seats

from

$1,446

5 tickets

at this price

Depart:


- Liberty)

Arrive:


Intercontinental)

Flight

Time:

3 hr 52

mn

OnePass Miles/

Elite

Qualification*:

1,415

/100%

Flight: CO303



Meal: Snack



View Seats


AIDS Treatment and Prevention

We would like to treat and preventAIDS in Africa.

We are uncertain about theeffectiveness of experimentaltreatments and untried preventionmethods, but we can learn aboutthem by using them in practice, orby conducting scientific studies.

To which treatment and preventionmethods should we allocate ourinvestigative resources?

How should we balance using thosemethods that appear to be mosteffective, with those untriedmethods that may be very good?


Exploration vs. Exploitation in News Feeds

We would like to design anautomatic document screeningsystem that forwards documents(e.g., webpages) of interest to ahuman.

The screening system earns a rewardif the forwarded document is ofinterest, and pays a penalty if not.

Even if the expected immediatereward of forwarding a particulardocument is negative, the systemmay still want to do so becausehuman feedback may allow thesystem to improve futureperformance.


Adaptive Web Design (multi-armed bandits)


Product development(optimization of expensive functions)

We have a product whose featureswe are selecting based on asequence of focus groups.

We have the time and budget for afixed number of focus groups,through which we want to learnmore about underlying consumerpreferences for these features.

After conducting these focusgroups, we will choose a particularset of features with which to bringour product to market and receive areward based on the resulting salesrevenue and manufacturing anddevelopment costs.


Other examples

Materials informatics / Designing novel materials

Simulation optimization

Optimization of long-running computer codes

Clinical trials (sequential hypothesis testing)

Inventory control with censored demand

Quality control (changepoint detection)


Outline



3 Conclusion


Outline



3 Conclusion


We consider an optimal learning problem called BayesianRanking & Selection

We consider an optimal optimal learning problem called “BayesianRanking & Selection (R&S)” or “Bayesian Selection of the Best”.

In this problem, we wish to know which of a finite number of optionsis the best.

To figure out the quality of an option, we can sample it (try it out).

When we sample an option, we get a noisy observation of its quality.

We can take a limited number of samples.

We wish to allocate this sampling budget efficienly, so as to bestsupport selecting the best.


Example: Drug Discovery

A pharmaceutical company has a library of millions of compounds that itwould like to screen for potential cancer drugs. Robots will do the initialassay by performing a fixed test one or several times on some subset of thecompounds.

Sources: http://www.paa.co.uk/img/labauto/inst highres/ssi/mini dispenser.jpg,

http://www.kalyx.com/store/images/Images SW/SW 201442-51.jpg


Example: Queuing Control

We would like to choose a nurse/doctor staffing policy in a hospitalto minimize expected patient waiting time.To figure out the patient waiting time under a particular staffingpolicy is, we can simulate it using a discrete event simulation.Each simulation takes about 1 minute.We want to choose the best among 100 possible staffing policies,using at most 24 hours of simulation effort.

Shi, Chen, and Yucesan

budget. This is the basic idea of optimal computing budget

allocation (OCBA) (Chen et al. 1996, 1999).

We apply the hybrid algorithm for a stochastic

resource allocation problem, where no analytical

expression exists for the objective function, and it is

estimated through simulation. Numerical results show that

our proposed algorithm can be effectively used for solving

large-scale stochastic discrete optimization problems.

The paper is organized as follows: In section 2 we

formulate the resource allocation problem as a stochastic

discrete optimization problem. In section 3 we present the

hybrid algorithm. The performance of the algorithm is

illustrated with one numerical example in Section 4.

Section 5 concludes the paper.

problem of performing numerical expectation since the

functional L( 0,5> is available only in the form of a complex

calculation via simulation. The standard approach is to

estimate E[L( 6, 5>] by simulation sampling, i.e.,

Unfortunately, t can not be too small for a reasonable

estimation of E[L(O, 01. And the total number of

simulation samples can be extremely large since in the

resource allocation problems, the number of (el, &,..., 0,) combinations is usually very large as we will show the

following example.

2 RESOURCE ALLOCATION PROBLEMS 2.1 Buffer Allocation in Supply Chain Management

There are many resource allocation problems in the design

of discrete event systems. In this paper we consider the

following resource allocation optimization problem:

where 0 is a finite discrete set and J : 0 + R is a

performance function that is subject to noise. Often J ( @ is

an expectation of some random estimate of the

performance,

where 5 is a random vector that represents uncertain factors

in the systems. The "stochastic" aspect has to do with the

We consider a 10-node network shown in Figure 1. There

are 10 servers and 10 buffers, which is an example of a

supply chain, although such a network could be the model

for many different real-world systems, such as a

manufacturing system, a communication or a traffic

network. There are two classes of customers with different

arrival distributions, but the same service requirements. We

consider both exponential and non-exponential

distributions (uniform) in the network. Both classes arrive

at any of Nodes 0-3, and leave the network after having

gone through three different stages of service. The routing

is not probabilistic, but class dependent as shown in Figure

1. Finite buffer sizes at all nodes are assumed which is

exactly what makes our optimization problem interesting.

More specific, we are interested in distributing optimally

C1: Unif[2,18]

C2: Exp(O.12) Arrival:

Figure 1: A 10-node Network in the Resource Allocation Problem

396

Authorized licensed use limited to: IEEE Xplore. Downloaded on November 20, 2008 at 16:29 from IEEE Xplore. Restrictions apply.

Source: Shi,Chen,and Yucesan 1999Frazier (Cornell University) Optimal Learning talk 15 / 45

Mathematical Model

We consider k alternative options.

The underlying value of alternative x is θx ∈ R. We do not observethis, and must try to learn it through sampling. Let θ = (θ1, . . . ,θk).

At each time n = 1, . . . ,N, we choose an alternative to sample,xn ∈ {1, . . . ,k}.We observe a sample,

yn | xn,θ1:k ∼ Normal(θxn ,λ2).

To keep things simple, we assume that λ 2 is known and is the samefor all options. It is also possible to allow λ 2 to be unknown, and tovary with x .

At time N, we select an option x̂ ∈ {1, . . . ,k}, which we hope is thebest option.

We receive a reward of θx̂ , which is the true value of the selectedoption x̂ .


Example, Time 0

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5


Example, Time 1

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurement


Example, Time 2

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurements


Example, Time 3

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurements


Example, Time 4

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurements


Example, Time 5

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurements


Example, Time 10

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurements


Example, Time 10

-2

-1

0

1

2

x=1 x=2 x=3 x=4 x=5

measurementstrue values


Outline



3 Conclusion


We put a Bayesian prior probability distribution on θ

The underlying value of alternative x is θx .

We do not know θx , but based on intuition, experience, and datafrom other similar problems, we may be able to make statements like“The expected waiting time for this nurse staffing policy is probablybetween 15 minutes and 4 hours.”

We formalize this by supposing that θx was drawn by nature atrandom from a Bayesian prior probability distribution.

Once θx is drawn by nature (at time n = 0), it stays fixed (overn = 1,2, . . .).

We use a normal prior probility distribution, because it is flexible, andallows easy computation:

θx ∼ Normal(µ0x ,σ20x).


We can use Bayesian statistics to estimate θx , based onnoisy samples.

Suppose our first sample is from option x , so x1 = x .

We observey1 | x1 = x ,θ1:k ∼ Normal(θx ,λ

2).

We can use Bayes rule to calculate the conditional distribution of θx

given this sample.

The conditional distribution given the data is called the posteriordistribution.


We can use Bayesian statistics to estimate θx , based onnoisy samples.

Bayes rule shows us that the posterior distribution on θx is

θx | x1,y1 ∼ Normal(µ1,x ,σ

21,x

),

where

µ1,x =(σ0,x)−2µ0,x + λ−2y1

σ−20,x + λ−2

σ21,x =

[σ−20,x + λ

−2]−1

The posterior distribution on θx ′ , where x ′ 6= x , does not change.


There is a nice expression for the posterior distribution

In general,

θx | x1, . . . ,xn,y1, . . . ,yn ∼ Normal(µn,x ,σ

2n,x

),

where µn,x ,σn,x can be computed recursively.For x = xn, the posterior is updated via:

µn+1,x =σ−2n,x µn,x + λ−2yn+1

σ−2n,x + λ−2

σ2n+1,x =

[σ−2n,x + λ

−2]−1and the posterior for x 6= xn does not change:

µn+1,x = µn,x for x 6= xn

σn+1,x = σn,x for x 6= xn


Example of the posterior distribution






















We can use the posterior distribution to choose x̂

Recall that x̂ is our selection of the best, and it is chosen at time Nbased on all previous samples x1, . . . ,xN ,y1, . . . ,yN .

Based on these samples, the posterior is,

θx | x1, . . . ,xN ,y1, . . . ,yN ∼ Normal(µN,x ,σ

2N,x

).

Recall that the reward for choosing x̂ = x is θx .

The conditional expected reward for choosing x̂ = x is

E [θx | x1, . . . ,xN ,y1, . . . ,yN ] = µN,x .

Thus, the choice that gives the biggest conditional expected reward isarg maxx µN,x and it has value maxx µN,x .


Example of choosing x̂


Example of choosing x̂


How should we choose the xn?

Our ability to choose x̂ accurately depends on the chooses we makefor x1, . . . ,xN .

Intuitively, a good way to choose these should spend the first part ofthe budget exploring the options to figure out which ones are amongthe best, and then focus the rest of the budget on these options.

But how precisely should we accomplish this?

One way to choose the xn is through the knowledge-gradient (KG)method for independent beliefs.

Later, in the seminar, I will talk about the knowedge-gradient methodfor correlated beliefs.


Outline



3 Conclusion


The knowledge-gradient factor quantifies a sample’s value

The knowledge-gradient method is created via the following thoughtexperiment.

If we were to stop at time n, and select x̂ based on x1:n,y1:n, wewould earn an expected reward of

µ∗n = max

xµn,x .

If we were to take one more sample, xn+1, and observe yn+1, andthen select x̂ , we would earn an expected reward of

µ∗n+1 = max

xµn+1,x .


The knowledge-gradient factor quantifies a sample’s value

Before the new sample, our value was µ∗n . After, it was µ∗n+1.

The additional sample xn+1,yn+1 has increased our solution’s value by

µ∗n+1−µ

∗n .

At time n, we don’t know yn+1, so we can’t compute this quantity.

We can, however, compute its expected value,

KGn(x) = En

[µ∗n+1−µ

∗n | xn+1 = x

].

We call this quantity the knowledge-gradient (KG) factor, because itmeasures the change in the value of our knowledge.


Computing the KG factor requires us to think about howthe next measurement will change our posterior.

At time n, suppose we decide to measure xn+1 = x .

Before we observe yn+1, it is random.

We can calculate its conditional distribution givenx1, . . . ,xn+1,y1, . . . ,yn.

yn+1 | x1, . . . ,xn+1,y1, . . . ,yn ∼ Normal(µn,x ,σ2n,x + λ

2),

From this, and the formula for µn+1,x in terms of µn,x , σ2n,x , and

yn+1, we can calculate that

µn+1,x | x1, . . . ,xn+1,y1, . . . ,yn ∼ Normal(µn,x , σ̃2n,x),

where σ̃n,x = σ2n,x/

√σ2n,x + λ 2.

This distribution is called the “posterior predictive distribution”.


The KG factor has a convenient formula.

The VOI / KG factor for measuring alternative x at time n is

KGn(x) = σ̃n,x f

(−∆n,x

σ̃n,x

)where

∆n,x = |µn,x −maxx ′ 6=x

µn,x ′ |,

f (c) = cΦ(c) + ϕ(c),

Φ is the normal cdf, and ϕ is the normal pdf.


Animation of the KG method

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6

KnowledgeGradient n=0

priorY_x



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat



-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6


priorY_xyhat


The KG method works well

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 >.950

20

40

60

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 >.950

20

40

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 >.950

10

20

Value(KG)−Value(Boltzmann)

Value(KG)−Value(EqualAllocation)

Value(KG)−Value(Exploit)

−0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090

20

40

60

−0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090

20

40

60

−0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080

20

40

60

Value(KG)−Value(IE)

Value(KG)−Value(OCBA)

Value(KG)−Value(LL(S))

Histogram of the sampled difference in value for competing policiesaggregated across the 100 randomly generated problems.


Outline



3 Conclusion


The knowledge-gradient method is good, but it notoptimal in general

The KG method works well against other algorithms proposed for thisproblem.

The KG method is optimal if we have only sample remaining.

But in general, multiple samples remain.

What is the best algorithm in general?


The optimal algorithm is the solution to a dynamicprogram

The conditional expected value we receive, given what we know attime N, is maxx µN,x .

Define VN = VN(µN ,σ2N) = maxx µN,x .

At time N−1, the optimal choice of xN is the one that maximizes theexpected value of this reward,

arg maxxN

EN [VN |xN ],

and this maximal expected value is

VN−1 = VN−1(µN−1,σ2N−1) = max

xNEN−1[VN |xN ].

Notation: En means the conditional expectation with respect to µn

and σ2n ; µN = (µN ,x : x = 1, . . . ,k) and similarly for σ2

N .


In principle, we can repeat this to find the optimal rule forevery xn

We iterate backward n = N,N−1,N−2, . . . ,1, where in each stage n:

We computed Vn+1(µn+1,σ2n+1) in the previous stage.

The optimal choice for xn+1 is

xn+1 ∈ arg maxxn+1

En[Vn+1(µn+1,σ2n+1)|xn+1]

The value of this decision is

Vn(µn,σn) = maxxn+1

En[Vn+1(µn+1,σ2n+1)|xn+1].

This is dynamic programming.


We can solve the DP exactly for small problems

Here is the value function for a Bayesian ranking and selection problemwith Bernoulli (0/1) observations, and independent beta prior distributions.

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 1 2 3 4 5 6 7 8

Valu

e

# measurements (N)

k=4k=3k=2


For large problems, this does not work because of thecurse of dimensionality

To use dynamic programming, we need to compute and storeVn(µn,σ

2n ) for each possible value of µn and σ2

n . (We need tocompute Vn for every n, but at any given time we only need Vn andVn+1 in memory.)

There are infinitely many possible values for µn. We can discretize,but it is a vector in k dimensions, and so discretizing into m pieces ineach dimension allows for mk possible values.

σ2n only takes finitely many values, since (σ2

nx)−1/λ−2 is the numberof samples of alternative x , but there are still kn/n! possible values.

For large values of k (say, k > 10), solving the dynamic program iscomputationally intractable.

For such large values of k , we recommend using the KG policy.


The KG method has nice optimality properties

The dynamic programming equations to prove certain optimality propertiesof the KG policy:

The knowledge-gradient policy is optimal when N = 1.

The knowledge-gradient policy is asymptotically optimal as N → ∞.

For other N, the knowledge-gradient policy’s suboptimality isbounded by

VKG ,n(Sn)≥ V n(Sn)− N−n−1√2π

maxx

σ̃nx ,

where VKG ,n gives the value of the knowledge-gradient policy and V n

the value of the optimal policy, both with N−n measurementsremaining.



If there are exactly 2 alternatives (M=2), the knowledge-gradient policy isoptimal. In this case, the optimal policy reduces to

xn = arg maxx

σnx .



If there is no measurement noise, and alternatives may be reordered so that

µ01 ≥ µ

02 ≥ . . .≥ µ

0M

σ01 ≥ σ

02 ≥ . . .≥ σ

0M ,

then the knowledge-gradient policy is optimal.


Outline



3 Conclusion


Conclusion

We gave an introduction to Bayesian ranking and selection, which isone of many optimal learning problems.

We showed how Bayesian statistics and a one-step optimality analysiscan be used to derive the KG policy for this problem.

In the seminar today, we will look at another optimal learningproblem: simulation optimization, with correlated Bayesian priordistributions.

Knowledge-gradient methods offer a convenient yet principaled way todevelop algorithms for a wide variety of optimal learning problems.


For further reading

P.I. Frazier, “Tutorial: Optimization via Simulation with BayesianStatistics and Dynamic Programming,” Winter SimulationConference, 2012. (available on my website)

W.B. Powell & I.O. Ryzhov “Optimal Learning”, 2012. (textbook)

The original paper on the KG method: P.I. Frazier, W.B. Powell, andS. Dayanik “A Knowledge-Gradient Policy for Sequential InformationCollection,” SIAM Journal on Control and Optimization, 2008.

Other introductory materials available on my website,http://people.orie.cornell.edu/pfrazier/


http://people.orie.cornell.edu/pfrazier/

Documents

Optimal Learning: an Overview