View
256
Download
4
Category
Preview:
Citation preview
(c) 2013 W.B. Powell
Outline
The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery
(c) 2013 W.B. Powell 2
The knowledge gradient
Basic principle:» Assume you can make only one measurement, after which you have to
make a final choice (the implementation decision).» What choice would you make now to maximize the expected value of
the implementation decision?
1 2 3 4 5
Change in estimate of value of option
5 due to measurement.
Change which produces a change in the decision.
(c) 2013 W.B. Powell 3
The knowledge gradient
General model» Off-line learning – We have a measurement budget of
N observations. After we do our measurements, we have to make an implementation decision.
» Notation:
Implementation decisionOur state of knowledge (e.g. mean and variance of our
estimates of costs and other parameters).( , ) Value of making decision given knowledge .
Measurement decision(
yK
F y K y Kx
K x
) Updated distribution of belief about costs.
(c) 2013 W.B. Powell 4
The knowledge gradient
The knowledge gradient» The knowledge gradient is the expected value of a single
measurement x, given by
» The challenge is a computational one: how do we compute the expectation?
max ( , ( )) max ( , )KGx y yE F y K x F y K
arg maxKnowledge gradient policy
KGx xX
(c) 2013 W.B. Powell 5
The knowledge gradient
Computing the knowledge gradient» Notation
» We update the precision using
» In terms of the variance, this is the same as
2
1
Precision (inverse variance) of our estimate of the value of .
Precision of the measurement noise ( 1/ )
Measurement of in iteration 1 (unknown at )
n nx x
enx
x
w x n n
1 n nx x
1 1 12, 1 2, 2
2,2, 1
2, 21 /
n nx x
nn x
x nx
(c) 2013 W.B. Powell 6
The knowledge gradient
Computing the knowledge gradient» The change in variance can be found to be
» Next compute the normalized influence:
» Let
» Knowledge gradient is computed using
2, 1
2, 2, 1
2,
2 2,
|
1 /
n n n nx x x
n nx x
nx
nx
Var S
' 'maxn nn x x x xx n
x
( ) ( ) ( ) ( ) Cumulative standard normal distribution ( ) Standard normal densityf
2,KG n nx x xf
(c) 2013 W.B. Powell 7
The knowledge gradient
Computing the knowledge gradient
1 2 3 4 5
' 'max
Normalized distance to best (or second best)
n nn x x x xx n
x
nx
x
5KG
(c) 2013 W.B. Powell
The knowledge gradient
KG calculations illustrated» You may click on the spreadsheet to see the
calculations
Decision mu n̂ beta n̂ beta {̂n+1} sigmatilde max_x' zeta f(z) nu {̂KG)_x1 3.0 0.0156 1.0156 7.9382 5 -0.2519 0.2856 2.26692 4.0 0.0156 1.0156 7.9382 5 -0.1260 0.3391 2.69203 5.0 0.0156 1.0156 7.9382 5 0.0000 0.3989 3.16694 5.0 0.0123 1.0123 8.9450 5 0.0000 0.3989 3.56855 5.0 0.0100 1.0100 9.9504 5 0.0000 0.3989 3.9696
(c) 2013 W.B. Powell 9
The knowledge gradient
0
0.5
1
1.5
2
2.5
1 2 3 4 5
Choice
muSigmaKG index
(c) 2013 W.B. Powell 10
The knowledge gradient
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5
muSigmaKG index
(c) 2013 W.B. Powell 11
The knowledge gradient
0
1
2
3
4
5
6
1 2 3 4 5
Cho
ice mu
SigmaKG index
(c) 2013 W.B. Powell 12
The knowledge gradient
The knowledge gradient policy
Properties» Effectively a myopic policy, but also similar to steepest ascent for
nonlinear programming.» The best single measurement you can make (by construction)» Asymptotically optimal (more difficult proof). As the
measurement budget grows, we get the optimal solution.» The knowledge gradient policy is the only stationary policy with
this behavior.• Many policies are asymptotically optimal (e.g. pure exploration,
hybrid exploration/exploitation, epsilon-greedy), but are not myopically optimal.
,( ) arg maxKG n KG nx xX S
(c) 2013 W.B. Powell
The knowledge gradient policy
Myopic and asymptotic optimality
Optimal solution
Asymptotically optimal
Fast initial convergence, but stallsIdeal
(c) 2013 W.B. Powell
The knowledge gradient policy
Myopic and asymptotic optimality
Optimal solution
Myopic optimality (fast initial convergence)
Asymptotic optimality
Knowledge gradient
Ideal
The knowledge gradient policy
KG versus Gittins indices for multiarmed bandit problems» Gittins indices are provably optimal, but computing them is hard.» Computed using Chick and Gans (2009) approximation
Informative prior
Improvement of KG over Gittins
Uninformative prior
Improvement of KG over Gittins0 1 2 3 4 5 6 7 -5 0 5 10 15
(c) 2013 W.B. Powell
Knowledge gradient for online learning
But knowledge gradient can also handle:» Finite horizons» Correlated beliefs:
KG vs. Gittins KG vs. Interval estimation
KG vs. Upper confidence bounding KG vs. pure exploitation
(c) 2013 W.B. Powell
Knowledge gradient for online learning
KG versus interval estimation» Recall that with IE, you choose the alternative with the
highest:IEx x xz
IE beats KG
IE
KG
IE parameter z
Opp
ortu
nity
cos
t
(c) 2013 W.B. Powell
(c) 2013 W.B. Powell
Outline
The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery
(c) 2013 W.B. Powell
The nonconcavity of information
We can calculate the value of measurements.» Updating precision would now be
» Compute using
» Calculation of knowledge gradient is the same:
0 xnx x xn
2, 0 0
2,2,0
0
|
1 1
x x
x
x
n nx x x
nx x
nx x
Var S
2,nx
2, 0( ) Value of measurementsxnKGx x x x xn f n
xn
0 00 ' 'max
x
x x x xx n
x
(c) 2013 W.B. Powell
The nonconcavity of information
The value of information is often concave…
(c) 2013 W.B. Powell
The nonconcavity of information
… but not always.» The marginal value of a single measurement can be small!
(c) 2013 W.B. Powell
The nonconcavity of information
What influences the shape?» Consider a baseball hitter whose “true” batting average is
0.300.• Variance of a single at bat is .3 x .7 = .21. Std. dev = .45.• What if the difference in my belief in the batting averages of two
players is .02 (e.g. I think one bats .300 and the other bats .280)• Assume the standard deviation in my belief about the batting average
is also 50 points (.05).
» Click here to bring up spreadsheet.» Notes:
• The expected value of 10 at-bats in terms of increasing the expected batting average is .00087.
• The expected value of 100 at-bats in terms of increasing the expected batting average is .0068.
(c) 2013 W.B. Powell
The nonconcavity of information Optimal number of choices
As measurement noise increases, the optimal number of alternatives to evaluate decreases.
Number of alternatives being evaluated
Increasing noise
(c) 2013 W.B. Powell
The nonconcavity of information
Examples of problems with non-concave information» Finding the best hitters for a baseball team» Finding the best stock pickers for an investment fund» Finding the best high value, low volume products to put
in inventory
Implications?» Compare to behavior we learned at the beginning of the
course when we had an “s-curve” problem.
(c) 2013 W.B. Powell
The nonconcavity of information
The KG(*) policy» Maximize the average value of measurements.
(c) 2013 W.B. Powell
Outline
The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery
(c) 2013 W.B. Powell 27
Online vs. offline learning problems
Types of learning probems» On-line learning
• Learn as you earn • Example:
– Finding the best path to work– What is the best set of energy-saving technologies to use
for your building?– What is the best medication to control your diabetes?
• As you collect information, you collect rewards (or lose money). Collecting information is coincident with using the information.
• You have to balance the value of what you earn with a choice now against the benefits of the information you will gain on future decisions.
(c) 2013 W.B. Powell 28
Online vs. offline learning problems
Types of learning probems» Off-line learning
• There is a phase of information collection with a finite (sometimes small) budget.
• You are allowed to make a series of measurements, after which you make an implementation decision.
• Examples:– Finding the best drug compound through laboratory
experiments– Finding the best design of a manufacturing configuration
or engineering design which is evaluated using an expensive simulation.
– What is the best combination of designs for hydrogen production, storage and conversion.
• Off-line learning separates the process (and costs) of learning from the benefits of using the information that you have gained.
(c) 2013 W.B. Powell 29
Online vs. offline learning problems
For problems with a finite number of alternatives» On-line learning (learn as you earn)
• This is known in the literature as the multi-armed bandit problem, where you are trying to find the slot machine with the highest payoff.
» Off-line learning• You have a budget for taking measurements. After your
budget is exhausted, you have to make a final choice.• This is known as the ranking and selection problem.
Online vs. offline learning problems
Knowledge gradient policy» For off-line problems:
» For finite-horizon on-line problems:• Assume we have made 3 measurements out of our budget of 20.• What is the value of learning from one more measurement?• is the improvement in the 4th decision given what we know
after the 3rd measurement. But we benefit from this decision 17 more times.
• The more times we can use the information, the more we are willing to take a loss for future benefits.
,3 3 ,3 3 ,3(20 3) 17KGOL KG KGx x x x x
, Value of a measurement from a single decisionKG nx
,3KGx
(c) 2013 W.B. Powell
Online vs. offline learning problems
Knowledge gradient policy» For finite-horizon on-line problems:
» For infinite-horizon discounted problems:
Compare to Gittins indices for bandit problems
… and UCB
, ,( )KG OL n n KG nx x xN n
, ,
1KG OL n n KG nx x x
( , )Gittins nx x xn
1 log4UCB nx x n
x
nN
Value of information
Value of information
???
???
(c) 2013 W.B. Powell
(c) 2013 W.B. Powell
Outline
The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery
(c) 2013 W.B. Powell
Learning the maximum of a function Choosing prices to maximize revenue
» Measuring a price of $80 tells us something about the response at $81.
» Initial solution
(c) 2013 W.B. Powell
Learning the maximum of a function Choosing prices to maximize revenue
» Measuring a price of $80 tells us something about the response at $81.
» After three measurements (including endpoints)
(c) 2013 W.B. Powell
Learning the maximum of a function Choosing prices to maximize revenue
» Measuring a price of $80 tells us something about the response at $81.
» After four measurements
(c) 2013 W.B. Powell
Learning the maximum of a function Choosing prices to maximize revenue
» Measuring a price of $80 tells us something about the response at $81.
» After 10 measurements
Parametric beliefs and drug discovery
Learning a concave function
» As the number of observations increase, the policy quickly evolves to pure exploitation.
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Insights» Trying what appears to be
best maximizes profits given what you know, but you may be wrong.
» Generally not a good idea to try ideas that genuinely look bad.
» Best to try ideas that are just off center. You learn more, and you may learn that profits are even higher with different strategies.
(c) 2013 W.B. Powell
(c) 2013 W.B. Powell
OJ game 2009
Mom&Pop pricing (2009)
(c) 2013 W.B. Powell
OJ game 2009
Performance of different teams
(c) 2013 W.B. Powell
OJ game 2010
Challenge» If you are underperforming
• Are your prices right?
• Perhaps they are too high? Too low?
• What is your level of uncertainty?
(c) 2013 W.B. Powell
Outline
The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Biomedical research» How do we find the
best drug to cure cancer?
» There are millions of combinations, with laboratory budgets that cannot test everything.
» We need a method for sequencing experiments.
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Designing molecules
» X and Y are sites where we can hang substituents to change the behavior of the molecule
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
We express our belief using a linear, additive QSAR model»»
0
ij ijsites i substituents j
Y X 1 if substituent is at site , 0 otherwise.m
ijX j i
(c) 2013 W.B. Powell
If we sample points near the middle, we will have a difficult time estimating the function:
Parametric beliefs and drug discovery
(c) 2013 W.B. Powell
Sampling near the endpoints produces more stable estimates. Now take this into higher dimensions.
Parametric beliefs and drug discovery
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Knowledge gradient versus pure exploration for 99 compounds
Perf
orm
ance
und
er b
est p
ossi
ble
Number of molecules tested (out of 99)
Pure exploration
Knowledge gradient
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
A more complex molecule:
» From this base molecule, we created problems with 10,000 compounds, and one with 87,120 compounds.
R1
R2
R4
R3
R5
18
23
4
5
9
9
74’
3’
2’ 1’
63
3
FOHCHOCOCHOCOCH
3OCHCHNOCIOCOCH
Potential substituents:
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Compact representation on 10,000 combination compound» Results from 15 sample paths
Perf
orm
ance
und
er b
est p
ossi
ble
Number of molecules tested
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Single sample path on molecule with 87,120 combinationsPe
rfor
man
ce u
nder
bes
t pos
sibl
e
(c) 2013 W.B. Powell
Parametric beliefs and drug discovery
Representing beliefs using linear regression has many applications:» How do we find the optimal price of a product sold on
the internet?» Which internet ad will generate the most ad clicks?» How will a customer, described by a set of attributes,
respond to a price for a contract?» What parameter settings produce the best results from
my business simulator?» What are the best features that I should include in a
laptop?
Recommended