79
High Confidence Off-Policy Evaluation (HCOPE) Philip Thomas CMU 15-899E, Real Life Reinforcement Learning, Fall 2015

High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

High Confidence Off-Policy Evaluation (HCOPE)

Philip Thomas

CMU 15-899E, Real Life Reinforcement Learning, Fall 2015

Page 2: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Sequential Decision Problems

Agent

Environment

Action, 𝑎State, 𝑠

(Discrete / Continuous) (Discrete / Continuous)

(Fully observable / Partially observable)

Page 3: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

3

Page 4: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Educational Games

Page 5: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Decision Support Systems

Page 6: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Gridworld

Page 7: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Mountain Car

Page 8: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Reinforcement Learning Algorithms

• Sarsa

• Q-learning

• LSPI

• Fitted Q Iteration

• REINFORCE

• Residual Gradient

• Continuous-Time Actor-Critic

• Value Gradient

• POWER

• PILCO

• LSPI

• PIPI

• Policy Gradient

• DQN

• Double Q-Learning

• Deterministic Policy Gradient

• NAC-LSTD

• INAC

• Average-Reward INAC

• Unbiased NAC

• Projected NAC

• Risk-sensitive policy gradient

• Natural Sarsa

• PGPE / PGPE-SyS

• True Online

• GTD/TDC

• ARP

• GPTD

• Auto-Actor Auto-Critic

• Approximate Value Iteration

Page 9: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

If you apply an existing method, do you have confidence that it will work?

Page 10: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Notation

• 𝑠: State

• 𝑎: Action

• 𝑆𝑡 , 𝐴𝑡: State, and action at time 𝑡

• 𝜋 𝑎 𝑠 = Pr 𝐴𝑡 = 𝑎 𝑆𝑡 = 𝑠

• 𝜏 = (𝑆0, 𝐴0, 𝑆1, … , 𝑆𝐿 , 𝐴𝐿)

• 𝐺 𝜏 ∈ [0,1]

• 𝜌 𝜋 = 𝐄 𝐺 𝜏 𝜏 ∼ 𝜋]

Page 11: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Two Goals:

• High confidence off-policy evaluation (HCOPE)

• Safe Policy Improvement (SPI)

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Historical Data, 𝒟

Performance baseline, 𝜌−

Confidence Level, 𝛿

An improved* policy, 𝜋

*The probability that 𝜋’s performance is below 𝜌− is at most 𝛿

Page 12: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

High Confidence Off-Policy Evaluation

• Historical data: 𝒟 = 𝜏𝑖 , 𝜋𝑖 : 𝜏𝑖 ∼ 𝜋𝑖 𝑖=1𝑛

• Evaluation policy, 𝜋𝑒

• Confidence level, 𝛿

• Compute HCOPE(𝜋𝑒|𝒟, 𝛿) such that

Pr 𝜌 𝜋𝑒 ≥ HCOPE(𝜋𝑒|𝒟, 𝛿) ≥ 1 − 𝛿

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Page 13: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Importance Sampling

• We would like to estimate𝜃 ≔ 𝐄 𝑓 𝑥 𝑥~𝑝

• Monte Carlo estimator:• Sample 𝑋1, …𝑋𝑛 from 𝑝 and set:

𝜃𝑛 ≔1

𝑛

𝑖=1

𝑛

𝑓 𝑋𝑖

• Nice properties• The Monte Carlo estimator is strongly consistent:

𝜃𝑛

𝑎.𝑠.𝜃

• The Monte Carlo estimator is unbiased for all 𝑛 ≥ 1:

𝐄 𝜃𝑛 = 𝜃

Page 14: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Importance Sampling

• We would like to estimate𝜃 ≔ 𝐄 𝑓 𝑥 𝑥~𝑝

• … but we can only sample from a distribution, 𝑞, not 𝑝.

• Assume: if 𝑞 𝑥 = 0 then 𝑓 𝑥 𝑝 𝑥 = 0. Then:

𝐄 𝑓 𝑥 𝑥~𝑝 = 𝑥∈supp(𝑝) 𝑝 𝑥 𝑓(𝑥)

= 𝑥∈supp(𝑞) 𝑝 𝑥 𝑓(𝑥)

= 𝑥∈supp(𝑞)𝑞(𝑥)

𝑞(𝑥)𝑝 𝑥 𝑓(𝑥)

= 𝑥∈supp(𝑞) 𝑞 𝑥𝑝(𝑥)

𝑞(𝑥)𝑓(𝑥)

= 𝐄𝑝(𝑥)

𝑞(𝑥)𝑓 𝑥 𝑥~𝑞

Page 15: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Importance Sampling

• We would like to estimate𝜃 ≔ 𝐄 𝑓 𝑥 𝑥~𝑝

• Importance sampling estimator:• Sample 𝑋1, … , 𝑋𝑛 from 𝑞 and set:

𝜃𝑛 ≔1

𝑛

𝑖=1

𝑛𝑝 𝑋𝑖

𝑞 𝑋𝑖𝑓 𝑋𝑖

• Nice properties (under mild assumptions)• The importance sampling estimator is strongly consistent:

𝜃𝑛

𝑎.𝑠.𝜃

• The importance sampling estimator is unbiased for all 𝑛 ≥ 1:

𝐄 𝜃𝑛 = 𝜃

Page 16: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Importance Sampling

𝜌 𝜋𝑒 = 𝐄𝜏~𝜋𝑒𝐺 𝜏 = 𝐄𝜏~𝜋𝑏

Pr(𝜏|𝜋𝑒)

Pr(𝜏|𝜋𝑏)𝐺 𝜏

Probability of trajectory

Evaluation Policy, 𝜋𝑒

Behavior Policy, 𝜋𝑏

Page 17: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Importance Sampling for Reinforcement Learning

• 𝜌 𝜋𝑒 = 𝐄𝜏~𝜋𝑒𝐺 𝜏 = 𝐄𝜏~𝜋𝑏

Pr(𝜏|𝜋𝑒)

Pr(𝜏|𝜋𝑏)𝐺 𝜏

•Pr(𝜏|𝜋𝑒)

Pr(𝜏|𝜋𝑏)𝐺 𝜏 =

𝑡=0𝐿 Pr 𝑆𝑡|past Pr 𝐴𝑡|past,𝜋𝑒

𝑡=0𝐿 Pr 𝑆𝑡|past Pr 𝐴𝑡|past,𝜋𝑏

𝐺 𝜏

= 𝑡=0

𝐿 Pr 𝐴𝑡|past,𝜋𝑒

𝑡=0𝐿 Pr 𝐴𝑡|past,𝜋𝑏

𝐺 𝜏

= 𝑡=0

𝐿 𝜋𝑒 𝐴𝑡 𝑆𝑡)

𝑡=0𝐿 𝜋𝑏 𝐴𝑡 𝑆𝑡)

𝐺 𝜏

• 𝜌 𝜋𝑒 , 𝜏, 𝜋𝑏 = 𝑡=0𝐿 𝜋𝑒 𝐴𝑡 𝑆𝑡)

𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏

(D. Precup, R. S. Sutton, and S. Dasgupta, 2001)

Page 18: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Per-Decision Importance Sampling

• Use importance sampling to estimate each 𝑅𝑡.• Still and unbiased and strongly consistent estimator of 𝜌(𝜋𝑒).

• Often has lower variance than ordinary importance sampling.

Page 19: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛

𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =

𝑡=0

𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)

𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿

?

Page 20: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Chernoff-Hoeffding Inequality

• Let 𝑋1, … , 𝑋𝑛 be 𝑛 independent identically distributed random variables such that:• Xi ∈ [0, 𝑏]

• Then with probability at least 1 − 𝛿:

𝐸 𝑋𝑖 ≥1

𝑛

𝑖=1

𝑛

𝑋𝑖 −𝑏ln 1 𝛿

2𝑛

Page 21: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

𝜌 𝜋𝑒 = 𝐸 𝜌 𝜋𝑒 , 𝜏𝑖 , 𝜋𝑖 ≥1

𝑛

𝑖=1

𝑛

𝜌 𝜋𝑒 , 𝜏𝑖 , 𝜋𝑖 − 𝑏ln 1 𝛿

2𝑛

With probability at least 1 − 𝛿:

𝐸 𝑋𝑖 ≥1

𝑛

𝑖=1

𝑛

𝑋𝑖 −𝑏ln 1 𝛿

2𝑛

Page 22: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛

𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =

𝑡=0

𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)

𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿

?

Page 23: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛

𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =

𝑡=0

𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)

𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿

ConcentrationInequality

1

𝑛

𝑖=1

𝑛

𝜌 𝜋𝑒 , 𝜏𝑖 , 𝜋𝑖 − 𝑏ln 1 𝛿

2𝑛

Page 24: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Mountain Car

Page 25: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Mountain Car

• Using 100,000 trajectories

• Evaluation policy’s true performance is 0.19 ∈ [0,1].

• We get a 95% confidence lower bound of:

−5,831,000

Page 26: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

What went wrong?

𝜌 𝜋𝑒 , 𝜏, 𝜋𝑏 =

𝑡=0

𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡

𝜋𝑏 𝐴𝑡|𝑆𝑡𝐺 𝜏

Page 27: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

What went wrong?

𝐸 𝑋𝑖 ≥1

𝑛

𝑖=1

𝑛

𝑋𝑖 −𝑏ln 1 𝛿

2𝑛

𝑏 ≈ 109.4

Largest observed importance weighted return: 316.

Page 28: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Another problem:

𝜌 𝜋𝑒 , 𝜏, 𝜋𝑏 =

𝑡=0

𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡

𝜋𝑏 𝐴𝑡| 𝑆𝑡𝐺 𝜏

• One behavior policy• Independent and identically distributed

• More than one behavior policy• Independent

Page 29: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Conservative Policy Iteration

• ≈ 1,000,000 trajectories for a single policy improvement.

(S. Kakade and J. Langford, 2002)

Page 30: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

PAC-RL

• ≈ 1017 time steps to guarantee convergence to a near-optimal policy.

(T. Lattimore and M. Hutter, 2012)

Page 31: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Thesis

High Confidence Off-Policy Evaluation (HCOPE)

and

Safe Policy Improvement (SPI)

are tractable using a practical amount of data.

Page 32: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 33: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛

𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =

𝑡=0

𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)

𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿

ConcentrationInequality

1

𝑛

𝑖=1

𝑛

𝜌 𝜋𝑒 , 𝜏𝑖 , 𝜋𝑖 − 𝑏ln 1 𝛿

2𝑛

Page 34: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 35: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 36: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 37: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 38: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Extending Maurer’s Inequality

• First Key Idea:• Generalize: random variables with different ranges.

• Specialize: random variables with the same mean.

Page 39: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Extending Maurer’s Inequality

• Second Key Idea:• Removing the upper tail only decreases the expected value.

Page 40: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 41: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Tradeoff

Page 42: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Threshold Optimization

• Use 20% of the data to optimize 𝑐.

• Use 80% to compute lower bound with optimized 𝑐.

Page 43: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Given 𝑛 samples, 𝒳 ≔ 𝑋𝑖 𝑖=1𝑛 , predict what the lower bound would

be if computed from 𝑚 samples, rather than 𝑛.

Page 44: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 45: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

CUT Chernoff-Hoeffding

Maurer Anderson Bubeck et al.

95% Confidence

lower bound on the mean

0.145 −5,831,000 −129,703 0.055 −.046

Page 46: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Digital Marketing Example

• 10 real-valued features

• Two groups of campaigns to choose between

• User interactions limited to 𝐿 = 10

• Data collected from a Fortune 20 company

• Data was not used directly.

47

Page 47: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

48

Page 48: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

49

Page 49: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

50

Page 50: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

51

Page 51: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

52

Page 52: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

Page 53: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

We can now evaluate policies proposed by RL algorithms without the need to execute them and in a way that instills users with confidence that the new policy will actually work.

Page 54: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Two Goals:

• High confidence off-policy evaluation (HCOPE)

• Safe Policy Improvement (SPI)

Historical Data, 𝒟

Proposed Policy, 𝜋𝑒

Confidence Level, 𝛿

1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒

Historical Data, 𝒟

Performance baseline, 𝜌−

Confidence Level, 𝛿

An improved* policy, 𝜋

*The probability that 𝜋’s performance is below 𝜌− is at most 𝛿

Page 55: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Safe Policy Improvement (SPI)

SPI 𝒟, 𝜌−, 𝛿 ∈ {NO SOLUT ION FOUND } ∪ Π

Pr SPI 𝒟, 𝜌− , 𝛿 ∈ {𝜋: 𝜌 𝜋 < 𝜌−} < 𝛿

Exact Approximate< 𝛿 ≈ 𝛿

Page 56: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

SPI 𝒟, 𝜌−, 𝛿

1. Return NO SOLUTION FOUND.

Page 57: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

SPI 𝒟, 𝜌−, 𝛿

1. Return NO SOLUTION FOUND.

Pr SPI 𝒟, 𝜌−, 𝛿 ∈ {𝜋: 𝜌 𝜋 < 𝜌−} < 𝛿

Page 58: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

• Want a batch RL algorithm that• Satisfies this inequality• Often returns a policy

SPI 𝒟, 𝜌−, 𝛿

1. Return NO SOLUTION FOUND.

Pr SPI 𝒟, 𝜌−, 𝛿 ∈ {𝜋: 𝜌 𝜋 < 𝜌−} < 𝛿

Page 59: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Thoughts?

Page 60: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Historical Data

Training Set (20%)

Candidate Policy

Testing Set (80%)

“Safety” Test

Page 61: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 62: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 63: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 64: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Approximate Confidence Intervals

• What if we knew that the importance weighted returns were normally distributed?• One-sided Student’s t-test.

Page 65: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 66: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Approximate Confidence Intervals

• Bootstrap (BCa)• Estimate CDF with sample CDF:

𝐹𝑛 𝑥 ≔1

𝑛

𝑖=1

𝑛

1𝑋𝑖≤𝑥

• What if we assume that the samples come from a distribution like 𝐹𝑛?• Sort 𝑋1, … , 𝑋𝑛 and return 𝑋𝛿𝑛

Page 67: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

ksdensity(gamrnd(2, 50, 10000000, 1))

Page 68: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 69: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 70: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 71: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose
Page 72: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Gridworld

Page 73: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Mountain Car

Page 74: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

Page 75: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

DAEDALUS

• Apply SPI algorithm repeatedly.

• Per-iteration guarantee.

• DAEDALUS2: Exact HCPI until the first change to the policy, then approximate.

Page 76: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Gridworld

Page 77: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Mountain Car

Page 78: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Example: Digital Marketing

Page 79: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture1_2.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose

Safe policy improvement is tractable

A policy improvement algorithm that has a low probability of returning a bad policy.