20
Counterfactual Risk Minimization Learning from logged bandit feedback Adith Swaminathan , Thorsten Joachims Software: http ://www.cs.cornell.edu/~adith/poem/

Counterfactual Risk Minimization

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Counterfactual Risk Minimization

Counterfactual Risk MinimizationLearning from logged bandit

feedbackAdith Swaminathan, Thorsten Joachims

Software: http://www.cs.cornell.edu/~adith/poem/

Page 2: Counterfactual Risk Minimization

Learning frameworks

𝒙

𝒚Online Batch

Full Information Perceptron, … SVM, ...

Bandit Feedback LinUCB, … ?

2

Page 3: Counterfactual Risk Minimization

Alice: Cool!

SpaceX launch

Logged bandit feedback is everywhere!

3

𝑥𝑖: Query

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋)δ1 = 1

𝑦𝑖: Prediction

δ𝑖: Feedback

Alice: Want Tech news

Page 4: Counterfactual Risk Minimization

Goal

• Risk of ℎ: 𝕏 ↦ 𝕐𝑅 ℎ = 𝔼𝑥 δ(𝑥, ℎ 𝑥 )

• Find ℎ∗ ∈ 𝓗 with minimum risk

• Can we find ℎ∗ using 𝒟 = collected from ℎ0?

4

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋)δ1 = 1

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦2 = (𝐹1)δ2 = 5

Page 5: Counterfactual Risk Minimization

Learning by replaying logs?

• Training/evaluation from logged data is counter-factual [Bottou et al]

5

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋)δ1 = 1

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦2 = (𝐹1)δ2 = 5

𝑥1: (Alice,tech)

𝑦1′ : Apple watch

What would Alice do ??

𝑥 ∼ Pr(𝕏), however 𝑦 = ℎ0(𝑥), not ℎ(𝑥)

Page 6: Counterfactual Risk Minimization

• Stochastic ℎ: 𝕏 ↦ ∆(𝕐), 𝑦 ∼ ℎ(𝑥)𝑅 ℎ = 𝔼𝑥𝔼𝑦∼ℎ(𝑥) δ(𝑥, 𝑦)

Stochastic policies to the rescue!

Likely

Unlikely

𝕐

SpaceX launch

6

ℎ0ℎ

Page 7: Counterfactual Risk Minimization

Counterfactual risk estimators

• 𝒟 = { 𝑥1, 𝑦1, δ1, 𝑝1 , 𝑥2, 𝑦2, δ2, 𝑝2 , … , 𝑥𝑛, 𝑦𝑛, δ𝑛, 𝑝𝑛 }

• 𝑝𝑖 = ℎ0 𝑦𝑖 𝑥𝑖 … propensity [Rosenbaum et al]

𝑹𝓓 𝒉 =𝟏

𝒏

𝒊=𝟏

𝒏

𝜹𝒊𝒉(𝒚𝒊|𝒙𝒊)

𝒑𝒊

Basic Importance Sampling [Owen]

𝔼𝑥 𝔼𝑦∼ℎ δ 𝑥, 𝑦 = 𝔼𝑥 𝔼𝑦∼ℎ𝑜 δ(𝑥, 𝑦)ℎ(𝑦|𝑥)

ℎ0(𝑦|𝑥)Perf of new system Samples from old system Importance weight

7

Page 8: Counterfactual Risk Minimization

Story so far

8

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

Page 9: Counterfactual Risk Minimization

Importance sampling causes non-uniform variance!

9

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦1 = (𝐹1)𝛿1 = 5𝑝1 = 0.9

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦2 = (𝑆𝑝𝑎𝑐𝑒𝑋)𝛿2 = 1𝑝2 = 0.9

𝑥3 = (𝐴𝑙𝑖𝑐𝑒,𝑚𝑜𝑣𝑖𝑒𝑠)𝑦3 = (𝑆𝑡𝑎𝑟 𝑊𝑎𝑟𝑠)𝛿3 = 2𝑝3 = 0.9

𝑥4 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦4 = (𝑇𝑒𝑠𝑙𝑎)𝛿4 = 1𝑝4 = 0.9

ℎ1 𝑅𝓓 ℎ1 = 1

ℎ2 𝑅𝓓 ℎ2 = 1.33

Want: Error bound that captures variance of importance sampling

Page 10: Counterfactual Risk Minimization

Capacity control

Counterfactual Risk Minimization

• W.h.p. in 𝒟 ∼ ℎ0

∀ℎ ∈ 𝓗, 𝑅(ℎ) ≤ 𝑅𝒟(ℎ) + 𝑂 𝑉𝑎𝑟𝒟 ℎ 𝑛 + 𝑂( 𝓝∞(𝓗) 𝑛)

*conditions apply. Refer [Maurer et al]

Empirical risk Variance regularization

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

10

Learning objective

ℎ𝐶𝑅𝑀 = argminℎ∈𝓗

𝑅𝒟 ℎ + λ 𝑉𝑎𝑟𝒟 ℎ

𝑛

Page 11: Counterfactual Risk Minimization

POEM: CRM algorithm for structured prediction

• CRFs: ℎ𝑤 ∈ 𝓗𝑙𝑖𝑛; ℎ𝑤 𝑦 𝑥 =exp(𝑤ϕ(𝑥,𝑦))

ℤ(𝑥;𝑤)

• Policy Optimizer for Exponential Models :

𝑤∗ = argmin𝑤

1

𝑛

𝑖=1

𝑛

δ𝑖ℎ𝑤(𝑦𝑖|𝑥𝑖)

𝑝𝑖+ λ 𝑉𝑎𝑟(𝒉𝑤)

𝑛+ μ 𝑤 2

11

Good: Gradient descent, search over infinitely

many w

Bad: Not convex in w

Ugly: Resists stochastic

optimization

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

Page 12: Counterfactual Risk Minimization

Stochastically optimize 𝑉𝑎𝑟(𝒉𝑤) ?

• Taylor-approximate!

𝑉𝑎𝑟(𝒉𝑤) ≤ 𝐴𝑤𝑡

𝑖=1

𝑛

ℎ𝑤𝑖 + 𝐵𝑤𝑡

𝑖=1

𝑛

ℎ𝑤𝑖 2 + 𝐶𝑤𝑡

• During epoch: Adagrad with 𝛻ℎ𝑤𝑖 + λ 𝑛(𝐴𝑤𝑡𝛻ℎ𝑤

𝑖 + 2𝐵𝑤𝑡ℎ𝑤𝑖 𝛻ℎ𝑤𝑖 )}

• After epoch: 𝑤𝑡+1 ↤ 𝑤, compute 𝐴𝑤𝑡+1 , 𝐵𝑤𝑡+1

𝑤𝑡

w𝑤𝑡+1

12

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

Page 13: Counterfactual Risk Minimization

Experiment

• Supervised Bandit MultiLabel [Agarwal et al]

• δ 𝑥, 𝑦 = Hamming(𝑦∗ 𝑥 , 𝑦) (smaller is better)

• LibSVM Datasets• Scene (few features, labels and data)• Yeast (many labels)• LYRL (many features and data)• TMC (many features, labels and data)

• Validate hyper-params (λ,μ) using 𝑅𝒟𝑣𝑎𝑙(ℎ)

• Supervised test set expected Hamming loss

13

Page 14: Counterfactual Risk Minimization

Approaches

• Baselines• ℎ0: Supervised CRF trained on 5% of training data

• Proposed• IPS (No variance penalty) (extends [Bottou et al])

• POEM

• Skylines• Supervised CRF (independent logit regression)

14

Page 15: Counterfactual Risk Minimization

(1) Does variance regularization help?

15

1.543

5.547

1.463

3.445

1.519

4.614

1.118

3.023

1.143

4.517

0.996

2.522

0

1

2

3

4

5

6

Scene Yeast LYRL TMC

δ

Test set expected Hamming Loss

h0 IPS POEM Supervised CRF

0.659

2.822

0.222

1.189

Page 16: Counterfactual Risk Minimization

(2) Is it efficient?

Avg Time (s) Scene Yeast LYRL TMC

POEM(B) 75.20 94.16 561.12 949.95

POEM(S) 4.71 5.02 120.09 276.13

CRF 4.86 3.28 62.93 99.18

16

• POEM recovers same performance at fraction of L-BFGS cost• Scales as supervised CRF, learns from bandit feedback

Page 17: Counterfactual Risk Minimization

(3) Does generalization improve as 𝑛 → ∞?

17

2.75

3.25

3.75

4.25

1 2 4 8 16 32 64 128 256

δ

Replay Count

CRF h0 POEM

Page 18: Counterfactual Risk Minimization

(4) Does stochasticity of ℎ0 affect learning?

18

0.8

1

1.2

1.4

1.6

0.5 1 2 4 8 16 32

δ

Param Multiplier

h0 POEM

Stochastic Deterministic

Page 19: Counterfactual Risk Minimization

Conclusion

• CRM principle to learn from logged bandit feedback• Variance regularization

• POEM for structured output prediction• Scales as supervised CRF, learns from bandit feedback

• Contact: [email protected]

• POEM available at http://www.cs.cornell.edu/~adith/poem/

• Long paper: Counterfactual risk minimization – Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html

• Thanks!

19

Page 20: Counterfactual Risk Minimization

References1. Art B. Owen. 2013. Monte Carlo theory, methods and examples.

2. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.

3. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.

4. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.

5. Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and sample-variance penalization. Proceedings of the 22nd Conference on Learning Theory.

6. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning.

20