Transcript
Page 1: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

Bayesian Counterfactual Risk Minimization

International Conference on Machine Learning Long Beach, CA, June 11, 2019

Ted Sandler (sandler@) Amazon Music

Ben London (blondon@) Amazon Music

Page 2: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

Learning from Logged Data

Design/train new rec policy

A/B test new policy vs

old policy

Launch new policy

if unsucce

ssful

if successful

Pull log data

e.g., user i listened to item j

Page 3: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

• Only observe outcomes from actions taken

• e.g., only get feedback on recommendations

Problem 1: Bandit Feedback

Here’s a station you might like …

Alexa, play music

Page 4: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

Problem 2: Bias

low support → who knows?

high support → better estimate

• Logged data is biased

• Policy typically not uniform distribution

• User typically doesn’t see everything

• Bias affects inferences

• Self-fulfilling prophecies; “rich get richer”

• Miss key insights due to insufficient support

Page 5: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

IPS Policy Optimization• Use inverse propensity score (IPS) estimator

logged propensity pi = ⇡0(ai |xi)<latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit>

• IPS is an unbiased estimator of expected reward

• Caveat: logging policy must have full support

argmin⇡

1

n

nX

i=1

�ri⇡(ai |xi)

pi<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">AAACf3icjVFdSxwxFM1Mv7arrat99CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EyykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1II6g0hGOtJyiuTWaxMoxw6rppo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl++eExULrVhReKbC51s97C/JvvUljymFmmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrrGPz/gvebJFtFNaav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA==</latexit>

E(x,⇢)⇠D

Ea⇠⇡(x)

[⇢(x, a)] ⇡ 1

n

nX

i=1

r

i

⇡(ai

|xi

)

p

i

<latexit sha1_base64="PSf7DjZY4iJDoPC0SXiPXT/dCmw=">AAAChnicjVFdaxNBFJ1dP5rGr9g++nIxCAmUkK1K+6AQUMHHCqYtZNdldjLbDJ0vZmYlyzj9nz77D/wFziZ5sKLghQuHc869dzhTac6sm06/J+mdu/fu7/X2+w8ePnr8ZPD04NyqxhA6J4orc1lhSzmTdO6Y4/RSG4pFxelFdf2u0y++UmOZkp9dq2kh8JVkNSPYRaochPyDLv1ofZSblRpDXrWxq/cBNjzeEpqN1uOwgM4TrXgMRf8mx1obtb6BvDaY+Cx4GadsI0rP3mbhiwRTMsiPdnq3BHfEN1iXbBy8LlkoB8NsMt0U/BsM0a7OysHPfKlII6h0hGNrF9lUu8Jj4xjhNPTzxlKNyTW+oosIJRbUFn4TU4AXkVlCrUxs6WDD/j7hsbC2FVV0CuxW9k+tI/+mLRpXnxaeSd04Ksn2UN1wcAq6zGHJDCWOtxFgYlh8K5AVjqm4+DO3roh2SWsbN4BoQUSzsv3/C+n8eJK9nBx/ejWcvdnF1UPP0HM0Qhk6QTP0EZ2hOSLoR7KfHCSHaS+dpK/Tk601TXYzh+hWpbNfrhDCKg==</latexit>

Page 6: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

IPS Policy Optimization• Use inverse propensity score (IPS) estimator

logged propensity pi = ⇡0(ai |xi)<latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit>

!"($|&) !((|&)!"($|&) !($|&)

• Problem: IPS has high variance

argmin⇡

1

n

nX

i=1

�ri⇡(ai |xi)

pi<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">AAACf3icjVFdSxwxFM1Mv7arrat99CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EyykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1II6g0hGOtJyiuTWaxMoxw6rppo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl++eExULrVhReKbC51s97C/JvvUljymFmmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrrGPz/gvebJFtFNaav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA==</latexit>

Page 7: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

CRM Principle• Counterfactual Risk Minimization (CRM) principle

• Motivated by PAC risk analysis

• Stochastic optimization of variance regularizer is tricky

• Policy optimization for exponential models (POEM) algorithm

[Swaminathan & Joachims, ICML 2015]

variance regularization

argmin⇡

1

n

nX

i=1

�ri⇡(ai |xi)

pi<latexit sha1_base64="cWvFQAa06y/zjgRhGkPYJf5mUuE=">AAACVnicjVDLSgMxFM2Mr1pfoy7dBItQQUtHBV0oFNy4ESrYKnTqkEkzNTTJDElGHOJ8lT9jt/oHfoCY1oIPFLwQODnn3HuTE6WMKl2vDx13anpmdq40X15YXFpe8VbX2irJJCYtnLBEXkdIEUYFaWmqGblOJUE8YuQqGpyO9Ks7IhVNxKXOU9LlqC9oTDHSlgq98wDJPqciNEFKCxjswCCWCBu/MMJeVcZDQ0/84kbAXRnST4O1V9GIeID3Id0uTBrSIvQqfq0+Lvg3qIBJNUPvNeglOONEaMyQUh2/nuquQVJTzEhRDjJFUoQHqE86FgrEieqa8bcLuGWZHowTaY/QcMx+7TCIK5XzyDo50rfqpzYif9M6mY6PuoaKNNNE4I9FccagTuAoQ9ijkmDNcgsQltS+FeJbZFPRNulvW3jeI7GyEyDPIbfmRJX/F1J7r+bv1/YuDiqN40lcJbABNkEV+OAQNMAZaIIWwOARDMEzeHGenDd3xp37sLrOpGcdfCvXewf7/bVW</latexit>

+ �q

V̂ar(⇡, S)<latexit sha1_base64="IUqBlo6BGJCKN8j7pt+Y8e1iNNg=">AAACNHicdVBNb9NAEF2Xj5bw0QBHLisipKIiy3ZLSCUOlbhwbAVJK8VRNF6Pm1V3bXd3jLAs9z/wZ9or/AwkbohrD/0F3aRBoghG2tXTe/Nmdl9SKmkpCL57K7du37m7unavc//Bw0fr3cdPRraojMChKFRhDhOwqGSOQ5Kk8LA0CDpReJAcv5vrB5/QWFnkH6kucaLhKJeZFECOmnY3N/npKY+Vc6TAY3tiqIln4C7Cz9SMwLTtRlzKVx9ettNuL/CjKOgPdnjgv+5v9cPIgZ3BdhSGPPSDRfXYsvam3cs4LUSlMSehwNpxGJQ0acCQFArbTlxZLEEcwxGOHcxBo500i0+1/IVjUp4Vxp2c+IL909GAtrbWievUQDP7tzYn/6WNK8oGk0bmZUWYi+tFWaU4FXyeEE+lQUGqdgCEke6tXMzAgCCX440tuk4xs24C1zXXrrmwHRfS7yT4/8Eo8sMtP9rf7u2+Xca1xp6x52yDhewN22Xv2R4bMsG+sHP2lX3zzrwf3k/v13Xrirf0PGU3yru4ApDqq4s=</latexit>

Page 8: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

Bayesian CRM Principle• Bayesian Counterfactual Risk Minimization (CRM) principle

• Bayesian policy:

• Motivated by PAC-Bayes risk analysis

• Takeaway: posterior should stay close to the prior

• What should the prior be? How about the logging policy!

[London & Sandler, ICML 2019]

KL div. from prior to posteriorargmin

Q

1

n

nX

i=1

�ri⇡Q(ai |xi)

pi<latexit sha1_base64="FKE3v27XKHU3Af8ihAabaNut8Mk=">AAACXnicjVDLSgMxFE1HrVqtrboR3ASLUEFLpwq6UBDcuGzBPqBTh0yaqaFJZkgy4hDny/wSd7rVnV9gWgs+UPBA4OTcc+9NThAzqnS9/phz5uYX8otLy4WV1eJaqby+0VFRIjFp44hFshcgRRgVpK2pZqQXS4J4wEg3GF9M6t1bIhWNxJVOYzLgaCRoSDHSVvLLbQ/JEafCN14QtDLo7UMvlAgbNzPCXlXCfUPP3OxawAPp00+DF9NZUxVN9Ht459O9zMQ+zfxyxa3Vp4B/kwqYoemX37xhhBNOhMYMKdV367EeGCQ1xYxkBS9RJEZ4jEakb6lAnKiBmX4/g7tWGcIwkvYIDafq1w6DuFIpD6yTI32jftYm4m+1fqLDk4GhIk40EfhjUZgwqCM4yRIOqSRYs9QShCW1b4X4BtlwtE382xaeDkmo7ATIU8itOVKF/4XUadTcw1qjdVQ5P53FtQS2wQ6oAhccg3NwCZqgDTB4AM/gBbzmnpy8U3RKH1YnN+vZBN/gbL0D3m+3qw==</latexit>

+ �Dkl(QkP)<latexit sha1_base64="TZIo5oFiUTJ7hZIquoaW9C4VbFo=">AAACK3icdVDLSgMxFM3Ud31VXboJFkFQhplRawsuBDeCLirYB3RKyWQybWgyMySZwjDUtT+jW/0PV4pbf8AvMK0VrOiFhMM599ybHC9mVCrLejFyM7Nz8wuLS/nlldW19cLGZl1GicCkhiMWiaaHJGE0JDVFFSPNWBDEPUYaXv98pDcGREgahTcqjUmbo25IA4qR0lSnsLMPb2+hy7TDR9A9gO7llU8Hmet518PRXR12CkXLdByrVK5AyzwuHZZsR4NK+cixbWib1riKYFLVTuHD9SOccBIqzJCULduKVTtDQlHMyDDvJpLECPdRl7Q0DBEnsp2N/zKEu5rxYRAJfUIFx+xPR4a4lCn3dCdHqid/ayPyL62VqKDczmgYJ4qE+GtRkDCoIjgKBvpUEKxYqgHCguq3QtxDAmGl45vawlOfBFJPgDyFXDdHMq9D+k4C/g/qjmkfms71UfHsdBLXItgGO2AP2OAEnIELUAU1gMEdeACP4Mm4N56NV+PtqzVnTDxbYKqM90+lZabq</latexit>

⇡Q(a |x) = Prh⇠Q{h(x) = a}, h : X ! A<latexit sha1_base64="+t0dYAp3xN+F+e4OhaULvUtakkY=">AAACTHicdZDdShtBFMdnUz+3VWO9tBcHQyGChN2I+IFCxJteRjAayIRldjLrDs7sLjOz4rLGV/A5fI0+QL2tr1HaKxGcJCJa2j8Mc/idc+bM+YeZ4Np43oNT+TA1PTM7N+9+/LSwuFRd/nyq01xR1qGpSFU3JJoJnrCO4UawbqYYkaFgZ+HF0Sh/dsmU5mlyYoqM9SU5T3jEKTEWBdUjnPGgxGF4PKwTfH217h4AbqugjHFYjDEuIa5frcMBEMDDDffmBmLYA0y7gE1q78OgWvMau9tbu80d8BveWGCJ1Zb3SmqtL7e4/uf7bTuo/sKDlOaSJYYKonXP9zLTL4kynAo2dHGuWUboBTlnPRsmRDLdL8fLDuGrJQOIUmVPYmBM33aURGpdyNBWSmJi/XduBP+V6+Um2umXPMlywxI6GRTlAuySI+dgwBWjRhQ2IFRx+1egMVGEGuvvuymyGLBI2xdAFiBtcapda9KrN/8PTpsNf7PRPPZrrX000RxaRWuojny0jVroG2qjDqLoDt2jn+jB+eH8dh6dp0lpxXnpWUHvVJl5Bi+StK8=</latexit>

Page 9: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

Application to Mixed Logits• Mixed logit model

• Like a softmax with normally distributed parameters

• Risk bound motivates logging policy regularization

[London & Sandler, ICML 2019]

L2 distance to logging policy parameters

hw,�(x) = argmaxa w · �(x, a) + �a<latexit sha1_base64="CEu/DqFGcvPUlqBMrX5MVLu5BM0=">AAACRnicdVBda9swFJWzj7bZR7PtcS+XhkHKQrBTSltYITAYfSm0bGkLsTHXspyISJaR5DXG5Nfsz2yv21P/RAeDstcpySjr2A4IDuecqyudpBDcWN+/8hr37j94uLa+0Xz0+MnTzdaz52dGlZqyIVVC6YsEDRM8Z0PLrWAXhWYoE8HOk+nbhX/+kWnDVf7BVgWLJI5znnGK1klx63AS15fdcIxS4rwz24ZDCFGPJc7iGucQduESQpoqC2Ex4Z1ZF7fhNazyMcattt872Ns96O9D0POXAKc47Pq3SnsAx++/T6N3J3HrR5gqWkqWWyrQmFHgFzaqUVtOBZs3w9KwAukUx2zkaI6SmahefnMOr5ySQqa0O7mFpfrnRI3SmEomLinRTszf3kL8lzcqbbYf1TwvSstyulqUlQKsgkVnkHLNqBWVI0g1d28FOkGN1Lpm72yRVcoy424AWYF0YWWarqTbbv5Pzvq9YKfXPw3agzdkhXXykmyRDgnIHhmQI3JChoSST+QL+Uq+eZ+9a+/G+7mKNrzfMy/IHTTIL/+csqQ=</latexit>

� ⇠ Gumbel(0, 1)k<latexit sha1_base64="/K5/42P25wlb+IRN4rgq1zjjmL0=">AAACJnicdVDLattAFB2lL9d9qe2yKQw1BReKkVyCE8jCkEWyTKC2A5ZqRqMrZ/CMJGauQoTwKh+Q32i2yR/kA7orIbt8QyD7ju1i6tIeGDicc+7cmRPlUhj0vBtn7cHDR4+f1J7Wnz1/8fKV+/pN32SF5tDjmcz0YcQMSJFCDwVKOMw1MBVJGESTnZk/OAZtRJZ+xTKHULFxKhLBGVpp5L4PxkwpFkRlgHCC1W6hIpDTpvfZ//RtMnIbXmurs7HV3qR+y5uDWsViw1sqje76WdC8vzrbH7l3QZzxQkGKXDJjhr6XY1gxjYJLmNaDwkDO+ISNYWhpyhSYsJp/Y0o/WiWmSabtSZHO1T8nKqaMKVVkk4rhkfnbm4n/8oYFJpthJdK8QEj5YlFSSIoZnXVCY6GBoywtYVwL+1bKj5hmHG1zK1tUGUNi7A1UlVTZcGbqtqRlN/8n/XbL/9JqH/iN7jZZoEbekQ+kSXzSIV2yR/ZJj3BySs7JBbl0vjs/nJ/O9SK65vyeeUtW4Nz+Aj24qTA=</latexit>

Dkl(QkP) = O(kµ� µ0k2)<latexit sha1_base64="rGWMk0SbIjyAt8LiF5XfihmXdQI=">AAACOXicdVDdShwxGM1o/Vv/VnvZm9BF0AuHzKjrChaE3hQqdIWuCjvrkslkdoP5GZKMMAzzGr5Me6vv4GXvxMv6As2uW1DRA/k4nPN9+ZITZ5wZi9CdNzX9YWZ2bn6htri0vLJaX1s/NSrXhHaI4kqfx9hQziTtWGY5Pc80xSLm9Cy+/Dryz66oNkzJn7bIaE/ggWQpI9g6qV9H0ffjhF2VURyfVKParuAXGMVs8GMzkkqLMhI53Iau9lF1EW716w3khyFqtg4g8veaO80gdOSgtRsGAQx8NEYDTNDu1/9GiSK5oNISjo3pBiizvRJrywinVS3KDc0wucQD2nVUYkFNrxz/rIIbTklgqrQ70sKx+nyixMKYQsSuU2A7NK+9kfiW181t2uqVTGa5pZI8LUpzDq2Co5hgwjQllheOYKKZeyskQ6wxsS7MF1tEkdDUuBugKKBwzcrUXEj/k4Dvk9PQD3b88GS3cXQ4iWsefAKfwSYIwD44At9AG3QAAdfgN7gBt94v74937z08tU55k5mP4AW8x3+DQqxN</latexit>

w ⇠ N (µ,⌃)<latexit sha1_base64="MBWDrEtOzjXyF+/Llo1h5b8E1Yw=">AAACIXicdVDLSgMxFM34rPU16kZwEyxCBSnTamkLLgpuXElF+4BOKZk004YmkyHJKEOpP6Nb/Q934k78C7/A9EGxogcCh3POzU2OFzKqtON8WAuLS8srq4m15PrG5ta2vbNbUyKSmFSxYEI2PKQIowGpaqoZaYSSIO4xUvf6FyO/fkekoiK41XFIWhx1A+pTjLSR2vb+PXS9GLpXQvK0y6MT6N7QLkfHbTvlZEqFfClXhNmMMwY0ikHemSkpMEWlbX+5HYEjTgKNGVKqmXVC3RogqSlmZJh0I0VChPuoS5qGBogT1RqMfzCER0bpQF9IcwINx+rPiQHiSsXcM0mOdE/99kbiX14z0n6xNaBBGGkS4MkiP2JQCziqA3aoJFiz2BCEJTVvhbiHJMLalDa3hccd4itzA+Qx5CYsVNKUNOvmf1LLZbKnmdz1Wap8Pq0rAQ7AIUiDLCiAMrgEFVAFGDyAJ/AMXqxH69V6s94n0QVrOrMH5mB9fgN6/aKo</latexit>

⇡Q(a |x) = E(w,�)⇠Q [1{h

w,�

(x) = a}] = Ew

hexp(w·f(x,a))Pa0 exp(w·f(x,a0

))

i

<latexit sha1_base64="B0E7zDHFhHzMQ8SqV67/w2efN3w=">AAACqXicjVHditNAFJ5EXdf6s129FGSwyCZYSrOy6oWLiyJ4uQW7W+yEcDKZpMNmkjAzsQ1jLn0PH8NbH8Nn0CufwEnrgusPeGDg8P2cM3wnrnKu9Hj8xXEvXb6ydXX7Wu/6jZu3dvq7t09UWUvKprTMSzmLQbGcF2yquc7ZrJIMRJyz0/jsZcefvmNS8bJ4o5uKhQKygqecgrZQ1P9IKh6ROJ54QN6v/N4hJq+qyHjLIclACPBJ3HR0S3KW6jkmATGLyJzTrbfyD4G0mEieLXR47l+25AXPrDyVQA1hq8pbYkKTUuPUWw3B91tDVC0iA3vW/Ae/ZwW4GxFG/UEwGq8L/7sZPP/89cO9T5Nvx1H/O0lKWgtWaJqDUvNgXOnQgNSc5qztkVqxCugZZGxu2wIEU6FZJ9niBxZJcFpK+wqN1+ivDgNCqUbEVilAL9TvXAf+jZvXOn0aGl5UtWYF3SxK6xzrEndnwQmXjOq8sQ1Qye1fMV2ADU/b413YIpqEpcpOwKLBwopL1fu/kE72R8Gj0f4kGBw9Q5vaRnfRfeShAD1BR+g1OkZTRJ0tZ+gcOI/dh+7EnblvN1LX+em5gy6US38ARQzSkg==</latexit>

Page 10: Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization International Conference on Machine Learning Long Beach, CA, June 11, 2019 Ted Sandler

Contributions• PAC Risk bounds for Bayesian policies

• Application to mixed logits

• Logging policy prior motivates new regularizer

• Two new learning objectives (one convex) that minimize bounds

• Experiments show proposed methods gain up to 76% more reward than POEM, while also simpler/more efficient

Visit poster #113 in Pacific Ballroom


Recommended