Bayesian Counterfactual Risk Minimization11-16... · Bayesian Counterfactual Risk Minimization...

Preview:

Citation preview

Bayesian Counterfactual Risk Minimization

International Conference on Machine Learning Long Beach, CA, June 11, 2019

Ted Sandler (sandler@) Amazon Music

Ben London (blondon@) Amazon Music

Learning from Logged Data

Design/train new rec policy

A/B test new policy vs

old policy

Launch new policy

if unsucce

ssful

if successful

Pull log data

e.g., user i listened to item j

• Only observe outcomes from actions taken

• e.g., only get feedback on recommendations

Problem 1: Bandit Feedback

Here’s a station you might like …

Alexa, play music

Problem 2: Bias

low support → who knows?

high support → better estimate

• Logged data is biased

• Policy typically not uniform distribution

• User typically doesn’t see everything

• Bias affects inferences

• Self-fulfilling prophecies; “rich get richer”

• Miss key insights due to insufficient support

IPS Policy Optimization• Use inverse propensity score (IPS) estimator

logged propensity pi = ⇡0(ai |xi)<latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit>

• IPS is an unbiased estimator of expected reward

• Caveat: logging policy must have full support

argmin⇡

1

n

nX

i=1

�ri⇡(ai |xi)

pi<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">AAACf3icjVFdSxwxFM1Mv7arrat99CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EyykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1II6g0hGOtJyiuTWaxMoxw6rppo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl++eExULrVhReKbC51s97C/JvvUljymFmmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrrGPz/gvebJFtFNaav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA==</latexit>

E(x,⇢)⇠D

Ea⇠⇡(x)

[⇢(x, a)] ⇡ 1

n

nX

i=1

r

i

⇡(ai

|xi

)

p

i

<latexit sha1_base64="PSf7DjZY4iJDoPC0SXiPXT/dCmw=">AAAChnicjVFdaxNBFJ1dP5rGr9g++nIxCAmUkK1K+6AQUMHHCqYtZNdldjLbDJ0vZmYlyzj9nz77D/wFziZ5sKLghQuHc869dzhTac6sm06/J+mdu/fu7/X2+w8ePnr8ZPD04NyqxhA6J4orc1lhSzmTdO6Y4/RSG4pFxelFdf2u0y++UmOZkp9dq2kh8JVkNSPYRaochPyDLv1ofZSblRpDXrWxq/cBNjzeEpqN1uOwgM4TrXgMRf8mx1obtb6BvDaY+Cx4GadsI0rP3mbhiwRTMsiPdnq3BHfEN1iXbBy8LlkoB8NsMt0U/BsM0a7OysHPfKlII6h0hGNrF9lUu8Jj4xjhNPTzxlKNyTW+oosIJRbUFn4TU4AXkVlCrUxs6WDD/j7hsbC2FVV0CuxW9k+tI/+mLRpXnxaeSd04Ksn2UN1wcAq6zGHJDCWOtxFgYlh8K5AVjqm4+DO3roh2SWsbN4BoQUSzsv3/C+n8eJK9nBx/ejWcvdnF1UPP0HM0Qhk6QTP0EZ2hOSLoR7KfHCSHaS+dpK/Tk601TXYzh+hWpbNfrhDCKg==</latexit>

IPS Policy Optimization• Use inverse propensity score (IPS) estimator

logged propensity pi = ⇡0(ai |xi)<latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">AAACSXicjVDLSsNAFJ3UV62vqks3g0VQkJJEre1CKbhxWcGqkIQwmU7q0JkkzEzEEPs7/oxuFfwMXYkrp2kXKgoeuHA45744QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBBx6++ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvAA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA==</latexit>

!"($|&) !((|&)!"($|&) !($|&)

• Problem: IPS has high variance

argmin⇡

1

n

nX

i=1

�ri⇡(ai |xi)

pi<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">AAACf3icjVFdSxwxFM1Mv7arrat99CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EyykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1II6g0hGOtJyiuTWaxMoxw6rppo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl++eExULrVhReKbC51s97C/JvvUljymFmmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrrGPz/gvebJFtFNaav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA==</latexit>

CRM Principle• Counterfactual Risk Minimization (CRM) principle

• Motivated by PAC risk analysis

• Stochastic optimization of variance regularizer is tricky

• Policy optimization for exponential models (POEM) algorithm

[Swaminathan & Joachims, ICML 2015]

variance regularization

argmin⇡

1

n

nX

i=1

�ri⇡(ai |xi)

pi<latexit sha1_base64="cWvFQAa06y/zjgRhGkPYJf5mUuE=">AAACVnicjVDLSgMxFM2Mr1pfoy7dBItQQUtHBV0oFNy4ESrYKnTqkEkzNTTJDElGHOJ8lT9jt/oHfoCY1oIPFLwQODnn3HuTE6WMKl2vDx13anpmdq40X15YXFpe8VbX2irJJCYtnLBEXkdIEUYFaWmqGblOJUE8YuQqGpyO9Ks7IhVNxKXOU9LlqC9oTDHSlgq98wDJPqciNEFKCxjswCCWCBu/MMJeVcZDQ0/84kbAXRnST4O1V9GIeID3Id0uTBrSIvQqfq0+Lvg3qIBJNUPvNeglOONEaMyQUh2/nuquQVJTzEhRDjJFUoQHqE86FgrEieqa8bcLuGWZHowTaY/QcMx+7TCIK5XzyDo50rfqpzYif9M6mY6PuoaKNNNE4I9FccagTuAoQ9ijkmDNcgsQltS+FeJbZFPRNulvW3jeI7GyEyDPIbfmRJX/F1J7r+bv1/YuDiqN40lcJbABNkEV+OAQNMAZaIIWwOARDMEzeHGenDd3xp37sLrOpGcdfCvXewf7/bVW</latexit>

+ �q

V̂ar(⇡, S)<latexit sha1_base64="IUqBlo6BGJCKN8j7pt+Y8e1iNNg=">AAACNHicdVBNb9NAEF2Xj5bw0QBHLisipKIiy3ZLSCUOlbhwbAVJK8VRNF6Pm1V3bXd3jLAs9z/wZ9or/AwkbohrD/0F3aRBoghG2tXTe/Nmdl9SKmkpCL57K7du37m7unavc//Bw0fr3cdPRraojMChKFRhDhOwqGSOQ5Kk8LA0CDpReJAcv5vrB5/QWFnkH6kucaLhKJeZFECOmnY3N/npKY+Vc6TAY3tiqIln4C7Cz9SMwLTtRlzKVx9ettNuL/CjKOgPdnjgv+5v9cPIgZ3BdhSGPPSDRfXYsvam3cs4LUSlMSehwNpxGJQ0acCQFArbTlxZLEEcwxGOHcxBo500i0+1/IVjUp4Vxp2c+IL909GAtrbWievUQDP7tzYn/6WNK8oGk0bmZUWYi+tFWaU4FXyeEE+lQUGqdgCEke6tXMzAgCCX440tuk4xs24C1zXXrrmwHRfS7yT4/8Eo8sMtP9rf7u2+Xca1xp6x52yDhewN22Xv2R4bMsG+sHP2lX3zzrwf3k/v13Xrirf0PGU3yru4ApDqq4s=</latexit>

Bayesian CRM Principle• Bayesian Counterfactual Risk Minimization (CRM) principle

• Bayesian policy:

• Motivated by PAC-Bayes risk analysis

• Takeaway: posterior should stay close to the prior

• What should the prior be? How about the logging policy!

[London & Sandler, ICML 2019]

KL div. from prior to posteriorargmin

Q

1

n

nX

i=1

�ri⇡Q(ai |xi)

pi<latexit sha1_base64="FKE3v27XKHU3Af8ihAabaNut8Mk=">AAACXnicjVDLSgMxFE1HrVqtrboR3ASLUEFLpwq6UBDcuGzBPqBTh0yaqaFJZkgy4hDny/wSd7rVnV9gWgs+UPBA4OTcc+9NThAzqnS9/phz5uYX8otLy4WV1eJaqby+0VFRIjFp44hFshcgRRgVpK2pZqQXS4J4wEg3GF9M6t1bIhWNxJVOYzLgaCRoSDHSVvLLbQ/JEafCN14QtDLo7UMvlAgbNzPCXlXCfUPP3OxawAPp00+DF9NZUxVN9Ht459O9zMQ+zfxyxa3Vp4B/kwqYoemX37xhhBNOhMYMKdV367EeGCQ1xYxkBS9RJEZ4jEakb6lAnKiBmX4/g7tWGcIwkvYIDafq1w6DuFIpD6yTI32jftYm4m+1fqLDk4GhIk40EfhjUZgwqCM4yRIOqSRYs9QShCW1b4X4BtlwtE382xaeDkmo7ATIU8itOVKF/4XUadTcw1qjdVQ5P53FtQS2wQ6oAhccg3NwCZqgDTB4AM/gBbzmnpy8U3RKH1YnN+vZBN/gbL0D3m+3qw==</latexit>

+ �Dkl(QkP)<latexit sha1_base64="TZIo5oFiUTJ7hZIquoaW9C4VbFo=">AAACK3icdVDLSgMxFM3Ud31VXboJFkFQhplRawsuBDeCLirYB3RKyWQybWgyMySZwjDUtT+jW/0PV4pbf8AvMK0VrOiFhMM599ybHC9mVCrLejFyM7Nz8wuLS/nlldW19cLGZl1GicCkhiMWiaaHJGE0JDVFFSPNWBDEPUYaXv98pDcGREgahTcqjUmbo25IA4qR0lSnsLMPb2+hy7TDR9A9gO7llU8Hmet518PRXR12CkXLdByrVK5AyzwuHZZsR4NK+cixbWib1riKYFLVTuHD9SOccBIqzJCULduKVTtDQlHMyDDvJpLECPdRl7Q0DBEnsp2N/zKEu5rxYRAJfUIFx+xPR4a4lCn3dCdHqid/ayPyL62VqKDczmgYJ4qE+GtRkDCoIjgKBvpUEKxYqgHCguq3QtxDAmGl45vawlOfBFJPgDyFXDdHMq9D+k4C/g/qjmkfms71UfHsdBLXItgGO2AP2OAEnIELUAU1gMEdeACP4Mm4N56NV+PtqzVnTDxbYKqM90+lZabq</latexit>

⇡Q(a |x) = Prh⇠Q{h(x) = a}, h : X ! A<latexit sha1_base64="+t0dYAp3xN+F+e4OhaULvUtakkY=">AAACTHicdZDdShtBFMdnUz+3VWO9tBcHQyGChN2I+IFCxJteRjAayIRldjLrDs7sLjOz4rLGV/A5fI0+QL2tr1HaKxGcJCJa2j8Mc/idc+bM+YeZ4Np43oNT+TA1PTM7N+9+/LSwuFRd/nyq01xR1qGpSFU3JJoJnrCO4UawbqYYkaFgZ+HF0Sh/dsmU5mlyYoqM9SU5T3jEKTEWBdUjnPGgxGF4PKwTfH217h4AbqugjHFYjDEuIa5frcMBEMDDDffmBmLYA0y7gE1q78OgWvMau9tbu80d8BveWGCJ1Zb3SmqtL7e4/uf7bTuo/sKDlOaSJYYKonXP9zLTL4kynAo2dHGuWUboBTlnPRsmRDLdL8fLDuGrJQOIUmVPYmBM33aURGpdyNBWSmJi/XduBP+V6+Um2umXPMlywxI6GRTlAuySI+dgwBWjRhQ2IFRx+1egMVGEGuvvuymyGLBI2xdAFiBtcapda9KrN/8PTpsNf7PRPPZrrX000RxaRWuojny0jVroG2qjDqLoDt2jn+jB+eH8dh6dp0lpxXnpWUHvVJl5Bi+StK8=</latexit>

Application to Mixed Logits• Mixed logit model

• Like a softmax with normally distributed parameters

• Risk bound motivates logging policy regularization

[London & Sandler, ICML 2019]

L2 distance to logging policy parameters

hw,�(x) = argmaxa w · �(x, a) + �a<latexit sha1_base64="CEu/DqFGcvPUlqBMrX5MVLu5BM0=">AAACRnicdVBda9swFJWzj7bZR7PtcS+XhkHKQrBTSltYITAYfSm0bGkLsTHXspyISJaR5DXG5Nfsz2yv21P/RAeDstcpySjr2A4IDuecqyudpBDcWN+/8hr37j94uLa+0Xz0+MnTzdaz52dGlZqyIVVC6YsEDRM8Z0PLrWAXhWYoE8HOk+nbhX/+kWnDVf7BVgWLJI5znnGK1klx63AS15fdcIxS4rwz24ZDCFGPJc7iGucQduESQpoqC2Ex4Z1ZF7fhNazyMcattt872Ns96O9D0POXAKc47Pq3SnsAx++/T6N3J3HrR5gqWkqWWyrQmFHgFzaqUVtOBZs3w9KwAukUx2zkaI6SmahefnMOr5ySQqa0O7mFpfrnRI3SmEomLinRTszf3kL8lzcqbbYf1TwvSstyulqUlQKsgkVnkHLNqBWVI0g1d28FOkGN1Lpm72yRVcoy424AWYF0YWWarqTbbv5Pzvq9YKfXPw3agzdkhXXykmyRDgnIHhmQI3JChoSST+QL+Uq+eZ+9a+/G+7mKNrzfMy/IHTTIL/+csqQ=</latexit>

� ⇠ Gumbel(0, 1)k<latexit sha1_base64="/K5/42P25wlb+IRN4rgq1zjjmL0=">AAACJnicdVDLattAFB2lL9d9qe2yKQw1BReKkVyCE8jCkEWyTKC2A5ZqRqMrZ/CMJGauQoTwKh+Q32i2yR/kA7orIbt8QyD7ju1i6tIeGDicc+7cmRPlUhj0vBtn7cHDR4+f1J7Wnz1/8fKV+/pN32SF5tDjmcz0YcQMSJFCDwVKOMw1MBVJGESTnZk/OAZtRJZ+xTKHULFxKhLBGVpp5L4PxkwpFkRlgHCC1W6hIpDTpvfZ//RtMnIbXmurs7HV3qR+y5uDWsViw1sqje76WdC8vzrbH7l3QZzxQkGKXDJjhr6XY1gxjYJLmNaDwkDO+ISNYWhpyhSYsJp/Y0o/WiWmSabtSZHO1T8nKqaMKVVkk4rhkfnbm4n/8oYFJpthJdK8QEj5YlFSSIoZnXVCY6GBoywtYVwL+1bKj5hmHG1zK1tUGUNi7A1UlVTZcGbqtqRlN/8n/XbL/9JqH/iN7jZZoEbekQ+kSXzSIV2yR/ZJj3BySs7JBbl0vjs/nJ/O9SK65vyeeUtW4Nz+Aj24qTA=</latexit>

Dkl(QkP) = O(kµ� µ0k2)<latexit sha1_base64="rGWMk0SbIjyAt8LiF5XfihmXdQI=">AAACOXicdVDdShwxGM1o/Vv/VnvZm9BF0AuHzKjrChaE3hQqdIWuCjvrkslkdoP5GZKMMAzzGr5Me6vv4GXvxMv6As2uW1DRA/k4nPN9+ZITZ5wZi9CdNzX9YWZ2bn6htri0vLJaX1s/NSrXhHaI4kqfx9hQziTtWGY5Pc80xSLm9Cy+/Dryz66oNkzJn7bIaE/ggWQpI9g6qV9H0ffjhF2VURyfVKParuAXGMVs8GMzkkqLMhI53Iau9lF1EW716w3khyFqtg4g8veaO80gdOSgtRsGAQx8NEYDTNDu1/9GiSK5oNISjo3pBiizvRJrywinVS3KDc0wucQD2nVUYkFNrxz/rIIbTklgqrQ70sKx+nyixMKYQsSuU2A7NK+9kfiW181t2uqVTGa5pZI8LUpzDq2Co5hgwjQllheOYKKZeyskQ6wxsS7MF1tEkdDUuBugKKBwzcrUXEj/k4Dvk9PQD3b88GS3cXQ4iWsefAKfwSYIwD44At9AG3QAAdfgN7gBt94v74937z08tU55k5mP4AW8x3+DQqxN</latexit>

w ⇠ N (µ,⌃)<latexit sha1_base64="MBWDrEtOzjXyF+/Llo1h5b8E1Yw=">AAACIXicdVDLSgMxFM34rPU16kZwEyxCBSnTamkLLgpuXElF+4BOKZk004YmkyHJKEOpP6Nb/Q934k78C7/A9EGxogcCh3POzU2OFzKqtON8WAuLS8srq4m15PrG5ta2vbNbUyKSmFSxYEI2PKQIowGpaqoZaYSSIO4xUvf6FyO/fkekoiK41XFIWhx1A+pTjLSR2vb+PXS9GLpXQvK0y6MT6N7QLkfHbTvlZEqFfClXhNmMMwY0ikHemSkpMEWlbX+5HYEjTgKNGVKqmXVC3RogqSlmZJh0I0VChPuoS5qGBogT1RqMfzCER0bpQF9IcwINx+rPiQHiSsXcM0mOdE/99kbiX14z0n6xNaBBGGkS4MkiP2JQCziqA3aoJFiz2BCEJTVvhbiHJMLalDa3hccd4itzA+Qx5CYsVNKUNOvmf1LLZbKnmdz1Wap8Pq0rAQ7AIUiDLCiAMrgEFVAFGDyAJ/AMXqxH69V6s94n0QVrOrMH5mB9fgN6/aKo</latexit>

⇡Q(a |x) = E(w,�)⇠Q [1{h

w,�

(x) = a}] = Ew

hexp(w·f(x,a))Pa0 exp(w·f(x,a0

))

i

<latexit sha1_base64="B0E7zDHFhHzMQ8SqV67/w2efN3w=">AAACqXicjVHditNAFJ5EXdf6s129FGSwyCZYSrOy6oWLiyJ4uQW7W+yEcDKZpMNmkjAzsQ1jLn0PH8NbH8Nn0CufwEnrgusPeGDg8P2cM3wnrnKu9Hj8xXEvXb6ydXX7Wu/6jZu3dvq7t09UWUvKprTMSzmLQbGcF2yquc7ZrJIMRJyz0/jsZcefvmNS8bJ4o5uKhQKygqecgrZQ1P9IKh6ROJ54QN6v/N4hJq+qyHjLIclACPBJ3HR0S3KW6jkmATGLyJzTrbfyD4G0mEieLXR47l+25AXPrDyVQA1hq8pbYkKTUuPUWw3B91tDVC0iA3vW/Ae/ZwW4GxFG/UEwGq8L/7sZPP/89cO9T5Nvx1H/O0lKWgtWaJqDUvNgXOnQgNSc5qztkVqxCugZZGxu2wIEU6FZJ9niBxZJcFpK+wqN1+ivDgNCqUbEVilAL9TvXAf+jZvXOn0aGl5UtWYF3SxK6xzrEndnwQmXjOq8sQ1Qye1fMV2ADU/b413YIpqEpcpOwKLBwopL1fu/kE72R8Gj0f4kGBw9Q5vaRnfRfeShAD1BR+g1OkZTRJ0tZ+gcOI/dh+7EnblvN1LX+em5gy6US38ARQzSkg==</latexit>

Contributions• PAC Risk bounds for Bayesian policies

• Application to mixed logits

• Logging policy prior motivates new regularizer

• Two new learning objectives (one convex) that minimize bounds

• Experiments show proposed methods gain up to 76% more reward than POEM, while also simpler/more efficient

Visit poster #113 in Pacific Ballroom

Recommended