ContextualLinearBanditProblem& Applicaons*

Contextual Linear Bandit Problem & Applica7ons

Feng Bi Joon Sik Kim Leiya Ma

Pengchuan Zhang

Contents

•  Recap of last lecture – Mul7-‐armed Bandit Problem / UCB1 Algorithm

•  Applica7on to News Ar7cle Recommenda7on •  Contextual Linear Bandit Problem •  LinUCB Algorithm •  Disjoint Model & General Model •  Deriva7on of LinUCB •  News Ar7cle Recommenda7on Revisited •  Conclusion

Last Lecture : Mul7-‐armed Bandit Problem

•  K ac7ons (feature-‐free) •  Each ac7on has an average reward (unknown): μk

•  For t=1,…,T (unknown) –  Choose an ac7on at from {1, …, K} ac7ons –  Observe a random reward yt, where yt is bounded [0,1] –  E[yt] = μa,t : Expected reward of ac7on at

•  Minimizing Regret:

Q. How to choose an ac7on to minimize regret?

Last Lecture: UCB1 Algorithm

•  At each itera7on, choose the ac7on with highest Upper Confidence Bound (UCB)

•  Regret Bound : with high probability, sublinear w.r.t T Exploita7on Term Explora7on Term

#Ac7ons

Gap between best & 2nd best Δ = μ1 – μ2

Time Horizon

Applica7on of Mul7-‐armed Bandit Problem?

News Ar7cle Recommenda7on

•  Various algorithms for personalized recommenda7on –  Collabora7ve filtering, Content-‐based filtering, Hybrid approaches

•  But… –  Web-‐based contents undergo frequent changes –  Some users have no previous data to learn from (cold-‐start problem)

•  Explora7on vs. Exploita7on –  Need to gather more informa7on about users with more trials –  Op7mize the ar7cle selec7on with past user experience

è Use mul7-‐armed bandit seong

Ar7cle Recommenda7on in Feature-‐Free Bandit Seong

Users u1 with age YOUNG and u2 with age OLD

User u1 with age YOUNG

User u2 with age OLD

Contextual-‐Bandit Problem

•  For t=1,…,T (unknown) –  User ut , set At of ac7ons (a) –  Feature vector (context) xt,a : summarizes both user ut and ac7on a –  Based on previous results, choose at from At

–  Receive payoff –  Improve selec7on strategy with new observa7on set

•  Minimizing Regret:

Ac7on with maximum expected payoff at 7me t

Difference?

•  Contextual bandit problem becomes K-‐armed bandit problem when –  The ac7on set At is unchanged and contains K ac7ons for all t –  The user ut (or the context) is the same for all t

•  Also called context-‐free bandit problem

Ar7cle Recommenda7on in Contextual Linear Bandit Seong

Users u1 with age YOUNG and u2 with age OLD 𝜃1

Linear Payoff = xT 𝜽

Users u1 with age YOUNG and u2 with age OLD [ 0.1 , 0.6 ]

[ 0.5, 0.1 ]

[ 0.6 , 0.1 ]

[ 0.9 , 0.2 ]

Users u1 with age YOUNG and u2 with age OLD [ 0.1 , 0.6 ]

[ 0.5, 0.1 ]

[ 0.6 , 0.1 ]

[ 0.9 , 0.2 ]

Lecture 17: The Multi-‐Armed Bandit Problem 1

• Assumption: the payoff model is linear– Most Intuitive thought: Linear Model– Advantage: Confidence interval computedefficiently in closed form

• Tempting to apply UCB on general contextual bandit problems– asymptoticoptimality– strong regret bound

è Called LinUCB algorithm.

LinUCB Algorithm

Lecture 17: The Multi-‐Armed Bandit Problem

• For convenienceexposition, first describe simpler form– Disjoint linearmodel

• Then consider the general case– hybridmodel

è LinUCB is a generic contextual bandit algorithmwhich applies toapplications other than personalizednews article recommendation.

TwoModels

• We assume the expected payoff of an arm 𝑎 is linear in its d-‐dimentional feature 𝑥#,% with some unknown coefficientvector 𝜃%∗; namely for all t,

• The model is called disjointbecause the parameters are notshared among different arms.

Linear Disjoint Model

“Disjoint” in example Article Recommendation

Algorithm

𝑐# 𝐱#,%* 𝐀%,-𝐱#,%

Lecture 17: The Multi-‐Armed Bandit Problem

𝜃.*x0,1

𝑐# 𝐱#,%* 𝐀%,-𝐱#,%

Visualization Representation

Feature-free bandit v.s. linear bandit

Feature-free bandit

E[rt,a|xt,a] = µ∗a.

µ∗a is not known a priori.

Confidence interval Ct,a

{µa :|µa − µ̄t,a|

nt,a≤√

2 log t}

at = arg maxa∈{1,...,K}

maxµa∈Ct,a

= arg maxa∈{1,...,K}

µ̄t,a +

√2 log t

Linear bandit

E[rt,a|xt,a] = xTt,aθ∗a.

θ∗a is not known a priori.

Confidence ellipsiod Ct,a

{θa : ‖θa − θ̂t,a‖At,a ≤ ct}

where ‖x‖A ≡√

xT Ax.

maxθa∈Ct,a

xTt,aθa

xTt,aθ̂t,a + ct

t,aA−1t,a xt,a

Feature-free bandit v.s. linear bandit

Feature-free bandit

E[rt,a|xt,a] = µ∗a.

{µa :|µa − µ̄t,a|

nt,a≤√

2 log t}

maxµa∈Ct,a

µ̄t,a +

√2 log t

Linear bandit

xT Ax.

maxθa∈Ct,a

xTt,aθa

xTt,aθ̂t,a + ct

t,aA−1t,a xt,a

Confidence ellipsiod Ct,a = {θa : ‖θa − θ̂t,a‖At,a ≤ ct}

−2 0 2 4 6 80

Confidence ellipsoid: Ct,a

estimation of coefficients: θt,a

Feature vector: xt,a

Confidence Interval of xt,a

xTt,aθ̂t,a + ct

t,aA−1t,a xt,a = max

t,aθa

s.t. (θa − θ̂t,a)T At,a(θa − θ̂t,a) ≤ ct

Feature-free bandit = Linear bandit with xt,a ≡ 1, θa = µa

Feature-free bandit

E[rt,a|xt,a] = 1Tµ∗a.

{µa : ‖µa−µ̄t,a‖nt,a ≤√

2 log t}

where ‖µ‖n ≡√µT nµ.

maxµa∈Ct,a

1T µ̄t,a +

√2 log t

Linear bandit

xT Ax.

maxθa∈Ct,a

xTt,aθa

xTt,aθ̂t,a + ct

t,aA−1t,a xt,a

A Bayesian approach to derive θ̂t,a and A−1t,a

Gaussian prior p0(θa) ∼ N(0, λId).

nt,a noisy measurements: yt,a ∼ N(Dt,aθa, Int,a )....

yt,a(i)...

xTi,a...

...ηi,a...

Posterior distribution pt,a(θa) ∼ N(θ̂t,a, A−1

t,a ).

θ̂t,a = (DTt,aDt,a +

Id)−1 DTt,ayt,a︸︷︷︸bt,a

At,a = DTt,aDt,a +

The reward xTt,aθa ∼ N(xT

t,aθ̂t,a, xTt,aA−1

t,a xt,a). The upper confidence

bound (UCB) is xTt,aθ̂t,a + ct

t,aA−1t,a xt,a .

yt,a(i)...

xTi,a...

...ηi,a...

Posterior distribution pt,a(θa) ∼ N(θ̂t,a, A−1t,a ).

At,a = DTt,aDt,a +

t,aA−1t,a xt,a .

yt,a(i)...

xTi,a...

...ηi,a...

t,a ).

At,a = DTt,aDt,a +

t,aA−1t,a xt,a .

yt,a(i)...

xTi,a...

...ηi,a...

t,a ).

At,a = DTt,aDt,a +

t,aA−1t,a xt,a .

A general form of linear bandit

Disjoint linear model

maxθa∈Ct,a

xTt,aθa

A general linear model

E[rt |xt] = xTt θ∗.

xt = arg maxx∈At

maxθ∈Ct

θ1...θa...θK

, At =

: a = 1, 2, . . . ,K

A general form of linear bandit

maxθa∈Ct,a

xTt,aθa

xt = arg maxx∈At

maxθ∈Ct

θ1...θa...θK

, At =

: a = 1, 2, . . . ,K

A hybrid linear model

E[rt,a|zt,a, xt,a] = zTt,aβ∗ + xT

t,aθ∗a.

maxβ,θa∈Ct

zTt,aβ + xT

t,aθa

xt = arg maxx∈At

maxθ∈Ct

βθ1...θa...θK

, At =

zt,a...0

: a = 1, 2, . . . ,K

A general form of linear bandit, continued

maxθa∈Ct,a

xTt,aθa

Ct,a = {θa : ‖θa − θ̂t,a‖At,a ≤ ct}

Id)−1DTt,ayt,a ,

At,a = DTt,aDt,a +

xt = arg maxx∈At

maxθ∈Ct

Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

θ̂t = (DTt Dt +

Id)−1DTt yt ,

At = DTt Dt +

A general form of linear bandit, continued

maxθa∈Ct,a

xTt,aθa

Ct,a = {θa : ‖θa − θ̂t,a‖At,a ≤ ct}

Id)−1DTt,ayt,a ,

At,a = DTt,aDt,a +

xt = arg maxx∈At

maxθ∈Ct

Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

θ̂t = (DTt Dt +

Id)−1DTt yt ,

At = DTt Dt +

An O(d√

T ) regret boundTheorem (Theorem 2 + Theorem 3 in APS 2011)Assume that

1 The measurement noise ηt is independent of everything and isσ-sub-Gaussian for some σ > 0, i.e., E[eληt ] ≤ exp( λ

2 ) for all λ ∈ R .2 For all t and all x ∈ At, xT θ∗ ∈ [−1, 1].

Then, for any δ > 0, with probability at least 1 − δ, for all t ≥ 0,1 θ∗ lies in the confidence ellipsoid

θ : ‖θ − θ̂t‖At ≤ ct := σ

√log det At + d log λ + 2 log

1δ+‖θ∗‖√λ

2 The regret of the linUCB algorithm satisfies

Rt =√

8t︸︷︷︸I

√log det At + d log λ︸︷︷︸

σ√log det At + d log λ + 2 log1δ+‖θ∗‖√λ

︸︷︷︸III:ct

Lemma (Determinant-Trace Inequality, Lemma 10 in APS 2011)If for all t ≥ 0, ‖xt‖2 ≤ L then

log det At ≤ d log(1λ+

An O(d√

T ) regret boundTheorem (Theorem 2 + Theorem 3 in APS 2011)Assume that

1 The measurement noise ηt is independent of everything and isσ-sub-Gaussian for some σ > 0, i.e., E[eληt ] ≤ exp( λ

2 ) for all λ ∈ R .2 For all t and all x ∈ At, xT θ∗ ∈ [−1, 1].

Then, for any δ > 0, with probability at least 1 − δ, for all t ≥ 0,1 θ∗ lies in the confidence ellipsoid

θ : ‖θ − θ̂t‖At ≤ ct := σ

√log det At + d log λ + 2 log

1δ+‖θ∗‖√λ

2 The regret of the linUCB algorithm satisfies

Rt =√

8t︸︷︷︸I

√log det At + d log λ︸︷︷︸

σ√log det At + d log λ + 2 log1δ+‖θ∗‖√λ

︸︷︷︸III:ct

Lemma (Determinant-Trace Inequality, Lemma 10 in APS 2011)If for all t ≥ 0, ‖xt‖2 ≤ L then

log det At ≤ d log(1λ+

The r of the proof

We consider the high probability event θ∗ ∈ Ct for all t ≥ 0.

rt = 〈x∗t , θ∗〉 − 〈xt, θ

∗〉 xt, θ̃t = arg maxx∈At

maxθ∈Ct〈x, θ〉

≤ 〈xt, θ̃t〉 − 〈xt, θ∗〉 θ∗ ∈ Ct

= 〈xt, θ̃t − θ∗〉

= 〈xt, θ̂t − θ∗〉 + 〈xt, θ̃t − θ̂t〉

≤ ‖xt‖A−1t‖θ̂t − θ

∗‖At + ‖xt‖A−1t‖θ̃t − θ̂t‖At Cauchy-Schwarz

≤ 2ct‖xt‖A−1t

θ∗, θ̃t ∈ Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

Since xT θ∗ ∈ [−1, 1] for all x ∈ At, then we have rt ≤ 2. Therefore,

rt ≤ min{2ct‖xt‖A−1t, 2} ≤ 2ct min{‖xt‖A−1

The r of the proof

We consider the high probability event θ∗ ∈ Ct for all t ≥ 0.

rt = 〈x∗t , θ∗〉 − 〈xt, θ

∗〉 xt, θ̃t = arg maxx∈At

maxθ∈Ct〈x, θ〉

≤ 〈xt, θ̃t〉 − 〈xt, θ∗〉 θ∗ ∈ Ct

= 〈xt, θ̃t − θ∗〉

= 〈xt, θ̂t − θ∗〉 + 〈xt, θ̃t − θ̂t〉

≤ ‖xt‖A−1t‖θ̂t − θ

∗‖At + ‖xt‖A−1t‖θ̃t − θ̂t‖At Cauchy-Schwarz

≤ 2ct‖xt‖A−1t

θ∗, θ̃t ∈ Ct = {θ : ‖θ − θ̂t‖At ≤ ct}

Since xT θ∗ ∈ [−1, 1] for all x ∈ At, then we have rt ≤ 2. Therefore,

rt ≤ min{2ct‖xt‖A−1t, 2} ≤ 2ct min{‖xt‖A−1

The r of the proof, continued

r2t ≤ 4c2

t min{‖xt‖2A−1

t, 1} (1)

Consider the regret RT ≡∑T

t=1 rt,

RT ≤

√√T

T∑t=1

r2t ≤︸︷︷︸

By (1)

√√T

T∑t=1

4c2t min{‖xt‖

2A−1

≤ 2√

T︸︷︷︸I

cT︸︷︷︸III

√√T∑

min{‖xt‖2A−1

t, 1}︸︷︷︸

ct is monotonically increasing.

Since x ≤ 2 log(1 + x) for x ∈ [0, 1], we have

T∑t=1

min{‖xt‖2A−1

t, 1} ≤ 2

T∑t=1

log(1 + ‖xt‖2A−1

t) = 2(log det At + d log λ).

The last equality is proved in Lemma 11 in APS 2011.

r2t ≤ 4c2

t, 1} (1)

t=1 rt,

RT ≤

√√T

T∑t=1

r2t ≤︸︷︷︸

By (1)

√√T

T∑t=1

4c2t min{‖xt‖

2A−1

≤ 2√

T︸︷︷︸I

cT︸︷︷︸III

√√T∑

min{‖xt‖2A−1

t, 1}︸︷︷︸

T∑t=1

min{‖xt‖2A−1

t, 1} ≤ 2

T∑t=1

log(1 + ‖xt‖2A−1

r2t ≤ 4c2

t, 1} (1)

t=1 rt,

RT ≤

√√T

T∑t=1

r2t ≤︸︷︷︸

By (1)

√√T

T∑t=1

4c2t min{‖xt‖

2A−1

≤ 2√

T︸︷︷︸I

cT︸︷︷︸III

√√T∑

min{‖xt‖2A−1

t, 1}︸︷︷︸

T∑t=1

min{‖xt‖2A−1

t, 1} ≤ 2

T∑t=1

log(1 + ‖xt‖2A−1

How do we evaluate the performance of a recommenda3on algorithm?

•  Can we just run the algorithm on “live” data?

•  Build a simulator to model the bandit process, evaluate the algorithm based on the simulated data?

Algorithm Evalua3on

How do we evaluate the performance of a recommenda3on algorithm?

•  Can we just run the algorithm on “live” data? Difficult logis3cally.

•  Build a simulator to model the bandit process, evaluate the algorithm based on the simulated data? May introduce bias from the simulator.

•  Yahoo! Today Module! (random ar3cle)

Algorithm Evalua3on

No bias! About T events are accepted from TK trails.

Algorithm Evalua3on

Yahoo News …

Randomly shoot user an ar3cle a as highlighted news.

Algorithm Evalua3on (data collec3on)

Yahoo News …

Randomly shoot user an ar3cle a as highlighted news.

This event contains ar3cle a, feature vector X, and response r = 1/0 .

Algorithm Evalua3on (data collec3on)

Accept this event iff algorithm predicts the same ar3cle a.

In Yahoo’s data base, either a user or ar3cle is depicted by hundreds raw features. Need to reduce the feature dimensions.

Algorithm Evalua3on (construct features)

User Gender Age>20 Age>40 Student ...

1 1 0 1 …

Ar3cle Long Domes3c Tech Poli3cs …

1 1 0 1 … Raw

User Gender Age>20 Age>40 Student ...

1 1 0 1 …

Ar3cle Long Domes3c Tech Poli3cs …

1 1 0 1 … Raw

Suppose there’s a weight matrix W, st. the probability of user clicking on ar3cle a is:

Logis3c Regression to get W K-‐means method to find clusters/groups of the users.

(Cluster this projected feature vector.)

Logis3c Regression to get W K-‐means method to find clusters/groups of the users.

•  The constructed feature vector for a user would be the possibili3es of being in different groups. Denote as:

•  The same procedure can be applied to the ar3cle and get the constructed feature vector.

Ø  Disjointed LinUCB, use as input data. Ø  Hybrid model, the outer product of constructed user and

ar3cle feature vectors is also included as global features.

(Cluster this projected feature vector.)

Algorithm Evalua3on (construct features) For example, we have binary raw feature vectors:

Algorithm Evalua3on (construct features) For example, we have binary raw feature vectors for user and ar3cle:

•  Afer clustering:

Group A B C D E

Membership 0.2 0.1 0.35 0.3 0.05 User,

Group 1 2 3 4 5

Membership 0.7 0.05 0.15 0.05 0.05 Ar3cle

Algorithm Evalua3on (construct features) For example, we have binary raw feature vectors for user and ar3cle:

•  Afer clustering:

Group A B C D E

Membership 0.2 0.1 0.35 0.3 0.05 User,

Group 1 2 3 4 5

Membership 0.7 0.05 0.15 0.05 0.05 Ar3cle

•  The outer product of the two constructed feature vectors is:

•  Hybrid model:

Remove this term will get back to disjointed LinUCB

(25 dimensional here)

Without u3lizing features: •  Purely Random •  E-‐greedy •  UCB •  Omniscient

Algorithms with features: •  E-‐greedy (run on different clusters) •  E-‐greedy (hybrid, epoch greedy) •  UCB (cluster) •  LinUCB (disjoint) •  LinUCB ( hybrid)

Algorithm Evalua3on

(CTR normalized by the random recommenda3on CTR)

Algorithm Evalua3on

(CTR normalized by the random recommenda3on CTR)

Conclusion

•  UCB algorithm without feature has regret bound: •  LinUCB using feature vectors has regret bound:

For mul3-‐armed bandits problem

•  Evaluate using Yahoo Front Page Today Module data. •  Introducing contextual informa3on (features) to the

recommenda3on algorithm, the CTR (reward) has been improved by about 10%.

References •  Li, Lihong, et al. "A contextual-‐bandit approach to personalized news

ar3cle recommenda3on." Proceedings of the 19th interna3onal conference on World wide web. ACM, 2010.

•  Abbasi-‐Yadkori, Yasin, Dávid Pál, and Csaba Szepesvári. "Improved algorithms for linear stochas3c bandits." Advances in Neural Informa3on Processing Systems. 2011.

•  Auer, Peter, Nicolo Cesa-‐Bianchi, and Paul Fischer. "Finite-‐3me analysis of the mul3armed bandit problem." Machine learning 47.2-‐3 (2002): 235-‐256.

•  Auer, Peter. "Using confidence bounds for exploita3on-‐explora3on trade-‐offs." The Journal of Machine Learning Research 3 (2003): 397-‐422.

ContextualLinearBanditProblem& Applicaons*

Documents

How$to$strengthen$your$applicaons$ and$proﬁle$for$MS…scholarstrategy.com/.../2015/09/StrengthenMSApps.pdf · How$to$strengthen$your$applicaons$ and$proﬁle$for$MS$and$MIS?$ NISTHATRIPATHI’

Blockchain: Applicaons, Security Promises and Internalstristartom.github.io/docs/csiac17.pdf · 2020-04-08 · Blockchain: Applicaons, Security Promises and Internals Cyber Security

Agricultural Applicaons of ET - NASA...Agricultural Applicaons of ET Andrew French 1, Charles A. Sanchez2, Douglas Hunsaker , Paul Brown2, Dawit Zerihun2, Clinton Williams 1, Eduardo

Oracle PeopleSo,applicaons are’under’a3acks!’...Invest’in’security’ to’secure’investments’ Oracle PeopleSo,applicaons are’under’a3acks!’ Alexey Tyurin&&

Linear Distance Metric Learning for Large-scale Generic Image …nakayama/personal/pdf/nakayama_PhDen_1.0.pdf · theoretically optimal distance metric, called the canonical contextual

CAMP%/%FHWA% Vehicle%to%Infrastructure%Safety%Applicaons ... · CAMP%/%FHWA% Vehicle%to%Infrastructure%Safety%Applicaons%Program % The%informaon%contained%in%this%documentis%considered%interim%work%product

Building Applicaons in C#cdn.cs50.net/2011/fall/seminars/C_sharp/C_sharp.pdf · • Allows applicaons to run in separate virtual environments, eliminang conﬂicts with other applicaons

Titan Oil Recovery, Inc...• SPE 145054 What Has Been Learned From 100 MEOR Applicaons: Husky, Venoco, Titan Oil Recovery: 100 Applicaons documen,ng an average oil produc,on increase

HCI: Focus Groups and Contextual Inquiry · HCI: Focus Groups and Contextual Inquiry ... First, the news … Contextual Inquiry 4. Contextual Inquiry •An approach to ethnographic

miu SPEAKER ROLE CONTEXTUAL MODELING FOR LANGUAGE U ...yvchen/doc/IJCNLP17_Role_poster.pdf · Contextual - Semantic Contextual –Natural Language Tag Baseline SPEAKER ROLE CONTEXTUAL

Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Ajaxifying Legacy Web Applicaons · Web 1.0 Applicaons • Pages are rendered server‐side • Client‐side is stateless • Server‐side MVC • Full‐page

Topics · Papers:WirelessSensorNetworks: Applicaons,UseCasesandVision* • “Sensornetworkalgorithmsandapplicaons”,* NikiTrigoniandBhaskar

Sri Elda, 2013. The Development of The Contextual Based ...pustaka.unp.ac.id/file/abstrak_kki/abstrak_TESIS/3_A_SRI_ELDA_19581_2774_2013.pdf · Program Linear di Kelas XI Administrasi

Typical applicaons: Video collaboraon · Video collaboraon Typical applicaons: • Spontaneous face-to-face video calls with colleagues • Project collaboraon for dispersed teams,

Hierarchical Linear Modeling to Explore the Influence of ... · 256 Hierarchical Linear Modeling on Housing Prices the manner in which contextual variables influence or adjust the

Two examples of biomarker lipid applicaons to studies of ... · Two examples of biomarker lipid applicaons to studies of ... cytosol with the amino acid phenyl alanine. ... Within

Prac%cal Applicaons for Cri%cal Thinkingamericanenglish.state.gov/files/ae/resource_files/5.1_presentation_slides...Prac%cal Applicaons for Cri%cal Thinking in English Language Teaching

Radar%Meteorology:%Overview% and%applicaons%in%Africa · 2011-07-28 · Outline% • Review%of%radar%basics% • Radar%sampling%consideraons% • Radar%applicaons%in%WestAfrica •

Contextual Bandits with Linear Payo Functionsproceedings.mlr.press/v15/chu11a/chu11a.pdf · Contextual Bandits with Linear Payo Functions ... we give a theoretical analysis of a vari-

Contextual*Linear*BanditProblem*&* Applicaons*

ContextualLinearBanditProblem& Applicaons*