Bandits for Taxonomies: A Model-based
Approach
Sandeep Pandey
Deepak Agarwal
Deepayan Chakrabarti
Vanja Josifovski
The Content Match Problem
Adv
ertis
ers
Ads DB
Ads
Ad impression: Showing an ad to a user
(click)
The Content Match Problem
Adv
ertis
ersAds
Ad click: user click leads to revenue for ad server and content provider
Ads DB
(click)
The Content Match Problem
Adv
ertis
ers
Ads DB
Ads
The Content Match Problem:
Match ads to pages to maximize clicks
The Content Match Problem
Adv
ertis
ers
Ads DB
Ads
Maximizing the number of clicks means:
For each webpage, find the ad with the best Click-Through Rate (CTR),
but without wasting too many impressions in learning this.
Online Learning
Maximizing clicks requires:
Dimensionality reduction
Exploration
ExploitationBoth must occur together
Online learning is needed, since the system must continuously generate revenue
Taxonomies for dimensionality reduction
Root
Apparel Computers Travel
• Already exist
• Actively maintained
• Existing classifiers to map pages and ads to taxonomy nodes
Page/Ad
Learn the matching from page nodes to ad nodes dimensionality reduction
Online Learning
Maximizing clicks requires:
Dimensionality reduction
Exploration
Exploitation
Can taxonomies help in explore/exploit as well?
Taxonomy
?
Outline
Problem
Background: Multi-armed bandits Proposed Multi-level Policy Experiments Related Work Conclusions
Background: Bandits
Bandit “arms”
p1 p2 p3(unknown payoff
probabilities)
Pull arms sequentially so as to maximize the total expected reward
• Estimate payoff probabilities pi
• Bias the estimation process towards better arms
Background: BanditsW
ebp
age
1
Bandit “arms”
We
bpa
ge 2
We
bpa
ge 3
= ads
~106 ads
~109 pages
Background: BanditsAds
Web
page
s
Content Match = A matrix
• Each row is a bandit
• Each cell has an unknown CTR
One bandit
Unknown CTR
Background: Bandits
Bandit Policy
1. Assign priority to each arm
2. “Pull” arm with max priority, and observe reward
3. Update priorities
Priority 1
Priority 2
Priority 3
Allocation
Estimation
Background: Bandits
Why not simply apply a bandit policy directly to our problem?
• Convergence is too slow ~109 bandits, with ~106 arms per bandit
• Additional structure is available, that can help Taxonomies
Outline
Problem
Background: Multi-armed bandits
Proposed Multi-level Policy Experiments Related Work Conclusions
Multi-level PolicyAds
Webpages
… …
……
……
classes
classes
Consider only two levels
Multi-level Policy
ApparelCompu-
ters Travel
… …
……
……
Consider only two levels
Tra
vel
Co
mp
u-
ters
Ap
pare
l
Ad parent classes
Ad child classes
Block
One bandit
Multi-level Policy
ApparelCompu-
ters Travel
… …
……
……
Key idea: CTRs in a block are homogeneous
Ad parent classes
Block
One bandit
Tra
vel
Co
mp
u-
ters
Ap
pare
l Ad child classes
Multi-level Policy
CTRs in a block are homogeneous Used in allocation (picking ad for
each new page) Used in estimation (updating
priorities after each observation)
Multi-level Policy
CTRs in a block are homogeneous
Used in allocation (picking ad for each new page)
Used in estimation (updating priorities after each observation)
C
A C T
AT
Multi-level Policy (Allocation)
?
Page classifier
Classify webpage page class, parent page class Run bandit on ad parent classes pick one ad parent
class
C
A C T
AT
Multi-level Policy (Allocation)
Classify webpage page class, parent page class Run bandit on ad parent classes pick one ad parent
class Run bandit among cells pick one ad class In general, continue from root to leaf final ad
?
Page classifier
ad
C
A C T
AT
ad
Multi-level Policy (Allocation)
Bandits at higher levels use aggregated information have fewer bandit arms Quickly figure out the best ad parent
class
Page classifier
Multi-level Policy
CTRs in a block are homogeneous
Used in allocation (picking ad for each new page)
Used in estimation (updating priorities after each observation)
Multi-level Policy (Estimation)
CTRs in a block are homogeneous Observations from one cell also
give information about others in the block
How can we model this dependence?
Multi-level Policy (Estimation)
Shrinkage Model
Scell | CTRcell ~ Bin (Ncell, CTRcell)
CTRcell ~ Beta (Paramsblock)
# clicks in cell# impressions
in cell
All cells in a block come from the same distribution
Multi-level Policy (Estimation)
Intuitively, this leads to shrinkage of cell CTRs towards block CTRs
E[CTR] = α.Priorblock + (1-α).Scell/Ncell
Estimated CTR
Beta prior (“block CTR”)
Observed CTR
Outline
Problem
Background: Multi-armed bandits
Proposed Multi-level Policy
Experiments Related Work Conclusions
Experiments
Root
20 nodes
221 nodes…
~7000 leaves
Taxonomy structure
We use these 2 levels
Depth 0
Depth 7
Depth 1
Depth 2
Experiments
Data collected over a 1 day period Collected from only one server, under some
other ad-matching rules (not our bandit) ~229M impressions CTR values have been linearly transformed
for purposes of confidentiality
Experiments (Multi-level Policy)
Multi-level gives much higher #clicks
Number of pulls
Clic
ks
Experiments (Multi-level Policy)
Multi-level gives much better Mean-Squared Error it has learnt more from its explorations
Mea
n-S
quar
ed E
rror
Number of pulls
Experiments (Shrinkage)
Number of pulls Number of pullsMea
n-S
quar
ed E
rror
Clic
ks without shrinkage
with shrinkage
Shrinkage improved Mean-Squared Error, but no gain in #clicks
Outline
Problem
Background: Multi-armed bandits
Proposed Multi-level Policy
Experiments
Related Work Conclusions
Related Work
Typical multi-armed bandit problems Do not consider dependencies Very few arms
Bandits with side information Cannot handle dependencies among ads
General MDP solvers Do not use the structure of the bandit problem Emphasis on learning the transition matrix, which
is random in our problem.
Conclusions
Taxonomies exist for many datasets They can be used for
Dimensionality Reduction Multi-level bandit policy higher #clicks Better estimation via shrinkage models better
MSE