An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning...

Preview:

Citation preview

Maria-Florina (Nina) Balcan

An Online Learning Approach to Data-driven Algorithm Design

Carnegie Mellon University

Analysis and Design of Algorithms

Classic algo design: solve a worst case instance.

• Easy domains, have optimal poly time algos.

E.g., sorting, shortest paths

• Most domains are hard.

Data driven algo design: use learning & data for algo design.

• Suited when repeatedly solve instances of the same algo problem.

E.g., clustering, partitioning, subset selection, auction design, …

Prior work: largely empirical.

• Artificial Intelligence: E.g., [Xu-Hutter-Hoos-LeytonBrown, JAIR 2008]

• Computational Biology: E.g., [DeBlasio-Kececioglu, 2018]

• Game Theory: E.g., [Likhodedov and Sandholm, 2004]

• Different methods work better in different settings.

• Large family of methods – what’s best in our application?

Data Driven Algorithm Design

Data driven algo design: use learning & data for algo design.

Prior work: largely empirical.

Our Work: Data driven algos with formal guarantees.

• Different methods work better in different settings.

• Large family of methods – what’s best in our application?

Data Driven Algorithm Design

Data driven algo design: use learning & data for algo design.

• Several cases studies of widely used algo families.

• General principles: push boundaries of algorithm design and machine learning.

Algorithm Design as Distributional Learning

Goal: given large family of algos, sample of typical instances from domain, find an algo that performs well on new instances from same domain.

Large family 𝐅 of algorithms

Sample of i.i.d. typical inputs

Facility location:

Clustering: Input 1: Input 2: Input m:

Input 1: Input 2: Input m:

Input 1: Input 2: Input m:

MST

Greedy

Dynamic Programming

+

+ Farthest Location

[Gupta-Roughgarden, ITCS’16 &SICOMP’17]

Statistical Learning Approach to AAD

Challenge: “nearby” algos can have drastically different behavior.

Sample Complexity: How large should our sample of typical instances be in order to guarantee good performance on new instances?

m = O dim F /ϵ instances suffice to ensure generalizability

α ∈ ℝ𝑠

IQP objective value

Price Price

Revenue

2nd

highest bid

Highest bid

Reserve r

Revenue

2nd

highest bid

Prior Work: [Gupta-Roughgarden, ITCS’16 &SICOMP’17] proposed model; analyzed greedy algos for subset selection pbs (knapsack & independent set).

New algorithm classes for a wide range of problems.Our results:

Algorithm Design as Distributional Learning

Single linkage Complete linkage𝛼 −Weighted comb … Ward’s alg

DATA

DP for k-means

DP for k-median

DP for k-center

CLUSTERING

Clustering: Parametrized Linkage

[Balcan-Nagarajan-Vitercik-White, COLT 2017]

Parametrized Lloyds

Random seeding

Farthest first traversal

𝑘𝑚𝑒𝑎𝑛𝑠 + + …𝐷𝛼sampling

DATA

𝐿2-Local search 𝛽-Local search

CLUSTERING

[Balcan-Dick-White, NeurIPS 2018]

Alignment pbs (e.g., string alignment): parametrized dynamic prog.

[Balcan-DeBlasio-Dick-Kingsford-Sandholm-Vitercik, 2019]

dim(F) = O log n dim(F) = O k log n

New algorithm classes for a wide range of problems.Our results:

Algorithm Design as Distributional Learning

SDP Relaxation

Integer Quadratic Programming (IQP)

GW rounding

1-linear roundig

s-linear rounding

Feasible solution to IQP

… … …

E.g., Max-Cut,

• Partitioning pbs via IQPs: SDP + Rounding

Max-2SAT, Correlation Clustering

[Balcan-Nagarajan-Vitercik-White, COLT 2017]

• MIPs: Branch and Bound Techniques [Balcan-Dick-Sandholm-Vitercik, ICML’18]

Max 𝒄 ∙ 𝒙s.t. 𝐴𝒙 = 𝒃

𝑥𝑖 ∈ {0,1}, ∀𝑖 ∈ 𝐼

MIP instance

Choose a leaf of the search tree

Best-bound Depth-first

Fathom if possible and terminate if

possible

Choose a variable to branch onMost

fractional𝛼-linearProduct

• Automated mechanism design for revenue maximization

Parametrized VCG auctions, posted prices, lotteries.

[Balcan-Sandholm-Vitercik, EC 2018]

dim(F) = O log n

Algorithm Design as Distributional Learning

[Balcan-DeBlasio-Kingsford-Dick-Sandholm-Vitercik, 2019]

General sample complex. tools via structure of dual classOur results:

Thm: Suppose for each costI(α) there are ≤ N boundary fns f1, f2, … ∈ F

s. t within each region defined by them, ∃ g ∈ G s.t. costI α = g(α).

Pdim costα I = O dF∗ + dG∗ log dF∗ + dG∗ + dF∗ logN

dF∗ =VCdim of dual of F, dG∗ =Pdim of dual of G.

𝑠

IQP objective

value

2nd

highest bid

Highest bid

Reserve r

Revenue

2nd

highest bid

Price Price

Revenue

Online Algorithm Selection

• So far, batch setting: collection of typical instances given upfront.

• [Balcan-Dick-Vitercik, FOCS 2018], [Balcan-Dick-Pedgen, 2019] online alg. selection.

• Challenge:

• Identify general properties (piecewise Lipschitz fns withdispersed discontinuities) sufficient for strong bounds.

Cannot use known techniques.

scoring fns non-convex, with lots of discontinuities.

• Show these properties hold for many alg. selection pbs.

𝑠

IQP objective

value

Price Price

Revenue

Structure of the Talk

• Overview. Data driven algo design as batch learning.

• Data driven algo design via online learning.

• An example: clustering via parametrized linkage.

• Online learning of non-convex (piecewise Lipschitz) fns.

Example: Clustering ProblemsClustering: Given a set objects organize then into natural groups.

• E.g., cluster news articles, or web pages, or search results by topic.

• Or, cluster customers according to purchase history.

Often need do solve such problems repeatedly.

• E.g., clustering news articles (Google news).

• Or, cluster images by who is in them.

Example: Clustering Problems

Clustering: Given a set objects organize then into natural groups.

Input: Set of objects S, d

Output: centers {c1, c2, … , ck}

To minimize σpminid2(p, ci)

𝐤-median: min σpmind(p, ci) .

Objective based clustering

𝒌-means

k-center/facility location: minimize the maximum radius.

• Finding OPT is NP-hard, so no universal efficient algo that works on all domains.

Clustering: Linkage + Dynamic Programming

Family of poly time 2-stage algorithms:

1. Use a greedy linkage-based algorithm to organize data into a hierarchy (tree) of clusters.

2. Dynamic programming over this tree to identify pruning of tree corresponding to the best clustering.

A B C D E F

A B D E

A B C DEF

A B C D E F

A B C D E F

A B D E

A B C DEF

A B C D E F

Clustering: Linkage + Dynamic Programming

1. Use a linkage-based algorithm to get a hierarchy.

2. Dynamic programming to the best pruning.

Single linkage

Completelinkage

𝛼 −Weighted comb

… Ward’s algo

DATA

DP for k-means

DP for k-median

DP for k-center

CLUSTERING

Both steps can be done efficiently.

Linkage Procedures for Hierarchical Clustering

Bottom-Up (agglomerative)

soccer

sports fashion

Guccitennis Lacoste

All topics

• Start with every point in its own cluster.

• Repeatedly merge the “closest” two clusters.

Different defs of “closest” give different algorithms.

Linkage Procedures for Hierarchical Clustering

Have a distance measure on pairs of objects.

d(x,y) – distance between x and y

E.g., # keywords in common, edit distance, etc soccer

sports fashion

Guccitennis Lacoste

All topics

• Single linkage: dist A, B = minx∈A,x′∈B

dist(x, x′)

dist A, B = avgx∈A,x′∈B

dist(x, x′)• Average linkage:

• Complete linkage: dist A, B = maxx∈A,x′∈B

dist(x, x′)

• Parametrized family, 𝛼-weighted linkage:

dist A, B = α minx∈A,x′∈B

dist(x, x′) + (1 − α) maxx∈A,x′∈B

dist(x, x′)

Clustering: Linkage + Dynamic Programming

1. Use a linkage-based algorithm to get a hierarchy.

2. Dynamic programming to the best prunning.

Single linkage

Completelinkage

𝛼 −Weighted comb

… Ward’s algo

DATA

DP for k-means

DP for k-median

DP for k-center

CLUSTERING

• Used in practice.

• Strong properties.

PR: small changes to input distances shouldn’t move optimal solution by much.

[Balcan-Liang, SICOMP 2016] [Awasthi-Blum-Sheffet, IPL 2011]

[Angelidakis-Makarychev-Makarychev, STOC 2017]

0.7

E.g., [Filippova-Gadani-Kingsford, BMC Informatics]

E.g., best known algos for perturbation resilient instances for k-median, k-means, k-center.

Clustering: Linkage + Dynamic ProgrammingClustering: Linkage + Dynamic Programming

α ∈ ℝ

Key fact: If we fix a clustering instance of n pts and vary α, at most O n8 switching points where behavior on that instance changes.

So, the cost function is piecewise-constant with at most O n8 pieces.

α ∈ ℝ

Clustering: Linkage + Dynamic ProgrammingClustering: Linkage + Dynamic Programming

𝓝𝟏 𝓝𝟐 𝓝𝟑 𝓝𝟒

𝑝 𝑞𝑝′ 𝑞′ 𝑟𝑟’ 𝑠’

𝑠• For a given α, which will merge

first, 𝒩1 and 𝒩2, or 𝒩3 and 𝒩4?

• Depends on which of (1 − α)d p, q + αd p′, q′ or (1 − α)d r, s + αd r′, s′ is smaller.

Key idea:

• An interval boundary an equality for 8 points, so O n8 interval boundaries.

Key fact: If we fix a clustering instance of n pts and vary α, at most O n8 switching points where behavior on that instance changes.

α ∈ ℝ

Online Algorithm Selection

• [Balcan-Dick-Vitercik, FOCS 2018], [Balcan-Dick-Pedgen, 2019] online alg. selection.

• Challenge:

• Identify general properties (piecewise Lipschitz fns withdispersed discontinuities) sufficient for strong bounds.

Cannot use known techniques.

scoring fns non-convex, with lots of discontinuities.

• Show these properties hold for many alg. selection pbs.

𝑠

IQP objective

value

Price Price

Revenue

Online Algorithm Selection via Online Optimization

Online optimization of general piecewise Lipschitz functions

Goal: minimize regret: max𝛒∈𝒞

σt=1T ut(𝛒) − 𝔼 σt=1

T ut 𝛒𝐭

Our cumulative performancePerformance of best

parameter in hindsight

1. Online learning algo chooses a parameter 𝛒𝐭

On each round t ∈ 1,… , T :

2. Adversary selects a piecewise Lipschitz function ut: 𝒞 → [0, H]

• corresponds to some pb instance and its induced scoring fnc

3. Get feedback:

Payoff: score of the parameter we selected ut(𝛒𝐭).

Full information: observe the function ut ∙

Bandit feedback: observe only payoff ut(𝛒𝐭).

Online Regret Guarantees

Existing techniques (for finite, linear, or convex case): select 𝛒𝐭probabilistically based on performance so far.

• Regret guarantee: max𝛒∈𝒞

σt=1T ut(𝛒) − 𝔼 σt=1

T ut 𝛒𝐭 = ෩𝐎 𝐓 ×⋯

• Probability exponential in performance [Cesa-Bianchi and Lugosi 2006]

No-regret: per-round regret approaches 0 at rate ෩𝐎 𝟏/ 𝐓 .

Challenge: if discontinuities, cannot get no-regret.

To achieve low regret, need structural condition.

• Adversary can force online algo to “play 20 questions” while hiding an arbitrary real number.

• Round 1: adversary splits parameter space in half and randomly chooses one half to perform well, other half to perform poorly.

• Round 2: repeat on parameters that performed well in round 1. Etc.

• Any algorithm does poorly half the time in expectation but ∃ perfect 𝝆.

Not dispersePiecewise Lipschitz

function

Lipschitz within each

piece

Disperse

Few boundaries within any

interval

Many boundaries within interval

Dispersion, Sufficient Condition for No-Regret

u1(∙), … , uT(∙) is (𝐰, 𝐤)-dispersed if any ball of radius 𝐰 contains boundaries for at most 𝐤 of the ui.

Full info: exponentially weighted forecaster [Cesa-Bianchi-Lugosi 2006]

Our Results:

pt 𝛒 ∝ exp λ

s=1

t−1

us 𝛒

Disperse fns, regret ෩O Td fnc of problem) .

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:

Dispersion, Sufficient Condition for No-Regret

Disperse

density of 𝛒 exponential in its performance so far

Full info: exponentially weighted forecaster [Cesa-Bianchi-Lugosi 2006]

Assume u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

For most problems:

• Set 𝐰 ≈ 1/ T, 𝐤 = T × (fnc of problem)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:

Dispersion, Sufficient Condition for No-Regret

Wt = ∫ exp λUt 𝛒 d𝛒

Ut 𝛒 = σs=0t−1 us(𝛒) total payoff of 𝛒 up to time t

Our Results:

Dispersion, Sufficient Condition for No-Regret

Key idea: Provide upper and lower bounds on WT+1

W1, where Wt norm.

factor Wt = ∫ exp λUt 𝛒 d𝛒

Wt+1

Wt≤ exp

eHλ − 1 PtH

Pt = expected alg payoff at time t.

WT+1

W1≤ exp

eHλ − 1 PALGH

PALG = expected alg total payoff.

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

Thm:

• Upper bound: classic, relate algo perfor. with magnitude of update:

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

Dispersion, Sufficient Condition for No-Regret

Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

• Lower bound (uses dispersion + Lipschitz):

Let 𝛒∗ be the optimal parameter vector in hindsight.adsf

For all 𝛒 ∈ B 𝛒∗, w ,adsf

a

t=1

T

ut 𝛒 ≥

t=1

T

ut 𝛒∗ − Hk − TLw.

≤ k fncs us have disc. in B 𝛒∗, w & range is 0, H

remaining fnsL-Lipschitz

UT+1 𝛒 ≥ UT+1 𝛒∗ − Hk − TLw.For all 𝛒 ∈ B 𝛒∗, w ,adsf

Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

Dispersion, Sufficient Condition for No-Regret

Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

• Lower bound

UT+1 𝛒 ≥ UT+1 𝛒∗ − Hk − TLw.For all 𝛒 ∈ B 𝛒∗, w ,adsf

WT+1 ≥ Vol B 𝛒∗, w exp λ UT+1(𝛒∗ − Hk − TLw)

Get WT+1

W1≥

w

R

d

exp λ UT+1(𝛒∗ − Hk − TLw)

Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

Dispersion, Sufficient Condition for No-Regret

Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

expeHλ − 1 PALG

H≥

w

R

d

exp λ UT+1(𝛒∗ − Hk − TLw)

Take logs, use std approx of 𝑒𝑧 and fact that PALG ≤ HT:

UT+1 ρ∗ − PALG ≤ H2Tλ +d logR/w

λ+ Hk + TLw

Set λ =1

H

d

Tlog

R

wO H Td log

R

𝐰+ 𝐤 + TL𝐰Regret

Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

Example: Clustering with ρ-weighted linkage

𝜌-weighted linkage: dist A, B = ρ minx∈A,x′∈B

dist(x, x′) + (1 − ρ) maxx∈A,x′∈B

dist(x, x′)

Theorem: If T instances with distances selected in [0, B] from κ-bounded densities, then for any w, with prob ≥ 1 − 𝛿, we get

(w, k)-dispersion for k = O wn8κ2B2T + O T log n/𝛿 .

For any given interval 𝐼, expected #instances with discontinuities in

𝐼 is at most this

From a uniform convergence

argument

Implies expected regret of O H T log(TnκB)

Example: Clustering with ρ-weighted linkage

𝜌-weighted linkage: dist A, B = ρ minx∈A,x′∈B

dist(x, x′) + (1 − ρ) maxx∈A,x′∈B

dist(x, x′)

Theorem: If T instances with distances in [0, B] from κ-bounded densities, then for any w, with prob ≥ 1 − 𝛿 get (w, k)-dispersion for

k = O wn8κ2B2T + O T log n/𝛿 .

• Each instance O(n8) discontinuities of form𝑑1 − 𝑑3

(𝑑1−𝑑3) + (𝑑4 − 𝑑2)

𝑪𝟏 𝑪𝟐 𝑪𝟑 𝑪𝟒𝒅𝟏

𝒅𝟐

𝒅𝟑

𝒅𝟒

Given 𝜌, merge first, C1 and C2, or C3 and C4?

• Disc from O κB 2 -bounded density.

Critical value when 1 − 𝜌 d1 + 𝜌d2 = 1 − 𝜌 d3 + 𝜌d4

• So, for any given interval I of width w, expected # instances with discontinuities in I is O wn8κ2B2T . Uniform convergence on the top.

Key ideas:

Proving Dispersion

• Use functional form of discontinuities to get distribution of discontinuity locations, and expected number in any given interval.

• Randomness either from input instances (smoothed model) or from randomness in the algorithms themselves

• Use uniform convergence result on the top.

Lemma: Let u1, … , uT: ℝ → ℝ be independent piecewise-Lipschitz, each with ≤ N discontinuities. Let interval I, let D(I) be the # of ui with at least one discontinuity in I. Then w.h.p ≥ 1 − 𝛿 :

supI (D(I)− E D I ) ≤ O T log N/δ

Proof Sketch:

• Show VC-dim of ℱ = {fI} is O(logN), then use uniform convergence.

• For interval I, let fI i = 1 iff ui has discontinuity in I.

Then D I = σ𝑖 fI i

Proof: VC-dimension of 𝐹 is 𝑂(logN)

Lemma: For interval I, set s of ≤ N points, fI s = 1 iff s has a point inside I. VC-dimension of class ℱ = {fI} is O(logN).

How many ways can ℱ label d instances (d sets of ≤ N pts) s1, … , sd?

• But, given Nd points, at most Nd 2 + 1 different subsets of them that can be covered using intervals.

To shatter d instances, must have 2d ≤ Nd 2 + 1. VC-dim d = O(logN).

Proof idea:

• d instances have at most Nd points in their union.

• For fI, fI′ to label the instances differently, minimal req. is that intervals I and I′ can’t cover the exact same points.

Example: Knapsack

A problem instance is collection of items i, each with a value vi and asize si, along with a knapsack capacity C.

Goal: select most valuable subset of items that fits. Find subset V tomaximize σi∈V vi subject to σi∈V si ≤ C.

Greedy family of algorithms:

• Greedily pack items in order of value(highest value first) subject to feasibility.

• Greedily pack items in order of vi/siρ

subject to feasibility.

Thm: Adversary chooses knapsack instances with n items, capacity C, sizes si in [1, C], values vi in (0,1], and item values drawn independently from b-bounded distributions. Utilities u1, … , uT are piecewise constant

and w.h.p ≥ 1 − δ, (w, k)-dispersed for k = O wTn2b2 log C + T log n/δ .

Online Learning Summary

• Exploit dispersion to get no-regret.

• Learning theory: techniques of independent interest beyond algorithm selection.

• Can bound Rademacher complexity as a function of dispersion.

• For partitioning pbs via IQPs, SDP + Rounding, exploit smoothness in the algorithms themselves.

• For clustering & subset selection via greedy, assume input instances smoothed to show dispersion.

• Also differential privacy bounds.

• Also bandit, and semi-bandit feedback.

Recommended