TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... ·...

More Data, Less Work:Runtime as a decreasing function

of data set size

Nati SrebroToyota Technological Institute—Chicago

Outline

SVM Clusteringspeculations,

other problems

we arehere

•SVM Optimization: Inverse Dependence on Training Set SizeShai Shalev-Shwartz (TTI), N Srebro

•Fast Rates for Regularized ObjectivesKarthik Sridharan (TTI), Shai Shalev-Shwartz (TTI), N Srebro, NIPS’08

•Pegasos: Primal Estimated sub-GrAdient SOlver for SVMShai Shalev-Shwartz (TTI), Yoram Singer (Google), N Srebro, ICML’07

•An Investigation of Comp. and Informational Limits in Gaussian Mixture ClusteringN Srebro, Greg Shakhnarovich (TTI), Sam Roweis (Google/Toronto), ICML’06

Large Margin Linear Classificationaka L2-regularized Linear Classification

aka Support Vector Machines

|w|=1w

<w,x> ≥ M

<w,x> ≤ -M

Error: [1-y<w,x>]+

Large Margin Linear Classificationaka L2-regularized Linear Classification

aka Support Vector Machines

Margin: M = 1/|w|w

<w,x> ≤ -1

<w,x> ≥ 1

Error: [1-y<w,x>]+

SVM Training as anOptimization Problem

• IP method on dual (standard QP solver):O(n3.5 log log(1/ε))

• Dual decomposition methods (e.g. SMO):O(n2 d log(1/ε)) [Platt 98][Joachims 98][Lin 02]

• Primal cutting plane method (SVMperf):O( nd / (λε) ) [Joachims 06][Smola et al 08]

Runtime to get f(w) ≤ min f(w) + ε

More Data ⇒ More Work?10k training examples 1 hour 2.3% error

(when usingthe predictor)

1M training examples 1 week (or more…) 2.29% error

10 minutes 2.3% error

But I really care about that 0.01% gain

Can always sample and get same runtime:

Can we leverage the excess data to reduce runtime?

1 hour 2.3% error

Study runtime increase as a function of target accuracy

Study runtime increase as a function of problem difficulty (e.g. small margin)

My problem is so hard, I have to crunch 1M examples

SVM Training

• Optimization objective:

• True objective: prediction error on future exampleserr(w) = Ex,y[error of w’x vs. y] ≈ E[ [1-y〈w,x〉]+ ]

• Would like to understand computational cost in terms of:• Increasing function of:

– Desired generalization performance (i.e. as err(w) decreases)– Hardness of problem:

margin, noise (unavoidable error)

• Decreasing function of available data set size

Error Decomposition

• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer

w0 = argmin E[f(w)] = argmin λ|w|2 + E[loss(w)]

• Estimation error:– Extra error due to replacing E[loss] with empirical loss

w* = arg min fn(w) = arg min λ |w|2 + loss(w on training set)

• Optimization error:– Extra error due to only optimizing to within finite precision

err(w0)

err(w*)

err(w)Prediction error

The Double-Edged Sword

• When data set size increases:– Estimation error decreases– Can increase optimization error,

i.e. optimize to within lesser accuracy ⇒ fewer iterations– But handling more data is expensive

e.g. runtime of each iteration increases

• PEGASOS (Primal Efficient Sub-Gradient Solver for SVMs) [Shalev-Shwartz Singer S 07]

– Fixed runtime per iteration– Runtime to get fixed accuracy does not increase with n

err(w0)

err(w*)

err(w)

data set size (n)

Error DecompositionPrediction error

PEGASOS: Stochastic (sub-)Gradient Descent

• Initialize w=0

• At each iteration t,with random data point (xi,yi):

subgradient ofλ|w|2+[1-yi<w,xi>]+

• Theorem: After at most iterations, f(wPEGASOS) ≤ minw f(w)+ε,with probability ≥ 1-δ

• With d-dimensional (or d-sparse) features, each iteration takes time O(d)

• Conclusion: Run-time required for PEGASOS to find ε accurate solution with constant probability:

• Run-time does not depend on #examples

Training Time (in seconds)

8052Physics ArXiv(62k examples, 100k features)

25,514856Covertype(581k examples, 54 features)

20,075772Reuters CCAT (800K examples, 47k features)

SVM-Light [Joachims]

SVM-Perf[Joachims06]

Pegasos

Runtime Analyzis

If there is some predictor w0 with low |w0| and low err(w0),how much time to find predictor with err(w) ≤ err(w0)+ε

large margin M=1/|w0|

Data Laden analysis: Restricted by computation, not data

λ = O(ε/|w0|2)εacc = O(ε)n = Ω(1/(λ ε)) = Ω(|w0|2/ε2)

Traditional Data Laden:f(w)<f(w*)+εacc err(w)≤ err(w0)+ε

Interior Point n3.5 log(log(1/εεεεacc)) |w0|7/ε7

SMO n2 d log(1/εεεεacc) d |w0|4/ε4

SVMPerf n d / (λλλλ εεεεacc) d |w0|4/ε4

PEGASOS d / (λλλλ εεεεacc) d |w0|2/ε2

Unlimited data available, can choose working data-set size

To get err(w) ≤ err(w0)+O(ε):

(ignoring log-factors)

Dependence on Data Set Size

Training set Size

Minimal Training Size(Stat Learning Theory)

T = Ω

d(ǫM −O

(1√n

PEGASOS guaranteed runtime to get error err(w0)+εwith n training points:

Target error

Training set Size

Runtime to get target error

Minimal Runtime(Data Laden)

approx. errapprox. err

est. errest. err

opt. erropt. err

Increases for smaller target

Increases for smaller margin

Decreases for larger data set

Training set Size

300,000 500,000 700,0000

Training Set Size

Target error

Training set Size

err<5.25% on Reuters CCAT

est. errest. err

opt. erropt. err

Training set Size

300,000 500,000 700,0000

Training Set Size

err(w) ≤ err(w0) + λ|w0|2 + O(1/(λn)) + O(d/(λT))

Increase λ as training size increases!More regularization, less predictors allowedLarger approximation error err(w0)+λ|w0|2

Faster runtime T ∝ 1/λ

Target error

Training set Size

err<5.25% on Reuters CCAT

est. errest. err

opt. erropt. err

Dependence on Data Set Size:Traditional Optimization Approaches

Training Set Size

SVM-Perf

Dependence on Data Set Size:Traditional Optimization Approaches

Training Set Size

SVM-Perf

Beyond PEGASOS

• Other machine learning problems– Kernalizes SVMs– L1-regularization, e.g. LASSO– Matrix/factor models (e.g. with trace-norm regularization)– Multilayer / deep networks…

• Can we more explicitly leverage excess data?– Playing only on the error decomposition,

const × minimum-sample-complexity is enough to get to const × minimum-data-laden-runtime

Clustering(by fitting a Gaussian mixture model)

•Find centers (µ1,…,µk) minimizing objective:–Negative log-likelihood under Gaussian mixture model:

-Σi log( Σj exp -(xi-µj)2/2 )–k-means objective ≈ negative log-likelihood of assignment:

Σi minj (xi-µj)2

Clustering(by fitting a Gaussian mixture model)

• Clustering is hard in the worst-case• Given LOTS of data and HUGE separation:

– Can efficiently recover true clustering[Dasgupta 99][Dasgupta Schulman 00][Arora Kannan 01][Vempala Wang 04] [Achliopts McSherry 05][Kannan Salmasian Vempala 05]

– EM works (empirically)

• With too little data, clustering is meaningless:– Even if we find the ML clustering, it has nothing to do with

underlying distribution

“Clustering isn’t hard—it’s either easy, or not interesting”

Effect of “Signal Strength”

Not enough data—“optimal” solution is meaningless.

Lots of data—true solution creates distinct peak.Easy to find.

Just enough data—optimal solution is meaningful, but hard to find?

~Informationallimit

Computationallimit ~

Larger data set

Smaller data set

Computational and Information Limits in Clustering

600 1000 2000 4000

What EM finds

exponential

data set size

k=16 clusters in d=512 dimensions, sep=4σ

200 400 600 800 1000

6000 k=16 s=4

500 1000 1500 20000

k=16 s=6

5 10 15 20 250

7500 d=512 s=4

5 10 15 20 25 300

3000 d=512 s=6

Dependence on dimensionality (d) and number of clusters (k)

EM finds M

Gap: EM fails, M

L better

ML no good

Dependence on thecluster separation

2.8 4 5 6 7 81/1000

separation (in units of σ)

EM finds ML

Gap: EM fails, ML better

ML no good

n = 9.7 kd/s2.2

n = 131 kd/s4.8

Conclusions from Empirical Study

• With enough samples, EM does find global ML, even with low separation

• There is an informational cost to tractability(at least when using known methods)

• Cost of tractability: PCA+EM+pruning (best known method) may require about s2 as much data as what is statistically necessary

• Cost increases when separation increases

Hardness as a Function of Dataset Size

ndataset size

“not interesting”—optimum does not

correspond to true solution provable polytime

EM**empiricallypolytime

ndataset size

polytime

ndataset size

polytimehard: provably no polytime algorithm

Informational Cost of Tractability?

• Gaussian Mixture Clustering• Learning structure of dependency networks

– Hard to find optimal (ML) structure in the worst case [Srebro 01]

– Polynomial-time algorithms for the large-sample limit [ChechetkaGuestrin 07]

• Graph partitioning (correlation clustering)– Hard in the worst case– Easy for large graphs with a “nice” partitions [McSherry 03]

• Finding cliques in random graphs• Planted Noisy MAX-SAT

• Required runtime:– increases with complexity of the answer (separation, decision boundary)

– increases with desired accuracy

– decreases with amount of available data• PEGASOS (stochastic sub-gradient descent for SVMs):

– Runtime to get fixed optimization accuracy doesn’t depend on n→ Best performance in data-laden regime→ Runtime decreases as more data is available

• Clustering

– Past informational limit, extra data is needed to make problem tractable– Cost of tractability increases quadratic with cluster seperation

More Data ⇒ Less Work

0 2 4 6 8 10data set size

data set size

TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... ·...

Documents

NOSTRADAMUS LITE – SELECTED SPECULATIONS AS TO THE …

Green SpeculationS

Some Speculations About 'The Great Nicaraguan Lake

Speculations Issue4 eBook

Integrated Profile Polarization : Observations and Speculations

Speculations IV

The Strongest Presumption Challenged: Speculations on

Machine Learning and Cognitive Sciencemlg.eng.cam.ac.uk/mlss09/mlss_slides/mlss09... · – Learning about kinds of objects and their properties – Inferring causal relations –

Balakrishnan-Speculations on the Stationary

Speculations I Standard Version

Active Learning and Selective Sensingweb.cse.ohio-state.edu/mlss09/mlss09_talks/4.june-THURS/MLSS_Nowak_talk.pdfConclusions MinimaxBounds selective sampling/querying can accelerate

Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/jst_tutorial.… · Kernel methods approach † Data embedded into a Euclidean feature

Towards a Critique of Globalcentrism: Speculations on ...maihold.org/mediapool/113/1132142/data/8-Coronil_Fernandol_2000_.pdf · Towards a Critique of Globalcentrism: Speculations

PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Deﬁnitions PAC-Bayes Theorem Applications

2002 material speculations catalog

Spectral Graph Theory, Linear Solvers, and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/9.june-TUE/GaryMiller.pdf · Spectral Graph Theory, Linear Solvers, and Applications

Competing Orders: speculations and interpretations

Gothenburg Workshop: Speculations Concerning Urban Form

Dossier KTH Speculations Seminar concepts

Dichotomy 14: Speculations