70
Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No. 5141 - Telecom ParisTech - Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC

Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Machine-Learning for Big Data:

Sampling and Distributed On-Line Algorithms

Stéphan Clémençon

LTCI UMR CNRS No. 5141-

Telecom ParisTech-

Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC

Page 2: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Goals of Statistical Learning Theory

• Statistical issues cast as M -estimation problems:• Classification• Regression• Density level set estimation• ... and their variants

• Minimal assumptions on the distribution• Build realistic M -estimators for special criteria• Questions:

• Optimal elements• Consistency• Non-asymptotic excess risk bounds• Fast rates of convergence• Oracle inequalities

Page 3: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Main Example: Classification• (X,Y ) random pair with unknown distribution P

• X 2 X observation vector• Y 2 {�1,+1} binary label/class

• A posteriori probability ⇠ regression function

8x 2 X , ⌘(x) = P{Y = 1 | X = x}• g : X ! {�1,+1} classifier• Performance measure = classification error

L(g) = P {g(X) 6= Y } ! ming

• Solution: Bayes rule

8x 2 X , g⇤(x) = 2I{⌘(x)>1/2} � 1

• Bayes error L⇤ = L(g⇤)

Page 4: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Empirical Risk Minimization

• Sample (X1, Y1), . . . , (Xn, Yn) with i.i.d. copies of (X,Y )• Class G of classifiers• Empirical Risk Minimization principle

gn = arg ming2G

Ln(g) :=1n

nX

i=1

I{g(Xi) 6=Yi}

• Best classifier in the class

g = arg ming2G

L(g)

Page 5: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Empirical Processes in Classification

• Bias-variance decomposition

L(gn)� L⇤ (L(gn)� Ln(gn)) + (Ln(g)� L(g)) + (L(g)� L⇤)

2

supg2G

| Ln(g)� L(g) |!

+✓

infg2G

L(g)� L⇤◆

• Concentration inequalityWith probability 1� �:

supg2G

| Ln(g)� L(g) | E supg2G

| Ln(g)� L(g) | +r

2 log(1/�)n

Page 6: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Classification Theory - Main Results

1 Bayes risk consistency and rate of convergenceComplexity control:

E supg2G

| Ln(g)� L(g) | C

r

V

n

if G is a VC class with VC dimension V .

2 Fast rates of convergenceUnder variance control: rate faster than n�1/2

3 Convex risk minimization

4 Oracle inequalities

Page 7: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Classification Theory - Main Results

1 Bayes risk consistency and rate of convergenceComplexity control:

E supg2G

| Ln(g)� L(g) | C

r

V

n

if G is a VC class with VC dimension V .

2 Fast rates of convergenceUnder variance control: rate faster than n�1/2

3 Convex risk minimization

4 Oracle inequalities

Page 8: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Classification Theory - Main Results

1 Bayes risk consistency and rate of convergenceComplexity control:

E supg2G

| Ln(g)� L(g) | C

r

V

n

if G is a VC class with VC dimension V .

2 Fast rates of convergenceUnder variance control: rate faster than n�1/2

3 Convex risk minimization

4 Oracle inequalities

Page 9: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Big Data? Big Challenge!

Now, it is much easier

• to collect data, massively and in real-time: ubiquity of sensors(cell phones, internet, embedded systems, social networks, . . .)

• to store and manage Big (and Complex) Data (distributed filesystems, NoSQL)

• to implement massively parallelized and distributedcomputational algorithms (MapReduce, clouds)

The three features of Big Data analysis

• Velocity: process data in quasi-real time (on-line algorithms)• Volume: scalability (parallelized, distributed algorithms)• Variety: complex data (text, signal, image, graph)

Page 10: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

How to apply ERM to Big Data?

• Suppose that n is too large to evaluate the empirical risk Ln(g)

• Common sense: run your preferred learning algorithm using asubsample of "reasonable" size B << n, e.g. by drawing withreplacement in the original training data set...

• ... but of course, statistical performance is downgraded!

1/p

n << 1/p

B

Page 11: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

How to apply ERM to Big Data?

• Suppose that n is too large to evaluate the empirical risk Ln(g)

• Common sense: run your preferred learning algorithm using asubsample of "reasonable" size B << n, e.g. by drawing withreplacement in the original training data set...

• ... but of course, statistical performance is downgraded!

1/p

n << 1/p

B

Page 12: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Survey designs:a solution to Big Data learning?

• Framework: massive original sample (X1, Y1), . . . , (Xn, Yn)viewed as a superpopulation

• Survey plan Rn = probability distribution on the ensemble of allnonempty subsets of {1, . . . , n}

• Let S ⇠ RN and set ✏i = 1 if i 2 S, ✏i = 0 otherwiseThe vector (✏1, . . . , ✏n) fully describes S

• First and second order inclusion probabilities:

⇡i(RN ) = P{i 2 S} and ⇡i,j(RN ) = P{(i, j) 2 S2}• Do not rely on the empirical risk based on the survey sample{(Xi, Yi) : i 2 S}

1#S

P

i2S I{g(Xi) 6= Yi} is a biased estimate of L(g)

Page 13: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Horvitz -Thompson theory

• Consider the Horvitz-Thompson estimator of the risk

LRnn (g) =

1n

nX

i=1

✏i

⇡iI{g(Xi) 6= Yi}

• And the Horvitz Thompson empirical risk minimizer

arg ming2G

LRnn (g) = g✏

n

• It may work if supg2G�

�LRnn (g)� Ln(g)

� is small

• In general, due to the dependence structure, not much can be saidabout the fluctuations of this supremum

Page 14: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

The Poisson case:the ✏i’s are independent

• In this case, LRnn (g) is a simple average of independent r.v.’s

) back to empirical process theory

• One recovers the same learning rate as if all data had been used,e.g. VC finite dimension case

E [L(g✏n)� L⇤] (n

p2 + 4)

r

V log(n + 1) + log 2n

where n =q

Pni=1(1/⇡2

i ) (the ⇡i’s should not be too small...)

• The upper bound is optimal in the minimax sense.

Page 15: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

The Poisson case:the ✏i’s are independent

• Can be extended to more general sampling plans Qn providedyou are able to control

dTV (Rn, Qn)def=

X

S2P(Un)

|Pn(S)�Rn(S)|.

• A coupling technique (Hajek, 1964) can be used to show that itworks for rejective sampling, Rao-Sampford sampling,successive sampling, post-stratified sampling, etc

Page 16: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Beyond Empirical ProcessesU -Statistics as Performance Criteria

• In various situations, the performance criterion is not a basicsample mean statistic any more

• Examples:• Clustering: within cluster point scatter related to a partition P

2n(n� 1)

X

i<j

D(Xi, Xj)X

C2PI{(Xi, Xj) 2 C2}

• Graph inference (link prediction)• Ranking• · · ·

• The empirical criterion is an average over all possible k-tuplesU -statistic of degree k � 2

Page 17: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking• Data with ordinal label:

(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n

• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.

s(X) and Y tend to increase/decrease together with highprobability

• Quantitative formulation: maximize the criterion

L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}

• Observations: nk i.i.d. copies of X given Y = k,X(k)

1 , . . . , X(k)nk

n = n1 + . . . + nK

Page 18: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking• Data with ordinal label:

(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n

• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.

s(X) and Y tend to increase/decrease together with highprobability

• Quantitative formulation: maximize the criterion

L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}

• Observations: nk i.i.d. copies of X given Y = k,X(k)

1 , . . . , X(k)nk

n = n1 + . . . + nK

Page 19: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking• Data with ordinal label:

(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n

• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.

s(X) and Y tend to increase/decrease together with highprobability

• Quantitative formulation: maximize the criterion

L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}

• Observations: nk i.i.d. copies of X given Y = k,X(k)

1 , . . . , X(k)nk

n = n1 + . . . + nK

Page 20: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking• Data with ordinal label:

(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n

• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.

s(X) and Y tend to increase/decrease together with highprobability

• Quantitative formulation: maximize the criterion

L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}

• Observations: nk i.i.d. copies of X given Y = k,X(k)

1 , . . . , X(k)nk

n = n1 + . . . + nK

Page 21: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking

• A natural empirical counterpart of L(s) is

bLn(s) =

Pn1i1=1 · · ·

PnKiK=1 I

n

s(X(1)i1

) < . . . < s(X(K)iK

)o

n1 ⇥ · · ·⇥ nK,

• But the number of terms to be summed is prohibitive!

n1 ⇥ . . .⇥ nK

• Maximization of bLn(s) is computationally unfeasible...

Page 22: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking

• A natural empirical counterpart of L(s) is

bLn(s) =

Pn1i1=1 · · ·

PnKiK=1 I

n

s(X(1)i1

) < . . . < s(X(K)iK

)o

n1 ⇥ · · ·⇥ nK,

• But the number of terms to be summed is prohibitive!

n1 ⇥ . . .⇥ nK

• Maximization of bLn(s) is computationally unfeasible...

Page 23: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: Ranking

• A natural empirical counterpart of L(s) is

bLn(s) =

Pn1i1=1 · · ·

PnKiK=1 I

n

s(X(1)i1

) < . . . < s(X(K)iK

)o

n1 ⇥ · · ·⇥ nK,

• But the number of terms to be summed is prohibitive!

n1 ⇥ . . .⇥ nK

• Maximization of bLn(s) is computationally unfeasible...

Page 24: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Generalized U -statistics

• K � 1 samples and degrees (d1, . . . , dK) 2 N⇤K

• (X(k)1 , . . . , X(k)

nk ), 1 k K, K independent i.i.d. samplesdrawn from Fk(dx) on Xk respectively

• Kernel H : X d11 ⇥ · · ·⇥ X dK

K ! R, square integrable w.r.t.µ = F⌦d1

1 ⌦ · · ·⌦ F⌦dKK

Page 25: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Generalized U -statisticsDefinitionThe K-sample U -statistic of degrees (d1, . . . , dK) with kernel H is

Un(H) =

P

I1. . .P

IKH(X(1)

I1;X(2)

I2; . . . ;X(K)

IK)

n1d1

�⇥ · · · �nKdK

� ,

whereP

Ikrefers to summation over all

nkdk

subsets

X(k)Ik

= (X(k)i1

, . . . , X(k)idk

) related to a set Ik of dk indexes1 i1 < . . . < idk

nk

It is said symmetric when H is permutation symmetric in each set ofdk arguments X(k)

Ik.

References: Lee (1990)

Page 26: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Generalized U -statistics• Unbiased estimator of

✓(H) = E[H(X(1)1 , . . . , X(1)

d1, . . . , X(K)

1 , . . . , X(K)dk

)]

with minimum variance

• Asymptotically Gaussian as nk/n ! �k > 0 for k = 1, . . . , K

• Its computation requires the summation of

KY

k=1

nk

dk

terms

• K-partite ranking: dk = 1 for 1 k K

Hs(x1, . . . , xK) = I {s(x1) < s(x2) < · · · < s(xK)}

Page 27: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Incomplete U -statistics• Replace Un(H) by an incomplete version, involving much less

terms

• Build a set DB of cardinality B built by sampling withreplacement in the set ⇤ of indexes

((i(1)1 , . . . , i(1)

d1), . . . , (i(K)

1 , . . . , i(K)dK

))

with 1 i(k)1 < . . . < i(k)

dk nk, 1 k K

• Compute the Monte-Carlo version based on B terms

eUB(H) =1B

X

(I1, ..., IK)2DB

H(X(1)I1

, . . . , X(K)IK

)

• An incomplete U -statistic is NOT a U -statistic

Page 28: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

ERM based on incompleteU -statistics

• Replace the criterion by a tractable incomplete version based onB = O(n) terms

minH2H

eUB(H)

• This leads to investigate the maximal deviations

supH2H

eUB(H)� Un(H)�

Page 29: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Main ResultTheoremLet H be a VC major class of bounded symmetric kernels of finite VCdimension V < +1. Set MH = sup(H,x)2H⇥X |H(x)|. Then,

(i) Pn

supH2H�

eUB(H)� Un(H)�

> ⌘o

2(1 + #⇤)V ⇥ e�B⌘2/M2

H

(ii) for all � 2 (0, 1), with probability at least 1� �, we have:

1MH

supH2H

eUB(H)� Eh

eUB(H)i

2

r

2V log(1 + )

+

r

log(2/�)

+

r

V log(1 + #⇤) + log(4/�)B

,

where = min{bn1/d1c, . . . , bnK/dKc}

Page 30: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Consequences

• Empirical risk sampling with B = O(n) yields a rate bound ofthe order O(

p

log n/n)

• One suffers no loss in terms of learning rate, while drasticallyreducing computational cost

Page 31: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example: RankingEmpirical ranking performance for SVMrank based on 1%, 5%, 10%,20% and 100% of the "LETOR 2007" dataset.

Page 32: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Sketch of Proof• Set ✏ = ((✏k(I))I2⇤)1kB , where ✏k(I) is equal to 1 if the tuple

I = (I1, . . . , IK) has been selected at the k-th draw and to 0otherwise

• The ✏k’s are i.i.d. random vectors• For all (k, I) 2 {1, . . . , B}⇥ ⇤, the r.v. ✏k(I) has a Bernoulli

distribution with parameter 1/#⇤• With these notations,

eUB(H)� Un(H) =1B

BX

k=1

Zk(H),

whereZk(H) =

X

I2⇤

(✏k(I)� 1/#⇤)H(XI)

• Freezing the XI ’s, by virtue of Sauer’s lemma:

#{(H(XI))I2⇤ : H 2 H} (1 + #⇤)V .

Page 33: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Sketch of Proof (continued)

• Conditioned upon the XI ’s, Z1(H), . . . , ZB(H) areindependent

• The first assertion is thus obtained by applying Hoeffding’sinequality combined with the union bound

• Set

�1VH

X(1)1 , . . . , X(1)

n1, . . . , X(K)

1 , . . . , X(K)nK

=

H⇣

X(1)1 , . . . , X(1)

d1, . . . , X(K)

1 , . . . , X(K)dK

+ H⇣

X(1)d1+1, . . . , X(1)

2d1, . . . , X(K)

dK+1, . . . , X(K)2dK

+ . . .

+ H⇣

X(1)d1�d1+1, . . . , X(K)

dK�dK+1, . . . , X(K)dK

,

Page 34: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Sketch of Proof (continued)

• The proof of the second assertion is based on the Hoeffdingdecomposition

Un(H) =1

n1! · · ·nK !

X

�12Sn1 , ..., �K2SnK

V⇣

X(1)�1(1), . . . , X(K)

�K(nK)

,

• The concentration result is then obtained in a classical manner• Convexity (Chernoff’s bound)• Symmetrization• Randomization• Application of McDiarmid’s bounded difference inequality

Page 35: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Beyond finite VC dimension

• Challenge: develop probabilistic tools and complexityassumptions to investigate the concentration properties ofcollections of sums of weighted binomials

eUB(H)� Un(H) =1B

BX

k=1

Zk(H),

withZk(H) =

X

I2⇤

(✏k(I)� 1/#⇤)H(XI)

Page 36: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Some references

• Maximal Deviations of Incomplete U-statistics with Applicationsto Empirical Risk Sampling. S. Clémençon, S. Robbiano and J.Tressou (2013). In the Proceedings of the SIAM InternationalConference on Data-Mining, Austin (USA).

• Empirical processes in survey sampling. P. Bertail, E. Chautruand S. Clémençon (2013). Submitted.

• A statistical view of clustering performance through the theory ofU-processes. S. Clémençon (2014). In Journal of MultivariateAnalysis.

• On Survey Sampling and Empirical Risk Minimization. P.Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, FortLauderdale (USA).

Page 37: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Introduction

Investigate the binary classification problem in statistical learning context

I Data not stored in central unit but processed by independent agents(processors)

I Aim : not to find a consensus on a common classifier but find howto combine e�ciently the local ones

I Solution : implement in an on-line and distributed manner

2/21

Page 38: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

3/21

Page 39: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

3/21

Page 40: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Learning problem

sign(H (X ))

r.v. observation r.v. binary output

X 2 X⇢ Rn ���! ���! Y 2 {�1,+1}

Given training dataset (X ,Y ) = (Xi ,Yi)i=1,...,n in a high dimension nand with unknown joint distribution....

...find the best prediction rule sign(H ?) such the classifier function H (x ) :

H ? =minH

Pe(H ) where Pe(H ) = P [�YH (X )> 0] = E⇥1{�YH (X )>0}

minimizes the probability of error Pe

B but 1(x ) is not a di↵erentiable function !

4/21

Page 41: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Learning problem

Majorize E⇥1{�YH (X )>0}

⇤by a convex function: Convex Surrogate

E⇥1{�YH (X )>0}

⇤ E [j(�YH (X ))]

How ? Use a cost function with appropiate properties

Example : use the quadratic function j(u) = (u+1)2

2 : R! [0, +•)

4/21

Page 42: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Learning problemsign(H (X ))

r.v. observation r.v. binary output

X 2 X⇢ Rn ���! ���! Y 2 {�1,+1}

Given training dataset (X ,Y ) = (Xi ,Yi)i=1,...,n in a high dimension nand with unknown joint distribution....

...find the best prediction rule sign(H ?) such the classifier function H (x ) :

H ? =minH

Rj(H ) where Rj(H ) = E [j(�YH (X ))]

minimizes the risk function Rj(H )

4 when j(u) = (u+1)2

2 ! H ? coincides with the naive Bayes classifier !

4/21

Page 43: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Aggregation of local classifiers

Consider a classification device composed by a set V of Nconnected agents

Each agent v 2 V :

I disposes of {(X1,v ,Y1,v ), . . . ,(Xnv ,v ,Ynv ,v )} ! nv independent copies of(X ,Y )

I selects a local soft classifier function from a parametric class {hv (·, qv )}

Set qv = (av ,bv ), the global soft classifier is: H (x ,✓) = Âv2V hv (x ,qv )

where : hv (x ,qv ) = avhv (x ,bv ) and ✓ =

0

B@q1...

qN

1

CA

5/21

Page 44: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Problem statement

The problem can be summarized as follows:

I given an observed data X

I obtain the best estimated label Y as sign(H (X ,✓))

I where ✓ is computed from the optimization problem using thetraining data (X ,Y ) = (Xi ,Yi)i=1,...,n as:

min✓2⇥

Rj(�YH (X ,✓))

6/21

Page 45: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Problem statement

Approaches

1. Agreement to a common decision rule [Tsitsiklis-84’,

Agarwal-10’] : consensus approach

I find an average consensus solution : ✓ = (q , . . . ,q)I each agent use the global classifier H (X ,✓)

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓ = (q1, . . . ,qN )

I each agent use its local classifier hv (x ,qv )

6/21

Page 46: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Problem statement

Approaches

1. Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’] :consensus approach

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓ = (q1, . . . ,qN )

I each agent use its local classifier hv (x ,qv )

4 Example:

set bv = 0, av � 0 and hv : X! {�1,+1} : the weak classifier

hv (x ,qv ) = av hv (x )

6/21

Page 47: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

7/21

Page 48: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

Solve the minimization problem of the parametric risk function:

min✓2⇥

Rj(H (X ,✓))

8/21

Page 49: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

An standard distributed gradient descent iterative approach :

I generates a vector sequence of the estimated parameter(✓t)t�1 = (qt ,1, · · · ,qt ,N )t�1

I at each agent v the update step writes:

qt+1,v = qt ,v + gt E⇥Y —vhv (X ,qt ,v )j 0(�YH (X ,✓t))

⇤| {z }

B the joint distribution is unknown

8/21

Page 50: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

An standard distributed and on-line gradient descent iterativeapproach is:

I generate a vector sequence of the estimated parameter(✓t)t�1 = (qt ,1, · · · ,qt ,N )t�1

I each agent v observes a pair (Xt+1,v ,Yt+1,v )

I at each agent v the update step writes:

qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qvt ,v )j 0(�Yt+1,vH (Xt+1,v ,✓t ))| {z }replace by the empirical version

B evaluate H (Xt+1,v ,✓(t)) is required at each t and v !

8/21

Page 51: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

ExampleAt iteration t , each agent v 2 V has (Xv ,t ,qv ,t )...

1

h1(X1,t ,q1,t )

2

(X2,t ,q2,t )

3

(X3,t ,q3,t )

4

(X4,t ,q4,t )

...and evaluates its local hv (Xt ,v ,qt ,v )

9/21

Page 52: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

ExampleEach node v sends its observation Xt ,v to all the other nodes...

1

(Xt ,1,qt ,1)

2Xt ,1

3

Xt ,1

4Xt ,1

9/21

Page 53: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from all the other nodes...

1

h1(X1,t ,q1,t )

2

h2(X1,t ,q2,t )

3

h3(X1,t ,q3,t )

4

h4(X1,t ,q4,t )

9/21

Page 54: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

High rate distributed learning

ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from all the other nodes...

1h1(X1,t ,q1,t ) {h2(X1,t ,q2,t ),h3(X1,t ,q3,t ),h4(X1,t ,q4,t )}

2

3

4

...and computes the global : H (Xt ,1,✓t ) = Â4w=1 hw (Xt ,1,qt ,w )

B N (N �1) communications per iterationN=4����! 12!

9/21

Page 55: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Proposed distributed learning : OLGA algorithm

4 Replace the global H (Xt+1,v ,✓(t)) by a local estimate Y(V)t ,v at

each v 2 V such :

E[Y (V)t+1,v |Xt+1,v ,✓t ] =H (Xt+1,v ,✓t )

How ? sparse communications with ratio sparsity p...

On-line Learning Gossip Algorithm (OLGA)

...for each v 2 V at time t , the local gradient descent update writes :

qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qt ,v )j 0(�Yt+1,v Y(V)t+1,v )

10/21

Page 56: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Proposed distributed learning : OLGA algorithm

ExampleAt iteration t , each agent v 2 V has (Xt ,v ,qt ,v )...

1

h1(X1,t ,q1,t )

2

(X2,t ,q2,t )

3

(X3,t ,q3,t )

4

(X4,t ,q4,t )

...and evaluates its local hv (Xt ,v ,qt ,v )

10/21

Page 57: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Proposed distributed learning : OLGA algorithm

ExampleEach node v sends its observation Xt ,v to randomly selected nodes withprobability p = 1

3 ...

1

(Xt ,1,qt ,1)

2

3

4Xt ,1

10/21

Page 58: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Proposed distributed learning : OLGA algorithm

ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from the randomlyselected nodes...

1

h1(X1,t ,q1,t )

2

3

4

h4(X1,t ,q4,t )

10/21

Page 59: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Proposed distributed learning : OLGA algorithm

ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from choosen nodes...

1h1(X1,t ,q1,t ) {h4(X1,t ,q4,t )}

2

3

4

...and computes its local estimated : Y(V)t ,1 = h1(Xt ,1,qt ,1)+ 1

p h4(Xt ,1,qt ,4)

B pN (N �1) communications per iterationN=4,p= 1

3��������! 4 (reduction of 67%)!

10/21

Page 60: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Performance analysis

B What is the e↵ect of sparsification ?...

...study the behaviour of the vector sequence ✓t as t ! •

I the consistency of the final solution given by the algorithm

I qualify the error variance excess due to the sparsity

11/21

Page 61: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

12/21

Page 62: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Asymptotic behaviour of OLGA

Under suitable assumptions, we prove the following results :

1. Consistency :

(✓t )t�1a.s.������! q? 2 L= {—Rj (✓) = 0}

2. CLT : conditioned to the event {limt!•✓t = ✓?}, then

pgt (✓t �✓?)L�����! N (0,S(G?))

where :

G? =

estimation error in a centralized casez }| {E[(H (X ,✓?)�Y )2 —vhv (X ,q?

v )—Tv hv (X ,q?

v )]+

+1�p

p Âw 6=v

E[hw (X ,q?w )2 —vhv (X ,q?

v )—Tv hv (X ,q?

v )]

| {z }additional noise term induced by the distributed setting

13/21

Page 63: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

14/21

Page 64: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

A best agents selection approach

When...

B the number of agents N " ! di�cult to implement

B redudancy agents ! avoid similar outputs

... include distributed agent selection !

How ? add a `1-penalization term with tunning parameter l

min✓2⇥

Rj(H (X ,✓))+l Âv|av |

where:

I the weight av = 0 for an idle agent and av > 0 when it is active

15/21

Page 65: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Including best agents selection to OLGA algorithm

Introduce an update step at each time t of OLGA to seek :

the time varying set of active nodes St ⇢ V

16/21

Page 66: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Including best agents selection to OLGA algorithm

The extended algorithm is summarized as follows, at time t :

1. obtain active nodes St from the sequence of updated weights(at ,1, . . . ,at ,N )

2. apply OLGA to the set of active agents v 2 St as:

i) estimate local Y(St )t+1,v from a random selection among the

current active nodes

ii) update local gradient descent

qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qt ,v )j 0(�Yt+1,v Y(St )t+1,v )

16/21

Page 67: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

17/21

Page 68: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Example with simulated data

Binary classification of (+) and (o) data samples with N = 60 agentsusing weak lineal classifiers (-). When using distributed selection, itreduces to 25 active classifiers.

!6 !4 !2 0 2 4 6!6

!4

!2

0

2

4

6

(a) OLGA

!6 !4 !2 0 2 4 6!6

!4

!2

0

2

4

6

(b) OLGA with distributed selection

18/21

Page 69: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Comparison with real data

Binary classification of the available benchmark dataset banana usingweak lineal classifiers when increasing N .

5 10 15 20 25 30 350.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Number of weak!learners

Erro

r r

ate

OLGA (p=0.6)GentleBoost

Figure: Comparison between a centralized and sequential approach(GentleBoost) and our distributed and on-line algorithm (OLGA).

19/21

Page 70: Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Conclusions

I A fully distributed and on-line algorithm is proposed for binaryclassification of big datasets solved by N processors

4 the algorithm is then adapted to select useful classifiers ! N #

I We obtain theoretical results from the asymptotic analysis ofthe sequence estimated by OLGA

I Numerical results are illustrated showing a comparablebehaviour to a centralized, batch and sequential approach(GentleBoost)

20/21