Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No

Machine-Learning for Big Data:

Sampling and Distributed On-Line Algorithms

Stéphan Clémençon

LTCI UMR CNRS No. 5141-

Telecom ParisTech-

Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC

Goals of Statistical Learning Theory

• Statistical issues cast as M -estimation problems:• Classification• Regression• Density level set estimation• ... and their variants

• Minimal assumptions on the distribution• Build realistic M -estimators for special criteria• Questions:

• Optimal elements• Consistency• Non-asymptotic excess risk bounds• Fast rates of convergence• Oracle inequalities

Main Example: Classification• (X,Y ) random pair with unknown distribution P

• X 2 X observation vector• Y 2 {�1,+1} binary label/class

• A posteriori probability ⇠ regression function

8x 2 X , ⌘(x) = P{Y = 1 | X = x}• g : X ! {�1,+1} classifier• Performance measure = classification error

L(g) = P {g(X) 6= Y } ! ming

• Solution: Bayes rule

8x 2 X , g⇤(x) = 2I{⌘(x)>1/2} � 1

• Bayes error L⇤ = L(g⇤)

Empirical Risk Minimization

• Sample (X1, Y1), . . . , (Xn, Yn) with i.i.d. copies of (X,Y )• Class G of classifiers• Empirical Risk Minimization principle

gn = arg ming2G

Ln(g) :=1n

nX

i=1

I{g(Xi) 6=Yi}

• Best classifier in the class

g = arg ming2G

L(g)

Empirical Processes in Classification

• Bias-variance decomposition

L(gn)� L⇤ (L(gn)� Ln(gn)) + (Ln(g)� L(g)) + (L(g)� L⇤)

2

supg2G

| Ln(g)� L(g) |!

+✓

infg2G

L(g)� L⇤◆

• Concentration inequalityWith probability 1� �:

supg2G

| Ln(g)� L(g) | E supg2G

| Ln(g)� L(g) | +r

2 log(1/�)n

Classification Theory - Main Results

1 Bayes risk consistency and rate of convergenceComplexity control:

E supg2G

| Ln(g)� L(g) | C

r

V

n

if G is a VC class with VC dimension V .

2 Fast rates of convergenceUnder variance control: rate faster than n�1/2

3 Convex risk minimization

4 Oracle inequalities



E supg2G

| Ln(g)� L(g) | C

r

V

n







E supg2G

| Ln(g)� L(g) | C

r

V

n





Big Data? Big Challenge!

Now, it is much easier

• to collect data, massively and in real-time: ubiquity of sensors(cell phones, internet, embedded systems, social networks, . . .)

• to store and manage Big (and Complex) Data (distributed filesystems, NoSQL)

• to implement massively parallelized and distributedcomputational algorithms (MapReduce, clouds)

The three features of Big Data analysis

• Velocity: process data in quasi-real time (on-line algorithms)• Volume: scalability (parallelized, distributed algorithms)• Variety: complex data (text, signal, image, graph)

How to apply ERM to Big Data?

• Suppose that n is too large to evaluate the empirical risk Ln(g)

• Common sense: run your preferred learning algorithm using asubsample of "reasonable" size B << n, e.g. by drawing withreplacement in the original training data set...

• ... but of course, statistical performance is downgraded!

1/p

n << 1/p

B

How to apply ERM to Big Data?

• Suppose that n is too large to evaluate the empirical risk Ln(g)

• Common sense: run your preferred learning algorithm using asubsample of "reasonable" size B << n, e.g. by drawing withreplacement in the original training data set...

• ... but of course, statistical performance is downgraded!

1/p

n << 1/p

B

Survey designs:a solution to Big Data learning?

• Framework: massive original sample (X1, Y1), . . . , (Xn, Yn)viewed as a superpopulation

• Survey plan Rn = probability distribution on the ensemble of allnonempty subsets of {1, . . . , n}

• Let S ⇠ RN and set ✏i = 1 if i 2 S, ✏i = 0 otherwiseThe vector (✏1, . . . , ✏n) fully describes S

• First and second order inclusion probabilities:

⇡i(RN ) = P{i 2 S} and ⇡i,j(RN ) = P{(i, j) 2 S2}• Do not rely on the empirical risk based on the survey sample{(Xi, Yi) : i 2 S}

1#S

P

i2S I{g(Xi) 6= Yi} is a biased estimate of L(g)

Horvitz -Thompson theory

• Consider the Horvitz-Thompson estimator of the risk

LRnn (g) =

1n

nX

i=1

✏i

⇡iI{g(Xi) 6= Yi}

• And the Horvitz Thompson empirical risk minimizer

arg ming2G

LRnn (g) = g✏

n

• It may work if supg2G�

�LRnn (g)� Ln(g)

�

� is small

• In general, due to the dependence structure, not much can be saidabout the fluctuations of this supremum

The Poisson case:the ✏i’s are independent

• In this case, LRnn (g) is a simple average of independent r.v.’s

) back to empirical process theory

• One recovers the same learning rate as if all data had been used,e.g. VC finite dimension case

E [L(g✏n)� L⇤] (n

p2 + 4)

r

V log(n + 1) + log 2n

where n =q

Pni=1(1/⇡2

i ) (the ⇡i’s should not be too small...)

• The upper bound is optimal in the minimax sense.

The Poisson case:the ✏i’s are independent

• Can be extended to more general sampling plans Qn providedyou are able to control

dTV (Rn, Qn)def=

X

S2P(Un)

|Pn(S)�Rn(S)|.

• A coupling technique (Hajek, 1964) can be used to show that itworks for rejective sampling, Rao-Sampford sampling,successive sampling, post-stratified sampling, etc

Beyond Empirical ProcessesU -Statistics as Performance Criteria

• In various situations, the performance criterion is not a basicsample mean statistic any more

• Examples:• Clustering: within cluster point scatter related to a partition P

2n(n� 1)

X

i<j

D(Xi, Xj)X

C2PI{(Xi, Xj) 2 C2}

• Graph inference (link prediction)• Ranking• · · ·

• The empirical criterion is an average over all possible k-tuplesU -statistic of degree k � 2

Example: Ranking• Data with ordinal label:

(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n

• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.

s(X) and Y tend to increase/decrease together with highprobability

• Quantitative formulation: maximize the criterion

L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}

• Observations: nk i.i.d. copies of X given Y = k,X(k)

1 , . . . , X(k)nk

n = n1 + . . . + nK


(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n




L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}


1 , . . . , X(k)nk

n = n1 + . . . + nK


(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n




L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}


1 , . . . , X(k)nk

n = n1 + . . . + nK


(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n




L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}


1 , . . . , X(k)nk

n = n1 + . . . + nK

Example: Ranking

• A natural empirical counterpart of L(s) is

bLn(s) =

Pn1i1=1 · · ·

PnKiK=1 I

n

s(X(1)i1

) < . . . < s(X(K)iK

)o

n1 ⇥ · · ·⇥ nK,

• But the number of terms to be summed is prohibitive!

n1 ⇥ . . .⇥ nK

• Maximization of bLn(s) is computationally unfeasible...

Example: Ranking


bLn(s) =

Pn1i1=1 · · ·

PnKiK=1 I

n

s(X(1)i1

) < . . . < s(X(K)iK

)o

n1 ⇥ · · ·⇥ nK,


n1 ⇥ . . .⇥ nK


Example: Ranking


bLn(s) =

Pn1i1=1 · · ·

PnKiK=1 I

n

s(X(1)i1

) < . . . < s(X(K)iK

)o

n1 ⇥ · · ·⇥ nK,


n1 ⇥ . . .⇥ nK


Generalized U -statistics

• K � 1 samples and degrees (d1, . . . , dK) 2 N⇤K

• (X(k)1 , . . . , X(k)

nk ), 1 k K, K independent i.i.d. samplesdrawn from Fk(dx) on Xk respectively

• Kernel H : X d11 ⇥ · · ·⇥ X dK

K ! R, square integrable w.r.t.µ = F⌦d1

1 ⌦ · · ·⌦ F⌦dKK

Generalized U -statisticsDefinitionThe K-sample U -statistic of degrees (d1, . . . , dK) with kernel H is

Un(H) =

P

I1. . .P

IKH(X(1)

I1;X(2)

I2; . . . ;X(K)

IK)

�

n1d1

�⇥ · · · �nKdK

� ,

whereP

Ikrefers to summation over all

�

nkdk

�

subsets

X(k)Ik

= (X(k)i1

, . . . , X(k)idk

) related to a set Ik of dk indexes1 i1 < . . . < idk

nk

It is said symmetric when H is permutation symmetric in each set ofdk arguments X(k)

Ik.

References: Lee (1990)

Generalized U -statistics• Unbiased estimator of

✓(H) = E[H(X(1)1 , . . . , X(1)

d1, . . . , X(K)

1 , . . . , X(K)dk

)]

with minimum variance

• Asymptotically Gaussian as nk/n ! �k > 0 for k = 1, . . . , K

• Its computation requires the summation of

KY

k=1

✓

nk

dk

◆

terms

• K-partite ranking: dk = 1 for 1 k K

Hs(x1, . . . , xK) = I {s(x1) < s(x2) < · · · < s(xK)}

Incomplete U -statistics• Replace Un(H) by an incomplete version, involving much less

terms

• Build a set DB of cardinality B built by sampling withreplacement in the set ⇤ of indexes

((i(1)1 , . . . , i(1)

d1), . . . , (i(K)

1 , . . . , i(K)dK

))

with 1 i(k)1 < . . . < i(k)

dk nk, 1 k K

• Compute the Monte-Carlo version based on B terms

eUB(H) =1B

X

(I1, ..., IK)2DB

H(X(1)I1

, . . . , X(K)IK

)

• An incomplete U -statistic is NOT a U -statistic

ERM based on incompleteU -statistics

• Replace the criterion by a tractable incomplete version based onB = O(n) terms

minH2H

eUB(H)

• This leads to investigate the maximal deviations

supH2H

�

�

�

eUB(H)� Un(H)�

�

�

Main ResultTheoremLet H be a VC major class of bounded symmetric kernels of finite VCdimension V < +1. Set MH = sup(H,x)2H⇥X |H(x)|. Then,

(i) Pn

supH2H�

�

�

eUB(H)� Un(H)�

�

�

> ⌘o

2(1 + #⇤)V ⇥ e�B⌘2/M2

H

(ii) for all � 2 (0, 1), with probability at least 1� �, we have:

1MH

supH2H

�

�

�

eUB(H)� Eh

eUB(H)i

�

�

�

2

r

2V log(1 + )

+

r

log(2/�)

+

r

V log(1 + #⇤) + log(4/�)B

,

where = min{bn1/d1c, . . . , bnK/dKc}

Consequences

• Empirical risk sampling with B = O(n) yields a rate bound ofthe order O(

p

log n/n)

• One suffers no loss in terms of learning rate, while drasticallyreducing computational cost

Example: RankingEmpirical ranking performance for SVMrank based on 1%, 5%, 10%,20% and 100% of the "LETOR 2007" dataset.

Sketch of Proof• Set ✏ = ((✏k(I))I2⇤)1kB , where ✏k(I) is equal to 1 if the tuple

I = (I1, . . . , IK) has been selected at the k-th draw and to 0otherwise

• The ✏k’s are i.i.d. random vectors• For all (k, I) 2 {1, . . . , B}⇥ ⇤, the r.v. ✏k(I) has a Bernoulli

distribution with parameter 1/#⇤• With these notations,

eUB(H)� Un(H) =1B

BX

k=1

Zk(H),

whereZk(H) =

X

I2⇤

(✏k(I)� 1/#⇤)H(XI)

• Freezing the XI ’s, by virtue of Sauer’s lemma:

#{(H(XI))I2⇤ : H 2 H} (1 + #⇤)V .

Sketch of Proof (continued)

• Conditioned upon the XI ’s, Z1(H), . . . , ZB(H) areindependent

• The first assertion is thus obtained by applying Hoeffding’sinequality combined with the union bound

• Set

�1VH

⇣

X(1)1 , . . . , X(1)

n1, . . . , X(K)

1 , . . . , X(K)nK

⌘

=

H⇣

X(1)1 , . . . , X(1)

d1, . . . , X(K)

1 , . . . , X(K)dK

⌘

+ H⇣

X(1)d1+1, . . . , X(1)

2d1, . . . , X(K)

dK+1, . . . , X(K)2dK

⌘

+ . . .

+ H⇣

X(1)d1�d1+1, . . . , X(K)

dK�dK+1, . . . , X(K)dK

⌘

,

Sketch of Proof (continued)

• The proof of the second assertion is based on the Hoeffdingdecomposition

Un(H) =1

n1! · · ·nK !

X

�12Sn1 , ..., �K2SnK

V⇣

X(1)�1(1), . . . , X(K)

�K(nK)

⌘

,

• The concentration result is then obtained in a classical manner• Convexity (Chernoff’s bound)• Symmetrization• Randomization• Application of McDiarmid’s bounded difference inequality

Beyond finite VC dimension

• Challenge: develop probabilistic tools and complexityassumptions to investigate the concentration properties ofcollections of sums of weighted binomials

eUB(H)� Un(H) =1B

BX

k=1

Zk(H),

withZk(H) =

X

I2⇤

(✏k(I)� 1/#⇤)H(XI)

Some references

• Maximal Deviations of Incomplete U-statistics with Applicationsto Empirical Risk Sampling. S. Clémençon, S. Robbiano and J.Tressou (2013). In the Proceedings of the SIAM InternationalConference on Data-Mining, Austin (USA).

• Empirical processes in survey sampling. P. Bertail, E. Chautruand S. Clémençon (2013). Submitted.

• A statistical view of clustering performance through the theory ofU-processes. S. Clémençon (2014). In Journal of MultivariateAnalysis.

• On Survey Sampling and Empirical Risk Minimization. P.Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, FortLauderdale (USA).

Introduction

Investigate the binary classification problem in statistical learning context

I Data not stored in central unit but processed by independent agents(processors)

I Aim : not to find a consensus on a common classifier but find howto combine e�ciently the local ones

I Solution : implement in an on-line and distributed manner

2/21

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection

Numerical experiences

3/21

Outline

Background

Proposed algorithm

Theoretical results



3/21

Learning problem

sign(H (X ))

r.v. observation r.v. binary output

X 2 X⇢ Rn ��! ��! Y 2 {�1,+1}

Given training dataset (X ,Y ) = (Xi ,Yi)i=1,...,n in a high dimension nand with unknown joint distribution....

...find the best prediction rule sign(H ?) such the classifier function H (x ) :

H ? =minH

Pe(H ) where Pe(H ) = P [�YH (X )> 0] = E⇥1{�YH (X )>0}

⇤

minimizes the probability of error Pe

B but 1(x ) is not a di↵erentiable function !

4/21

Learning problem

Majorize E⇥1{�YH (X )>0}

⇤by a convex function: Convex Surrogate

E⇥1{�YH (X )>0}

⇤ E [j(�YH (X ))]

How ? Use a cost function with appropiate properties

Example : use the quadratic function j(u) = (u+1)2

2 : R! [0, +•)

4/21

Learning problemsign(H (X ))

r.v. observation r.v. binary output

X 2 X⇢ Rn ��! ��! Y 2 {�1,+1}

Given training dataset (X ,Y ) = (Xi ,Yi)i=1,...,n in a high dimension nand with unknown joint distribution....

...find the best prediction rule sign(H ?) such the classifier function H (x ) :

H ? =minH

Rj(H ) where Rj(H ) = E [j(�YH (X ))]

minimizes the risk function Rj(H )

4 when j(u) = (u+1)2

2 ! H ? coincides with the naive Bayes classifier !

4/21

Aggregation of local classifiers

Consider a classification device composed by a set V of Nconnected agents

Each agent v 2 V :

I disposes of {(X1,v ,Y1,v ), . . . ,(Xnv ,v ,Ynv ,v )} ! nv independent copies of(X ,Y )

I selects a local soft classifier function from a parametric class {hv (·, qv )}

Set qv = (av ,bv ), the global soft classifier is: H (x ,✓) = Âv2V hv (x ,qv )

where : hv (x ,qv ) = avhv (x ,bv ) and ✓ =

0

B@q1...

qN

1

CA

5/21

Problem statement

The problem can be summarized as follows:

I given an observed data X

I obtain the best estimated label Y as sign(H (X ,✓))

I where ✓ is computed from the optimization problem using thetraining data (X ,Y ) = (Xi ,Yi)i=1,...,n as:

min✓2⇥

Rj(�YH (X ,✓))

6/21

Problem statement

Approaches

1. Agreement to a common decision rule [Tsitsiklis-84’,

Agarwal-10’] : consensus approach

I find an average consensus solution : ✓ = (q , . . . ,q)I each agent use the global classifier H (X ,✓)

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓ = (q1, . . . ,qN )

I each agent use its local classifier hv (x ,qv )

6/21

Problem statement

Approaches

1. Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’] :consensus approach

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓ = (q1, . . . ,qN )

I each agent use its local classifier hv (x ,qv )

4 Example:

set bv = 0, av � 0 and hv : X! {�1,+1} : the weak classifier

hv (x ,qv ) = av hv (x )

6/21

Outline

Background

Proposed algorithm

Theoretical results



7/21

High rate distributed learning

Solve the minimization problem of the parametric risk function:

min✓2⇥

Rj(H (X ,✓))

8/21


An standard distributed gradient descent iterative approach :

I generates a vector sequence of the estimated parameter(✓t)t�1 = (qt ,1, · · · ,qt ,N )t�1

I at each agent v the update step writes:

qt+1,v = qt ,v + gt E⇥Y —vhv (X ,qt ,v )j 0(�YH (X ,✓t))

⇤| {z }

B the joint distribution is unknown

8/21


An standard distributed and on-line gradient descent iterativeapproach is:

I generate a vector sequence of the estimated parameter(✓t)t�1 = (qt ,1, · · · ,qt ,N )t�1

I each agent v observes a pair (Xt+1,v ,Yt+1,v )

I at each agent v the update step writes:

qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qvt ,v )j 0(�Yt+1,vH (Xt+1,v ,✓t ))| {z }replace by the empirical version

B evaluate H (Xt+1,v ,✓(t)) is required at each t and v !

8/21


ExampleAt iteration t , each agent v 2 V has (Xv ,t ,qv ,t )...

1

h1(X1,t ,q1,t )

2

(X2,t ,q2,t )

3

(X3,t ,q3,t )

4

(X4,t ,q4,t )

...and evaluates its local hv (Xt ,v ,qt ,v )

9/21


ExampleEach node v sends its observation Xt ,v to all the other nodes...

1

(Xt ,1,qt ,1)

2Xt ,1

3

Xt ,1

4Xt ,1

9/21


ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from all the other nodes...

1

h1(X1,t ,q1,t )

2

h2(X1,t ,q2,t )

3

h3(X1,t ,q3,t )

4

h4(X1,t ,q4,t )

9/21


ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from all the other nodes...

1h1(X1,t ,q1,t ) {h2(X1,t ,q2,t ),h3(X1,t ,q3,t ),h4(X1,t ,q4,t )}

2

3

4

...and computes the global : H (Xt ,1,✓t ) = Â4w=1 hw (Xt ,1,qt ,w )

B N (N �1) communications per iterationN=4��! 12!

9/21

Proposed distributed learning : OLGA algorithm

4 Replace the global H (Xt+1,v ,✓(t)) by a local estimate Y(V)t ,v at

each v 2 V such :

E[Y (V)t+1,v |Xt+1,v ,✓t ] =H (Xt+1,v ,✓t )

How ? sparse communications with ratio sparsity p...

On-line Learning Gossip Algorithm (OLGA)

...for each v 2 V at time t , the local gradient descent update writes :

qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qt ,v )j 0(�Yt+1,v Y(V)t+1,v )

10/21


ExampleAt iteration t , each agent v 2 V has (Xt ,v ,qt ,v )...

1

h1(X1,t ,q1,t )

2

(X2,t ,q2,t )

3

(X3,t ,q3,t )

4

(X4,t ,q4,t )

...and evaluates its local hv (Xt ,v ,qt ,v )

10/21


ExampleEach node v sends its observation Xt ,v to randomly selected nodes withprobability p = 1

3 ...

1

(Xt ,1,qt ,1)

2

3

4Xt ,1

10/21


ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from the randomlyselected nodes...

1

h1(X1,t ,q1,t )

2

3

4

h4(X1,t ,q4,t )

10/21


ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from choosen nodes...

1h1(X1,t ,q1,t ) {h4(X1,t ,q4,t )}

2

3

4

...and computes its local estimated : Y(V)t ,1 = h1(Xt ,1,qt ,1)+ 1

p h4(Xt ,1,qt ,4)

B pN (N �1) communications per iterationN=4,p= 1

3��! 4 (reduction of 67%)!

10/21

Performance analysis

B What is the e↵ect of sparsification ?...

...study the behaviour of the vector sequence ✓t as t ! •

I the consistency of the final solution given by the algorithm

I qualify the error variance excess due to the sparsity

11/21

Outline

Background

Proposed algorithm

Theoretical results



12/21

Asymptotic behaviour of OLGA

Under suitable assumptions, we prove the following results :

1. Consistency :

(✓t )t�1a.s.��! q? 2 L= {—Rj (✓) = 0}

2. CLT : conditioned to the event {limt!•✓t = ✓?}, then

pgt (✓t �✓?)L��! N (0,S(G?))

where :

G? =

estimation error in a centralized casez }| {E[(H (X ,✓?)�Y )2 —vhv (X ,q?

v )—Tv hv (X ,q?

v )]+

+1�p

p Âw 6=v

E[hw (X ,q?w )2 —vhv (X ,q?

v )—Tv hv (X ,q?

v )]

| {z }additional noise term induced by the distributed setting

13/21

Outline

Background

Proposed algorithm

Theoretical results



14/21

A best agents selection approach

When...

B the number of agents N " ! di�cult to implement

B redudancy agents ! avoid similar outputs

... include distributed agent selection !

How ? add a `1-penalization term with tunning parameter l

min✓2⇥

Rj(H (X ,✓))+l Âv|av |

where:

I the weight av = 0 for an idle agent and av > 0 when it is active

15/21

Including best agents selection to OLGA algorithm

Introduce an update step at each time t of OLGA to seek :

the time varying set of active nodes St ⇢ V

16/21

Including best agents selection to OLGA algorithm

The extended algorithm is summarized as follows, at time t :

1. obtain active nodes St from the sequence of updated weights(at ,1, . . . ,at ,N )

2. apply OLGA to the set of active agents v 2 St as:

i) estimate local Y(St )t+1,v from a random selection among the

current active nodes

ii) update local gradient descent

qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qt ,v )j 0(�Yt+1,v Y(St )t+1,v )

16/21

Outline

Background

Proposed algorithm

Theoretical results



17/21

Example with simulated data

Binary classification of (+) and (o) data samples with N = 60 agentsusing weak lineal classifiers (-). When using distributed selection, itreduces to 25 active classifiers.

!6 !4 !2 0 2 4 6!6

!4

!2

0

2

4

6

(a) OLGA

!6 !4 !2 0 2 4 6!6

!4

!2

0

2

4

6

(b) OLGA with distributed selection

18/21

Comparison with real data

Binary classification of the available benchmark dataset banana usingweak lineal classifiers when increasing N .

5 10 15 20 25 30 350.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Number of weak!learners

Erro

r r

ate

OLGA (p=0.6)GentleBoost

Figure: Comparison between a centralized and sequential approach(GentleBoost) and our distributed and on-line algorithm (OLGA).

19/21

Conclusions

I A fully distributed and on-line algorithm is proposed for binaryclassification of big datasets solved by N processors

4 the algorithm is then adapted to select useful classifiers ! N #

I We obtain theoretical results from the asymptotic analysis ofthe sequence estimated by OLGA

I Numerical results are illustrated showing a comparablebehaviour to a centralized, batch and sequential approach(GentleBoost)

20/21

Documents

Stéphan Clémençon - sorbonne-universiteplc/clemencon2013.pdf · Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No