Unsupervised Ranking and Ensemble Learning or Making good

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Unsupervised Ranking and Ensemble Learning

or

Making good decisions when knowing nothing

Boaz Nadler

Department of Computer Science and Applied MathematicsThe Weizmann Institute of Science

Joint work with

Fabio Parisi, Francesco Strino and Yuval Kluger (Yale Medical School)and Ariel Jaffe (Weizmann)

Nov. 2014

Boaz Nadler Unsupervised Ranking, Ensemble Learning 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

Consider binary classification problem, over some instance space Xand output Y ∈ {−1,+1}.

Goal:Construct a classifier with good generalization (small risk)

Typical Supervised Case:

Labeled training set {xi , yi}ni=1

Many methods to construct classifiers, well understood theory


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking - Multiple Classifiers

Multiple Classifiers:

we are given m classifiers, f1, . . . , fm

Each classifier constructed with own training data, assumptions,design principles, etc.

Two Key Questions:

- Rank, find the most accurate classifier

- Combine them to a more accurate meta-classifier (ensemblelearner)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





Two Key Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





Two Key Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





Two Key Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





Two Key Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Standard Approach:









.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Standard Approach:









.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Standard Approach:









.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Standard Approach:









.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Standard Approach:









.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Standard Approach:









.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example I:

Consider an investor intending to trade n stocks.

He gets advice from m entities (sell/buy on each of the n stocks)

Entities = friends, professional investment houses, his mother(in-law), the WSJ, etc.

However, our investor knows nothing about the reliability of theadvisors.

Questions:

Can our investor find out who is the most reliable advisor ?

How should he combine the (possibly conflicting) advices of the m entities ?


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.






Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.






Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.






Questions:




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example II

A biologist wishes to know where, along a given long DNA string,are protein binding sites.

Common Approach: Apply tens of different peak detectionalgorithms that predict binding sites.

Each algorithm - derived by separate lab, using prioprietary data,employing different design principles and biological knowledge.

How should our biologist rank these algorithms ?How should she combine them and get a more accurate prediction ?


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications




- decision science

- crowdsourcing

- medicine


- etc, etc


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications




- decision science

- crowdsourcing

- medicine


- etc, etc


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications




- decision science

- crowdsourcing

- medicine


- etc, etc


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications




- decision science

- crowdsourcing

- medicine


- etc, etc


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications




- decision science

- crowdsourcing

- medicine


- etc, etc


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Previous Works / Unsupervised Ensemble Learning

- Majority Voting (highly sub-optimal)

- Bayesian Approaches: Reaching a Consensus[DeGroot, 74’]

- Maximum Likelihood Estimate via EM[Dawid and Skeene 79’][many follow up works]

- Spectral Methods in Crowdsourcing[Karger et. al. 2010]


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution

Present a novel simple spectral analysis of this problem.

Reveal low dimensional structure in this high dimensional data

Insights:

- Standard independence assumptions between classifier errors →off-diagonal of population covariance matrix of classifiers has

rank-one structure

- Entries of eigenvector of this rank-one matrix

∝ balanced accuracies of classifiers

- Allows ranking of classifiers

- Allows consistent estimation of their parameters (sensitivity,specificity)

- Derive novel unsupervised ensemble learner


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution



Insights:


rank-one structure







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution



Insights:


rank-one structure







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution



Insights:


rank-one structure







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution



Insights:


rank-one structure







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Statistical Formulation

Binary Classification Problem:

- instance space X (typically Rd)- output space Y = {−1,+1}.- probability density p(x , y), marginals pX (x) and pY (y).

Binary Classifier:function f : X → Y


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Statistical Formulation

Binary Classification Problem:

- instance space X (typically Rd)- output space Y = {−1,+1}.- probability density p(x , y), marginals pX (x) and pY (y).

Binary Classifier:function f : X → Y


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Classifier Quality / Performance

Sensitivity:ψ = Pr[f (X ) = 1 |Y = 1] = ratio of true positives

Specificity:η = Pr[f (X ) = −1 |Y = −1] = ratio of true negatives

Balanced Accuracy:

π =1

2[ψ + η]

Common quality measure in presence of class im-balance.

Will come as a natural measure in our setup !


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




Balanced Accuracy:

π =1

2[ψ + η]




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.




Balanced Accuracy:

π =1

2[ψ + η]




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup








.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup








.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup








.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup








.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup








.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Assumption

Conditionally Independent Predictors:

As in supervised ensemble methods, assume errors made by oneclassifier independent of those made by others

∀ai , aj ,Y ∈ {−1, 1}:

Pr[fi (X)=ai , fj(X)=aj |Y ] = Pr[fi (X)=ai |Y ] Pr[fj(X)=aj |Y ]


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Assumption

Conditionally Independent Predictors:

As in supervised ensemble methods, assume errors made by oneclassifier independent of those made by others

∀ai , aj ,Y ∈ {−1, 1}:

Pr[fi (X)=ai , fj(X)=aj |Y ] = Pr[fi (X)=ai |Y ] Pr[fj(X)=aj |Y ]


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The population covariance matrix

Let Q be the m ×m population covariance matrix betweenclassifiers

qij = E[(fi (X )− µi )(fj(X )− µj)]

where µi = E[fi (X )].

Lemma: Entries of Q are

qij =

{1− µ2i i = j

(2πi − 1)(2πj − 1)(1− b2

)otherwise

where b ∈ (−1, 1) is the class imbalance

b = Pr[Y = 1]− Pr[Y = −1].


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The population covariance matrix

Let Q be the m ×m population covariance matrix betweenclassifiers

qij = E[(fi (X )− µi )(fj(X )− µj)]

where µi = E[fi (X )].

Lemma: Entries of Q are

qij =

{1− µ2i i = j

(2πi − 1)(2πj − 1)(1− b2

)otherwise

where b ∈ (−1, 1) is the class imbalance

b = Pr[Y = 1]− Pr[Y = −1].


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Unsupervised Ranking of Classifiers

Corollary: The off-diagonal entries of Q correspond to a rank-onematrix R = λvvT , where

λ = (1− b2)M∑j=1

(2πj − 1)2

Importantly, up to a ±1 sign ambiguity

vj ∝ (2πj − 1)

Key Result: If we can consistently estimate rank-one matrix R,then classifiers can be consistently ranked by the entries of thiseigenvector.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Unsupervised Ranking of Classifiers

Corollary: The off-diagonal entries of Q correspond to a rank-onematrix R = λvvT , where

λ = (1− b2)M∑j=1

(2πj − 1)2

Importantly, up to a ±1 sign ambiguity

vj ∝ (2πj − 1)

Key Result: If we can consistently estimate rank-one matrix R,then classifiers can be consistently ranked by the entries of thiseigenvector.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Rank-One Matrix R

Sample Covariance Matrix:

qij =1

n − 1

n∑k=1

(fi (xk)− µi )(fj(xk)− µj)

where µj =1n

∑k fj(xk).

By definition, for i = j , E[qij ] = qij = rij .

Only need to estimate diagonal entries of R and then do aneigendecomposition to compute leading eigenvector.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Sample Covariance Matrix:

qij =1

n − 1

n∑k=1

(fi (xk)− µi )(fj(xk)− µj)

where µj =1n

∑k fj(xk).

By definition, for i = j , E[qij ] = qij = rij .

Only need to estimate diagonal entries of R and then do aneigendecomposition to compute leading eigenvector.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Low Rank Matrix-Completion Problem

Several computationally efficient methods to do so:

- Semi-Definite Program- Least-Squares methods to estimate diagonal of R.

From estimate R, compute leading eigenvector v .

Rank m classifiers by sorting entries of v .


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Low Rank Matrix-Completion Problem

Several computationally efficient methods to do so:

- Semi-Definite Program- Least-Squares methods to estimate diagonal of R.

From estimate R, compute leading eigenvector v .

Rank m classifiers by sorting entries of v .


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Asymptotic Consistency: As unlabeled set size n → ∞, t → t,R → R and consequently v → v .

If class imbalance b known, then this method also gives consistentestimates for πi .

Estimating ψi , ηi :

µ = E[f ] = (ψ − η) + b(2π − 1)

Hence given b and v , solve system of 2 linear equations in 2unknowns (ψ, η).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





µ = E[f ] = (ψ − η) + b(2π − 1)



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





µ = E[f ] = (ψ − η) + b(2π − 1)



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Stability for Finite Sample Size S:

Fluctuations are

Q − Q = OP

(1√n

)

Via matrix perturbation approach, similar to [N., AoS 08’]

v − v = OP

(1

λ

1√n

)Remark: If all m classifiers are better than random, spectral gap,

λ = O(m)

ψi − ψi , ηi − ηi = OP

(1√n

)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Fluctuations are

Q − Q = OP

(1√n

)Via matrix perturbation approach, similar to [N., AoS 08’]

v − v = OP

(1

λ

1√n

)

Remark: If all m classifiers are better than random, spectral gap,

λ = O(m)


(1√n

)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Fluctuations are

Q − Q = OP

(1√n

)Via matrix perturbation approach, similar to [N., AoS 08’]

v − v = OP

(1

λ

1√n

)Remark: If all m classifiers are better than random, spectral gap,

λ = O(m)


(1√n

)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Class Imbalance

Look at 3-D joint covariance of triplets of classifiers.

Lemma: If classifiers independent errors in triplets, off-diagonal of3D tensor is rank-one,

E[(fi (X )− µi )(fj(X )− µj)(fk(X )− µk)] = α(b)vivjvk

where

α(b) = − 2b√1− b2

Hence: Compute 3-D tensor. Via least squares estimate singlescalar α(b). Invert to estimate b.

asymptotically consisitent, no tensor decomposition


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





where

α(b) = − 2b√1− b2




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.





where

α(b) = − 2b√1− b2




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Learning Mixtures of Product Distributions

If we assume all classifiers jointly make independent errors, problemis equivalent to learning mixture of discrete product distributions.

Each column vector of length m comes from one of twodistributions according to unobserved latent variable y (classlabel).

[Freund and Mansour 99’]Several computationally efficient methods developed for thisRecently, via tensor decompositions [Anandkumar, Hsu andKakade 12’]

Our method provides elegant much simpler method for the binarycase.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

PART II: How to combine the m classifiers ?


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood

Unknowns:- ψi , ηi = specificities and sensitivities of m classifiers- y = (y1, . . . , yn) = true labels of n instances

Common Approach:Look for ψi , ηi , y that maximize the likelihood.

Given assumption of independence of errors of different classifiers,for instance x ,

L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1

Pr[fi (x) |y, ψi , ηi ]

Assuming instances are i.i.d.

L(fi (xj) |y, ψi , ηi ) =n∏

j=1

L(fi (xj) | yj , ψi , ηi )


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood




L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1




j=1



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood




L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1




j=1



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood




L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1




j=1



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood Solution

If we knew sensitivies and specificities of m classifiers

Lemma: MLE is a weighted linear ensemble classifier

y (ML) = sign

(∑i

fi (x) lnαi + lnβi

)

where

αi =ψiηi

(1− ψi )(1− ηi ), βi =

ψi (1− ψi )

ηi (1− ηi ).

Unfortunately:,i) depends on unknown classifier’s specificities and sensitivities !ii) Likelihood function not convex when ψi , ηi , y unknown


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood Solution

If we knew sensitivies and specificities of m classifiers

Lemma: MLE is a weighted linear ensemble classifier

y (ML) = sign

(∑i

fi (x) lnαi + lnβi

)

where

αi =ψiηi

(1− ψi )(1− ηi ), βi =

ψi (1− ψi )

ηi (1− ηi ).

Unfortunately:,i) depends on unknown classifier’s specificities and sensitivities !ii) Likelihood function not convex when ψi , ηi , y unknown


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM-type Solutions

[Dawid and Skene, 1979]Expectation-Maximization:

- Given guess y of n labels, estimate ψi , ηi of m classifiers.

- Given sensitivity and specificity estimates, can constructapproximate MLE of labels, y.

Iteratively increase likelihood till convergence.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

EM in practice

Widely used:

- Jin & Ghahramani, Learning with multiple labels, 2003.

- Raykar, Yu, Zhao et. al, Learning from crowds, 2010.

- Whitehill, Ruvolo, Wu et al. Whose vote should count more...,2009.

- Welinder, Branson, Belongie, Perona, Multidimensional wisdomof crowds, 2010.

- etc, etc.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM solution

Key limitations:

Convergence guaranteed only to local maxima

Need a good initial guess of y

No performance guarantees on solution

Question:Can we do better ?

Answer:Yes we can !

Simply plug-in into MLE our estimates ψi , ηi .


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Key limitations:





Answer:Yes we can !



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Key limitations:





Answer:Yes we can !



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets

Novel fully-unsupervised ensemble learner.

SML = spectral meta-learner

Does method work on real data ?Indepedence assumptions not likely to hold exacty


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets

Ensemble of classifiers:

33 standard classification algorithms as implemented in Weka(kNN, Decision Trees, SVM, Logistic regression, Naive Bayes, etc)

On Various real datasets: Split into training and test data.Each algorithm trained on separate sub-set of whole training data.

Consistency Checks:- Is λ1(R)/Trace(R) close to one (e.g. is matrix rank-one)- Do classifiers approximately make independent errors


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets






.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets






.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Example:

Financial Stock Prediction: (NYSE)

x= (opening, closing, low, high price, and volume) × 9 daysy = high price on 10th day

Prediction Goal:

highday 10 − highday 9

highday 9

> 1.05


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Example:

Financial Stock Prediction: (NYSE)

x= (opening, closing, low, high price, and volume) × 9 daysy = high price on 10th day

Prediction Goal:

highday 10 − highday 9

highday 9

> 1.05


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Examples:

Bal

ance

d ac

cura

cy (π

)

NYSE

0.45

0.50

0.55

0.60

SML iMLESML iMLEVoting

Best inferred predictor Ensemble median


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary

- Presented a spectral analysis of unsupervised ranking andensemble learning

- Key idea: Exploit structure of hidden low-rank matrix

Future work / Open Questions:- Regression and multi-class problems- Instance difficulty- Cartels- Real Applications- Relation to Random Matrix Theory- Low Rank Matrix Recovery with missing data

Parisi et al., PNAS, 2014.Jaffe, Nadler, Kluger, under review, 2014.www.wisdom.weizmann.ac.il/∼nadler/


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary






.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary






.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

On the paper by Dawid and Skene

Tech. report by Dawid in 1972.

Dawid and Skene, Maximum likelihood estimation of observer errorrates using the EM-algorithm, JRSS-C, 1979.

over 300 citations from google-scholar:

Thank You / The End


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

On the paper by Dawid and Skene

Tech. report by Dawid in 1972.

Dawid and Skene, Maximum likelihood estimation of observer errorrates using the EM-algorithm, JRSS-C, 1979.

over 300 citations from google-scholar:

Thank You / The End


Documents

Unsupervised Ranking and Ensemble Learning or Making good