97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Ranking and Ensemble Learning or Making good decisions when knowing nothing Boaz Nadler Department of Computer Science and Applied Mathematics The Weizmann Institute of Science Joint work with Fabio Parisi, Francesco Strino and Yuval Kluger (Yale Medical School) and Ariel Jaffe (Weizmann) Nov. 2014 Boaz Nadler Unsupervised Ranking, Ensemble Learning 1

Unsupervised Ranking and Ensemble Learning or Making good

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Unsupervised Ranking and Ensemble Learning

or

Making good decisions when knowing nothing

Boaz Nadler

Department of Computer Science and Applied MathematicsThe Weizmann Institute of Science

Joint work with

Fabio Parisi, Francesco Strino and Yuval Kluger (Yale Medical School)and Ariel Jaffe (Weizmann)

Nov. 2014

Boaz Nadler Unsupervised Ranking, Ensemble Learning 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

Consider binary classification problem, over some instance space Xand output Y ∈ {−1,+1}.

Goal:Construct a classifier with good generalization (small risk)

Typical Supervised Case:

Labeled training set {xi , yi}ni=1

Many methods to construct classifiers, well understood theory

Boaz Nadler Unsupervised Ranking, Ensemble Learning 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

Consider binary classification problem, over some instance space Xand output Y ∈ {−1,+1}.

Goal:Construct a classifier with good generalization (small risk)

Typical Supervised Case:

Labeled training set {xi , yi}ni=1

Many methods to construct classifiers, well understood theory

Boaz Nadler Unsupervised Ranking, Ensemble Learning 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

Consider binary classification problem, over some instance space Xand output Y ∈ {−1,+1}.

Goal:Construct a classifier with good generalization (small risk)

Typical Supervised Case:

Labeled training set {xi , yi}ni=1

Many methods to construct classifiers, well understood theory

Boaz Nadler Unsupervised Ranking, Ensemble Learning 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

Consider binary classification problem, over some instance space Xand output Y ∈ {−1,+1}.

Goal:Construct a classifier with good generalization (small risk)

Typical Supervised Case:

Labeled training set {xi , yi}ni=1

Many methods to construct classifiers, well understood theory

Boaz Nadler Unsupervised Ranking, Ensemble Learning 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking - Multiple Classifiers

Multiple Classifiers:

we are given m classifiers, f1, . . . , fm

Each classifier constructed with own training data, assumptions,design principles, etc.

Two Key Questions:

- Rank, find the most accurate classifier

- Combine them to a more accurate meta-classifier (ensemblelearner)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking - Multiple Classifiers

Multiple Classifiers:

we are given m classifiers, f1, . . . , fm

Each classifier constructed with own training data, assumptions,design principles, etc.

Two Key Questions:

- Rank, find the most accurate classifier

- Combine them to a more accurate meta-classifier (ensemblelearner)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking - Multiple Classifiers

Multiple Classifiers:

we are given m classifiers, f1, . . . , fm

Each classifier constructed with own training data, assumptions,design principles, etc.

Two Key Questions:

- Rank, find the most accurate classifier

- Combine them to a more accurate meta-classifier (ensemblelearner)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking - Multiple Classifiers

Multiple Classifiers:

we are given m classifiers, f1, . . . , fm

Each classifier constructed with own training data, assumptions,design principles, etc.

Two Key Questions:

- Rank, find the most accurate classifier

- Combine them to a more accurate meta-classifier (ensemblelearner)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking - Multiple Classifiers

Multiple Classifiers:

we are given m classifiers, f1, . . . , fm

Each classifier constructed with own training data, assumptions,design principles, etc.

Two Key Questions:

- Rank, find the most accurate classifier

- Combine them to a more accurate meta-classifier (ensemblelearner)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Supervised Ranking and Ensemble Learning

Standard Approach:

Set aside an independent set of labeled validation data

Rank classifiers by empirical accuracies on this labeled set.

Many methods to construct ensemble learners[bagging, boosting, etc]

Central Question in this Talk:

Given the predictions of m classifiers over a large test data

can we rank them and construct a more accurate meta-classifier

without any labeled data ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example I:

Consider an investor intending to trade n stocks.

He gets advice from m entities (sell/buy on each of the n stocks)

Entities = friends, professional investment houses, his mother(in-law), the WSJ, etc.

However, our investor knows nothing about the reliability of theadvisors.

Questions:

Can our investor find out who is the most reliable advisor ?

How should he combine the (possibly conflicting) advices of the m entities ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 5

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example I:

Consider an investor intending to trade n stocks.

He gets advice from m entities (sell/buy on each of the n stocks)

Entities = friends, professional investment houses, his mother(in-law), the WSJ, etc.

However, our investor knows nothing about the reliability of theadvisors.

Questions:

Can our investor find out who is the most reliable advisor ?

How should he combine the (possibly conflicting) advices of the m entities ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 5

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example I:

Consider an investor intending to trade n stocks.

He gets advice from m entities (sell/buy on each of the n stocks)

Entities = friends, professional investment houses, his mother(in-law), the WSJ, etc.

However, our investor knows nothing about the reliability of theadvisors.

Questions:

Can our investor find out who is the most reliable advisor ?

How should he combine the (possibly conflicting) advices of the m entities ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 5

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example I:

Consider an investor intending to trade n stocks.

He gets advice from m entities (sell/buy on each of the n stocks)

Entities = friends, professional investment houses, his mother(in-law), the WSJ, etc.

However, our investor knows nothing about the reliability of theadvisors.

Questions:

Can our investor find out who is the most reliable advisor ?

How should he combine the (possibly conflicting) advices of the m entities ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 5

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example II

A biologist wishes to know where, along a given long DNA string,are protein binding sites.

Common Approach: Apply tens of different peak detectionalgorithms that predict binding sites.

Each algorithm - derived by separate lab, using prioprietary data,employing different design principles and biological knowledge.

How should our biologist rank these algorithms ?How should she combine them and get a more accurate prediction ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 6

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example II

A biologist wishes to know where, along a given long DNA string,are protein binding sites.

Common Approach: Apply tens of different peak detectionalgorithms that predict binding sites.

Each algorithm - derived by separate lab, using prioprietary data,employing different design principles and biological knowledge.

How should our biologist rank these algorithms ?How should she combine them and get a more accurate prediction ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 6

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Motivating Example II

A biologist wishes to know where, along a given long DNA string,are protein binding sites.

Common Approach: Apply tens of different peak detectionalgorithms that predict binding sites.

Each algorithm - derived by separate lab, using prioprietary data,employing different design principles and biological knowledge.

How should our biologist rank these algorithms ?How should she combine them and get a more accurate prediction ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 6

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc

Boaz Nadler Unsupervised Ranking, Ensemble Learning 7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc

Boaz Nadler Unsupervised Ranking, Ensemble Learning 7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc

Boaz Nadler Unsupervised Ranking, Ensemble Learning 7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc

Boaz Nadler Unsupervised Ranking, Ensemble Learning 7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc

Boaz Nadler Unsupervised Ranking, Ensemble Learning 7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Applications

Common theme to both examples: we are given predictions orrecommendations of multiple advisers of unknown reliability

No labeled data (difficult/expensive to obtain, or will be knownonly in the future)

This scenario appears in a broad range of applications:

- decision science

- crowdsourcing

- medicine

- grant application review panels...

- etc, etc

Boaz Nadler Unsupervised Ranking, Ensemble Learning 7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Previous Works / Unsupervised Ensemble Learning

- Majority Voting (highly sub-optimal)

- Bayesian Approaches: Reaching a Consensus[DeGroot, 74’]

- Maximum Likelihood Estimate via EM[Dawid and Skeene 79’][many follow up works]

- Spectral Methods in Crowdsourcing[Karger et. al. 2010]

Boaz Nadler Unsupervised Ranking, Ensemble Learning 8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Previous Works / Unsupervised Ensemble Learning

- Majority Voting (highly sub-optimal)

- Bayesian Approaches: Reaching a Consensus[DeGroot, 74’]

- Maximum Likelihood Estimate via EM[Dawid and Skeene 79’][many follow up works]

- Spectral Methods in Crowdsourcing[Karger et. al. 2010]

Boaz Nadler Unsupervised Ranking, Ensemble Learning 8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Previous Works / Unsupervised Ensemble Learning

- Majority Voting (highly sub-optimal)

- Bayesian Approaches: Reaching a Consensus[DeGroot, 74’]

- Maximum Likelihood Estimate via EM[Dawid and Skeene 79’][many follow up works]

- Spectral Methods in Crowdsourcing[Karger et. al. 2010]

Boaz Nadler Unsupervised Ranking, Ensemble Learning 8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution

Present a novel simple spectral analysis of this problem.

Reveal low dimensional structure in this high dimensional data

Insights:

- Standard independence assumptions between classifier errors →off-diagonal of population covariance matrix of classifiers has

rank-one structure

- Entries of eigenvector of this rank-one matrix

∝ balanced accuracies of classifiers

- Allows ranking of classifiers

- Allows consistent estimation of their parameters (sensitivity,specificity)

- Derive novel unsupervised ensemble learner

Boaz Nadler Unsupervised Ranking, Ensemble Learning 9

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution

Present a novel simple spectral analysis of this problem.

Reveal low dimensional structure in this high dimensional data

Insights:

- Standard independence assumptions between classifier errors →off-diagonal of population covariance matrix of classifiers has

rank-one structure

- Entries of eigenvector of this rank-one matrix

∝ balanced accuracies of classifiers

- Allows ranking of classifiers

- Allows consistent estimation of their parameters (sensitivity,specificity)

- Derive novel unsupervised ensemble learner

Boaz Nadler Unsupervised Ranking, Ensemble Learning 9

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution

Present a novel simple spectral analysis of this problem.

Reveal low dimensional structure in this high dimensional data

Insights:

- Standard independence assumptions between classifier errors →off-diagonal of population covariance matrix of classifiers has

rank-one structure

- Entries of eigenvector of this rank-one matrix

∝ balanced accuracies of classifiers

- Allows ranking of classifiers

- Allows consistent estimation of their parameters (sensitivity,specificity)

- Derive novel unsupervised ensemble learner

Boaz Nadler Unsupervised Ranking, Ensemble Learning 9

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution

Present a novel simple spectral analysis of this problem.

Reveal low dimensional structure in this high dimensional data

Insights:

- Standard independence assumptions between classifier errors →off-diagonal of population covariance matrix of classifiers has

rank-one structure

- Entries of eigenvector of this rank-one matrix

∝ balanced accuracies of classifiers

- Allows ranking of classifiers

- Allows consistent estimation of their parameters (sensitivity,specificity)

- Derive novel unsupervised ensemble learner

Boaz Nadler Unsupervised Ranking, Ensemble Learning 9

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Our Contribution

Present a novel simple spectral analysis of this problem.

Reveal low dimensional structure in this high dimensional data

Insights:

- Standard independence assumptions between classifier errors →off-diagonal of population covariance matrix of classifiers has

rank-one structure

- Entries of eigenvector of this rank-one matrix

∝ balanced accuracies of classifiers

- Allows ranking of classifiers

- Allows consistent estimation of their parameters (sensitivity,specificity)

- Derive novel unsupervised ensemble learner

Boaz Nadler Unsupervised Ranking, Ensemble Learning 9

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Statistical Formulation

Binary Classification Problem:

- instance space X (typically Rd)- output space Y = {−1,+1}.- probability density p(x , y), marginals pX (x) and pY (y).

Binary Classifier:function f : X → Y

Boaz Nadler Unsupervised Ranking, Ensemble Learning 10

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Statistical Formulation

Binary Classification Problem:

- instance space X (typically Rd)- output space Y = {−1,+1}.- probability density p(x , y), marginals pX (x) and pY (y).

Binary Classifier:function f : X → Y

Boaz Nadler Unsupervised Ranking, Ensemble Learning 10

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Classifier Quality / Performance

Sensitivity:ψ = Pr[f (X ) = 1 |Y = 1] = ratio of true positives

Specificity:η = Pr[f (X ) = −1 |Y = −1] = ratio of true negatives

Balanced Accuracy:

π =1

2[ψ + η]

Common quality measure in presence of class im-balance.

Will come as a natural measure in our setup !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 11

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Classifier Quality / Performance

Sensitivity:ψ = Pr[f (X ) = 1 |Y = 1] = ratio of true positives

Specificity:η = Pr[f (X ) = −1 |Y = −1] = ratio of true negatives

Balanced Accuracy:

π =1

2[ψ + η]

Common quality measure in presence of class im-balance.

Will come as a natural measure in our setup !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 11

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Classifier Quality / Performance

Sensitivity:ψ = Pr[f (X ) = 1 |Y = 1] = ratio of true positives

Specificity:η = Pr[f (X ) = −1 |Y = −1] = ratio of true negatives

Balanced Accuracy:

π =1

2[ψ + η]

Common quality measure in presence of class im-balance.

Will come as a natural measure in our setup !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 11

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 12

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 12

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 12

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 12

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 12

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Problem Setup

D = {xi} ⊂ X unlabeled test data of n i.i.d. instances from pX (x).

f1, . . . , fm - an ensemble of m classifiers, of unknown accuracy.

Questions: Given the m × n matrix of all classifier’s predictions,{fi (xk)}i ,k , i = 1, . . . ,m, k = 1, . . . , n, can one

- rank the m classifiers ?

- combine the n predictions to an even improved meta-classifier ?

Key Point: perform the above without any labeled data !

Boaz Nadler Unsupervised Ranking, Ensemble Learning 12

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Assumption

Conditionally Independent Predictors:

As in supervised ensemble methods, assume errors made by oneclassifier independent of those made by others

∀ai , aj ,Y ∈ {−1, 1}:

Pr[fi (X)=ai , fj(X)=aj |Y ] = Pr[fi (X)=ai |Y ] Pr[fj(X)=aj |Y ]

Boaz Nadler Unsupervised Ranking, Ensemble Learning 13

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Assumption

Conditionally Independent Predictors:

As in supervised ensemble methods, assume errors made by oneclassifier independent of those made by others

∀ai , aj ,Y ∈ {−1, 1}:

Pr[fi (X)=ai , fj(X)=aj |Y ] = Pr[fi (X)=ai |Y ] Pr[fj(X)=aj |Y ]

Boaz Nadler Unsupervised Ranking, Ensemble Learning 13

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The population covariance matrix

Let Q be the m ×m population covariance matrix betweenclassifiers

qij = E[(fi (X )− µi )(fj(X )− µj)]

where µi = E[fi (X )].

Lemma: Entries of Q are

qij =

{1− µ2i i = j

(2πi − 1)(2πj − 1)(1− b2

)otherwise

where b ∈ (−1, 1) is the class imbalance

b = Pr[Y = 1]− Pr[Y = −1].

Boaz Nadler Unsupervised Ranking, Ensemble Learning 14

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The population covariance matrix

Let Q be the m ×m population covariance matrix betweenclassifiers

qij = E[(fi (X )− µi )(fj(X )− µj)]

where µi = E[fi (X )].

Lemma: Entries of Q are

qij =

{1− µ2i i = j

(2πi − 1)(2πj − 1)(1− b2

)otherwise

where b ∈ (−1, 1) is the class imbalance

b = Pr[Y = 1]− Pr[Y = −1].

Boaz Nadler Unsupervised Ranking, Ensemble Learning 14

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Unsupervised Ranking of Classifiers

Corollary: The off-diagonal entries of Q correspond to a rank-onematrix R = λvvT , where

λ = (1− b2)M∑j=1

(2πj − 1)2

Importantly, up to a ±1 sign ambiguity

vj ∝ (2πj − 1)

Key Result: If we can consistently estimate rank-one matrix R,then classifiers can be consistently ranked by the entries of thiseigenvector.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 15

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Unsupervised Ranking of Classifiers

Corollary: The off-diagonal entries of Q correspond to a rank-onematrix R = λvvT , where

λ = (1− b2)M∑j=1

(2πj − 1)2

Importantly, up to a ±1 sign ambiguity

vj ∝ (2πj − 1)

Key Result: If we can consistently estimate rank-one matrix R,then classifiers can be consistently ranked by the entries of thiseigenvector.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 15

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Rank-One Matrix R

Sample Covariance Matrix:

qij =1

n − 1

n∑k=1

(fi (xk)− µi )(fj(xk)− µj)

where µj =1n

∑k fj(xk).

By definition, for i = j , E[qij ] = qij = rij .

Only need to estimate diagonal entries of R and then do aneigendecomposition to compute leading eigenvector.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 16

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Rank-One Matrix R

Sample Covariance Matrix:

qij =1

n − 1

n∑k=1

(fi (xk)− µi )(fj(xk)− µj)

where µj =1n

∑k fj(xk).

By definition, for i = j , E[qij ] = qij = rij .

Only need to estimate diagonal entries of R and then do aneigendecomposition to compute leading eigenvector.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 16

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Rank-One Matrix R

Low Rank Matrix-Completion Problem

Several computationally efficient methods to do so:

- Semi-Definite Program- Least-Squares methods to estimate diagonal of R.

From estimate R, compute leading eigenvector v .

Rank m classifiers by sorting entries of v .

Boaz Nadler Unsupervised Ranking, Ensemble Learning 17

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Rank-One Matrix R

Low Rank Matrix-Completion Problem

Several computationally efficient methods to do so:

- Semi-Definite Program- Least-Squares methods to estimate diagonal of R.

From estimate R, compute leading eigenvector v .

Rank m classifiers by sorting entries of v .

Boaz Nadler Unsupervised Ranking, Ensemble Learning 17

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Asymptotic Consistency: As unlabeled set size n → ∞, t → t,R → R and consequently v → v .

If class imbalance b known, then this method also gives consistentestimates for πi .

Estimating ψi , ηi :

µ = E[f ] = (ψ − η) + b(2π − 1)

Hence given b and v , solve system of 2 linear equations in 2unknowns (ψ, η).

Boaz Nadler Unsupervised Ranking, Ensemble Learning 18

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Asymptotic Consistency: As unlabeled set size n → ∞, t → t,R → R and consequently v → v .

If class imbalance b known, then this method also gives consistentestimates for πi .

Estimating ψi , ηi :

µ = E[f ] = (ψ − η) + b(2π − 1)

Hence given b and v , solve system of 2 linear equations in 2unknowns (ψ, η).

Boaz Nadler Unsupervised Ranking, Ensemble Learning 18

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Asymptotic Consistency: As unlabeled set size n → ∞, t → t,R → R and consequently v → v .

If class imbalance b known, then this method also gives consistentestimates for πi .

Estimating ψi , ηi :

µ = E[f ] = (ψ − η) + b(2π − 1)

Hence given b and v , solve system of 2 linear equations in 2unknowns (ψ, η).

Boaz Nadler Unsupervised Ranking, Ensemble Learning 18

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Stability for Finite Sample Size S:

Fluctuations are

Q − Q = OP

(1√n

)

Via matrix perturbation approach, similar to [N., AoS 08’]

v − v = OP

(1

λ

1√n

)Remark: If all m classifiers are better than random, spectral gap,

λ = O(m)

ψi − ψi , ηi − ηi = OP

(1√n

)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 19

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Stability for Finite Sample Size S:

Fluctuations are

Q − Q = OP

(1√n

)Via matrix perturbation approach, similar to [N., AoS 08’]

v − v = OP

(1

λ

1√n

)

Remark: If all m classifiers are better than random, spectral gap,

λ = O(m)

ψi − ψi , ηi − ηi = OP

(1√n

)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 19

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Properties of Solution

Stability for Finite Sample Size S:

Fluctuations are

Q − Q = OP

(1√n

)Via matrix perturbation approach, similar to [N., AoS 08’]

v − v = OP

(1

λ

1√n

)Remark: If all m classifiers are better than random, spectral gap,

λ = O(m)

ψi − ψi , ηi − ηi = OP

(1√n

)

Boaz Nadler Unsupervised Ranking, Ensemble Learning 19

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Class Imbalance

Look at 3-D joint covariance of triplets of classifiers.

Lemma: If classifiers independent errors in triplets, off-diagonal of3D tensor is rank-one,

E[(fi (X )− µi )(fj(X )− µj)(fk(X )− µk)] = α(b)vivjvk

where

α(b) = − 2b√1− b2

Hence: Compute 3-D tensor. Via least squares estimate singlescalar α(b). Invert to estimate b.

asymptotically consisitent, no tensor decomposition

Boaz Nadler Unsupervised Ranking, Ensemble Learning 20

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Class Imbalance

Look at 3-D joint covariance of triplets of classifiers.

Lemma: If classifiers independent errors in triplets, off-diagonal of3D tensor is rank-one,

E[(fi (X )− µi )(fj(X )− µj)(fk(X )− µk)] = α(b)vivjvk

where

α(b) = − 2b√1− b2

Hence: Compute 3-D tensor. Via least squares estimate singlescalar α(b). Invert to estimate b.

asymptotically consisitent, no tensor decomposition

Boaz Nadler Unsupervised Ranking, Ensemble Learning 20

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Estimating Class Imbalance

Look at 3-D joint covariance of triplets of classifiers.

Lemma: If classifiers independent errors in triplets, off-diagonal of3D tensor is rank-one,

E[(fi (X )− µi )(fj(X )− µj)(fk(X )− µk)] = α(b)vivjvk

where

α(b) = − 2b√1− b2

Hence: Compute 3-D tensor. Via least squares estimate singlescalar α(b). Invert to estimate b.

asymptotically consisitent, no tensor decomposition

Boaz Nadler Unsupervised Ranking, Ensemble Learning 20

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Learning Mixtures of Product Distributions

If we assume all classifiers jointly make independent errors, problemis equivalent to learning mixture of discrete product distributions.

Each column vector of length m comes from one of twodistributions according to unobserved latent variable y (classlabel).

[Freund and Mansour 99’]Several computationally efficient methods developed for thisRecently, via tensor decompositions [Anandkumar, Hsu andKakade 12’]

Our method provides elegant much simpler method for the binarycase.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 21

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Learning Mixtures of Product Distributions

If we assume all classifiers jointly make independent errors, problemis equivalent to learning mixture of discrete product distributions.

Each column vector of length m comes from one of twodistributions according to unobserved latent variable y (classlabel).

[Freund and Mansour 99’]Several computationally efficient methods developed for thisRecently, via tensor decompositions [Anandkumar, Hsu andKakade 12’]

Our method provides elegant much simpler method for the binarycase.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 21

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Learning Mixtures of Product Distributions

If we assume all classifiers jointly make independent errors, problemis equivalent to learning mixture of discrete product distributions.

Each column vector of length m comes from one of twodistributions according to unobserved latent variable y (classlabel).

[Freund and Mansour 99’]Several computationally efficient methods developed for thisRecently, via tensor decompositions [Anandkumar, Hsu andKakade 12’]

Our method provides elegant much simpler method for the binarycase.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 21

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

PART II: How to combine the m classifiers ?

Boaz Nadler Unsupervised Ranking, Ensemble Learning 22

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood

Unknowns:- ψi , ηi = specificities and sensitivities of m classifiers- y = (y1, . . . , yn) = true labels of n instances

Common Approach:Look for ψi , ηi , y that maximize the likelihood.

Given assumption of independence of errors of different classifiers,for instance x ,

L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1

Pr[fi (x) |y, ψi , ηi ]

Assuming instances are i.i.d.

L(fi (xj) |y, ψi , ηi ) =n∏

j=1

L(fi (xj) | yj , ψi , ηi )

Boaz Nadler Unsupervised Ranking, Ensemble Learning 23

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood

Unknowns:- ψi , ηi = specificities and sensitivities of m classifiers- y = (y1, . . . , yn) = true labels of n instances

Common Approach:Look for ψi , ηi , y that maximize the likelihood.

Given assumption of independence of errors of different classifiers,for instance x ,

L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1

Pr[fi (x) |y, ψi , ηi ]

Assuming instances are i.i.d.

L(fi (xj) |y, ψi , ηi ) =n∏

j=1

L(fi (xj) | yj , ψi , ηi )

Boaz Nadler Unsupervised Ranking, Ensemble Learning 23

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood

Unknowns:- ψi , ηi = specificities and sensitivities of m classifiers- y = (y1, . . . , yn) = true labels of n instances

Common Approach:Look for ψi , ηi , y that maximize the likelihood.

Given assumption of independence of errors of different classifiers,for instance x ,

L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1

Pr[fi (x) |y, ψi , ηi ]

Assuming instances are i.i.d.

L(fi (xj) |y, ψi , ηi ) =n∏

j=1

L(fi (xj) | yj , ψi , ηi )

Boaz Nadler Unsupervised Ranking, Ensemble Learning 23

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood

Unknowns:- ψi , ηi = specificities and sensitivities of m classifiers- y = (y1, . . . , yn) = true labels of n instances

Common Approach:Look for ψi , ηi , y that maximize the likelihood.

Given assumption of independence of errors of different classifiers,for instance x ,

L(f1(x), . . . , fm(x) |y, ψi , ηi ) =m∏i=1

Pr[fi (x) |y, ψi , ηi ]

Assuming instances are i.i.d.

L(fi (xj) |y, ψi , ηi ) =n∏

j=1

L(fi (xj) | yj , ψi , ηi )

Boaz Nadler Unsupervised Ranking, Ensemble Learning 23

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood Solution

If we knew sensitivies and specificities of m classifiers

Lemma: MLE is a weighted linear ensemble classifier

y (ML) = sign

(∑i

fi (x) lnαi + lnβi

)

where

αi =ψiηi

(1− ψi )(1− ηi ), βi =

ψi (1− ψi )

ηi (1− ηi ).

Unfortunately:,i) depends on unknown classifier’s specificities and sensitivities !ii) Likelihood function not convex when ψi , ηi , y unknown

Boaz Nadler Unsupervised Ranking, Ensemble Learning 24

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Maximum Likelihood Solution

If we knew sensitivies and specificities of m classifiers

Lemma: MLE is a weighted linear ensemble classifier

y (ML) = sign

(∑i

fi (x) lnαi + lnβi

)

where

αi =ψiηi

(1− ψi )(1− ηi ), βi =

ψi (1− ψi )

ηi (1− ηi ).

Unfortunately:,i) depends on unknown classifier’s specificities and sensitivities !ii) Likelihood function not convex when ψi , ηi , y unknown

Boaz Nadler Unsupervised Ranking, Ensemble Learning 24

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM-type Solutions

[Dawid and Skene, 1979]Expectation-Maximization:

- Given guess y of n labels, estimate ψi , ηi of m classifiers.

- Given sensitivity and specificity estimates, can constructapproximate MLE of labels, y.

Iteratively increase likelihood till convergence.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 25

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM-type Solutions

[Dawid and Skene, 1979]Expectation-Maximization:

- Given guess y of n labels, estimate ψi , ηi of m classifiers.

- Given sensitivity and specificity estimates, can constructapproximate MLE of labels, y.

Iteratively increase likelihood till convergence.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 25

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM-type Solutions

[Dawid and Skene, 1979]Expectation-Maximization:

- Given guess y of n labels, estimate ψi , ηi of m classifiers.

- Given sensitivity and specificity estimates, can constructapproximate MLE of labels, y.

Iteratively increase likelihood till convergence.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 25

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

EM in practice

Widely used:

- Jin & Ghahramani, Learning with multiple labels, 2003.

- Raykar, Yu, Zhao et. al, Learning from crowds, 2010.

- Whitehill, Ruvolo, Wu et al. Whose vote should count more...,2009.

- Welinder, Branson, Belongie, Perona, Multidimensional wisdomof crowds, 2010.

- etc, etc.

Boaz Nadler Unsupervised Ranking, Ensemble Learning 26

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM solution

Key limitations:

Convergence guaranteed only to local maxima

Need a good initial guess of y

No performance guarantees on solution

Question:Can we do better ?

Answer:Yes we can !

Simply plug-in into MLE our estimates ψi , ηi .

Boaz Nadler Unsupervised Ranking, Ensemble Learning 27

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM solution

Key limitations:

Convergence guaranteed only to local maxima

Need a good initial guess of y

No performance guarantees on solution

Question:Can we do better ?

Answer:Yes we can !

Simply plug-in into MLE our estimates ψi , ηi .

Boaz Nadler Unsupervised Ranking, Ensemble Learning 27

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Iterative EM solution

Key limitations:

Convergence guaranteed only to local maxima

Need a good initial guess of y

No performance guarantees on solution

Question:Can we do better ?

Answer:Yes we can !

Simply plug-in into MLE our estimates ψi , ηi .

Boaz Nadler Unsupervised Ranking, Ensemble Learning 27

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets

Novel fully-unsupervised ensemble learner.

SML = spectral meta-learner

Does method work on real data ?Indepedence assumptions not likely to hold exacty

Boaz Nadler Unsupervised Ranking, Ensemble Learning 28

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets

Ensemble of classifiers:

33 standard classification algorithms as implemented in Weka(kNN, Decision Trees, SVM, Logistic regression, Naive Bayes, etc)

On Various real datasets: Split into training and test data.Each algorithm trained on separate sub-set of whole training data.

Consistency Checks:- Is λ1(R)/Trace(R) close to one (e.g. is matrix rank-one)- Do classifiers approximately make independent errors

Boaz Nadler Unsupervised Ranking, Ensemble Learning 29

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets

Ensemble of classifiers:

33 standard classification algorithms as implemented in Weka(kNN, Decision Trees, SVM, Logistic regression, Naive Bayes, etc)

On Various real datasets: Split into training and test data.Each algorithm trained on separate sub-set of whole training data.

Consistency Checks:- Is λ1(R)/Trace(R) close to one (e.g. is matrix rank-one)- Do classifiers approximately make independent errors

Boaz Nadler Unsupervised Ranking, Ensemble Learning 29

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Real Datasets

Ensemble of classifiers:

33 standard classification algorithms as implemented in Weka(kNN, Decision Trees, SVM, Logistic regression, Naive Bayes, etc)

On Various real datasets: Split into training and test data.Each algorithm trained on separate sub-set of whole training data.

Consistency Checks:- Is λ1(R)/Trace(R) close to one (e.g. is matrix rank-one)- Do classifiers approximately make independent errors

Boaz Nadler Unsupervised Ranking, Ensemble Learning 29

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Example:

Financial Stock Prediction: (NYSE)

x= (opening, closing, low, high price, and volume) × 9 daysy = high price on 10th day

Prediction Goal:

highday 10 − highday 9

highday 9

> 1.05

Boaz Nadler Unsupervised Ranking, Ensemble Learning 30

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Example:

Financial Stock Prediction: (NYSE)

x= (opening, closing, low, high price, and volume) × 9 daysy = high price on 10th day

Prediction Goal:

highday 10 − highday 9

highday 9

> 1.05

Boaz Nadler Unsupervised Ranking, Ensemble Learning 30

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Examples:

Bal

ance

d ac

cura

cy (π

)

NYSE

0.45

0.50

0.55

0.60

SML iMLESML iMLEVoting

Best inferred predictor Ensemble median

Boaz Nadler Unsupervised Ranking, Ensemble Learning 31

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary

- Presented a spectral analysis of unsupervised ranking andensemble learning

- Key idea: Exploit structure of hidden low-rank matrix

Future work / Open Questions:- Regression and multi-class problems- Instance difficulty- Cartels- Real Applications- Relation to Random Matrix Theory- Low Rank Matrix Recovery with missing data

Parisi et al., PNAS, 2014.Jaffe, Nadler, Kluger, under review, 2014.www.wisdom.weizmann.ac.il/∼nadler/

Boaz Nadler Unsupervised Ranking, Ensemble Learning 32

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary

- Presented a spectral analysis of unsupervised ranking andensemble learning

- Key idea: Exploit structure of hidden low-rank matrix

Future work / Open Questions:- Regression and multi-class problems- Instance difficulty- Cartels- Real Applications- Relation to Random Matrix Theory- Low Rank Matrix Recovery with missing data

Parisi et al., PNAS, 2014.Jaffe, Nadler, Kluger, under review, 2014.www.wisdom.weizmann.ac.il/∼nadler/

Boaz Nadler Unsupervised Ranking, Ensemble Learning 32

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary

- Presented a spectral analysis of unsupervised ranking andensemble learning

- Key idea: Exploit structure of hidden low-rank matrix

Future work / Open Questions:- Regression and multi-class problems- Instance difficulty- Cartels- Real Applications- Relation to Random Matrix Theory- Low Rank Matrix Recovery with missing data

Parisi et al., PNAS, 2014.Jaffe, Nadler, Kluger, under review, 2014.www.wisdom.weizmann.ac.il/∼nadler/

Boaz Nadler Unsupervised Ranking, Ensemble Learning 32

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

On the paper by Dawid and Skene

Tech. report by Dawid in 1972.

Dawid and Skene, Maximum likelihood estimation of observer errorrates using the EM-algorithm, JRSS-C, 1979.

over 300 citations from google-scholar:

Thank You / The End

Boaz Nadler Unsupervised Ranking, Ensemble Learning 33

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

On the paper by Dawid and Skene

Tech. report by Dawid in 1972.

Dawid and Skene, Maximum likelihood estimation of observer errorrates using the EM-algorithm, JRSS-C, 1979.

over 300 citations from google-scholar:

Thank You / The End

Boaz Nadler Unsupervised Ranking, Ensemble Learning 33