24
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Embed Size (px)

Citation preview

Page 1: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under

Sample Selection Bias

Wei Fan IBM T.J.Watson

Ian Davidson

SUNY Albany

Page 2: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Where Sample Selection Bias Comes From?

Universe of Examples:Joint probability

distributionP(x,y) = P(y|x) P(x)

DM models this universe

Sampling process TrainingData

Question:Is the training data a good

sample of the universe?

Algorithm

Model

x

y

Page 3: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Universe of Examples

Two classes:red and green

red: f2>f1green: f2<=f1

Page 4: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Unbiased & Biased Samples

Rather Unbiased Sample:evenly distributed

Biased Sample:less likely to sample pointsclose to decision boundary

Page 5: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Single Decision Tree

Error = 2.9% Error = 7.9%

Trained from Unbiased Sample Trained from Biased Sample

Page 6: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Random Decision Tree

Error = 3.1% Error = 4.1%

Trained from Unbiased Sample Trained from Biased Sample

Page 7: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

What can we observe? Sample Selection Bias does affect

modeling. Some techniques are more sensitive to bias

than others. Models’ accuracy do get affected.

One important question: How to choose amongst the best classification

algorithm, given potentially biased dataset?

Page 8: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Ubiquitous Problem Fundamental assumption: training data is an

unbiased sample from the universe of examples. Catalogue:

Purchase history is normally only based on each merchant’s own data

However, may not be representative of a population that may potentially purchase from the merchant..

Drug Testing: Fraud Detection: Other examples (see Zadrozny’04 and Smith and

Elkan’04)

Page 9: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Effect of Bias on Model Construction

Inductive model: P(y|x,M): non-trivial dependency on the constructed model M. Recall that P(y|x) is the true conditional probability

“independent” from any modeling techniques. In general, P(y|x,M) != P(y|x).

If the model M is the “correct model”, sample selection bias doesn’t affect learning. (Fan,Davidson,Zadrozny, and Yu’05)

Otherwise, it does. Key Issues:

for real-world problems, we normally do not know the relationship between P(y|x,M) and P(y|x).

No exact idea about where the bias comes from.

Page 10: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Re-Capping Our focus How to choose amongst the best

classification algorithm, given potentially biased dataset? No information on the exactly how the

data is biased No information on if the learners are

affected by the bias. No information on true model, P(y|x)

Page 11: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Failure of Traditional Methods Given sample section bias, cross-

validation based methods are a bad indicator of which methods are the most accurate.

Results come next.

Page 12: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

ReverseTesting Basic idea: how to use testing

data’s feature vector x’s to help ordering different models even when their true labels y are not known.

Page 13: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Basic Procedure

Train

A

B

MA

MB

Test

A

B

MAA

MAB

MBA

MBB

Train

Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB

DA

DB

Labeledtest data

Page 14: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Rule If “A’s labeled test data” can construct

“more accurate models” for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate. If MAA > MAB and MBA > MBB then choose A

Similarly, If MAA < MAB and MBA < MBB then choose B

Otherwise, undecided.

Page 15: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Heuristics of ReverseTesting Assume that:

A is more accurate than B Use both A and B labeled data to

train two models. Using A’s data is likely to train a

more accurate model than B’s data.

Page 16: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Result Summary

05

101520253035404550

10-fold leave 1 out ReverseTesting

Total

Wrong

Page 17: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Why CV won’t work?

Sparse Region

Page 18: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

CV under-estimate in sparse regions

1. Examples in sparse regions are under represented in CV’s averaged results.• Comparing those examples near the decision boundary• A model performs badly in these under sample regions are not accurately estimated in cross-validation.

2. CV could also create “biased folds” in these “sparse” regions.• Their estimate on biased region itself could also be unreliable.

3. No information on how a model behaves on “feature vectors” not represented in the training data.

Page 19: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Decision Boundary of one fold in 10-fold CV

1-fold Full Training Data

Page 20: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Desiderata in ReverseTesting

Not reduce the size of “sparse regions” as 10-fold CV does

Not use “training model” or something close to training model.

Utilize “feature vectors” not present in the training dataset.

Page 21: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

C45 Decision Boundary

Training Data

C45 labeled data

RDT Data

C45 labeled data

RDT labeled data

C45 can never learn

such a model from training

data

Page 22: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

RDT Decision Boundary

C45 labeled data RDT labeled data

Page 23: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Model Comparison “Feature vectors in testing data” change

the “decision boundary. The model constructed by algorithm A from

A’s own labeled data != original “training model”.

A’s “inductive bias” is represented in B’s space.

“Use the changed boundary to include more emphasis on these sparse regions for both A and B re-trained on the two labeled test datasets.

Page 24: ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany

Summary Sample Selection bias is a ubiquitous

problem for DM and ML in practice. For most applications and modeling,

techniques, sample selection bias does affect accuracy.

Given sample selection bias, CV based method is bad at estimating order.

ReverseTesting can do a much better job. Future work:

not only orders but also estimates accuracy.