30
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions SDM 2016 – May 6th 2016 A Framework to Adjust Dependency Measure Estimates for Chance Simone Romano [email protected] @ialuronico Nguyen Xuan Vinh James Bailey Karin Verspoor (We won the Best Paper Award!) Department of Computing and Information Systems, The University of Melbourne, Victoria, Australia I will soon start working as applied scientist for in London UK Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance

A Framework to Adjust Dependency Measure Estimates for Chance

Embed Size (px)

Citation preview

Page 1: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

SDM 2016 – May 6th 2016

A Framework to Adjust Dependency Measure Estimates for Chance

Simone Romano

[email protected]

@ialuronico

Nguyen Xuan Vinh James Bailey Karin Verspoor

(We won the Best Paper Award!)

Department of Computing and Information Systems,The University of Melbourne, Victoria, Australia

I will soon start working as

applied scientist for

in London UK

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 2: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Motivation

Adjustment for Quantification

Adjustment for Ranking

Conclusions

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 3: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Dependency Measures

A dependency measure D is used to assessthe amount of dependency between variables:

Example 1: After collecting weight and height for many people,we can compute D(weight, height)

Example 2: assess the amountof dependency between searchqueries in Google

https://www.google.com/

trends/correlate/

They are fundamental for a number of applications in machine learning/ data mining

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 4: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Applications of Dependency Measures

Supervised learning

I Feature selection [Guyon and Elisseeff, 2003];

I Decision tree induction [Criminisi et al., 2012];

I Evaluation of classification accuracy [Witten et al., 2011].

Unsupervised learning

I External clustering validation [Strehl and Ghosh, 2003];

I Generation of alternative or multi-view clusterings[Muller et al., 2013, Dang and Bailey, 2015];

I The exploration of the clustering space using results from the Meta-Clusteringalgorithm [Caruana et al., 2006, Lei et al., 2014].

Exploratory analysis

I Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];

I Analysis of neural time-series data [Cohen, 2014].

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 5: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Motivation for Adjustment For QuantificationPearson’s correlation between two variables X and Y estimated on a data sampleSn = {(xk , yk)} of n data points:

r(Sn|X ,Y ) ,

∑nk=1(xk − x)(yk − y)√∑n

k=1(xk − x)2∑n

k=1(yk − y)2(1)

1 0.8 0.4 0 -0.4 -0.8 -1

1 1 1 -1 -1 -1

0 0 0 0 0 0 0

Figure : Fromhttps://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

r2(Sn|X ,Y ) can be used as a proxy of the amount of noise for linearrelationships:

I 1 if noiselessI 0 if complete noise

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 6: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

The Maximal Information Coefficient (MIC) was published in Science[Reshef et al., 2011] and has ≈ 570 citations to date according to Google scholar.

MIC(X ,Y ) can be used as a proxy of the amount fo noise for functionalrelationships:

Figure : From supplementary material online in [Reshef et al., 2011]

MIC should be equal to:

I 1 if the relationship between X and Y is functional and noiseless

I 0 if there is complete noise

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 7: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Challenge

Nonetheless, its estimation is challenging on a finite data sample Sn of n datapoints.

We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 datapoints:

0.2 0.4 0.6 0.8 1

MIC(S80jX;Y )

MIC(S20jX;Y )

Value can be high because of chance!The user expects values close to 0 in both cases

Challenge: Adjust the estimated MIC to better exploit the range [0, 1]

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 8: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Adjustment for Chance

I We define a framework for adjustment:

Adjustment for Quantification

AD ,D − E [D0]

max D − E [D0]

I It uses the distribution D0 under independent variables:I r2

0 : Beta distributionI MIC0: can be computed using Monte Carlo permutations.

Used in κ-statistics. Its application is beneficial to other dependency measures:

I Adjusted r2 ⇒ Ar2

I Adjusted MIC ⇒ AMIC

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 9: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Adjusted measures enable better interpretabilityTask:Obtain 1 for noiseless relationship, and 0 for complete noise (on average).

0%

r2 = 1Ar2 = 1

20%

r2 = 0:66Ar2 = 0:65

40%

r2 = 0:39Ar2 = 0:37

60%

r2 = 0:2Ar2 = 0:17

80%

r2 = 0:073Ar2 = 0:044

100%

r2 = 0:035Ar2 = 0:00046

Figure : Ar2 becomes zero on average on 100% noise: r2 = 0.035 vs Ar2 = 0.00046.

0%

MIC = 1AMIC = 1

20%

MIC = 0:7AMIC = 0:6

40%

MIC = 0:47AMIC = 0:29

60%

MIC = 0:34AMIC = 0:11

80%

MIC = 0:27AMIC = 0:021

100%

MIC = 0:26AMIC = 0:0014

Figure : AMIC becomes zero on average on 100% noise: MIC = 0.26 vs AMIC = 0.014.

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 10: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missingvalues)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Raw r2

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

Raw MIC

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Ar2 (Adjusted)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

AMIC (Adjusted)

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 11: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missingvalues)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Raw r2

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

Raw MIC

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Ar2 (Adjusted)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

AMIC (Adjusted)Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 12: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Motivation

Adjustment for Quantification

Adjustment for Ranking

Conclusions

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 13: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Motivation for Adjustment for Ranking

Say that we want to predict the risks of C cancer using equally unpredictive variablesX1 and X2 defined as follows:

I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};

X1= yesX1= no

X2=green

X2=blueX2=brown

Problem:When rankingvariables, dependencymeasures are biasedtowards the selectionof variables with manycategories

This still happens because of finite samples!

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 14: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Motivation for Adjustment for Ranking

Say that we want to predict the risks of C cancer using equally unpredictive variablesX1 and X2 defined as follows:

I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};

X1= yesX1= no

X2=green

X2=blueX2=brown

Problem:When rankingvariables, dependencymeasures are biasedtowards the selectionof variables with manycategories

This still happens because of finite samples!

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 15: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)

I Generate a variable X2 with 3 categories(independently from C)

ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased ranking

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 16: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)

I Generate a variable X2 with 3 categories(independently from C)

ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased ranking

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 17: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)

I Generate a variable X2 with 3 categories(independently from C)

ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased ranking

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 18: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)

I Generate a variable X2 with 3 categories(independently from C)

ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased ranking

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 19: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Selection bias experimentExperimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)

I Generate a variable X2 with 3 categories(independently from C)

ComputeGini(X1,C) andGini(X2,C).

Give a win to the variablethat gets the highestvalue

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times

( Bad )Given that they are equally unpredictive,we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased rankingSimone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 20: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Adjustment for RankingWe propose two adjustments for ranking:

Standardization

SD ,D − E [D0]√

Var(D0)

Quantifies statistical significance like a p-value

Adjustment for Ranking

AD(α) , D − q0(1− α)

Penalizes on statistical significance according to α

q0 is the quantile of the distribution D0

(small α more penalization)Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 21: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Standardized Gini (SGini) corrects for Selection bias

Select unpredictive features X1 with 2 categories and X2 with 3 categories.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Experiment: X1 and X2 gets se-lected on average almost 50% ofthe times

( Good )

Being similar to a p-value, this is consistent with the literature on decisiontrees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,Strobl et al., 2007].

Nonetheless: we found that this is a simplistic scenario

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 22: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Standardized Gini (SGini) might be biased

Fix predictiveness of features X1 and X2 to a constant 6= 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Experiment: SGini becomes bi-ased towards X1 because morestatically significant

( Bad )

This behavior has been overlooked in the decision tree community

Use AD(α) to penalize less or even tune the bias!

⇒ AGini(α)

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 23: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Standardized Gini (SGini) might be biased

Fix predictiveness of features X1 and X2 to a constant 6= 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Experiment: SGini becomes bi-ased towards X1 because morestatically significant

( Bad )

This behavior has been overlooked in the decision tree community

Use AD(α) to penalize less or even tune the bias!

⇒ AGini(α)

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 24: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Application to random forest

why random forest? good classifier to try first when there are “meaningful” features[Fernandez-Delgado et al., 2014].

Plug-in different splitting criteria

Experiment: 19 data sets with categorical variables

,0 0.2 0.4 0.6 0.8

Mea

nAU

C

90

90.5

91

91.5

AGini(,)

SGini

Gini

Figure : Using the same α for all data sets

And α can be tuned for each data set with cross-validation.

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 25: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Motivation

Adjustment for Quantification

Adjustment for Ranking

Conclusions

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 26: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Conclusion - Message

Dependency estimates are high because of chance under finite samples.

Adjustments can help for:

Quantification, to have an interpretable value between [0, 1]

Ranking, to avoid biases towards:

I missing values

I categorical variables with more categories

Future Work:Adjust dependency measures between multiple variables D(X1, . . . ,Xd) because ofbias towards large d

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 27: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

Thank you.

Questions?

Simone Romano

[email protected]

@ialuronico

Code available online:

https://github.com/ialuronico

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 28: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

References I

Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).Meta clustering.In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.

Cohen, M. X. (2014).Analyzing neural time series data: theory and practice.MIT Press.

Criminisi, A., Shotton, J., and Konukoglu, E. (2012).Decision forests: A unified framework for classification, regression, density estimation,manifold learning and semi-supervised learning.Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.

Dang, X. H. and Bailey, J. (2015).A framework to uncover multiple alternative clusterings.Machine Learning, 98(1-2):7–30.

Dobra, A. and Gehrke, J. (2001).Bias correction in classification tree construction.In ICML, pages 90–97.

Fernandez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).Do we need hundreds of classifiers to solve real world classification problems?The Journal of Machine Learning Research, 15(1):3133–3181.

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 29: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

References II

Frank, E. and Witten, I. H. (1998).Using a permutation test for attribute selection in decision trees.In ICML, pages 152–160.

Guyon, I. and Elisseeff, A. (2003).An introduction to variable and feature selection.The Journal of Machine Learning Research, 3:1157–1182.

Hothorn, T., Hornik, K., and Zeileis, A. (2006).Unbiased recursive partitioning: A conditional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674.

Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).Filta: Better view discovery from collections of clusterings via filtering.In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.

Muller, E., Gunnemann, S., Farber, I., and Seidl, T. (2013).Discovering multiple clustering solutions: Grouping objects in different views of the data.Tutorial at ICML.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science, 334(6062):1518–1524.

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance

Page 30: A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantification Adjustment for Ranking Conclusions

References III

Strehl, A. and Ghosh, J. (2003).Cluster ensembles—a knowledge reuse framework for combining multiple partitions.The Journal of Machine Learning Research, 3:583–617.

Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).Unbiased split selection for classification trees based on the gini index.Computational Statistics & Data Analysis, 52(1):483–501.

Villaverde, A. F., Ross, J., and Banga, J. R. (2013).Reverse engineering cellular networks with information theoretic methods.Cells, 2(2):306–329.

Witten, I. H., Frank, E., and Hall, M. A. (2011).Data Mining: Practical Machine Learning Tools and Techniques.3rd edition.

Simone Romano University of Melbourne

A Framework to Adjust Dependency Measure Estimates for Chance