30
1 CSI5388: CSI5388: Functional Elements Functional Elements of Statistics for of Statistics for Machine Learning Machine Learning Part II Part II

1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

Embed Size (px)

Citation preview

Page 1: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

11

CSI5388:CSI5388:Functional Elements of Functional Elements of Statistics for Machine Statistics for Machine

Learning Learning

Part IIPart II

Page 2: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

22

Part I (This set of lecture notes):Part I (This set of lecture notes):• Definition and PreliminariesDefinition and Preliminaries• Hypothesis Testing: Parametric ApproachesHypothesis Testing: Parametric Approaches

Part II (The next set of lecture notes)Part II (The next set of lecture notes)• Hypothesis Testing: Non-Parametric Hypothesis Testing: Non-Parametric

ApproachesApproaches• Power of a TestPower of a Test• Statistical Tests for Comparing Multiple Statistical Tests for Comparing Multiple

ClassifiersClassifiers

Contents of the LectureContents of the Lecture

Page 3: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

33

Non-parametric approaches to Non-parametric approaches to Hypothesis testingHypothesis testing

The hypothesis testing procedures discussed in the The hypothesis testing procedures discussed in the previous lecture are called parametric. previous lecture are called parametric.

This means that they are based on assumptions This means that they are based on assumptions regarding the distribution of the population for which regarding the distribution of the population for which the test was ran, and rely on the estimation of the test was ran, and rely on the estimation of parameters from these distributions. parameters from these distributions.

In our cases, we assumed that the distributions were In our cases, we assumed that the distributions were either normal or followed a Student t distribution. either normal or followed a Student t distribution. The parameters we estimated were the mean and The parameters we estimated were the mean and the variance. the variance.

The problem we now turn to is the issue of The problem we now turn to is the issue of hypothesis testing that is not based on assumptions hypothesis testing that is not based on assumptions regarding the distribution and do not rely on the regarding the distribution and do not rely on the estimation of parameters. estimation of parameters.

Page 4: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

44

The Different Types of non-parametric The Different Types of non-parametric hypothesis testing approaches Ihypothesis testing approaches I

There are two important families of tests that do There are two important families of tests that do not involve distributional assumptions and not involve distributional assumptions and parameter estimations:parameter estimations:

• Nonparametric testsNonparametric tests, which rely on ranking , which rely on ranking the data and performing a statistical test on the data and performing a statistical test on the ranks.the ranks.

• Resampling statisticsResampling statistics which consist of which consist of drawing samples repeatedly from a population drawing samples repeatedly from a population and evaluating the distribution of the result. and evaluating the distribution of the result. Resampling Statistics will be discussed in the Resampling Statistics will be discussed in the next lecture.next lecture.

Page 5: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

55

The nonparametric tests are quite useful The nonparametric tests are quite useful in populations for which outliers skew the in populations for which outliers skew the distribution too much. distribution too much.

Ranking eliminates the problem. Ranking eliminates the problem. However, they typically are less powerful However, they typically are less powerful

(see further) than parametric tests.(see further) than parametric tests. Resampling statistics are useful when the Resampling statistics are useful when the

statistics of interest cannot be derived statistics of interest cannot be derived analytically (e.g., statistics about the analytically (e.g., statistics about the median of a population), unless we median of a population), unless we assume a normal distribution.assume a normal distribution.

The Different Types of non-parametric The Different Types of non-parametric hypothesis testing approaches IIhypothesis testing approaches II

Page 6: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

66

Non-Parametric TestsNon-Parametric Tests

Wilcoxon’s Rank-Sum TestWilcoxon’s Rank-Sum Test The case of independent samplesThe case of independent samples The case of matched pairsThe case of matched pairs

Page 7: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

77

Wilcoxon’s Rank-Sum TestsWilcoxon’s Rank-Sum Tests Wilcoxon’s Rank-Sum Tests are equivalent Wilcoxon’s Rank-Sum Tests are equivalent

to the t-test, but apply when the normality to the t-test, but apply when the normality assumption of the distribution is not met. assumption of the distribution is not met.

As a result of their non-parametric nature, As a result of their non-parametric nature, however, power is lost (see further for a however, power is lost (see further for a formal discussion of power). In particular, formal discussion of power). In particular, the tests are not as specific as their the tests are not as specific as their parametric equivalent. parametric equivalent.

This means that, although we interpret the This means that, although we interpret the result of these non-parametric tests to result of these non-parametric tests to mean one thing of a central nature to the mean one thing of a central nature to the distributions under study, they could mean distributions under study, they could mean something else.something else.

Page 8: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

88

Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)

Informal Description IInformal Description I Given two populations with n1 observations Given two populations with n1 observations

in group 1 and n2 observations in group 2. in group 1 and n2 observations in group 2. The null hypothesis we are trying to reject is: The null hypothesis we are trying to reject is: “H0: The two samples come from identical “H0: The two samples come from identical populations (not just populations with the populations (not just populations with the same mean)”. same mean)”.

We consider two cases:We consider two cases:• Case 1: Case 1: The null hypothesis is false (to a The null hypothesis is false (to a

substantial degree) and the scores from substantial degree) and the scores from population 1 are generally lower than those of population 1 are generally lower than those of population 2. population 2.

• Case 2: Case 2: The null hypothesis is true. This means The null hypothesis is true. This means that the two samples came from the same that the two samples came from the same population. population.

Page 9: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

99

Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)

Informal Description IIInformal Description II In both cases, the procedure consists of In both cases, the procedure consists of

ranking the scores of the two populations ranking the scores of the two populations taken together. taken together. • Case 1:Case 1: In the first case, we assume that the In the first case, we assume that the

ranks from population 1 should be generally ranks from population 1 should be generally lower than those of population 2. Actually, we lower than those of population 2. Actually, we could also expect that the sum of the ranks in could also expect that the sum of the ranks in group 1 is smaller than the sum of the ranks in group 1 is smaller than the sum of the ranks in group 2.group 2.

• Case 2: In the second case, we assume that Case 2: In the second case, we assume that the sum of ranks of the first group is about the sum of ranks of the first group is about equal to the sum of ranks of the second group.equal to the sum of ranks of the second group.

Page 10: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1010

Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)

n1 and n2 n1 and n2 ≤≤ 25: 25: Consider the two groups of data sets of size n1 and n2, Consider the two groups of data sets of size n1 and n2,

respectively, where n1 is the smallest sample size.respectively, where n1 is the smallest sample size. Rank their scores together from lowest to highest. Rank their scores together from lowest to highest. In case of an x-way tie just after rank y, then assign (y+1 + In case of an x-way tie just after rank y, then assign (y+1 +

y +2 + … + y+ x)/x to all the tied elements. y +2 + … + y+ x)/x to all the tied elements. Add the scores of the group containing the smallest number Add the scores of the group containing the smallest number

of samples (n1) (if both groups contain as many samples, of samples (n1) (if both groups contain as many samples, choose the smallest value). Call this sum Ws.choose the smallest value). Call this sum Ws.

Find the value V in the Wilcoxon table, for n1 and n2 and Find the value V in the Wilcoxon table, for n1 and n2 and the significance level s required, where n1 in the table the significance level s required, where n1 in the table corresponds to the smallest value, as well. corresponds to the smallest value, as well.

Compare Ws to V and conclude that the difference between Compare Ws to V and conclude that the difference between the two groups at the chosen level, L1 for a one-tailed test the two groups at the chosen level, L1 for a one-tailed test or 2*L1 for the two-tailed test is significant only if Ws < V. If or 2*L1 for the two-tailed test is significant only if Ws < V. If Ws ≥ V, the null hypothesis cannot be rejected.Ws ≥ V, the null hypothesis cannot be rejected.

Page 11: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1111

Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)

n1 and n2 n1 and n2 >> 25: 25: Compute Ws as beforeCompute Ws as before Use the fact that Ws approaches a normal Use the fact that Ws approaches a normal

distribution as size increases with: distribution as size increases with: • A mean of A mean of

m= n1(n1+n2+1)/2, and m= n1(n1+n2+1)/2, and • A standard error of A standard error of

std= sqrt(n1n2(n1+n2+1)/12) std= sqrt(n1n2(n1+n2+1)/12) Compute the z statistic Compute the z statistic

z = (Ws – m)/stdz = (Ws – m)/std Use the tables of the normal distribution.Use the tables of the normal distribution.

Page 12: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1212

Wilcoxon’s Matched Pairs Signed Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores)Ranks Test (for paired scores)

Informal DescriptionInformal Description

Logic of the Test:Logic of the Test: Given the same population tested under Given the same population tested under

different circumstances C1 and C2. different circumstances C1 and C2. If there is improvement in C2, then most of If there is improvement in C2, then most of

the results recorded in C2 will be greater the results recorded in C2 will be greater than those recorded in C1 and those that than those recorded in C1 and those that are not greater will be smaller by only a are not greater will be smaller by only a small amount.small amount.

Page 13: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1313

Wilcoxon’s Matched Pairs Signed Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores)Ranks Test (for paired scores)

n n ≤≤ 50 50 We calculate the difference score for each pair of We calculate the difference score for each pair of

measurementmeasurement We rank all the difference scores without paying We rank all the difference scores without paying

attention to their signs (i.e., we rank their attention to their signs (i.e., we rank their absolute values)absolute values)

We assign the algebraic sign of the differences to We assign the algebraic sign of the differences to the ranksthe ranks

We sum the positive and negative ranks We sum the positive and negative ranks separatelyseparately

We choose as test statistic T, the smaller of the We choose as test statistic T, the smaller of the absolute values of the two sums.absolute values of the two sums.

We compare T to a Wilcoxon T tableWe compare T to a Wilcoxon T table

Page 14: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1414

Wilcoxon’s Matched Pairs Signed Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores)Ranks Test (for paired scores)

n n >> 50 50 Compute T as beforeCompute T as before Use the fact that T approaches a normal Use the fact that T approaches a normal

distribution as size increases with:distribution as size increases with:• A mean of A mean of

m= n(n+1)/4 and m= n(n+1)/4 and • A standard error A standard error

std= sqrt(n(n+1)(2n+1)/24) std= sqrt(n(n+1)(2n+1)/24) And compute the z statistic And compute the z statistic

z = (T – m)/stdz = (T – m)/std Use the tables of the normal distribution.Use the tables of the normal distribution.

Page 15: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1515

Power AnalysisPower Analysis

Page 16: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1616

Type I and Type II ErrorsType I and Type II Errors

Definition:Definition: A A Type I error (α)Type I error (α) corresponds to the error of rejecting corresponds to the error of rejecting H0, the null hypothesis, when it is, in H0, the null hypothesis, when it is, in fact, true. A fact, true. A Type II errorType II error (β)(β) corresponds to the error of failing to corresponds to the error of failing to reject H0 when it is false.reject H0 when it is false.

Definition:Definition: The power of a test is The power of a test is the probability of rejecting H0 given the probability of rejecting H0 given that it is false. Power = 1- βthat it is false. Power = 1- β

Page 17: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1717

Why does Power Matter? IWhy does Power Matter? I

All the hypothesis tests described in the All the hypothesis tests described in the previous three sections are only concerned previous three sections are only concerned about reducing the Type I error. about reducing the Type I error.

i.e., they try to ascertain the conditions i.e., they try to ascertain the conditions under which we are rejecting a hypothesis under which we are rejecting a hypothesis rightly. rightly.

They are not at all concerned about the They are not at all concerned about the case where the null hypothesis is really case where the null hypothesis is really false, but we do not reject it.false, but we do not reject it.

Page 18: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1818

Why does Power Matter? IIWhy does Power Matter? II In the case of Machine Learning, reducing the In the case of Machine Learning, reducing the

type I error means reducing the probability of us type I error means reducing the probability of us saying that there is a difference in the saying that there is a difference in the performance of the 2 classifiers, when in fact, performance of the 2 classifiers, when in fact, there isn’t.there isn’t.

Reducing the type II error means reducing the Reducing the type II error means reducing the probability of us saying that there is no difference probability of us saying that there is no difference in the performance of the two classifiers, when, in in the performance of the two classifiers, when, in fact, there is.fact, there is.

Power matters because we do not want to discard Power matters because we do not want to discard a classifier that shouldn’t have been discarded. If a classifier that shouldn’t have been discarded. If a test does not have enough power, then this a test does not have enough power, then this kind of situation can arisekind of situation can arise

Page 19: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

1919

What is the Effect Size?What is the Effect Size? The effect size measures how strong the The effect size measures how strong the

relationship between two entities is. relationship between two entities is. In particular, if we consider a particular procedure, In particular, if we consider a particular procedure,

in addition to knowing how statistically significant in addition to knowing how statistically significant the effect of that procedure is, we may want to the effect of that procedure is, we may want to know what the size of this effect is.know what the size of this effect is.

There are different measures of effect sizes, There are different measures of effect sizes, including:including:• Pearsons’ correlation coefficient Pearsons’ correlation coefficient • Odd’s ratioOdd’s ratio• Cohen’s d statisticsCohen’s d statistics

Cohen's Cohen's dd statistic is appropriate in the context of a statistic is appropriate in the context of a t-testt-test on means. It is thus the effect size measure on means. It is thus the effect size measure we concentrate on here.we concentrate on here.

[Wikipedia: http://en.wikipedia.org/wiki/Effect_size][Wikipedia: http://en.wikipedia.org/wiki/Effect_size]

Page 20: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2020

Cohen’s d-statisticsCohen’s d-statistics Cohen’s d-statistic is expressed as:Cohen’s d-statistic is expressed as: d = (X1 – X2)/ sp d = (X1 – X2)/ sp Where spWhere sp22, the pooled variance estimate is:, the pooled variance estimate is:

spsp22= ((n1-1)*s= ((n1-1)*s1122 + (n2-1)*s + (n2-1)*s22

22) ) (n1+n2-2)(n1+n2-2) and sp, its square root.and sp, its square root. [Note this is not exactly Cohen’s d measure [Note this is not exactly Cohen’s d measure

which was expressed in terms of which was expressed in terms of parameters. What we show above is an parameters. What we show above is an estimate of d].estimate of d].

Page 21: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2121

Usefulness of the d statisticUsefulness of the d statistic d is useful in that it standardizes the difference d is useful in that it standardizes the difference

between the two means. We can talk about between the two means. We can talk about deviations in terms of proportions of standard deviations in terms of proportions of standard deviation points that are more useful than actual deviation points that are more useful than actual differences that are domain dependent.differences that are domain dependent.

Cohen came up with a set of guidelines Cohen came up with a set of guidelines concerning d:concerning d:

• d=.2 has a small effect, but is probably d=.2 has a small effect, but is probably meaningful; meaningful;

• d= .5 is a medium effect that is d= .5 is a medium effect that is noticeable. noticeable.

• d= .8 shows a large effect size.d= .8 shows a large effect size.

Page 22: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2222

Statistical Tests for Statistical Tests for Comparing Multiple Comparing Multiple

ClassifiersClassifiers

Page 23: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2323

What is the Analysis of Variance What is the Analysis of Variance (ANOVA)?(ANOVA)?

The analysis of variance is similar to the t-test in The analysis of variance is similar to the t-test in that it deals with differences between sample that it deals with differences between sample means. means.

However, unlike the t-test that is restricted to the However, unlike the t-test that is restricted to the difference between two means, ANOVA allows us difference between two means, ANOVA allows us to assess whether the differences observed to assess whether the differences observed between between any number of meansany number of means are statistically are statistically significant. significant.

In addition, ANOVA allows us to deal with more In addition, ANOVA allows us to deal with more than one independent variable. For example, we than one independent variable. For example, we could choose, as two independent variables, 1) the could choose, as two independent variables, 1) the learning algorithm and 2) the domain to which the learning algorithm and 2) the domain to which the learning algorithm is applied.learning algorithm is applied.

Page 24: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2424

Why is ANOVA useful?Why is ANOVA useful?

One may wonder why ANOVA is useful in One may wonder why ANOVA is useful in the context of classifier evaluation. the context of classifier evaluation.

Very simply, if we want to answer the Very simply, if we want to answer the following common question : "How do following common question : "How do various classifiers fare on different data various classifiers fare on different data sets?", then we have 2 independent sets?", then we have 2 independent variables: the learning algorithm and the variables: the learning algorithm and the domain, and a lot of results. domain, and a lot of results.

ANOVA makes it easy to tell whether the ANOVA makes it easy to tell whether the difference observed are indeed significant.difference observed are indeed significant.

Page 25: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2525

Variations on the ANOVA ThemeVariations on the ANOVA Theme

There are different implementations of There are different implementations of ANOVA:ANOVA:• One-way ANOVAOne-way ANOVA is a linear model trying to is a linear model trying to

assess if the difference in the performance assess if the difference in the performance measures of classifiers over different datasets is measures of classifiers over different datasets is statistically significant, but does not distinguish statistically significant, but does not distinguish between the performance measures’ variability between the performance measures’ variability within-datasets and the performance measure within-datasets and the performance measure variability between-datasets.variability between-datasets.

• Two-way/Multi-way ANOVATwo-way/Multi-way ANOVA can deal with more can deal with more than one independent variable. For instance, two than one independent variable. For instance, two performance measures over different classifiers performance measures over different classifiers over various datasets.over various datasets.

Then there are other related tests as well:Then there are other related tests as well:• Friedman’s test, Post-hoc tests, Tukey Test, etc…Friedman’s test, Post-hoc tests, Tukey Test, etc…

Page 26: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2626

How does One-Way ANOVA work? IHow does One-Way ANOVA work? I

It considers various groups of observations It considers various groups of observations and sets as a hypothesis that all the and sets as a hypothesis that all the means are equal.means are equal.

The opposite hypothesis is that they are The opposite hypothesis is that they are not all equal.not all equal.

The ANOVA model is as follows:The ANOVA model is as follows: xxijij = = μμii + e + eijij

• where xwhere xijij is the j is the jthth observation from group i, observation from group i, μμii is is the mean of group i and ethe mean of group i and eij ij is the noise that is is the noise that is normally distributed with mean 0 and common normally distributed with mean 0 and common standard deviation standard deviation σσ

Page 27: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2727

How does One-Way ANOVA work? IIHow does One-Way ANOVA work? II

ANOVA monitors three different kinds of variation in ANOVA monitors three different kinds of variation in the data:the data:• Within-group variation Within-group variation • Between-group variation Between-group variation • Total variation = within-group variation + between-group Total variation = within-group variation + between-group

variation variation Each of the above variations are represented by Each of the above variations are represented by

sums of squares (SS) of the variations.sums of squares (SS) of the variations. The statistics of interest in ANOVA is F, where The statistics of interest in ANOVA is F, where

F = Between-group variation F = Between-group variation Within-group variationWithin-group variation Larger F’s demonstrate greater statistical Larger F’s demonstrate greater statistical

significance than smaller ones. Like for z and t, significance than smaller ones. Like for z and t, there are tables of significance levels associated there are tables of significance levels associated with the F-ratio.with the F-ratio.

Page 28: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2828

The goal of ANOVA is to find out whether or not The goal of ANOVA is to find out whether or not the differences in means between different the differences in means between different groups are statistically significant. groups are statistically significant.

To do so, ANOVA partitions the total variance into To do so, ANOVA partitions the total variance into variance caused by random error (the within variance caused by random error (the within group SS) and variance caused by actual group SS) and variance caused by actual differences between means (the between-group differences between means (the between-group SS). SS).

If the null hypothesis holds, then the within-group If the null hypothesis holds, then the within-group SS should be about the same as the between-SS should be about the same as the between-groups SS.groups SS.

We can compare these two SS using the We can compare these two SS using the FF test, test, which checks whether the ratio of the two SSs is which checks whether the ratio of the two SSs is significantly greater than 1. significantly greater than 1.

How does One-Way ANOVA work? IIIHow does One-Way ANOVA work? III

Page 29: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

2929

What is Multi-Way ANOVA? What is Multi-Way ANOVA?

In One-Way ANOVA, we simply considered several In One-Way ANOVA, we simply considered several groups.groups.

For example this could correspond the comparing For example this could correspond the comparing the performance of 10 different classifiers on one the performance of 10 different classifiers on one domain.domain.

How about the case where we compare the How about the case where we compare the performance of these same 10 different classifiers performance of these same 10 different classifiers on 5 domains?on 5 domains?

Two-Way ANOVA can help with thatTwo-Way ANOVA can help with that If we were to use an additional dimension such as If we were to use an additional dimension such as

the consideration of 6 different (but matched) the consideration of 6 different (but matched) threshold levels (as in AUC) for each classifier on the threshold levels (as in AUC) for each classifier on the same 5 domains, then Three-way ANOVA could be same 5 domains, then Three-way ANOVA could be used, and so on…used, and so on…

Page 30: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II

3030

How Does Multi-Way ANOVA work?How Does Multi-Way ANOVA work?

In our example, the difference between One-Way In our example, the difference between One-Way ANOVA and Two-Way ANOVA can be illustrated as ANOVA and Two-Way ANOVA can be illustrated as follows:follows:• in One-Way ANOVA, we would calculate the within-group in One-Way ANOVA, we would calculate the within-group

SS by collapsing the results obtained on all the data sets SS by collapsing the results obtained on all the data sets together within each classifier results.together within each classifier results.

• In Two-Way ANOVA, with would calculate all the within-In Two-Way ANOVA, with would calculate all the within-classifier, within-domain variances separately and group classifier, within-domain variances separately and group the results together.the results together.

• As a result, the spooled within-group SS of two-way As a result, the spooled within-group SS of two-way ANOVA would be smaller than the spooled within-group SS ANOVA would be smaller than the spooled within-group SS of one-way ANOVA.of one-way ANOVA.

Multi-way ANOVA is thus a more statistically Multi-way ANOVA is thus a more statistically powerful test than One-way ANOVA since we need powerful test than One-way ANOVA since we need fewer observations to find significant effects. fewer observations to find significant effects.