Experimental Software Engineering – Statistical Tests

1

Experimental Software Engineering

– Statistical Tests –

Fernando Brito e Abreu ([email protected])Universidade Nova de Lisboa (http://www.unl.pt)QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR)

Experimental Software Engineering / Fernando Brito e Abreu25-May-08

ABSTRACT

Data analysis methodsScale and role of variables revisitedTesting distribution adherenceHypotheses testingStatistical significance and confidence intervalError types in interpreting resultsParametric testsNonparametric tests

2


Data analysis taxonomyNumber of independent variables (aka factors):

One-factorial analysis – one independent variablesMultifactorial analysis – several independent variables

Number of dependent variables:Univariate analysis – one dependent variablesMultivariate analysis – several dependent variables


Data analysis methodsProportion testing

Inference tests for categorical and continuous dataParametric testingNon-parametric testing

Regression analysisLinear regression modelingNonlinear regression modeling (e.g. logistic regression analysis)

Multivariate data analysisFactor analysisCluster analysisDiscriminant analysis

3


More on scale types

Categorical (discrete) dataNominal scaleOrdinal scaleAbsolute scale

Continuous dataInterval scaleRatio scale


Independent variables(aka factors or explanatory variables)

Are those that are manipulated in experimental research

Examples:Programming language, Development environmentDesign sizePractitioner expertise

4


Dependent variables(aka outcome variables, measures, criteria)

Are those whose effect of the independent ones we want to measure in experimental research

Examples:Effort do produce a given deliverableProject scheduleDefects found in code inspectionSystem faults in operation (e.g. MTBF, MTTR)


Exercise: Identify the independent and dependent variables …

5


Degree of freedom (df) of an estimateIs the number of independent pieces of information on which the estimate is based.

Is the number of values in the final calculation of a statistic that are free to vary (df = number of different treatments – 1)


Why is the "Normal distribution" important?

… because in most cases, it approximates well the function that represents the relationship between "magnitude" and "significance" of relations between two variables, depending on the sample size

The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution

Many frequently used statistical tests make the assumption that the data come from a normal distribution

6


Distribution adherenceThe distribution type conditions the kind of statistical tests we can apply

Therefore we want to know if a variable follows (adheres to) a given statistical distribution

Often we are interested in how well the distribution can be approximated by the normal distribution

We can take several, increasingly powerful, approaches:Use descriptive statisticsUse plotsUse distribution adherence tests


Testing distribution adherenceMost common normality tests

Kolmogorov-Smirnov one-sample test

Lilliefors test (correction upon the previous)

Shapiro-Wilks' W test

Royston test (correction upon the previous)

These tests are also known as goodness-of-fit ones since they test whether the observations could reasonably have come from the specified distribution

7


Testing distribution adherenceKolmogorov-Smirnov one-sample test

The Kolmogorov-Smirnov one-sample test for normality is based on the maximum difference between the sample cumulative distribution and the hypothesized cumulative distribution.

H0: X ~ N(μ;σ)H1: ¬ X ~ N(μ;σ)

Notes:For many software programs, the probability values that are reported are based on those tabulated by Massey (1951); those probability values are valid when the mean and standard deviation of the normal distribution are known a-priori and not estimated from the dataThis test can be used to verify goodness of fit for other distributions (e.g. uniform, Poisson, exponential)


Testing distribution adherenceKolmogorov-Smirnov one-sample testInterpretation:

If the Z statistic is significant, then the hypothesis that the respective distribution is normal (H0) should be rejected

“Significant” means that the statistical significance p of the result is not inferior to the test significance α (required level)

Example:Consider the test significance α = 0.05Probability of Type I error = 0.05 * 100% = 5%

(probability of rejecting H0, the null hypothesis, when it is true)If p ≤ α (significant Z statistic):

Reject H0 and accept H1 (sample cannot come from a Normal population)If p > α (not significant Z statistic)

Accept H0 and therefore reject H1 (sample may come from a Normal population)

8


One-Sample Kolmogorov-Smirnov Test

3310 4180444.17 5951.71

926.623 20567.461.317 .386.262 .302

-.317 -.38618.238 24.970

.000 .000

NMeanStd. Deviation

Normal Parameters a,b

AbsolutePositiveNegative

Most ExtremeDifferences

Kolmogorov-Smirnov ZAsymp. Sig. (2-tailed)

FunctionalSize

NormalisedWork Effort

Test distribution is Normal.a.

Calculated from data.b.

Example:

Even for a test significance α = 0.01 (99% confidence interval), since p=0.000 ≤ α (significant Z statistic):

We reject H0 and accept H1 (neither Size nor Effort can come from a Normal population)

SPSS:Analyze

Nonparametric Tests1-Sample K-S…


Testing distribution adherenceLilliefors test

This test is basically a correction to the Kolmogorov-Smirnov test, applicable when the parameters of the hypothesized normal distribution are estimated from the sample data

Interpretation:If the Z statistic is significant, then the hypothesis that the respective distribution is normal should be rejected (same as for the KS test)

Notes:In a Kolmogorov-Smirnov test for normality when the mean and standard deviation of the hypothesized normal distribution are not known (i.e., they are estimated from the sample data), the probability values tabulated by Massey (1951) are not valid. In that case, the test for normality involves a complex conditional hypothesis ("how likely is it to obtain a D statistic of this magnitude or greater, contingent upon the mean and standard deviation computed from the data"), and the Lilliefors probabilities should be interpreted (Lilliefors, 1967)

9


Testing distribution adherenceShapiro-Wilks' W test

This test is the preferred test of normality because of its goodpower properties as compared to a wide range of alternative tests

Interpretation:If the W statistic is significant (i.e. p ≤ α), then the hypothesis that the respective distribution is normal should be rejected

Notes:Some software programs implement an extension to the test described by Royston (1982), which allows it to be applied to large samples (with up to 5000 observations)


Statistical significance (p-value) of a result

The p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative” of the population

A p-value of 5% (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in oursample is a "fluke“ (stroke of luck)

For adherence tests, the p-value is the probability that the observed difference between the sample cumulative distribution and the hypothesized cumulative distribution. occurred by pure chance ("luck of the draw")

In other words, that in the population from which the sample was drawn, no such difference exists

10


Common p-values(conventions in many research areas)

Borderline statistically significantp-value = 5% (1/20)

Statistically significantp-value = 1% (1/100)

Highly statistically significantp-value = 0.5% (1/200) or even 0.1% (1/1000)


Hypothesis testingSuppose that a CIO is interested in showing that in his software-house the projects have an average defect density (ADD) below 5[KLOC-1]. This question, in statistical terms: “Is ADD < 5?"

STEP 1: State as a "statistical null hypothesis" (hypothesis H0) something that is the logical opposite of what you believe.

Ho: ADD > 5STEP 2: Collect data (build a sample)STEP 3: Using statistical theory, show from the data that it is likely H0 is false, and should be rejected.

By rejecting H0, you support what you actually believe.

This kind of situation, which is typical in many fields of research, is called "Reject-Support testing," (RS testing) because rejecting the null hypothesis supports the experimenter's theory.

11


Hypothesis testing

Two kinds of errorsα – Type I error rate, must be kept at or below .05β –Type II error rate, must be kept low as well (the conventions are much more rigid with respect to α than with respect to β)

The "Statistical power," (1-β), must be kept highIdeally, power should be at least .80 to detect a reasonable departure from the null hypothesis

Correct H0Rejection

(1-β)

Type I Error (α)

H1

Type II Error (β)

Correct H0Acceptance

(1- α)

H0Decision

H1HOState of the WorldThe null hypothesis is either

true or falseThe statistical decision should be set up so that no "ties” occur

The null hypothesis is either rejected or not rejected


Hypothesis testing (expanded)

OKCorrect H0 Rejection

Correct H1 Acceptance(Probability = 1 - β)

Type I ErrorIncorrect H0 Rejection

Incorrect H1 Acceptance(Probability = α)

Reject H0Accept H1

Type II ErrorIncorrect H0 Acceptance

Incorrect H1 Rejection (Probability = β)

OKCorrect H0 Acceptance Correct H1 Rejection(Probability = 1 - α)

Accept H0Reject H1

DECISION

H0 = FalseH1 = True

H0 is TrueH1 is False

STATE OF THE WORLD

12


Statistical TestsParametric tests

Assure stronger validity than the non-parametric counterparts

Their statistical power is greater

Non-parametric testsWeaker validity than the parametric counterparts

Their statistical power is smaller


Statistical Tests for ScalesMeasurement scale

of the variable under consideration

Nominal Ordinal Interval Ratio

Non-parametric test Normal distribution

Non-parametricmethods

Parametric methods

No Yes

13


Parametric tests (between groups)

i) The means of a variable on each group (treatment) are the same?

ii) There is no interaction among the factors?

Numeric (absolute, interval or ratio)

2+/2+Factorial ANOVA

The means of a variable on each group (treatment) are the same?


1/2+One-Way ANOVA

The means of a variable on each group (treatment) are the same?


1/2t-Student(2 independent samples)

The mean of a variable is equal to a specified constant?


NAt-Student(one sample)

Null hypothesesOutcome scaleFactors / Treat.

Name


Nonparametric tests (between groups)

The several groups have a similar localization parameter?

At least ordinal scale1/2+Kruskal-Wallis test(aka H-test)

i) The several groups have a similar localization parameter?

ii) There is no interaction among the factors?

At least ordinal scale2+/2+Nonparametric Factorial ANOVA

The two groups have similar central tendency?

At least ordinal scale1/2Mann-Whitney test(aka U-test)

The expected proportions in the groups are similar?

NA1/2+Chi-Square(test of proportions)

The expected proportions are the ones being tested?

NA1/2Binomial test(test of proportions)

Null hypothesesOutcome scaleFactors / Treat.

Name

14


PARAMETRIC TESTS


T-Student test(One sample)

15


One sample T-Student testApplicability

This procedure tests whether the mean of a quantitative variable differs from a hypothesized test value

The test value is a specified constant

DesignN/A

ScalesFactor (grouping) variable: noneOutcome variable: numeric (absolute, interval or ratio)


One sample T-Student testAssumptions

This test assumes that the data from the outcome variable are normally distributed; however, it is fairly robust to departures from normality.

16


One sample T-Student testHypotheses being testedH0: μ = k

The mean of the variable does not differ significantly from a specified K value

H1: μ ≠ kThe mean of the variable differs significantly from the specified K value

This test uses the T statistic that has a t-Student distribution with (n-1) degrees of freedom


One sample T-Student testTest decisionFor n ≤ 30:We reject H0, for a given level of significance α, if:

|Tcalc| > t1-α/2 (n-1)The critical values can be obtained from a t-Student table.

For n > 30: the t-Student distribution becomes ~ N(0;1)Therefore, we can then use the test significance:

If p ≤ α (significant Z statistic):Reject H0 and accept H1 (proportions are different)

If p > α (not significant Z statistic)Accept H0 and therefore reject H1 (proportions are similar)

17


Example (1/3)Problem:

Is the population mean effort per adjusted function point equal to 15 or 16 man.hours?

First we have to compute that effort …

SPSS:Transform

Compute Variable


One-Sample Kolmogorov-Smirnov Test

2839428,94

829,349,304,256

-,30416,186

,000

NMeanStd. Deviation

Normal Parametersa,b




AdjustedFunctionPoints

Test distribution is Normal.a.

Calculated from data.b.

Example (2/3)Assumption: is the effort per adjusted function point Normally distributed?

The effort is not Normally distributed, but since this test is robust to non-Normal data, we will still use it!

SPSS:Analyze

Nonparametric Tests1 Sample K-S

18


Example (3/3)SPSS:

AnalyzeCompare Means

One-Sample T-Test…

One-Sample Statistics

2839 16,5820 25,00000 ,46920Effort per Adjusted FPN Mean Std. Deviation

Std. ErrorMean

One-Sample Test

3,372 2838 ,001 1,58204 ,6620 2,5020Effort per Adjusted FPt df Sig. (2-tailed)

MeanDifference Lower Upper

95% ConfidenceInterval of the

Difference

Test Value = 15

One-Sample Test

1,240 2838 ,215 ,58204 -,3380 1,5020Effort per Adjusted FPt df Sig. (2-tailed)

MeanDifference Lower Upper


Difference

Test Value = 16

H0 cannot be rejected, even with a 90% confidence interval. The population mean value of the variable is 16!

H0 is rejected. The population mean value of the variable is not 15!

Conclusion:The expected value for the population mean for FP countings using the IFPUG rules is 16.


T-Student test(2 samples)

19


Two samples T-Student testApplicability

This test allows inferring the equality of the means in the populations from two samples (groups)

Design1 factor, 2 treatments, independent samples

ScalesFactor (grouping) variable: categorical or cut-point defined upon a numeric variable (e.g. setting a cut point on team size of 10 persons allows splitting projects on two groups, according to that variable)Outcome variable: numeric (absolute, interval or ratio)


Two samples T-Student testAssumptions

The subjects should be randomly assigned to two groups, so that any difference in response is due to the treatments and not to other factors.

This test assumes that the data from the outcome variable are normally distributed; however, it is fairly robust to departures from normality.

This test uses different statistics depending on the outcome variable having homogeneous or non-homogeneous variances on the two groups

This homogeneity can be assessed with the Levene test

20


Two samples T-Student testHypotheses being tested

H0: μA = μBThe means of the variable on each group (treatment) are the same

H1: μA ≠ μBThe averages of the variable on each group (treatment) are not the same


Two samples T-Student testTest decisionFor n ≤ 30:We reject H0, for a given level of significance α, if:

|Tcalc| > t1-α/2 (n-1)The critical values can be obtained from a t-Stundent table.

For n > 30: the t-Student distribution becomes ~ N(0;1)Therefore, we can then use the test significance:

If p ≤ α (significant Z statistic):Reject H0 and accept H1 (means are different)

If p > α (not significant Z statistic)Accept H0 and therefore reject H1 (means are similar)

21



Is the mean number of adjusted Function Points the same when using IFPUG counting or any other counting (e.g. COSMIC, FiSMA, Feature Points)?

First we have to create a new factor variable (isIFPUG)

IFPUG projects will be coded “1” and non-IFPUG “0”

SPSS:Transform

Compute Variable


Example (2/3)

SPSS:Analyze

Compare MeansIndependent Samples T-Test

Group Statistics

2839 17,9005 28,46928 ,53431148 13,7710 14,25176 1,17149

Is FP counting IFPUG?10

Effort per Adjusted FPN Mean Std. Deviation

Std. ErrorMean

22


Independent Samples Test

4,960 ,026 1,753 2985 ,080 4,12954 2,35567 -,48937 8,74845

3,207 214,040 ,002 4,12954 1,28758 1,59157 6,66751

Equal variancesassumedEqual variancesnot assumed

Effort per Adjusted FPF Sig.

Levene's Test forEquality of Variances

t df Sig. (2-tailed)Mean

DifferenceStd. ErrorDifference Lower Upper


Difference

t-test for Equality of Means

Example (3/3)

For a confidence interval of 95% (equal variances not assumed), we can say that the means between the two groups differ significantly!For a confidence interval of 99% (equal variances assumed), we cannot say that the means between the two groups differ significantly!

Conclusion:The FP counting rules other than the IFPUGones, do not seem to differ significantly from the latter.Setting the value of the confidence interval can change results interpretation in border-line situations!

For a confidence interval of 95% (p<α), H0 can be rejected, therefore sample variances cannot be considered homogeneous.For a confidence interval of 99% (p>α), H0 cannot be rejected, therefore sample variances can be considered homogeneous.


One-Way ANOVA(One-factorial ANalysis Of VAriance)

23


One-Way ANOVAApplicability

This procedure is used to test the hypothesis that the means among several groups (determined by a factor variable) are equal. Therefore, it allows testing if there is a variance on the outcome variable that is due to the factor. This is an extension of the two-sample t test.Design1 factor, 2+ treatments, independent samples

ScalesFactor (grouping) variable: categorical (recoded into numeric)Outcome variable: numeric (absolute, interval or ratio)


One-Way ANOVA Assumptions

Each group is an independent random sample from a normal population. One-Way ANOVA is robust to departures from normality, although the data should be symmetric.

The groups should come from populations with equal (homogeneous) variances. To test this assumption, use Levene's homogeneity-of-variance test.

24


One-Way ANOVAThe groups

Let us consider that we have k groups (each group is a sample), each one corresponding to a given treatment (factor level)

sample 1 with n1 elements: X1= {X11, X21, ...Xn11}……sample k with nk elements: Xk= {X1k, X2k, ...Xnkk}

where Xij is the value observed for subject i, belonging to sample j


One-Way ANOVA Calculating the variance

Let SST be the sum of the squares of the deviations of observed values around the global mean:

SST =

whereSST=SSW+SSB

Let SSW be the sum of the squares of the deviations within groups,

SSW =

Let SSB be the sum of the squares of the deviations between groups,

SSB =

2

1 1

( )jnk

ij jJ i

X X= =

−∑∑

2

1

( )k

j jj

n X X=

−∑

2

1 1( )

k n

ijj i

X X= =

−∑∑

25


One-Way ANOVA The T statistic

The ANOVA compares the sum of the squares of the deviations between groups (difference between groups), with the sum of the squares within groups.

The null hypothesis is tested using the following test statistic:

Under the null hypothesis, the T statistic follows an F (Snedecor) distribution with (k-1,n-k) degrees of freedom, i.e.,

T ∼ F(k-1,n-k)

/( 1)/( )

SSB kTSSW n k

−=

−n= number of casesk=number of groups


One-Way ANOVA Hypotheses being testedH0: The means of the outcome variable for each group

(treatment) are all the same∀ i, j : μi = μj ( i ≠ j )

H1: The means for each group are not all the same∃ i, j : μi ≠ μj ( i ≠ j )

Test decision:We reject H0, for a given level of significance α, if:

Fcalc> F1-α (k-1,n-k)Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm

Note: k-1 is the numerator and n-k is the denominator (see previous slide)

26


One-Way ANOVA Which groups differ?

In addition to determining that differences exist among the means, you may want to know which means differ

There are two types of tests for comparing means:

a priori contrasts are tests set up before running the experimentpost hoc are tests are run after the experiment has been conducted



Is the effort per adjusted function point the same across 4 well-known languages (Cobol, Visual Basic, C++ and Java)?

Verifying assumptions:Is the outcome variable (the effort) normally distributed?

From previous slides we have seen this is not true, but since ANOVA is robust to departures from normality, we still use it …

Have the groups corresponding to each of the programming languages equal variances?

27


Example (2/4)We need to recode the programming languages of interest

SPSS:Transform

Recode into Different Variables


Test of Homogeneity of Variances

Effort per Adjusted FP

3,519 3 1090 ,015

LeveneStatistic df1 df2 Sig.

Example (3/4)

SPSS:Analyze

Compare MeansOne-Way ANOVA

Verifying another precondition: With a confidence interval of 99% we cannot reject the null hypothesis that the variances are homogeneous

The plot gives us a qualitative perspective of the phenomenon; the mean effort seems to depend on the language!

28


Descriptives


509 23,5086 34,84305 1,54439 20,4744 26,5427 ,24 424,87265 16,0208 23,46494 1,44144 13,1826 18,8590 ,13 256,13116 28,8261 34,36263 3,19049 22,5064 35,1458 ,99 211,77204 17,2658 29,09480 2,03704 13,2493 21,2823 ,90 259,71

1094 21,0945 31,57117 ,95451 19,2217 22,9674 ,13 424,8731,32726 ,94714 19,2361 22,9530

2,85458 12,0100 30,1791 22,57916

CobolVisual BasicC++JavaTotal

Fixed EffectsRandom Effects

Model

N Mean Std. Deviation Std. Error Lower Bound Upper Bound

95% Confidence Interval forMean

Minimum Maximum

Between-Component

Variance

Example (4/4)

ANOVA


19712,581 3 6570,860 6,695 ,0001069723 1090 981,3971089435 1093

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

n (number of cases) = 1094k (number of groups) = 4

The upper critical value of the F distribution can be found in a table. Notice that the critical value for (k-1,n-k) = (3, 1090) can be majoratedby the critical value for (3, 100). For α = 5% we get a majorant of 3.984. Since Fcalc=6.695 > 3.984 we reject the null hypothesis!

Conclusion:The average effort per function point is significantly dependent on the language


Factorial ANOVA(Multi-factorial ANalysis Of VAriance)

29


Factorial ANOVAApplicability

This procedure is used to test if a given set of factors has a significant effect on a given variable

Allows determining the effect of each factorAllows assessing the interaction among factors (aka moderation)

This is a particular case of a multivariate regression analysis methodology called “General Linear Model” (GLM)

In GLM, both balanced and unbalanced models can be tested. A design is balanced if each cell (a treatment) in the model contains the same number of cases


Factorial ANOVADesign & ScalesDesign

2+ factors, 2+ levels per factor, independent samplesIf we only have 2 factors, this is called Two-way ANOVA

In Factorial ANOVA a treatment corresponds to a combination (tuple) of factor levels such as (Java; Eclipse) if the factors are programming language and development environment. In Factorial ANOVA, a treatment is often called a model cell.

ScalesFactor (grouping) variable: categorical (recoded)Outcome variable: numeric (absolute, interval or ratio)

30


Factorial ANOVAMain and interaction effectsConsider that you have three factors F1, F2 and F3Main effects

These are the effects on the outcome variable caused by each factor alone, as we did with One-way ANOVA

These are represented by F1, F2, F3

Interaction effectsThere are the cross factor effects caused by the combined actionof all combinations of factors, which are:

F1*F2, F1*F3, F2*F3, F1*F2*F3

Overall effect representation in the GLMI + F1 + F2 + F3 + F1*F2 + F1*F3 + F2*F3 + F1*F2*F3

Where I is an intercept term (similar to that used in linear regression)


Factorial ANOVA Hypotheses being tested (one for each factor)H0: The expected means (in the population) of the outcome

variable for each group (treatment) are all the sameμ1 = .... = μk

H1: At least one the expected means is different

∃ i,j: μi ≠ μj (i≠ j)



Note: k-1 is the numerator and n-k is the denominator (see a previous slide)

31


Factorial ANOVA Hypotheses being tested (one for each interaction)H0: There is no interaction among the factors

∀ i,j: γi,j = 0 (i ≠ j)

H1: There is interaction between at least two factors

∃ i,j: γi,j ≠ 0 (i ≠ j)



Note: k-1 is the numerator and n-k is the denominator (see a previous slide)


Factorial ANOVA Assumptions

For increased test power the populations from where each cell data was taken should be normal and with homogeneous variances. However:

Factorial ANOVA is robust to departures from normality, although the data should be symmetricRegarding variance, there are alternatives for using Factorial ANOVA when variance homogeneity is not assumed

To check assumptions, we can use homogeneity of variances tests (e.g. Levene test) and spread-versus-level plots. We can also examine residuals and residual plots.

32


Factorial ANOVA Differences among specific treatments

The overall F statistic allows to test that at least a group corresponding to a given treatment has a means on the outcome variable that is different from the other groupsIf an overall F test has shown significance, we use post hoc tests to evaluate differences among specific means.

Some of those post hoc tests are applicable when equal variances are assumed and some other when they are not

Use the Scheffé or the Tamhane’s tests, depending on variance’s homogeneity, since those two tests are more conservative (safer) than others, which means that a larger difference between means is required for significance.(See more by clicking on the help button)


Factorial ANOVA Profile (interaction plots)

If the interaction effects are not significant, we should consider each of the the main effects separately, as we did for one-way ANOVA

When interaction effects are significant (rejected interaction effect null hypotheses), we do not consider the corresponding main effects. Therefore, we should center our attention on the study of interactions

Profile plots allow visualizing the interactions among the factors!

33



Is the effort per adjusted function point dependent on the language, development type and software architecture?

We know already that the effort per adjusted function point does not have a Normal distribution (see previous slides), but since the Factorial ANOVA is robust to non-normality we still use it.

Between-Subjects Factors

Cobol 508Visual Basic 264C++ 116Java 204New development 445Enhancement 622Re-development 25Stand-alone 720Client-server 298Multi-tier with web interface 74

1234

Language

012

Development type

123

Architecture_

Value Label N


Levene's Test of Equality of Error Variancesa

Dependent Variable: Effort per Adjusted FP

5,049 20 1071 ,000F df1 df2 Sig.

Tests the null hypothesis that the error variance of thedependent variable is equal across groups.

Design: Intercept+Language+DevType+Architecture_+Language * DevType+Language * Architecture_+DevType * Architecture_+Language * DevType *Architecture_

a.

Example (2/6)

SPSS:Analyze

General Linear ModelUnivariate

Verifying variance homogeneity: With a confidence interval of 99%% we reject the null hypothesis that the variances are homogeneous. Notice the interaction terms

34


Tests of Between-Subjects Effects

Dependent Variable: Effort per Adjusted FP

80088,097b 20 4004,405 3,877 ,000 ,068 77,532 1,00037040,756 1 37040,756 35,859 ,000 ,032 35,859 1,000

802,606 3 267,535 ,259 ,855 ,001 ,777 ,1007598,601 2 3799,300 3,678 ,026 ,007 7,356 ,6778014,872 2 4007,436 3,880 ,021 ,007 7,759 ,702265,799 4 66,450 ,064 ,992 ,000 ,257 ,063

1294,776 3 431,592 ,418 ,740 ,001 1,253 ,1348740,139 3 2913,380 2,820 ,038 ,008 8,461 ,6807952,667 3 2650,889 2,566 ,053 ,007 7,699 ,634

1106304,428 1071 1032,9641741086,067 10921186392,525 1091

SourceCorrected ModelInterceptLanguageDevTypeArchitecture_Language * DevTypeLanguage * Architecture_DevType * Architecture_Language * DevType * Architecture_ErrorTotalCorrected Total

Type III Sumof Squares df Mean Square F Sig.

Partial EtaSquared

Noncent.Parameter

ObservedPowera

Computed using alpha = ,05a.

R Squared = ,068 (Adjusted R Squared = ,050)b.

Example (3/6)

With a confidence interval of 95%, the DevType * Architecture interaction is significant. Therefore, we should consider this interaction effect instead of the main effects

When the test power 1- β is low (below 80%) as it happens here, specially for all terms including the Language, we should be careful since the Type II Error (β) is high. Recall that = β is the probability of Incorrect H0 acceptance (incorrect H1 rejection) when Ho is false.

Conclusion:The average effort per function point is significantly dependent on the combined action of development type and software architecture, although care should be taken since the test power is limited


Multiple Comparisons

Dependent Variable: Effort per Adjusted FPTamhane

6,0732* 2,04680 ,009 1,1751 10,971419,0322* 1,37305 ,000 15,7464 22,3179-6,0732* 2,04680 ,009 -10,9714 -1,175112,9589* 1,55147 ,000 9,2343 16,6836

-19,0322* 1,37305 ,000 -22,3179 -15,7464-12,9589* 1,55147 ,000 -16,6836 -9,2343

(J) Architecture_Client-serverMulti-tier with web interfaceStand-aloneMulti-tier with web interfaceStand-aloneClient-server

(I) Architecture_Stand-alone

Client-server

Multi-tier withweb interface

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound95% Confidence Interval

Based on observed means.The mean difference is significant at the ,05 level.*.

Multiple Comparisons

Dependent Variable: Effort per Adjusted FPTamhane

-10,4831* 1,82443 ,000 -14,8474 -6,1188-6,6862 3,02944 ,103 -14,3758 1,003410,4831* 1,82443 ,000 6,1188 14,84743,7969 3,33127 ,596 -4,4952 12,08916,6862 3,02944 ,103 -1,0034 14,3758

-3,7969 3,33127 ,596 -12,0891 4,4952

(J) Development typeEnhancementRe-developmentNew developmentRe-developmentNew developmentEnhancement

(I) Development typeNew development

Enhancement

Re-development

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound95% Confidence Interval

Based on observed means.The mean difference is significant at the ,05 level.*.

Example (4/6)

With a confidence interval of 95% we can only say that software enhancement requires on average effort per adjusted FP of around10,5 hours larger than for new development. Your interpretation?

Scenario 1:Main effects are importantAttention: this scenario isnot true in our case study!

With a confidence interval of 95% we can only say that there is an increasing order of magnitude in the average effort per adjusted FP from multi-tier with web interface to stand alone. Your interpretation?

35


Example (5/6)Significant interactions

The effect of the development type on the effort is partly moderated by the architecture (and vice-versa)This moderation effect manifests itself in crossing lines.

Scenario 2:Interaction effects are importantAttention: this scenario is the correctone in our case study!


Example (6/6)Non significant interactions

When the interactions are not significant, the lines do not cross or only cross slightly

36


NON-PARAMETRIC TESTS(Between Groups)


Binomial(test of proportions)

37


Binomial testApplicability

This test allows comparing the proportions of the occurrence of one of the two possible values of a dichotomic variable on the total number of cases

Design1 factor, 1 treatment, independent samples

ScalesFactor (grouping) variable: categorical (dichotomic)Outcome variable: N/A


Binomial testThe proportions to be tested

If px and py are the proportions of the two possible values of the factor, then px + py = 1

We test the hypothesis that the expected proportions in the population are of a given value (p0, 1-p0), as for instance:

p0 = 25% -> px = 25%, py = 75%p0 = 50% -> px = 50%, py = 50%p0 = 60% -> px = 60%, py = 40%

38


Binomial testThe probability of occurrence of each of the possible values of the dichotomic variable follows a Binomial distributionIf the sample is large enough (n>30) the Binomial approximates a Normal distribution. Under this circumstance we can use the Z test statistic:

Z ~ N(0; 1)

Therefore we need a table for the standardized Normal distribution N(0; 1)


Normal distribution table N(0;1)

39


Binomial testHypotheses being tested

Test decision:The decision is based on the bilateral rejection interval that depends on the value of α

] -∞, -z1- α/2] U [ z1- α/2, +∞[

Z belongs to the rejection intervalReject H0 and accept H1 (proportions are different)

Z is outside the rejection intervalAccept H0 and therefore reject H1 (proportions are similar)


Binomial testH0: px = p0, py = 1 - p0

there is statistical evidence that the expected proportions in the population are the ones being tested

H1: px ≠ p0, py ≠ 1 - p0the expected proportions in the population is significantly different from the tested ones

Test decision:If p ≤ α (significant Z statistic):

Reject H0 and accept H1If p > α (not significant Z statistic)

Accept H0 and therefore reject H1

40


Example (1/3)Objective: assess if the proportions of CASE tool usage / non-usage are even

CASE tool usage is represented by a dichotomicvariable that splits subjects in 2 samples (groups)

One group corresponds to the projects using CASE tools (label “Yes”) and the other to those projects that aren’t using them (label “No”)

SPSS:Analyze

Nonparametric TestsBinomial


Example (2/3)

Binomial Test

No 1254 ,66 ,50 ,000a

Yes 646 ,341900 1,00

Group 1Group 2Total

Case tool usageCategory N

ObservedProp. Test Prop.

Asymp. Sig.(2-tailed)

Based on Z Approximation.a.

Conclusion:There is a statistically significant difference between the proportion of projects thatuse CASE tools and those that don’t.

With a confidence interval greater than 99,99% we can reject the null hypothesis

We can enter a test proportion for the first group. The probability for the second group will be 1 minus the specified probability for the first group.

41


Binomial Test

No 1254 ,660 ,666 ,297a,b

Yes 646 ,3401900 1,000

Group 1Group 2Total

Case tool usageCategory N

ObservedProp. Test Prop.

Asymp. Sig.(1-tailed)

Alternative hypothesis states that the proportion of cases in the first group < ,666.a.

Based on Z Approximation.b.

Example (3/3)

Conclusion:We accept that the proportion of projects not using CASE tools is twice as large as thatof those that do not use them!

Even with a confidence interval of 90%% we cannot reject the null hypothesis

Here we are testing if the proportion of projects not using CASE tools is twice as large as those using them.


Chi-Square(test of proportions)

42


Chi-Square testApplicability

This goodness-of-fit test compares the observed and expected frequencies in each category to test that all categories contain the same proportion of values

Can also test if each category contains a user-specified proportion of values

It can be used to test if 2 or more independent samples (groups) differ regarding a given factor


Chi-Square testDesign & Scales

Design1 factor, 2 or more treatments, independent samples

ScalesFactor (grouping) variable: categoricalOutcome variable: N/A

43


Chi-square testHypotheses being testedH0: the expected proportions of the groups (in the

population) are similarGroups do not differ significantly in sizeThe effect of the factor is negligible

H1: the proportions in the groups are different Groups differ significantly in sizeThe effect of the factor is not negligible

Test decision:If p ≤ α (significant Z statistic):

Reject H0 and accept H1 (proportions are different)If p > α (not significant Z statistic)

Accept H0 and therefore reject H1 (proportions are similar)


Chi-square testPreconditions

The Chi-square operates on a contingency tableRows and columns represent the categories of the two variablesEach cell contains the number of observations for a given pair of values (factor, outcome variable)

The Chi-Square preconditions are: The sample must be large enough (n > 20)All contingency values must be > 1At least 80% of the contingency values must be > 5

Primary Programming Language * Case tool usage Crosstabulation

Count

279 121 400229 24 253

26 42 6861 36 97

595 223 818

CobolVisual BasicC++Java

Primary ProgrammingLanguage

Total

No YesCase tool usage

TotalThis is a contingency table where all the

preconditions are met

44


Example (1/3)Objective:

Is the adoption of CASE tools dependent on the programming language used?

If there is some sort of dependence than the proportions in the groups will not be similar.

SPSS:Analyze

Descriptive StatisticsCrosstabs


Example (2/3)

45


Example (3/3)

Conclusion:The adoption of CASE tools depends on the programming language

With a confidence interval of 99% we can reject the null hypothesis

Case Processing Summary

639 22,5% 2201 77,5% 2840 100,0%Primary ProgrammingLanguage * Case toolusage

N Percent N Percent N PercentValid Missing Total

Cases

Primary Programming Language * Case tool usage Crosstabulation

222 106 328243,3 84,7 328,0

192 17 209155,0 54,0 209,0

11 28 3928,9 10,1 39,0

49 14 6346,7 16,3 63,0474 165 639

474,0 165,0 639,0

CountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected Count

Cobol

Visual Basic

C++

Java

Primary ProgrammingLanguage

Total

No YesCase tool usage

Total

Chi-Square Tests

84,822a 3 ,00086,160 3 ,000

,565 1 ,452

639

Pearson Chi-SquareLikelihood RatioLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (,0%) have expected count less than 5. Theminimum expected count is 10,07.

a.


Mann-Whitney test(aka U-test)

46


Mann-Whitney testApplicability

This is the non-parametric analog of the t-testInstead of comparing the average of the 2 samples, it compares their central tendency to detect differences

It can be used to test if 2 samples differ regarding a given factor


Mann-Whitney test Design & ScalesDesign

1 factor, 2 treatments, independent samples, random samples (between groups)

ScalesFactor (grouping) variable: categoricalOutcome variable: at least ordinal scale

AssumptionsThe two tested samples should be similar in shape

47


Mann-Whitney testHypotheses being tested

H0: The two populations from which the samples for the two groups were taken, have similar central tendency

The groups are not affected by the factor variableH1: The two samples do not have similar central tendency

The groups are affected by the factor variable

U statisticThis statistic is used to test the above hypotheses


Example (1/3)Objective: assess if the effort per each development phase is different between two languages (Cobol and Java)

Each independent sample (group) corresponds to the projects (cases) that use the same programming language (PL)

Let c and j be indexes identifying Cobol and Java, respectively. Then, the underlying hypotheses for this test are the following:

H0: ∀i,j :Ec ~ Ej

H1: ¬ ∀i,j :Ec ~ Ej

If we reject the null hypothesis that the samples do not differ on the criterion (factor or grouping) variable (the PL), then we can sustain that the statistical distributions of the efforts per phase for each group of projects (corres-ponding to a PL) are different.

In other words, we would accept the alternative hypothesis that the PL has an influence on the effort per phase.

Notice that since we have several phases, then we have to perform one test for each phase

48


Test Statisticsa

,215 ,130 ,550 ,215 ,225 ,222,215 ,130 ,200 ,215 ,225 ,000

-,088 -,050 -,550 -,014 -,031 -,222,919 ,628 ,977 1,128 1,162 ,998,367 ,825 ,295 ,157 ,134 ,272




Effort Plan Effort Specify Effort Design Effort Build Effort TestEffort

Implement

Grouping Variable: Primary Programming Languagea.

Example (2/3)1 is Cobol and 4 is Java

First we must verify, for each effort kind, if the groups corresponding to different languages have distributions with similar shapes, by using the Kolmogorov-Smirnov Z test.H0 – The statistical distribution of the effort is similar in both programming languagesH1 – The statistical distribution of the effort is significantly different in both programming languages

SPSS: Analyze

Nonparametric Tests2 Independent Samples

We accept that the 2 groups have similar statistical distributions (with a confidence level of 99%) for all efforts being tested


Example (3/3) Ranks

77 49,22 3790,0024 56,71 1361,00

101108 68,27 7373,5030 73,92 2217,50

1384 8,00 32,00

15 10,53 158,0019

142 85,42 12130,0034 101,35 3446,00

176160 93,84 15014,0032 109,81 3514,00

192106 69,46 7363,0025 51,32 1283,00

131

Primary ProgrammingCobolJavaTotalCobolJavaTotalCobolJavaTotalCobolJavaTotalCobolJavaTotalCobolJavaTotal

Effort Plan

Effort Specify

Effort Design

Effort Build

Effort Test

Effort Implement

N Mean Rank Sum of Ranks

Test Statisticsb

787,000 1487,500 22,000 1977,000 2134,000 958,0003790,000 7373,500 32,000 12130,000 15014,000 1283,000

-1,093 -,684 -,800 -1,638 -1,485 -2,150,274 ,494 ,424 ,102 ,138 ,032

,469a

Mann-Whitney UWilcoxon WZAsymp. Sig. (2-tailed)Exact Sig. [2*(1-tailed Sig.)]

Effort Plan Effort Specify Effort Design Effort Build Effort TestEffort

Implement

Not corrected for ties.a.

Grouping Variable: Primary Programming Languageb.

The effort to plan, specify and design do not differ significantly between Cobol and Java.The effort to build and implement may be considered significantly different with a confidence interval of 90%.

SPSS: Analyze

Nonparametric Tests2 Independent Samples

H0 – the two samples have similar central tendency on the effortH1 – the two samples do not have similar central tendency on the effort

49


Kruskal-Wallis H test(one-way analysis of variance)


Kruskal-Wallis testApplicability

Is an extension of the Mann-Whitney U test

Is the nonparametric equivalent to the One-Way ANOVA

Assesses whether several independent samples have a common localization parameter

Each sample is a group of subjects corresponding to the application of a given treatment (a level of the factor variable)

50


Kruskal-Wallis testDesign & Scales

Design1 factor with more than 2 treatments, independent, random samples (between groups)

ScalesFactor (grouping) variable: categoricalOutcome variable: any type

AssumptionsThe tested samples should be similar in shape


Kruskal-Wallis testHypotheses being tested

H0: The distribution of the populations from where each group was extracted have the same localization parameter

Groups do not differ significantlyThe effect of the factor is negligible

H1: At least one of the distributions has a localization parameter that is smaller or greater than the others

A least one sample (group) differs significantlyThe effect of the factor is not negligible

51


Kruskal-Wallis testTest decision

The calculated H test statistic is distributed approximately as chi-squareFrom a chi-square table with the given df (degrees of freedom) and for a stipulated significance α (probability of a Type I error) we obtain a critical value of the chi-square to be compared with the calculated H statistic


Hcalc> H(1-α,df)


df 90% 95% 97.5% 99% 99.5% 99.9%1 2.706 3.841 5.024 6.635 7.879 10.827 2 4.605 5.991 7.378 9.210 10.597 13.815 3 6.251 7.815 9.348 11.345 12.838 16.268 4 7.779 9.488 11.143 13.277 14.860 18.465 5 9.236 11.070 12.832 15.086 16.750 20.517 6 10.645 12.592 14.449 16.812 18.548 22.457 7 12.017 14.067 16.013 18.475 20.278 24.322 8 13.362 15.507 17.535 20.090 21.955 26.125 9 14.684 16.919 19.023 21.666 23.589 27.877 10 15.987 18.307 20.483 23.209 25.188 29.588 11 17.275 19.675 21.920 24.725 26.757 31.264 12 18.549 21.026 23.337 26.217 28.300 32.909 13 19.812 22.362 24.736 27.688 29.819 34.528 14 21.064 23.685 26.119 29.141 31.319 36.123 15 22.307 24.996 27.488 30.578 32.801 37.697 16 23.542 26.296 28.845 32.000 34.267 39.252 17 24.769 27.587 30.191 33.409 35.718 40.790 18 25.989 28.869 31.526 34.805 37.156 42.312 19 27.204 30.144 32.852 36.191 38.582 43.820 20 28.412 31.410 34.170 37.566 39.997 45.315 21 29.615 32.671 35.479 38.932 41.401 46.797 22 30.813 33.924 36.781 40.289 42.796 48.268 23 32.007 35.172 38.076 41.638 44.181 49.728 24 33.196 36.415 39.364 42.980 45.558 51.179 25 34.382 37.652 40.646 44.314 46.928 52.620 26 35.563 38.885 41.923 45.642 48.290 54.052 27 36.741 40.113 43.194 46.963 49.645 55.476 28 37.916 41.337 44.461 48.278 50.993 56.893 29 39.087 42.557 45.722 49.588 52.336 58.302 30 40.256 43.773 46.979 50.892 53.672 59.703

Chi-square distribution table

52


Example (1/2)Objective: assess the impact of the adopted programming language (PL) on the normalized work effort (E)

Each independent sample (group) corresponds to the projects (cases) that use the same PL

Let i and j be two different PLs. Then, the underlying hypotheses for this test are the following:

H0: ∀i,j :Ei ~ Ej

H1: ¬ ∀i,j :Ei ~ Ej

If we reject the null hypothesis that the samples do not differ on the criterion (factor or grouping) variable (the PL), then we can sustain that the statistical distributions of the groups of projects’ NWE corresponding to each of the PLs are different.

In other words, we would accept the alternative hypothesis that the PL has an influence on E.


Example (2/2)

SPSS: Analyze

Nonparametric TestsK Independent Samples

Ranks

509 606,70265 454,29116 646,58204 464,53

1094

LanguageCobolVisual BasicC++JavaTotal

Effort per Adjusted FPN Mean Rank

Test Statisticsa,b

66,4053

,000

Chi-SquaredfAsymp. Sig.


Kruskal Wallis Testa.

Grouping Variable: Languageb.

Even for a confidence interval of 99.9% we have Chi-SquareCALC > Chi-square (3; 0.001) and we can reject the null hypothesis. Therefore the effect of the language on the effort per FP is not negligible

Ranks give us the indication of the relative influence of each language on the effort per FP. Notice that C++ is the language requiring more effort and Visual Basic the least!

Extract of Chi-Square table:df 90% 95% 97.5% 99% 99.5% 99.9%3 6.251 7.815 9.348 11.345 12.838 16.268.

53


Classroom exampleA CIO wants to know if using 4 different DBMS has an

effect on the cost per delivered Function Point.

Factor: DBMS {DB2, Oracle, SQL Server, Access}

Outcome variable: Cost per Adjusted FP

The K-W test lets us to test if the average cost per FP is the same across the 4 groups


Nonparametric Factorial ANOVA

54


Nonparametric Factorial ANOVAApplicability

This procedure is used to test if a given set of factors has a significant effect on a given variable

Allows determining the effect of each factorAllows assessing the interaction among factors (aka moderation)

This procedure is similar to the (parametric) Factorial ANOVA, but the H statistic is calculated based upon the ranks of cases within each group

Remember that in the parametric version we used the F statistic that is calculated upon the values of the outcome variable itself


Nonparametric Factorial ANOVAHow to perform?The basic distribution of SPSS does not support the Nonparametric Factorial ANOVA (not even the Two-Way)There are several alternatives to perform this test:

Use another tool instead of SPSS (R has this procedure for free)Get an advanced SPSS module that supports nonparametric ANOVA (may be expensive)Program this procedure in the SPSS syntax language (VB-like) or find in the Internet someone who has done it alreadyTransform the outcome variable in a Normal distributed one and if successful, use the parametric Factorial ANOVAUse Excel to implement the test statistic H and then use a Chi-Square table to make the test decision.

55


Nonparametric Factorial ANOVADesign & ScalesDesign

2+ factors, 2+ levels per factor, independent samplesIf we only have 2 factors, this is called Nonparametrictwo-way ANOVA

A treatment corresponds to a combination (tuple) of factor levels such as (Java; Eclipse) if the factors are programming language and development environment. A treatment is often called a model cell.

ScalesFactor (grouping) variable: categorical (recoded)Outcome variable: at least ordinal


Nonparametric Factorial ANOVAMain and interaction effectsConsider that you have three factors F1, F2 and F3Main effects

These are the effects on the outcome variable caused by each factor alone, as we did with the Kruskal-Wallis test

These are represented by F1, F2, F3

Interaction effectsThere are the cross factor effects caused by the combined actionof all combinations of factors, which are:

F1*F2, F1*F3, F2*F3, F1*F2*F3

Overall effect representation in the GLMI + F1 + F2 + F3 + F1*F2 + F1*F3 + F2*F3 + F1*F2*F3

Where I is an intercept term (similar to that used in linear regression)

56


Nonparametric Factorial ANOVA Main effects hypotheses (one for each factor)

H0: The distribution of the populations from where each group was extracted have the same localization parameter

Groups do not differ significantlyThe effect of the factor is negligible

H1: At least one of the distributions has a localization parameter that is smaller or greater than the others

A least one sample (group) differs significantlyThe effect of the factor is not negligible


Nonparametric Factorial ANOVA Main effects test decision (one for each factor)As seen in the Kruskal-Wallis test, the calculated H test

statistic is distributed approximately as chi-square


Hcalc> H(1-α,df)

Get the value of the critical value H(1-α,df) from the Chi-Square table presented on the Kruskal-Wallis test

57


Nonparametric Factorial ANOVA Interaction effects hypotheses (one for each interaction)

H0: There is no interaction among the factors

∀ i,j: γi,j = 0 (i ≠ j)

H1: There is interaction between at least two factors

∃ i,j: γi,j ≠ 0 (i ≠ j)


Nonparametric Factorial ANOVA Interaction effects test decision (one for each interaction)

We reject H0, for a given level of significance α, if:

Hcalc> H(1-α,df)

Get the value of the critical value H(1-α,df) from the Chi-Square table presented on the Kruskal-Wallis test

Documents

Experimental Software Engineering – Statistical Tests