HCI 510 : HCI Methods I Statistics. HCI 510: HCI Methods I Descriptive Statistics Inferential Statistics Significance T-Test

HCI 510 : HCI Methods I• Statistics

HCI 510: HCI Methods I

• Descriptive Statistics

• Inferential Statistics

• Significance

• T-Test




• Significance

• T-Test

Descriptive Statistics

Descriptive vs. Inferential statistics


Descriptive statistics are used to describe the basic features of the data in a study.

They provide simple summaries about the sample and the measures.

Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data


Descriptive statistics are typically distinguished from inferential statistics.

With descriptive statistics you are simply describing what is or what the data shows.

With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone.


For instance, we use inferential statistics to try to infer from the sample data what the population might think.

Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study.

Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.


Descriptive Statistics are used to present quantitative descriptions in a manageable form.

In a research study we may have lots of measures.

Or we may measure a large number of people on any measure.

Descriptive statistics help us to simply large amounts of data in a sensible way.

Each descriptive statistic reduces lots of data into a simpler summary.


For instance, consider a simple number used to summarize how well a batter is performing in baseball, the batting average.

This single number is simply the number of hits divided by the number of times at bat.

A batter who is hitting .333 is getting a hit one time in every three at bats.

One batting .250 is hitting one time in four.

The single number describes a large number of discrete events.


Or, consider the scourge of many students, the Grade Point Average (GPA).

This single number describes the general performance of a student across a potentially wide range of course experiences.


Every time you try to describe a large set of observations with a single indicator you run the risk of distorting the original data or losing detail.

The batting average doesn't tell you whether the batter is hitting home runs or singles. It doesn't tell whether she's been in a slump or on a streak.

The GPA doesn't tell you whether the student was in difficult courses or easy ones, or whether they were courses in their major field or in other disciplines.

Even given these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units.


Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are three major characteristics of a single variable that we tend to look at:

• the distribution• the central tendency• the dispersion

In most situations, we would describe all three of these characteristics for each of the variables in our study.


The Distribution.

The distribution is a summary of the frequency of individual values or ranges of values for a variable.

The simplest distribution would list every value of a variable and the number of persons who had each value.


The Distribution.

For instance, a typical way to describe the distribution of college students is by year in college, listing the number or percent of students at each of the four years.

Or, we describe gender by listing the number or percent of males and females.

In these cases, the variable has few enough values that we can list each one and summarize how many sample cases had the value.


The Distribution.

But what do we do for a variable like income or GPA?

With these variables there can be a large number of possible values, with relatively few people having each one.

In this case, we group the raw scores into categories according to ranges of values.

For instance, we might look at GPA according to the letter grade ranges. Or, we might group income into four or five ranges of income values.


The Distribution.

One of the most common ways to describe a single variable is with a frequency distribution.

Depending on the particular variable, all of the data values may be represented, or you may group the values into categories first (e.g., with age, price, or temperature variables, it would usually not be sensible to determine the frequencies for each value.

Rather, the value are grouped into ranges and the frequencies determined.).


The Distribution.

Frequency distributions can be depicted in two ways, as a table or as a graph. The table shows an age frequency distribution with five categories of age ranges defined.


The Distribution.

The same frequency distribution can be depicted in a graph as shown in the figure. This type of graph is often referred to as a histogram or bar chart.


Central Tendency.

The central tendency of a distribution is an estimate of the "center" of a distribution of values.

There are three major types of estimates of central tendency:

• Mean• Median• Mode


The Mean or average is probably the most commonly used method of describing central tendency.

To compute the mean all you do is add up all the values and divide by the number of values.

For example, the mean or average quiz score is determined by summing all the scores and dividing by the number of students taking the exam.


The Mean or average is probably the most commonly used method of describing central tendency.

For example, consider the test score values:

15, 20, 21, 20, 36, 15, 25, 15

The sum of these 8 values is 167, so the mean is 167/8 = 20.875.


The Median is the score found at the exact middle of the set of values.

One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample.


The Median

For example, if there are 500 scores in the list, score #250 would be the median. If we order 8 scores, we would get:

15,15,15,20,20,21,25,36

There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores are 20, the median is 20.

If the two middle scores had different values, you would have to interpolate to determine the median.


The mode is the most frequently occurring value in the set of scores.

To determine the mode, you might again order the scores as shown above, and then count each one. The most frequently occurring value is the mode.

In our example, (15, 20, 21, 20, 36, 15, 25, 15) the value 15 occurs three times and is the model.

In some distributions there is more than one modal value. For instance, in a bimodal distribution there are two values that occur most frequently.


Notice that for the same set of 8 scores we got three different values

-- 20.875, 20, and 15 – for the mean, median and mode respectively.

If the distribution is truly normal (i.e., bell-shaped), the mean, median and mode are all equal to each other.


Dispersion.

Dispersion refers to the spread of the values around the central tendency.

There are two common measures of dispersion, the range and the standard deviation.


Dispersion.

Dispersion refers to the spread of the values around the central tendency.

There are two common measures of dispersion, the range and the standard deviation.

The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the range is 36 - 15 = 21.


The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate the range.

Look at the set of scores: 15,20,21,20,36,15,25,15the single outlier value of 36 stands apart from the rest of the values.

The Standard Deviation shows the relation that set of scores has to the mean of the sample.


To compute the standard deviation, we first find the distance between each value and the mean.

We know from above that the mean is 20.875. So, the differences from the mean are:

15 - 20.875 = -5.875 20 - 20.875 = -0.875 21 - 20.875 = +0.125 20 - 20.875 = -0.875 36 - 20.875 = 15.125 15 - 20.875 = -5.875 25 - 20.875 = +4.125 15 - 20.875 = -5.875


Notice that values that are below the mean have negative discrepancies and values above it have positive ones.

Next, we square each discrepancy:

-5.875 * -5.875 = 34.515625 -0.875 * -0.875 = 0.765625 +0.125 * +0.125 = 0.015625 -0.875 * -0.875 = 0.765625 15.125 * 15.125 = 228.765625 -5.875 * -5.875 = 34.515625 +4.125 * +4.125 = 17.015625 -5.875 * -5.875 = 34.515625


Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here, the sum is 350.875.

Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875 / 7 = 50.125. This value is known as the variance.

To get the standard deviation, we take the square root of the variance (remember that we squared the deviations earlier).

This would be SQRT(50.125) = 7.079901129253.


Although this computation may seem convoluted, it's actually quite simple.

To see this, consider the formula for the standard deviation:


In the top part of the ratio, the numerator, we see that each score has the mean subtracted from it, the difference is squared, and the squares are summed.

In the bottom part, we take the number of scores minus 1.

The ratio is the variance and the square root is the standard deviation. We can describe the standard deviation as:

The square root of the sum of the squared deviations from the mean divided by the number of scores minus one


Although we can calculate these univariate statistics by hand, it gets quite tedious when you have more than a few values and variables.

Every statistics program and calculator is capable of calculating them easily for you.


The standard deviation allows us to reach some conclusions about specific scores in our distribution.

Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the following conclusions can be reached:

• approximately 68% of the scores in the sample fall within one standard deviation of the mean

• approximately 95% of the scores in the sample fall within two standard deviations of the mean

• approximately 99% of the scores in the sample fall within three standard deviations of the mean


For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799,

we can from the above statement estimate that approximately 95% of the scores will fall in the range of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348.

This kind of information is a critical stepping stone to enabling us to compare the performance of an individual on one variable with their performance on another, even when the variables are measured on entirely different scales.


Worksheet 01

While performing a usability test on a new computer interface the following data was collected :

1. Create a frequency distribution for each variable.2. Calculate the mean, median and mode for each data set.3. Calculate the range, variance and standard deviation for each data set.4. What conclusions can be drawn, if any from these statistics.




• Significance

• T-Test

Inferential Statistics


With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone.

For instance, we use inferential statistics to try to infer from the sample data what the population might think.

Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study.



Thus,

we use inferential statistics to make inferences from our data to more general conditions;

we use descriptive statistics simply to describe what's going on in our data.



Inferential statistics are useful in experimental and quasi-experimental research design or in program outcome evaluation.



Perhaps one of the simplest inferential test is used when you want to compare the average performance of two groups on a single measure to see if there is a difference.

You might want to know whether eighth-grade boys and girls differ in math test scores or whether a program group differs on the outcome measure from a control group.

Whenever you wish to compare the average performance between two groups you should consider the t-test for differences between groups.



Most of the major inferential statistics come from a general family of statistical models known as the General Linear Model.

This includes the t-test, Analysis of Variance (ANOVA), Analysis of Covariance (ANCOVA), regression analysis, and many of the multivariate methods like factor analysis, multidimensional scaling, cluster analysis, discriminant function analysis, and so on.

Significance

Significance

"Significance level" is a misleading term that many people do not fully understand. In normal English, "significant" means important, while in Statistics "significant" means probably true (not due to chance).

A research finding may be true without being important. When statisticians say a result is "highly significant" they mean it is very probably true.

They do not (necessarily) mean it is highly important.

Significance

Take a look at the table below.

Significance

The chi squares at the bottom of the table show two rows of numbers. The top row numbers of 0.07 and 24.4 are the chi square statistics themselves. The second row contains values .795 and .001. These are the significance levels.




• Significance

• T-Test

Significance

Significance levels show you how likely a result is due to chance.

The most common level, used to mean something is good enough to be believed, is .95.

This means that the finding has a 95% chance of being true.

However, this value is also used in a misleading way.

Significance

No statistical package will show you "95%" or ".95" to indicate this level.

Instead it will show you ".05," meaning that the finding has a five percent (.05) chance of not being true, which is the converse of a 95% chance of being true.

To find the significance level, subtract the number shown from one. For example, a value of ".01" means that there is a 99% (1-.01=.99) chance of it being true.

Significance

In this table, there is probably no difference in purchases of gasoline X by people in the city center and the suburbs, because the probability is .795 (i.e., there is only a 20.5% chance that the difference is true). .

Significance

In contrast the high significance level for type of vehicle (.001 or 99.9%) indicates there is almost certainly a true difference in purchases of Brand X by owners of different vehicles in the population from which the sample was drawn.

Significance

In all cases of calculating statistical significance, the p value tells you how likely something is to be not true.

If a chi square test shows probability of .04, it means that there is a 96% (1-.04=.96) chance that the answers given by different groups really are different.

If a t-test reports a probability of .07, it means that there is a 93% chance that the two means being compared would be truly different if you looked at the entire population.

Significance

Significance is a statistical term that tells how sure you are that a difference or relationship exists.

To say that a significant difference or relationship exists only tells half the story. We might be very sure that a relationship exists, but is it a strong, moderate, or weak relationship?

After finding a significant relationship, it is important to evaluate its strength.

Significant relationships can be strong or weak. Significant differences can be large or small. It just depends on your sample size.

Significance

For example, suppose we give 1,000 people an IQ test, and we ask if there is a significant difference between male and female scores.

The mean score for males is 98 and the mean score for females is 100.

We use an independent groups t-test and find that the difference is significant at the .001 level.

The big question is, "So what?". The difference between 98 and 100 on an IQ test is a very small difference...so small, in fact, that its not even important.

Significance

Then why did the t-statistic come out significant?

Because there was a large sample size. When you have a large sample size, very small differences will be detected as significant.

This means that you are very sure that the difference is real (i.e., it didn't happen by fluke). It doesn't mean that the difference is large or important.

If we had only given the IQ test to 25 people instead of 1,000, the two-point difference between males and females would not have been significant.

Significance

People sometimes think that the 95% level is sacred.

If a test shows a .06 probability, it means that it has a 94% chance of being true. You can't be quite as sure about it as if it had a 95% chance of being be true, but the odds still are that it is true.

The 95% level comes from academic publications, where a theory usually has to have at least a 95% chance of being true to be considered worth telling people about.

In the ‘real’ world if something has a 90% chance of being true (probability =.1), it can't be considered proven, but it is probably better to act as if it were true rather than false.

Significance

Many researchers use the word "significant" to describe a finding that may have decision-making utility to a client.

From a statistician's viewpoint, this is an incorrect use of the word.

However, the word "significant" has virtually universal meaning to the public.

Thus, many researchers use the word "significant" to describe a difference or relationship that may be strategically important to a client (regardless of any statistical tests).

Significance

In these situations, the word "significant" is used to advise a client to take note of a particular difference or relationship because it may be relevant to the company's strategic plan.

The word "significant" is not the exclusive domain of statisticians and either use is correct in the business world.

Thus, for the HCI expert, it may be wise to adopt a policy of always referring to "statistical significance" rather than simply "significance" when communicating with the public.

Significance

One important concept in significance testing is whether you use a one-tailed or two-tailed test of significance.

The answer is that it depends on your hypothesis.

Significance

When your research hypothesis states the direction of the difference or relationship, then you use a one-tailed probability.

For example, a one-tailed test would be used to test these null hypotheses:

• Females will not score significantly higher than males on an IQ test. • Blue collar workers are will not buy significantly more product than

white collar workers. • Superman is not significantly stronger than the average person.

In each case, the null hypothesis (indirectly) predicts the direction of the difference.

Significance

A two-tailed test would be used to test these null hypotheses:

• There will be no significant difference in IQ scores between males and females.

• There will be no significant difference in the amount of product purchased between blue collar and white collar workers.

• There is no significant difference in strength between Superman and the average person.

The one-tailed probability is exactly half the value of the two-tailed probability.

Significance

A two-tailed test would be used to test these null hypotheses:

• There will be no significant difference in IQ scores between males and females.

• There will be no significant difference in the amount of product purchased between blue collar and white collar workers.

• There is no significant difference in strength between Superman and the average person.

The one-tailed probability is exactly half the value of the two-tailed probability.

Significance

Whenever we perform a significance test, it involves comparing a test value that we have calculated to some critical value for the statistic.

It doesn't matter what type of statistic we are calculating (e.g., a t-statistic, a chi-square statistic, an F-statistic, etc.), the procedure to test for significance is the same.

1. Decide on the critical alpha level you will use (i.e., the error rate you are willing to accept).

2. Conduct the research. 3. Calculate the statistic. 4. Compare the statistic to a critical value obtained from a table.

Significance

If your statistic is higher than the critical value from the table:

• Your finding is significant.

• You reject the null hypothesis.

• The probability is small that the difference or relationship happened by chance, and p is less than the critical alpha level (p < alpha ).

Significance

If your statistic is lower than the critical value from the table:

• Your finding is not significant.

• You fail to reject the null hypothesis.

• The probability is high that the difference or relationship happened by chance, and p is greater than the critical alpha level (p > alpha ).




• Significance

• T-Test

T-Test

The t-test assesses whether the means of two groups are statistically different from each other.

This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design.

T-Test

The figure shows the distributions for the treated (orange) and control (purple) groups in a study.

T-Test

The figure indicates where the control and treatment group means are located.

The question the t-test addresses is whether the means are statistically different.

T-Test

What does it mean to say that the averages for two groups are statistically different?

T-Test

What does it mean to say that the averages for two groups are statistically different?

T-Test

The first thing to notice about the three situations is that the difference between the means is the same in all three. But, you should also notice that the three situations don't look the same -- they tell very different stories.

T-Test

The top example shows a case with moderate variability of scores within each group. The second situation shows the high variability case. the third shows the case with low variability.

T-Test

Clearly, we would conclude that the two groups appear most different or distinct in the bottom or low-variability case.

Why? Because there is relatively little overlap between the two bell-shaped curves.

In the high variability case, the group difference appears least striking because the two bell-shaped distributions overlap so much.

T-Test

This leads us to a very important conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores.

The t-test does just this..

T-Test

Statistical Analysis of the t-test

The formula for the t-test is a ratio.

The top part of the ratio is just the difference between the two means or averages.

The bottom part is a measure of the variability or dispersion of the scores.

T-Test

Statistical Analysis of the t-test

This formula is essentially another example of the signal-to-noise metaphor in research:

the difference between the means is the signal that, in this case, we think our program or treatment introduced into the data;

the bottom part of the formula is a measure of variability that is essentially noise that may make it harder to see the group difference.

T-Test

The figure below shows the formula for the t-test and how the numerator and denominator are related to the distributions.

T-Test

The top part of the formula is easy to compute -- just find the difference between the means.

The bottom part is called the standard error of the difference. To compute it, we take the variance for each group and divide it by the number of people in that group.

We add these two values and then take their square root.

T-Test

The final formula for the t-test is shown below:

Remember, that the variance is simply the square of the standard deviation.

T-Test

The t-value will be positive if the first mean is larger than the second and negative if it is smaller.

Once you compute the t-value you have to look it up in a table of significance to test whether the ratio is large enough to say that the difference between the groups is not likely to have been a chance finding.

To test the significance, you need to set a risk level (called the alpha level).

T-Test

You also need to determine the degrees of freedom (df) for the test.

In the t-test, the degrees of freedom is the sum of the persons in both groups minus 2.

T-Test

Given the alpha level, the df, and the t-value, you can look the t-value up in a standard table of significance to determine whether the t-value is large enough to be significant.

If it is, you can conclude that the difference between the means for the two groups is different (even given the variability).

T-Test

Worksheet 2

Rosenthal and Jacobson (1968) informed classroom teachers that some of their students showed unusual potential for intellectual gains.

Eight months later the students identified to teachers as having potentional for unusual intellectual gains showed significantly greater gains performance on a test said to measure IQ than did children who were not so identified.

T-Test

Conclusions

If you do a large number of tests, falsely significant results are a problem.

Remember that a 95% chance of something being true means there is a 5% chance of it being false. This means that of every 100 tests that show results significant at the 95% level, the odds are that five of them do so falsely.

If you took a random, meaningless set of data and did 100 significance tests, the odds are that five tests would be falsely reported significant.

As you can see, the more tests you do, the more of a problem these false positives are. You cannot tell which the false results are - you just know they are there.

T-Test

Conclusions

Limiting the number of tests to a small group chosen before the data is collected is one way to reduce the problem.

If this isn't practical, there are other ways of solving this problem.

The best approach from a statistical point of view is to repeat the study and see if you get the same results.

If something is statistically significant in two separate studies, it is probably true.

T-Test

Conclusions

In real life it is not usually practical to repeat a survey, but you can use the "split halves" technique of dividing your sample randomly into two halves and do the tests on each.

If something is significant in both halves, it is probably true.

The main problem with this technique is that when you halve the sample size, a difference has to be larger to be statistically significant.

T-Test

Conclusions

The last common error is also important.

Most significance tests assume you have a truly random sample.

If your sample is not truly random, a significance test may overstate the accuracy of the results, because it only considers random error.

The test cannot consider biases resulting from non-random error (for example a badly selected sample).

T-Test

Conclusions

To summarize:

• In statistical terms, significant does not necessarily mean important.

• Probability values should be read in reverse (1 - p).

• Too many significance tests will turn up some falsely significant relationships.

• Check your sampling procedure to avoid bias.

Documents

HCI 510 : HCI Methods I Statistics. HCI 510: HCI Methods I Descriptive Statistics Inferential Statistics Significance T-Test