BML224 Handbook

SEMALDr Andrew Clegg

BML224: DataAnalysis for Research

p. 209

Geographical Techniques 2 Descriptive Statistics

© Dr Andrew Clegg

Data Analysis for Research Contents

Contents

Course Outline p. i

Section 1: Sampling and Types of Data1.0 Introduction - Why use statistics p. 1-1

1.1 Sampling p. 1-1

1.2 Some Terms in Sampling p. 1-2

1.3 Avoiding Bias p. 1-2

1.4 Deciding on the Choice of Sampling Techniques p. 1-2

1.5 Summary p. 1-6

1.6 Types of Data p. 1-7

1.7 Presenting Data p. 1-11

Section 2: Descriptive Statistics2.0 Introduction p. 2-25

2.1 Measures of Central Tendancy p. 2-25

2.2 Arithmetic Mean p. 2-25

2.3 The Median p. 2-27

2.4 The Mode p. 2.31

2.5 Comparision of the Mean, Median and Mode p. 2-33

2.6 The Population Mean p. 2-34

2.7 Skew and the Relationship of the Mean, Median and Mode p. 2-36

2.8 Using SPSS to Calculate Descriptive Statistics p. 2-37

2.9 Graphically Describing Data p. 2-68

2.10 Graphically Describing Data in SPSS p. 2-74

2.11 Creating Crosstabulations in SPSS p. 2-80

Section 3: Measures of Dispersion3.0 Introduction p. 3-89

3.1 Measures of Dispersion p. 3-91

3.2 Other Distributions p. 3-98

3.3 The Standard Normal Distribution p. 3-100

3.4 Confidence Intervals p. 3-105

3.5 The Standard Error p. 3-109

3.6 Looking at Distributions in SPSS p. 3-115

3.7 Graphically Looking at Distributions in SPSS p. 3-117

p. 210


© Dr Andrew Clegg

Data Analysis for Research Contents

Section 4: Student T-Test, Paired Samples T-Test, Mann Whitney andWilcoxon4.0 Introduction p. 4-127

4.1 Null and Alternative Hyptheses p. 4-127

4.2 Hypothesis Testing p. 4-129

4.3 One and Two Tailed Tests p. 4-129

4.4 Choosing the Right Test p. 4.132

4.5 Parametric Tests p. 4-134

4.6 Using SPSS to Calculate the Student T-Test p. 4-135

4.7 Using SPSS to Calcualte the T-Test for for Related Samples p. 4-147

4.8 Non-Parametric Tests p. 4-152

4.9 Using SPSS to Calculate Mann Whitney p. 4-153

4.10Using SPSS to Calculate Wilcoxon Signed Ranks Test p. 4-161

Section 5: Chi-Squared5.0 Introduction p. 5-167

5.1 The One Sample ChiSquared Test p. 5-169

5.2 The Chi-Squared Test for Two or More Samples p. 5-172

5.3 Yates Correction Factor p. 5-175

5.4 Conditions Necessary for Conducting a Chi-Squared Test p. 5-176

5.5 Using SPSS to Calculate Chi-Squared p. 5-180

Section 6: Correlation6.0 Introduction p. 6-191

6.1 The Meaning of Correlation p. 6-193

6.2 Identifying Signs of Correlation in theData p. 6-194

6.3 Correlation Analysis p. 6-194

6.4 Using SPSS to Measure Correlation:

Pearson’s Product Moment Correlation Coefficient p. 6-196

6.5 Using SPSS to Measure Correlation:

Spearman’s Rank Correlation Co-efficient p. 6-206

Sampling &Types of Data

Section 1

Learning Outcomes

At the end of this session, you should be able to:

Understand the rationale for the use of statisticaltechniques

Discuss approaches to developing samplingframeworks and methodologies

Define key terms in the use of statisticaltechniques

Understand the difference between different datatypes

Present numerical data effectively in graphicaland tabular form

© Dr Andrew Clegg p. 1-1

Data Analysis for Research Introduction to Statistics

Introduction to Statistical Terms and Sampling Frameworks

1.0 Introduction - Why use Statistics?

There is far more to research than measurement and analysis of quantifiable facts. A prime tool in the study of

how people exist in their environment is the very fact of the investigator’s common humanity: you know a lot

about what people do because you are also a human being. And human actions and responses are affected

by memory, prejudice and emotions which cannot be adequately quantified. Even so, there are innumberable

instances of relevant, quantified facts in geographical investigations: most questionnaire results contain some

quantitative element, even if it is only how many people said ‘yes’ and how many people said ‘no’; international

comparisons can often make use of data from the World Health Organization, World Bank, Unicef or the

United Nations Development Programme, amongst others; within Britain data from the census or health

authorities exists for a wide range of areal units; in physical geography the geology, soils, vegetation, elevation,

aspect and so on can all be quantified. You should not ignore these data.

You may feel that it is only necessary to present such information, perhaps using a table or a graph, and

sometimes that may be enough. On the other hand, statistics will enable you to go much further in the

understanding of the patterns and relationships displayed. Furthermore, they will help ascertain the quality of

the information that you are using. This last point is perhaps the most important part of using statistics: for

instance, your pie chart showing that 75% of respondents preferred Bognor to Barbados as a holiday destination

may look impressive, but statistics will soon reveal that any conclusions to be drawn from the answers given

by only four people are limited, to say the least. If the statistics can’t test your hypotheses, the fault may be in

your hypotheses, or more likely in your data collection, but it certainly isn’t in the statistics.

Before considering the statistical manipulation of data, it is necessary to consider how the data is collected

for use.

1.1 Sampling

You often have to make do with what information there is (if you are interested in the cultivation of mangel-

wurzels and the nineteenth century agricultural survey did not record them, there is not much you can do about

it), but ideally in research you can go and collect the information yourself. In such a case you can ensure that

the information you collect is as useful as possible. Sometimes you will be able to collect all the relevant

information – the census population of each ward in the county, for instance – but in many cases you will need

to collect a sample. For example, you would not practicably be able to find the opinions of all the people in a

county, but using an appropriate sampling technique you could collect information from a smaller but

representative sample of that population.


Data Analysis for Research Sampling Strategies

1.2 Some Terms in Sampling

A variable is a property which can vary and be measured - temperature etc.

An observation or variate is a particular measure.

Population is the complete set of counts or measurements derived from all objects possessing one or

more common characteristics. This can be infinite, as in the case of elevations in the field.

Sample - part of a population.

1.3 Avoiding Bias

An important question to ask yourself at the start of sampling is ‘What do I want my sampling to be representative

of ?’ An example of where this might be important is in studying the patterns of farming in a region. For

simplicity and clarity, let us assume that each farm only cultivates one crop. Selecting points on a map will tend

to choose the bigger farms because they occupy a larger area. On the other hand, selecting farms from a list

will tend to choose the smaller farms, because there are likely to be more of them within the same area.

Therefore the first method will give a representative sample of the land use, the second of the farming. What

can cause problems is using the first to find out about farms, or the second about land-use.

1.4 Deciding on the Choice of Sampling Techniques

Before you starting sampling, you need to consider whether a convenient sampling frame exists. An example

of a sampling frame may be a list of names on an electoral register or a membership directory of a particular

organisation. Even when sampling frames do exist, they are often incomplete or out of date. The integrity of

the data set will therefore influence your choice of sampling technique. However, it is often possible to construct

your own sampling framework, although this could be costly and time-consuming. For example, if investigating

the distribution of farm shops in West Sussex, you could use the farm shops listed in the yellow pages as a

provisional framework and then supplement this with fieldwork to check for any farm shops not listed in the

yellow pages. For an area it may be necessary to create a grid with x and y axes, so that the whole area under

investigation can be referred to using co-ordinates, like grid references. In this instance, you need to achieve

a balance between having too few cells to give precise or even usable results (remember that a co-ordinate

reference refers to an area rather than a point) and having so many that the sampling process becomes too

time-consuming. Such decisions must be made with specific reference to the particular investigation and the

time and resources at your disposal. Indeed, when designing a sampling strategy for a research project it is

important to ask yourself whether you can afford the time and money to carry out the sample collection.

When deciding on the sample technique, you also need to decide on the size of the sample. As a general

guideline, the larger the sample, the more confident we can be that the statistics derived from it will be similar

to the population parameters. However, a large sample with a poorly designed sampling frame, may contain

less information than a smaller but more carefully designed sample.



1.4.1 Random Sampling

The word random in this context does not mean haphazard. It refers to a definite method of selection aimed

at eliminating bias as far as possible. A random sampling method should satisfy two important criteria: a)

every individual must have an equal chance of inclusion in the sample throughout the sampling procedure;

and b) the selection of a particular individual should not affect the change selection of any other individual. To

put these criteria in more formal probability terms: the probabilities of inclusion in the sample must be equal

and independent of each other. So, if the aim is to pick a random sample of 50 households from a population

of 200, every household should have the same 50/200 or 0.25 probability of selection. The simplest example

of pure random sampling is a raffle or lottery. Thus to take a random sample of the population of the UK, the

name of each resident would have to be written on a piece of paper, all the pieces of paper put in a giant drum

and a random selection made: obviously not a practical method. More usually it is numbers, not names, which

are used and, instead of picking these numbers out of a hat, a computer can be programmed to generate

random number sequences. Alternatively tables of random numbers can be used. Computers use the last

digits of their internal clock to ‘seed’ their random numbers (otherwise they would just keep repeating the same

sequence), and similarly when using random number tables it is worthwhile picking a point somewhere in the

table ‘at random’ and then sometimes read up, or left, rather than from left to right.

1.4.2 Systematic Sampling

Systematic sampling is, as its name suggests, sampling according to a regular system. This involves choosing

the first item at random and then selecting every nth item where n will be determined by the size of the sample

required. For example, if a sample of 50 items is required from a population of 500 items, every 10th one

would be selected. Provided that there are no characteristics in the population which recur every 10th item,

the sample will be unbiased; indeed this may be thought of simply as a short cut (the population does not need

to be numbered) method of producing a random sample.

1.4.3 Stratified Sampling

It is possible, in some instances, to improve on simple random sampling by stratification of the population.

This is particularly true where the population is heterogeneous (i.e. made up of dissimilar groups) and the

population can be stratified into homogeneous (i.e. similar) classes. These classes should define mutually

exclusive categories.

For example suppose a bakery makes three different types of loaf: large, small and cottage. If a simple

random sample was taken of the daily output, it would be possible, although unlikely, for it to include only one

type of loaf. Stratification of the population before sampling can prevent this and, if carried out as described

below, can produce a sample which is truly representative of the population.

Assume that the bakery’s output is 50% large, 40% small and 10% cottage loaves. The different loaves divide

the population into three strata. Now if a sample of 50 loaves is required it should contain 25 large, 20 small

and 5 cottage thus ensuring that the proportions of each type of loaf in the population are reflected in the

sample. Within these constraints, however, selection should be made on a random basis.



1.4.4 Multi-Stage Sampling

Surveys covering the whole UK are frequently required but, as you can imagine, simple random sampling or

even stratified sampling will not give an easy solution. Where the population is very spread out, particularly

geographically, simple random sampling will result in a dispersed sample leading to a considerable amount of

travelling and time. Consequently some method is needed to narrow down the field down to a smaller area,

with the resultant cost savings. Multi-stage sampling attempts to do this without adversely affecting the

‘randomness’ of the result.

The first step is to divide the population into manageable, convenient groups or areas, such as counties or

local authority regions. Indeed, stratification of areas such as counties or local authorities by principal

geographical regions is often introduced in order to minimise geographical bias (Clark et al, 1998, p. 84). A

number of areas are then selected at random. If the number of areas selected is still too large or dispersed,

then these areas can be broken down further to reduce the sample size to more manageable proportions. For

example, having chosen a random sample of local authorities, each one itself may be divided into political

wards or streets or households. Finally a simple random or systematic sample will be chosen.

1.4.5 Cluster Sampling

Cluster sampling can often be confused with multi-stage sampling as the first step appears identical. The

important difference is that cluster sampling is used when the population has not been listed and it is the only

way to obtain a sample.

As an example, suppose that a survey is to be done on the proportion of elm trees attacked by Dutch elm

disease in the UK. Obviously there is no list of the complete population of elm trees. Neither would it be

possible to try and cover the whole population. To use cluster sampling in this case, the population could be

divided into small ‘clusters’ by drawing a grid over the map of the country and choosing, at random, a few of

these clusters for observation, each cluster being a small area. Within each area, the investigators will then be

asked to find as many elm trees as possible within that area and note how many of them are diseased.

1.4.6 Non-Random Sampling

The previous paragraphs have been concerned with methods of random sampling, basically simple random

sampling with several variations and refinements. The methods discussed in the previous section share a

number of key elements. These include: a) the chances of obtaining an unrepresentative sample are small; b)

this chance decreases as the size of the sample increases; c) this chance can be calculated; and d) the

sampling error can be measured and therefore the results can be interpreted.

Unfortunately occasions often arise when the selection of a random sample is not feasible. This may be

because:

It would be too costly;

It would take too long; or

All the items in the population are not known.



For these reasons the following research methods of non-random sampling are used, particularly in the field

of market research.

1.4.6.1 Judgement Sampling

In this case an expert, or a team of experts use their personal judgement to select what, in their opinion,

is a truly representative sample. It certainly cannot be called a random sample as it involves human

judgment which could involve bias. On the other hand, the sampling process does not require any

numbering of the population or random number tables. It can be done more quickly and economically

than random sampling and, if carried out sensibly, can produce very good results.For example, in an

interview situation, a researcher may pick individuals because of the nature of the response they are

likely to give, and the responses the researcher is looking for.

1.4.6.2 Quota Sampling

This is the method most often used in market research where the data is collected by enumerators

armed with questionnaires. To avoid the expense of having to ‘track down’ specific people chosen by

random sampling methods, the enumerators are given a quota of say 400 people, and are told to

interview all the people they can until their quota has been met. Such a quota is nearly always divided

up into different types of people with sub quotas for each type. For example, out of a quota of 400, the

enumerator may be told to interview 250 working wives, 100 non-working wives and 50 unmarried

women, and within each of these three classes to have 50% who smoke and 50% non-smokers.

Using this technique, the researcher has the choice of selecting certain people who might be included

in the sample, and can therefore introduce an element of bias into the sample.

The main advantage of this method is that, if a respondent refuses to answer the questions for any

reason, the interviewer will just look for another person in the same category. With true random

sampling, once a sample item has been decided upon, it must be used. Any substitution results in a

non-random sample.

1.4.6.3 Convenience Sampling

As the name implies, the most important factor here is the ease of selecting the sample. No effort is

made to introduce any element of randomness. An example of this is the quality controller who takes

the first 20 items off the production line as his sample, a dangerous procedure as any fault occurring

after this could remain unnoticed until the next sample is taken (maybe an hour later).

For most purposes, this sampling method is simply not good enough but for some pilot surveys the

savings in cost, time and effort outweigh the disadvantages. The aim of a pilot survey could be to

establish the most satisfactory form of questionnaire to be used in the actual survey. Since the actual

results would not be used it does not matter that the sample was not selected at random.



1.5 Summary

Sampling serves two purposes. One is the saving of time and effort in the collection of information. The

second is the collection of information so that inferences and comparisons can be drawn using statistics.

Although a simple subject, it is fundamental to much research, and needs to be done with care. Table 1,

provides a summary of the key sampling methods that have been discussed.

Table 1: Sampling Methods

Judgemental

Quota

Systematic

Simple random

Stratified random

Multi-stage random

Clustered random

Representative

Probability

Random

(first unitselected atrandom)

Description

Sampling elements are selectedbased on the interviewer’s experiencethat are likely to produce the requiredresults.

Sampling elements are selectedsubject to a predefined quota control.

Sampling elements in the samplingframe are numbered. First samplingunit is selected using random numbertables. All other units are selectedsystematically k units away from theprevious unit.

Sample size of n elements selectedfrom a sampling frame withoutreplacement, such that every possiblemember of the population has anequal chance of being selected.

Sampling frame divided into sub-groups (strata) which are then eachsampled using the simple randommethod.

Sampling frame divided intohierarchical levels (stages). Eachlevel is sampled using a simplerandom method which selects theelements to be included at the nextlevel.

Sampling frame divided intohierarchical levels (stages). Levelsare selected using random samplingsimilar to the multi-stage randommethod. However, all elements areselected at the final stage.

Example

Several houses for sale in Belfast,perhaps with families known to theinterviewer, are chosen subjectively.

The quota is the first 30 homeownerssellign their houses in Belfast who are alsomaking an intra-urban move, and are agedbetween 20-40 years.

Sampling frame of 600 homeownersselling their houses in Belfast. Thesehouses for sale are ordered andnumbered. A random number is selectedfor a start point, from which every tenthproperty is selected for inclusion in thesample.

All 600 houses for sale in the samplingframe are numbered 1-600. A sample of30 units is selected using a randomnumber table, excluding those numbersoutside the range 1-600.

All 600 houses for sale come from listsprovided by six estate agents. These areeach randomly sampled for houses toinclude in the sample.

All 600 houses for sale are distributed toenumeration districts within several wards.A random sample of these wards isselected and of these random samples ofboth enumeration districts and finallyhouses for sale are selected.

Similar to the above method, expect thatall the houses for sale in a givenenumeration district are selected.

[Source: KITCHIN, R. AND TATE, N. (2000): Conducting Research into Human Geography, Prentice Hall, London, p. 55.]


Data Analysis for Research Types of Data

1.6 Types of Data

Normally when we think of data quality, we think about reliability or accuracy. In statistics, data have quality in

terms of what they represent and how they can be manipulated. The four levels of measurement are: nominal/

categorical, ordinal, interval and ratio. Each measurement is outlined below:

An ordinal variable can be ranked in order from highest to lowest, for example a league table. Alternatively,

a questionnaire survey may ask respondents to rank satisfication levels on a scale from ‘Strongly Agree’

to ‘Strongly Disagree’. Ordinal variables do not allow comparable measurements, for example ‘Strongly

Agree’ is not worth double ‘Slightly Agree’.

Interval and Ratio variables are concerned with quantitative data. Interval variables are in the form of a

scale which possesses a fixed but arbitrary interval and arbitrary origin. Addition or multiplication by a

constant will not alter the interval nature of the observations (e.g. 10C, 20C, 30C, 40C). For a ratio

measurement, this number is in relation to a scale of an arbitrary interval, similar to interval data, but with

a true zero origin. In these cases, where we are using numbers as we normally think of them, one value

can be twice the size of another. For example, income is a ratio variable as a person can have no income.

Ratio measurement commonly applies to metric quantities such as distance and mass, which possess a

zero origin. [When importing data into SPSS, and using the Variable window, Interval and Ratio data are

classed as Scale - see Descriptive section in this handbook].

Categorical or nominal variables are the lowest level and are variables where numerical values have

been assigned to separate categories, often viewed as unique from one another. For example, gender

(male/female), hair colour (blonde, brown, ginger, grey), or direction (north, east, south, west).

It is important to remember that data can only be converted from higher to lower quality, and data can only be

treated ‘at their own level’. For instance, the numbers ‘1,2,3,4’ could be heights in meters (ratio), temperatures

in degrees C (interval), the order of countries achieving Rostow’s ‘take off’ (ordinal) or the answer to ‘what is

your favourite number (nominal): they must not be treated at a higher level than their meaning. As Mulberg

(2002) points out ‘the thing to ask is if it makes sense to talk about one case being double another, or if there

is a highest and a lowest (see Figure 1). It is also important to understand the different types of data or

variables, as this will influence the kind of statistical analysis that is possible. The levels of measurement are

summarised in Table 2.

In order to use parametric and non-parametric tests successfully later in the module, it is

imperative that you understand the characteristics and differences between types of data. Please

read through these notes carefully, and learn the different data types.



Figure 1: Judging Levels of Measurement

[Source: Mulberg, 2002, p. 8]

Additional terms that you will encounter include:

A discrete variable is a variable whose numerical values varies in steps or where the values are integer

numbers. Normally such variables are associated with counts, for example you may count the number of

firms, products or employees when conducting a survey. Discrete variables do not allow for decimal

places.

A continuous variable is a variable which assumes a value that can be donated on a continuous scale.

Examples include weights, heights and age. In reality, continuous variables relate to specific values that

lie at a point on a continuum. For example a person’s age could be recorded in discrete form as being so

many years, but in reality their age can be placed at a point on a continuum which reflects not only the

numbers of years but also the number of days, minutes and seconds which have passed since the moment

of their birth (Clark et al, 1998). Continuous variables allow for decimal places. Continuous variables can

som etimes be described as demonstrating certain statistical properties that allow them to be used in

parametric statistical tests. However, sometimes some continuous variables do not show these particular

properties, and when this happens, the variables are though suitable to be used in non-parametric tests

(Mulberg, 2002).

Variables can also be classed as ‘dependent’ or ‘independent’. A dependent variable refers to a variable

which is identified as having a relationship or dependance on the value of one or more independent

variables. For example, levels of car ownership may be directly dependent on a number of independent

variables including average household income, age and the number of persons in the household.

No

Does it make sense to talk about

one number being higher or lower

than other?

No

Does it make sense to talk about

one number being

double another?

Start

Ratio Level

Ordinal Level

Nominal Level

Yes

Yes



Table 2: Data Quality

When attempting to remember types of data use the abbreviation NOIR (nominal, ordinal, interval, ratio).

When using variables in statistical analysis, a further distinction is also drawn between descriptive and

inferential statistics. Descriptive statistics refer to the sample that is created by the research/study

process and literally refers to the methods and techniques used to describe and summarise data.

Measures of central tendency (mode, median, mean) are the most basic descriptive statistics to which

we can also add basic measures of dispersion including the maximum, minimum and range of values.

Inferential statistics refer to those techniques which are adopted to draw conclusions about the

population to which the sample belongs and which enable inferences about the characteristics that

might be expected in other samples as yet to be selected from that same population. Inferential statistics

give greater analytical power and bring into play probability theory and other statistical tests and measures

that will be discussed later in this handbook.

Description

Data assigned to discrete categories, in no

natural order

The categories associated with a variable

can be rank-ordered. Objects can be

ordered in terms of a criterion from highest

to lowest.

With ‘true’ interval variables, categories

associated with a variable can be rank-

ordered, as with an ordinal variable, but the

distances between categories are equal;

Categories have no absolute zero point;

Variables which strictly speaking are ordinal

but which have a large number of

categories, such as multiple-item

questionnaire measures. These variables

are assumed to have similar properties to

‘true’ interval variables.

Data with meaningful intervals and a true

zero

Name

Nominal or Categorical

Ordinal

Interval

Ratio

Examples

Clay, sandstone,granite, lifestyle groups,

singles, retired

Cities in order of population size/opinions

regarding service or product quality

Temperature in degrees Celsius or

Fahrenheit.

Goal Difference

Age, distance



However, as Lindsay (1997) points out the use of inferential statistics carries greater responsibility and as

such any user must be aware of the following guidelines:

Sampling must be independent. This means that the data generation method should give every

observation in the population an equal chance of selection, and the choice of any one case should not

affect the selection of value of any other case;

The statistical test chosen should be fit for its purpose and appropriate for the type of data selected;

The user must interpret the results of the exercise properly. The numerical outcome of a statistical test

is the result of an exercise.


Data Analysis for Research Presenting Data

1.7 Presenting Data

Presenting numerical data accurately is an important element of essays, reports, presentations and posters.

The aim of the following section is to provide a few basic guidelines on how to incorporate graphs and tables

effectively, and at the same time creatively, into your work.

1.7.1 Using Graphs and Charts

Computer spreadsheets such as Excel, now allow you to produce a range of graphs and charts (bar charts,

column charts, pie charts, graphs) quickly and easily. As such, graphs can be used effectively to enhance the

quality of reports, essays, posters and presentations. Carefully thoughtout graphs can bring to life data from

tables and allow comparisons to be made quickly. However, poorly designed graphs can easily fail and

weaken a piece of work. It is very common for students to rush in and produce a whole plethora of charts and

graphs without giving much thought to the data set they are using or what type of output would be most

appropriate. Therefore is it important to take your time and give careful consideration to what you actually

want to achieve.

First, ask yourself the following questions:

Is a graph or chart necessary?

Students often use diagrams as a means of ‘padding out’ work and as a result graphs not referred to

in the text become ‘window-dressing’. Therefore carefully consider whether the graph is actually

needed - ask yourself whether the graph helps the reader understand a particular point or aspect of the

data. If it does fine - but make sure that is it integrated and referred to fully in your dicussion. If not,

provide a simple verbal description.

What is the purpose/objective/outcome?

Are you producing a graph for an essay/report, poster or presentation? While the basic guidelines and

formatting options are generic, you need to consider the overall purpose and intended audience. For

example graphs produced for a presentation will be different to those produced for inclusion in an

essay or a PowerPoint presentation. Carefully consider the importance of visual impact and clarity,

and the type of media you are using.

What is the nature of the data set you are using?

Graphs often fail because an incorrect chart type has been used or the graph is too complicated.

Therefore before you start carefully consider the actual nature of the data set you are using. Above all

you need to distinguish between ‘continuous’ data and ‘discrete’ quantities. A continuous quantity is

that which can be chosen to any degree



of precision. Examples of continuous quantities include mass (kg), length (m), and time (s). Discrete

quantitites in contrast can only be expressed as integers (whole numbers) for example: 3 computers, 5 cars,

4 houses. In trying to decide if something is continuous or discrete, decide whether it is like a stream (continuous)

or like people (discrete). Continuous variables are usually plotted on a graph as this demonstrates the existence

of a casual relationship between the data points, whereas discrete data series are plotted as bar charts or

histograms.

In addition to the nature of the data set also consider whether you referring to absolute values or percentage

distributions? This will have a significant influence on the chart type that you use. Second, how complicated

is the data set?; is it best represented as a graph or a table?; can the data be manipulated to make it easier to

use, for example by reformatting columns or excluding columns? Be prepared to modify the data set if

necessary. However, make sure that when you do this you do not alter the accuracy or the representativeness

of the data set you are using.

The following graphs highlight the issue of using appropriate chart types.

Figure 2: Car Sales for Rover, BMW, and Jaguar 1995-2000

[Source: Believe, M., 2001]

In Figure 2, car sales for leading manufacturers have been plotted for a 5-year time period. In this instance we

are dealing with discrete data (as you cannot sell half a car!). However, the data has been plotted as a line

graph - is this correct? The answer is YES as there is a logical year to year link and the ‘joining the dots’

technique illustrates the casual relationship between the x-axis variables. This data could have also been

presented as a column chart. Compare this to Figure 3.



Figure 3: Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex


Figure 3 highlights the attitudes of residents to new housing development in West Sussex. Is this graph the

most effective form of presentation? The answer is NO. In this instance joining the dots is not appropriate as

there is no casual relationship between x-axis variables. In this instance a column chart would have been

more effective - see Figure 4.





While Figure 4 is a definite improvement, is there any way of making the data in Figure 4 more effective so that

it really highlights the differences in resident opinions between the different areas? Again the answer is YES.

So far we have graphed the absolute values relating to resident opinions. If we were to change this to a

percentage distribution we could present the data as a bar chart - see Figure 5.



As you can see in Figure 5, utilising the percentage distribution really succeeds in highlighting the differences

in residents opinions.

Let us consider a further example. Figure 6 illustrates the mean monthly temperature and rainfall totals for

Edinburgh. Is the graph appropriate? Again the answer is YES as there is a logical year to year link and the

‘joining the dots’ technique illustrates the casual relationship between the x-axis variables. However, although

this graph allows us to compare monthly temperature and rainfall totals, the high values for temperature have

masked the values for rainfall and a degree of accuracy has been lost. To overcome this we can change the

type of the graph and plot temperature and rainfall on separate axis - see Figure 7.



Figure 6: Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh

[Source: Bartholomew, 1987]

Figure 7: Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh

[Source: Bartholomew, 1987]

So far our discussion has concentrated on the use of line graphs, column and bar charts. Another type of chart

frequently used is the pie chart. The overall total number of cases represented by the pie chart should equal

the sample size, or aggregate to 100% where segments denote proportional frequencies (Riley et al, 1998, p.

172). Let us consider some specific examples.



Figure 8: The Distribution of Serviced Establishments in Torbay by Size

[Source: Clegg, 1997]

Figure 8 refers to the percentage distribution of serviced establishments in Torbay by size. When using pie

charts it is important to remember that pie charts can only graph the percentage distribution of one specific

variable and cannot be used to analyse time series data. For example, we could not use a pie chart to illustrate

the car sales for Rover, BMW and Jaguar referred to in Figure 2. However, we could use a pie chart to analyse

the market share of car sales for a specific year (see Figure 9).

Figure 9: Market Share of Car Sales for Rover, BMW and Jaguar in 1995




Rover41%

BMW27%

Jaguar32%

By drawing and then combining two or more pie charts we could then compare market share for different

years (see Figure 10).

Figure 10: Market Share of Car Sales for Rover, BMW and Jaguar in 1995 and 1999


Programmes such as Excel will only allow you to draw one pie chart at a time - however once drawn you can

arrange a number of pie charts on a worksheet and print them out. Alternatively, you can cut and paste Excel

charts into Word or Publisher.

Clearly, using the most appropriate type of graph is very important to ensure that the data is presented

accurately. In addition to the type of chart it is also important to ensure that the graph is presented effectively.

Rover27%

BMW32%

Jaguar41%

1995

1999



1.7.2 Producing Graphs

When producing graphs a number of basic rules and guidelines need to be considered. These are:

Is the graph completely self-explanatory?

Is the graph clearly titled, labelled and sourced?

The axes should be labelled, and clear indication given as to the scales being used, and the

numerical quantities being referred to;

All dates and times periods should be explicitly stated in the title, and on the appropriate axis;

In titles do not write ‘A Graph Showing....’. This is obvious - instead refer to the specific content of

the graph (see examples given in this section);

The source of the data should be included, especially if they are drawn from published material.

Are elements of the graph distinguishable?

When using charts it is important that the different data series are clearly distinguishable other-

wise the graph will be meaningless;

Consider carefully the number of data series you intend to graph. Too much data will over

complicate a graph and reduce its impact;

When using pie charts it is recommended that the number of segments should not be too large.

Too many segments make charts confusing and difficult to read;

If charts are to be included in a black and white report, avoid shadings that involve colours as the

distinctions will be clearly lost. Try and keep the use of colours to a minimum. Use one colour and

different shades;

Ensure that each segment of the pie chart is clearly labelled and that the percentage values have

been added to indicate quickly which are the principal groups and by how much;

Avoid repetition; if labels and percentage values have been added to a pie chart there is no need

to include the legend.



1.7.3 2D or 3D Graph Formats

Excel and similar packages allow you to enhance the quality of graphs by making them 3D. However, the

use of 3D formatting needs to be treated with caution. If you are producing graphs on A4 for a presentation 3D

charts can work effectively. However, if you are preparing graphs for inclusion in an essay or report 3D charts

may not be appropriate and you may be better off with a standard 2D version. There are no hard and fast rules

on this issue and, ultimately, the type of chart produced and the type of formatting applied will depend on the

nature of the data set used.

Let me illustrate this by referring to examples included in this section. Below is Figure 4, showing resident

attitudes to housing development in West Sussex. At the moment this is a standard 2D column chart. Let us

convert it into a 3D chart.

2D

3D



Do you think this chart is effective? It looks good but is not quite as easy to read as the standardchart. It is noticeable that in order to create a 3D chart Excel has to shrink the original chart. Thisis where problems lie, as in making the graph smaller the overall impact of the graph is diminished.

Let us try another example. Below is Figure 8, which refers to the distribution of servicedaccommodation in Torbay. As before, let us convert this into a 3D chart.

In this instance the 3D chart is actually quite effective and has enhanced the standard 2D chartconsiderably. The basic rule seems to be that simple 2D charts can be converted into 3D chartsquite effectively. However, the more detailed and complicated the standard chart the less effectiveit becomes when you make it 3D. Your best option is to experiment with different data sets andformatting options to find the most effective form of presentation.

2D

3D



1.7.3 Using Tables

In addition to charts, tables are also an effective way of presenting information. Again when producing tables

a number of guidelines can be followed:

Consider the purpose of presenting the data as a table as there may be better ways of presenting it;

Avoid the temptation of just photocopying tables out of text books and sticking into essays. In many

cases, tables often contain information superfluous to the reader. Be prepared to modify data sets so

that only relevant information is included in your table;

Make sure that tables are completely self-explanatory. Provide a table number and title for each table.

If abbreviations are used when labelling then provide a key;

Make sure that the content of the table is fully referred to in the text - make sure that tables are not

basically ‘window-dressing’;

Allow sufficient space when designing the table for all figures to be clearly written;

Make sure that the table/data is fully sourced.

Again let me illustrate with a number of examples.

Table 2 is an example of a table I created for the Arun Tourism Strategy document. Does the table meet the

guidelines highlighted above? The answer is YES. The table is clear, well laid out, titled, sourced and self-

explanatory. Shading has also been used to try and enhance the visual impact of the table.

Table 2: Visits Abroad by UK Residents 1994-1997

Area of Destination

Year Total (‘000) North America Western Europe Rest of World

1994 39,630 2,927 32,375 4,328

1995 41,345 3,120 33,821 4,404

1996 42,050 3,584 33,566 4,900

1997 45,957 3,594 37,060 5,303

% Change

1996/1997 +9 0 +10 +8

[Source: ETB, 1999]

Number of Visits (000’s)



Now consider Table 3 which refers to regional tourism spending in England in 1997. Again this is a clear table

that for the purposes of the tourism strategy had to contain a lot of detail. If you were using this table to

illustrate patterns of regional spending it could be simplified to show the most obvious or important patterns.

For example in Table 1 it is evident that tourism spending is highest in the West Country and lowest in

Northumbria.

The table could therefore be easily modified to really reinforce this message (see Table 4). Notice that in the

amended Table 4, I have also changed the title so that the content of the new table becomes self-explanatory

and reflects the actual purpose of the table. Table 3 could have also been modified by removing specific

columns thereby emphasising the patterns of spending in particular market areas.

Table 3: The Regional Distribution of Tourism Spending in England, 1997

[Source: ETB, 1998]

Destination

England

Cumbria

Northumbria

North West England

Yorkshire

Heart of England

East of England

London

West Country

Southern

South East England

All

Tourism

£11,665

%

3

3

9

8

11

13

9

24

11

9

Holidays

£7,725

%

5

3

8

8

9

14

6

30

10

8

Short

Holidays

(1-3 nights)

£2,505

%

5

3

11

7

14

11

13

17

10

9

Long

Holidays

(4+ nights)

£5,215

%

5

3

6

8

7

15

2

37

11

7

Business

and Work

£2,055

%

1

3

12

9

15

14

15

10

3

10

VFR

£1,415

%

1

5

10

10

16

12

17

10

9

12



Destination

England

Northumbria

East of England

West Country

South East England

All

Tourism

£11,665

%

3

13

24

9

Holidays

£7,725

%

3

14

30

8

Short

Holidays

(1-3 nights)

£2,505

%

3

11

17

9

Long

Holidays

(4+ nights)

£5,215

%

3

15

37

7

Business

and Work

£2,055

%

3

14

10

10

VFR

£1,415

%

5

12

10

12

Table 4: Selected Regional Differentials in the Distribution of Tourism Spending in England, 1997

[Source: ETB, 1998]



Descriptive Statistics

Section 2

Learning Outcomes


Produce descriptive statistics including themean, median and mode

Understand the features of measures of centraltendency

Apply appropriate descriptive statistics todifferent data types

Import data into SPSS and use SPSS to producedescriptive statistics and cross-tabulations

Use SPSS to graphically describe data throughthe use of frequency histograms, stem and leafplots and box plots

p. 25



Data Analysis for Research Descriptive Statistics

2.0 Introduction

The first part of the data analysis process is the production of basic descriptive statistics, such as the

mean, median, mode, standard deviation, standard error, and basic frequency and contingency tables.

The analysis of the descriptive statistics can then be used to ascertain the nature of the data, especially

in relation to its distribution, and what types of statistical tests can be used to analyse the data further.

2.1 Measures of Central Tendency

Averages, or measures of central tendency, give a simple summary of the characteristics of the data

being described. How the data is described depends upon its quality. The three measures used are the

mean, median and mode (see Table 2.1).

Table 2.1: Measures of Central Tendency

2.2 Arithmetic Mean

This is the figure that most people would produce if they were asked to give the average set of figures.

The mean is the most commonly used of all averages and is calculated by adding together all the values

in a series and dividing the total by the number of items in the series. The computation formula is:

The symbols may be explained as follows:

pronounced ‘x-bar’ denotes the arithmetic mean of a sample;

pronounced ‘sigma’ means ‘the sum of’;

Name

Mean

Median

Mode

Data Type

Ratio or interval

Ordinal

Nominal

Description

Total/Number of samples

Middle in rank order

Most common category

Example

‘The mean July maximum in

Bognor is 210C’

‘Half of the customers travel

more than 6km to Tescos’

‘Most visitors are from London’

x xi

n

i n==

1

/

x

p. 26




xi means all values of x where x1, x2, x3...xn represent the values of each observation in a data set.

Thus i assumes, in turn, the values of 1,2,3 and so on and;

n is the total number of observations in the data set.

Therefore for the following data series:

8,2,4,7,3,4,1,2,2,1

The arithmetic mean is calculated as:

2.2.1 Features of the Mean

When using the mean, you should consider the following points:

The mean is easy to understand and calculate and is the most commonly used of all averages;

It makes use of every value in the distribution, leading to a mathematical exactness which is useful

for further mathematical processing;

It can be determined if only the total value of the items and the number of items are known, without

knowing individual values;

It can be distorted by extreme values in the distribution;

For a discrete distribution, the mean may be an ‘impossible’ figure e.g. 17.5 cigarettes per day when

all values in the distribution are whole numbers.

3.41034

10124....28

x ==++++=

p. 27




2.3 The Median

There are however certain occasions when it is either not possible or not practical to use the

arithmetic mean, particularly if the values of some of the extreme items are difficult to determine or

if it is possible only to arrange the items in order without assigning numerical values to them. In

such cases the representative or average figure may be taken as the middle item when the series

is arranged in ascending or descending order.

The statistical term for this middle item in a set of data is the median. The median is a position

average or the value of the middle item of a series. For example, the median of the series

1,2,2,4,7,7,10 is the value 4 since it is the middle item. For a series with an even number of items

(e.g. 1,2,3,4), there is no middle item and yet a median may still be required. In this case the

median is conventionally taken as the arithmetic mean of the two central items, in this case, a value

of 5.5.

Therefore, to reemphaise:

Example 1: A series with an uneven number of items

The data series in rank order is:

Example 2: A series with an even number of items

The data series in rank order is:

1 2 2 4 7 7 10

The median is the middle item which in

this case is 4.

1 2 2 4 7 7 10 11

The median is the arithmetic mean of

the two central items:5.5

274 =+

WORKED EXAMPLE

p. 28




2.3.1 The Median of a Grouped Distribution

Strictly speaking, it should be impossible to find the median of a grouped distribution as detailed information

is lost when data is gathered into classes. However, as with the arithmetic mean, several assumptions

are made and an answer is produced. There is also a convention to say which is the median item in a

grouped frequency distribution with either an odd or an even number of items.

If a frequency distribution contains a total of n items then the median item will be:

a) the n +

1

2th item if n is odd

b) the n2

th item if n is even

For a distribution of 401 items the median will thus be the

+

2

1401th = 201st item.

For a distribution of 400 items the median will be the 4 0 0

2th = 200th item.

To find the median within a grouped data set it is first necessary to construct a table showing the cumulative

frequencies. The data on the following pages highlights the annual rainfall in Kano, a popular tourist

destination in Nigeria, and it should be clear that Table 2.2 has been produced by dividing the annual

rainfall totals into ranked categories (400-499mm etc) and then counting the number of years that fall

into each of these categories. These are then added up to produce the cumulative frequency, which can

be expressed as a percentage for easier interpretation (see Figure 2.1).

p. 29




Table 2.2: Rainfall for Kano, Nigeria from 1907 to 1974

Year Rainfall Year Rainfall Year Rainfall Year Rainfall

1907 930 1924 820 1941 740 1958 1070

1908 970 1925 1100 1942 1110 1959 1010

1909 650 1926 540 1943 810 1960 830

1910 890 1927 780 1944 840 1961 1020

1911 1230 1928 850 1945 620 1962 760

1912 850 1929 900 1946 790 1963 780

1913 750 1930 700 1947 480 1964 1140

1914 950 1931 770 1948 990 1965 700

1915 680 1932 890 1949 1060 1966 750

1916 1010 1933 830 1950 800 1967 900

1917 740 1934 1000 1951 700 1968 780

1918 480 1935 1180 1952 580 1969 970

1919 690 1936 1010 1953 920 1970 960

1920 820 1937 850 1954 810 1971 710

1921 990 1938 830 1955 1040 1972 660

1922 860 1939 940 1956 710 1973 410

1923 1040 1940 980 1957 1110 1974 560

Table 2.3: Cumulative Rainfall for Kano, Nigeria

Annual Rainfall in mm. Frequency Cumulative Frequency Cumulative % Frequency

400-499 3 3 4.4

500-599 3 6 8.8

600-699 5 11 16.2

700-799 15 26 38.2

800-899 15 41 60.3

900-999 12 53 77.9

1000-1099 9 62 91.2

1100-1199 5 67 98.5

1200-1299 1 68 100

p. 30




Figure 2.1: Cumulative Frequency Curve for Kano, Nigeria

By reading off at 50% on the y axis (Cumulative % Frequency) to the line, and then down to the x axis

the median is calculated at about 850mm. The median is, in fact, what is quite often meant by ‘the

average’ in everyday conversation, in that half of the years tend to have more rainfall than this, and half

less.

2.2.1 Features of the Median

When using the median, you should consider the following points:

Half the items in the series will have a value greater than or equal to the median and

half less than or equal to the median. It is therefore a measure of rank or position;

It is easy to understand;

It is unaffected by the presence of extreme items in the distribution;

If found directly (from ungrouped data) it will be the same as an actual item in the distribution;

It may be found when the values of all the items are not known, provided that values of middle items

and the total number of items are known;

Ranking the items can be tedious;

The median cannot be used for further mathematical processing;

It may not be representative if there are few items.

p. 31




2.4 The Mode

In an ungrouped, discrete distribution the mode is the value which occurs most often; that is, the value

with the highest frequency. The mode of the series 1,2,2,3,4 is the value of 2. Unlike the mean and the

median, it is not necessarily unique. For example the series 1,2,2,3,4,4 has two modes: 2 and 4.

In a continuous frequency distribution it is possible that no two values will be the same. In this sort of

situation the mode is defined as the point where there is the greatest clustering of values, or maximum

frequency density.

2.4.1 Mode for Grouped Data

To find the mode within a grouped data set it is first necessary to construct a histogram showing the

frequency distribution (see Figure 2.2). Having constructed the graph, first identify the modal class (the

class with the greatest frequency or frequency density). To calculate the actual value of the mode, draw

a line from the top right-hand corner of the modal rectangle to the point where the top of the adjacent

rectangle on the left meets it. Now draw a similar line from the top left-hand corner to the point where the

adjacent rectangle on the right meets it. Now draw a perpendicular from the point at which these lines

cross to the horizontal axis. This point gives the value of the mode.

Figure 2.2: The Calculation of the Mode from a Frequency Histogram

While this technique will give the specific value of the mode, it is often more useful and meaningful to

simply indicate the boundaries of the modal class. In other words, rather than attempting to calculate an

accurate value for the mode, which may not be entirely accurate or representative, it would be more

meaningful to say that more people for example fell within the 30 and under 40 age group than any

other group described by Figure 1.2.

10 20 30 40 50 60 70 80 90 100

Age

Freq

uenc

y 70

0

60

50

40

30

20

10

Mode = 34

p. 32




2.4.2 Features of the Mode

When using the mode, you should consider the following points:

For discrete data it is an actual single value;

For continuous data it is the point of highest frequency density;

It is easy to understand;

Extreme items do not affect its value;

It can be estimated from incomplete data;

It cannot be used for further mathematical processing;

It may not be unique or clearly defined;

It requires arrangement of the data which may be time consuming.

Activity 1:

For practice work out the mean, median and mode for the following sets of scores relating to the

number of bedspaces in serviced accommodation in Torquay.

Set 1:

4 916 1016 2020 1532 1410 27

Mean =

Median =

Mode =

Set 2:

16 1415 12

8 108 26 14

15 30

Mean =

Median =

Mode =

p. 33




2.5 Comparison of the Mean, Median and Mode

The mean, median and mode are the three most important statistical measures of location and central

tendency. Here are some guidelines to help you decide which value should be used in a particular case:

To determine what would result from an equal distribution use the mean (e.g. to determine the per

capita consumption of jelly babies);

If position or ranking is involved use the median which gives the half-way value (e.g. a student

interested in whether his exam mark places him in the upper or lower half of the class will need to

compare his mark with the median mark);

Where the most typical value is required use the mode (e.g. a shoe manufacturer may want to know

the average shoe size for ladies. For production planning it will be the mode that he requires as it will

tell him the most common shoe size).

2.5.1 Which Measure Should You Use?

The type of measure that you use will depend on the data that you are using, but ultimately whatever

measure you choose should provide a good indication of the typical score in your sample. The mean is

the most frequently used measure of central tendancy, because it is calculated from the actual scores

themselves, not from ranks, as is the case with the median, and not from frequency of occurence, as in

the case of the mode. However, as mentioned earlier, as the mean uses all the scores in the calculation

it is sensitive to extreme values.

Look at the following sets of scores:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

The mean from this set of data is 5.5 (the same as the median). If we were to change one of the scores

to make it more extreme, we would get the following:

1,2, 3, 4, 5, 6, 7, 8, 9, 20

The mean is now 6.5, although the median is still 5.5. If we were to make the final score even more

extreme we would get the following:

1, 2, 3, 4, 5, 6, 7, 8, 9, 100

The mean is now 14.5, which as you can see is not really representative of this set of scores. As we

have only changed the highest score, the median remains 5.5. In this case, the median becomes a

better measure of central tendancy. Therefore when deciding which measure to use it is always useful

to check the data for extreme values. Where extremes scores are present, use the median as this

simply gives you the score in the middle of other scores when they are put into ascending order. The

insensitivity to extreme values makes the median a useful alternative to the mean.

p. 34




The mode can be used with any type of data, as it relates to the most frequently occurring score and

does not require any calculation. The median and mode cannot be used with certain types of data. For

example if you were discussing occupation or attraction classifications it would be meaningless to rank

these in order of magnitude. Again, when using the mode it is important that it provides a good indication

of the typical score. Consider the following two sets of data:

A] 1,2,2,2,2,2,2,2,3,4,5,6,7,8

B] 1,2,2,3,4,5,6,7,8,9,10,11,12

In set A there are more 2s than any other number and the mode would provide a suitable measure of

central tendency. However, in set B, although the mode is again 2, it is not such a good indicator as its

frequency of occurence is only just greater than all the other scores.

2.6 The Population Mean

The measures of central tendancy outlined above are useful for giving an indication of the typical score

in a sample. However, what if you wanted to get an indication of the typical score in a population. In

theory, one could calcuate the population mean (a parameter) in a similar way to the calculation of a

sample mean; obtain scores from everyone in the population, sum them and divide by the number in the

population. However, this would not be possible. We therefore have to estimate the population parameters

from the sample statistics. One way of estimating the population mean is to calculate the means for a

number of samples and then calculate the mean of these sample means. It has been found that this

gives a close approximation of the population mean.

So why does the mean of the sample means approximate the population mean? Imagine randomly

selecting a sample of people and measuring their IQ. It has been found that the population mean for IQ

is 100. It could be that, by chance, you have selected mainly geniuses and that the mean IQ of the

sample is 150. This is clearly above the population mean of 100. You might select another sample that

happens to have a mean IQ of 75, again not near the population mean. It is clear that the sample mean

need not be a close approximation of the population mean. However, if we calculate the mean of these

two samples, we get a much closer approximation to the population mean:

(75+100)/2 = 112.5

Activity 2:

Which measure of central tendency would be most suitable for each of the following sets of data:

a] 1, 23, 25, 26, 27, 23, 29, 30 ........................................

b] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5 ........................................

c] 1, 1, 2, 3, 4, 1, 2, 6, 5, 8, 3, 4, 5, 6, 7 ........................................

d] 1, 101, 104, 106, 111, 108, 109, 200 ........................................

p. 35




The mean of the sample means (112.5) is a closer approximation of the population mean (100) than

either of the individual sample means (75 and 150). If several samples of the same size are taken from

a population, some will have a mean higher than the population mean and some will have a lower mean.

If all the sample means were plotted as a frequency histogram the graph would look similar to Figure

2.3.

Figure 2.3: Distribution of Sample Means Selected from a Population with a Mean of 100

If we calculated the mean of all these sample means it would be equal to 100, which is also equal to the

population mean. This tendency of the mean of sample means to equal the population mean is known

in statistics as the Central Limit Theorem. Knowing that the mean of the sample means gives a good

approximation of the population mean is important as it helps us to generalise from our samples to our

population. This will be considered in more detail when we look at dispersion.

Population mean and mean of

sample means are both 100

p. 36




2.7 Skew and the Relationship of the Mean, Median and Mode

Skew is the term that is used to describe the shape of the data as depicted by its frequency distribution

or frequency curve. Under a symmetrical distribution curve, or what is also called ‘Normal Distribution’

(this will be covered in more detail when we look at measure of dispersion), the data builds up slowly

from the left to a central peak or modal point and then declines to the right. In this situation, the mean,

median and the mode all coincide (see Figure 2.4). A positive skew is when the peak lies to the left

and a negative skew when it lies to the right. The further the peak lies from the centre of the horizontal

axis, the more the distribution is said to be skewed.

Figure 2.4: Symmetrical, Positively and Negatively Skewed Data Distributions

Where the distribution is positively skewed, the mean and median will be pulled to the right of the mode,

and where it is negatively skewed, the mean and median are pulled to the left. Consequently, in a

positively skewed distribution, the mean will have the greatest value, the mode the lowest value and the

median will fall between the two. Conversely, in a negatively skewed distribution, the mode will have the

highest value and the mean will have a lower value than the median and the mode.

p. 37




2.8 Using SPSS to Calculate Descriptive Statistics

Having considered the basic calculation of the mean, median and mode by hand (and hopefully not to

painfully!), the aim of this next section is to show you how to produce basic descriptive statistics using

SPSS. You can also produce descriptive statistics in Access, and this will be demonstrated later in the

module. We first need to consider the basic elements of the SPSS operating system.

2.8.1 An Introduction to SPSS

SPSS (PASW Statistics) is a powerful statistical tool that can be used to perform a wide range of statistical

techniques. When analysing data in SPSS it is often convenient to transfer over the data you which to

analyse from an Excel spreadsheet. The following section will highlight how to import an Excel

spreadsheet, and provide a basic introduction to the SPSS environment, before detailing in more detail

how to produce descriptive statistics.

To import an Excel spreadsheet, first open SPSS.

SPSS asks you what you would like to do. Move the mouse over Open an Existing Data Source and

press the left mouse button. Either choose the required files or select More Files and click OK.

simulationsimulation

p. 38




The Open File dialog box appears. Move the mouse over the drive containing the file you want to open

and then press the left mouse button. The file Dataset is located in the BML224 home page on

Moodle.

SPSS must be told to look for an Excel file. Therefore in the Files of Type box make sure that Excel

is selected [Move the mouse over and press the left mouse button. A sub menu of different file types

appear. Move the mouse over Excel and press the left mouse button].

Now select the Dataset file and click Open.

The Opening File Options dialog

box appears. In the Excel

spreadsheet you are going to import,

the first row in the spreadsheet

contains the field names of the

variables you want to examine. To

assist your data analysis, you need

to ensure that SPSS recognises this.

Move the mouse over Read Variable Names option and press the left mouse button

( becomes ). Move the mouse over OK and

press the left mouse button.

p. 39




SPSS now automatically imports the fields in the Excel spreadsheet and the data is displayed in the

Data Editor window.

You know need to save this file to your own homespace on the network. Move the mouse over File and

press the left mouse button. Move the mouse over Save As and press the left mouse button again. The

Save As Dialog box appears. Save the file as DATASET.SAV. Note that .SAV is the file extension for

data tables in SPSS. If you need to reload this file at any point, in the Open File dialog box select the

DATASET.SAV file.

p. 40




Before using SPSS to perform basic frequency counts and descriptive statistics on the results of the

Interview data you first need to understand the nature of the data. For example, some variables are

based on numeric coding schemes (nominal, categorical data types) and others on specific data values

(interval or ratio data types). For those questions based on numeric coding schemes, certain descriptive

statistics are not appropriate, although in this case SPSS can be used to perform basic frequency

counts.

Details of the variables in the Dataset file are included in the Dataset guide which has been given to you

as part of the module resources. Please read through this guide carefully and become familiar with the

different types of data, as this will be central to your successful completion of this module.

p. 41




2.8.2 Using the Variable View

In SPSS, we can use the variable view to check the integrity of the data and to apply additional information

to the coding schemes to aid our analysis of the data. In the bottom of the SPSS window, click on the

Variable View tab.

The Variable View window is displayed. This window provides specific information relating to the variables

that we have imported in the Dataset file. A number of key areas need to be checked at this point. First,

check the Type column. In order for SPSS to conduct statistical analysis on the variables in the Dataset

file all the variables here should be listed as Numeric.

In this instance the Greenrank06 variable is listed

as a String. This needs to be changed to Numeric.

To do this move the mouse over String and press

the left mouse button. The cell is highlighted and

a button appears.

Click the button and the Variable Type dialog box

appears.


p. 42




Select Numeric and click OK.

Check the other variables to ensure that they are set as numeric.

We can also use the Variable View to check the Measurement type of

the variables. In this instance the measurement type should look like

this. Refer back to your introductory notes to check on different data

types.

If the measurement type is not correct for a specific variable, move the

mouse over the measurement cell in question and press the left mouse

button.

The cell is highlighted and a button appears.

Click on the button and a sub menu appears, offering three options: Scale, Ordinal

and Nominal. Move the move over the required data type and press the left

mouse button. The new data type will be presented. Note that ratio and interval

data (e.g. age/investment) are classified as Scale).

In the Variable View we can also assign more specific value labels to each of the

variables. For example if we take Area as an example of the basic coding scheme in place here, Chichester

District = 1 and Arun District =2. Any subsequent analysis that we perform will use this base coding

scheme in any output. In order to make the SPSS output more self-explanatory we can assign additional

value labels so that any output actually refers to Chichester District and Arun District.

In the Variable View move the mouse over Values for the Area variable and press the left mouse button.

The cell is highlighted and a button appears. Click the button.

p. 43




The Value Labels dialog box

appears.

In the Value: box type 1.

In the Value Label: box type

Chichester District.

Click Add.

In the Value: box type 2.

In the Value Label: box type

Arun District.

Click Add.

Click OK.

The changes you have made are reflected in the Variable View.

Repeat this process to add Value Labels to the remaining variables (where appropriate!).

Return to the Data View and SAVE the file. We can now experiment with producing descriptive statistics.

Chichester District

Arun District

1 = ‘Chichester District’

p. 44




By using the Value Labels in the Data View window you can switch the value labels between the

numeric coding and the full text labeling. Click the button to toggle between the different options.

Numeric Coding

Text Label

p. 45




2.8.3 Working with SPSS Output

Before we start producing descriptive statistics, it is worth mentioning that SPSS output can be cut and

paste into a Word document (or equivalent package). The process is very simple.

In the output window, select the item you want to cut and paste, in this case a histogram. When the item

is selected a black border will appear. Copy the item (Edit>Copy or right mouse click>Copy).

Open Word and paste the selection into your document.


p. 46




To print specific elements of the output, first select the element you wish to print. When the item is

selected a black border will appear.

Select Print from the File menu. The Print dialog box opens. Make sure that Selection is highlighted

and click OK.

The required element is printed. Please use this method to print and annotate output that will be created

during the module.

Please use the cut and paste process highlighted here to complete your log book that we will use

throughout this module.

Additional guidance notes on the different features of SPSS are available in the appendices of this

handbook. When using SPSS to analsye data, you should not be directly cutting and pasting SPSS

output into your work. Outputs tables should ideally be recreated in Word, and data should be transferred

into Excel to create appropriate graphs.

p. 47




2.8.4 Producing Descriptive Statistics

As mentioned earlier, before using SPSS to perform basic frequency counts and descriptive statistics on

the results of the survey data you first need to understand the nature of the data (refer back to Section

1.6). In this case, we will start by exploring the categorical/nominal variable: OCC (occupation).

Remember for this variable it would not be appropriate to apply the mean, median or standard deviation.

To perform a basic frequency count, first decide on

the variable you which to examine. In this case we

shall examine OCC.

To do so, first move the mouse over Analyse and

press the left mouse. Move the left mouse button

over Descriptive Statistics and then over

Frequencies and press the left mouse button again.

The Frequencies dialog box appears.

Move the mouse over the variable you want to examine (in this case Occ) and press the left mouse

button. Move the mouse over the central arrow and press the left mouse button again. Alternatively,

select the variable you want to examine and quickly double click the left mouse button. The selected

variable moves across into the Variable(s) box. Note this procedure can be repeated for multiple

variables. Click OK.simulationsimulation

p. 48




The results of the frequency count are displayed in the output window. Notice that the frequency table

has listed the occupations as a result of you entering in data for the Value Labels. This helps to make

the table more self-explanatory.

Any statistics you generate in SPSS will also be displayed in this output window. This is very useful as

it means all your calculations are stored in one file that you can save and open at a later date. Save the

output file to your own homespace on the network. Save the file as DS-OUTPUT1.

Repeat this procedure to perform frequency counts to complete the Tables 1 and 2 overleaf. Your

additional frequency counts will appear in the output window. Save the output regularly. Record your

results overleaf or alternatively print out and fully annotate your SPSS output and file in your work folder.

The information presented in the frequency chart could now be copied or cut and paste into Excel where

you could create an Excel chart to show the distribution of the data.

An online simulation of how to create basic frequency statistics is available on the BML224 home page.

Please use this simulation to familiarise yourself with the basic prodecures outlined here.simulationsimulation

p. 49




Activity 3:

Table 1: The Distribution of Accommodation by Size

Table 2: The Distribution of Accommodation by Price

Having completed Tables 1 and 2, now have a go at completing Table 3. It is exactly the same process

but you will need to perform a frequency count for each separate question in the table (the relevant

variable name is given in the brackets).

Size Frequency Percentage

Small

Medium

Large

Price Frequency Percentage

Up to £30

£31 to £50

£51 to £70

£71 to £90

£91+

p. 50




Table 3: Business Responses to Tourism Issues

You will have noticed that the frequency count produced relates to the entire sample of 300 businesses,

and there is no differentiation based on specific cases such as location. By selecting specific cases we

can use SPSS to produce more detail frequency counts. In the following example we will produce a

frequency count showing the frequency distribution of different occupation types by area.

Activity 4:

p. 51




Return to the Data View window in SPSS. Move the mouse Data and press

the left mouse button.

Move the mouse over Split File and press the left mouse button

The Split File dialog box opens.

Select Compare groups

Then select Area and move into the

Groups Based on box.

Then click OK.


p. 52




The frequency table is displayed in the output window. As you can see the frequency table now gives a

breakdown of occupation type by area (our prior labelling clearly referring to the Chichester and Arun

Districts). Let us repeat this frequency count but this time instead of using Area we will use Town Code.

p. 53




Return to the Data View window in SPSS. Move the mouse over Data and


Move the mouse over Split File and press the left mouse button.


Deselect Area and then select Town Code and move into the Groups Based on box.

Click OK.

p. 54




Run the frequency count again and the frequency table is displayed in the output window. As you can

see the frequency table now gives a breakdown of occupation type by Town Code (our prior labelling

clearly referring to the actual towns).

p. 55




Activity 5:

Using the Split File option please complete the following tables.

Table 4: Size of Accommodation by Area

Table 5: Size of Accommodation by Town

Area

Size of Accommodation

Small [No. of Ests]

Medium [No. of Ests]

Large [No. of Ests]

Total

Chichester District

% Distribution

Arun District

% Distribution

Town

Size of Accommodation

Small [No. of Ests]

Medium [No. of Ests]

Large [No. of Ests]

Total

Chichester

% Distribution

Midhurst

% Distribution

Arundel

% Distribution

Bognor Regis

% Distribution

p. 56




Activity 5:

Table 6: Business Response to Employment Opportunities by Area

Table 7: Business Response to Employment Opportunities by Town

p. 57




0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Chichester

Midhurst

Arundel

Bognor Regis

Percentage

To

wn

The Size Structure of Accomodation in the

Chichester and Arun Districts

Small

Medium

Large

Activity 6: Self-Directed

Cut and paste the results from Table 5 in your SPSS output into Excel. Edit the layout of the results accordingly and

produce the following graph. The graph should be presented on A4 in landscape format. Please copy the format of

this chart exactly.

Please print of the chart and have it checked by the module tutor. File the chart in your work folder.

p. 58




Before we do any additional analysis it is important to remember to set the

Split File dialog box, so any subsequent analysis is based on the entire

sample.

Return to the Data View window in SPSS. Move the mouse Data and


Move the mouse over Split File and press the left mouse button


Select Analyze all cases, do not

create groups and then click OK.

Failure to reset the Split Files dialog box can result in inaccurate statistics being created.

p. 59




There are a number of ways in which you can produce Descriptive Statistics for interval or ratio

variables in SPSS.

Method 1: First decide on the variable you which to examine. In this case we shall examine the turnover

of businesses in 2008 (Turnover08).

To do so, first move the mouse over Analyse

and press the left mouse.

Move the left mouse button over Descriptive

Statistics and then over Frequencies and

press the left mouse button again.

The Frequencies dialog box appears.

Move the mouse over the variable you want to examine (in this case Turnover08) and press the left

mouse button. Move the mouse over the central arrow and press the left mouse button again. Alternatively,

select the variable you want to examine and quickly double click the left mouse button. The selected

variable moves across into the Variables box. Note this procedure can be repeated for multiple variables.

Move the mouse over Statistics and press the left mouse button. simulationsimulation

p. 60




The Frequencies: Statistics dialog box appears. This dialog box gives you the opportunity to select a

wide range of descriptive statistics. Select the options you want to include by moving the mouse over

the blank square and pressing the left mouse button so a tick appears. When you have completed your

selection move the mouse over Continue and press the left mouse button.

Note that SPSS also allows you to select measures of dispersion. This will be discussed in more detail

in the next session.

This will take you back to the Frequencies dialog box. Move the mouse over OK and press the left

mouse button. SPSS automatically calculates the necessary statistics and displays the results in the

Output window. This method not only produces the basic descriptive statistics for the variable but also

a frequency table (which can be deleted).

p. 61




Descriptive statistics can also be produced by selecting

Descriptives instead of Frequencies in the

Descriptive Statistics sub menu.

Follow the same procedures as in the previous

example, however, in this case click Options to specify

the descriptive statistics you want SPSS to produce.

Select the options you want to include by moving

the mouse over the blank square and pressing

the left mouse button so a tick appears.

When you have completed your selection move

the mouse over Continue and press the left

mouse button.

This will take you back to the Descriptives dialog

box. Move the mouse over OK and press the left

mouse button. SPSS automatically calculates the

necessary statistics and displays the results in

the Output window.

p. 62




You will have noticed that the descriptive statistics produced for Turnover08 relate to the entire sample of 300 businesses.

By using the Split File option again we can look in more detail at the characteristics of turnover in relation to specific

cases such as size of business or location. For example in the following, we can use the Split file to look at the

average turnover in the Chichester and Arun Districts.

As before open the Split File dialog box and select Compare groups. Select Area to go in the Groups Based on:

box.

Now produce descriptive statistics for Turnover08 again (using either the descriptives or frequencies option). In the

following example I have created descriptive statistics using the frequencies option and you can see in the output that

descriptive statistics have now been produced for both the Chichester and Arun Districts.

p. 63




Method 2: The second (and slightly faster method) is to use the Explore function. In this example we will

again examine the turnover of businesses in 2008 (Turnover08).

To do so, first move the mouse over

Analyse and press the left mouse.

Move the left mouse button over

Descriptive Statistics and then over

Explore and press the left mouse button

again.

The Explore dialog box appears.


mouse button. Move the mouse over the Dependent List arrow and press the left mouse button again.

p. 64




Turnover08 appears in the Dependent List.

Make sure that Statistics is selected in the dialog box. We will come back to plots later.

Click OK. Descriptive statistics for Turnover08 are produced in the output window.

p. 65




As in the previous method producing descriptive statistics, the values given in the output relate to the

entire sample. By adding variables in the Factor List in the Explore dialog box, we can differentiate by

specific cases.

Return to the Explore dialog box.

Select Area from the variable list and click the Factor List arrow. Area will appear in the Factor List

window. This will give us separate descriptive statistics for the Arun and Chichester Districts. Remember

in the previous method, we used the Split File option to group around specific cases.

Click OK. Descriptive statistics for business turnover in the Arun and Chichester Districts are produced

in the output window.

p. 66




Let me illustrate another example. Return to the Explore dialog box.

Remove Area from the Factor List and replace with E-Strategy. Click OK.

Descriptive statistics for business turnover for E-Commerce Adopters and E-Commerce Non-Adopters

are produced in the output window.

p. 67




Using either method, attempt to complete the following tables.

Table 8: Descriptive Statistics for Turnover08 by Town

Table 9: Descriptive Statistics for GTBS Score in 2008 [GTBS08] by Size of Business

Table 10: Descriptive Statistics for Invest by GStrategy

Activity 7:

Size of Business

GTBS08

Mean Median Mode

Standard Deviation

Range

Small

Medium

Large

p. 68




2.9 Graphically Describing Data

As mentioned earlier, when using statistics it is important to understand the data that you are using.

One of the best ways of doing this is through exploratory data analysis, and investigating your data

using graphical techniques. The next section will consider three main elements: frequency histograms,

stem and leaf plots and box plots.

2.9.1 Frequency Histograms

In the above section you have used SPSS to perform basic frequency counts. The frequency histogram

is a useful way of representing a frequency count more graphically, and allowing us to inspect for any

extreme values (see Figure 2.5). Any extreme values and possible errors that have been made in

inputting the data are often easier to spot when you have graphed the data. The frequency histogram is

also useful for discovering other important characteristics of your data. For example you can easily

record the value of the mode by looking for the tallest column in the chart. In addition, the histogram

also gives you useful information about how the values are distributed. However, when interpreting the

distribution of the data, be aware that the interpretation of your histrogram is dependent upon the particular

intervals that the bars represent. The way that the data is distributed will become important when we

look at normal distribution and dispersion in the next session. The distribution and character of the data

is also an important consideration in the use of inferential statistics that will be examined later in this

module.

p. 69




Figure 2.5: Freqency Histogram showing the Mean, Median and Mode

[Note: The frequency histogram is based on the following data: 2, 12, 12, 19, 19, 20, 20, 20, 25]

2.9.2 Stem and Leaf Plots

Stem and leaf plots are similar to frequency histograms in that they allow you to see how the scores are

distributed. They also retain the values of the individual observations. A basic example of a stem and

leaf plot is shown below:

Stem and Leaf Plot [a]

[Data set= 2, 12, 12, 19, 19, 19, 20, 20, 20, 25]

Stem Leaf

Tens Units

0 2

1 22999

2 0005

A stem and leaf plot based on a larger data set is illustrated overleaf.

Median Mode

Mean

(not normally shown on

histograms)

16.56

The score of 2

The score of 25

p. 70




Stem and Leaf Plot [b]

[Data set= 1, 1, 2, 2, 2, 5, 5, 5, 12 ,12, 12, 12, 14, 14, 14, 14, 15, 15, 15, 15, 18, 18, 24, 24, 24, 24,

24, 25, 25, 25, 25, 25, 25, 25, 28, 28, 28, 28, 28, 28, 28, 28, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 34,

35, 35, 35, 35, 35, 42, 42, 42, 43, 43, 44 ]

Stem Leaf

0 11222555

1 22224444555588

2 44444555555588888888

3 2233334444455555

4 222334

You can see the similarities between histograms and stem and leaf plots if you turn the stem and leaf

plot on its side. When you do this you can get a good representation of the distribution of the data. In

Stem and Leaf Plot [a] the first line contains the scores 0 to 9, the next line 10 to 19 and hte last line 20

to 29. Therefore in this case the stem indicates the tens and the leaf the units. You can see the score

of 2 is represented as 0 in the tens column (the stem) and 2 in the units column (the leaf), 25 is represented

as a stem of 2 and a leaf of 5. The same pattern applies to Stem and Leaf Plot [b], which highlights that

this approach is useful for presenting lots of data.

However, there are times when the system of blocking in tens is not very informative. For example look

at the following Stem and Leaf Plot.

Stem and Leaf Plot [c]

Stem Leaf

0 0000022222222333333333555555555555555777777777777799999999

1 000000033333888

2 3

6 4

This Stem and Leaf Plot is not really that informative, and only indicates that most of the values are

below 20. An alternative system is to block the scores in groups of 5 (0-4, 5-9, 10-14, 15-19 etc).

Stem and Leaf Plot [d]

Block Stem Leaf

0-4 0. 0000022222222333333333

5-9 0* 555555555555555777777777777799999999

10-14 1. 000000033333

15-19 1* 888

20-24 2. 3

60-64 6. 4

p. 71




This stem and leaf plot provides a much better indication of the distribution of scores. You can see that

we use a full stop (.) following the stem to signify the first half of each block of ten scores (e.g. 0-4) and

an asterisk (*) to signify the second half of each block of ten scores (e.g. 5-9).

2.9.3 Box Plots

Extreme scores are sometimes difficult to spot in a large data set. In this instance an alternative graphical

technique is the box plot or whisker plot, which gives a clear indication of the distribution of extreme

scores, and like the stem and leaf plots and histograms discussed above, tells us how the scores are

distributed. An example of a box plot is given in Figure 2.6:

Figure 2.6: An Example of a Box Plot

Although SPSS will automatically create box plots, the following section will outline how to create them

so you understand how to interpret them.

Step 1: The box plot in Figure 1.6 is based on the following data: 2, 20, 20, 12, 12, 19, 19, 25, 20

The first step is to calculate the median score.

2, 12, 12, 19, 19, 20, 20, 20, 25

Median score = 19 [position 5]

10

20

30

40

0N= 9

Hinges

The Box

This thick line represents the Median

Adjacent Values

Whiskers

p. 72




Step 2: The next step is to calculate the hinges. These are the scores that cut the top and bottom

25% of the data (the lower and upper quartiles): thus 50% of the scores fall within the

hinges. The hinges form the outer boundaries of the box. The hinges are calculated by

adding 1 to the position of the median position and then dividing by 2. In this instance the

median was in position 5, therefore: (5+1)/2 = 3

Step 3: The upper and lower hinges are therefore the third score from the top and the third score

from the bottom of the ranked list, which in this current example are 20 and 12 respectively.

Step 4: From these scores we can work out the h-spread, which is the range of the scores between

the two hinges. The score on the upper hinge is 20 and the score on the lower hinge is 12,

therefore the h-spread is 8 (20 minus 12).

Step 5: We define extreme values as those that fall one-and-a-half times the h-spread outside the

upper and lower hinges. The points one-and-a-half times the h-spread outside the upper

and lower hinges are called inner fences. One-and-a-half times the h-spread in this case

is 12, that is 1.5*8: therefore any score that falls below 0 (lower hinge, 12, minus 12) or

above 32 (upper hinge, 20, plus 12) is classed as an extreme score.

Step 6: The scores that fall within the hinges and inner fences and which are closest to the inner

fence are called adjacent scores. In our example, these scores are 2 and 25, as 2 is the

closest score to 0 (the lower inner fence) and 25 is the closest to 32 (the upper inner

fence). These are illustrated by the cross-bars on each of the whiskers.

Any extreme scores (those that fall outside the upper and lower fences), are shown on the box plot.

You can see from Figure 2.6 that the h-spread is indicated by the box width (12 to 20) and that there are

no extreme scores. The lines coming out from the edge of the box are called whiskers, and these

represent the range of scores that fall outside the hinges but are within the limits defined by the inner

fences. Any scores that fall outside the inner fences are classed as extreme scores (also called outliers).

As shown in Figure 1.6, there are no scores outside the inner fences, which are 0 and 32. The inner

fences are not necessarily shown on the plot. The lowest and highest scores that fall within the inner

fences (adjacent scores 2 and 25) are indicated on the plots by the cross-lines on each of the whiskers.

If we were to add a score of 33 to the data set illustrated in Figure 1.6, a revised box plot would now

indicate the presence of an extreme score (see Figure 2.7). As shown in Figure 2.7, the score is marked

as 10, indicating that the tenth score in our data set is an extreme score (in this case, 33). This value

falls outside the inner fence of 32. In this situation it may be worth examining the data set to ensure that

this extreme value has not been caused by an error in the data entry process.

p. 73




Figure 2.7: Revised Box Plot Indicating an Extreme Score

10

20

30

40

0N= 10

10

p. 74




2.10 Graphically Describing Data in SPSS

Creating histograms, stem and leaf plots and box plots in SPSS is very straight forward. In the following

example, we will generate graphical output relating to the Turnover08 variable in the dataset.

Move the mouse over Analyse and press the

left mouse.

Move the left mouse button over Descriptive

Statistics and then over Explore and press

the left mouse button again.



mouse button.

Move the mouse over the central arrow and press the left mouse button again. Alternatively, select the

variable you want to examine and quickly double click the left mouse button.

p. 75




The selected variable moves across into the Dependent List.

Move the mouse over Plots and press the left mouse button.

The Explore Plots dialog box appears.

Select Stem and Leaf and Histogram.

Click Continue.

Turnover08

p. 76




You are returned to the Explore dialog box. Click OK.

SPSS generates a histogram, stem and leaf plot and box plot in the output window.

Turnover08

Turnover08

p. 77




Turnover08

Turnover08

p. 78




As before any graphical output produced is referring to the entire sample of 300 businesses. Using the Factor List

option in the Explore dialog allows us to examine specific variables in more detail. For example the following output

has been produced by selecting Area in the Factor List box. This is an extremely useful way of visually looking at the

distribution of your data, which we will come back to when we look at dispersion and statistical testing.

p. 79




I would like you now to have a go at producing graphical output for a specific variable. Choose an

appropriate variable (which must be ratio or interval in nature) and produce output for the entire

sample, and then use the Factor List option in the Explore dialog box to investigate specific cases. Record your

observations by cutting and pasting the output into your log book.

Activity 8:

p. 80




2.11 Creating Cross-tabulations in SPSS

Another useful way of examining the relationship between variables is through the use of cross-tabulations.

In the following example we will create a number of cross-tabulations using data from the Dataset file.

To create a cross-tabulation in SPSS,

move the mouse over Analyse and press

the left mouse button. Move the mouse

over Descriptive Statistics and then

Crosstabs.

The Crosstabs dialog box appears. You need to think about the structure of your crosstab and decide

what variable you want as a row and what variable you want as column. Your crosstab should take the

form of a contingency table.


p. 81




Move the mouse over the variable you want to assign to Rows, in this case Area, and press the left

mouse button. Move the mouse over the central arrow and press the left mouse button again. Area

appears in the

Row(s) box:

Move the mouse over the variable you want to assign to Columns, in this case Occ (Occupation), and

press the left mouse button. Move the mouse over the central arrow and press the left mouse button

again. Occ appears in the Column(s) box:

Click OK.

p. 82




SPSS produced the crosstab in the output window:

The crosstab presented here is based on the absolute values of the data. We can repeat the process to

include Row and Column percentages. This is often a good idea, as it provides a more representative

overview if you have different sample sizes. In this instance we will add percentages to the rows.

Having selected the Row and Column variables move the mouse over Cells and press the left mouse

button.

p. 83




The Crosstabs: Cell Display dialog box appears.

Select Row in the Percentages window and

then click Continue. This will return you to

the Crosstabs dialog box. Click OK.

A second crosstab is produced in the output window - this time row percentages have been included. In

this example the crosstab is showing the distribution of occupation categories within a specific District.

For example in the Chichester District, 48.6% of businesses are run by previous managers and

administrators, compared to 25.7% who were in professional occupations. Reference to the percentage

distribution rather than the absolute values provides a more representative discussion, as it takes into

account relative sample sizes. Repeat the process removing row percentages and adding column

percentages.

p. 84




When producing crosstabs it is important that you correctly assign row and column percentages as this can influence

the accuracy of how you discuss the results. A simple rule of thumb is that row percentages should always total 100

when read across the row, and column percentages will always total 100 when read down the column. In the above

example where we have used the column total we are looking at the distribution of specific occupation categories

across the two Districts. For example, 70.2% of managers and administrators are within the Chichester District

compared to 29.8% in the Arun District. In contrast, 63.6% of plant operatives are within the Arun District compared to

36.4% in the Chichester District.

p. 85




Now attempt to complete the following tables. Please give consideration to whether you should be

using row or column totals (the clue is in the table). Refer to your Dataset guide.

Table 11: Town Against G-Strategy

Table 12: E-Strategy Against Occupation

E-Strategy

Occupation

Managers and Administrators

Professional Occupations

Clerical and Secretarial

Sales Operations

Plant Operatives

Total

E-Commerce Adopters

%Distribution

E-Commerce Non-Adopters

%Distribution

Activity 9:

p. 86




Table 13: Perceived Value of the Internet Against E-Commerce and Marketing Course Attendance

Table 14: Town Against the Size of Business

Size of Business

Town

Chichester Midhurst Arundel Bognor Regis

Small

%Distribution

Medium

%Distribution

Large

%Distribution

Total

Activity 9:

p. 87




Activity 10:

Using the Dataset file, create 3 additional crosstabs using appropriate variables. Record your results

by cutting and pasting your output into your log book. Check your crosstabs with your module tutor to

ensure that they are correct.

simulationsimulationsimulation Please review the online simulations to ensure that you are familiar with the basic approaches of producing

descriptive statistics in SPSS.

We can make crosstabs even more specific by using the Layer Command. In the following example

our the initial crosstab is GStrategy v Occupation but we are going to use the layer command to examine

any differences between GStrategy, Occupation and Area. In effect, the layer command is allowing us

to use Area as an additional filter.

Select the variables to use as the basis of your crosstab. Here we are using GStrategy (row) and

Occupation (column). Select Area and put in the Layer option.

Click OK.

p. 88




In the output window you will notice that a crosstab showing GStrategy v Occupation has been provided,

but the results have also now been split by area, showing relative distributions in both the Arun and

Chichester Districts.

Dispersion

Section 3

Learning Outcomes


Understand the theory and assumptions relatingto the distribution and variance of data

Calculate measures of dispersion, including themedian, range, standard deviation and standarderror, both manually and using SPSS

Use confidence levels and z scores to establishthe relationship between the sample mean andpopulation mean

Use the standard error to establish the extent towhich the sample mean deviates from thepopulation mean

p. 3-89

Data Analysis for Research Measures of Dispersion

3.0 Introduction

So far, you have been introduced to a number of different methods to graphically illustrate your data. But why

is it important to do this? It is important because the way the data is distributed will influence the types of

statistical tests that are valid, as many of the statistical tests that you will be introduced to in this module make

specific assumptions regarding the distribution of the data. One of the most important distributions that you

need to consider is the normal distribution. Under a normal distribution, the characteristic frequency curve

is bell-shaped and is symmetrical around the mean. For example, if 1,000 people were asked to estimate

the length of a room that was exactly 12 feet long, it is highly probable that everybody would say that the room

was 12 feet long. Some may guess at low as 11 feet and other may decide on 13 feet. However, we would

expect that most of the estimates would be between 11 feet and 13 feet and very few as far out as 9 feet or 15

feet. If the frequency distribution of the measurements were plotted on a graph, the pattern would tend to be

bell-shaped because most of the values would be clustered around the 12 feet mark, while the frequency of

measurements would diminish away from this central value.

Figure 3.1: Normal Distributions

The curves illustrated in Figure 3.1 all have a normal distribution, even though they are not quite the same.

You can see that they differ in terms of how spread out the scores are and how peaked they are in the middle.

Under a normal distribution, the mean, median and mode are exactly the same. These are features of a

normal curve. Indeed, many natural phenomena, such as heights of adult males and weights of eggs, tend

to produce the ‘normal’ (or Gaussian) distribution, and more significantly, most sampling will do so as well,

regardless of the distribution of the population. This is why it, and sampling, are so important in statistics. The

p. 3-90


requirements of a normal distribution are not always met in research, especially when you are dealing with

small sample sizes. If your sample size is less than 30, then reference to the normal distribution is not

appropriate. It is generally found that the more scores from such variables that you plot, the more like the

normal distribution they become. This can be seen in the following example. If you randomly select 10 men

and measured their height inches, the frequency histogram may be similar to Figure 3.2a. This histogram

bears little resemblance to the normal distribution curve. If we were to select an addition 10 men and

measure their height, and then plot all 20 measurements the resulting histogram (Figure 3.2b) would again

not look like a normal distribution. However, you can see that as we select more and more men and plot their

heights, the histogram becomes a closer approximation to the normal distribution (Figure 3.2c to 3.2e). By

the time we have select 100 men you can see that we have a perfectly normal distribution.

Figure 3.2: Normal Distribution and Sample Size

[Source: Dancey and Reid, 2002, p. 64]

p. 3-91


3.1 Measures of Dispersion

Although the different types of average can help to describe frequency distributions to a certain extent, they

are of limited use on their own and additional measures are often required to illustrate the full picture, and too

assess how much variation there is in our sample of population. This situation is best illustrated by a simple

example.

Two groups of 5 SEMAL students were asked to record their weekly beer consumption. The results in pints

were as follows:

Group 1: 12, 12, 12, 12, 12

Group 2: 0, 5, 10, 15, 30

Passing over the obvious comment that the 2nd group appears to contain someone who isn’t a SEMAL

student, the arithmetic mean for both groups is 12. However, this result gives no indication of the basic

differences between the two sets of values. Therefore, a measure of dispersion (or spread) can be used to

express the fact that one set of values is constant while the other ranges over a wide scale. The following

section will highlight a number of ways in which the level of variance within a sample of population can be

assessed.

3.1.1 The Range

The least sophisticated measure of dispersion is the range of a set of values. The range is simply the

difference between the highest and lowest values of a series. As such, it only tells us about two values which

may be atypical from the rest of the data set. In reference to our previous example, for the beer consumption

of the two groups of tourism management students the ranges are:

Group 1: Zero

Group 2: 30

Although the range tells us about the overall range of scores, it does not give any indication of what is happening

in between these scores. Ideally, we need to have an indication of the overall shape of the distribution and how

much the scores vary from the mean. Therefore, although the range gives a crude indication of the spread of

the scores, it does not really tell us much about the overall shape of the distribution of the sample of scores.

Max Min Range

12 12 0

30 0 30

Remember the range is calculated by subtracting the

minimum value from the maximum value. In this case:

--

==

p. 3-92


3.1.2 Quartile Deviation

The range, as a measure of dispersion, has the significant disadvantage of being susceptible to distortion by

extreme values. One way of overcoming this is to ignore items in the top and bottom quarters of the

distribution and to consider the range of the two middle quarters only. This is known as the interquartile

range since it is the difference between the first and third quartiles. The quartile deviation (semi-interquartile

range) is one half of the interquartile range. For continuous data, the lower quartile (Q1) is determined by

first ranking the data in order and then dividing the total sample number by 4. In the following example (see

Figure 3.3), the lower quartile lies between the ages of the 2nd and 3rd visitors. Thus, the lower quartile value

is 14 years (i.e. (13+15)/2). The upper quartile value is computed in a similar way but by dividing the

sample size by three quarters. Thus the upper quartile value lies between the ages of the 7th and 8th visitor.

Thus the value is 18 years (i.e. (18+18)/2). To summarise, we can now state that one quarter of visitors were

aged 14 years or under, while one quarter were aged 18 years or more. In addition, we can also quote the

interquartile range by stating that 50% of the visitors were aged between 14 and 18 years of age.

Figure 3.3: Age Profile of Visitors to the Arun Youth Centre

10, 13, 15, 16, 16, 17, 18, 18, 18, 20

Lower quartile value = 13 152

14+

=

Upper quartile value = 18 182

18+ =

Effectively, the interquartile range is a refinement of the median and is most easily calculated from the cumulative

frequency curve. In the Kano rainfall example, discussed in your descriptive statistics handout, the lower

quartile is read off by tracing a line from the 25% level to the curve, and then down to the appropriate rainfall

(about 275mm), and the upper quartile by reading 75% (to find about 1000mm). This means that over the

period in question, half of the years had a rainfall between 725mm and 1000mm, with the interquartile range

itself therefore being 275mm.

p. 3-93


To calculate the quartile ranges for grouped data, it is first necessary to calculate the cumulative frequencies

as in the Kano example. When trying to calculate the quartile values of grouped data it is again necessary to

make assumptions regarding the distribution of values within the class. In this instance it is assumed that the

distribution is even and the lower quartile is calculated as follows:

Q1 LCL(Q1)(LC)

(Q1)xw(Q1)= +

−

n cff

/ 4

Where:

Q1: is the lower quartile range

LCL(Q1): is the lower class limit of the class containing the lower quartile

n: is the sample size

cf(LC): is the cumulative frequency of the class immediately below that containing the lower quartile

w(Q1): is the width of the class interval containing the lower quartile

f(Q1): is the frequency of the class interval containing the lower quartile

The calculation for the upper quartile is:

Q3 LCL(Q3)(LC)

(Q3)xw(Q3)= + −

3 4n cff

/

In this case, Q3 reflects the relevant upper quartile values and can be substituted in the description of terms

stated for calculating Q1.

p. 3-94


3.1.3 Mean Deviation

Unlike the range, the mean deviation measures dispersion about a particular average, namely the arithmetic

mean. It is the average (arithmetic mean) of all the deviations of values from the arithmetic mean ignoring

minus signs. If deviations are considered with plus and minus signs and are measured from the mean then

their total will be zero by definition of the arithmetic mean. Basically the mean deviation tells us the average

distance by which all items in a data set differ from their mean.

For example, for the beer drinking figures of the 2nd group of geography students:

Value: 0 5 10 15 30

d (deviation): -12 -7 -2 +3 +18 ( d=0)

|d|: 12 7 2 3 18

[|d|, pronounced mod d, is the mathematical shorthand for saying: ‘ignore minus signs’.]

The mean deviation = | |d

n =

12 7 2 3 185

425

8 4+ + + +

= = .

By ignoring minus signs the mean deviation ignores the fact that some items are greater than the average and

some less, consequently this measure of dispersion gives no idea of the way the items are spread around the

average.

3.1.4 Standard Deviation

Standard deviation is one of the most fundamental measures of dispersion used in statistical analysis. Standard

deviation measures the dispersion around the average, but does so on the basis of the figures themselves, not

just the rank order. Like the mean deviation it is calculated from the deviations of each item from the arithmetic

mean. To ensure that these divisions are not totalled to give zero, they are squared before being added

together. This removes all minus signs since two negative values multiplied give a positive value. Thus by

summing the squares of the deviations, the sums of squares or sum of squared differences is arrived. The

mean of ‘the sum of squares’ is known as the ‘variance’. The square root of the variance is the standard

deviation.

Standard deviation is symbolized by ‘s’ for a sample and ‘σ ’ for a population.

For an ungrouped, discrete data series, the standard deviation can therefore be calculated as:

)(n

xx −=

2

σ

p. 3-95


or alternatively,

σ = −

2xn

x

n

2

The calculation of the standard deviation is illustrated in the following example:

The Calculation of the Standard Deviation

Values (x) )(x x− )(x x−2

3 -0.6 0.36

2 -1.6 2.56

1 -2.6 6.76

2 -1.6 2.56

3 -0.6 0.36

4 0.4 0.16

3 -0.6 0.36

7 3.4 11.56

6 2.4 5.76

5 1.4 1.96

Totals 32.4

The standard deviation figure of 1.8 is useful as it provides an indication of how closely the scores are clustered

around the mean. The value of the standard deviation when placed in context of the normal distribution.

Generally, 70% of all scores fall within 1 standard deviation of the mean. In this example with a standard

deviation of 1.8, this tells us that the majority of scores in this sample are within 1.8 units above or below the

mean (3.6 +/-1.8). The standard deviation is useful when you want to compare samples using the same scale.

For example if we were to take a second sample of scores and calculated a standard deviation of 3.6. If we

compare this to the standard deviation from our first sample, it would indicate that scores in the first sample are

more closely clustered around the mean value, than scores in the second sample.

Step 1: First calculate the mean of thesample:

Step 2: Now calculate the standarddeviation:

6.310

36x ===

n

x

)(σ =

− x x

n

2

σ = =32.410

18.

WO

RKED EXAMPLE

Where:

: standard deviation

: sum of

: the value

: the mean

: the number of values

σx

nx

p. 3-96


In conclusion, the standard deviation is a measure of dispersion which indicates the spread of the data

values around the arithmetic mean.

‘Quoting the standard deviation of a distribution is a way of indicating a kind of ‘average’ amount by

which all the values deviate from the mean. The greater the dispersion the bigger the deviations and

the bigger the standard (average) deviation’

(Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197)

p. 3-97


To demonstrate your familiarity with basic measures of dispersion, for the following data sets, relating tobedspace size of B&B establishments in Blackpool, calculate the mean,median, range and standarddeviation.

Results:

Mean:

Median:

Range:

Standard Deviation:

Results:

Mean:

Median:

Range:

Standard Deviation:

Results:

Mean:

Median:

Range:

Standard Deviation:

Activity 11:

Sample A:

4 9 16 1034 20 8 632 14 10 2718 12 17 1048 12 14 6

6 19 10 617 11 6 614 12 10 8

4 14 16 818 19 34 14

Sample B:

34 72 11 1034 20 38 632 14 19 2718 12 17 1048 12 14 3416 19 10 3217 11 50 650 12 62 8

4 14 16 818 19 34 23

Sample C:

14 9 11 1034 20 8 632 14 19 2718 12 17 1048 12 14 34

6 19 10 617 11 6 650 12 62 8

4 14 16 818 19 34 14

p. 3-98


3.2 Other Distributions

There are of course variations on the normal distribution. Distributions can also vary depending on how flat

or peaked they are. The degree of flatness or peakedness is referred to as the kurtosis of the distribution. If

a distribution is highly peaked it is leptokurtic and if the distribution is flat it is platykurtic. Leptokurtic

distributions appear relatively thin in appearance, and somewhat pointy. In contrast, platykurtic distributions

are flatter, reflecting a greater number of scores in the tails of the distribution. A distribution between the

extremes of peakedness and flatness is classed as mesokurtic (see Figure 3.4). In a normal distribution

curve, the value of kurtosis is 0 (i.e. the distribution is symmetrical). If a distribution has a value above or below

0 then this indicates a level of deviation from the norm. You don’t need to worry about kurtosis too much as this

point, but you will notice that when you produce descriptive statistics in SPSS, a value for kurtosis is given.

Positive values of kurtosis indicate that the distribution is leptokurtic, whereas negative values suggest

that the distribution is platykurtic (Dancey, 2002).

Figure 3.4: Examples of Leptokurtic, Platykurtic and Mesokurtic Distributions

[Source: Dancey and Reid, 2002, p. 70]

p. 3-99


3.2.1 Skewed Distributions

Distributions can also be skewed (see Figure 3.5). A positive skew is when the peak lies to the left of the

mean and a negative skew when it lies to the right of the mean. The further the peak lies from the centre

of the horizontal axis, the more the distribution is said to be skewed. If you come across badly skewed

distributions then you need to consider whether the mean is the best measure of central tendency,

as the scores in the extended tail will be distorting your mean. As discussed in your descriptive

statistics handbook, at this point it might be more appropriate to use the median or the mode to give a more

representative indication of the typical score in your sample. The SPSS output for descriptive statistics will

also provide a measure of skewness. A positive value suggests a positively skewed distribution, whereas

a negative value suggests a negatively skewed distribution. A value of zero indicates that the distribution

is not skewed in either direction (i.e. the distribution is symmetrial).

Figure 3.5: Examples of Skewed Distributions

These refinements need not concern us here, but will need consideration when it comes to deciding which

statistical tests you which to use to examine the data. For now, it is perhaps enough to make a distinction

between the most powerful ‘parametric’ tests which rely on the data concerned being normally distributed, and

the less powerful non-parametric ones which do not. If you have control over the collection of your data you

should do your best to collect data on which parametric tests can be conducted, but if you cannot ensure this

quality or need to use others’ information it may be better to use the less powerful tests.

p. 3-100


3.3 The Standard Normal Distribution

The standard normal distribution (SND) is also known as the probability distribution. The value of

probability distributions is that there is a probability associated with each particular score from the distribution.

More specifically, the area under the curve between any specified points represents the probability of obtaining

scores within these specified points. For example the probability of obtaining scores between -1 and + 1

standard deviations from the distribution is about 68% (see Figure 3.6). This means that:

68.26% of observations fall within plus or minus one standard deviation of the mean;

95.44% of observations fall within plus or minus two standard deviations of the mean;

99.7% of observations fall within plus or minus three standard deviations of the mean

This percentage values will be referred to later as ‘confidence limits’.

Figure 3.6: The Standard Normal Distribution

Let me illustrate this through a specific example. Figure 3.7, illustrates the number of tourists bunging

jumping off at bridge at an extreme academy in New Zealand. There were 150 tourists in total, and the brave

souls were most frequently aged between 26 to 30 (the highest bar). The graph also indicates that very few

people over the age of 60 participate in bunging jumping (thank goodness for that!). If we think about this

distribution as a probability distribution we could start asking specific questions. For example, how likely is it

that a 60 year old will undertake bunging jumping in New Zealand. A look at the distribution and your answer

might be ‘not very likely’. However, what if you were asked how likely it is a 30 year old went bunging jumping,

your answer would be ‘quite likely’. Indeed, the distribution shows that 30 of the 150 tourists were aged

around 30 (equating to 20% of the total sample). Therefore, using this data it is possible to estimate the

probability that a particular score will occur.

p. 3-101


Figure 3.7: Tourists Bunging Jumping in New Zealand

Using the characteristics of the SND, it is possible to calculate the probability of obtaining scores within any

section of the distribution. Statisticans (much clever than me) have calculated the probability of certain scores

occurring in a normal distribution with a mean of 1 and standard deviation of 1. If our sample shares these

values, then we can use a table of probabilities for normal distribution to assess the likelihood of a particular

score occurring. However in reality, it is likely that the data we will collect will have a mean of 0 and standard

deviation of 1. However, as Field (2003) points out any data set can be converted into a data set that has a

mean of 0 and standard deviation of 1. First to centre the zero, we take each score and subtract from it the

mean of all the scores. Then, we divide the resulting score by the standard deviation to ensure the data have

a standard deviation of 1. The resulting scores are called z scores. The z-score is expressed in standard

deviation units - the z score therefore tells us how many standard deviations above the mean our score is. A

negative z score is below the mean and a positive z score is above the mean.

Extreme z scores, for example greater than 2 and below 2, have a much smaller chance of being obtained

than scores in the middle of the distribution. That is areas of the curve above 2 and below -2 are small in

comparison with the area between -1 and 1 (see Figure 3.8).

p. 3-102


Figure 3.8: Areas in the middle and extremes of the Standard Normal Distribution

Let us refer back to our example of bunging jumping in New Zealand, where we can now answer the question

what is the probability of someone over 60 doing a bungee jump. First we need to convert 60 into a z-score.

From the population the mean age is 32 and the standard deviation is 11. In this instance 60 will become:

(60-32)/11=2.54

This indicates that your score is 2.54 standard deviations around the mean.

Consider another example. The mean IQ scores for many IQ tests is 100 and the standard deviation is 15.

If you had an IQ score of 130, then your z-score would be:

(130-100)/15=2

This indicates that your score is 2 standard deviations around the mean.

Using the z-score we can also calculate the proportion of the population who would score above or below your

score - or in the case of the normal distribution the area under the normal distribution curve. Figure 3.9

illustrates that the IQ score of 130 is 2 standard deviations above the mean. The shaded area represents the

proportion of the population who would score less than you, and the unshaded area represents those who

would score more than you. To calculate the specific proportion of the population that would score more or less

than you we refer to a standard normal distribution table (see Table 3.1). The table indicates that the

proportion falling below your z-score is 0.9772 or 97.72%. In order to find the proportion above your score, you

simply subtract the above proportion (0.9772) from 1. In this case the proportion is .0228 or 2.28% . When

using statistical tables for SND you should note that only details of positive z scores are given (those that fall

above the mean). If you have a negative z score disregard the negative sign of the z score to find the relevant

areas above and below your score (Figure 3.10).

score-meanstandard deviation

= z score

p. 3-103


Figure 3.9: Normal Distribution showing the proportion of the population with an IQ of less than 130

Table 3.1: Z Scores for Standard Normal Distribution

97.72%

p. 3-104


Figure 3.10: The proportions of the curve below positive z scores and above negative z scores

Let us now refer back to the z-score calculated when asking about the probabiloty of people over 60 bunging

jumping in New Zealand. The calculated z-score is 2.54. Refer to the table of probability values that have

been included in the appendices. Look up the value of 2.54 in the column labelled ‘smaller portion’ (i.e. the

area above the value 3.2). You should find that the probability value is 0.00554, or .0055% chance that a

person over 60% would bungee jump. By looking at the values of the ‘bigger portion’, we find that the

probability of those jumping under the age of 60 was .99446. Or alternatively, there is 99.44 probability that

those tourists jumping were aged below 60 (.99446 = 1-.00554).

Certain z-scores are particularly important, as their values cut off certain important percentages of the distribution.

As Field (2003) highlights, the first important value is 1.96 as this cuts off the top 25% of the distribution, and

its counterpart at the opposite end (-1.96) cuts off the bottom 2.5% of the distribution. As such, these values

together cut off 5% of scores, or put another way, 95% of z-scores lie between -1.96 and 1.96. The other

important scores are +/- 2.58 and +/- 3.29 which cut off 1% and 0.1% of scores respectively. Put another way,

99% of z-scores lie within -2.58 and 2.58, and 99.9% of z-scores lie between -3.29 and 3.29. These values will

crop up time and time again, indeed we have already referred to this values when referring to the characteristics

of the normal distribution curve.

p. 3-105


3.4 Confidence Intervals

Although the sampling mean is an approximation of the population mean, we are not sure how good an

approximation it is. Because the sample mean is a particular value or point along a variable, it is known as a

point estimate of the population mean. It represents one point on a variable and because of this we do not

know whether our sample mean is an underestimation or overestimation of the population mean. We can

therefore use confidence intervals to help us identify where on the variable the population mean may lie.

Confidence intervals of the mean are interval estimates of where the population mean may lie and they

provide us with a range of scores (an interval) within which we can be confident that the population mean lies

(see Figure 3.11). Because we are still only using estimates of population parameters it is not guaranteed that

the population mean will fall within this range; we therefore have to give an expression of how confident we are

that the range we calculate contains the population mean. Hence the term ‘confidence intervals’.

Figure 3.11: The role of confidence intervals in determining the position of the population mean in relation to

the sample mean

p. 3-106


We have already discussed the characteristics of the sampling mean and that it tends to be normally distributed,

and contains a good approximation of the population mean. Using the base characteristics of the normal

distribution allows us to estimate how far our sample mean is from the population mean.

As shown in Figure 3.12, we know that the sample mean is going to be a certain number of standard

deviations above or below the population mean. Indeed, we can be 99.74% certain that the sample mean

will fall with -3 and + 3 standard deviations. As discussed earlier, this area accounts for most of the scores in

the distribution. If we wanted to be 95% certain that a certain area of the distribution contained the sample

mean we would have to refer back to the z scores. As highlighted earlier, 95% of the area under the SND falls

with -1.96 and +1.96 standard deviations. Thus we can be 95% certain that the sample mean will lie between

-1.96 and +1.96 standard deviations of the population mean (see Figure 3.13).

Figure 3.12: Sample mean is a certain number of S.Ds above or below the population mean

Figure 3.13: Percentage of curve (95%) falling between -1.96 and +1.96 S.Ds

p. 3-107


For illustration, assume that the sample mean is somewhere above the population mean. If we draw the

distribution around the sample mean instead of the population mean we see the situation in Figure 3.14.

Figure 3.14: Location of the population mean where distribution is drawn around the sample mean

Applying the same logic, we can be confident that the population mean falls somewhere with 1.96 standard

deviations below the sample mean. Similarly, if the sample mean is below the population mean we can be

confident that the population mean is within 1.96 standard deviations above the sample mean (Figure 3.13).

We can therefore be confident (95%) that the population mean is within the region 1.96 standard deviations

above or below the sample mean. With this information we can now calculate how far the sample mean is

from the population mean. All we need to know is the sample mean and the standard deviation of the sampling

distribution of the mean (standard error).

Figure 3.15: Distribution drawn around the sample mean when it falls below the population mean

p. 3-108


Use the following table to calculate the following:

a] The probability that z is less than or equal to0.7

b] The probability that z is more than 0.7,

c] The probability that z is less than or equal to2 and equal to or more than -2

d] The probability that z is less than or equal to3 and equal to or more than -3

Record your answers below:

Activity 12:

p. 3-109


3.5 The Standard Error

One useful adjunct of the normal distribution is standard error, or the standard deviation of the sampling

distribution of the mean, which can be helpful in gauging the precision of your sample, and deciding how large

your eventual sample should be from a pilot study. The standard error is a measure of the degree to which the

sample means deviate from the mean of the sample means. Given the mean of the sample means is also a

close approximation of the population mean, the standard error of the mean must also tell us the degree to

which the sample means deviate from the population mean. Consequently, once we are able to calculate the

standard error we can use this information to find out how good an estimate our sample mean is to the

population mean. This is illustrated in Figure 3.16.

Figure 3.16: Calculating the Standard Error

[Source: Field, A, 2003, p. 16]

p. 3-110


Figure 3.16, illustrates the process of taking samples from the population. In this case Field (2003) is looking

at the ratings of lecturers. If we were to take the rating of all lecturers the mean value would be 3. As illustrated

in Figure 3.16, each sample has a mean value, and these have been presented in a frequency chart. As you

can see some samples have the same mean as the population, some are lower, some are higher. These

differences are referring to sample variation. As you can see, the end result is a symmetrical distribution,

known as a sampling distribution (Field, 2003). If we were to take the average of all the sample means, you’d

get the same value as the population mean. But how representative is the population mean?

We use standard deviation as a measure of how representative the mean was of the observed data. If you

were to measure the standard deviation between sample means then this would give a measure of how much

variability there was between the samples of the different means. The standard deviation of the sample

means is known as the standard error of the sample mean.

Standard error is very similar to standard deviation, but takes account of sample size. The larger the sample

size, the lower the sampling error.

SE (mean) = Standard Deviation of the Sample (s)

√ Sample Size (n)

Dividing the standard deviation by the square root of the sample size takes account of the fact that the larger

the sample size, the more likely that the sample is representative, and vice versa. Any probability of the

sample mean being close to the population mean can be calculated, but for our purposes we will only examine

the population mean that we can estimate with 95% probability, which corresponds to two standard errors

away from the mean.

For example, in investigating the geography of sport in Lancashire, you might want to find out how far Warrington’s

supporters travelled to the match. From sampling the crowd you might find a mean of 23km travelled to

Wilderspool, and a Standard Error of 3km. This means that your sampling suggested (with 95% certainty) that

the mean distance which supporters of Warrington RLFC travelled to the match is 23km ± 6km. This does not

mean that 95% of supporters travel between 17km and 29km, but rather is a measure of the confidence with

which you state the mean. You can be pretty certain that if you sampled the crowd twenty times nineteen of

your answers would be within this range.

p. 3-111


The following example highlights a practical application of the standard error in attempting to assess the

mean spending of short break holidaymakers in Chichester.

In an example the following results were obtained were taken:

Visitor Spending (£)

In this example, the standard error has been calculated at 9.43. With reference back to the properties of the

normal distribution curve, we can conclude that it is likely that 68 times out of 100 (or approximately 2 in 3

times) that the true mean of the population lies within the range 127± 9.43. That is between 117.57mm and

136.43mm (or the Mean ± 1 x Standard Error (SE)). If we wish to predict the range with greater confidence

then the rule of plus or minus two standard errors can be applied to give a 95% confidence level. In this case

the true mean of the population would lie within the range 127± 18.86. That is between 108.14mm and 145.86

(or the Mean ± 2 x SE).

Values (x) )(x x− )(x x−2

109 -18 324

97 -30 900

112 -15 225

156 -29 841

86 -41 1681

94 -33 1089

176 49 2401

158 31 961

147 20 400

135 8 64

Totals 8886

Step 1: First calculate the mean of thesample:

Step 2: Now calculate the standarddeviation:

)(σ =

− x x

n

2

29.80910

8886 ==σ

WO

RKED EXAMPLE

12710

1270

n

xx ===

(n) Size Sample

(s) Sample the of Deviation StandardSE =

Step 3: Now calculate the standard error:

10

29.809SE =

9.433.16

29.809SE ==

p. 3-112


In the above example, the selected standard errors equated to critical z values of 1.0 and 2.0. These values

help to establish and define the ‘confidence limits’. As discussed these limits are usually described in

percentage rather than absolute values, and you would therefore refer to 68.2%, 95.4% and 99.7% confidence

levels. For the 95% (0.95) and 99% (0.99) levels (the percentage values have been rounded for convenience)

the critical z values are 1.96 and 2.58 respectively. Therefore, if we were to refer back to our previous

example, we can redefine our confidence limits and expected ranges in which we would expect the mean

value of the population to lie.

For example, in the previous example, at the 95% confidence level the limits were given by:

127 ±( 2 x 9.43) = £108.14 to £136.43

If we adopt the critical z values for the standard error at a 95% confidence level, the limits are now defined as:

127 ± (1.96 x 9.43) = £108.52 to £145.48mm

If we adopt the critical z values for the standard error at a 99% confidence level, the limits are now defined as:

127 ±( 2.58 x 9.43) = £102.67 to £151.33mm

Effectively, higher confidence levels can only be achieved at the expense of wider confidence intervals.

Therefore, we can be 99% certain that the sample population lies between £102.67 and £151.33, but only 95

per cent confident that it lies between the narrower bands of £108.14 and £145.86. Clearly the best way to

gain greater accuracy in sample estimates is to increase the sample size (n). As the sample size (n) increases

the standard error, or spread, of the sampling distribution is reduced and the resulting confidence intervals are

narrowed.

Referring back to our previous example which focused on visitor spending, increasing the sample size by 20

yields the following results:

Mean: £127

Std Dev: 29.95

Standard Error: 2.99

Adopting the same confidence limits as before, we can now be certain that at the 95% confidence level the

mean of the sample population lies between:

127 ± (1.96 x 2.99) = £121.14 to £129.86

and at the 99% confidence level the sample population lies between:

127 ± (2.58 x 2.99) = £119.29 to £134.71

As you can clearly see, increasing the sample size has significantly reduced the width of the confidence

intervals.

p. 3-113


This is graphically illustrated in Figure 3.17 below.

Figure 3.17: Confidence Intervals with Samples Sizes

a] Sample Size of 10

Confidence Interval (95%)

Range = 36.96

Sample mean of 127

a] Sample Size of 100

Confidence Interval (95%)

Range = 8.72

Sample mean of 127

As is evident in Figure 3.17, increasing the sample size results in a much narrower range of scores and gives

us a much clearer indication of where the population mean may be. This in turn underlines the importance of

sample size when trying to estimate population parameters from sample statistics. Generally the larger the

sample size, the better the estimate of the population we can get from it.

£108.52 £145.48

£121.14 £129.86

p. 3-114


Refer back to the exercise on page 3-93. This time calculate the standard error for each sample, andthe standard error ranges at 95% and 99% (using z-scores).

Results:

Standard Error:

The standard error range at 95%Lower:

Upper:


Upper:

Results:

Standard Error:


Upper:


Upper:

Results:

Standard Error:


Upper:


Upper:

Activity 14:

Sample A:

4 9 16 1034 20 8 632 14 10 2718 12 17 1048 12 14 6

6 19 10 617 11 6 614 12 10 8

4 14 16 818 19 34 14

Sample B:

34 72 11 1034 20 38 632 14 19 2718 12 17 1048 12 14 3416 19 10 3217 11 50 650 12 62 8

4 14 16 818 19 34 23

Sample C:

14 9 11 1034 20 8 632 14 19 2718 12 17 1048 12 14 34

6 19 10 617 11 6 650 12 62 8

4 14 16 818 19 34 14

p. 3-115


3.6 Looking at Distributions in SPSS

As discussed in this handbook, SPSS will produce basic descriptive statistics for dispersion in the Descriptive

dialog box. Refer back to your descriptive statistics section for guidance. Statistics for variance can also be

created via the Explore dialog box. The following example is using the Age variable in the Dataset file.

Move the mouse over Analyse and press the left mouse button.

Move the mouse over Descriptives and then over Explore and



Select Age and click the central arrow so that Age appears in the

Dependent List.


p. 3-116


Move the mouse over Statistics and press the left mouse button.

The Explore: Statistics dialog box opens. At this point we

can assign a confidence interval for the mean (as discussed

in the previous sections). Make sure that the confidence

interval is set to 95%.

Click Continue.

This returns you to the Explore dialog box. Click OK.

A summary table is produced in the output window.

p. 3-117


This summary table provides you with basic descriptive statistics including the mean and the median, and

measures of dispersion including the range, standard deviation and standard error. The output also provides

the confidence interval at 95% (47.07 to 48.34). Note that Age is a ratio data type, and that the average would

not apply to ordinal or nominal data sets.

3.7 Graphically Looking at Distributions in SPSS

Refer back to your descriptive statistics handbook for information on how to produce basic frequency histograms,

stem and leaf plots and box plots.

SPSS will also allow you to plot the normal distribution over a frequency histogram, so you can ascertain how

the distribution of your sample relates to the normal distribution. The following example again uses the Age

variable in the Dataset file.

Move the mouse over Graphs and then Chart Builder and press the left mouse

button.


p. 3-118


The Chart Builder dialog box appears.

p. 3-119


Select Histogram in the Choose From: box. A series of charts are presented.

p. 3-120


Move the mouse over the Simple Histogram, and holding the left mouse button down drag it into the chart window. Release

the left mouse button and a simple histogram is presented. An Element Properties dialog box also appeared and we will

return to this shortly.

You will notice that the histogram presents options for the vertical Y-axis and the

horizontal X-axis. In this case we need to assign Age to the X-axis. The vertical

Y-axis will be frequency which SPSS will default to automatically.

p. 3-121


Move the mouse over Age in the Variables box and holding down the left mouse button, drag Age over the to X-Axis box.

Release the left mouse button and Age is assigned to the X-axis of the histogram.

We now need to assign a Normal Distribution Curve to the

histogram. Shift your attention to the accompanying Elements

Properties dialog box.

Select Display normal curve in the dialog box and click Apply.

Notice that a Nornal Distribution curve has been superimposed on

top of the histogram in the Chart Builder window. Click OK in the

Chart Builder window.

p. 3-122


A frequency histogram is produced in the output window, and a normal distribution curve has been plotted on it. As you can

see from this output, the Age variable bears some resemblence to the normal distribution, although there the overall shape

of the curve is influenced by a number of outlying values.

p. 3-123


As before we can also use the Split File option

look at specific cases. For example here Area

has been selected and two separate

distribution curves for the Chichester and Arun

Districts have been produced.

p. 3-124


Table 15: Descriptive Statistics for GTBSscore08

GTBSscore08

Please cut and paste your histogram below

and rescale accordingly Descriptive Statistics

Mean

Median

Mode

Standard Deviation

Standard Error

Skewness

Kurtosis

Please provide a brief summary of the distribution:

Table 16: Descriptive Statistics for GTBSscore08 - Chichester District

GTBSscore08: Chichester District



Mean

Median

Mode

Standard Deviation

Standard Error

Skewness

Kurtosis


Activity 15:

p. 3-125


GTBSscore08: Arun District Council



Mean

Median

Mode

Standard Deviation

Standard Error

Skewness

Kurtosis


Table 17: Descriptive Statistics for GTBSscore08 - Arun District

Repeat this exercise for an additional variable (which should be ratio or variable in nature). Record yourresults by cutting and pasting your output into your log book.

Activity 15:

p. 3-126


Notes:

Student T-Test,Paired Samples T-Test,

Mann Whitney andWilcoxon

Section 4

Learning Outcomes


Understand the rationale for the use ofparametric and non-parametric tests

Examine the relationship between variablesusing parametric and non-parametric tests,constructing suitable null and alternativehypotheses

Apply the procedure for conducting parametricand non-parametric tests in SPSS in relation tothe Student T-Test, the Paired Samples T-Test,Mann Whitney and Wilcoxon

Interpret computer generated SPSS output inrelation to the above tests


Data Analysis for Research Statistical Tests: Introduction

4.0 Introduction

Statistical tests are used to make deductions about a particular data set or relationships between different

data sets. For example, you might have interviewed a random sample of 50 households from two rural

villages in West Sussex to compare whether income levels are different. In village A, you calculate the mean

income to be £17,650 and for village B, £22,200. In this instance, a statistical test can be used to determine

whether we have a real difference or whether the difference could have occurred purely by chance. There

are a wide variety of statistical tests, each designed to take account of the different characteristics of the data

sets you may wish to examine. The choice of test to use can prove overwhelming, and indeed frightening at

first. At the most basic level, the principal distinction drawn between different statistical tests is whether they

are ‘parametric’ and ‘non-parametric’ tests. Parametric tests can only be performed where the data

conforms to a normal distribution and is of an interval or ratio nature. In contrast, non-parametric tests

involvement less rigorous conditions and can be used on data of lower level which does not conform to a

normal frequency distribution.

4.1 Null and Alternative Hypotheses

Before conducting a statistical test it is first necessary to establish a hypothesis or statement which the test

then challenges. These hypotheses are referred to as the null hypothesis (Ho ) and the alternative or

research hypothesis (H1). The null hypothesis is usually expressed as Ho: μ1= μ1 where μn is the mean

for each group, and the subscript n denotes the group. When stating a null hypothesis, the normal procedure

is to start by assuming that there is no real difference between your data sets. A statistical test effectively helps

the researcher to decide whether or not the null hypothesis is true, or more precisely, whether or not it should

be accepted. If the result of the test shows that the null hypothesis should not be accepted and that it should

be rejected, we can then go on to say, with some degree of confidence, that a difference does exist or a

change has occurred (Riley, M. et al, 1998, p. 203). It is important that you express both Ho and H1 in the

context of your own research problem before collecting your data and before starting your analysis. In

reference to the rural income example quoted above, we could formulate the following hypotheses:

Ho: μμμμμa= μ μ μ μ μb There is no significant difference between the mean income of households in village

A as compared with the mean income of households in village B; mean household income

is not influenced by geographical location.

H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in the mean household income for households in

village A as compared with village B; mean household income is influenced by geographical

location.

To determine whether or not your sampled data sets are consistent with the null hypothesis or the alternative

hypothesis, we need to perform a probability based significance test. However, before such a test is conducted

we must determine how big any difference has to be, to be considered real beyond that expected due to

chance.



4.2 Hypothesis Testing

Most tests follow the same basic logic, in that a research hypothesis (your alternative hypothesis) predicts a

difference in distributions, whereas a null hypothesis predicts that they are the same. For each significance

test, we can produce a probability distribution of a test statistic, termed a sampling distribution under the null

hypothesis, calculated on the basis that the null hypothesis is true. A simple example relating to the probability

distribution curve of the Student’s t-statistic is shown in Figure 4.1.

Figure 4.1: Rejection Region for a Probability Region

A large visible difference between data sets corresponds to a probability towards the tails of distribution,

therefore meaning that differences occurring by chance are unlikely. We can determine whether any

difference between the data sets is large enough to have occurred by chance, by determining whether the

difference occurs in relation to the tails of the distribution. We can define a critical or rejection region as

that part of the probability distribution beyond a critical value of a test statistic at a certain probability (see

Figure 1.1). We compare this critical value with the calculated test statistic. If the calculated statistic is

greater than the critical value and therefore falls within the rejection region, the difference in the data are

unlikely to have occurred by chance. Consequently, we can reject the null hypothesis and accept the

alternative hypothesis. However, if the calculated value does not fall within the rejection region, this does not

prove the truth of the null hypothesis, but merely fails to reject it.

The size of the rejection region is determined by the significance level. Significance levels allow the researcher

to state whether or not they believe a null hypothesis to be true with a given level of confidence or significance

value. Significance levels are presented in statistical tables as a probability value normally expressed in

decimal terms i.e. 0.05 (5% or p=00.5/1 in 20) and 0.01 (1% or p=0.01/ 1 in 100). The value 0.05 indicates

the 95 per cent confidence limit and represents the minimum limit for deciding upon whether or not a

particular result is significant and whether or not the null hypothesis should be accepted or rejected. Anything

lower than 95 per cent confidence level, that is where the level is computed to be 94 per cent or less, means

the null hypothesis is normally accepted and the result is regarded as not significant. If the significance level

is found to be higher, that is, it indicates a confidence level of 95 per cent or more has been achieved, then

we say the observed change or difference is significant. By the selection of either a 5% or a 1% significance

level, what we are saying is that we are willing to accept either a 5% or a 1% chance of making an error in



rejecting the null hypothesis when it is in fact true; this is known as a Type I error. A Type II error represents the

probability of not rejecting the null hypothesis when it is in fact false.

SPSS will report the significance of the calculated test statistic in terms of a probability value p. Where p<0.05

this would indicate a significant result result at a 0.05 (5%) level, and p<0.01 would indicate a significant result

at a 0.01 (1%) level, often termed ‘highly significant’.

4.3 One and Two Tailed Tests

When conducting a statistical test to a given significance level it is important to consider how the hypothesis

is worded as this will either create conditions for a one- or two tailed test. Any statement including terms

such as reduces or increases, no lower or no higher implies a specific direction in the null hypothesis and

consequently forms the basis of a one-tailed test. In contrast, any statement indicating no direction (no

different/no effect) forms the basis of a two-tailed test. Therefore in relation to the rural income example

stated above, we would perform a two-tailed test. This is because we would have to allow for the average

income for village B to be either larger or smaller than that for village A.

We could, however, have chosen a slighty different alternative hypothesis, for example:

H1: μ μ μ μ μa> μ> μ> μ> μ> μb The mean household income for households in Village A is significantly larger as

compared with Village B; mean household income is influenced by geographical location.

This is termed a one-tailed test as we are only interested in a difference in one direction, in this case positive

differences (larger). As a result, the rejection region must be concentrated at one end of the distribution

(hence the term one-tailed). For the sample mean to be larger than the population mean, the rejection

region must lie at the positive end of the x-axis. The choice of a two-tailed or one-tailed test will determine

the distribution of the rejection region. This will now be discussed in the following section.



4.3.1 Significance Levels and One and Two-Tailed Predictions

The relationship between significance levels and one and two-tailed predictions is explained by Hinton

(2004) in the following extract:

When we undertake a one-tailed test we argue that if the test score has a probability lower than the significance

level then it falls within the tail-end of the known distribution we are interested in. We interpret this as indicating

that the score is unlikely to have come from a distribution the same as the known distribution but from a

different distribution. If the score arises anywhere outside this part of the tail cut off by the significance level we

reject the alternative hypothesis. This is shown in Figure 4.2. Notice that this shows a one-tailed projection

that the unknown distribution is higher than the known distribution.

Figure 4.2: A One-Tailed Prediction and the Significance Level

With a two-tailed prediction, unlike the one-tailed, both tails of the known distribution are of interest, as the

unknown distribution could be at either end. However, if we set our significance level so that we take the 5 per

cent at the end of each tail we increase the risk of making an error. Recall that we are arguing that, when the

probability is less than 0.05 that a score arises from the known distribution, then we conclude that the

distributions are different. In this case the chance that we are wrong, and the distributions are the same, is

less than 5 per cent. If we take 5 per cent at either end of the distribution, as we are tempted to do in a two-

tailed test, we end up with a 10 per cent chance of an error, and we have increased the chance of making a

mistake.

We want to keep the risk of making an error down to 5 per cent overall, as otherwise there will be an increase

in our false claims of differences in distributions which can undermine our credibility with other researchers,

who might stop taking our findings seriously. When we gamble on the unknown distribution being at either tail

of the known distribution, to keep the overall error risk to 5 per cent, we must share our 5 per cent between the

two tails of the known distribution, so we set our confidence level at 2.5 per cent at each end. If the score falls

into one of the 2.5 per cent tails we then say it comes from a diffferent distribution. Thus, when we undertake

a two-tailed prediction the result has to fall within a smaller area of the tail compared to a one-tailed prediction,

before we claim that the distributions are different, to compensate for hedging our bets in our prediction. This

is shown in Figure 4.3.



Figure 4.3: A Two-Tailed Prediction and the Significance Level

[Extract taken from Hinton, P. (2004), Statistics Explained, Routledge, London]

The changes in the critical values between one and two-tailed tests have important consequences because

it is possible for Ho to be accepted if the test is two-tailed but rejected if it is one-tailed. This happens with z

values within the range 1.645 and 1.96 and test statistics of, say, 1.75 which fall outside the two-tailed rejection

region but within the one-tailed. Consequently the phrasing and justification of the alternative hypothesis

should be formulated with considerable care.

Although the actual method for calculating the test statistics is not influenced by the nature of the null

hypothesis, the effect of stating a direction is to impose a more rigorous test which in turn affects the significance

level that can be quoted. By stating a direction to the null hypothesis we are effectively establishing a more

precise test.

Table 4.1: Critical z Values for the 0.01 and 0.05 Rejection Regions for One- and Two-tailed Tests

Critical Values

Tailedness 0.05 Level 0.01 Level

One-tailed test -1.645 or +1.645 -2.33 or +2.33

Two-tailed test -1.96 or +1.96 -2.58 or +2.58



4.4 Choosing the Right Test

The main motivation for choosing a statistical test to apply to a set of data has to be driven ultimately by the

objectives of your research project. Indeed your project should have been designed and data sampled with

a certain test or set of tests in mind (Kitchin and Tate, 1999). When deciding upon a particular test, you need

to consider the nature and characteristics of the data sets that you are investigating and, in particular, whether

they will allow the use of a parametric or non-parametric tests. The common characteristics of both parametric

and non-parametric tests are listed in Table 4.2. Table 4.3 also provides a useful framework to help you

choose the correct test.

Table 4.2: Common Characteristics of Parametric and Non-parametric Tests

Parametric Tests

Independence of observations, except where the data are paired

Random sampling of observations from a normally distributed population

Interval scale measurement (at least) for the dependent variable

A minimum sample size of about 30 per group is recommended

Equal variances of the population from which the data is drawn

Hypotheses are usually made about the mean (μ) of the population

Non-Parametric Tests

Independence of randomly selected observations except when paired

Few assumptions concerning the distribution of the population

Ordinal or nominal scale of measurement

Ranks or frequencies of data are the focus of tests

Hypotheses are posed regarding ranks, medians or frequencies

Sample size requirements are less stringent than for parametric tests

[Kitchin and Tate, 1999, p. 113]



Table 4.3: Identifying the Right Test

[Source: Maltby & Day, 2002]

Question 1:What combination ofvariables have you?

Two categorical

Twoseperate

continuous

Two continuouswhich is the

samemeasure

administeredtwice

Two continuouswhich is the

samemeasure

administered onthree occasions

or more

One categorical andone continuous

Which test to use:

Chi-Square

Go to question 2

Go to question 2

Go to question 2

Go to question 2

Question 2:Should your

continuous data beused with parametric

tests ornon-parametric tests?

Parametric

Non-Parametric

Parametric

Non-Parametric

Parametric

Non-Parametric

Parametric

Non-Parametric

Which test to use:

Pearson

Spearman

Related t-test

Wilcoxonsign-ranks

ANOVA (withinsubjects)

Friedmann test

Go to question 3

Go to question 3

Question 3:How many levels

has yourcategorical data?

2

3 or more

2

3 or more

Which test to use:

Independent-samples t-test

ANOVA(between subjects)

Mann-Whitney U

Kruskal-Wallis


Data Analysis for Research Statistical Tests: Student T-Test

4.5 Parametric Tests

4.5.1 T-Test or Student’s T-test

The t-test is most useful for testing whether or not a significant difference exists between the means of two

samples, or alternatively, whether or not two samples come from one population. There are two principal

versions of the t-test. One relates to samples involving independent data sets and the other to samples which

involve paired comparisons. In both cases, the data must be of ratio or interval in nature, randomly chosen

and normally (or near normally) distributed. The variances of the two data sets should also be similar. Where

there is doubt over the frequency distribution and the values of the variances that may jeopardise the accuracy

of the test, alternative and less refined non-parametric tests should be used.

4.5.2 T-Test for Independent Samples

In this instance, the t-test compares two unrelated data sets by inspecting the amount of difference between

their means and taking into account the variability of each data set. The larger the difference in the means,

the more likely that a real, significant difference exists, and our samples come from different populations (see

Figure 4.5).

Figure 4.5: Differences in Means and Populations

The following section will illustrate how to use SPSS to conduct a student t-test using variables from the

Dataset file.



4.6 Using SPSS to Calculate the Student T-Test

The aim of the following section is to demonstrate how to use SPSS to perform the unrelated and related t-

test. As already mentioned in this section, the t-test is most useful for testing whether or not a significant

difference exists between the means of two samples, or alternatively, whether or not two samples come from

one population. There are two principal versions of the t-test. One relates to samples involving independent

data sets, and the other to samples which involve paired comparisons. In both cases, the data must be of

interval nature, randomly chosen and normally (or near normally) distributed. The variances of the two data

sets should also be similar. Where there is doubt over the frequency distribution and the values of the

variances that may jeopardise the accuracy of the test, alternative and less refined non-parametric tests

should be used. To begin, open SPSS and open the file dataset file that you have used in previous sessions.

We are going to use the Student T-test to examine the relationship between different variables. Let us

consider a potential research scenario to help you place the use of the student t-test in context.

Scenario: As part of the bidding process to Tourism South East for future tourism funding, local tourism

officers have to demonstrate if there is a significant difference in turnover between businesses

in the Arun and Chichester Districts.

Variables: We are therefore going to examine if there is a relationship between Area and Turnover08.

Before we start we first need to establish a Null and Alternative hypothesis.

In this case:

The Null Hypothesis:

Ho: μμμμμa= μ μ μ μ μb There is no significant difference in Turnover between Area; business turnover is not

influenced by location

TheAlternativel Hypothesis:

H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in Turnover between Area; business turnover is




4.6.1 T-Test for Independent Samples

To perform the unrelated t-test for two independent samples, first move the mouse over Analyse and press

the left mouse button. Move the mouse over Compare Means and then over Independent Samples T

Test.

The Independent-Samples T Test

dialog box appears.




Move the mouse over the variable Turnover08 and press the left mouse button. Move the mouse over the

centre arrow and press the left mouse button so that the variable Tunrover08 appears in the Test Variable(s)

box.

Select the variable Area

and press the lower arrow

so that Area appears in

the Grouping Variable

box.

Move the mouse over Define Groups and press the left mouse button.

The Define Groups dialog box appears. In the box beside

Group 1: type 1 and in the box beside Group 2: type 2. Note

in this case the groups have been defined in terms of their two

codes (1=Chichester District and 2=Arun District). The values

can also be used as a cut-off point, at or above which all the

values constitute one group while those below form the other

group. In this instance the cut-off point is two, which would be

placed in parentheses after gender.

Move the mouse over Continue and press the left mouse button. This will return you to the Independent-

Samples T Test dialog box. Move the mouse over OK and press the left mouse button. SPSS performs the

test and displays the results in the Output window.

Turnover08



In this case the following output is produced:

You are now wondering what this all means. Let us start by referring back to our null and alternative

hypothesis.

In this case:


Ho: μμμμμa= μ μ μ μ μb There is no significant difference in Turnover between Area; business turnover is not



H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in Turnover between Area; business turnover is


The second subtable in the output, provides the information we need by tabulating the value of t and its p-

value (Sig.(2-tailed)) together with the 95% Confidence Interval of Difference for both Equal variances

assumed and Equal variances not assumed.

The key to which situation to use lies in the first two columns labelled Levene’s Test for Equality of

Variances which is a test for the homogeneity of variance assumption of a valid t-test. One of the criteria for

using a parametic t test is the assumption that both populations have equal variances. If the test statistic F is

significant, Levene’s test has found that the two variances do differ significantly, in which case we must use

the bottom values. Provided the test is not significant (p>0.05), the variances can be assumed to be

homogenous and the Equal Variances line of values for the t-test can be used. As Kinnear and Gray (1999)

point out:

If p > 0.05, then the homogeneity of variance assumption has not been violated and the normal t-test

based on equal variances (Equal variances assumed) is used (the top line).

If p < 0.05, then the homogeneity of variance assumption has been violated and the normal t-test based

on equal variances should be replaced by one based on separate variance estimates (Equal variances

not assumed)(the bottom line).

Turnover08

Turnover08



In this example, the Levene Test is significant (p = 0.041

and is therefore < than 0.05), so the t value calculated with

the pooled variance estimate (Equal variances not

assumed) is appropriate.

The results are relatively straightforward. The table includes the t-

statistic, the degree of freedom, and the two-tailed probability of the

former being equalled or exceeded by chance alone (Sig.). This

form of output does not give the critical t-value that must be exceeded

for the null hypothesis to be rejected and Sig. is therefore of great

importance in this and other tests in the output of which it is commonly

listed. It allows us to dispense with tables of critical values and, if this

probability value is equal to or less than the selected significance

level, the null hypothesis must be rejected.

In this case, the test produces a two-tailed p-value of 0.000; this value is significant. Remember for the p-

value not to be significant at the 0.05 level, the p-value would have to be greater than 0.05. In this case, the null

hypothesis is rejected at the 0.05 significance level. In other words we would conclude that there is a

significant difference in mean turnover between area, and that turnover is influenced by location.

It is important to write up the results clearly and fully. In this instance we could write:

A student t-test was conducted to determine if a significant difference between turnover and area existed. A

null hypthosis of no significant difference and an alternative hypthosis of a significant different were established,

and a 95% confidence level was assumed. The difference was significant t = 6.354, p(<.0005)<0.05.

Therefore the null hypthosis can be rejected and we can assume that there is a significant difference

between turnover and area, and that turnover is influenced by location.

Note that in the above we have reported the probability value as <.0005. You cannot have a probability value

of 0.000. The reported probability value has actually been rounded down to three decimal places and

therefore for accurary we would report this as p<0.0005.



A note on Significance Testing taken from Maltby and Day (2002):

‘Significance testing is a criterion, based on probability, that researchers use to decide whether two

variables are related. Remember, as researcher always use samples, and because of the possible

error, they use significance testing to decide whether the relationships observed are real, or not.

Researchers are then able to use a criteria level (significance testing) to decide whether or not their

findings are probable (confident of their findings) and not probable (not confident of their findings).

This criterion is expressed in terms of percentages, and their relationship to probability values. If we

accept that we can never be 100 per cent sure of our findings, we have to set a criterion of how certain

we want to be of our findings. Traditionally, two criterion are used. The first is that we are 95 per cent

confident of our findings, the second is that we are 99 per cent confident of our findings. This is often

expressed in another way. Rather, there is only a 5 per cent (95 per cent confidence) or 1 per cent (99

per cent confidence) probability that we have we have made an error. In terms of significance testing

these two criteria are often termed the .05 (5 per cent) and 0.01 (1 per cent) significance levels.

Throughout this handbook, you will be using a number of tests to determine whether there is a

significant association/relationship between two variables. These tests always provide a probability

statistic, in the form of a value; e.g. 0.75, 0.40, 0.15, 0.04, 0.03 and 0.002. Here, the notion of significance

testing is essential. This probability statistic is compared against the criteria of 0.05 and 0.01 to decide

whether the findings are significant. If the probability value (p) is less than 0.05 (p<0.05) or less than

0.01(p<0.01) then we conclude that the findings is significant. If the probability value is more than 0.05

(p>0.05) then we decide that the finding is not significant. Therefore we can use this information in

relation to our research idea and we can determine whether our variables are significantly related, or

not. Therefore, for the probability values stated above:

* The probability values of 0.75, 0.40 and 0.15 are greater than 0.05 (p>0.05) and these probability

values are not significant at the 0.05 level (p>0.05).

* The probability values of 0.04, and 0.03 are less than 0.05 (p<0.05) and these probability values

are significant at the 0.05 level (p<0.05).

* The probability value of 0.02 is less than 0.01 (p<0.01) therefore this probability value is significant

at the 0.01 level (p<0.01)’



4.6.2 One or Two-Tailed Tests

The above test has been based on a two-tailed test as the null and alternative hypothesis did

not specify any specific direction. If we were going to perform a one-tailed test we would first

need to look at the mean values of the data and then rewrite our hypotheses accordingly.

Remember that when applying a one-tailed test it is first necessary to establish whether the

difference in the samples corresponds to the direction outlined in the alternative hypothesis.

For example if the alternative hypothesis is that the mean of sample Y is greater than the mean

of sample X, the null hypothesis can only be rejected if the mean of sample Y is greater than

the mean of sample X and if it is significant at the chosen level.

If we use Descriptives Statistics in SPSS to look at the mean turnovers for businesses in the

Chichester and Arun Districts we would find that the mean turnover in the Chichester District

is £43,968.47and in the Arun District is £37,591.69. The mean turnover is higher in Chichester

which therefore suggests that turnover may be influenced by location. We can therefore

conduct a one-tailed t-test to test if there is actually a significance difference between the two

mean scores.

In this case


Ho: μμμμμa= μ μ μ μ μb There is no significant difference in Turnover between Area; business

turnover is not influenced by location.


H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in Turnover between Area; business

turnover is higher in Chichester than Bognor Regis.

To calculate the one-tailed level of significance, divide the two-tailed significance value by 2

(0.000/2). The resultant one-tailed value would be 0.000 which would still be significant

(p.<0.05).



4.6.3 Choosing the Correct Data for a T-Test

SPSS will not tell you if you are using the wrong data in a test, and it is therefore imperative that you are capable

of selecting the right variables to use in a t-test. This will be central to your assessment in this module and it

is vital that you get it right.

Let us first refer back to Table 4.3 on page 4-133. This table clearly shows that a t-test is a combination of one

continuous variable and one categorical (with two levels).

In the worked example provided, Turnover08 was the continuous variable and Area was the categorical

variable. Note that Area has two levels (i.e. 1 - Chichester District and 2 - Arun District). You can only use

categorical variables that have two levels in a t-test. The actual Independent Samples T-Test actually

provides a clue here as you are only able to define two groups (levels) within the Grouping Variable.

In this case also note that the continuous variable (Turnover08) goes in the Test Variable box.

Turnover08



Referring to the variables in the Dataset file and your accompanying data set guide, attempt to complete thefollowing diagram listing Test Variables and Grouping Variables that would be suitable for use in a series of t-tests.

Activity 16:

Test Variables

Grouping Variables



From the list of potential relationships that you have identified overleaf, please conduct 3 separate T-tests and record your results in the following tables. For each test, identify a research scenario that youare using the test to explore.

Table 18: Student T-Test 1

Activity 17:

Student T-Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output

Record the p. value of the Levene Test

Is this significant: yes/no?

Is your test based on: Equal variances assumed

Equal variances not assumed

Record the value of p. (Sig. 2-tailed)

Is the value of p. significant: yes/no?

Your conclusions (with full reference to the null and alternative hypotheses)




Activity 17:

Student T-Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis


SPSS Output











Activity 17:

Student T-Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis


SPSS Output









Data Analysis for Research Statistical Tests: Related T-Test

4.7 Using SPPS to Calculate the T-Test for Related Samples

The t-test can also be used to examine means of the same participants in two conditions or at two points in

time. The advantage of using the same participants or matched participants is that the amount of error

deriving from differences between participants is reduced. The difference between a related and unrelated

t-test lies essentially in the fact that two scores from the same person are likely to vary less than two scores

from two different people. For example, if you were to weigh the same person on two occasions, the

difference between those two weights is likely to be less than the weights of two seperate individuals. The

variability of the standard error for the related t-test is less than that for the unrelated one. Indeed, the variability

of the standard error of the differences in means for the related t test will depend on the extent to which the

pairs of scores are similar or related. The more similar they are, the less the variability will be of their

estimated standard error.

In the following example we are going to look at paired data from the Dataset file. Let us consider a potential

research scenario to help you place the use of the related t-test in context.

Scenario: Between 2008 and 2010, Tourism South East ran a series of courses in conjunction with the

Green Tourism Business Scheme to help GTBS members progress to the next stage of

accreditation (e.g. bronze to silver; silver to gold). As part of the monitoring process, Tourism

South East want to establish if these courses have had any impact on GTBS scores.

Variables: We are are going to examine differences in GTBS scores in 2008 and 2010.

As always we need to start by defining our hypthoses. In this instance, the null and alternative hypthoseses

have been stated as:

Ho: μμμμμa= μ μ μ μ μb There is no significant difference between the GTBS scores in 2008 and 2010.

H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in the GTBS scores in 2008 and 2010.

To perform the related t-test, first move the mouse over Analyse and press the left mouse button.

Move the mouse over Compare Means and then over Paired-Samples T-Test.




The Paired-Samples T-Test dialog box appears.

Move the mouse over GTBS08 and press the left mouse button. GTBS08 is selected. Now move the mouse over GTBS10

and press the left mouse button. GTBS10 is selected. Move the mouse over the central button and press the left mouse

button.

GTBS08 and GTBS10 now appear in the Paired Variables box.

Click OK.



The procedure produces the following results in the output window:

The first table evident in the SPSS output is the Paired Samples Statistics which reports the descriptive

statistics. By observing the mean scores we can see that mean GTBS scores were higher in 2010 than 2008.

These differences seem to be supporting our initial hypothesis. To establish whether this result is significant

or has merely occured by chance we refer to the Paired Samples Test.

The key elements of the Pair Samples Test include:

(a) The test statistic - this is denoted as t; in this case the value of t=-11.386

(b) The degrees of freedom - the degrees of freedom equal the size of the sample (300) minus 1. The minus

1 represents minus 1 for the sample as you have only asked one set of respondents. The degrees of freedom

value is placed in brackets between the t and the = sign (e.g. t(299)=-11.386).

(c) The Probability Value - as in all tests we also have to report the probability value. Note that the value of p

=.000 (which remember we report as p<.0005) is less than 0.05 which means that there has been a

significant change in GTBS scores between 2006 and 2008.

Let us bring these different elements together. As can be seen from the SPSS output, the difference between

the two means is significant. This is specifically reported as:

There is a significant difference in GTBS scores between 2006 and 2008, t (299)= -11.386, p (<.0005)<0.05.



However, we can be more specific and in our altnerative hypothesis look for an improvement

in GTBS scores. As a result our alternative hypothesis would be:

H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant improvement in the GTBS scores between 2008

and 2010.

This therefore means we have conducted a one-tailed test, as we have specified a specific

direction in which to examine change. To alter the output here so that it complies with a one-

tailed test we merely divide the p-value by 2. The resultant value (.000) is still significant

(p=.000<0.05). As a result we can reject the null hypothesis and conclude that there has been

a significant improvement in GTBS scores between 2008 and 2010, at the 95% confidence

level. Specifically:

There has been a significant improvement in GTBS scores between 2008 and 2010, t (299)=

-11.386, p (<.0005)<0.05.



We are now going to use the Dataset file to conduct a number of additional related t-tests. Pleasecomplete the following tables, making clear reference to the SPSS output. You have been providedwith research scenarios for each table to place the test in context.

Table 21: Related T-Test: Turnover08 Against Turnover10[Tourism South East want to establish if regional marketing strategies implemented between 2008 and2010 have had an impact on business turnover.]

Table 22: Related T-Test: Green08 Against Green10[Tourism South East want to establish if support given to the use of local produce has impacted on howmuch businesses spend on local produce]

Note that the tests conducted here relate to the entire sample. If we used the Split File option as we havedone previously, we could conduct Related T-tests to provide comparisons between selected variables

such as Area, Town or G-Strategy. Attempt to apply the Split File option and repeat one of the tests

above. Cut and paste the output into your log book.

Activity 18:

Related T-Test

Null Hypothesis


Comment on the SPSS Output

Related T-Test

Null Hypothesis




Data Analysis for Research Statistical Tests: Mann Whitney U Test

4.8 Non Parametric Tests

4.8.1 The Mann-Whitney U Test (Independent Samples)

When comparing samples of geographical data, assumptions of normality which underpin the accuracy of

parametric tests, such as the t- test, are often quite unrealistic. In these cases, the use of a non-parametric

test, such as the Mann Whitney U Test, provides a convenient alternative. The Mann Whitney U test is the

non-parametric counterpart of the t-test for unrelated (independent) data. The test is used to determine

whether ordinal data collected in two different samples differ significantly. As a non-parametric test it is not

restricted by any assumptions regarding the nature of the population from which the sample was taken and

is applicable to ordinal (ranked data). In additition, the sample sizes of the data sets need not be equal. The

test calculates whether there is a significant difference in the distribution (based on the median) of data by

comparing ranks of each data set.

Within the Mann Whitney U test the null hypothesis is that the two populations are taken from a common

population so that there should be no consistent difference between the two sets of values. Any observed

differences are due entirely to chance in the sampling process.

To begin, open SPSS and open the file Dataset file that you have used in previous sessions. We are going to

use Mann Whitney to examine the relationship between different variables. Let us consider a potential

research scenario to help you place the use of the Mann Whitney test in context.

Scenario: Tourism South East are developing a new e-tourism strategy and they want to establish if there

is any relationship between e-strategy (e-commerce adopters and non-adopters) and business

attitudes to the value of the internet.

Variables: We are therefore going to examine the relationship between EStrategy and the perceived

value of the internet in 2008 (Webqual08).

4.8.2 Writing Null and Alternative Hypotheses

Before we start we first need to establish a Null and Alternative hypothesis.

In this case:


Ho: There is no significant difference between the two groups in terms of their perceived value of the

internet; e-strategy does not influence attitudes towards the internet

TheAlternative Hypothesis:

H1: There is a significant difference between the two groups in terms of their perceived value of the

internet; e-strategy does influence attitudes towards the internet


Data Analysis for Research Statistical Tests: Mann Whitney U Test

4.9 Using SPSS to Calculate Mann Whitney

To perform the Mann Whitney U test, first move the mouse over Analyse and press


Move the mouse over Nonparametric Tests and then over Legacy Dialogs. Select

2 Independent Samples.

The Two-Independent Samples Tests dialog box appears.

Select the variable labelled

Webqual08. Move the mouse over

the central arrow and press the left

mouse button so Webqual08

appears in the Test Variable List.

Data Analysis for Research


Statistical Tests: Wilcoxon Signed Ranks

Move the mouse over Define Groups and press


Select the variable

EStrategy and press the

lower arrow so that

EStrategy appears in the

Grouping Variable box.

The Define Groups dialog box appears. In the box beside Group 1: type 1 and in the box beside Group

2: type 2. Note in this case the groups have been defined in terms of their two codes (1=ECommerce Adopter

and 2=ECommerce - Non Adopter). The values can also be used as a cut-off point, at or above which all the

values constitute one group while those below form the other group. In this instance the cut-off point is two,

which would be placed in parentheses after gender.

Move the mouse over Continue and press the left mouse button. This will return you to the Independent-

Samples Tests dialog box. Move the mouse over OK and press the left mouse button. SPSS performs the

test and displays the results in the Output window.




The first subtable, Ranks, illustrates the number of businesses in each group, and the total number of

businesses. The Mean Rank indicates the mean rank of scores within each group and the Sum of Ranks

indicates the total sum of all ranks within each group. If our null hypothesis of no significant difference was

true, then we would expect the mean rank and sum of ranks to be roughly similar across the two groups. As

we can see the mean rank for E-Commerce Adopters is 178.82 and for E-Commerce Non-Adopters is

116.35. There is a clear difference between the two, and to determine whether this difference is significant

we refer to the Test Statistics table below.

This tells us that the Mann Whitney U value is 6508.000 and that the probability value (p), ascertained by

examining the Asymp. Sig. (2-tailed) is .000. In this case, the p-value (.000) (reported as p<.0005) is less than

0.05, so we can therefore reject the null hypothesis and conclude that there a significant difference between

EStrategy and attitudes towards the internet.

Our Mann Whitney test was two-tailed but again we could be more specific by indicating a direction in our

alternative hypothesis. In this case the alternative hypothesis would be:

H1: There is a significant difference between the two groups in terms of their perceived value of the

internet. E-commerce adopters rank the value of the internet higher than e-commerce non-adopters.

Note that an initial examination of the mean ranks would support our alternative hypothesis. As before, for a

one-tailed test, the p value needs to be halved (.000/2 = .000). In this case the test would still be significant as

the p-value (.000) (reported as p<.0005) is less than 0.05, so we can again reject the null hypothesis and

conclude that there a significant difference between EStrategy and attitudes towards the internet and that e-

Commerce adopters rank the value of the internet higher than E-commerce non-adopters.




4.9.1 Choosing the Correct Data for a Mann Whitney Test

SPSS will not tell you if you are using the wrong data in a test, and it is therefore imperative that you are capable

of selecting the right variables to use in a Mann Whitney T-test. This will be central to your assessment in this

module and it is vital that you get it right.

Let us first refer back to Table 3 on page 4-133. This table clearly shows that a Mann Whitney T-Test is non-

parametric and comprises a combination of one continuous variable and one categorical (with two levels).

In the worked example provided, Webqual was the continuous variable and EStrategy was the categorical

variable. Note that EStrategy has two levels (i.e. 1 - E-Commerce Adopter and 2 - E-Commerce Non-

Adopter). You can only use categorical variables that have two levels in a Mann Whitney Test. The actual

Mann Whitney Test dialog box actually provides a clue here as you are only able to define two groups (levels)

within the Grouping Variable.

In this case also note that the continuous variable (Webqual08) goes in the Test Variable box.




Referring to the variables in the Dataset file, attempt to complete the following diagram listing Test Variablesand Grouping Variables that would be appropriate for use in a series of Mann Whitney Tests.

Activity 19:

Test Variables

Grouping Variables




From the list of potential relationships that you have identified overleaf, please conduct 3 separateMann Whitney tests and record your results in the following tables. For each test, identify a researchscenario that you are using the test to explore.

Table 23: Mann Whitney Test 1

Activity 20:

Mann Whitney Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis


SPSS Output

Record the Mann Whitney U Value

Record the value of p (Asymp. Sig. (2-tailed))


Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)





Activity 20:

Mann Whitney Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis


SPSS Output









Activity 20:

Mann Whitney Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis


SPSS Output








4.10 Using SPSS to Calculate Wilcoxon Signed Ranks Test (Related Data Sets)

The Wilcoxon signed ranks test is the non-parametric conterpart of the t-test for related data or paired t-test.

The basic assumptions for the test are that the data are paired across conditions or time, and that the data are

symmetrical but need not be normal or any other shape. The data should also be of at least ordinal level,

which therefore makes the test very useful for analysing data based on ranked scores. The test itself

examines the differences between data from the phenomenon collected in two different conditions or times

by examining the ranks of the difference in values over the two conditions. For example, you may want to

know whether a village’s fertility or mortality rate changes significantly between dates or whether the conditions

under which a questionnaire or interview is conducted influence the findings of a study significantly. In this

case, the test calculates whether there is a significant difference by examining whether the ranks of individual

phenomena differ between conditions or times.

To begin, open SPSS and open the file Dataset file that you have used in previous sessions. We are going to

use the Wilcoxon test to examine the relationship between different variables. Let us consider a potential

research scenario to help you place the use of the Wilcoxon Test in context.

Scenario: Between 2008 and 2010, Tourism South East have been running E-Commerce workshops

across the South East region. As part of the monitoring process, Tourism South East want to

establish if these workshops have had any impact on business attitudes to the value of the

internet.

Variables: Therefore we are are going to examine the relationship between Webqual08 and

Webqual10.

In this instance, the null and alternative hypthoseses have been stated as:

Ho: There is no difference in business attitudes towards the value of the internet between 2008 and

2010

H1: There is a difference in business attitudes towards the value of the internet between 2008 and 2010.

The significance level has been set at 0.05 (95%). Note that this is also a two-tailed test as no direction has

been specified in the alternative hypothesis.




The Two-Related Samples Tests dialog box appears.

Move the mouse over Webqual08 and press the left mouse button. Webqual08 is selected. Now move the

mouse over Webqual10 and press the left mouse button. Webqual10 is selected. You will notice that in the

Current Selections area in the dialog box, Webqual08 is now beside Variable 1 and Webqual10 is beside

Variable 2.

To perform the Wilcoxon Test, first move the mouse over Analyse and press the left

mouse button.

Move the mouse over Nonparametric Tests and then Legacy Dialogs and then over 2

Related Samples and press the left mouse button.




Move the mouse over the central button and press the left mouse button. Webqual08 and Webqual10 now appear in the

Paired Variables box.

Click OK. The procedure produces the following in the output window.




The first subtable, Ranks, shows the number of negative, positive and tied ranks, along with the mean rank

and the Sum of Ranks. Let us explore this is additional detail.

Key observations:

Webqual10 has been entered into the equation first, therefore the calculation is based on the attitudes

scores in 2010 minus the attitude scores in 2008.

The Negative Ranks indicate how many ranks of Webqual08 were larger than Webqual10. Here the

value is 0, which would initially suggest that attitude scores have increased.

The PostiveRanks indicate how many ranks of Webqual08 were smaller than Webqual10. The

value here is 259.

The Tied Ranks indicate how many of the rankings of Webqual08 and Webqual10 are the same.

The value here is 41.

The Total is the total number of ranks, which is equal to the number of attitude scores in the sample

(in this case 300).

From the second subtable, Test Statistics, it can be seen that the value of z = -16.093, which is significant

as the value of p (.000) is less than 0.05. We can therefore reject the null hypothesis and conclude that there

is a significant difference in business attitudes towards the value of the internet between 2008 and 2010. The

findings of the Wilcoxon test should be reports as:

z= -16.093, p(<0.0005)<.005

The Wilcoxon Test was two-tailed but again we could be more specific by indicating a direction in our

alternative hypothesis. In this case the alternative hypothesis would be:

H1: There is a significant difference in attitudes towards the value of the internet between 2008 and 2010;

business attitudes have improved.

Note that an initial examination of the data in the ranks table support our alternative hypothesis. As before, for

a one-tailed test, the p value needs to be halved (.000/2 = .000). In this case the test would still be significant

as the p-value (.000) (reported as p<.0005) is less than 0.05, so we can again reject the null hypothesis and

conclude that there a significant difference between attitudes towards the value of the internet between 2008

and 2010 and that business attitudes have improved.




We are now going to use the Dataset file to conduct a number of additional Wilcoxon Tests. Pleasecomplete the following tables, making clear reference to the SPSS output. You have been providedwith research scenarios for each table to place the test in context.

Table 26: Wilcoxon Test = BLINK08-BLINK10[Following a complete review of their business advisory services, instigated by poor industry feedbackin 2007, Business Link need to establish if business attitudes towards their advisory services hasimproved between 2008 and 2010]

Table 27: Wilcoxon Test- WEBVALUE08-WEBVALUE10[Tourism South East want to establish if business attitudes to destination management systems havechanged following the change of DMS platform and a complete relaunch of booking systems]

Note that the tests conducted here relate to the entire sample. If we used the Split File option as we havedone previously, we could conduct Wilcoxon Tests to provide comparisons between selected cases

such as Area, Town or G-Strategy. Attempt to apply the Split File option and repeat one of the tests above.

Cut and paste the output into your log book.

Activity 21:

Related T-Test

Null Hypothesis



Related T-Test

Null Hypothesis






Notes:

Chi-Squared

Section 5

Learning Outcomes


Understand the rationale for the use of chi-squared

Understand the basic conditions and criteriainvolved in the use of chi-squared

Apply the procedure for calculating chi-squaredstatistics both manually and in SPSS

Interpret manually derived and computergenerated SPSS chi-squared output


Data Analysis for Research Chi-Squared

5.0 Introduction

Chi-squared (χ2) is primarily employed to test a null hypothesis of ‘no difference’ between samples of frequency

measurements. The method is used widely in the fields of Business and Management and is often employed

in questionnaire analysis. In many ways it is the most flexible of such tests as:

It can be applied to frequency data on any originally collected scale of measurement (nominal,

ordinal or interval) provided that the data are grouped into independent and mutually-exclusive

categories.

It may be used to test a null hypothesis of ‘no difference’ for any number of samples.

The chi-squared test involves computing a calculated χ2 statistic and comparing this with an appropriate

tabulated χ2 statistic (or critical χ2 value) to test a null hypothesis of ‘no difference’ at a selected significance

level. Although considered less powerful than other tests, this is compensated for by its simple data requirements.

Both ordinal and ratio scale data can be converted into nominal form, although such categorisation can often

cause a loss of detail.

The chi-squared test requires that data be in the form of contingency tables, which are simply data matrices

showing the frequency of observations in different categories (h) for one or more samples (k). The following

are three examples of contingency tables.

Table 5.1: Categories of residence sampled in terms of the age of the resident

Sample a b c

Category Age: 20-29 30-39 40 and over

Owner-occupied 18 42 28

Rented 31 29 12

Council housing 24 41 35

Other 17 4 1

No. of categories (h) = 4 No. of samples (k) = 3

Note: Measurement scale: nominal



Table 5.2: Typical questionnaire responses


Note: Measurement scale: ordinal

Table 5.3: Total dissolved solids in groundwater, sampled by rock type


Note: Measurement scale: interval

The calculated χ2 statistic compares the observed frequency (O) for each category and every sample against

an expected frequency (E) using the general formula:

χ 2 =− (O E)E

2

In the above equation the observed frequencies (O) are those that we measure, (i.e. those that appear in the

contingency tables). The expected frequencies (E) for each category are defined by our hypothesis. The null

hypothesis of ‘no difference’ often involves testing for departure from a uniform distribution in the case of the

single sample test. This means that the expected value for each category is identical and equal to n/h. The

chi-squared test can also be used to establish differences from a theoretical distribution, such as the normal

distribution.

Category

Frequency

Strongly Agree

8

Agree

11

Niether Agree or

Disagree

6

Disagree

19

Strongly

Disagree

12

Categories

(Concentration in mgl-1)

0-19

20-39

40-59

60-79

80-99

100-119

A

Granite (n=30)

3

12

10

4

1

0

B

Basalt (n=30)

1

9

11

8

3

1



Although the χ2 test is primarily employed as a one-tailed test of the significance of differences, it may also be

employed to establish the significance of similarities between samples. Most of the χ2 tables contain not only

the usual values at the lower end of the significance scale for testing differences but also values at the upper

end of this scale for testing similarity. If we wish to establish similarity of two or more samples, then our

calculated χ2 statistic must be less than, for example, the appropriate figure for the 95% significance level if

we are to accept the null hypothesis of ‘no difference’ at this level.

5.1 The One-Sample Chi-Squared Test

The one-sample test is normally used to test the significance of differences between categories of a single

sample. Consider the following example.

The frequency of rock falls from a popular cliff face in Snowdonia is recorded for two weeks in the summer,

autumn, winter and spring by the local mountain rescue. The results are recorded in Table 5.4.

Table 5.4: Rockfall frequency

Sampling Period Summer Autumn Winter Spring

Frequency of Rockfalls 17 14 10 23

h=4 n=64

If there were no differences in the frequency of rockfalls in each season, then we would expect an equal

frequency of rockfalls in each season. Basically, the expected frequency for each category would be:

Enh

644

16= = =

As with any test, we must first formalise the null and alternative hypotheses. In this case:

H0: There is no difference in the frequency of rock falls between seasons

H1: The frequency of rock falls is significantly greater in some seasons than in others



The calculated χ2 statistic can now be computed as follows:

χ 2 =− (O E)E

2

Table 5.5: Rockfall frequency: the calculation of the χ2 statistic

χ 2 = − = (O E)E

2

5.6250

The degrees of freedom (v) for this one-sample chi-square test is:

V=h-1

which in this case equals:

V=h-1 = 4-1 = 3

The calculated χ2 statistic is then compared with a tabulated χ2 statistic at a selected significance level (see

Table 5.6). To reject the null hypothesis, the calculated χ2 must exceed the tabulated χ2. At the 0.05 significance

level, the tabulated χ2 statistic with three degrees of freedom is 7.82. As the calculated value is less than the

tabulated value we cannot reject the null hypothesis of ‘no difference’ at the 0.05 significance level and conclude

that there is no significant difference in the frequency of rock falls between the different seasons.

Category

Summer

Autumn

Winter

Spring

O

17

14

10

23

E

16

16

16

16

(O-E)

1

-2

-6

7

(O-E)2

1

4

36

49

0.0625

0.2500

2.2500

3.0625

(O E)2

E−



Table 5.6: Critical Values of Chi-square95% 99%



5.2 The Chi-Squared Test for Two or More Samples

The chi-square test can also be used to test the differences or similarities between two or more samples,

though it is always used as a test of difference. The procedure is similar to that for the one sample test, except

that the calculation of the expected frequencies (E) is slightly more complex. Consider the following example.

A researcher in Ghana studying the distribution of malaria outbreaks among international tourists obtains the

following results from a sample of 100 tourists who stayed in hotels on the river flood plain, and from 200

tourists who stayed in hotels on a plateau above the river. The results are recorded in Table 5.7.

Table 5.7: The incidence of malaria outbreaks in Ghana

Category Infected Not Infected

Sample

Flood plain (n=100) 20 80

Plateau (n=200) 25 175

In this case we have two samples (k=2) and two categories (h=2). The researcher wishes to establish whether

the two samples differ significantly in terms of the incidence of infection. The expected frequencies (E) are

thus those that would be expected if there were indeed ‘no differences’ between the plateau and the flood plain

in terms of incidence of infection. The expected frequencies are calculated for each observation using the

following formula:

EColumn total x Row total

Overall total=

Or alternatively using notation in a contingency table format:

Table 5.7a: Calculation of expected values

Category Infected Not Infected Row Total

Sample

Flood plain (expected) Cell A= (N1 x T1)/T Cell C= (N2 x T1)/T T1

Plateau (expected) Cell B=(N1 x T2)/T Cell D= (N2 x T2)/T T2

Column Totals N1 N2 T



Therefore in the case of the malaria outbreaks, the expected values are calculated in the following manner:

Table 5.7b: Calculation of expected values


Sample

Flood plain (observed) 20 80 100

Plateau (observed) 25 175 200

Column Totals 45 255 300

Hence the expected values are:

Table 5.7c: Calculation of expected values cont..


Sample

Flood plain (expected) 15 (45*100)/300 85 (255*100)/300 100

Plateau (expected) 30 (45*200)/300 170 (255*200)/300 200

Column Totals 45 255 300

Note that the row and column totals for the expected values are identical to those for the observed values.

The χ2 statistic is now calculated in the following manner:

Table 5.8: Calculation of the χχχχχ2 statistic

1.667

0.294

0.833

0.147

2.941

Category

Flood plain

Infected

Not Infected

Plateau

Infected

Not Infected

Total

O

20

80

25

175

300

E

15

85

30

170

300

(O-E)

5

-5

-5

5

(O-E)2

25

25

25

25

(O E)2

E−



When the chi-squared test is used to test two or more samples, the number of degrees of freedom is given by:

V = (h-1)(k-1)

In this case:

V = (h-1)(k-1) = (2-1)(2-1) =1

Formally, the test of a null hypothesis of ‘no difference’ is as follows:

H0: There is no difference between the incidence of infection on the flood plain and that on the plateau

H1: There is a significant difference between the incidence of infection on the flood plain and that on theplateau

The calculated χ2 statistic is then compared with a tabulated χ2 statistic at a selected significance level. The

tabulated value at the 0.1 significance level with 1 degree of freedom is 2.71 (see Table 1.6). As the calculated

χ2 statistic (2.941) exceeds the tabulated χ2 statistic we can reject the null hypothesis at the 0.1 significance

level. In practical terms this means that on the basis of the evidence of the chi-square test, it is extremely

unlikely that the observed difference between rates of infection is due only to chance in the sampling process

and instead reflects a ‘real’ difference between the rates of iinfection on the flood plain and plateau.

Notice that the larger the test statistic, the stronger the evidence of association will be. This is not surprising

because the test statistic, χ2 , is based on differences between the actual, or observed frequencies and those

we would expect if there were no association. If there were association then we would anticipate large

differences between observed and expected frequencies. If there were no association we would expect small

differences.

In the above example, if a higher significance level had been chosen, say 0.05, then the calculated χ2 statistic

(2.941) would have been less than the tabulated χ2 statistic ( 3.84) and so the null hypothesis could not have

been rejected. This situation raises the issue of subjectivity and that in order to reject a null hypothesis a

researcher may well be tempted to choose a lower significance level. The safest rule is to choose a significance

level before the test is carried out and stick to it.



5.3 Yates Correction Factor

When using a χ2 test with one degree of freedom, as in the previous example, it is necessary to make a slight

adjustment to the calculations. The adjustment consists of either adding or subtracting 0.5 to the value of

each (O-E) before squaring it. The rule for deciding whether to add or subtract the 0.5 is:

a) If (O-E) is negative then add;

b) If (O-E) is positive then subtract

It is probably more easily remembered by noting that addition or subtraction should be performed with a view

to making the value of χ2 smaller.

The effect of the Yates correction can be highlighted with reference to a new version of Table 5.9 which has

been appropriately adjusted using the Yates correction factor.

Table 5.9: Calculation of the χχχχχ2 statistic with Yates Correction

The effect of the Yates correction is to introduce greater accuracy into the calculation and evaluation of the χ2

statistic. In this case, the Yates correction has reduced the value of the calculated χ2 statistic to the extent that

is no longer exceeds the value of the tabulated value of χ2 with one degree of freedom. As such the null

hypothesis can no longer be rejected and the researcher would have to conclude that there is no significant

difference between the incidence of river blindness on the flood plain and the plateau.

1.667 (1.35)

0.294 (0.24)

0.833 (0.675)

0.147 (0.119)

2.941 (2.384)

Category

Flood plain

Infected

Not Infected

Plateau

Infected

Not Infected

Total

O

20

80

25

175

300

E

15

85

30

170

300

(O-E)

5-0.5 (4.5)

-5+0.5 (-4.5)

-5+0.5 (-4.4)

5-0.5 (4.5)

(O-E)2

25 (20.25)

25 (20.25)

25 (20.25)

25 (20.25)

(O E)2

E−



5.4 Conditions Necessary for Conducting a Chi-squared Test

When using chi-square a number of guidelines must be remembered:

Contingency tables must consist of at least two categories;

Where there are only two categories, the expected frequency in each category must not be less

than 5;

Where there are more than two categories, no category should have an expected frequency of less

than 1 and not more than one category in five should have an expected frequency of less than five;

Data must be in the form of frequencies (i.e. counted data in categories). The χ2 statistic is best

suited to comparing frequencies within nominal categories. It can also be applied to higher order

levels of measurement if data are grouped into categories prior to analysis. These tests are not

applicable to interval scale data;

No cell is allowed to have an expected frequency of less than 1. This requirement can sometimes

be met through the amalgamation of rows and columns (i.e. fewer cells with more observations in

each). However be careful as the regrouping of data can lead to a loss of information and the

subtle differences between two data sets being obscured. Therefore regrouping should be avoided

if at all possible and thus larger sample sizes are recommended . In addition, the way that categories

are constructed may determine whether or nor significant associations are detected;

Samples are assumed to be independent (not applicable to dependent variables);

Random sampling is assumed (other sampling procedures can be considered as long as they are

proved to be unbiased);

Data samples must be discrete and unambiguous;

Frequencies must be absolute and not percentages of proportional values;

Question of ‘tailedness’ of the alternative hypothesis does not arise in the context of the chi-square

tests. Because of the manner of its execution the direction of departure is immaterial.



From an examination of destination preferences for second homes it appears that coastal counties of England

and Wales are perceived as being more desirable holiday locations than inland counties. The results are

summarised below.

Of the 19 coastal counties, 14 have preference scores of more than 30 and only 5 have preference scores

of 30 or less. Of the 34 inland counties, 15 have high preference scores and 19 have low scores. Use the

chi-square test to decide whether there is in fact a significant difference at the 0.05 level between coastal

and inland counties in terms of their destination desirability. Report your final result below:

Activity 22:

Location Residential Desirability

Low Preference High Preference Total

Coastal Counties 5 14 19

Inland Counties 19 15 34

Total 24 29 53



In a survey commissioned by a TV travel program, 135 people were asked what their favourite foreign

holiday destination was. Some of the results are summarised in the contingency table below:

Use these sample results to test for association between gender and destination preference, using a

95% confidence level. When calculating the expected frequencies, check if the data meets the

requirements of the chi-square test. How can you re-categorise the data to make it meet the criteria for

the chi-squared test? Report your final result below:

Activity 23:



Company managers at Butlins are investigating the relationship between job satisfaction and the levels of

absenteeism in the firm. They believe that satisfied individuals are less likely to be absent from work than

those who are not satisfied. The results from a survey of 30 workers are displayed in a contingency table

below.

Calculate the value of χ2 for the difference between the observed and expected numbers. Is this difference

significant at the 0.05 level? Record your final result below:

Activity 24:

Absenteeism Job Satisfaction

Dissatisfied Happy Total

Absent from work 4 11 15

Not absent from work

10 5 15

Total 14 16 30



Subject Job Satis Absent1 1 22 1 13 2 14 1 15 2 16 2 17 1 28 1 29 2 210 1 211 2 212 2 113 2 114 2 215 2 116 1 217 1 118 2 219 1 220 2 121 2 222 1 223 2 124 2 125 1 226 1 127 1 228 2 129 1 230 1 2

5.5 Using SPSS to Calculate Chi-Squared

Having considered how to calculate chi-square manually,

the aim of the following section is to highlight how to calculate

chi-square values using SPSS.

To start, we will repeat Exercise 3 relating to job satisfaction

levels at Butlins. Load the ‘Butlins1’ exercise file into SPSS.

Label the columns and values as you have done in previous

sessions.

To perform a chi-squared test in SPSS, move the mouse over Analyse and press the left mouse button. Move

the mouse over Descriptive Statistics and then Crosstabs.




The Crosstabs dialog box appears.

Move the mouse over Jobsatis and press the left mouse button. Press the top

arrow button so that Jobsatis is selected in the Row(s): box.

Move the mouse over Absent and press the left mouse button. Press the middle

arrow button so that Absent is selected in the Column(s) box.

Move the mouse over Statistics and press the

left mouse button. The Crosstabs: Statistics

dialog box appears.

Select the chi-square option

and then press Continue.



This takes you back to the Crosstabs dialog box. Move the mouse over Cells and press the left mouse

button.

The Crosstabs: Cell display dialog box

appears.

Make sure that Observed and Expected

counts are selected and then press Continue.

This will take you back to the initial Crosstabs

dialog box.

Press OK and SPSS will automatically

calculate the chi-square statistics and display

the results in the output window. The output

window will display a contingency table and

the following output.



How do these results compare to your manual output ?

Well, first of all you should notice that the Pearson chi-square

result gives you the χ2 statistic prior to revision by the Yates

Correction (4.82). Second, the Continuity Correction chi-

square gives you the χ2 statistic as adjusted by the Yates

Correction (3.34).

But, from the SPSS output how do you infer the significance level ?

Although the output looks daunting the answer is quite simple. In the output below, the significance value

(Asymp. Sig. (two -tailed)) for the corrected χ2 statistic is .067. This value is greater than 0.05 which means

it is not significant at the 0.05 confidence level. This can also be reported as p>0.05 ( not significant). Notice

however, that the value is less than 0.1, which means it is significant at the 0.1 significance level , which can

alternatively be recorded as p<0.1 (significant). Basically, these are the same results as you should have

calculated manually.



Remember:

if the significance value (p) is <0.1 then the value is significant at the 0.1 significance level

(90%)

if the significance value (p) is <0.05 then the value is significant athe 0.05 significance level

(95%)

if the significance value (p) is <0.01 then the value is significant at the 0.01 significance level

(99%).

Remember however, that you should not switch between significance levels so that the null

hypothesis can be rejected. The safest rule is to pick a significance level before you start and stick

with it all the way through the test.

Accurately Reporting the Outcomes of the Chi-Square Test

When reporting the chi-square result a number of key elements must be included:

Specify suitable hypotheses. In this case:

H0: There is no significant difference between job satisfaction and levels of absenteeism

H1: There is a significant difference between job satisfaction and levels of absenteeism

The test statistic. Therefore in your write-up you must include what χ2 equals. In this example χ2 =

3.34.

The degrees of freedom. This is the number of rows minus 1, times the number of columns minus 1.

This value is actually given in the SPSS output. The value for degrees of freedom is placed between

the χ2 and the = sign and placed in brackets. In this example the degrees of freedom = 1, therefore χ2

(1) = 3.34.

As part of the report you must also state the probability. As highlighted above this is done in relation

to whether your probability value was below 0.05 and 0.01 (and therefore significant) or above 0.05

(and therefore not significant). Here, you use the less than (<) or greater than (>) the criteria level. You

state this criteria by stating whether p<0.05 (significant), p<0.01 (significant) or p>0.05 (not significant).

Assuming a 95% confidence level in the above example, as p=0.67, we would write p>0.05 and place

this after the reporting of the χ2 value. Therefore χ2 (1) = 3.34, p (0.067)>0.05.



These elements must be incorporated into your text to ensure that your results are presented succintly but

effectively. You can also include a table. Therefore using the findings above we could report the following.

Table 1: Job Satisfaction v Job Absenteeism

Category Job Satisfaction

Happy Dissatisifed Totals

Absenteeism

Yes (observed) 4 11 15

% of Total 13% 37% 50%

No (observed) 10 5 15

% of Total 33% 17% 50%

Totals 14 16 30

‘Table 1 shows a breakdown of the distribution of respondents in terms of levels of job satisfaction and

levels of absenteeism (with percentages in brackets). A chi-squared was used to determine whether

there was a significant difference between the two variables. A null hypothesis of no signifcanrt difference

and an alternative hypothesis of a significant difference were established, and a 95% confidence level

was assumed. No significant difference was found between job satisfaction and absenteeism (χ2 (1) =

3.34, p (0.067)>0.05). The null of hypothesis of no significant difference can therefore not be rejected.’

Note if we had assumed a 90% confidence level from the start we would write:

‘Table 1 shows a breakdown of the distribution of respondents in terms of levels of job satisfaction and

levels of absenteeism (with percentages in brackets). A chi-squared was used to determine whether

there was a significant difference between the two variables. A null hypothesis of no significant difference

and an alternative hypothesis of a significant difference were established, and a 90% confidence level

was assumed. A significant difference was found between job satisfaction and absenteeism (χ2 (1) =

3.34, p (0.067)<0.1). The null hypothesis of no significant difference can therefore be rejected.’



Load the ‘Chi Square Exercises file’ file into SPSS. This file contains the data relating to the two additional

practical exercises that you completed by hand. Perform two chi-squares tests and compare your output

back to your manual calculations. Note that the Excel file contains two spreadsheets that you will need to

access. Import into SPSS the normal way but select the spreadsheet you wish to use from the Opening

Excel Data Source dialog box (as below). Record the results in your log book.

The following coding schemes have been used:

Residential Desirability: Area: Coastal =1; Inland =2;Score: High = 1; Low =2

TV Survey: Gender: Female =1; Male = 2;Location: Greece = 1; Spain = 2; Thailand = 3; Turkey = 4; USA =5Regroup: Europe = 1; Asia = 2; USA = 3

Activity 25:




Referring to the variables in the Dataset file, identify a series of relationships that could beexamined using the chi-squared test. Remember you need to focus on category/nominaldata for this exercise.

Activity 26:



Using the Dataset file conduct 3 appropriate chi-squared tests. Please complete the followingtables, making clear reference to the SPSS output. For each test, identify a research scenariothat you are using the test to explore.

Table 28: Chi-Squared 1


Activity 27:

Chi-Squared Test

Research Scenario

Row Variable

Column Variable

Null Hypothesis



Chi-Squared Test

Research Scenario

Row Variable

Column Variable

Null Hypothesis





Using the Dataset file conduct 3 appropriate chi-squared tests. Please complete the followingtables, making clear reference to the SPSS output. For each test, identify a research scenariothat you are using the test to explore.


Activity 27:

Chi-Squared Test

Research Scenario

Row Variable

Column Variable

Null Hypothesis





Notes:

Correlation

Section 6

Learning Outcomes


Explain the rationale for the use of correlationanalysis

Understand the basic conditions and criteriainvolved in the use of correlation analysis

Use SPSS to calculate the correlation coefficientfor both the Pearson’s Product MomentCorrelation Coefficient and Spearman’s RankCorrelation Coefficient

Interpret computer generated SPSS correlationanalysis output


Data Analysis for Research Correlation

6.0 Introduction

The aim of this session is to help you understand the importance of correlation in statistical analysis. By the

end of this session you should understand the meaning of correlation, how to check if data fulfils assumptions

for parametric and non-parametric testing, and how to perform correlation statistics on SPSS.

6.1 The Meaning of Correlation

Correlation is one of the most widely used statistical techniques. It is a means to measure the degree of

association between two variables, that is, the extent to which changes in values of one variable are matched

by changes in another variable. For example, we would tend to expect that, other things being equal, the

market price of houses increases as the size of the house increases, that is bigger houses cost more. The

size and price are correlated. The amount of water flowing down a river would be expected to be closely

related to the amount of rain which has recently fallen on the catchment. The rainfall and river flow are

correlated. We may have data on crime rates and on unemployment in a number of areas. It may be that

those areas with a high crime rate that also, in general, have a higher rate of unemployment. These variables

are also correlated.

Correlation may measure that extent to which higher values of one variable are matched with higher values

of the other and this is called positive correlation or it can measure the extent to which higher values of one

variable are matched with lower values of the other, and this is called negative correlation. For example,

you might find a positive correlation between the amount of beer you drank the night before and the number

of pneumatic drills you think are in your head the next day. However, there might be a negative correlation

between the number of pints and your ability to perform particular tasks.

To repeat, correlation is a measure of association, it says nothing whatsoever about cause. Although

variation is house size may cause variation in house price, and variation in amounts of rainfall may cause

variation in river flow, there has been a long, political as well as sociological, argument about whether

unemployment causes crime. It is possible to find sets of data which have absolutely nothing in common,

except that they are correlated.

Remember:

If higher values of one variable are associated with higher values of the other variable, then the

two variables are positively correlated.

If higher values of one variable are associated with lower values of the other variable, then the

two variables are negatively correlated.

There are several ways to measure correlation, using a range of different indices for different types of data.

When variables are parametric in nature (e.g. interval/ratio data), by far the most commonest measure of

correlation is the Pearson’s Product Moment Correlation Coefficient, often referred to as Pearson’s r.

Where data is ordinal (one or both variables are not measured on an interval scale), or when not normally

distributed, or when other assumptions of the Pearson correlation coefficient are violated, we use the Spearman

Corelation Coefficient, referred to a Spearman’s rs.



Refering to the variables in the Dataset file and your accompanying data set guide, attempt to

complete the following diagram, listing variables that could be correlated using the Pearson Product

Moment Correlation or the Spearman Rank Correlation Coefficient.

Activity 28:

Pearson Product Moment Correlation

Spearman Rank Correlation Coefficient



6.2 Identifying Signs of Correlation in the Data

No matter what the type of data you are using, an important first stage in measuring correlation is to obtain

some idea if correlation may be present in the data. The simplest way to do this is to plot the variables and look

carefully at the graph.

Figure 6.1 shows that the two variables are clearly related in some way: they are strongly correlated. The

graph slopes up to the right, that is, there is an association between higher values, so the correlation is

positive.

Figure 1.1: Strong Positive Correlation

In the case of Figure 6.2, the graph slopes down to the right, thereby implying a negative relationship, meaning

that as one variable increases, the other decreases.

Figure 6.2: Strong Negative Correlation

In addition to positive and negative relationships we sometimes find non-linear or curvilinear relationships, in

which the shape of the relationship between the two variables is not straight, but curves at one or more points

(see Figure 6.3).



Figure 6.3: Non-linear or Curvilinear Relationship

It is important to identify if the relationship is non-linear as:

It would affect the choice of correlation measurement technique;

If the wrong technique was used there would be a spurious result.

Overall, scatter diagrams are useful aids in the preliminary steps of identifying correlation and allow three

aspects of a relationship to be discerned: whether it is linear; the direction of the relationship (positive or

negative); and the strength of the relationship. The amount of scatter is indicative of the strength of the

relationship.

6.3 Correlation Analysis

The correlation coefficient (r) measures linear relationship between the variables. Every correlation coefficient

will lie somewhere on the scale of possible values, that is between -1 and +1 inclusive. A relationship of -1 or

+1 would indicate a perfect relationship, positive or negative respectively, between the two variables. The

complete absence of a relationship would engender a computed coefficient of zero. The closer the correlation

coefficient is to 1 (either positively or negatively) the stronger the relationship between the two variables. The

nearer the correlation coefficient is to zero, the weaker the relationship. These ideas are displayed in Figure

6.4.



Figure 6.4: The Strength and Direction of Correlation Coefficients

If the correlation coefficient is is 0.85, this would indicate a strong positive relationship between the two

variables, whereas a correlation coefficient of 0.28 would denote a weak positive relationship. Similarly, -0.75

and -0.36 would be indicative of strong and weak negative relationships respectively.

However, what is a large correlation ? Cohen and Holliday (1982) suggest the following: 0.19 and below

is very low; 0.20 to 0.30 is low; 0.40 to 0.69 is modest; 0.70 to 0.89 is high; and 0.90 to 1 is very high. However,

these measures are regarded as a rule of thumb and should not be regarded as definite indications. Caution

is also required when comparing computed correlation coefficients. For example we can say that a computed

correlation coefficient of -0.60 is larger than one of -0.30, but we cannot say that the relationship is twice as

strong. In order to understand this more clearly, we need to refer to the coefficient of determination (R2).

This is quite simply the square of the correlation coefficient multiplied by 100. It provides us with an indication

of how far variation in one variable is due to the other. Thus if r= -0.6, then R2 =36 per cent. This means that

36 per cent of the variance in one variable is due to the other. When r = -0.3, then R2 will be 9 per cent. Thus,

although an r of -.06 is twice as large as one of -0.3, it cannot indicate that the former is twice as strong as the

latter, because four times more variance is being accounted for by an r of 0.6 than one of -0.3 (Bryman and

Cramer, 1997). Referring to the determination of coefficient can also influence your interpretation of r. For

example, an r value of 0.75 may seem quite high, but it would only mean that 56 per cent of the variance in y

can be attributed to x. In other words, 46 per cent of the variance in y is due to variables other than x.

PerfectNegativeCorrelation

PerfectPostive

Correlation

0-1 1Strong Weak Weak Strong

NoCorrelation



CARS PERSONS INCOME AGE TRAVEXP[No. of cars] [No. of Persons] [Income (Thousands)] [Age][Travel Expenditure]

0 2 9 25 102 3 25 37 501 1 13 23 202 4 30 30 602 2 50 43 700 1 4 18 51 3 30 27 1002 2 43 55 301 1 10 71 151 3 50 20 202 2 37 41 501 2 25 51 901 5 30 45 402 4 50 40 803 2 75 54 1501 3 45 34 501 4 50 67 300 3 20 44 200 4 13 34 151 3 35 54 502 1 40 65 501 1 75 45 300 2 10 34 101 2 50 26 302 3 30 65 703 4 100 32 1001 3 40 46 602 3 30 55 501 2 30 65 20

6.4 Using SPSS to Measure Correlation: Pearson’s Correlation Coefficient

The most commonly used (and misused) measure of correlation is Pearson’s Product Moment Correlation

Coefficient. This is a powerful parametric measure, which can be used to test for significance and reliability

as long as its assumptions are satisfied. The first two assumptions are:

The relationship between the variables is linear;

The variables are interval or ratio scale measurements.

Before we use Pearson’s Correlation Coefficient to examine possible correlations in the Dataset file, let me

illustrate correlation through a simple example. Load the Excel file ‘Correlation’ into SPSS. The details of

this data file are highlighted below.

The above table refers to factors that might influence the level of car ownership in individual households. If

you wanted to examine the relationship between the different variables, the first stage would be to produce a

series of scatterplots to highlight the direction and strength of any possible relationships. Let us examine

correlation through a specific example. In this case, we will look at the relationship between the number of

persons in the household (Persons) against the number of cars (Cars).



To do so, click Graphs, move the mouse over Legacy Dialogs

and then select Scatter/Dot.

The Scatterplot dialog box appears.

Ensure that Simple is selected and then press Define.

The Simple Scatterplot dialog box appears.

Move the mouse of Cars (Number

of Cars) and press the left mouse

button. Move the mouse over the top

arrow and press the left mouse

button so that cars is selected in the

Y Axis: box.

Move the mouse over Persons

(Number of People) and press the

left mouse button. Move the mouse

over the centre arrow and press the

left mouse button so Persons is

selected in the X Axis box:

Press OK.




A scatterplot showing the relationship between the two variables appears.

The non-linear relationship expressed in the scatterplot indicates a very weak correlation between the two

variables. This can be confirmed by actually calculating the correlation coefficient.

To do so, move the mouse over Analyse and press the left mouse button. Move the mouse over Correlate

and then over Bivariate and press the left mouse button again. The Bivariate Correlations dialog box

appears.



Move the mouse over Cars and press the left mouse button. Move the mouse over the top arrow so that Cars

is selected in the Variables Box:.

Repeat the same procedure for Persons. Make sure that the Pearson Correlation coefficient and a two-

tailed test is selected. A two-tailed test is selected as we do not know which direction our relationship

between the two variables will be and we will be looking for either a positive or a negative correlation. Press

OK.

SPSS produces a matrix of correlation coefficients in the output window. In this case the following output is

produced:

As you can see from the output, the value of r for the two variables equals 0.129, which indicates a very weak

correlation. You should also notice that the probability value (p) is also not significant (p>0.05).



As with your previous exercises, you should also provide null and alternative hypotheses. In this case:

Null Hypothesis

There is no significant association between levels of car ownership and the number of persons

in the household.

Alternative Hypothesis [Two-Tailed]

There is a significant association between levels of car ownership and the number of persons in

the household.

Note that this alternative hypothesis is two-tailed as it is not specifying a specific direction (for example a

positive or negative association). An initial scatterplot of the data would reveal any possible association

between the data, and allow you to specify a one-tailed test. In this case, a one-tailed test would look like this:

Alternative Hypothesis [One-Tailed]

There is a positive association between levels of car ownership and the number of persons in the

household.

Referring back to the SPSS output for our initial correlation:

The Pearson Correlation test statistic = .129. The output indicates that this is not significant (p=.503, >0.05)

A conventional way of reporting these figures would be as follows: r = .129, n = 29, p>0.01.

The results indicate that there is no significant association between levels of car ownership and number of

persons in the household. Note that when using correlation you are examining the level of association,

and this should be clearly reflected in your hypthoses.



Let us now repeat this procedure to examine the relationship between additional variables within the dataset.

In this case will look at car ownership against household income.

First create a scatterplot between the car ownership and income. Your scatterplot should look similar to the

graph below:

The scatterplot clearly indicates that there is a linear relationship between the two variables, and that there is

evidence of a positive correlation: in this case, as household income increases so does the level of car

ownership. Having established the existence of a linear relationship, now calculate the correlation coefficient.

In the Bivariate calculations dialog box, specify a one-tailed test as in this case we are expecting a positive

correlation - thus indicating a direction. SPSS will generate the following output.

Correlations

Cars Income

Cars Pearson Correlation 1 .665(**)

Sig. (1-tailed) . .000

N 29 29

Income Pearson Correlation .665(**) 1

Sig. (1-tailed) .000 .

N 29 29

** Correlation is significant at the 0.01 level (1-tailed).



The Pearson Correlation test statistic =0.665. SPSS indicates with ** that it is significant at the 0.01 level for

a one-tailed prediction. The actual p value is shown to be 0.000. A conventional way of reporting these figures

would be as follows: r=0.665, n=29, p<0.01. The results indicate that as household income increases, car

ownership also increases, which is a positive correlation. As the r value reported is positive and p <0.01, we

can state that there is a positive correlation between our two variables and that the null hypothesis can be

rejected.

Please cut and paste your scatterplot below

and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r?

Probability Value?

Please provide a brief summary of your findings here:

Examine the remaining variable in the dataset and record your observations, using thetables below that are also in your log book.

Table 31: Number of cars against age

Activity 29:



Examine the remaining variable in the dataset and record your observations, using thetables below that are also in your log book.

Table 32: Number of cars against income

Table 33: Number of cars against monthy travel expenses




Null Hypothesis:


Value of r?

Probability Value?


Activity 29:




Null Hypothesis:


Value of r?

Probability Value?




Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficientson appropriate variables and record your answers in the tables below, which can befound in your log book. For each test, identify a research scenario that you are using the test toexplore.

Table 34: Correlation 1

Activity 30:

Research Scenario




Null Hypothesis:


Value of r?

Probability Value?




Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficientson appropriate variables and record your answers in the tables below, which can befound in your log book. For each test, identify a research scenario that you are using the test toexplore.


Activity 30:

Research Scenario




Null Hypothesis:


Value of r?

Probability Value?




6.5 Non-Parametric Correlation: Spearman’s Rank Correlation Coefficient

It is often the case that the data available do not fit the requirements for parametric testing. In this case, there

is a non-parametric correlation measure available. Spearman’s Rank Correlation Coefficient is mathematically

derived from Pearson’s coefficient, but instead of using the actual data values it uses rank or ordinal data. The

Spearman correlation coefficient is known as rs. The main assumptions for the use of Spearman’s rank

correlation are:

The relationship between the variable is monotonic, that is, x increases as y increases or x

decreases as y decreases. A linear relationship is monotonic, but a monotonic relationship is

not necessarily linear.

The variables are ordinal (ranks) or are ranked interval or ratio scale measurements.

To highlight the use of the Spearman’s rank correlation, type the data table

into SPSS. The data refers to a survey of workers in a London hotel. The

manager believed that the employee commitment to customer care policies

was influenced by overall job satisfaction. The data in the table is ranked

for Commitment (Commit) (1=High Commitment and 4 = Poor

Commitment) and Job Satisfaction (Satis) (1= High Satisfaction and 4=Low

Satisfaction)

Use the same procedure starting on page 6-196, to open the Bivariate

Correlations dialog box.

Select both Commit and Satis in the

Variables: box. Instead of Pearson’s

r, make sure that the Spearman

Correlation Coefficient is

selected. Make sure that the one-

tailed test is also selected. This is

because the manager believes that employee commitment increases with job satisfaction. This therefore implies a

direction in the alternative hypothesis making it a one-tailed test.

Commit Satis1.00 1.002.00 3.001.00 2.004.00 3.004.00 4.001.00 1.001.00 2.001.00 2.002.00 1.004.00 4.003.00 4.004.00 4.001.00 1.001.00 2.001.00 2.002.00 2.001.00 1.003.00 3.004.00 4.004.00 3.001.00 1.001.00 2.002.00 1.003.00 4.001.00 1.00




Press OK and SPSS will automatically calculate the value of the Spearman’s rank correlation coefficient. In

this case, the following output is produced.

As you can see from the output, there is a strong positive correlation between the two variables (0.78). The

result is also significant (p<0.01) and the manager can be confident at the 99% significance level that

commitment increases with job satisfaction. The positive correlation is also reflected in a scatterplot of the

two variables.



Using the Dataset file, conduct two Spearman Rank Correlation Coefficients onappropriate variables and record your answers in the tables below, which can be foundin your log book. For each test, identify a research scenario that you are using the test to explore.


Activity 31:

orrelation Coefficient

Research Scenario




Null Hypothesis:


Value of r?

Probability Value?




Using the Dataset file, conduct two Spearman Rank Correlation Coefficients onappropriate variables and record your answers in the tables below, which can be foundin your log book. For each test, identify a research scenario that you are using the test to explore.


Activity 31:

orrelation Coefficient

Research Scenario




Null Hypothesis:


Value of r?

Probability Value?




Useful Reading

Section 7

p. 209



Data Analysis & Presentation Useful Reading

7.0 Useful Reading

BRYMAN, A. AND CRAMER, D. (2001), Quantitative Data Analysis with SPSS Release 10 for Windows,

Routledge, London.

BUGLEAR, J. (2000), Stats to Go, Butterworth Heinemann, London.

CLARK, M., RILEY, M., WILKIE, E. AND WOOD, R. (1998), Researching and Writing Dissertations in

Hospitality and Tourism, Thomson Business Press, London.

DANCEY, C. AND REIDY, J. (2002), Statistics Without Maths for Psychology, Second Edition, Prentice

Hall, London.

EBDON, D. (1985), Statistics in Geography, Blackwell, London.

FIELD, A. (2009), Discovering Statistics Using SPSS, Third Edition, Sage, London.

FINN, M., ELLIOTT-WHITE, M. AND WALTON, M. (2000), Tourism and Leisure Research Methods,

Longman, London.

GHAURI, P. AND GRONHAUG, K. (2002), Research Methods in Business Studies, FT Prentice Hall,

London.

HINTON, P. (2004), Statistics Explained, Routledge, London.

HINTON, P., BROWNLOW, C., McMURRAY, I. AND COZENS, B. (2004), SPPS Explained, Routledge,

London.

KINNEAR, P. AND GRAY, C. (1999), SPSS for Windows Made Simple, Psychology Press, London.

KITCHIN, R. AND TATE, N. (2000), Conducting Research into Human Geography, Prentice Hall, London.

MALTBY, J. AND DAY, L. (2002), Early Success in Statistics, Prentice Hall, London.

McQUEEN, R. AND KNUSSEN, C. (2002), Research Methods for Social Science, Prentice Hall, London.

MICROSOFT PRESS (1997), Microsoft Access 97 - At a Glance, Microsoft Press, Washington.

MULBERG, J. (2002), Figuring Figures, Prentice Hall, London.

ROGERSON, P. (2001), Statistical Methods for Geography, Sage Publications, London.

SAUNDERS, .M, LEWIS, P. AND THORNHILL, A. (2003), Research Methods for Business Studies,

Third Edition, FT Prentice Hall, London.

Appendices

Section 8

Documents

BML224 Handbook