Upload
andrew-clegg
View
219
Download
1
Tags:
Embed Size (px)
DESCRIPTION
BML224 Handbook
Citation preview
SEMALDr Andrew Clegg
BML224: DataAnalysis for Research
p. 209
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg
Data Analysis for Research Contents
Contents
Course Outline p. i
Section 1: Sampling and Types of Data1.0 Introduction - Why use statistics p. 1-1
1.1 Sampling p. 1-1
1.2 Some Terms in Sampling p. 1-2
1.3 Avoiding Bias p. 1-2
1.4 Deciding on the Choice of Sampling Techniques p. 1-2
1.5 Summary p. 1-6
1.6 Types of Data p. 1-7
1.7 Presenting Data p. 1-11
Section 2: Descriptive Statistics2.0 Introduction p. 2-25
2.1 Measures of Central Tendancy p. 2-25
2.2 Arithmetic Mean p. 2-25
2.3 The Median p. 2-27
2.4 The Mode p. 2.31
2.5 Comparision of the Mean, Median and Mode p. 2-33
2.6 The Population Mean p. 2-34
2.7 Skew and the Relationship of the Mean, Median and Mode p. 2-36
2.8 Using SPSS to Calculate Descriptive Statistics p. 2-37
2.9 Graphically Describing Data p. 2-68
2.10 Graphically Describing Data in SPSS p. 2-74
2.11 Creating Crosstabulations in SPSS p. 2-80
Section 3: Measures of Dispersion3.0 Introduction p. 3-89
3.1 Measures of Dispersion p. 3-91
3.2 Other Distributions p. 3-98
3.3 The Standard Normal Distribution p. 3-100
3.4 Confidence Intervals p. 3-105
3.5 The Standard Error p. 3-109
3.6 Looking at Distributions in SPSS p. 3-115
3.7 Graphically Looking at Distributions in SPSS p. 3-117
p. 210
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg
Data Analysis for Research Contents
Section 4: Student T-Test, Paired Samples T-Test, Mann Whitney andWilcoxon4.0 Introduction p. 4-127
4.1 Null and Alternative Hyptheses p. 4-127
4.2 Hypothesis Testing p. 4-129
4.3 One and Two Tailed Tests p. 4-129
4.4 Choosing the Right Test p. 4.132
4.5 Parametric Tests p. 4-134
4.6 Using SPSS to Calculate the Student T-Test p. 4-135
4.7 Using SPSS to Calcualte the T-Test for for Related Samples p. 4-147
4.8 Non-Parametric Tests p. 4-152
4.9 Using SPSS to Calculate Mann Whitney p. 4-153
4.10Using SPSS to Calculate Wilcoxon Signed Ranks Test p. 4-161
Section 5: Chi-Squared5.0 Introduction p. 5-167
5.1 The One Sample ChiSquared Test p. 5-169
5.2 The Chi-Squared Test for Two or More Samples p. 5-172
5.3 Yates Correction Factor p. 5-175
5.4 Conditions Necessary for Conducting a Chi-Squared Test p. 5-176
5.5 Using SPSS to Calculate Chi-Squared p. 5-180
Section 6: Correlation6.0 Introduction p. 6-191
6.1 The Meaning of Correlation p. 6-193
6.2 Identifying Signs of Correlation in theData p. 6-194
6.3 Correlation Analysis p. 6-194
6.4 Using SPSS to Measure Correlation:
Pearson’s Product Moment Correlation Coefficient p. 6-196
6.5 Using SPSS to Measure Correlation:
Spearman’s Rank Correlation Co-efficient p. 6-206
Sampling &Types of Data
Section 1
Learning Outcomes
At the end of this session, you should be able to:
Understand the rationale for the use of statisticaltechniques
Discuss approaches to developing samplingframeworks and methodologies
Define key terms in the use of statisticaltechniques
Understand the difference between different datatypes
Present numerical data effectively in graphicaland tabular form
© Dr Andrew Clegg p. 1-1
Data Analysis for Research Introduction to Statistics
Introduction to Statistical Terms and Sampling Frameworks
1.0 Introduction - Why use Statistics?
There is far more to research than measurement and analysis of quantifiable facts. A prime tool in the study of
how people exist in their environment is the very fact of the investigator’s common humanity: you know a lot
about what people do because you are also a human being. And human actions and responses are affected
by memory, prejudice and emotions which cannot be adequately quantified. Even so, there are innumberable
instances of relevant, quantified facts in geographical investigations: most questionnaire results contain some
quantitative element, even if it is only how many people said ‘yes’ and how many people said ‘no’; international
comparisons can often make use of data from the World Health Organization, World Bank, Unicef or the
United Nations Development Programme, amongst others; within Britain data from the census or health
authorities exists for a wide range of areal units; in physical geography the geology, soils, vegetation, elevation,
aspect and so on can all be quantified. You should not ignore these data.
You may feel that it is only necessary to present such information, perhaps using a table or a graph, and
sometimes that may be enough. On the other hand, statistics will enable you to go much further in the
understanding of the patterns and relationships displayed. Furthermore, they will help ascertain the quality of
the information that you are using. This last point is perhaps the most important part of using statistics: for
instance, your pie chart showing that 75% of respondents preferred Bognor to Barbados as a holiday destination
may look impressive, but statistics will soon reveal that any conclusions to be drawn from the answers given
by only four people are limited, to say the least. If the statistics can’t test your hypotheses, the fault may be in
your hypotheses, or more likely in your data collection, but it certainly isn’t in the statistics.
Before considering the statistical manipulation of data, it is necessary to consider how the data is collected
for use.
1.1 Sampling
You often have to make do with what information there is (if you are interested in the cultivation of mangel-
wurzels and the nineteenth century agricultural survey did not record them, there is not much you can do about
it), but ideally in research you can go and collect the information yourself. In such a case you can ensure that
the information you collect is as useful as possible. Sometimes you will be able to collect all the relevant
information – the census population of each ward in the county, for instance – but in many cases you will need
to collect a sample. For example, you would not practicably be able to find the opinions of all the people in a
county, but using an appropriate sampling technique you could collect information from a smaller but
representative sample of that population.
© Dr Andrew Clegg p. 1-2
Data Analysis for Research Sampling Strategies
1.2 Some Terms in Sampling
A variable is a property which can vary and be measured - temperature etc.
An observation or variate is a particular measure.
Population is the complete set of counts or measurements derived from all objects possessing one or
more common characteristics. This can be infinite, as in the case of elevations in the field.
Sample - part of a population.
1.3 Avoiding Bias
An important question to ask yourself at the start of sampling is ‘What do I want my sampling to be representative
of ?’ An example of where this might be important is in studying the patterns of farming in a region. For
simplicity and clarity, let us assume that each farm only cultivates one crop. Selecting points on a map will tend
to choose the bigger farms because they occupy a larger area. On the other hand, selecting farms from a list
will tend to choose the smaller farms, because there are likely to be more of them within the same area.
Therefore the first method will give a representative sample of the land use, the second of the farming. What
can cause problems is using the first to find out about farms, or the second about land-use.
1.4 Deciding on the Choice of Sampling Techniques
Before you starting sampling, you need to consider whether a convenient sampling frame exists. An example
of a sampling frame may be a list of names on an electoral register or a membership directory of a particular
organisation. Even when sampling frames do exist, they are often incomplete or out of date. The integrity of
the data set will therefore influence your choice of sampling technique. However, it is often possible to construct
your own sampling framework, although this could be costly and time-consuming. For example, if investigating
the distribution of farm shops in West Sussex, you could use the farm shops listed in the yellow pages as a
provisional framework and then supplement this with fieldwork to check for any farm shops not listed in the
yellow pages. For an area it may be necessary to create a grid with x and y axes, so that the whole area under
investigation can be referred to using co-ordinates, like grid references. In this instance, you need to achieve
a balance between having too few cells to give precise or even usable results (remember that a co-ordinate
reference refers to an area rather than a point) and having so many that the sampling process becomes too
time-consuming. Such decisions must be made with specific reference to the particular investigation and the
time and resources at your disposal. Indeed, when designing a sampling strategy for a research project it is
important to ask yourself whether you can afford the time and money to carry out the sample collection.
When deciding on the sample technique, you also need to decide on the size of the sample. As a general
guideline, the larger the sample, the more confident we can be that the statistics derived from it will be similar
to the population parameters. However, a large sample with a poorly designed sampling frame, may contain
less information than a smaller but more carefully designed sample.
© Dr Andrew Clegg p. 1-3
Data Analysis for Research Sampling Strategies
1.4.1 Random Sampling
The word random in this context does not mean haphazard. It refers to a definite method of selection aimed
at eliminating bias as far as possible. A random sampling method should satisfy two important criteria: a)
every individual must have an equal chance of inclusion in the sample throughout the sampling procedure;
and b) the selection of a particular individual should not affect the change selection of any other individual. To
put these criteria in more formal probability terms: the probabilities of inclusion in the sample must be equal
and independent of each other. So, if the aim is to pick a random sample of 50 households from a population
of 200, every household should have the same 50/200 or 0.25 probability of selection. The simplest example
of pure random sampling is a raffle or lottery. Thus to take a random sample of the population of the UK, the
name of each resident would have to be written on a piece of paper, all the pieces of paper put in a giant drum
and a random selection made: obviously not a practical method. More usually it is numbers, not names, which
are used and, instead of picking these numbers out of a hat, a computer can be programmed to generate
random number sequences. Alternatively tables of random numbers can be used. Computers use the last
digits of their internal clock to ‘seed’ their random numbers (otherwise they would just keep repeating the same
sequence), and similarly when using random number tables it is worthwhile picking a point somewhere in the
table ‘at random’ and then sometimes read up, or left, rather than from left to right.
1.4.2 Systematic Sampling
Systematic sampling is, as its name suggests, sampling according to a regular system. This involves choosing
the first item at random and then selecting every nth item where n will be determined by the size of the sample
required. For example, if a sample of 50 items is required from a population of 500 items, every 10th one
would be selected. Provided that there are no characteristics in the population which recur every 10th item,
the sample will be unbiased; indeed this may be thought of simply as a short cut (the population does not need
to be numbered) method of producing a random sample.
1.4.3 Stratified Sampling
It is possible, in some instances, to improve on simple random sampling by stratification of the population.
This is particularly true where the population is heterogeneous (i.e. made up of dissimilar groups) and the
population can be stratified into homogeneous (i.e. similar) classes. These classes should define mutually
exclusive categories.
For example suppose a bakery makes three different types of loaf: large, small and cottage. If a simple
random sample was taken of the daily output, it would be possible, although unlikely, for it to include only one
type of loaf. Stratification of the population before sampling can prevent this and, if carried out as described
below, can produce a sample which is truly representative of the population.
Assume that the bakery’s output is 50% large, 40% small and 10% cottage loaves. The different loaves divide
the population into three strata. Now if a sample of 50 loaves is required it should contain 25 large, 20 small
and 5 cottage thus ensuring that the proportions of each type of loaf in the population are reflected in the
sample. Within these constraints, however, selection should be made on a random basis.
© Dr Andrew Clegg p. 1-4
Data Analysis for Research Sampling Strategies
1.4.4 Multi-Stage Sampling
Surveys covering the whole UK are frequently required but, as you can imagine, simple random sampling or
even stratified sampling will not give an easy solution. Where the population is very spread out, particularly
geographically, simple random sampling will result in a dispersed sample leading to a considerable amount of
travelling and time. Consequently some method is needed to narrow down the field down to a smaller area,
with the resultant cost savings. Multi-stage sampling attempts to do this without adversely affecting the
‘randomness’ of the result.
The first step is to divide the population into manageable, convenient groups or areas, such as counties or
local authority regions. Indeed, stratification of areas such as counties or local authorities by principal
geographical regions is often introduced in order to minimise geographical bias (Clark et al, 1998, p. 84). A
number of areas are then selected at random. If the number of areas selected is still too large or dispersed,
then these areas can be broken down further to reduce the sample size to more manageable proportions. For
example, having chosen a random sample of local authorities, each one itself may be divided into political
wards or streets or households. Finally a simple random or systematic sample will be chosen.
1.4.5 Cluster Sampling
Cluster sampling can often be confused with multi-stage sampling as the first step appears identical. The
important difference is that cluster sampling is used when the population has not been listed and it is the only
way to obtain a sample.
As an example, suppose that a survey is to be done on the proportion of elm trees attacked by Dutch elm
disease in the UK. Obviously there is no list of the complete population of elm trees. Neither would it be
possible to try and cover the whole population. To use cluster sampling in this case, the population could be
divided into small ‘clusters’ by drawing a grid over the map of the country and choosing, at random, a few of
these clusters for observation, each cluster being a small area. Within each area, the investigators will then be
asked to find as many elm trees as possible within that area and note how many of them are diseased.
1.4.6 Non-Random Sampling
The previous paragraphs have been concerned with methods of random sampling, basically simple random
sampling with several variations and refinements. The methods discussed in the previous section share a
number of key elements. These include: a) the chances of obtaining an unrepresentative sample are small; b)
this chance decreases as the size of the sample increases; c) this chance can be calculated; and d) the
sampling error can be measured and therefore the results can be interpreted.
Unfortunately occasions often arise when the selection of a random sample is not feasible. This may be
because:
It would be too costly;
It would take too long; or
All the items in the population are not known.
© Dr Andrew Clegg p. 1-5
Data Analysis for Research Sampling Strategies
For these reasons the following research methods of non-random sampling are used, particularly in the field
of market research.
1.4.6.1 Judgement Sampling
In this case an expert, or a team of experts use their personal judgement to select what, in their opinion,
is a truly representative sample. It certainly cannot be called a random sample as it involves human
judgment which could involve bias. On the other hand, the sampling process does not require any
numbering of the population or random number tables. It can be done more quickly and economically
than random sampling and, if carried out sensibly, can produce very good results.For example, in an
interview situation, a researcher may pick individuals because of the nature of the response they are
likely to give, and the responses the researcher is looking for.
1.4.6.2 Quota Sampling
This is the method most often used in market research where the data is collected by enumerators
armed with questionnaires. To avoid the expense of having to ‘track down’ specific people chosen by
random sampling methods, the enumerators are given a quota of say 400 people, and are told to
interview all the people they can until their quota has been met. Such a quota is nearly always divided
up into different types of people with sub quotas for each type. For example, out of a quota of 400, the
enumerator may be told to interview 250 working wives, 100 non-working wives and 50 unmarried
women, and within each of these three classes to have 50% who smoke and 50% non-smokers.
Using this technique, the researcher has the choice of selecting certain people who might be included
in the sample, and can therefore introduce an element of bias into the sample.
The main advantage of this method is that, if a respondent refuses to answer the questions for any
reason, the interviewer will just look for another person in the same category. With true random
sampling, once a sample item has been decided upon, it must be used. Any substitution results in a
non-random sample.
1.4.6.3 Convenience Sampling
As the name implies, the most important factor here is the ease of selecting the sample. No effort is
made to introduce any element of randomness. An example of this is the quality controller who takes
the first 20 items off the production line as his sample, a dangerous procedure as any fault occurring
after this could remain unnoticed until the next sample is taken (maybe an hour later).
For most purposes, this sampling method is simply not good enough but for some pilot surveys the
savings in cost, time and effort outweigh the disadvantages. The aim of a pilot survey could be to
establish the most satisfactory form of questionnaire to be used in the actual survey. Since the actual
results would not be used it does not matter that the sample was not selected at random.
© Dr Andrew Clegg p. 1-6
Data Analysis for Research Sampling Strategies
1.5 Summary
Sampling serves two purposes. One is the saving of time and effort in the collection of information. The
second is the collection of information so that inferences and comparisons can be drawn using statistics.
Although a simple subject, it is fundamental to much research, and needs to be done with care. Table 1,
provides a summary of the key sampling methods that have been discussed.
Table 1: Sampling Methods
Judgemental
Quota
Systematic
Simple random
Stratified random
Multi-stage random
Clustered random
Representative
Probability
Random
(first unitselected atrandom)
Description
Sampling elements are selectedbased on the interviewer’s experiencethat are likely to produce the requiredresults.
Sampling elements are selectedsubject to a predefined quota control.
Sampling elements in the samplingframe are numbered. First samplingunit is selected using random numbertables. All other units are selectedsystematically k units away from theprevious unit.
Sample size of n elements selectedfrom a sampling frame withoutreplacement, such that every possiblemember of the population has anequal chance of being selected.
Sampling frame divided into sub-groups (strata) which are then eachsampled using the simple randommethod.
Sampling frame divided intohierarchical levels (stages). Eachlevel is sampled using a simplerandom method which selects theelements to be included at the nextlevel.
Sampling frame divided intohierarchical levels (stages). Levelsare selected using random samplingsimilar to the multi-stage randommethod. However, all elements areselected at the final stage.
Example
Several houses for sale in Belfast,perhaps with families known to theinterviewer, are chosen subjectively.
The quota is the first 30 homeownerssellign their houses in Belfast who are alsomaking an intra-urban move, and are agedbetween 20-40 years.
Sampling frame of 600 homeownersselling their houses in Belfast. Thesehouses for sale are ordered andnumbered. A random number is selectedfor a start point, from which every tenthproperty is selected for inclusion in thesample.
All 600 houses for sale in the samplingframe are numbered 1-600. A sample of30 units is selected using a randomnumber table, excluding those numbersoutside the range 1-600.
All 600 houses for sale come from listsprovided by six estate agents. These areeach randomly sampled for houses toinclude in the sample.
All 600 houses for sale are distributed toenumeration districts within several wards.A random sample of these wards isselected and of these random samples ofboth enumeration districts and finallyhouses for sale are selected.
Similar to the above method, expect thatall the houses for sale in a givenenumeration district are selected.
[Source: KITCHIN, R. AND TATE, N. (2000): Conducting Research into Human Geography, Prentice Hall, London, p. 55.]
© Dr Andrew Clegg p. 1-7
Data Analysis for Research Types of Data
1.6 Types of Data
Normally when we think of data quality, we think about reliability or accuracy. In statistics, data have quality in
terms of what they represent and how they can be manipulated. The four levels of measurement are: nominal/
categorical, ordinal, interval and ratio. Each measurement is outlined below:
An ordinal variable can be ranked in order from highest to lowest, for example a league table. Alternatively,
a questionnaire survey may ask respondents to rank satisfication levels on a scale from ‘Strongly Agree’
to ‘Strongly Disagree’. Ordinal variables do not allow comparable measurements, for example ‘Strongly
Agree’ is not worth double ‘Slightly Agree’.
Interval and Ratio variables are concerned with quantitative data. Interval variables are in the form of a
scale which possesses a fixed but arbitrary interval and arbitrary origin. Addition or multiplication by a
constant will not alter the interval nature of the observations (e.g. 10C, 20C, 30C, 40C). For a ratio
measurement, this number is in relation to a scale of an arbitrary interval, similar to interval data, but with
a true zero origin. In these cases, where we are using numbers as we normally think of them, one value
can be twice the size of another. For example, income is a ratio variable as a person can have no income.
Ratio measurement commonly applies to metric quantities such as distance and mass, which possess a
zero origin. [When importing data into SPSS, and using the Variable window, Interval and Ratio data are
classed as Scale - see Descriptive section in this handbook].
Categorical or nominal variables are the lowest level and are variables where numerical values have
been assigned to separate categories, often viewed as unique from one another. For example, gender
(male/female), hair colour (blonde, brown, ginger, grey), or direction (north, east, south, west).
It is important to remember that data can only be converted from higher to lower quality, and data can only be
treated ‘at their own level’. For instance, the numbers ‘1,2,3,4’ could be heights in meters (ratio), temperatures
in degrees C (interval), the order of countries achieving Rostow’s ‘take off’ (ordinal) or the answer to ‘what is
your favourite number (nominal): they must not be treated at a higher level than their meaning. As Mulberg
(2002) points out ‘the thing to ask is if it makes sense to talk about one case being double another, or if there
is a highest and a lowest (see Figure 1). It is also important to understand the different types of data or
variables, as this will influence the kind of statistical analysis that is possible. The levels of measurement are
summarised in Table 2.
In order to use parametric and non-parametric tests successfully later in the module, it is
imperative that you understand the characteristics and differences between types of data. Please
read through these notes carefully, and learn the different data types.
© Dr Andrew Clegg p. 1-8
Data Analysis for Research Types of Data
Figure 1: Judging Levels of Measurement
[Source: Mulberg, 2002, p. 8]
Additional terms that you will encounter include:
A discrete variable is a variable whose numerical values varies in steps or where the values are integer
numbers. Normally such variables are associated with counts, for example you may count the number of
firms, products or employees when conducting a survey. Discrete variables do not allow for decimal
places.
A continuous variable is a variable which assumes a value that can be donated on a continuous scale.
Examples include weights, heights and age. In reality, continuous variables relate to specific values that
lie at a point on a continuum. For example a person’s age could be recorded in discrete form as being so
many years, but in reality their age can be placed at a point on a continuum which reflects not only the
numbers of years but also the number of days, minutes and seconds which have passed since the moment
of their birth (Clark et al, 1998). Continuous variables allow for decimal places. Continuous variables can
som etimes be described as demonstrating certain statistical properties that allow them to be used in
parametric statistical tests. However, sometimes some continuous variables do not show these particular
properties, and when this happens, the variables are though suitable to be used in non-parametric tests
(Mulberg, 2002).
Variables can also be classed as ‘dependent’ or ‘independent’. A dependent variable refers to a variable
which is identified as having a relationship or dependance on the value of one or more independent
variables. For example, levels of car ownership may be directly dependent on a number of independent
variables including average household income, age and the number of persons in the household.
No
Does it make sense to talk about
one number being higher or lower
than other?
No
Does it make sense to talk about
one number being
double another?
Start
Ratio Level
Ordinal Level
Nominal Level
Yes
Yes
© Dr Andrew Clegg p. 1-9
Data Analysis for Research Types of Data
Table 2: Data Quality
When attempting to remember types of data use the abbreviation NOIR (nominal, ordinal, interval, ratio).
When using variables in statistical analysis, a further distinction is also drawn between descriptive and
inferential statistics. Descriptive statistics refer to the sample that is created by the research/study
process and literally refers to the methods and techniques used to describe and summarise data.
Measures of central tendency (mode, median, mean) are the most basic descriptive statistics to which
we can also add basic measures of dispersion including the maximum, minimum and range of values.
Inferential statistics refer to those techniques which are adopted to draw conclusions about the
population to which the sample belongs and which enable inferences about the characteristics that
might be expected in other samples as yet to be selected from that same population. Inferential statistics
give greater analytical power and bring into play probability theory and other statistical tests and measures
that will be discussed later in this handbook.
Description
Data assigned to discrete categories, in no
natural order
The categories associated with a variable
can be rank-ordered. Objects can be
ordered in terms of a criterion from highest
to lowest.
With ‘true’ interval variables, categories
associated with a variable can be rank-
ordered, as with an ordinal variable, but the
distances between categories are equal;
Categories have no absolute zero point;
Variables which strictly speaking are ordinal
but which have a large number of
categories, such as multiple-item
questionnaire measures. These variables
are assumed to have similar properties to
‘true’ interval variables.
Data with meaningful intervals and a true
zero
Name
Nominal or Categorical
Ordinal
Interval
Ratio
Examples
Clay, sandstone,granite, lifestyle groups,
singles, retired
Cities in order of population size/opinions
regarding service or product quality
Temperature in degrees Celsius or
Fahrenheit.
Goal Difference
Age, distance
© Dr Andrew Clegg p. 1-10
Data Analysis for Research Types of Data
However, as Lindsay (1997) points out the use of inferential statistics carries greater responsibility and as
such any user must be aware of the following guidelines:
Sampling must be independent. This means that the data generation method should give every
observation in the population an equal chance of selection, and the choice of any one case should not
affect the selection of value of any other case;
The statistical test chosen should be fit for its purpose and appropriate for the type of data selected;
The user must interpret the results of the exercise properly. The numerical outcome of a statistical test
is the result of an exercise.
© Dr Andrew Clegg p. 1-11
Data Analysis for Research Presenting Data
1.7 Presenting Data
Presenting numerical data accurately is an important element of essays, reports, presentations and posters.
The aim of the following section is to provide a few basic guidelines on how to incorporate graphs and tables
effectively, and at the same time creatively, into your work.
1.7.1 Using Graphs and Charts
Computer spreadsheets such as Excel, now allow you to produce a range of graphs and charts (bar charts,
column charts, pie charts, graphs) quickly and easily. As such, graphs can be used effectively to enhance the
quality of reports, essays, posters and presentations. Carefully thoughtout graphs can bring to life data from
tables and allow comparisons to be made quickly. However, poorly designed graphs can easily fail and
weaken a piece of work. It is very common for students to rush in and produce a whole plethora of charts and
graphs without giving much thought to the data set they are using or what type of output would be most
appropriate. Therefore is it important to take your time and give careful consideration to what you actually
want to achieve.
First, ask yourself the following questions:
Is a graph or chart necessary?
Students often use diagrams as a means of ‘padding out’ work and as a result graphs not referred to
in the text become ‘window-dressing’. Therefore carefully consider whether the graph is actually
needed - ask yourself whether the graph helps the reader understand a particular point or aspect of the
data. If it does fine - but make sure that is it integrated and referred to fully in your dicussion. If not,
provide a simple verbal description.
What is the purpose/objective/outcome?
Are you producing a graph for an essay/report, poster or presentation? While the basic guidelines and
formatting options are generic, you need to consider the overall purpose and intended audience. For
example graphs produced for a presentation will be different to those produced for inclusion in an
essay or a PowerPoint presentation. Carefully consider the importance of visual impact and clarity,
and the type of media you are using.
What is the nature of the data set you are using?
Graphs often fail because an incorrect chart type has been used or the graph is too complicated.
Therefore before you start carefully consider the actual nature of the data set you are using. Above all
you need to distinguish between ‘continuous’ data and ‘discrete’ quantities. A continuous quantity is
that which can be chosen to any degree
© Dr Andrew Clegg p. 1-12
Data Analysis for Research Presenting Data
of precision. Examples of continuous quantities include mass (kg), length (m), and time (s). Discrete
quantitites in contrast can only be expressed as integers (whole numbers) for example: 3 computers, 5 cars,
4 houses. In trying to decide if something is continuous or discrete, decide whether it is like a stream (continuous)
or like people (discrete). Continuous variables are usually plotted on a graph as this demonstrates the existence
of a casual relationship between the data points, whereas discrete data series are plotted as bar charts or
histograms.
In addition to the nature of the data set also consider whether you referring to absolute values or percentage
distributions? This will have a significant influence on the chart type that you use. Second, how complicated
is the data set?; is it best represented as a graph or a table?; can the data be manipulated to make it easier to
use, for example by reformatting columns or excluding columns? Be prepared to modify the data set if
necessary. However, make sure that when you do this you do not alter the accuracy or the representativeness
of the data set you are using.
The following graphs highlight the issue of using appropriate chart types.
Figure 2: Car Sales for Rover, BMW, and Jaguar 1995-2000
[Source: Believe, M., 2001]
In Figure 2, car sales for leading manufacturers have been plotted for a 5-year time period. In this instance we
are dealing with discrete data (as you cannot sell half a car!). However, the data has been plotted as a line
graph - is this correct? The answer is YES as there is a logical year to year link and the ‘joining the dots’
technique illustrates the casual relationship between the x-axis variables. This data could have also been
presented as a column chart. Compare this to Figure 3.
© Dr Andrew Clegg p. 1-13
Data Analysis for Research Presenting Data
Figure 3: Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex
[Source: Believe, M., 2001]
Figure 3 highlights the attitudes of residents to new housing development in West Sussex. Is this graph the
most effective form of presentation? The answer is NO. In this instance joining the dots is not appropriate as
there is no casual relationship between x-axis variables. In this instance a column chart would have been
more effective - see Figure 4.
Figure 4: Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex
[Source: Believe, M., 2001]
© Dr Andrew Clegg p. 1-14
Data Analysis for Research Presenting Data
While Figure 4 is a definite improvement, is there any way of making the data in Figure 4 more effective so that
it really highlights the differences in resident opinions between the different areas? Again the answer is YES.
So far we have graphed the absolute values relating to resident opinions. If we were to change this to a
percentage distribution we could present the data as a bar chart - see Figure 5.
Figure 5: Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex
[Source: Believe, M., 2001]
As you can see in Figure 5, utilising the percentage distribution really succeeds in highlighting the differences
in residents opinions.
Let us consider a further example. Figure 6 illustrates the mean monthly temperature and rainfall totals for
Edinburgh. Is the graph appropriate? Again the answer is YES as there is a logical year to year link and the
‘joining the dots’ technique illustrates the casual relationship between the x-axis variables. However, although
this graph allows us to compare monthly temperature and rainfall totals, the high values for temperature have
masked the values for rainfall and a degree of accuracy has been lost. To overcome this we can change the
type of the graph and plot temperature and rainfall on separate axis - see Figure 7.
© Dr Andrew Clegg p. 1-15
Data Analysis for Research Presenting Data
Figure 6: Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh
[Source: Bartholomew, 1987]
Figure 7: Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh
[Source: Bartholomew, 1987]
So far our discussion has concentrated on the use of line graphs, column and bar charts. Another type of chart
frequently used is the pie chart. The overall total number of cases represented by the pie chart should equal
the sample size, or aggregate to 100% where segments denote proportional frequencies (Riley et al, 1998, p.
172). Let us consider some specific examples.
© Dr Andrew Clegg p. 1-16
Data Analysis for Research Presenting Data
Figure 8: The Distribution of Serviced Establishments in Torbay by Size
[Source: Clegg, 1997]
Figure 8 refers to the percentage distribution of serviced establishments in Torbay by size. When using pie
charts it is important to remember that pie charts can only graph the percentage distribution of one specific
variable and cannot be used to analyse time series data. For example, we could not use a pie chart to illustrate
the car sales for Rover, BMW and Jaguar referred to in Figure 2. However, we could use a pie chart to analyse
the market share of car sales for a specific year (see Figure 9).
Figure 9: Market Share of Car Sales for Rover, BMW and Jaguar in 1995
[Source: Believe, M., 2001]
© Dr Andrew Clegg p. 1-17
Data Analysis for Research Presenting Data
Rover41%
BMW27%
Jaguar32%
By drawing and then combining two or more pie charts we could then compare market share for different
years (see Figure 10).
Figure 10: Market Share of Car Sales for Rover, BMW and Jaguar in 1995 and 1999
[Source: Believe, M., 2001]
Programmes such as Excel will only allow you to draw one pie chart at a time - however once drawn you can
arrange a number of pie charts on a worksheet and print them out. Alternatively, you can cut and paste Excel
charts into Word or Publisher.
Clearly, using the most appropriate type of graph is very important to ensure that the data is presented
accurately. In addition to the type of chart it is also important to ensure that the graph is presented effectively.
Rover27%
BMW32%
Jaguar41%
1995
1999
© Dr Andrew Clegg p. 1-18
Data Analysis for Research Presenting Data
1.7.2 Producing Graphs
When producing graphs a number of basic rules and guidelines need to be considered. These are:
Is the graph completely self-explanatory?
Is the graph clearly titled, labelled and sourced?
The axes should be labelled, and clear indication given as to the scales being used, and the
numerical quantities being referred to;
All dates and times periods should be explicitly stated in the title, and on the appropriate axis;
In titles do not write ‘A Graph Showing....’. This is obvious - instead refer to the specific content of
the graph (see examples given in this section);
The source of the data should be included, especially if they are drawn from published material.
Are elements of the graph distinguishable?
When using charts it is important that the different data series are clearly distinguishable other-
wise the graph will be meaningless;
Consider carefully the number of data series you intend to graph. Too much data will over
complicate a graph and reduce its impact;
When using pie charts it is recommended that the number of segments should not be too large.
Too many segments make charts confusing and difficult to read;
If charts are to be included in a black and white report, avoid shadings that involve colours as the
distinctions will be clearly lost. Try and keep the use of colours to a minimum. Use one colour and
different shades;
Ensure that each segment of the pie chart is clearly labelled and that the percentage values have
been added to indicate quickly which are the principal groups and by how much;
Avoid repetition; if labels and percentage values have been added to a pie chart there is no need
to include the legend.
© Dr Andrew Clegg p. 1-19
Data Analysis for Research Presenting Data
1.7.3 2D or 3D Graph Formats
Excel and similar packages allow you to enhance the quality of graphs by making them 3D. However, the
use of 3D formatting needs to be treated with caution. If you are producing graphs on A4 for a presentation 3D
charts can work effectively. However, if you are preparing graphs for inclusion in an essay or report 3D charts
may not be appropriate and you may be better off with a standard 2D version. There are no hard and fast rules
on this issue and, ultimately, the type of chart produced and the type of formatting applied will depend on the
nature of the data set used.
Let me illustrate this by referring to examples included in this section. Below is Figure 4, showing resident
attitudes to housing development in West Sussex. At the moment this is a standard 2D column chart. Let us
convert it into a 3D chart.
2D
3D
© Dr Andrew Clegg p. 1-20
Data Analysis for Research Presenting Data
Do you think this chart is effective? It looks good but is not quite as easy to read as the standardchart. It is noticeable that in order to create a 3D chart Excel has to shrink the original chart. Thisis where problems lie, as in making the graph smaller the overall impact of the graph is diminished.
Let us try another example. Below is Figure 8, which refers to the distribution of servicedaccommodation in Torbay. As before, let us convert this into a 3D chart.
In this instance the 3D chart is actually quite effective and has enhanced the standard 2D chartconsiderably. The basic rule seems to be that simple 2D charts can be converted into 3D chartsquite effectively. However, the more detailed and complicated the standard chart the less effectiveit becomes when you make it 3D. Your best option is to experiment with different data sets andformatting options to find the most effective form of presentation.
2D
3D
© Dr Andrew Clegg p. 1-21
Data Analysis for Research Presenting Data
1.7.3 Using Tables
In addition to charts, tables are also an effective way of presenting information. Again when producing tables
a number of guidelines can be followed:
Consider the purpose of presenting the data as a table as there may be better ways of presenting it;
Avoid the temptation of just photocopying tables out of text books and sticking into essays. In many
cases, tables often contain information superfluous to the reader. Be prepared to modify data sets so
that only relevant information is included in your table;
Make sure that tables are completely self-explanatory. Provide a table number and title for each table.
If abbreviations are used when labelling then provide a key;
Make sure that the content of the table is fully referred to in the text - make sure that tables are not
basically ‘window-dressing’;
Allow sufficient space when designing the table for all figures to be clearly written;
Make sure that the table/data is fully sourced.
Again let me illustrate with a number of examples.
Table 2 is an example of a table I created for the Arun Tourism Strategy document. Does the table meet the
guidelines highlighted above? The answer is YES. The table is clear, well laid out, titled, sourced and self-
explanatory. Shading has also been used to try and enhance the visual impact of the table.
Table 2: Visits Abroad by UK Residents 1994-1997
Area of Destination
Year Total (‘000) North America Western Europe Rest of World
1994 39,630 2,927 32,375 4,328
1995 41,345 3,120 33,821 4,404
1996 42,050 3,584 33,566 4,900
1997 45,957 3,594 37,060 5,303
% Change
1996/1997 +9 0 +10 +8
[Source: ETB, 1999]
Number of Visits (000’s)
© Dr Andrew Clegg p. 1-22
Data Analysis for Research Presenting Data
Now consider Table 3 which refers to regional tourism spending in England in 1997. Again this is a clear table
that for the purposes of the tourism strategy had to contain a lot of detail. If you were using this table to
illustrate patterns of regional spending it could be simplified to show the most obvious or important patterns.
For example in Table 1 it is evident that tourism spending is highest in the West Country and lowest in
Northumbria.
The table could therefore be easily modified to really reinforce this message (see Table 4). Notice that in the
amended Table 4, I have also changed the title so that the content of the new table becomes self-explanatory
and reflects the actual purpose of the table. Table 3 could have also been modified by removing specific
columns thereby emphasising the patterns of spending in particular market areas.
Table 3: The Regional Distribution of Tourism Spending in England, 1997
[Source: ETB, 1998]
Destination
England
Cumbria
Northumbria
North West England
Yorkshire
Heart of England
East of England
London
West Country
Southern
South East England
All
Tourism
£11,665
%
3
3
9
8
11
13
9
24
11
9
Holidays
£7,725
%
5
3
8
8
9
14
6
30
10
8
Short
Holidays
(1-3 nights)
£2,505
%
5
3
11
7
14
11
13
17
10
9
Long
Holidays
(4+ nights)
£5,215
%
5
3
6
8
7
15
2
37
11
7
Business
and Work
£2,055
%
1
3
12
9
15
14
15
10
3
10
VFR
£1,415
%
1
5
10
10
16
12
17
10
9
12
© Dr Andrew Clegg p. 1-23
Data Analysis for Research Presenting Data
Destination
England
Northumbria
East of England
West Country
South East England
All
Tourism
£11,665
%
3
13
24
9
Holidays
£7,725
%
3
14
30
8
Short
Holidays
(1-3 nights)
£2,505
%
3
11
17
9
Long
Holidays
(4+ nights)
£5,215
%
3
15
37
7
Business
and Work
£2,055
%
3
14
10
10
VFR
£1,415
%
5
12
10
12
Table 4: Selected Regional Differentials in the Distribution of Tourism Spending in England, 1997
[Source: ETB, 1998]
© Dr Andrew Clegg p. 1-24
Data Analysis for Research Presenting Data
Descriptive Statistics
Section 2
Learning Outcomes
At the end of this session, you should be able to:
Produce descriptive statistics including themean, median and mode
Understand the features of measures of centraltendency
Apply appropriate descriptive statistics todifferent data types
Import data into SPSS and use SPSS to producedescriptive statistics and cross-tabulations
Use SPSS to graphically describe data throughthe use of frequency histograms, stem and leafplots and box plots
p. 25
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-25
Data Analysis for Research Descriptive Statistics
2.0 Introduction
The first part of the data analysis process is the production of basic descriptive statistics, such as the
mean, median, mode, standard deviation, standard error, and basic frequency and contingency tables.
The analysis of the descriptive statistics can then be used to ascertain the nature of the data, especially
in relation to its distribution, and what types of statistical tests can be used to analyse the data further.
2.1 Measures of Central Tendency
Averages, or measures of central tendency, give a simple summary of the characteristics of the data
being described. How the data is described depends upon its quality. The three measures used are the
mean, median and mode (see Table 2.1).
Table 2.1: Measures of Central Tendency
2.2 Arithmetic Mean
This is the figure that most people would produce if they were asked to give the average set of figures.
The mean is the most commonly used of all averages and is calculated by adding together all the values
in a series and dividing the total by the number of items in the series. The computation formula is:
The symbols may be explained as follows:
pronounced ‘x-bar’ denotes the arithmetic mean of a sample;
pronounced ‘sigma’ means ‘the sum of’;
Name
Mean
Median
Mode
Data Type
Ratio or interval
Ordinal
Nominal
Description
Total/Number of samples
Middle in rank order
Most common category
Example
‘The mean July maximum in
Bognor is 210C’
‘Half of the customers travel
more than 6km to Tescos’
‘Most visitors are from London’
x xi
n
i n==
1
/
x
p. 26
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-26
Data Analysis for Research Descriptive Statistics
xi means all values of x where x1, x2, x3...xn represent the values of each observation in a data set.
Thus i assumes, in turn, the values of 1,2,3 and so on and;
n is the total number of observations in the data set.
Therefore for the following data series:
8,2,4,7,3,4,1,2,2,1
The arithmetic mean is calculated as:
2.2.1 Features of the Mean
When using the mean, you should consider the following points:
The mean is easy to understand and calculate and is the most commonly used of all averages;
It makes use of every value in the distribution, leading to a mathematical exactness which is useful
for further mathematical processing;
It can be determined if only the total value of the items and the number of items are known, without
knowing individual values;
It can be distorted by extreme values in the distribution;
For a discrete distribution, the mean may be an ‘impossible’ figure e.g. 17.5 cigarettes per day when
all values in the distribution are whole numbers.
3.41034
10124....28
x ==++++=
p. 27
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-27
Data Analysis for Research Descriptive Statistics
2.3 The Median
There are however certain occasions when it is either not possible or not practical to use the
arithmetic mean, particularly if the values of some of the extreme items are difficult to determine or
if it is possible only to arrange the items in order without assigning numerical values to them. In
such cases the representative or average figure may be taken as the middle item when the series
is arranged in ascending or descending order.
The statistical term for this middle item in a set of data is the median. The median is a position
average or the value of the middle item of a series. For example, the median of the series
1,2,2,4,7,7,10 is the value 4 since it is the middle item. For a series with an even number of items
(e.g. 1,2,3,4), there is no middle item and yet a median may still be required. In this case the
median is conventionally taken as the arithmetic mean of the two central items, in this case, a value
of 5.5.
Therefore, to reemphaise:
Example 1: A series with an uneven number of items
The data series in rank order is:
Example 2: A series with an even number of items
The data series in rank order is:
1 2 2 4 7 7 10
The median is the middle item which in
this case is 4.
1 2 2 4 7 7 10 11
The median is the arithmetic mean of
the two central items:5.5
274 =+
WORKED EXAMPLE
p. 28
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-28
Data Analysis for Research Descriptive Statistics
2.3.1 The Median of a Grouped Distribution
Strictly speaking, it should be impossible to find the median of a grouped distribution as detailed information
is lost when data is gathered into classes. However, as with the arithmetic mean, several assumptions
are made and an answer is produced. There is also a convention to say which is the median item in a
grouped frequency distribution with either an odd or an even number of items.
If a frequency distribution contains a total of n items then the median item will be:
a) the n +
1
2th item if n is odd
b) the n2
th item if n is even
For a distribution of 401 items the median will thus be the
+
2
1401th = 201st item.
For a distribution of 400 items the median will be the 4 0 0
2th = 200th item.
To find the median within a grouped data set it is first necessary to construct a table showing the cumulative
frequencies. The data on the following pages highlights the annual rainfall in Kano, a popular tourist
destination in Nigeria, and it should be clear that Table 2.2 has been produced by dividing the annual
rainfall totals into ranked categories (400-499mm etc) and then counting the number of years that fall
into each of these categories. These are then added up to produce the cumulative frequency, which can
be expressed as a percentage for easier interpretation (see Figure 2.1).
p. 29
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-29
Data Analysis for Research Descriptive Statistics
Table 2.2: Rainfall for Kano, Nigeria from 1907 to 1974
Year Rainfall Year Rainfall Year Rainfall Year Rainfall
1907 930 1924 820 1941 740 1958 1070
1908 970 1925 1100 1942 1110 1959 1010
1909 650 1926 540 1943 810 1960 830
1910 890 1927 780 1944 840 1961 1020
1911 1230 1928 850 1945 620 1962 760
1912 850 1929 900 1946 790 1963 780
1913 750 1930 700 1947 480 1964 1140
1914 950 1931 770 1948 990 1965 700
1915 680 1932 890 1949 1060 1966 750
1916 1010 1933 830 1950 800 1967 900
1917 740 1934 1000 1951 700 1968 780
1918 480 1935 1180 1952 580 1969 970
1919 690 1936 1010 1953 920 1970 960
1920 820 1937 850 1954 810 1971 710
1921 990 1938 830 1955 1040 1972 660
1922 860 1939 940 1956 710 1973 410
1923 1040 1940 980 1957 1110 1974 560
Table 2.3: Cumulative Rainfall for Kano, Nigeria
Annual Rainfall in mm. Frequency Cumulative Frequency Cumulative % Frequency
400-499 3 3 4.4
500-599 3 6 8.8
600-699 5 11 16.2
700-799 15 26 38.2
800-899 15 41 60.3
900-999 12 53 77.9
1000-1099 9 62 91.2
1100-1199 5 67 98.5
1200-1299 1 68 100
p. 30
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-30
Data Analysis for Research Descriptive Statistics
Figure 2.1: Cumulative Frequency Curve for Kano, Nigeria
By reading off at 50% on the y axis (Cumulative % Frequency) to the line, and then down to the x axis
the median is calculated at about 850mm. The median is, in fact, what is quite often meant by ‘the
average’ in everyday conversation, in that half of the years tend to have more rainfall than this, and half
less.
2.2.1 Features of the Median
When using the median, you should consider the following points:
Half the items in the series will have a value greater than or equal to the median and
half less than or equal to the median. It is therefore a measure of rank or position;
It is easy to understand;
It is unaffected by the presence of extreme items in the distribution;
If found directly (from ungrouped data) it will be the same as an actual item in the distribution;
It may be found when the values of all the items are not known, provided that values of middle items
and the total number of items are known;
Ranking the items can be tedious;
The median cannot be used for further mathematical processing;
It may not be representative if there are few items.
p. 31
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-31
Data Analysis for Research Descriptive Statistics
2.4 The Mode
In an ungrouped, discrete distribution the mode is the value which occurs most often; that is, the value
with the highest frequency. The mode of the series 1,2,2,3,4 is the value of 2. Unlike the mean and the
median, it is not necessarily unique. For example the series 1,2,2,3,4,4 has two modes: 2 and 4.
In a continuous frequency distribution it is possible that no two values will be the same. In this sort of
situation the mode is defined as the point where there is the greatest clustering of values, or maximum
frequency density.
2.4.1 Mode for Grouped Data
To find the mode within a grouped data set it is first necessary to construct a histogram showing the
frequency distribution (see Figure 2.2). Having constructed the graph, first identify the modal class (the
class with the greatest frequency or frequency density). To calculate the actual value of the mode, draw
a line from the top right-hand corner of the modal rectangle to the point where the top of the adjacent
rectangle on the left meets it. Now draw a similar line from the top left-hand corner to the point where the
adjacent rectangle on the right meets it. Now draw a perpendicular from the point at which these lines
cross to the horizontal axis. This point gives the value of the mode.
Figure 2.2: The Calculation of the Mode from a Frequency Histogram
While this technique will give the specific value of the mode, it is often more useful and meaningful to
simply indicate the boundaries of the modal class. In other words, rather than attempting to calculate an
accurate value for the mode, which may not be entirely accurate or representative, it would be more
meaningful to say that more people for example fell within the 30 and under 40 age group than any
other group described by Figure 1.2.
10 20 30 40 50 60 70 80 90 100
Age
Freq
uenc
y 70
0
60
50
40
30
20
10
Mode = 34
p. 32
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-32
Data Analysis for Research Descriptive Statistics
2.4.2 Features of the Mode
When using the mode, you should consider the following points:
For discrete data it is an actual single value;
For continuous data it is the point of highest frequency density;
It is easy to understand;
Extreme items do not affect its value;
It can be estimated from incomplete data;
It cannot be used for further mathematical processing;
It may not be unique or clearly defined;
It requires arrangement of the data which may be time consuming.
Activity 1:
For practice work out the mean, median and mode for the following sets of scores relating to the
number of bedspaces in serviced accommodation in Torquay.
Set 1:
4 916 1016 2020 1532 1410 27
Mean =
Median =
Mode =
Set 2:
16 1415 12
8 108 26 14
15 30
Mean =
Median =
Mode =
p. 33
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-33
Data Analysis for Research Descriptive Statistics
2.5 Comparison of the Mean, Median and Mode
The mean, median and mode are the three most important statistical measures of location and central
tendency. Here are some guidelines to help you decide which value should be used in a particular case:
To determine what would result from an equal distribution use the mean (e.g. to determine the per
capita consumption of jelly babies);
If position or ranking is involved use the median which gives the half-way value (e.g. a student
interested in whether his exam mark places him in the upper or lower half of the class will need to
compare his mark with the median mark);
Where the most typical value is required use the mode (e.g. a shoe manufacturer may want to know
the average shoe size for ladies. For production planning it will be the mode that he requires as it will
tell him the most common shoe size).
2.5.1 Which Measure Should You Use?
The type of measure that you use will depend on the data that you are using, but ultimately whatever
measure you choose should provide a good indication of the typical score in your sample. The mean is
the most frequently used measure of central tendancy, because it is calculated from the actual scores
themselves, not from ranks, as is the case with the median, and not from frequency of occurence, as in
the case of the mode. However, as mentioned earlier, as the mean uses all the scores in the calculation
it is sensitive to extreme values.
Look at the following sets of scores:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
The mean from this set of data is 5.5 (the same as the median). If we were to change one of the scores
to make it more extreme, we would get the following:
1,2, 3, 4, 5, 6, 7, 8, 9, 20
The mean is now 6.5, although the median is still 5.5. If we were to make the final score even more
extreme we would get the following:
1, 2, 3, 4, 5, 6, 7, 8, 9, 100
The mean is now 14.5, which as you can see is not really representative of this set of scores. As we
have only changed the highest score, the median remains 5.5. In this case, the median becomes a
better measure of central tendancy. Therefore when deciding which measure to use it is always useful
to check the data for extreme values. Where extremes scores are present, use the median as this
simply gives you the score in the middle of other scores when they are put into ascending order. The
insensitivity to extreme values makes the median a useful alternative to the mean.
p. 34
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-34
Data Analysis for Research Descriptive Statistics
The mode can be used with any type of data, as it relates to the most frequently occurring score and
does not require any calculation. The median and mode cannot be used with certain types of data. For
example if you were discussing occupation or attraction classifications it would be meaningless to rank
these in order of magnitude. Again, when using the mode it is important that it provides a good indication
of the typical score. Consider the following two sets of data:
A] 1,2,2,2,2,2,2,2,3,4,5,6,7,8
B] 1,2,2,3,4,5,6,7,8,9,10,11,12
In set A there are more 2s than any other number and the mode would provide a suitable measure of
central tendency. However, in set B, although the mode is again 2, it is not such a good indicator as its
frequency of occurence is only just greater than all the other scores.
2.6 The Population Mean
The measures of central tendancy outlined above are useful for giving an indication of the typical score
in a sample. However, what if you wanted to get an indication of the typical score in a population. In
theory, one could calcuate the population mean (a parameter) in a similar way to the calculation of a
sample mean; obtain scores from everyone in the population, sum them and divide by the number in the
population. However, this would not be possible. We therefore have to estimate the population parameters
from the sample statistics. One way of estimating the population mean is to calculate the means for a
number of samples and then calculate the mean of these sample means. It has been found that this
gives a close approximation of the population mean.
So why does the mean of the sample means approximate the population mean? Imagine randomly
selecting a sample of people and measuring their IQ. It has been found that the population mean for IQ
is 100. It could be that, by chance, you have selected mainly geniuses and that the mean IQ of the
sample is 150. This is clearly above the population mean of 100. You might select another sample that
happens to have a mean IQ of 75, again not near the population mean. It is clear that the sample mean
need not be a close approximation of the population mean. However, if we calculate the mean of these
two samples, we get a much closer approximation to the population mean:
(75+100)/2 = 112.5
Activity 2:
Which measure of central tendency would be most suitable for each of the following sets of data:
a] 1, 23, 25, 26, 27, 23, 29, 30 ........................................
b] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5 ........................................
c] 1, 1, 2, 3, 4, 1, 2, 6, 5, 8, 3, 4, 5, 6, 7 ........................................
d] 1, 101, 104, 106, 111, 108, 109, 200 ........................................
p. 35
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-35
Data Analysis for Research Descriptive Statistics
The mean of the sample means (112.5) is a closer approximation of the population mean (100) than
either of the individual sample means (75 and 150). If several samples of the same size are taken from
a population, some will have a mean higher than the population mean and some will have a lower mean.
If all the sample means were plotted as a frequency histogram the graph would look similar to Figure
2.3.
Figure 2.3: Distribution of Sample Means Selected from a Population with a Mean of 100
If we calculated the mean of all these sample means it would be equal to 100, which is also equal to the
population mean. This tendency of the mean of sample means to equal the population mean is known
in statistics as the Central Limit Theorem. Knowing that the mean of the sample means gives a good
approximation of the population mean is important as it helps us to generalise from our samples to our
population. This will be considered in more detail when we look at dispersion.
Population mean and mean of
sample means are both 100
p. 36
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-36
Data Analysis for Research Descriptive Statistics
2.7 Skew and the Relationship of the Mean, Median and Mode
Skew is the term that is used to describe the shape of the data as depicted by its frequency distribution
or frequency curve. Under a symmetrical distribution curve, or what is also called ‘Normal Distribution’
(this will be covered in more detail when we look at measure of dispersion), the data builds up slowly
from the left to a central peak or modal point and then declines to the right. In this situation, the mean,
median and the mode all coincide (see Figure 2.4). A positive skew is when the peak lies to the left
and a negative skew when it lies to the right. The further the peak lies from the centre of the horizontal
axis, the more the distribution is said to be skewed.
Figure 2.4: Symmetrical, Positively and Negatively Skewed Data Distributions
Where the distribution is positively skewed, the mean and median will be pulled to the right of the mode,
and where it is negatively skewed, the mean and median are pulled to the left. Consequently, in a
positively skewed distribution, the mean will have the greatest value, the mode the lowest value and the
median will fall between the two. Conversely, in a negatively skewed distribution, the mode will have the
highest value and the mean will have a lower value than the median and the mode.
p. 37
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-37
Data Analysis for Research Descriptive Statistics
2.8 Using SPSS to Calculate Descriptive Statistics
Having considered the basic calculation of the mean, median and mode by hand (and hopefully not to
painfully!), the aim of this next section is to show you how to produce basic descriptive statistics using
SPSS. You can also produce descriptive statistics in Access, and this will be demonstrated later in the
module. We first need to consider the basic elements of the SPSS operating system.
2.8.1 An Introduction to SPSS
SPSS (PASW Statistics) is a powerful statistical tool that can be used to perform a wide range of statistical
techniques. When analysing data in SPSS it is often convenient to transfer over the data you which to
analyse from an Excel spreadsheet. The following section will highlight how to import an Excel
spreadsheet, and provide a basic introduction to the SPSS environment, before detailing in more detail
how to produce descriptive statistics.
To import an Excel spreadsheet, first open SPSS.
SPSS asks you what you would like to do. Move the mouse over Open an Existing Data Source and
press the left mouse button. Either choose the required files or select More Files and click OK.
simulationsimulation
p. 38
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-38
Data Analysis for Research Descriptive Statistics
The Open File dialog box appears. Move the mouse over the drive containing the file you want to open
and then press the left mouse button. The file Dataset is located in the BML224 home page on
Moodle.
SPSS must be told to look for an Excel file. Therefore in the Files of Type box make sure that Excel
is selected [Move the mouse over and press the left mouse button. A sub menu of different file types
appear. Move the mouse over Excel and press the left mouse button].
Now select the Dataset file and click Open.
The Opening File Options dialog
box appears. In the Excel
spreadsheet you are going to import,
the first row in the spreadsheet
contains the field names of the
variables you want to examine. To
assist your data analysis, you need
to ensure that SPSS recognises this.
Move the mouse over Read Variable Names option and press the left mouse button
( becomes ). Move the mouse over OK and
press the left mouse button.
p. 39
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-39
Data Analysis for Research Descriptive Statistics
SPSS now automatically imports the fields in the Excel spreadsheet and the data is displayed in the
Data Editor window.
You know need to save this file to your own homespace on the network. Move the mouse over File and
press the left mouse button. Move the mouse over Save As and press the left mouse button again. The
Save As Dialog box appears. Save the file as DATASET.SAV. Note that .SAV is the file extension for
data tables in SPSS. If you need to reload this file at any point, in the Open File dialog box select the
DATASET.SAV file.
p. 40
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-40
Data Analysis for Research Descriptive Statistics
Before using SPSS to perform basic frequency counts and descriptive statistics on the results of the
Interview data you first need to understand the nature of the data. For example, some variables are
based on numeric coding schemes (nominal, categorical data types) and others on specific data values
(interval or ratio data types). For those questions based on numeric coding schemes, certain descriptive
statistics are not appropriate, although in this case SPSS can be used to perform basic frequency
counts.
Details of the variables in the Dataset file are included in the Dataset guide which has been given to you
as part of the module resources. Please read through this guide carefully and become familiar with the
different types of data, as this will be central to your successful completion of this module.
p. 41
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-41
Data Analysis for Research Descriptive Statistics
2.8.2 Using the Variable View
In SPSS, we can use the variable view to check the integrity of the data and to apply additional information
to the coding schemes to aid our analysis of the data. In the bottom of the SPSS window, click on the
Variable View tab.
The Variable View window is displayed. This window provides specific information relating to the variables
that we have imported in the Dataset file. A number of key areas need to be checked at this point. First,
check the Type column. In order for SPSS to conduct statistical analysis on the variables in the Dataset
file all the variables here should be listed as Numeric.
In this instance the Greenrank06 variable is listed
as a String. This needs to be changed to Numeric.
To do this move the mouse over String and press
the left mouse button. The cell is highlighted and
a button appears.
Click the button and the Variable Type dialog box
appears.
simulationsimulation
p. 42
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-42
Data Analysis for Research Descriptive Statistics
Select Numeric and click OK.
Check the other variables to ensure that they are set as numeric.
We can also use the Variable View to check the Measurement type of
the variables. In this instance the measurement type should look like
this. Refer back to your introductory notes to check on different data
types.
If the measurement type is not correct for a specific variable, move the
mouse over the measurement cell in question and press the left mouse
button.
The cell is highlighted and a button appears.
Click on the button and a sub menu appears, offering three options: Scale, Ordinal
and Nominal. Move the move over the required data type and press the left
mouse button. The new data type will be presented. Note that ratio and interval
data (e.g. age/investment) are classified as Scale).
In the Variable View we can also assign more specific value labels to each of the
variables. For example if we take Area as an example of the basic coding scheme in place here, Chichester
District = 1 and Arun District =2. Any subsequent analysis that we perform will use this base coding
scheme in any output. In order to make the SPSS output more self-explanatory we can assign additional
value labels so that any output actually refers to Chichester District and Arun District.
In the Variable View move the mouse over Values for the Area variable and press the left mouse button.
The cell is highlighted and a button appears. Click the button.
p. 43
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-43
Data Analysis for Research Descriptive Statistics
The Value Labels dialog box
appears.
In the Value: box type 1.
In the Value Label: box type
Chichester District.
Click Add.
In the Value: box type 2.
In the Value Label: box type
Arun District.
Click Add.
Click OK.
The changes you have made are reflected in the Variable View.
Repeat this process to add Value Labels to the remaining variables (where appropriate!).
Return to the Data View and SAVE the file. We can now experiment with producing descriptive statistics.
Chichester District
Arun District
1 = ‘Chichester District’
p. 44
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-44
Data Analysis for Research Descriptive Statistics
By using the Value Labels in the Data View window you can switch the value labels between the
numeric coding and the full text labeling. Click the button to toggle between the different options.
Numeric Coding
Text Label
p. 45
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-45
Data Analysis for Research Descriptive Statistics
2.8.3 Working with SPSS Output
Before we start producing descriptive statistics, it is worth mentioning that SPSS output can be cut and
paste into a Word document (or equivalent package). The process is very simple.
In the output window, select the item you want to cut and paste, in this case a histogram. When the item
is selected a black border will appear. Copy the item (Edit>Copy or right mouse click>Copy).
Open Word and paste the selection into your document.
simulationsimulation
p. 46
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-46
Data Analysis for Research Descriptive Statistics
To print specific elements of the output, first select the element you wish to print. When the item is
selected a black border will appear.
Select Print from the File menu. The Print dialog box opens. Make sure that Selection is highlighted
and click OK.
The required element is printed. Please use this method to print and annotate output that will be created
during the module.
Please use the cut and paste process highlighted here to complete your log book that we will use
throughout this module.
Additional guidance notes on the different features of SPSS are available in the appendices of this
handbook. When using SPSS to analsye data, you should not be directly cutting and pasting SPSS
output into your work. Outputs tables should ideally be recreated in Word, and data should be transferred
into Excel to create appropriate graphs.
p. 47
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-47
Data Analysis for Research Descriptive Statistics
2.8.4 Producing Descriptive Statistics
As mentioned earlier, before using SPSS to perform basic frequency counts and descriptive statistics on
the results of the survey data you first need to understand the nature of the data (refer back to Section
1.6). In this case, we will start by exploring the categorical/nominal variable: OCC (occupation).
Remember for this variable it would not be appropriate to apply the mean, median or standard deviation.
To perform a basic frequency count, first decide on
the variable you which to examine. In this case we
shall examine OCC.
To do so, first move the mouse over Analyse and
press the left mouse. Move the left mouse button
over Descriptive Statistics and then over
Frequencies and press the left mouse button again.
The Frequencies dialog box appears.
Move the mouse over the variable you want to examine (in this case Occ) and press the left mouse
button. Move the mouse over the central arrow and press the left mouse button again. Alternatively,
select the variable you want to examine and quickly double click the left mouse button. The selected
variable moves across into the Variable(s) box. Note this procedure can be repeated for multiple
variables. Click OK.simulationsimulation
p. 48
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-48
Data Analysis for Research Descriptive Statistics
The results of the frequency count are displayed in the output window. Notice that the frequency table
has listed the occupations as a result of you entering in data for the Value Labels. This helps to make
the table more self-explanatory.
Any statistics you generate in SPSS will also be displayed in this output window. This is very useful as
it means all your calculations are stored in one file that you can save and open at a later date. Save the
output file to your own homespace on the network. Save the file as DS-OUTPUT1.
Repeat this procedure to perform frequency counts to complete the Tables 1 and 2 overleaf. Your
additional frequency counts will appear in the output window. Save the output regularly. Record your
results overleaf or alternatively print out and fully annotate your SPSS output and file in your work folder.
The information presented in the frequency chart could now be copied or cut and paste into Excel where
you could create an Excel chart to show the distribution of the data.
An online simulation of how to create basic frequency statistics is available on the BML224 home page.
Please use this simulation to familiarise yourself with the basic prodecures outlined here.simulationsimulation
p. 49
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-49
Data Analysis for Research Descriptive Statistics
Activity 3:
Table 1: The Distribution of Accommodation by Size
Table 2: The Distribution of Accommodation by Price
Having completed Tables 1 and 2, now have a go at completing Table 3. It is exactly the same process
but you will need to perform a frequency count for each separate question in the table (the relevant
variable name is given in the brackets).
Size Frequency Percentage
Small
Medium
Large
Price Frequency Percentage
Up to £30
£31 to £50
£51 to £70
£71 to £90
£91+
p. 50
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-50
Data Analysis for Research Descriptive Statistics
Table 3: Business Responses to Tourism Issues
You will have noticed that the frequency count produced relates to the entire sample of 300 businesses,
and there is no differentiation based on specific cases such as location. By selecting specific cases we
can use SPSS to produce more detail frequency counts. In the following example we will produce a
frequency count showing the frequency distribution of different occupation types by area.
Activity 4:
p. 51
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-51
Data Analysis for Research Descriptive Statistics
Return to the Data View window in SPSS. Move the mouse Data and press
the left mouse button.
Move the mouse over Split File and press the left mouse button
The Split File dialog box opens.
Select Compare groups
Then select Area and move into the
Groups Based on box.
Then click OK.
simulationsimulation
p. 52
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-52
Data Analysis for Research Descriptive Statistics
The frequency table is displayed in the output window. As you can see the frequency table now gives a
breakdown of occupation type by area (our prior labelling clearly referring to the Chichester and Arun
Districts). Let us repeat this frequency count but this time instead of using Area we will use Town Code.
p. 53
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-53
Data Analysis for Research Descriptive Statistics
Return to the Data View window in SPSS. Move the mouse over Data and
press the left mouse button.
Move the mouse over Split File and press the left mouse button.
The Split File dialog box opens.
Deselect Area and then select Town Code and move into the Groups Based on box.
Click OK.
p. 54
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-54
Data Analysis for Research Descriptive Statistics
Run the frequency count again and the frequency table is displayed in the output window. As you can
see the frequency table now gives a breakdown of occupation type by Town Code (our prior labelling
clearly referring to the actual towns).
p. 55
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-55
Data Analysis for Research Descriptive Statistics
Activity 5:
Using the Split File option please complete the following tables.
Table 4: Size of Accommodation by Area
Table 5: Size of Accommodation by Town
Area
Size of Accommodation
Small [No. of Ests]
Medium [No. of Ests]
Large [No. of Ests]
Total
Chichester District
% Distribution
Arun District
% Distribution
Town
Size of Accommodation
Small [No. of Ests]
Medium [No. of Ests]
Large [No. of Ests]
Total
Chichester
% Distribution
Midhurst
% Distribution
Arundel
% Distribution
Bognor Regis
% Distribution
p. 56
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-56
Data Analysis for Research Descriptive Statistics
Activity 5:
Table 6: Business Response to Employment Opportunities by Area
Table 7: Business Response to Employment Opportunities by Town
p. 57
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-57
Data Analysis for Research Descriptive Statistics
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Chichester
Midhurst
Arundel
Bognor Regis
Percentage
To
wn
The Size Structure of Accomodation in the
Chichester and Arun Districts
Small
Medium
Large
Activity 6: Self-Directed
Cut and paste the results from Table 5 in your SPSS output into Excel. Edit the layout of the results accordingly and
produce the following graph. The graph should be presented on A4 in landscape format. Please copy the format of
this chart exactly.
Please print of the chart and have it checked by the module tutor. File the chart in your work folder.
p. 58
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-58
Data Analysis for Research Descriptive Statistics
Before we do any additional analysis it is important to remember to set the
Split File dialog box, so any subsequent analysis is based on the entire
sample.
Return to the Data View window in SPSS. Move the mouse Data and
press the left mouse button.
Move the mouse over Split File and press the left mouse button
The Split File dialog box opens.
Select Analyze all cases, do not
create groups and then click OK.
Failure to reset the Split Files dialog box can result in inaccurate statistics being created.
p. 59
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-59
Data Analysis for Research Descriptive Statistics
There are a number of ways in which you can produce Descriptive Statistics for interval or ratio
variables in SPSS.
Method 1: First decide on the variable you which to examine. In this case we shall examine the turnover
of businesses in 2008 (Turnover08).
To do so, first move the mouse over Analyse
and press the left mouse.
Move the left mouse button over Descriptive
Statistics and then over Frequencies and
press the left mouse button again.
The Frequencies dialog box appears.
Move the mouse over the variable you want to examine (in this case Turnover08) and press the left
mouse button. Move the mouse over the central arrow and press the left mouse button again. Alternatively,
select the variable you want to examine and quickly double click the left mouse button. The selected
variable moves across into the Variables box. Note this procedure can be repeated for multiple variables.
Move the mouse over Statistics and press the left mouse button. simulationsimulation
p. 60
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-60
Data Analysis for Research Descriptive Statistics
The Frequencies: Statistics dialog box appears. This dialog box gives you the opportunity to select a
wide range of descriptive statistics. Select the options you want to include by moving the mouse over
the blank square and pressing the left mouse button so a tick appears. When you have completed your
selection move the mouse over Continue and press the left mouse button.
Note that SPSS also allows you to select measures of dispersion. This will be discussed in more detail
in the next session.
This will take you back to the Frequencies dialog box. Move the mouse over OK and press the left
mouse button. SPSS automatically calculates the necessary statistics and displays the results in the
Output window. This method not only produces the basic descriptive statistics for the variable but also
a frequency table (which can be deleted).
p. 61
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-61
Data Analysis for Research Descriptive Statistics
Descriptive statistics can also be produced by selecting
Descriptives instead of Frequencies in the
Descriptive Statistics sub menu.
Follow the same procedures as in the previous
example, however, in this case click Options to specify
the descriptive statistics you want SPSS to produce.
Select the options you want to include by moving
the mouse over the blank square and pressing
the left mouse button so a tick appears.
When you have completed your selection move
the mouse over Continue and press the left
mouse button.
This will take you back to the Descriptives dialog
box. Move the mouse over OK and press the left
mouse button. SPSS automatically calculates the
necessary statistics and displays the results in
the Output window.
p. 62
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-62
Data Analysis for Research Descriptive Statistics
You will have noticed that the descriptive statistics produced for Turnover08 relate to the entire sample of 300 businesses.
By using the Split File option again we can look in more detail at the characteristics of turnover in relation to specific
cases such as size of business or location. For example in the following, we can use the Split file to look at the
average turnover in the Chichester and Arun Districts.
As before open the Split File dialog box and select Compare groups. Select Area to go in the Groups Based on:
box.
Now produce descriptive statistics for Turnover08 again (using either the descriptives or frequencies option). In the
following example I have created descriptive statistics using the frequencies option and you can see in the output that
descriptive statistics have now been produced for both the Chichester and Arun Districts.
p. 63
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-63
Data Analysis for Research Descriptive Statistics
Method 2: The second (and slightly faster method) is to use the Explore function. In this example we will
again examine the turnover of businesses in 2008 (Turnover08).
To do so, first move the mouse over
Analyse and press the left mouse.
Move the left mouse button over
Descriptive Statistics and then over
Explore and press the left mouse button
again.
The Explore dialog box appears.
Move the mouse over the variable you want to examine (in this case Turnover08) and press the left
mouse button. Move the mouse over the Dependent List arrow and press the left mouse button again.
p. 64
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-64
Data Analysis for Research Descriptive Statistics
Turnover08 appears in the Dependent List.
Make sure that Statistics is selected in the dialog box. We will come back to plots later.
Click OK. Descriptive statistics for Turnover08 are produced in the output window.
p. 65
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-65
Data Analysis for Research Descriptive Statistics
As in the previous method producing descriptive statistics, the values given in the output relate to the
entire sample. By adding variables in the Factor List in the Explore dialog box, we can differentiate by
specific cases.
Return to the Explore dialog box.
Select Area from the variable list and click the Factor List arrow. Area will appear in the Factor List
window. This will give us separate descriptive statistics for the Arun and Chichester Districts. Remember
in the previous method, we used the Split File option to group around specific cases.
Click OK. Descriptive statistics for business turnover in the Arun and Chichester Districts are produced
in the output window.
p. 66
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-66
Data Analysis for Research Descriptive Statistics
Let me illustrate another example. Return to the Explore dialog box.
Remove Area from the Factor List and replace with E-Strategy. Click OK.
Descriptive statistics for business turnover for E-Commerce Adopters and E-Commerce Non-Adopters
are produced in the output window.
p. 67
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-67
Data Analysis for Research Descriptive Statistics
Using either method, attempt to complete the following tables.
Table 8: Descriptive Statistics for Turnover08 by Town
Table 9: Descriptive Statistics for GTBS Score in 2008 [GTBS08] by Size of Business
Table 10: Descriptive Statistics for Invest by GStrategy
Activity 7:
Size of Business
GTBS08
Mean Median Mode
Standard Deviation
Range
Small
Medium
Large
p. 68
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-68
Data Analysis for Research Descriptive Statistics
2.9 Graphically Describing Data
As mentioned earlier, when using statistics it is important to understand the data that you are using.
One of the best ways of doing this is through exploratory data analysis, and investigating your data
using graphical techniques. The next section will consider three main elements: frequency histograms,
stem and leaf plots and box plots.
2.9.1 Frequency Histograms
In the above section you have used SPSS to perform basic frequency counts. The frequency histogram
is a useful way of representing a frequency count more graphically, and allowing us to inspect for any
extreme values (see Figure 2.5). Any extreme values and possible errors that have been made in
inputting the data are often easier to spot when you have graphed the data. The frequency histogram is
also useful for discovering other important characteristics of your data. For example you can easily
record the value of the mode by looking for the tallest column in the chart. In addition, the histogram
also gives you useful information about how the values are distributed. However, when interpreting the
distribution of the data, be aware that the interpretation of your histrogram is dependent upon the particular
intervals that the bars represent. The way that the data is distributed will become important when we
look at normal distribution and dispersion in the next session. The distribution and character of the data
is also an important consideration in the use of inferential statistics that will be examined later in this
module.
p. 69
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-69
Data Analysis for Research Descriptive Statistics
Figure 2.5: Freqency Histogram showing the Mean, Median and Mode
[Note: The frequency histogram is based on the following data: 2, 12, 12, 19, 19, 20, 20, 20, 25]
2.9.2 Stem and Leaf Plots
Stem and leaf plots are similar to frequency histograms in that they allow you to see how the scores are
distributed. They also retain the values of the individual observations. A basic example of a stem and
leaf plot is shown below:
Stem and Leaf Plot [a]
[Data set= 2, 12, 12, 19, 19, 19, 20, 20, 20, 25]
Stem Leaf
Tens Units
0 2
1 22999
2 0005
A stem and leaf plot based on a larger data set is illustrated overleaf.
Median Mode
Mean
(not normally shown on
histograms)
16.56
The score of 2
The score of 25
p. 70
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-70
Data Analysis for Research Descriptive Statistics
Stem and Leaf Plot [b]
[Data set= 1, 1, 2, 2, 2, 5, 5, 5, 12 ,12, 12, 12, 14, 14, 14, 14, 15, 15, 15, 15, 18, 18, 24, 24, 24, 24,
24, 25, 25, 25, 25, 25, 25, 25, 28, 28, 28, 28, 28, 28, 28, 28, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 34,
35, 35, 35, 35, 35, 42, 42, 42, 43, 43, 44 ]
Stem Leaf
0 11222555
1 22224444555588
2 44444555555588888888
3 2233334444455555
4 222334
You can see the similarities between histograms and stem and leaf plots if you turn the stem and leaf
plot on its side. When you do this you can get a good representation of the distribution of the data. In
Stem and Leaf Plot [a] the first line contains the scores 0 to 9, the next line 10 to 19 and hte last line 20
to 29. Therefore in this case the stem indicates the tens and the leaf the units. You can see the score
of 2 is represented as 0 in the tens column (the stem) and 2 in the units column (the leaf), 25 is represented
as a stem of 2 and a leaf of 5. The same pattern applies to Stem and Leaf Plot [b], which highlights that
this approach is useful for presenting lots of data.
However, there are times when the system of blocking in tens is not very informative. For example look
at the following Stem and Leaf Plot.
Stem and Leaf Plot [c]
Stem Leaf
0 0000022222222333333333555555555555555777777777777799999999
1 000000033333888
2 3
6 4
This Stem and Leaf Plot is not really that informative, and only indicates that most of the values are
below 20. An alternative system is to block the scores in groups of 5 (0-4, 5-9, 10-14, 15-19 etc).
Stem and Leaf Plot [d]
Block Stem Leaf
0-4 0. 0000022222222333333333
5-9 0* 555555555555555777777777777799999999
10-14 1. 000000033333
15-19 1* 888
20-24 2. 3
60-64 6. 4
p. 71
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-71
Data Analysis for Research Descriptive Statistics
This stem and leaf plot provides a much better indication of the distribution of scores. You can see that
we use a full stop (.) following the stem to signify the first half of each block of ten scores (e.g. 0-4) and
an asterisk (*) to signify the second half of each block of ten scores (e.g. 5-9).
2.9.3 Box Plots
Extreme scores are sometimes difficult to spot in a large data set. In this instance an alternative graphical
technique is the box plot or whisker plot, which gives a clear indication of the distribution of extreme
scores, and like the stem and leaf plots and histograms discussed above, tells us how the scores are
distributed. An example of a box plot is given in Figure 2.6:
Figure 2.6: An Example of a Box Plot
Although SPSS will automatically create box plots, the following section will outline how to create them
so you understand how to interpret them.
Step 1: The box plot in Figure 1.6 is based on the following data: 2, 20, 20, 12, 12, 19, 19, 25, 20
The first step is to calculate the median score.
2, 12, 12, 19, 19, 20, 20, 20, 25
Median score = 19 [position 5]
10
20
30
40
0N= 9
Hinges
The Box
This thick line represents the Median
Adjacent Values
Whiskers
p. 72
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-72
Data Analysis for Research Descriptive Statistics
Step 2: The next step is to calculate the hinges. These are the scores that cut the top and bottom
25% of the data (the lower and upper quartiles): thus 50% of the scores fall within the
hinges. The hinges form the outer boundaries of the box. The hinges are calculated by
adding 1 to the position of the median position and then dividing by 2. In this instance the
median was in position 5, therefore: (5+1)/2 = 3
Step 3: The upper and lower hinges are therefore the third score from the top and the third score
from the bottom of the ranked list, which in this current example are 20 and 12 respectively.
Step 4: From these scores we can work out the h-spread, which is the range of the scores between
the two hinges. The score on the upper hinge is 20 and the score on the lower hinge is 12,
therefore the h-spread is 8 (20 minus 12).
Step 5: We define extreme values as those that fall one-and-a-half times the h-spread outside the
upper and lower hinges. The points one-and-a-half times the h-spread outside the upper
and lower hinges are called inner fences. One-and-a-half times the h-spread in this case
is 12, that is 1.5*8: therefore any score that falls below 0 (lower hinge, 12, minus 12) or
above 32 (upper hinge, 20, plus 12) is classed as an extreme score.
Step 6: The scores that fall within the hinges and inner fences and which are closest to the inner
fence are called adjacent scores. In our example, these scores are 2 and 25, as 2 is the
closest score to 0 (the lower inner fence) and 25 is the closest to 32 (the upper inner
fence). These are illustrated by the cross-bars on each of the whiskers.
Any extreme scores (those that fall outside the upper and lower fences), are shown on the box plot.
You can see from Figure 2.6 that the h-spread is indicated by the box width (12 to 20) and that there are
no extreme scores. The lines coming out from the edge of the box are called whiskers, and these
represent the range of scores that fall outside the hinges but are within the limits defined by the inner
fences. Any scores that fall outside the inner fences are classed as extreme scores (also called outliers).
As shown in Figure 1.6, there are no scores outside the inner fences, which are 0 and 32. The inner
fences are not necessarily shown on the plot. The lowest and highest scores that fall within the inner
fences (adjacent scores 2 and 25) are indicated on the plots by the cross-lines on each of the whiskers.
If we were to add a score of 33 to the data set illustrated in Figure 1.6, a revised box plot would now
indicate the presence of an extreme score (see Figure 2.7). As shown in Figure 2.7, the score is marked
as 10, indicating that the tenth score in our data set is an extreme score (in this case, 33). This value
falls outside the inner fence of 32. In this situation it may be worth examining the data set to ensure that
this extreme value has not been caused by an error in the data entry process.
p. 73
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-73
Data Analysis for Research Descriptive Statistics
Figure 2.7: Revised Box Plot Indicating an Extreme Score
10
20
30
40
0N= 10
10
p. 74
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-74
Data Analysis for Research Descriptive Statistics
2.10 Graphically Describing Data in SPSS
Creating histograms, stem and leaf plots and box plots in SPSS is very straight forward. In the following
example, we will generate graphical output relating to the Turnover08 variable in the dataset.
Move the mouse over Analyse and press the
left mouse.
Move the left mouse button over Descriptive
Statistics and then over Explore and press
the left mouse button again.
The Explore dialog box appears.
Move the mouse over the variable you want to examine (in this case Turnover08) and press the left
mouse button.
Move the mouse over the central arrow and press the left mouse button again. Alternatively, select the
variable you want to examine and quickly double click the left mouse button.
p. 75
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-75
Data Analysis for Research Descriptive Statistics
The selected variable moves across into the Dependent List.
Move the mouse over Plots and press the left mouse button.
The Explore Plots dialog box appears.
Select Stem and Leaf and Histogram.
Click Continue.
Turnover08
p. 76
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-76
Data Analysis for Research Descriptive Statistics
You are returned to the Explore dialog box. Click OK.
SPSS generates a histogram, stem and leaf plot and box plot in the output window.
Turnover08
Turnover08
p. 77
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-77
Data Analysis for Research Descriptive Statistics
Turnover08
Turnover08
p. 78
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-78
Data Analysis for Research Descriptive Statistics
As before any graphical output produced is referring to the entire sample of 300 businesses. Using the Factor List
option in the Explore dialog allows us to examine specific variables in more detail. For example the following output
has been produced by selecting Area in the Factor List box. This is an extremely useful way of visually looking at the
distribution of your data, which we will come back to when we look at dispersion and statistical testing.
p. 79
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-79
Data Analysis for Research Descriptive Statistics
I would like you now to have a go at producing graphical output for a specific variable. Choose an
appropriate variable (which must be ratio or interval in nature) and produce output for the entire
sample, and then use the Factor List option in the Explore dialog box to investigate specific cases. Record your
observations by cutting and pasting the output into your log book.
Activity 8:
p. 80
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-80
Data Analysis for Research Descriptive Statistics
2.11 Creating Cross-tabulations in SPSS
Another useful way of examining the relationship between variables is through the use of cross-tabulations.
In the following example we will create a number of cross-tabulations using data from the Dataset file.
To create a cross-tabulation in SPSS,
move the mouse over Analyse and press
the left mouse button. Move the mouse
over Descriptive Statistics and then
Crosstabs.
The Crosstabs dialog box appears. You need to think about the structure of your crosstab and decide
what variable you want as a row and what variable you want as column. Your crosstab should take the
form of a contingency table.
simulationsimulation
p. 81
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-81
Data Analysis for Research Descriptive Statistics
Move the mouse over the variable you want to assign to Rows, in this case Area, and press the left
mouse button. Move the mouse over the central arrow and press the left mouse button again. Area
appears in the
Row(s) box:
Move the mouse over the variable you want to assign to Columns, in this case Occ (Occupation), and
press the left mouse button. Move the mouse over the central arrow and press the left mouse button
again. Occ appears in the Column(s) box:
Click OK.
p. 82
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-82
Data Analysis for Research Descriptive Statistics
SPSS produced the crosstab in the output window:
The crosstab presented here is based on the absolute values of the data. We can repeat the process to
include Row and Column percentages. This is often a good idea, as it provides a more representative
overview if you have different sample sizes. In this instance we will add percentages to the rows.
Having selected the Row and Column variables move the mouse over Cells and press the left mouse
button.
p. 83
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-83
Data Analysis for Research Descriptive Statistics
The Crosstabs: Cell Display dialog box appears.
Select Row in the Percentages window and
then click Continue. This will return you to
the Crosstabs dialog box. Click OK.
A second crosstab is produced in the output window - this time row percentages have been included. In
this example the crosstab is showing the distribution of occupation categories within a specific District.
For example in the Chichester District, 48.6% of businesses are run by previous managers and
administrators, compared to 25.7% who were in professional occupations. Reference to the percentage
distribution rather than the absolute values provides a more representative discussion, as it takes into
account relative sample sizes. Repeat the process removing row percentages and adding column
percentages.
p. 84
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-84
Data Analysis for Research Descriptive Statistics
When producing crosstabs it is important that you correctly assign row and column percentages as this can influence
the accuracy of how you discuss the results. A simple rule of thumb is that row percentages should always total 100
when read across the row, and column percentages will always total 100 when read down the column. In the above
example where we have used the column total we are looking at the distribution of specific occupation categories
across the two Districts. For example, 70.2% of managers and administrators are within the Chichester District
compared to 29.8% in the Arun District. In contrast, 63.6% of plant operatives are within the Arun District compared to
36.4% in the Chichester District.
p. 85
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-85
Data Analysis for Research Descriptive Statistics
Now attempt to complete the following tables. Please give consideration to whether you should be
using row or column totals (the clue is in the table). Refer to your Dataset guide.
Table 11: Town Against G-Strategy
Table 12: E-Strategy Against Occupation
E-Strategy
Occupation
Managers and Administrators
Professional Occupations
Clerical and Secretarial
Sales Operations
Plant Operatives
Total
E-Commerce Adopters
%Distribution
E-Commerce Non-Adopters
%Distribution
Activity 9:
p. 86
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-86
Data Analysis for Research Descriptive Statistics
Table 13: Perceived Value of the Internet Against E-Commerce and Marketing Course Attendance
Table 14: Town Against the Size of Business
Size of Business
Town
Chichester Midhurst Arundel Bognor Regis
Small
%Distribution
Medium
%Distribution
Large
%Distribution
Total
Activity 9:
p. 87
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-87
Data Analysis for Research Descriptive Statistics
Activity 10:
Using the Dataset file, create 3 additional crosstabs using appropriate variables. Record your results
by cutting and pasting your output into your log book. Check your crosstabs with your module tutor to
ensure that they are correct.
simulationsimulationsimulation Please review the online simulations to ensure that you are familiar with the basic approaches of producing
descriptive statistics in SPSS.
We can make crosstabs even more specific by using the Layer Command. In the following example
our the initial crosstab is GStrategy v Occupation but we are going to use the layer command to examine
any differences between GStrategy, Occupation and Area. In effect, the layer command is allowing us
to use Area as an additional filter.
Select the variables to use as the basis of your crosstab. Here we are using GStrategy (row) and
Occupation (column). Select Area and put in the Layer option.
Click OK.
p. 88
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 2-88
Data Analysis for Research Descriptive Statistics
In the output window you will notice that a crosstab showing GStrategy v Occupation has been provided,
but the results have also now been split by area, showing relative distributions in both the Arun and
Chichester Districts.
Dispersion
Section 3
Learning Outcomes
At the end of this session, you should be able to:
Understand the theory and assumptions relatingto the distribution and variance of data
Calculate measures of dispersion, including themedian, range, standard deviation and standarderror, both manually and using SPSS
Use confidence levels and z scores to establishthe relationship between the sample mean andpopulation mean
Use the standard error to establish the extent towhich the sample mean deviates from thepopulation mean
p. 3-89
Data Analysis for Research Measures of Dispersion
3.0 Introduction
So far, you have been introduced to a number of different methods to graphically illustrate your data. But why
is it important to do this? It is important because the way the data is distributed will influence the types of
statistical tests that are valid, as many of the statistical tests that you will be introduced to in this module make
specific assumptions regarding the distribution of the data. One of the most important distributions that you
need to consider is the normal distribution. Under a normal distribution, the characteristic frequency curve
is bell-shaped and is symmetrical around the mean. For example, if 1,000 people were asked to estimate
the length of a room that was exactly 12 feet long, it is highly probable that everybody would say that the room
was 12 feet long. Some may guess at low as 11 feet and other may decide on 13 feet. However, we would
expect that most of the estimates would be between 11 feet and 13 feet and very few as far out as 9 feet or 15
feet. If the frequency distribution of the measurements were plotted on a graph, the pattern would tend to be
bell-shaped because most of the values would be clustered around the 12 feet mark, while the frequency of
measurements would diminish away from this central value.
Figure 3.1: Normal Distributions
The curves illustrated in Figure 3.1 all have a normal distribution, even though they are not quite the same.
You can see that they differ in terms of how spread out the scores are and how peaked they are in the middle.
Under a normal distribution, the mean, median and mode are exactly the same. These are features of a
normal curve. Indeed, many natural phenomena, such as heights of adult males and weights of eggs, tend
to produce the ‘normal’ (or Gaussian) distribution, and more significantly, most sampling will do so as well,
regardless of the distribution of the population. This is why it, and sampling, are so important in statistics. The
p. 3-90
Data Analysis for Research Measures of Dispersion
requirements of a normal distribution are not always met in research, especially when you are dealing with
small sample sizes. If your sample size is less than 30, then reference to the normal distribution is not
appropriate. It is generally found that the more scores from such variables that you plot, the more like the
normal distribution they become. This can be seen in the following example. If you randomly select 10 men
and measured their height inches, the frequency histogram may be similar to Figure 3.2a. This histogram
bears little resemblance to the normal distribution curve. If we were to select an addition 10 men and
measure their height, and then plot all 20 measurements the resulting histogram (Figure 3.2b) would again
not look like a normal distribution. However, you can see that as we select more and more men and plot their
heights, the histogram becomes a closer approximation to the normal distribution (Figure 3.2c to 3.2e). By
the time we have select 100 men you can see that we have a perfectly normal distribution.
Figure 3.2: Normal Distribution and Sample Size
[Source: Dancey and Reid, 2002, p. 64]
p. 3-91
Data Analysis for Research Measures of Dispersion
3.1 Measures of Dispersion
Although the different types of average can help to describe frequency distributions to a certain extent, they
are of limited use on their own and additional measures are often required to illustrate the full picture, and too
assess how much variation there is in our sample of population. This situation is best illustrated by a simple
example.
Two groups of 5 SEMAL students were asked to record their weekly beer consumption. The results in pints
were as follows:
Group 1: 12, 12, 12, 12, 12
Group 2: 0, 5, 10, 15, 30
Passing over the obvious comment that the 2nd group appears to contain someone who isn’t a SEMAL
student, the arithmetic mean for both groups is 12. However, this result gives no indication of the basic
differences between the two sets of values. Therefore, a measure of dispersion (or spread) can be used to
express the fact that one set of values is constant while the other ranges over a wide scale. The following
section will highlight a number of ways in which the level of variance within a sample of population can be
assessed.
3.1.1 The Range
The least sophisticated measure of dispersion is the range of a set of values. The range is simply the
difference between the highest and lowest values of a series. As such, it only tells us about two values which
may be atypical from the rest of the data set. In reference to our previous example, for the beer consumption
of the two groups of tourism management students the ranges are:
Group 1: Zero
Group 2: 30
Although the range tells us about the overall range of scores, it does not give any indication of what is happening
in between these scores. Ideally, we need to have an indication of the overall shape of the distribution and how
much the scores vary from the mean. Therefore, although the range gives a crude indication of the spread of
the scores, it does not really tell us much about the overall shape of the distribution of the sample of scores.
Max Min Range
12 12 0
30 0 30
Remember the range is calculated by subtracting the
minimum value from the maximum value. In this case:
--
==
p. 3-92
Data Analysis for Research Measures of Dispersion
3.1.2 Quartile Deviation
The range, as a measure of dispersion, has the significant disadvantage of being susceptible to distortion by
extreme values. One way of overcoming this is to ignore items in the top and bottom quarters of the
distribution and to consider the range of the two middle quarters only. This is known as the interquartile
range since it is the difference between the first and third quartiles. The quartile deviation (semi-interquartile
range) is one half of the interquartile range. For continuous data, the lower quartile (Q1) is determined by
first ranking the data in order and then dividing the total sample number by 4. In the following example (see
Figure 3.3), the lower quartile lies between the ages of the 2nd and 3rd visitors. Thus, the lower quartile value
is 14 years (i.e. (13+15)/2). The upper quartile value is computed in a similar way but by dividing the
sample size by three quarters. Thus the upper quartile value lies between the ages of the 7th and 8th visitor.
Thus the value is 18 years (i.e. (18+18)/2). To summarise, we can now state that one quarter of visitors were
aged 14 years or under, while one quarter were aged 18 years or more. In addition, we can also quote the
interquartile range by stating that 50% of the visitors were aged between 14 and 18 years of age.
Figure 3.3: Age Profile of Visitors to the Arun Youth Centre
10, 13, 15, 16, 16, 17, 18, 18, 18, 20
Lower quartile value = 13 152
14+
=
Upper quartile value = 18 182
18+ =
Effectively, the interquartile range is a refinement of the median and is most easily calculated from the cumulative
frequency curve. In the Kano rainfall example, discussed in your descriptive statistics handout, the lower
quartile is read off by tracing a line from the 25% level to the curve, and then down to the appropriate rainfall
(about 275mm), and the upper quartile by reading 75% (to find about 1000mm). This means that over the
period in question, half of the years had a rainfall between 725mm and 1000mm, with the interquartile range
itself therefore being 275mm.
p. 3-93
Data Analysis for Research Measures of Dispersion
To calculate the quartile ranges for grouped data, it is first necessary to calculate the cumulative frequencies
as in the Kano example. When trying to calculate the quartile values of grouped data it is again necessary to
make assumptions regarding the distribution of values within the class. In this instance it is assumed that the
distribution is even and the lower quartile is calculated as follows:
Q1 LCL(Q1)(LC)
(Q1)xw(Q1)= +
−
n cff
/ 4
Where:
Q1: is the lower quartile range
LCL(Q1): is the lower class limit of the class containing the lower quartile
n: is the sample size
cf(LC): is the cumulative frequency of the class immediately below that containing the lower quartile
w(Q1): is the width of the class interval containing the lower quartile
f(Q1): is the frequency of the class interval containing the lower quartile
The calculation for the upper quartile is:
Q3 LCL(Q3)(LC)
(Q3)xw(Q3)= + −
3 4n cff
/
In this case, Q3 reflects the relevant upper quartile values and can be substituted in the description of terms
stated for calculating Q1.
p. 3-94
Data Analysis for Research Measures of Dispersion
3.1.3 Mean Deviation
Unlike the range, the mean deviation measures dispersion about a particular average, namely the arithmetic
mean. It is the average (arithmetic mean) of all the deviations of values from the arithmetic mean ignoring
minus signs. If deviations are considered with plus and minus signs and are measured from the mean then
their total will be zero by definition of the arithmetic mean. Basically the mean deviation tells us the average
distance by which all items in a data set differ from their mean.
For example, for the beer drinking figures of the 2nd group of geography students:
Value: 0 5 10 15 30
d (deviation): -12 -7 -2 +3 +18 ( d=0)
|d|: 12 7 2 3 18
[|d|, pronounced mod d, is the mathematical shorthand for saying: ‘ignore minus signs’.]
The mean deviation = | |d
n =
12 7 2 3 185
425
8 4+ + + +
= = .
By ignoring minus signs the mean deviation ignores the fact that some items are greater than the average and
some less, consequently this measure of dispersion gives no idea of the way the items are spread around the
average.
3.1.4 Standard Deviation
Standard deviation is one of the most fundamental measures of dispersion used in statistical analysis. Standard
deviation measures the dispersion around the average, but does so on the basis of the figures themselves, not
just the rank order. Like the mean deviation it is calculated from the deviations of each item from the arithmetic
mean. To ensure that these divisions are not totalled to give zero, they are squared before being added
together. This removes all minus signs since two negative values multiplied give a positive value. Thus by
summing the squares of the deviations, the sums of squares or sum of squared differences is arrived. The
mean of ‘the sum of squares’ is known as the ‘variance’. The square root of the variance is the standard
deviation.
Standard deviation is symbolized by ‘s’ for a sample and ‘σ ’ for a population.
For an ungrouped, discrete data series, the standard deviation can therefore be calculated as:
)(n
xx −=
2
σ
p. 3-95
Data Analysis for Research Measures of Dispersion
or alternatively,
σ = −
2xn
x
n
2
The calculation of the standard deviation is illustrated in the following example:
The Calculation of the Standard Deviation
Values (x) )(x x− )(x x−2
3 -0.6 0.36
2 -1.6 2.56
1 -2.6 6.76
2 -1.6 2.56
3 -0.6 0.36
4 0.4 0.16
3 -0.6 0.36
7 3.4 11.56
6 2.4 5.76
5 1.4 1.96
Totals 32.4
The standard deviation figure of 1.8 is useful as it provides an indication of how closely the scores are clustered
around the mean. The value of the standard deviation when placed in context of the normal distribution.
Generally, 70% of all scores fall within 1 standard deviation of the mean. In this example with a standard
deviation of 1.8, this tells us that the majority of scores in this sample are within 1.8 units above or below the
mean (3.6 +/-1.8). The standard deviation is useful when you want to compare samples using the same scale.
For example if we were to take a second sample of scores and calculated a standard deviation of 3.6. If we
compare this to the standard deviation from our first sample, it would indicate that scores in the first sample are
more closely clustered around the mean value, than scores in the second sample.
Step 1: First calculate the mean of thesample:
Step 2: Now calculate the standarddeviation:
6.310
36x ===
n
x
)(σ =
− x x
n
2
σ = =32.410
18.
WO
RKED EXAMPLE
Where:
: standard deviation
: sum of
: the value
: the mean
: the number of values
σx
nx
p. 3-96
Data Analysis for Research Measures of Dispersion
In conclusion, the standard deviation is a measure of dispersion which indicates the spread of the data
values around the arithmetic mean.
‘Quoting the standard deviation of a distribution is a way of indicating a kind of ‘average’ amount by
which all the values deviate from the mean. The greater the dispersion the bigger the deviations and
the bigger the standard (average) deviation’
(Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197)
p. 3-97
Data Analysis for Research Measures of Dispersion
To demonstrate your familiarity with basic measures of dispersion, for the following data sets, relating tobedspace size of B&B establishments in Blackpool, calculate the mean,median, range and standarddeviation.
Results:
Mean:
Median:
Range:
Standard Deviation:
Results:
Mean:
Median:
Range:
Standard Deviation:
Results:
Mean:
Median:
Range:
Standard Deviation:
Activity 11:
Sample A:
4 9 16 1034 20 8 632 14 10 2718 12 17 1048 12 14 6
6 19 10 617 11 6 614 12 10 8
4 14 16 818 19 34 14
Sample B:
34 72 11 1034 20 38 632 14 19 2718 12 17 1048 12 14 3416 19 10 3217 11 50 650 12 62 8
4 14 16 818 19 34 23
Sample C:
14 9 11 1034 20 8 632 14 19 2718 12 17 1048 12 14 34
6 19 10 617 11 6 650 12 62 8
4 14 16 818 19 34 14
p. 3-98
Data Analysis for Research Measures of Dispersion
3.2 Other Distributions
There are of course variations on the normal distribution. Distributions can also vary depending on how flat
or peaked they are. The degree of flatness or peakedness is referred to as the kurtosis of the distribution. If
a distribution is highly peaked it is leptokurtic and if the distribution is flat it is platykurtic. Leptokurtic
distributions appear relatively thin in appearance, and somewhat pointy. In contrast, platykurtic distributions
are flatter, reflecting a greater number of scores in the tails of the distribution. A distribution between the
extremes of peakedness and flatness is classed as mesokurtic (see Figure 3.4). In a normal distribution
curve, the value of kurtosis is 0 (i.e. the distribution is symmetrical). If a distribution has a value above or below
0 then this indicates a level of deviation from the norm. You don’t need to worry about kurtosis too much as this
point, but you will notice that when you produce descriptive statistics in SPSS, a value for kurtosis is given.
Positive values of kurtosis indicate that the distribution is leptokurtic, whereas negative values suggest
that the distribution is platykurtic (Dancey, 2002).
Figure 3.4: Examples of Leptokurtic, Platykurtic and Mesokurtic Distributions
[Source: Dancey and Reid, 2002, p. 70]
p. 3-99
Data Analysis for Research Measures of Dispersion
3.2.1 Skewed Distributions
Distributions can also be skewed (see Figure 3.5). A positive skew is when the peak lies to the left of the
mean and a negative skew when it lies to the right of the mean. The further the peak lies from the centre
of the horizontal axis, the more the distribution is said to be skewed. If you come across badly skewed
distributions then you need to consider whether the mean is the best measure of central tendency,
as the scores in the extended tail will be distorting your mean. As discussed in your descriptive
statistics handbook, at this point it might be more appropriate to use the median or the mode to give a more
representative indication of the typical score in your sample. The SPSS output for descriptive statistics will
also provide a measure of skewness. A positive value suggests a positively skewed distribution, whereas
a negative value suggests a negatively skewed distribution. A value of zero indicates that the distribution
is not skewed in either direction (i.e. the distribution is symmetrial).
Figure 3.5: Examples of Skewed Distributions
These refinements need not concern us here, but will need consideration when it comes to deciding which
statistical tests you which to use to examine the data. For now, it is perhaps enough to make a distinction
between the most powerful ‘parametric’ tests which rely on the data concerned being normally distributed, and
the less powerful non-parametric ones which do not. If you have control over the collection of your data you
should do your best to collect data on which parametric tests can be conducted, but if you cannot ensure this
quality or need to use others’ information it may be better to use the less powerful tests.
p. 3-100
Data Analysis for Research Measures of Dispersion
3.3 The Standard Normal Distribution
The standard normal distribution (SND) is also known as the probability distribution. The value of
probability distributions is that there is a probability associated with each particular score from the distribution.
More specifically, the area under the curve between any specified points represents the probability of obtaining
scores within these specified points. For example the probability of obtaining scores between -1 and + 1
standard deviations from the distribution is about 68% (see Figure 3.6). This means that:
68.26% of observations fall within plus or minus one standard deviation of the mean;
95.44% of observations fall within plus or minus two standard deviations of the mean;
99.7% of observations fall within plus or minus three standard deviations of the mean
This percentage values will be referred to later as ‘confidence limits’.
Figure 3.6: The Standard Normal Distribution
Let me illustrate this through a specific example. Figure 3.7, illustrates the number of tourists bunging
jumping off at bridge at an extreme academy in New Zealand. There were 150 tourists in total, and the brave
souls were most frequently aged between 26 to 30 (the highest bar). The graph also indicates that very few
people over the age of 60 participate in bunging jumping (thank goodness for that!). If we think about this
distribution as a probability distribution we could start asking specific questions. For example, how likely is it
that a 60 year old will undertake bunging jumping in New Zealand. A look at the distribution and your answer
might be ‘not very likely’. However, what if you were asked how likely it is a 30 year old went bunging jumping,
your answer would be ‘quite likely’. Indeed, the distribution shows that 30 of the 150 tourists were aged
around 30 (equating to 20% of the total sample). Therefore, using this data it is possible to estimate the
probability that a particular score will occur.
p. 3-101
Data Analysis for Research Measures of Dispersion
Figure 3.7: Tourists Bunging Jumping in New Zealand
Using the characteristics of the SND, it is possible to calculate the probability of obtaining scores within any
section of the distribution. Statisticans (much clever than me) have calculated the probability of certain scores
occurring in a normal distribution with a mean of 1 and standard deviation of 1. If our sample shares these
values, then we can use a table of probabilities for normal distribution to assess the likelihood of a particular
score occurring. However in reality, it is likely that the data we will collect will have a mean of 0 and standard
deviation of 1. However, as Field (2003) points out any data set can be converted into a data set that has a
mean of 0 and standard deviation of 1. First to centre the zero, we take each score and subtract from it the
mean of all the scores. Then, we divide the resulting score by the standard deviation to ensure the data have
a standard deviation of 1. The resulting scores are called z scores. The z-score is expressed in standard
deviation units - the z score therefore tells us how many standard deviations above the mean our score is. A
negative z score is below the mean and a positive z score is above the mean.
Extreme z scores, for example greater than 2 and below 2, have a much smaller chance of being obtained
than scores in the middle of the distribution. That is areas of the curve above 2 and below -2 are small in
comparison with the area between -1 and 1 (see Figure 3.8).
p. 3-102
Data Analysis for Research Measures of Dispersion
Figure 3.8: Areas in the middle and extremes of the Standard Normal Distribution
Let us refer back to our example of bunging jumping in New Zealand, where we can now answer the question
what is the probability of someone over 60 doing a bungee jump. First we need to convert 60 into a z-score.
From the population the mean age is 32 and the standard deviation is 11. In this instance 60 will become:
(60-32)/11=2.54
This indicates that your score is 2.54 standard deviations around the mean.
Consider another example. The mean IQ scores for many IQ tests is 100 and the standard deviation is 15.
If you had an IQ score of 130, then your z-score would be:
(130-100)/15=2
This indicates that your score is 2 standard deviations around the mean.
Using the z-score we can also calculate the proportion of the population who would score above or below your
score - or in the case of the normal distribution the area under the normal distribution curve. Figure 3.9
illustrates that the IQ score of 130 is 2 standard deviations above the mean. The shaded area represents the
proportion of the population who would score less than you, and the unshaded area represents those who
would score more than you. To calculate the specific proportion of the population that would score more or less
than you we refer to a standard normal distribution table (see Table 3.1). The table indicates that the
proportion falling below your z-score is 0.9772 or 97.72%. In order to find the proportion above your score, you
simply subtract the above proportion (0.9772) from 1. In this case the proportion is .0228 or 2.28% . When
using statistical tables for SND you should note that only details of positive z scores are given (those that fall
above the mean). If you have a negative z score disregard the negative sign of the z score to find the relevant
areas above and below your score (Figure 3.10).
score-meanstandard deviation
= z score
p. 3-103
Data Analysis for Research Measures of Dispersion
Figure 3.9: Normal Distribution showing the proportion of the population with an IQ of less than 130
Table 3.1: Z Scores for Standard Normal Distribution
97.72%
p. 3-104
Data Analysis for Research Measures of Dispersion
Figure 3.10: The proportions of the curve below positive z scores and above negative z scores
Let us now refer back to the z-score calculated when asking about the probabiloty of people over 60 bunging
jumping in New Zealand. The calculated z-score is 2.54. Refer to the table of probability values that have
been included in the appendices. Look up the value of 2.54 in the column labelled ‘smaller portion’ (i.e. the
area above the value 3.2). You should find that the probability value is 0.00554, or .0055% chance that a
person over 60% would bungee jump. By looking at the values of the ‘bigger portion’, we find that the
probability of those jumping under the age of 60 was .99446. Or alternatively, there is 99.44 probability that
those tourists jumping were aged below 60 (.99446 = 1-.00554).
Certain z-scores are particularly important, as their values cut off certain important percentages of the distribution.
As Field (2003) highlights, the first important value is 1.96 as this cuts off the top 25% of the distribution, and
its counterpart at the opposite end (-1.96) cuts off the bottom 2.5% of the distribution. As such, these values
together cut off 5% of scores, or put another way, 95% of z-scores lie between -1.96 and 1.96. The other
important scores are +/- 2.58 and +/- 3.29 which cut off 1% and 0.1% of scores respectively. Put another way,
99% of z-scores lie within -2.58 and 2.58, and 99.9% of z-scores lie between -3.29 and 3.29. These values will
crop up time and time again, indeed we have already referred to this values when referring to the characteristics
of the normal distribution curve.
p. 3-105
Data Analysis for Research Measures of Dispersion
3.4 Confidence Intervals
Although the sampling mean is an approximation of the population mean, we are not sure how good an
approximation it is. Because the sample mean is a particular value or point along a variable, it is known as a
point estimate of the population mean. It represents one point on a variable and because of this we do not
know whether our sample mean is an underestimation or overestimation of the population mean. We can
therefore use confidence intervals to help us identify where on the variable the population mean may lie.
Confidence intervals of the mean are interval estimates of where the population mean may lie and they
provide us with a range of scores (an interval) within which we can be confident that the population mean lies
(see Figure 3.11). Because we are still only using estimates of population parameters it is not guaranteed that
the population mean will fall within this range; we therefore have to give an expression of how confident we are
that the range we calculate contains the population mean. Hence the term ‘confidence intervals’.
Figure 3.11: The role of confidence intervals in determining the position of the population mean in relation to
the sample mean
p. 3-106
Data Analysis for Research Measures of Dispersion
We have already discussed the characteristics of the sampling mean and that it tends to be normally distributed,
and contains a good approximation of the population mean. Using the base characteristics of the normal
distribution allows us to estimate how far our sample mean is from the population mean.
As shown in Figure 3.12, we know that the sample mean is going to be a certain number of standard
deviations above or below the population mean. Indeed, we can be 99.74% certain that the sample mean
will fall with -3 and + 3 standard deviations. As discussed earlier, this area accounts for most of the scores in
the distribution. If we wanted to be 95% certain that a certain area of the distribution contained the sample
mean we would have to refer back to the z scores. As highlighted earlier, 95% of the area under the SND falls
with -1.96 and +1.96 standard deviations. Thus we can be 95% certain that the sample mean will lie between
-1.96 and +1.96 standard deviations of the population mean (see Figure 3.13).
Figure 3.12: Sample mean is a certain number of S.Ds above or below the population mean
Figure 3.13: Percentage of curve (95%) falling between -1.96 and +1.96 S.Ds
p. 3-107
Data Analysis for Research Measures of Dispersion
For illustration, assume that the sample mean is somewhere above the population mean. If we draw the
distribution around the sample mean instead of the population mean we see the situation in Figure 3.14.
Figure 3.14: Location of the population mean where distribution is drawn around the sample mean
Applying the same logic, we can be confident that the population mean falls somewhere with 1.96 standard
deviations below the sample mean. Similarly, if the sample mean is below the population mean we can be
confident that the population mean is within 1.96 standard deviations above the sample mean (Figure 3.13).
We can therefore be confident (95%) that the population mean is within the region 1.96 standard deviations
above or below the sample mean. With this information we can now calculate how far the sample mean is
from the population mean. All we need to know is the sample mean and the standard deviation of the sampling
distribution of the mean (standard error).
Figure 3.15: Distribution drawn around the sample mean when it falls below the population mean
p. 3-108
Data Analysis for Research Measures of Dispersion
Use the following table to calculate the following:
a] The probability that z is less than or equal to0.7
b] The probability that z is more than 0.7,
c] The probability that z is less than or equal to2 and equal to or more than -2
d] The probability that z is less than or equal to3 and equal to or more than -3
Record your answers below:
Activity 12:
p. 3-109
Data Analysis for Research Measures of Dispersion
3.5 The Standard Error
One useful adjunct of the normal distribution is standard error, or the standard deviation of the sampling
distribution of the mean, which can be helpful in gauging the precision of your sample, and deciding how large
your eventual sample should be from a pilot study. The standard error is a measure of the degree to which the
sample means deviate from the mean of the sample means. Given the mean of the sample means is also a
close approximation of the population mean, the standard error of the mean must also tell us the degree to
which the sample means deviate from the population mean. Consequently, once we are able to calculate the
standard error we can use this information to find out how good an estimate our sample mean is to the
population mean. This is illustrated in Figure 3.16.
Figure 3.16: Calculating the Standard Error
[Source: Field, A, 2003, p. 16]
p. 3-110
Data Analysis for Research Measures of Dispersion
Figure 3.16, illustrates the process of taking samples from the population. In this case Field (2003) is looking
at the ratings of lecturers. If we were to take the rating of all lecturers the mean value would be 3. As illustrated
in Figure 3.16, each sample has a mean value, and these have been presented in a frequency chart. As you
can see some samples have the same mean as the population, some are lower, some are higher. These
differences are referring to sample variation. As you can see, the end result is a symmetrical distribution,
known as a sampling distribution (Field, 2003). If we were to take the average of all the sample means, you’d
get the same value as the population mean. But how representative is the population mean?
We use standard deviation as a measure of how representative the mean was of the observed data. If you
were to measure the standard deviation between sample means then this would give a measure of how much
variability there was between the samples of the different means. The standard deviation of the sample
means is known as the standard error of the sample mean.
Standard error is very similar to standard deviation, but takes account of sample size. The larger the sample
size, the lower the sampling error.
SE (mean) = Standard Deviation of the Sample (s)
√ Sample Size (n)
Dividing the standard deviation by the square root of the sample size takes account of the fact that the larger
the sample size, the more likely that the sample is representative, and vice versa. Any probability of the
sample mean being close to the population mean can be calculated, but for our purposes we will only examine
the population mean that we can estimate with 95% probability, which corresponds to two standard errors
away from the mean.
For example, in investigating the geography of sport in Lancashire, you might want to find out how far Warrington’s
supporters travelled to the match. From sampling the crowd you might find a mean of 23km travelled to
Wilderspool, and a Standard Error of 3km. This means that your sampling suggested (with 95% certainty) that
the mean distance which supporters of Warrington RLFC travelled to the match is 23km ± 6km. This does not
mean that 95% of supporters travel between 17km and 29km, but rather is a measure of the confidence with
which you state the mean. You can be pretty certain that if you sampled the crowd twenty times nineteen of
your answers would be within this range.
p. 3-111
Data Analysis for Research Measures of Dispersion
The following example highlights a practical application of the standard error in attempting to assess the
mean spending of short break holidaymakers in Chichester.
In an example the following results were obtained were taken:
Visitor Spending (£)
In this example, the standard error has been calculated at 9.43. With reference back to the properties of the
normal distribution curve, we can conclude that it is likely that 68 times out of 100 (or approximately 2 in 3
times) that the true mean of the population lies within the range 127± 9.43. That is between 117.57mm and
136.43mm (or the Mean ± 1 x Standard Error (SE)). If we wish to predict the range with greater confidence
then the rule of plus or minus two standard errors can be applied to give a 95% confidence level. In this case
the true mean of the population would lie within the range 127± 18.86. That is between 108.14mm and 145.86
(or the Mean ± 2 x SE).
Values (x) )(x x− )(x x−2
109 -18 324
97 -30 900
112 -15 225
156 -29 841
86 -41 1681
94 -33 1089
176 49 2401
158 31 961
147 20 400
135 8 64
Totals 8886
Step 1: First calculate the mean of thesample:
Step 2: Now calculate the standarddeviation:
)(σ =
− x x
n
2
29.80910
8886 ==σ
WO
RKED EXAMPLE
12710
1270
n
xx ===
(n) Size Sample
(s) Sample the of Deviation StandardSE =
Step 3: Now calculate the standard error:
10
29.809SE =
9.433.16
29.809SE ==
p. 3-112
Data Analysis for Research Measures of Dispersion
In the above example, the selected standard errors equated to critical z values of 1.0 and 2.0. These values
help to establish and define the ‘confidence limits’. As discussed these limits are usually described in
percentage rather than absolute values, and you would therefore refer to 68.2%, 95.4% and 99.7% confidence
levels. For the 95% (0.95) and 99% (0.99) levels (the percentage values have been rounded for convenience)
the critical z values are 1.96 and 2.58 respectively. Therefore, if we were to refer back to our previous
example, we can redefine our confidence limits and expected ranges in which we would expect the mean
value of the population to lie.
For example, in the previous example, at the 95% confidence level the limits were given by:
127 ±( 2 x 9.43) = £108.14 to £136.43
If we adopt the critical z values for the standard error at a 95% confidence level, the limits are now defined as:
127 ± (1.96 x 9.43) = £108.52 to £145.48mm
If we adopt the critical z values for the standard error at a 99% confidence level, the limits are now defined as:
127 ±( 2.58 x 9.43) = £102.67 to £151.33mm
Effectively, higher confidence levels can only be achieved at the expense of wider confidence intervals.
Therefore, we can be 99% certain that the sample population lies between £102.67 and £151.33, but only 95
per cent confident that it lies between the narrower bands of £108.14 and £145.86. Clearly the best way to
gain greater accuracy in sample estimates is to increase the sample size (n). As the sample size (n) increases
the standard error, or spread, of the sampling distribution is reduced and the resulting confidence intervals are
narrowed.
Referring back to our previous example which focused on visitor spending, increasing the sample size by 20
yields the following results:
Mean: £127
Std Dev: 29.95
Standard Error: 2.99
Adopting the same confidence limits as before, we can now be certain that at the 95% confidence level the
mean of the sample population lies between:
127 ± (1.96 x 2.99) = £121.14 to £129.86
and at the 99% confidence level the sample population lies between:
127 ± (2.58 x 2.99) = £119.29 to £134.71
As you can clearly see, increasing the sample size has significantly reduced the width of the confidence
intervals.
p. 3-113
Data Analysis for Research Measures of Dispersion
This is graphically illustrated in Figure 3.17 below.
Figure 3.17: Confidence Intervals with Samples Sizes
a] Sample Size of 10
Confidence Interval (95%)
Range = 36.96
Sample mean of 127
a] Sample Size of 100
Confidence Interval (95%)
Range = 8.72
Sample mean of 127
As is evident in Figure 3.17, increasing the sample size results in a much narrower range of scores and gives
us a much clearer indication of where the population mean may be. This in turn underlines the importance of
sample size when trying to estimate population parameters from sample statistics. Generally the larger the
sample size, the better the estimate of the population we can get from it.
£108.52 £145.48
£121.14 £129.86
p. 3-114
Data Analysis for Research Measures of Dispersion
Refer back to the exercise on page 3-93. This time calculate the standard error for each sample, andthe standard error ranges at 95% and 99% (using z-scores).
Results:
Standard Error:
The standard error range at 95%Lower:
Upper:
The standard error range at 99%Lower:
Upper:
Results:
Standard Error:
The standard error range at 95%Lower:
Upper:
The standard error range at 99%Lower:
Upper:
Results:
Standard Error:
The standard error range at 95%Lower:
Upper:
The standard error range at 99%Lower:
Upper:
Activity 14:
Sample A:
4 9 16 1034 20 8 632 14 10 2718 12 17 1048 12 14 6
6 19 10 617 11 6 614 12 10 8
4 14 16 818 19 34 14
Sample B:
34 72 11 1034 20 38 632 14 19 2718 12 17 1048 12 14 3416 19 10 3217 11 50 650 12 62 8
4 14 16 818 19 34 23
Sample C:
14 9 11 1034 20 8 632 14 19 2718 12 17 1048 12 14 34
6 19 10 617 11 6 650 12 62 8
4 14 16 818 19 34 14
p. 3-115
Data Analysis for Research Measures of Dispersion
3.6 Looking at Distributions in SPSS
As discussed in this handbook, SPSS will produce basic descriptive statistics for dispersion in the Descriptive
dialog box. Refer back to your descriptive statistics section for guidance. Statistics for variance can also be
created via the Explore dialog box. The following example is using the Age variable in the Dataset file.
Move the mouse over Analyse and press the left mouse button.
Move the mouse over Descriptives and then over Explore and
press the left mouse button.
The Explore dialog box appears.
Select Age and click the central arrow so that Age appears in the
Dependent List.
simulationsimulation
p. 3-116
Data Analysis for Research Measures of Dispersion
Move the mouse over Statistics and press the left mouse button.
The Explore: Statistics dialog box opens. At this point we
can assign a confidence interval for the mean (as discussed
in the previous sections). Make sure that the confidence
interval is set to 95%.
Click Continue.
This returns you to the Explore dialog box. Click OK.
A summary table is produced in the output window.
p. 3-117
Data Analysis for Research Measures of Dispersion
This summary table provides you with basic descriptive statistics including the mean and the median, and
measures of dispersion including the range, standard deviation and standard error. The output also provides
the confidence interval at 95% (47.07 to 48.34). Note that Age is a ratio data type, and that the average would
not apply to ordinal or nominal data sets.
3.7 Graphically Looking at Distributions in SPSS
Refer back to your descriptive statistics handbook for information on how to produce basic frequency histograms,
stem and leaf plots and box plots.
SPSS will also allow you to plot the normal distribution over a frequency histogram, so you can ascertain how
the distribution of your sample relates to the normal distribution. The following example again uses the Age
variable in the Dataset file.
Move the mouse over Graphs and then Chart Builder and press the left mouse
button.
simulationsimulation
p. 3-118
Data Analysis for Research Measures of Dispersion
The Chart Builder dialog box appears.
p. 3-119
Data Analysis for Research Measures of Dispersion
Select Histogram in the Choose From: box. A series of charts are presented.
p. 3-120
Data Analysis for Research Measures of Dispersion
Move the mouse over the Simple Histogram, and holding the left mouse button down drag it into the chart window. Release
the left mouse button and a simple histogram is presented. An Element Properties dialog box also appeared and we will
return to this shortly.
You will notice that the histogram presents options for the vertical Y-axis and the
horizontal X-axis. In this case we need to assign Age to the X-axis. The vertical
Y-axis will be frequency which SPSS will default to automatically.
p. 3-121
Data Analysis for Research Measures of Dispersion
Move the mouse over Age in the Variables box and holding down the left mouse button, drag Age over the to X-Axis box.
Release the left mouse button and Age is assigned to the X-axis of the histogram.
We now need to assign a Normal Distribution Curve to the
histogram. Shift your attention to the accompanying Elements
Properties dialog box.
Select Display normal curve in the dialog box and click Apply.
Notice that a Nornal Distribution curve has been superimposed on
top of the histogram in the Chart Builder window. Click OK in the
Chart Builder window.
p. 3-122
Data Analysis for Research Measures of Dispersion
A frequency histogram is produced in the output window, and a normal distribution curve has been plotted on it. As you can
see from this output, the Age variable bears some resemblence to the normal distribution, although there the overall shape
of the curve is influenced by a number of outlying values.
p. 3-123
Data Analysis for Research Measures of Dispersion
As before we can also use the Split File option
look at specific cases. For example here Area
has been selected and two separate
distribution curves for the Chichester and Arun
Districts have been produced.
p. 3-124
Data Analysis for Research Measures of Dispersion
Table 15: Descriptive Statistics for GTBSscore08
GTBSscore08
Please cut and paste your histogram below
and rescale accordingly Descriptive Statistics
Mean
Median
Mode
Standard Deviation
Standard Error
Skewness
Kurtosis
Please provide a brief summary of the distribution:
Table 16: Descriptive Statistics for GTBSscore08 - Chichester District
GTBSscore08: Chichester District
Please cut and paste your histogram below
and rescale accordingly Descriptive Statistics
Mean
Median
Mode
Standard Deviation
Standard Error
Skewness
Kurtosis
Please provide a brief summary of the distribution:
Activity 15:
p. 3-125
Data Analysis for Research Measures of Dispersion
GTBSscore08: Arun District Council
Please cut and paste your histogram below
and rescale accordingly Descriptive Statistics
Mean
Median
Mode
Standard Deviation
Standard Error
Skewness
Kurtosis
Please provide a brief summary of the distribution:
Table 17: Descriptive Statistics for GTBSscore08 - Arun District
Repeat this exercise for an additional variable (which should be ratio or variable in nature). Record yourresults by cutting and pasting your output into your log book.
Activity 15:
p. 3-126
Data Analysis for Research Measures of Dispersion
Notes:
Student T-Test,Paired Samples T-Test,
Mann Whitney andWilcoxon
Section 4
Learning Outcomes
At the end of this session, you should be able to:
Understand the rationale for the use ofparametric and non-parametric tests
Examine the relationship between variablesusing parametric and non-parametric tests,constructing suitable null and alternativehypotheses
Apply the procedure for conducting parametricand non-parametric tests in SPSS in relation tothe Student T-Test, the Paired Samples T-Test,Mann Whitney and Wilcoxon
Interpret computer generated SPSS output inrelation to the above tests
© Dr Andrew Clegg p. 4-127
Data Analysis for Research Statistical Tests: Introduction
4.0 Introduction
Statistical tests are used to make deductions about a particular data set or relationships between different
data sets. For example, you might have interviewed a random sample of 50 households from two rural
villages in West Sussex to compare whether income levels are different. In village A, you calculate the mean
income to be £17,650 and for village B, £22,200. In this instance, a statistical test can be used to determine
whether we have a real difference or whether the difference could have occurred purely by chance. There
are a wide variety of statistical tests, each designed to take account of the different characteristics of the data
sets you may wish to examine. The choice of test to use can prove overwhelming, and indeed frightening at
first. At the most basic level, the principal distinction drawn between different statistical tests is whether they
are ‘parametric’ and ‘non-parametric’ tests. Parametric tests can only be performed where the data
conforms to a normal distribution and is of an interval or ratio nature. In contrast, non-parametric tests
involvement less rigorous conditions and can be used on data of lower level which does not conform to a
normal frequency distribution.
4.1 Null and Alternative Hypotheses
Before conducting a statistical test it is first necessary to establish a hypothesis or statement which the test
then challenges. These hypotheses are referred to as the null hypothesis (Ho ) and the alternative or
research hypothesis (H1). The null hypothesis is usually expressed as Ho: μ1= μ1 where μn is the mean
for each group, and the subscript n denotes the group. When stating a null hypothesis, the normal procedure
is to start by assuming that there is no real difference between your data sets. A statistical test effectively helps
the researcher to decide whether or not the null hypothesis is true, or more precisely, whether or not it should
be accepted. If the result of the test shows that the null hypothesis should not be accepted and that it should
be rejected, we can then go on to say, with some degree of confidence, that a difference does exist or a
change has occurred (Riley, M. et al, 1998, p. 203). It is important that you express both Ho and H1 in the
context of your own research problem before collecting your data and before starting your analysis. In
reference to the rural income example quoted above, we could formulate the following hypotheses:
Ho: μμμμμa= μ μ μ μ μb There is no significant difference between the mean income of households in village
A as compared with the mean income of households in village B; mean household income
is not influenced by geographical location.
H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in the mean household income for households in
village A as compared with village B; mean household income is influenced by geographical
location.
To determine whether or not your sampled data sets are consistent with the null hypothesis or the alternative
hypothesis, we need to perform a probability based significance test. However, before such a test is conducted
we must determine how big any difference has to be, to be considered real beyond that expected due to
chance.
© Dr Andrew Clegg p. 4-128
Data Analysis for Research Statistical Tests: Introduction
4.2 Hypothesis Testing
Most tests follow the same basic logic, in that a research hypothesis (your alternative hypothesis) predicts a
difference in distributions, whereas a null hypothesis predicts that they are the same. For each significance
test, we can produce a probability distribution of a test statistic, termed a sampling distribution under the null
hypothesis, calculated on the basis that the null hypothesis is true. A simple example relating to the probability
distribution curve of the Student’s t-statistic is shown in Figure 4.1.
Figure 4.1: Rejection Region for a Probability Region
A large visible difference between data sets corresponds to a probability towards the tails of distribution,
therefore meaning that differences occurring by chance are unlikely. We can determine whether any
difference between the data sets is large enough to have occurred by chance, by determining whether the
difference occurs in relation to the tails of the distribution. We can define a critical or rejection region as
that part of the probability distribution beyond a critical value of a test statistic at a certain probability (see
Figure 1.1). We compare this critical value with the calculated test statistic. If the calculated statistic is
greater than the critical value and therefore falls within the rejection region, the difference in the data are
unlikely to have occurred by chance. Consequently, we can reject the null hypothesis and accept the
alternative hypothesis. However, if the calculated value does not fall within the rejection region, this does not
prove the truth of the null hypothesis, but merely fails to reject it.
The size of the rejection region is determined by the significance level. Significance levels allow the researcher
to state whether or not they believe a null hypothesis to be true with a given level of confidence or significance
value. Significance levels are presented in statistical tables as a probability value normally expressed in
decimal terms i.e. 0.05 (5% or p=00.5/1 in 20) and 0.01 (1% or p=0.01/ 1 in 100). The value 0.05 indicates
the 95 per cent confidence limit and represents the minimum limit for deciding upon whether or not a
particular result is significant and whether or not the null hypothesis should be accepted or rejected. Anything
lower than 95 per cent confidence level, that is where the level is computed to be 94 per cent or less, means
the null hypothesis is normally accepted and the result is regarded as not significant. If the significance level
is found to be higher, that is, it indicates a confidence level of 95 per cent or more has been achieved, then
we say the observed change or difference is significant. By the selection of either a 5% or a 1% significance
level, what we are saying is that we are willing to accept either a 5% or a 1% chance of making an error in
© Dr Andrew Clegg p. 4-129
Data Analysis for Research Statistical Tests: Introduction
rejecting the null hypothesis when it is in fact true; this is known as a Type I error. A Type II error represents the
probability of not rejecting the null hypothesis when it is in fact false.
SPSS will report the significance of the calculated test statistic in terms of a probability value p. Where p<0.05
this would indicate a significant result result at a 0.05 (5%) level, and p<0.01 would indicate a significant result
at a 0.01 (1%) level, often termed ‘highly significant’.
4.3 One and Two Tailed Tests
When conducting a statistical test to a given significance level it is important to consider how the hypothesis
is worded as this will either create conditions for a one- or two tailed test. Any statement including terms
such as reduces or increases, no lower or no higher implies a specific direction in the null hypothesis and
consequently forms the basis of a one-tailed test. In contrast, any statement indicating no direction (no
different/no effect) forms the basis of a two-tailed test. Therefore in relation to the rural income example
stated above, we would perform a two-tailed test. This is because we would have to allow for the average
income for village B to be either larger or smaller than that for village A.
We could, however, have chosen a slighty different alternative hypothesis, for example:
H1: μ μ μ μ μa> μ> μ> μ> μ> μb The mean household income for households in Village A is significantly larger as
compared with Village B; mean household income is influenced by geographical location.
This is termed a one-tailed test as we are only interested in a difference in one direction, in this case positive
differences (larger). As a result, the rejection region must be concentrated at one end of the distribution
(hence the term one-tailed). For the sample mean to be larger than the population mean, the rejection
region must lie at the positive end of the x-axis. The choice of a two-tailed or one-tailed test will determine
the distribution of the rejection region. This will now be discussed in the following section.
© Dr Andrew Clegg p. 4-130
Data Analysis for Research Statistical Tests: Introduction
4.3.1 Significance Levels and One and Two-Tailed Predictions
The relationship between significance levels and one and two-tailed predictions is explained by Hinton
(2004) in the following extract:
When we undertake a one-tailed test we argue that if the test score has a probability lower than the significance
level then it falls within the tail-end of the known distribution we are interested in. We interpret this as indicating
that the score is unlikely to have come from a distribution the same as the known distribution but from a
different distribution. If the score arises anywhere outside this part of the tail cut off by the significance level we
reject the alternative hypothesis. This is shown in Figure 4.2. Notice that this shows a one-tailed projection
that the unknown distribution is higher than the known distribution.
Figure 4.2: A One-Tailed Prediction and the Significance Level
With a two-tailed prediction, unlike the one-tailed, both tails of the known distribution are of interest, as the
unknown distribution could be at either end. However, if we set our significance level so that we take the 5 per
cent at the end of each tail we increase the risk of making an error. Recall that we are arguing that, when the
probability is less than 0.05 that a score arises from the known distribution, then we conclude that the
distributions are different. In this case the chance that we are wrong, and the distributions are the same, is
less than 5 per cent. If we take 5 per cent at either end of the distribution, as we are tempted to do in a two-
tailed test, we end up with a 10 per cent chance of an error, and we have increased the chance of making a
mistake.
We want to keep the risk of making an error down to 5 per cent overall, as otherwise there will be an increase
in our false claims of differences in distributions which can undermine our credibility with other researchers,
who might stop taking our findings seriously. When we gamble on the unknown distribution being at either tail
of the known distribution, to keep the overall error risk to 5 per cent, we must share our 5 per cent between the
two tails of the known distribution, so we set our confidence level at 2.5 per cent at each end. If the score falls
into one of the 2.5 per cent tails we then say it comes from a diffferent distribution. Thus, when we undertake
a two-tailed prediction the result has to fall within a smaller area of the tail compared to a one-tailed prediction,
before we claim that the distributions are different, to compensate for hedging our bets in our prediction. This
is shown in Figure 4.3.
© Dr Andrew Clegg p. 4-131
Data Analysis for Research Statistical Tests: Introduction
Figure 4.3: A Two-Tailed Prediction and the Significance Level
[Extract taken from Hinton, P. (2004), Statistics Explained, Routledge, London]
The changes in the critical values between one and two-tailed tests have important consequences because
it is possible for Ho to be accepted if the test is two-tailed but rejected if it is one-tailed. This happens with z
values within the range 1.645 and 1.96 and test statistics of, say, 1.75 which fall outside the two-tailed rejection
region but within the one-tailed. Consequently the phrasing and justification of the alternative hypothesis
should be formulated with considerable care.
Although the actual method for calculating the test statistics is not influenced by the nature of the null
hypothesis, the effect of stating a direction is to impose a more rigorous test which in turn affects the significance
level that can be quoted. By stating a direction to the null hypothesis we are effectively establishing a more
precise test.
Table 4.1: Critical z Values for the 0.01 and 0.05 Rejection Regions for One- and Two-tailed Tests
Critical Values
Tailedness 0.05 Level 0.01 Level
One-tailed test -1.645 or +1.645 -2.33 or +2.33
Two-tailed test -1.96 or +1.96 -2.58 or +2.58
© Dr Andrew Clegg p. 4-132
Data Analysis for Research Statistical Tests: Introduction
4.4 Choosing the Right Test
The main motivation for choosing a statistical test to apply to a set of data has to be driven ultimately by the
objectives of your research project. Indeed your project should have been designed and data sampled with
a certain test or set of tests in mind (Kitchin and Tate, 1999). When deciding upon a particular test, you need
to consider the nature and characteristics of the data sets that you are investigating and, in particular, whether
they will allow the use of a parametric or non-parametric tests. The common characteristics of both parametric
and non-parametric tests are listed in Table 4.2. Table 4.3 also provides a useful framework to help you
choose the correct test.
Table 4.2: Common Characteristics of Parametric and Non-parametric Tests
Parametric Tests
Independence of observations, except where the data are paired
Random sampling of observations from a normally distributed population
Interval scale measurement (at least) for the dependent variable
A minimum sample size of about 30 per group is recommended
Equal variances of the population from which the data is drawn
Hypotheses are usually made about the mean (μ) of the population
Non-Parametric Tests
Independence of randomly selected observations except when paired
Few assumptions concerning the distribution of the population
Ordinal or nominal scale of measurement
Ranks or frequencies of data are the focus of tests
Hypotheses are posed regarding ranks, medians or frequencies
Sample size requirements are less stringent than for parametric tests
[Kitchin and Tate, 1999, p. 113]
© Dr Andrew Clegg p. 4-133
Data Analysis for Research Statistical Tests: Introduction
Table 4.3: Identifying the Right Test
[Source: Maltby & Day, 2002]
Question 1:What combination ofvariables have you?
Two categorical
Twoseperate
continuous
Two continuouswhich is the
samemeasure
administeredtwice
Two continuouswhich is the
samemeasure
administered onthree occasions
or more
One categorical andone continuous
Which test to use:
Chi-Square
Go to question 2
Go to question 2
Go to question 2
Go to question 2
Question 2:Should your
continuous data beused with parametric
tests ornon-parametric tests?
Parametric
Non-Parametric
Parametric
Non-Parametric
Parametric
Non-Parametric
Parametric
Non-Parametric
Which test to use:
Pearson
Spearman
Related t-test
Wilcoxonsign-ranks
ANOVA (withinsubjects)
Friedmann test
Go to question 3
Go to question 3
Question 3:How many levels
has yourcategorical data?
2
3 or more
2
3 or more
Which test to use:
Independent-samples t-test
ANOVA(between subjects)
Mann-Whitney U
Kruskal-Wallis
© Dr Andrew Clegg p. 4-134
Data Analysis for Research Statistical Tests: Student T-Test
4.5 Parametric Tests
4.5.1 T-Test or Student’s T-test
The t-test is most useful for testing whether or not a significant difference exists between the means of two
samples, or alternatively, whether or not two samples come from one population. There are two principal
versions of the t-test. One relates to samples involving independent data sets and the other to samples which
involve paired comparisons. In both cases, the data must be of ratio or interval in nature, randomly chosen
and normally (or near normally) distributed. The variances of the two data sets should also be similar. Where
there is doubt over the frequency distribution and the values of the variances that may jeopardise the accuracy
of the test, alternative and less refined non-parametric tests should be used.
4.5.2 T-Test for Independent Samples
In this instance, the t-test compares two unrelated data sets by inspecting the amount of difference between
their means and taking into account the variability of each data set. The larger the difference in the means,
the more likely that a real, significant difference exists, and our samples come from different populations (see
Figure 4.5).
Figure 4.5: Differences in Means and Populations
The following section will illustrate how to use SPSS to conduct a student t-test using variables from the
Dataset file.
© Dr Andrew Clegg p. 4-135
Data Analysis for Research Statistical Tests: Introduction
4.6 Using SPSS to Calculate the Student T-Test
The aim of the following section is to demonstrate how to use SPSS to perform the unrelated and related t-
test. As already mentioned in this section, the t-test is most useful for testing whether or not a significant
difference exists between the means of two samples, or alternatively, whether or not two samples come from
one population. There are two principal versions of the t-test. One relates to samples involving independent
data sets, and the other to samples which involve paired comparisons. In both cases, the data must be of
interval nature, randomly chosen and normally (or near normally) distributed. The variances of the two data
sets should also be similar. Where there is doubt over the frequency distribution and the values of the
variances that may jeopardise the accuracy of the test, alternative and less refined non-parametric tests
should be used. To begin, open SPSS and open the file dataset file that you have used in previous sessions.
We are going to use the Student T-test to examine the relationship between different variables. Let us
consider a potential research scenario to help you place the use of the student t-test in context.
Scenario: As part of the bidding process to Tourism South East for future tourism funding, local tourism
officers have to demonstrate if there is a significant difference in turnover between businesses
in the Arun and Chichester Districts.
Variables: We are therefore going to examine if there is a relationship between Area and Turnover08.
Before we start we first need to establish a Null and Alternative hypothesis.
In this case:
The Null Hypothesis:
Ho: μμμμμa= μ μ μ μ μb There is no significant difference in Turnover between Area; business turnover is not
influenced by location
TheAlternativel Hypothesis:
H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in Turnover between Area; business turnover is
influenced by location
© Dr Andrew Clegg p. 4-136
Data Analysis for Research Statistical Tests: Introduction
4.6.1 T-Test for Independent Samples
To perform the unrelated t-test for two independent samples, first move the mouse over Analyse and press
the left mouse button. Move the mouse over Compare Means and then over Independent Samples T
Test.
The Independent-Samples T Test
dialog box appears.
simulationsimulation
© Dr Andrew Clegg p. 4-137
Data Analysis for Research Statistical Tests: Student T-Test
Move the mouse over the variable Turnover08 and press the left mouse button. Move the mouse over the
centre arrow and press the left mouse button so that the variable Tunrover08 appears in the Test Variable(s)
box.
Select the variable Area
and press the lower arrow
so that Area appears in
the Grouping Variable
box.
Move the mouse over Define Groups and press the left mouse button.
The Define Groups dialog box appears. In the box beside
Group 1: type 1 and in the box beside Group 2: type 2. Note
in this case the groups have been defined in terms of their two
codes (1=Chichester District and 2=Arun District). The values
can also be used as a cut-off point, at or above which all the
values constitute one group while those below form the other
group. In this instance the cut-off point is two, which would be
placed in parentheses after gender.
Move the mouse over Continue and press the left mouse button. This will return you to the Independent-
Samples T Test dialog box. Move the mouse over OK and press the left mouse button. SPSS performs the
test and displays the results in the Output window.
Turnover08
© Dr Andrew Clegg p. 4-138
Data Analysis for Research Statistical Tests: Student T-Test
In this case the following output is produced:
You are now wondering what this all means. Let us start by referring back to our null and alternative
hypothesis.
In this case:
The Null Hypothesis:
Ho: μμμμμa= μ μ μ μ μb There is no significant difference in Turnover between Area; business turnover is not
influenced by location
TheAlternativel Hypothesis:
H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in Turnover between Area; business turnover is
influenced by location
The second subtable in the output, provides the information we need by tabulating the value of t and its p-
value (Sig.(2-tailed)) together with the 95% Confidence Interval of Difference for both Equal variances
assumed and Equal variances not assumed.
The key to which situation to use lies in the first two columns labelled Levene’s Test for Equality of
Variances which is a test for the homogeneity of variance assumption of a valid t-test. One of the criteria for
using a parametic t test is the assumption that both populations have equal variances. If the test statistic F is
significant, Levene’s test has found that the two variances do differ significantly, in which case we must use
the bottom values. Provided the test is not significant (p>0.05), the variances can be assumed to be
homogenous and the Equal Variances line of values for the t-test can be used. As Kinnear and Gray (1999)
point out:
If p > 0.05, then the homogeneity of variance assumption has not been violated and the normal t-test
based on equal variances (Equal variances assumed) is used (the top line).
If p < 0.05, then the homogeneity of variance assumption has been violated and the normal t-test based
on equal variances should be replaced by one based on separate variance estimates (Equal variances
not assumed)(the bottom line).
Turnover08
Turnover08
© Dr Andrew Clegg p. 4-139
Data Analysis for Research Statistical Tests: Student T-Test
In this example, the Levene Test is significant (p = 0.041
and is therefore < than 0.05), so the t value calculated with
the pooled variance estimate (Equal variances not
assumed) is appropriate.
The results are relatively straightforward. The table includes the t-
statistic, the degree of freedom, and the two-tailed probability of the
former being equalled or exceeded by chance alone (Sig.). This
form of output does not give the critical t-value that must be exceeded
for the null hypothesis to be rejected and Sig. is therefore of great
importance in this and other tests in the output of which it is commonly
listed. It allows us to dispense with tables of critical values and, if this
probability value is equal to or less than the selected significance
level, the null hypothesis must be rejected.
In this case, the test produces a two-tailed p-value of 0.000; this value is significant. Remember for the p-
value not to be significant at the 0.05 level, the p-value would have to be greater than 0.05. In this case, the null
hypothesis is rejected at the 0.05 significance level. In other words we would conclude that there is a
significant difference in mean turnover between area, and that turnover is influenced by location.
It is important to write up the results clearly and fully. In this instance we could write:
A student t-test was conducted to determine if a significant difference between turnover and area existed. A
null hypthosis of no significant difference and an alternative hypthosis of a significant different were established,
and a 95% confidence level was assumed. The difference was significant t = 6.354, p(<.0005)<0.05.
Therefore the null hypthosis can be rejected and we can assume that there is a significant difference
between turnover and area, and that turnover is influenced by location.
Note that in the above we have reported the probability value as <.0005. You cannot have a probability value
of 0.000. The reported probability value has actually been rounded down to three decimal places and
therefore for accurary we would report this as p<0.0005.
© Dr Andrew Clegg p. 4-140
Data Analysis for Research Statistical Tests: Student T-Test
A note on Significance Testing taken from Maltby and Day (2002):
‘Significance testing is a criterion, based on probability, that researchers use to decide whether two
variables are related. Remember, as researcher always use samples, and because of the possible
error, they use significance testing to decide whether the relationships observed are real, or not.
Researchers are then able to use a criteria level (significance testing) to decide whether or not their
findings are probable (confident of their findings) and not probable (not confident of their findings).
This criterion is expressed in terms of percentages, and their relationship to probability values. If we
accept that we can never be 100 per cent sure of our findings, we have to set a criterion of how certain
we want to be of our findings. Traditionally, two criterion are used. The first is that we are 95 per cent
confident of our findings, the second is that we are 99 per cent confident of our findings. This is often
expressed in another way. Rather, there is only a 5 per cent (95 per cent confidence) or 1 per cent (99
per cent confidence) probability that we have we have made an error. In terms of significance testing
these two criteria are often termed the .05 (5 per cent) and 0.01 (1 per cent) significance levels.
Throughout this handbook, you will be using a number of tests to determine whether there is a
significant association/relationship between two variables. These tests always provide a probability
statistic, in the form of a value; e.g. 0.75, 0.40, 0.15, 0.04, 0.03 and 0.002. Here, the notion of significance
testing is essential. This probability statistic is compared against the criteria of 0.05 and 0.01 to decide
whether the findings are significant. If the probability value (p) is less than 0.05 (p<0.05) or less than
0.01(p<0.01) then we conclude that the findings is significant. If the probability value is more than 0.05
(p>0.05) then we decide that the finding is not significant. Therefore we can use this information in
relation to our research idea and we can determine whether our variables are significantly related, or
not. Therefore, for the probability values stated above:
* The probability values of 0.75, 0.40 and 0.15 are greater than 0.05 (p>0.05) and these probability
values are not significant at the 0.05 level (p>0.05).
* The probability values of 0.04, and 0.03 are less than 0.05 (p<0.05) and these probability values
are significant at the 0.05 level (p<0.05).
* The probability value of 0.02 is less than 0.01 (p<0.01) therefore this probability value is significant
at the 0.01 level (p<0.01)’
© Dr Andrew Clegg p. 4-141
Data Analysis for Research Statistical Tests: Student T-Test
4.6.2 One or Two-Tailed Tests
The above test has been based on a two-tailed test as the null and alternative hypothesis did
not specify any specific direction. If we were going to perform a one-tailed test we would first
need to look at the mean values of the data and then rewrite our hypotheses accordingly.
Remember that when applying a one-tailed test it is first necessary to establish whether the
difference in the samples corresponds to the direction outlined in the alternative hypothesis.
For example if the alternative hypothesis is that the mean of sample Y is greater than the mean
of sample X, the null hypothesis can only be rejected if the mean of sample Y is greater than
the mean of sample X and if it is significant at the chosen level.
If we use Descriptives Statistics in SPSS to look at the mean turnovers for businesses in the
Chichester and Arun Districts we would find that the mean turnover in the Chichester District
is £43,968.47and in the Arun District is £37,591.69. The mean turnover is higher in Chichester
which therefore suggests that turnover may be influenced by location. We can therefore
conduct a one-tailed t-test to test if there is actually a significance difference between the two
mean scores.
In this case
The Null Hypothesis:
Ho: μμμμμa= μ μ μ μ μb There is no significant difference in Turnover between Area; business
turnover is not influenced by location.
TheAlternativel Hypothesis:
H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in Turnover between Area; business
turnover is higher in Chichester than Bognor Regis.
To calculate the one-tailed level of significance, divide the two-tailed significance value by 2
(0.000/2). The resultant one-tailed value would be 0.000 which would still be significant
(p.<0.05).
© Dr Andrew Clegg p. 4-142
Data Analysis for Research Statistical Tests: Student T-Test
4.6.3 Choosing the Correct Data for a T-Test
SPSS will not tell you if you are using the wrong data in a test, and it is therefore imperative that you are capable
of selecting the right variables to use in a t-test. This will be central to your assessment in this module and it
is vital that you get it right.
Let us first refer back to Table 4.3 on page 4-133. This table clearly shows that a t-test is a combination of one
continuous variable and one categorical (with two levels).
In the worked example provided, Turnover08 was the continuous variable and Area was the categorical
variable. Note that Area has two levels (i.e. 1 - Chichester District and 2 - Arun District). You can only use
categorical variables that have two levels in a t-test. The actual Independent Samples T-Test actually
provides a clue here as you are only able to define two groups (levels) within the Grouping Variable.
In this case also note that the continuous variable (Turnover08) goes in the Test Variable box.
Turnover08
© Dr Andrew Clegg p. 4-143
Data Analysis for Research Statistical Tests: Student T-Test
Referring to the variables in the Dataset file and your accompanying data set guide, attempt to complete thefollowing diagram listing Test Variables and Grouping Variables that would be suitable for use in a series of t-tests.
Activity 16:
Test Variables
Grouping Variables
© Dr Andrew Clegg p. 4-144
Data Analysis for Research Statistical Tests: Student T-Test
From the list of potential relationships that you have identified overleaf, please conduct 3 separate T-tests and record your results in the following tables. For each test, identify a research scenario that youare using the test to explore.
Table 18: Student T-Test 1
Activity 17:
Student T-Test
Research Scenario
Test Variable
Grouping Variable
Null Hypothesis
Alternative Hypothesis
SPSS Output
Record the p. value of the Levene Test
Is this significant: yes/no?
Is your test based on: Equal variances assumed
Equal variances not assumed
Record the value of p. (Sig. 2-tailed)
Is the value of p. significant: yes/no?
Your conclusions (with full reference to the null and alternative hypotheses)
© Dr Andrew Clegg p. 4-145
Data Analysis for Research Statistical Tests: Student T-Test
Table 19: Student T-Test 2
Activity 17:
Student T-Test
Research Scenario
Test Variable
Grouping Variable
Null Hypothesis
Alternative Hypothesis
SPSS Output
Record the p. value of the Levene Test
Is this significant: yes/no?
Is your test based on: Equal variances assumed
Equal variances not assumed
Record the value of p. (Sig. 2-tailed)
Is the value of p. significant: yes/no?
Your conclusions (with full reference to the null and alternative hypotheses)
© Dr Andrew Clegg p. 4-146
Data Analysis for Research Statistical Tests: Student T-Test
Table 20: Student T-Test 3
Activity 17:
Student T-Test
Research Scenario
Test Variable
Grouping Variable
Null Hypothesis
Alternative Hypothesis
SPSS Output
Record the p. value of the Levene Test
Is this significant: yes/no?
Is your test based on: Equal variances assumed
Equal variances not assumed
Record the value of p. (Sig. 2-tailed)
Is the value of p. significant: yes/no?
Your conclusions (with full reference to the null and alternative hypotheses)
© Dr Andrew Clegg p. 4-147
Data Analysis for Research Statistical Tests: Related T-Test
4.7 Using SPPS to Calculate the T-Test for Related Samples
The t-test can also be used to examine means of the same participants in two conditions or at two points in
time. The advantage of using the same participants or matched participants is that the amount of error
deriving from differences between participants is reduced. The difference between a related and unrelated
t-test lies essentially in the fact that two scores from the same person are likely to vary less than two scores
from two different people. For example, if you were to weigh the same person on two occasions, the
difference between those two weights is likely to be less than the weights of two seperate individuals. The
variability of the standard error for the related t-test is less than that for the unrelated one. Indeed, the variability
of the standard error of the differences in means for the related t test will depend on the extent to which the
pairs of scores are similar or related. The more similar they are, the less the variability will be of their
estimated standard error.
In the following example we are going to look at paired data from the Dataset file. Let us consider a potential
research scenario to help you place the use of the related t-test in context.
Scenario: Between 2008 and 2010, Tourism South East ran a series of courses in conjunction with the
Green Tourism Business Scheme to help GTBS members progress to the next stage of
accreditation (e.g. bronze to silver; silver to gold). As part of the monitoring process, Tourism
South East want to establish if these courses have had any impact on GTBS scores.
Variables: We are are going to examine differences in GTBS scores in 2008 and 2010.
As always we need to start by defining our hypthoses. In this instance, the null and alternative hypthoseses
have been stated as:
Ho: μμμμμa= μ μ μ μ μb There is no significant difference between the GTBS scores in 2008 and 2010.
H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant difference in the GTBS scores in 2008 and 2010.
To perform the related t-test, first move the mouse over Analyse and press the left mouse button.
Move the mouse over Compare Means and then over Paired-Samples T-Test.
simulationsimulation
© Dr Andrew Clegg p. 4-148
Data Analysis for Research Statistical Tests: Related T-Test
The Paired-Samples T-Test dialog box appears.
Move the mouse over GTBS08 and press the left mouse button. GTBS08 is selected. Now move the mouse over GTBS10
and press the left mouse button. GTBS10 is selected. Move the mouse over the central button and press the left mouse
button.
GTBS08 and GTBS10 now appear in the Paired Variables box.
Click OK.
© Dr Andrew Clegg p. 4-149
Data Analysis for Research Statistical Tests: Related T-Test
The procedure produces the following results in the output window:
The first table evident in the SPSS output is the Paired Samples Statistics which reports the descriptive
statistics. By observing the mean scores we can see that mean GTBS scores were higher in 2010 than 2008.
These differences seem to be supporting our initial hypothesis. To establish whether this result is significant
or has merely occured by chance we refer to the Paired Samples Test.
The key elements of the Pair Samples Test include:
(a) The test statistic - this is denoted as t; in this case the value of t=-11.386
(b) The degrees of freedom - the degrees of freedom equal the size of the sample (300) minus 1. The minus
1 represents minus 1 for the sample as you have only asked one set of respondents. The degrees of freedom
value is placed in brackets between the t and the = sign (e.g. t(299)=-11.386).
(c) The Probability Value - as in all tests we also have to report the probability value. Note that the value of p
=.000 (which remember we report as p<.0005) is less than 0.05 which means that there has been a
significant change in GTBS scores between 2006 and 2008.
Let us bring these different elements together. As can be seen from the SPSS output, the difference between
the two means is significant. This is specifically reported as:
There is a significant difference in GTBS scores between 2006 and 2008, t (299)= -11.386, p (<.0005)<0.05.
© Dr Andrew Clegg p. 4-150
Data Analysis for Research Statistical Tests: Related T-Test
However, we can be more specific and in our altnerative hypothesis look for an improvement
in GTBS scores. As a result our alternative hypothesis would be:
H1: μ μ μ μ μa≠ μ≠ μ≠ μ≠ μ≠ μb There is a significant improvement in the GTBS scores between 2008
and 2010.
This therefore means we have conducted a one-tailed test, as we have specified a specific
direction in which to examine change. To alter the output here so that it complies with a one-
tailed test we merely divide the p-value by 2. The resultant value (.000) is still significant
(p=.000<0.05). As a result we can reject the null hypothesis and conclude that there has been
a significant improvement in GTBS scores between 2008 and 2010, at the 95% confidence
level. Specifically:
There has been a significant improvement in GTBS scores between 2008 and 2010, t (299)=
-11.386, p (<.0005)<0.05.
© Dr Andrew Clegg p. 4-151
Data Analysis for Research Statistical Tests: Related T-Test
We are now going to use the Dataset file to conduct a number of additional related t-tests. Pleasecomplete the following tables, making clear reference to the SPSS output. You have been providedwith research scenarios for each table to place the test in context.
Table 21: Related T-Test: Turnover08 Against Turnover10[Tourism South East want to establish if regional marketing strategies implemented between 2008 and2010 have had an impact on business turnover.]
Table 22: Related T-Test: Green08 Against Green10[Tourism South East want to establish if support given to the use of local produce has impacted on howmuch businesses spend on local produce]
Note that the tests conducted here relate to the entire sample. If we used the Split File option as we havedone previously, we could conduct Related T-tests to provide comparisons between selected variables
such as Area, Town or G-Strategy. Attempt to apply the Split File option and repeat one of the tests
above. Cut and paste the output into your log book.
Activity 18:
Related T-Test
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
Related T-Test
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
© Dr Andrew Clegg p. 4-152
Data Analysis for Research Statistical Tests: Mann Whitney U Test
4.8 Non Parametric Tests
4.8.1 The Mann-Whitney U Test (Independent Samples)
When comparing samples of geographical data, assumptions of normality which underpin the accuracy of
parametric tests, such as the t- test, are often quite unrealistic. In these cases, the use of a non-parametric
test, such as the Mann Whitney U Test, provides a convenient alternative. The Mann Whitney U test is the
non-parametric counterpart of the t-test for unrelated (independent) data. The test is used to determine
whether ordinal data collected in two different samples differ significantly. As a non-parametric test it is not
restricted by any assumptions regarding the nature of the population from which the sample was taken and
is applicable to ordinal (ranked data). In additition, the sample sizes of the data sets need not be equal. The
test calculates whether there is a significant difference in the distribution (based on the median) of data by
comparing ranks of each data set.
Within the Mann Whitney U test the null hypothesis is that the two populations are taken from a common
population so that there should be no consistent difference between the two sets of values. Any observed
differences are due entirely to chance in the sampling process.
To begin, open SPSS and open the file Dataset file that you have used in previous sessions. We are going to
use Mann Whitney to examine the relationship between different variables. Let us consider a potential
research scenario to help you place the use of the Mann Whitney test in context.
Scenario: Tourism South East are developing a new e-tourism strategy and they want to establish if there
is any relationship between e-strategy (e-commerce adopters and non-adopters) and business
attitudes to the value of the internet.
Variables: We are therefore going to examine the relationship between EStrategy and the perceived
value of the internet in 2008 (Webqual08).
4.8.2 Writing Null and Alternative Hypotheses
Before we start we first need to establish a Null and Alternative hypothesis.
In this case:
The Null Hypothesis:
Ho: There is no significant difference between the two groups in terms of their perceived value of the
internet; e-strategy does not influence attitudes towards the internet
TheAlternative Hypothesis:
H1: There is a significant difference between the two groups in terms of their perceived value of the
internet; e-strategy does influence attitudes towards the internet
© Dr Andrew Clegg p. 4-153
Data Analysis for Research Statistical Tests: Mann Whitney U Test
4.9 Using SPSS to Calculate Mann Whitney
To perform the Mann Whitney U test, first move the mouse over Analyse and press
the left mouse button.
Move the mouse over Nonparametric Tests and then over Legacy Dialogs. Select
2 Independent Samples.
The Two-Independent Samples Tests dialog box appears.
Select the variable labelled
Webqual08. Move the mouse over
the central arrow and press the left
mouse button so Webqual08
appears in the Test Variable List.
Data Analysis for Research
© Dr Andrew Clegg p. 4-154
Statistical Tests: Wilcoxon Signed Ranks
Move the mouse over Define Groups and press
the left mouse button.
Select the variable
EStrategy and press the
lower arrow so that
EStrategy appears in the
Grouping Variable box.
The Define Groups dialog box appears. In the box beside Group 1: type 1 and in the box beside Group
2: type 2. Note in this case the groups have been defined in terms of their two codes (1=ECommerce Adopter
and 2=ECommerce - Non Adopter). The values can also be used as a cut-off point, at or above which all the
values constitute one group while those below form the other group. In this instance the cut-off point is two,
which would be placed in parentheses after gender.
Move the mouse over Continue and press the left mouse button. This will return you to the Independent-
Samples Tests dialog box. Move the mouse over OK and press the left mouse button. SPSS performs the
test and displays the results in the Output window.
Data Analysis for Research
© Dr Andrew Clegg p. 4-155
Statistical Tests: Wilcoxon Signed Ranks
The first subtable, Ranks, illustrates the number of businesses in each group, and the total number of
businesses. The Mean Rank indicates the mean rank of scores within each group and the Sum of Ranks
indicates the total sum of all ranks within each group. If our null hypothesis of no significant difference was
true, then we would expect the mean rank and sum of ranks to be roughly similar across the two groups. As
we can see the mean rank for E-Commerce Adopters is 178.82 and for E-Commerce Non-Adopters is
116.35. There is a clear difference between the two, and to determine whether this difference is significant
we refer to the Test Statistics table below.
This tells us that the Mann Whitney U value is 6508.000 and that the probability value (p), ascertained by
examining the Asymp. Sig. (2-tailed) is .000. In this case, the p-value (.000) (reported as p<.0005) is less than
0.05, so we can therefore reject the null hypothesis and conclude that there a significant difference between
EStrategy and attitudes towards the internet.
Our Mann Whitney test was two-tailed but again we could be more specific by indicating a direction in our
alternative hypothesis. In this case the alternative hypothesis would be:
H1: There is a significant difference between the two groups in terms of their perceived value of the
internet. E-commerce adopters rank the value of the internet higher than e-commerce non-adopters.
Note that an initial examination of the mean ranks would support our alternative hypothesis. As before, for a
one-tailed test, the p value needs to be halved (.000/2 = .000). In this case the test would still be significant as
the p-value (.000) (reported as p<.0005) is less than 0.05, so we can again reject the null hypothesis and
conclude that there a significant difference between EStrategy and attitudes towards the internet and that e-
Commerce adopters rank the value of the internet higher than E-commerce non-adopters.
Data Analysis for Research
© Dr Andrew Clegg p. 4-156
Statistical Tests: Wilcoxon Signed Ranks
4.9.1 Choosing the Correct Data for a Mann Whitney Test
SPSS will not tell you if you are using the wrong data in a test, and it is therefore imperative that you are capable
of selecting the right variables to use in a Mann Whitney T-test. This will be central to your assessment in this
module and it is vital that you get it right.
Let us first refer back to Table 3 on page 4-133. This table clearly shows that a Mann Whitney T-Test is non-
parametric and comprises a combination of one continuous variable and one categorical (with two levels).
In the worked example provided, Webqual was the continuous variable and EStrategy was the categorical
variable. Note that EStrategy has two levels (i.e. 1 - E-Commerce Adopter and 2 - E-Commerce Non-
Adopter). You can only use categorical variables that have two levels in a Mann Whitney Test. The actual
Mann Whitney Test dialog box actually provides a clue here as you are only able to define two groups (levels)
within the Grouping Variable.
In this case also note that the continuous variable (Webqual08) goes in the Test Variable box.
Data Analysis for Research
© Dr Andrew Clegg p. 4-157
Statistical Tests: Wilcoxon Signed Ranks
Referring to the variables in the Dataset file, attempt to complete the following diagram listing Test Variablesand Grouping Variables that would be appropriate for use in a series of Mann Whitney Tests.
Activity 19:
Test Variables
Grouping Variables
Data Analysis for Research
© Dr Andrew Clegg p. 4-158
Statistical Tests: Wilcoxon Signed Ranks
From the list of potential relationships that you have identified overleaf, please conduct 3 separateMann Whitney tests and record your results in the following tables. For each test, identify a researchscenario that you are using the test to explore.
Table 23: Mann Whitney Test 1
Activity 20:
Mann Whitney Test
Research Scenario
Test Variable
Grouping Variable
Null Hypothesis
Alternative Hypothesis
SPSS Output
Record the Mann Whitney U Value
Record the value of p (Asymp. Sig. (2-tailed))
Is the value of p. significant: yes/no?
Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)
Data Analysis for Research
© Dr Andrew Clegg p. 4-159
Statistical Tests: Wilcoxon Signed Ranks
Table 24: Mann Whitney Test 2
Activity 20:
Mann Whitney Test
Research Scenario
Test Variable
Grouping Variable
Null Hypothesis
Alternative Hypothesis
SPSS Output
Record the Mann Whitney U Value
Record the value of p (Asymp. Sig. (2-tailed))
Is the value of p. significant: yes/no?
Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)
Data Analysis for Research
© Dr Andrew Clegg p. 4-160
Statistical Tests: Wilcoxon Signed Ranks
Table 25: Mann Whitney Test 3
Activity 20:
Mann Whitney Test
Research Scenario
Test Variable
Grouping Variable
Null Hypothesis
Alternative Hypothesis
SPSS Output
Record the Mann Whitney U Value
Record the value of p (Asymp. Sig. (2-tailed))
Is the value of p. significant: yes/no?
Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)
Data Analysis for Research
© Dr Andrew Clegg p. 4-161
Statistical Tests: Wilcoxon Signed Ranks
4.10 Using SPSS to Calculate Wilcoxon Signed Ranks Test (Related Data Sets)
The Wilcoxon signed ranks test is the non-parametric conterpart of the t-test for related data or paired t-test.
The basic assumptions for the test are that the data are paired across conditions or time, and that the data are
symmetrical but need not be normal or any other shape. The data should also be of at least ordinal level,
which therefore makes the test very useful for analysing data based on ranked scores. The test itself
examines the differences between data from the phenomenon collected in two different conditions or times
by examining the ranks of the difference in values over the two conditions. For example, you may want to
know whether a village’s fertility or mortality rate changes significantly between dates or whether the conditions
under which a questionnaire or interview is conducted influence the findings of a study significantly. In this
case, the test calculates whether there is a significant difference by examining whether the ranks of individual
phenomena differ between conditions or times.
To begin, open SPSS and open the file Dataset file that you have used in previous sessions. We are going to
use the Wilcoxon test to examine the relationship between different variables. Let us consider a potential
research scenario to help you place the use of the Wilcoxon Test in context.
Scenario: Between 2008 and 2010, Tourism South East have been running E-Commerce workshops
across the South East region. As part of the monitoring process, Tourism South East want to
establish if these workshops have had any impact on business attitudes to the value of the
internet.
Variables: Therefore we are are going to examine the relationship between Webqual08 and
Webqual10.
In this instance, the null and alternative hypthoseses have been stated as:
Ho: There is no difference in business attitudes towards the value of the internet between 2008 and
2010
H1: There is a difference in business attitudes towards the value of the internet between 2008 and 2010.
The significance level has been set at 0.05 (95%). Note that this is also a two-tailed test as no direction has
been specified in the alternative hypothesis.
Data Analysis for Research
© Dr Andrew Clegg p. 4-162
Statistical Tests: Wilcoxon Signed Ranks
The Two-Related Samples Tests dialog box appears.
Move the mouse over Webqual08 and press the left mouse button. Webqual08 is selected. Now move the
mouse over Webqual10 and press the left mouse button. Webqual10 is selected. You will notice that in the
Current Selections area in the dialog box, Webqual08 is now beside Variable 1 and Webqual10 is beside
Variable 2.
To perform the Wilcoxon Test, first move the mouse over Analyse and press the left
mouse button.
Move the mouse over Nonparametric Tests and then Legacy Dialogs and then over 2
Related Samples and press the left mouse button.
Data Analysis for Research
© Dr Andrew Clegg p. 4-163
Statistical Tests: Wilcoxon Signed Ranks
Move the mouse over the central button and press the left mouse button. Webqual08 and Webqual10 now appear in the
Paired Variables box.
Click OK. The procedure produces the following in the output window.
Data Analysis for Research
© Dr Andrew Clegg p. 4-164
Statistical Tests: Wilcoxon Signed Ranks
The first subtable, Ranks, shows the number of negative, positive and tied ranks, along with the mean rank
and the Sum of Ranks. Let us explore this is additional detail.
Key observations:
Webqual10 has been entered into the equation first, therefore the calculation is based on the attitudes
scores in 2010 minus the attitude scores in 2008.
The Negative Ranks indicate how many ranks of Webqual08 were larger than Webqual10. Here the
value is 0, which would initially suggest that attitude scores have increased.
The PostiveRanks indicate how many ranks of Webqual08 were smaller than Webqual10. The
value here is 259.
The Tied Ranks indicate how many of the rankings of Webqual08 and Webqual10 are the same.
The value here is 41.
The Total is the total number of ranks, which is equal to the number of attitude scores in the sample
(in this case 300).
From the second subtable, Test Statistics, it can be seen that the value of z = -16.093, which is significant
as the value of p (.000) is less than 0.05. We can therefore reject the null hypothesis and conclude that there
is a significant difference in business attitudes towards the value of the internet between 2008 and 2010. The
findings of the Wilcoxon test should be reports as:
z= -16.093, p(<0.0005)<.005
The Wilcoxon Test was two-tailed but again we could be more specific by indicating a direction in our
alternative hypothesis. In this case the alternative hypothesis would be:
H1: There is a significant difference in attitudes towards the value of the internet between 2008 and 2010;
business attitudes have improved.
Note that an initial examination of the data in the ranks table support our alternative hypothesis. As before, for
a one-tailed test, the p value needs to be halved (.000/2 = .000). In this case the test would still be significant
as the p-value (.000) (reported as p<.0005) is less than 0.05, so we can again reject the null hypothesis and
conclude that there a significant difference between attitudes towards the value of the internet between 2008
and 2010 and that business attitudes have improved.
Data Analysis for Research
© Dr Andrew Clegg p. 4-165
Statistical Tests: Wilcoxon Signed Ranks
We are now going to use the Dataset file to conduct a number of additional Wilcoxon Tests. Pleasecomplete the following tables, making clear reference to the SPSS output. You have been providedwith research scenarios for each table to place the test in context.
Table 26: Wilcoxon Test = BLINK08-BLINK10[Following a complete review of their business advisory services, instigated by poor industry feedbackin 2007, Business Link need to establish if business attitudes towards their advisory services hasimproved between 2008 and 2010]
Table 27: Wilcoxon Test- WEBVALUE08-WEBVALUE10[Tourism South East want to establish if business attitudes to destination management systems havechanged following the change of DMS platform and a complete relaunch of booking systems]
Note that the tests conducted here relate to the entire sample. If we used the Split File option as we havedone previously, we could conduct Wilcoxon Tests to provide comparisons between selected cases
such as Area, Town or G-Strategy. Attempt to apply the Split File option and repeat one of the tests above.
Cut and paste the output into your log book.
Activity 21:
Related T-Test
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
Related T-Test
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
Data Analysis for Research
© Dr Andrew Clegg p. 4-166
Statistical Tests: Wilcoxon Signed Ranks
Notes:
Chi-Squared
Section 5
Learning Outcomes
At the end of this session, you should be able to:
Understand the rationale for the use of chi-squared
Understand the basic conditions and criteriainvolved in the use of chi-squared
Apply the procedure for calculating chi-squaredstatistics both manually and in SPSS
Interpret manually derived and computergenerated SPSS chi-squared output
© Dr Andrew Clegg p. 5-167
Data Analysis for Research Chi-Squared
5.0 Introduction
Chi-squared (χ2) is primarily employed to test a null hypothesis of ‘no difference’ between samples of frequency
measurements. The method is used widely in the fields of Business and Management and is often employed
in questionnaire analysis. In many ways it is the most flexible of such tests as:
It can be applied to frequency data on any originally collected scale of measurement (nominal,
ordinal or interval) provided that the data are grouped into independent and mutually-exclusive
categories.
It may be used to test a null hypothesis of ‘no difference’ for any number of samples.
The chi-squared test involves computing a calculated χ2 statistic and comparing this with an appropriate
tabulated χ2 statistic (or critical χ2 value) to test a null hypothesis of ‘no difference’ at a selected significance
level. Although considered less powerful than other tests, this is compensated for by its simple data requirements.
Both ordinal and ratio scale data can be converted into nominal form, although such categorisation can often
cause a loss of detail.
The chi-squared test requires that data be in the form of contingency tables, which are simply data matrices
showing the frequency of observations in different categories (h) for one or more samples (k). The following
are three examples of contingency tables.
Table 5.1: Categories of residence sampled in terms of the age of the resident
Sample a b c
Category Age: 20-29 30-39 40 and over
Owner-occupied 18 42 28
Rented 31 29 12
Council housing 24 41 35
Other 17 4 1
No. of categories (h) = 4 No. of samples (k) = 3
Note: Measurement scale: nominal
© Dr Andrew Clegg p. 5-168
Data Analysis for Research Chi-Squared
Table 5.2: Typical questionnaire responses
No. of categories (h) = 5 No. of samples (k) = 1
Note: Measurement scale: ordinal
Table 5.3: Total dissolved solids in groundwater, sampled by rock type
No. of categories (h) = 6 No. of samples (k) = 2
Note: Measurement scale: interval
The calculated χ2 statistic compares the observed frequency (O) for each category and every sample against
an expected frequency (E) using the general formula:
χ 2 =− (O E)E
2
In the above equation the observed frequencies (O) are those that we measure, (i.e. those that appear in the
contingency tables). The expected frequencies (E) for each category are defined by our hypothesis. The null
hypothesis of ‘no difference’ often involves testing for departure from a uniform distribution in the case of the
single sample test. This means that the expected value for each category is identical and equal to n/h. The
chi-squared test can also be used to establish differences from a theoretical distribution, such as the normal
distribution.
Category
Frequency
Strongly Agree
8
Agree
11
Niether Agree or
Disagree
6
Disagree
19
Strongly
Disagree
12
Categories
(Concentration in mgl-1)
0-19
20-39
40-59
60-79
80-99
100-119
A
Granite (n=30)
3
12
10
4
1
0
B
Basalt (n=30)
1
9
11
8
3
1
© Dr Andrew Clegg p. 5-169
Data Analysis for Research Chi-Squared
Although the χ2 test is primarily employed as a one-tailed test of the significance of differences, it may also be
employed to establish the significance of similarities between samples. Most of the χ2 tables contain not only
the usual values at the lower end of the significance scale for testing differences but also values at the upper
end of this scale for testing similarity. If we wish to establish similarity of two or more samples, then our
calculated χ2 statistic must be less than, for example, the appropriate figure for the 95% significance level if
we are to accept the null hypothesis of ‘no difference’ at this level.
5.1 The One-Sample Chi-Squared Test
The one-sample test is normally used to test the significance of differences between categories of a single
sample. Consider the following example.
The frequency of rock falls from a popular cliff face in Snowdonia is recorded for two weeks in the summer,
autumn, winter and spring by the local mountain rescue. The results are recorded in Table 5.4.
Table 5.4: Rockfall frequency
Sampling Period Summer Autumn Winter Spring
Frequency of Rockfalls 17 14 10 23
h=4 n=64
If there were no differences in the frequency of rockfalls in each season, then we would expect an equal
frequency of rockfalls in each season. Basically, the expected frequency for each category would be:
Enh
644
16= = =
As with any test, we must first formalise the null and alternative hypotheses. In this case:
H0: There is no difference in the frequency of rock falls between seasons
H1: The frequency of rock falls is significantly greater in some seasons than in others
© Dr Andrew Clegg p. 5-170
Data Analysis for Research Chi-Squared
The calculated χ2 statistic can now be computed as follows:
χ 2 =− (O E)E
2
Table 5.5: Rockfall frequency: the calculation of the χ2 statistic
χ 2 = − = (O E)E
2
5.6250
The degrees of freedom (v) for this one-sample chi-square test is:
V=h-1
which in this case equals:
V=h-1 = 4-1 = 3
The calculated χ2 statistic is then compared with a tabulated χ2 statistic at a selected significance level (see
Table 5.6). To reject the null hypothesis, the calculated χ2 must exceed the tabulated χ2. At the 0.05 significance
level, the tabulated χ2 statistic with three degrees of freedom is 7.82. As the calculated value is less than the
tabulated value we cannot reject the null hypothesis of ‘no difference’ at the 0.05 significance level and conclude
that there is no significant difference in the frequency of rock falls between the different seasons.
Category
Summer
Autumn
Winter
Spring
O
17
14
10
23
E
16
16
16
16
(O-E)
1
-2
-6
7
(O-E)2
1
4
36
49
0.0625
0.2500
2.2500
3.0625
(O E)2
E−
© Dr Andrew Clegg p. 5-171
Data Analysis for Research Chi-Squared
Table 5.6: Critical Values of Chi-square95% 99%
© Dr Andrew Clegg p. 5-172
Data Analysis for Research Chi-Squared
5.2 The Chi-Squared Test for Two or More Samples
The chi-square test can also be used to test the differences or similarities between two or more samples,
though it is always used as a test of difference. The procedure is similar to that for the one sample test, except
that the calculation of the expected frequencies (E) is slightly more complex. Consider the following example.
A researcher in Ghana studying the distribution of malaria outbreaks among international tourists obtains the
following results from a sample of 100 tourists who stayed in hotels on the river flood plain, and from 200
tourists who stayed in hotels on a plateau above the river. The results are recorded in Table 5.7.
Table 5.7: The incidence of malaria outbreaks in Ghana
Category Infected Not Infected
Sample
Flood plain (n=100) 20 80
Plateau (n=200) 25 175
In this case we have two samples (k=2) and two categories (h=2). The researcher wishes to establish whether
the two samples differ significantly in terms of the incidence of infection. The expected frequencies (E) are
thus those that would be expected if there were indeed ‘no differences’ between the plateau and the flood plain
in terms of incidence of infection. The expected frequencies are calculated for each observation using the
following formula:
EColumn total x Row total
Overall total=
Or alternatively using notation in a contingency table format:
Table 5.7a: Calculation of expected values
Category Infected Not Infected Row Total
Sample
Flood plain (expected) Cell A= (N1 x T1)/T Cell C= (N2 x T1)/T T1
Plateau (expected) Cell B=(N1 x T2)/T Cell D= (N2 x T2)/T T2
Column Totals N1 N2 T
© Dr Andrew Clegg p. 5-173
Data Analysis for Research Chi-Squared
Therefore in the case of the malaria outbreaks, the expected values are calculated in the following manner:
Table 5.7b: Calculation of expected values
Category Infected Not Infected Row Total
Sample
Flood plain (observed) 20 80 100
Plateau (observed) 25 175 200
Column Totals 45 255 300
Hence the expected values are:
Table 5.7c: Calculation of expected values cont..
Category Infected Not Infected Row Total
Sample
Flood plain (expected) 15 (45*100)/300 85 (255*100)/300 100
Plateau (expected) 30 (45*200)/300 170 (255*200)/300 200
Column Totals 45 255 300
Note that the row and column totals for the expected values are identical to those for the observed values.
The χ2 statistic is now calculated in the following manner:
Table 5.8: Calculation of the χχχχχ2 statistic
1.667
0.294
0.833
0.147
2.941
Category
Flood plain
Infected
Not Infected
Plateau
Infected
Not Infected
Total
O
20
80
25
175
300
E
15
85
30
170
300
(O-E)
5
-5
-5
5
(O-E)2
25
25
25
25
(O E)2
E−
© Dr Andrew Clegg p. 5-174
Data Analysis for Research Chi-Squared
When the chi-squared test is used to test two or more samples, the number of degrees of freedom is given by:
V = (h-1)(k-1)
In this case:
V = (h-1)(k-1) = (2-1)(2-1) =1
Formally, the test of a null hypothesis of ‘no difference’ is as follows:
H0: There is no difference between the incidence of infection on the flood plain and that on the plateau
H1: There is a significant difference between the incidence of infection on the flood plain and that on theplateau
The calculated χ2 statistic is then compared with a tabulated χ2 statistic at a selected significance level. The
tabulated value at the 0.1 significance level with 1 degree of freedom is 2.71 (see Table 1.6). As the calculated
χ2 statistic (2.941) exceeds the tabulated χ2 statistic we can reject the null hypothesis at the 0.1 significance
level. In practical terms this means that on the basis of the evidence of the chi-square test, it is extremely
unlikely that the observed difference between rates of infection is due only to chance in the sampling process
and instead reflects a ‘real’ difference between the rates of iinfection on the flood plain and plateau.
Notice that the larger the test statistic, the stronger the evidence of association will be. This is not surprising
because the test statistic, χ2 , is based on differences between the actual, or observed frequencies and those
we would expect if there were no association. If there were association then we would anticipate large
differences between observed and expected frequencies. If there were no association we would expect small
differences.
In the above example, if a higher significance level had been chosen, say 0.05, then the calculated χ2 statistic
(2.941) would have been less than the tabulated χ2 statistic ( 3.84) and so the null hypothesis could not have
been rejected. This situation raises the issue of subjectivity and that in order to reject a null hypothesis a
researcher may well be tempted to choose a lower significance level. The safest rule is to choose a significance
level before the test is carried out and stick to it.
© Dr Andrew Clegg p. 5-175
Data Analysis for Research Chi-Squared
5.3 Yates Correction Factor
When using a χ2 test with one degree of freedom, as in the previous example, it is necessary to make a slight
adjustment to the calculations. The adjustment consists of either adding or subtracting 0.5 to the value of
each (O-E) before squaring it. The rule for deciding whether to add or subtract the 0.5 is:
a) If (O-E) is negative then add;
b) If (O-E) is positive then subtract
It is probably more easily remembered by noting that addition or subtraction should be performed with a view
to making the value of χ2 smaller.
The effect of the Yates correction can be highlighted with reference to a new version of Table 5.9 which has
been appropriately adjusted using the Yates correction factor.
Table 5.9: Calculation of the χχχχχ2 statistic with Yates Correction
The effect of the Yates correction is to introduce greater accuracy into the calculation and evaluation of the χ2
statistic. In this case, the Yates correction has reduced the value of the calculated χ2 statistic to the extent that
is no longer exceeds the value of the tabulated value of χ2 with one degree of freedom. As such the null
hypothesis can no longer be rejected and the researcher would have to conclude that there is no significant
difference between the incidence of river blindness on the flood plain and the plateau.
1.667 (1.35)
0.294 (0.24)
0.833 (0.675)
0.147 (0.119)
2.941 (2.384)
Category
Flood plain
Infected
Not Infected
Plateau
Infected
Not Infected
Total
O
20
80
25
175
300
E
15
85
30
170
300
(O-E)
5-0.5 (4.5)
-5+0.5 (-4.5)
-5+0.5 (-4.4)
5-0.5 (4.5)
(O-E)2
25 (20.25)
25 (20.25)
25 (20.25)
25 (20.25)
(O E)2
E−
© Dr Andrew Clegg p. 5-176
Data Analysis for Research Chi-Squared
5.4 Conditions Necessary for Conducting a Chi-squared Test
When using chi-square a number of guidelines must be remembered:
Contingency tables must consist of at least two categories;
Where there are only two categories, the expected frequency in each category must not be less
than 5;
Where there are more than two categories, no category should have an expected frequency of less
than 1 and not more than one category in five should have an expected frequency of less than five;
Data must be in the form of frequencies (i.e. counted data in categories). The χ2 statistic is best
suited to comparing frequencies within nominal categories. It can also be applied to higher order
levels of measurement if data are grouped into categories prior to analysis. These tests are not
applicable to interval scale data;
No cell is allowed to have an expected frequency of less than 1. This requirement can sometimes
be met through the amalgamation of rows and columns (i.e. fewer cells with more observations in
each). However be careful as the regrouping of data can lead to a loss of information and the
subtle differences between two data sets being obscured. Therefore regrouping should be avoided
if at all possible and thus larger sample sizes are recommended . In addition, the way that categories
are constructed may determine whether or nor significant associations are detected;
Samples are assumed to be independent (not applicable to dependent variables);
Random sampling is assumed (other sampling procedures can be considered as long as they are
proved to be unbiased);
Data samples must be discrete and unambiguous;
Frequencies must be absolute and not percentages of proportional values;
Question of ‘tailedness’ of the alternative hypothesis does not arise in the context of the chi-square
tests. Because of the manner of its execution the direction of departure is immaterial.
© Dr Andrew Clegg p. 5-177
Data Analysis for Research Chi-Squared
From an examination of destination preferences for second homes it appears that coastal counties of England
and Wales are perceived as being more desirable holiday locations than inland counties. The results are
summarised below.
Of the 19 coastal counties, 14 have preference scores of more than 30 and only 5 have preference scores
of 30 or less. Of the 34 inland counties, 15 have high preference scores and 19 have low scores. Use the
chi-square test to decide whether there is in fact a significant difference at the 0.05 level between coastal
and inland counties in terms of their destination desirability. Report your final result below:
Activity 22:
Location Residential Desirability
Low Preference High Preference Total
Coastal Counties 5 14 19
Inland Counties 19 15 34
Total 24 29 53
© Dr Andrew Clegg p. 5-178
Data Analysis for Research Chi-Squared
In a survey commissioned by a TV travel program, 135 people were asked what their favourite foreign
holiday destination was. Some of the results are summarised in the contingency table below:
Use these sample results to test for association between gender and destination preference, using a
95% confidence level. When calculating the expected frequencies, check if the data meets the
requirements of the chi-square test. How can you re-categorise the data to make it meet the criteria for
the chi-squared test? Report your final result below:
Activity 23:
© Dr Andrew Clegg p. 5-179
Data Analysis for Research Chi-Squared
Company managers at Butlins are investigating the relationship between job satisfaction and the levels of
absenteeism in the firm. They believe that satisfied individuals are less likely to be absent from work than
those who are not satisfied. The results from a survey of 30 workers are displayed in a contingency table
below.
Calculate the value of χ2 for the difference between the observed and expected numbers. Is this difference
significant at the 0.05 level? Record your final result below:
Activity 24:
Absenteeism Job Satisfaction
Dissatisfied Happy Total
Absent from work 4 11 15
Not absent from work
10 5 15
Total 14 16 30
© Dr Andrew Clegg p. 5-180
Data Analysis for Research Chi-Squared
Subject Job Satis Absent1 1 22 1 13 2 14 1 15 2 16 2 17 1 28 1 29 2 210 1 211 2 212 2 113 2 114 2 215 2 116 1 217 1 118 2 219 1 220 2 121 2 222 1 223 2 124 2 125 1 226 1 127 1 228 2 129 1 230 1 2
5.5 Using SPSS to Calculate Chi-Squared
Having considered how to calculate chi-square manually,
the aim of the following section is to highlight how to calculate
chi-square values using SPSS.
To start, we will repeat Exercise 3 relating to job satisfaction
levels at Butlins. Load the ‘Butlins1’ exercise file into SPSS.
Label the columns and values as you have done in previous
sessions.
To perform a chi-squared test in SPSS, move the mouse over Analyse and press the left mouse button. Move
the mouse over Descriptive Statistics and then Crosstabs.
simulationsimulation
© Dr Andrew Clegg p. 5-181
Data Analysis for Research Chi-Squared
The Crosstabs dialog box appears.
Move the mouse over Jobsatis and press the left mouse button. Press the top
arrow button so that Jobsatis is selected in the Row(s): box.
Move the mouse over Absent and press the left mouse button. Press the middle
arrow button so that Absent is selected in the Column(s) box.
Move the mouse over Statistics and press the
left mouse button. The Crosstabs: Statistics
dialog box appears.
Select the chi-square option
and then press Continue.
© Dr Andrew Clegg p. 5-182
Data Analysis for Research Chi-Squared
This takes you back to the Crosstabs dialog box. Move the mouse over Cells and press the left mouse
button.
The Crosstabs: Cell display dialog box
appears.
Make sure that Observed and Expected
counts are selected and then press Continue.
This will take you back to the initial Crosstabs
dialog box.
Press OK and SPSS will automatically
calculate the chi-square statistics and display
the results in the output window. The output
window will display a contingency table and
the following output.
© Dr Andrew Clegg p. 5-183
Data Analysis for Research Chi-Squared
How do these results compare to your manual output ?
Well, first of all you should notice that the Pearson chi-square
result gives you the χ2 statistic prior to revision by the Yates
Correction (4.82). Second, the Continuity Correction chi-
square gives you the χ2 statistic as adjusted by the Yates
Correction (3.34).
But, from the SPSS output how do you infer the significance level ?
Although the output looks daunting the answer is quite simple. In the output below, the significance value
(Asymp. Sig. (two -tailed)) for the corrected χ2 statistic is .067. This value is greater than 0.05 which means
it is not significant at the 0.05 confidence level. This can also be reported as p>0.05 ( not significant). Notice
however, that the value is less than 0.1, which means it is significant at the 0.1 significance level , which can
alternatively be recorded as p<0.1 (significant). Basically, these are the same results as you should have
calculated manually.
© Dr Andrew Clegg p. 5-184
Data Analysis for Research Chi-Squared
Remember:
if the significance value (p) is <0.1 then the value is significant at the 0.1 significance level
(90%)
if the significance value (p) is <0.05 then the value is significant athe 0.05 significance level
(95%)
if the significance value (p) is <0.01 then the value is significant at the 0.01 significance level
(99%).
Remember however, that you should not switch between significance levels so that the null
hypothesis can be rejected. The safest rule is to pick a significance level before you start and stick
with it all the way through the test.
Accurately Reporting the Outcomes of the Chi-Square Test
When reporting the chi-square result a number of key elements must be included:
Specify suitable hypotheses. In this case:
H0: There is no significant difference between job satisfaction and levels of absenteeism
H1: There is a significant difference between job satisfaction and levels of absenteeism
The test statistic. Therefore in your write-up you must include what χ2 equals. In this example χ2 =
3.34.
The degrees of freedom. This is the number of rows minus 1, times the number of columns minus 1.
This value is actually given in the SPSS output. The value for degrees of freedom is placed between
the χ2 and the = sign and placed in brackets. In this example the degrees of freedom = 1, therefore χ2
(1) = 3.34.
As part of the report you must also state the probability. As highlighted above this is done in relation
to whether your probability value was below 0.05 and 0.01 (and therefore significant) or above 0.05
(and therefore not significant). Here, you use the less than (<) or greater than (>) the criteria level. You
state this criteria by stating whether p<0.05 (significant), p<0.01 (significant) or p>0.05 (not significant).
Assuming a 95% confidence level in the above example, as p=0.67, we would write p>0.05 and place
this after the reporting of the χ2 value. Therefore χ2 (1) = 3.34, p (0.067)>0.05.
© Dr Andrew Clegg p. 5-185
Data Analysis for Research Chi-Squared
These elements must be incorporated into your text to ensure that your results are presented succintly but
effectively. You can also include a table. Therefore using the findings above we could report the following.
Table 1: Job Satisfaction v Job Absenteeism
Category Job Satisfaction
Happy Dissatisifed Totals
Absenteeism
Yes (observed) 4 11 15
% of Total 13% 37% 50%
No (observed) 10 5 15
% of Total 33% 17% 50%
Totals 14 16 30
‘Table 1 shows a breakdown of the distribution of respondents in terms of levels of job satisfaction and
levels of absenteeism (with percentages in brackets). A chi-squared was used to determine whether
there was a significant difference between the two variables. A null hypothesis of no signifcanrt difference
and an alternative hypothesis of a significant difference were established, and a 95% confidence level
was assumed. No significant difference was found between job satisfaction and absenteeism (χ2 (1) =
3.34, p (0.067)>0.05). The null of hypothesis of no significant difference can therefore not be rejected.’
Note if we had assumed a 90% confidence level from the start we would write:
‘Table 1 shows a breakdown of the distribution of respondents in terms of levels of job satisfaction and
levels of absenteeism (with percentages in brackets). A chi-squared was used to determine whether
there was a significant difference between the two variables. A null hypothesis of no significant difference
and an alternative hypothesis of a significant difference were established, and a 90% confidence level
was assumed. A significant difference was found between job satisfaction and absenteeism (χ2 (1) =
3.34, p (0.067)<0.1). The null hypothesis of no significant difference can therefore be rejected.’
© Dr Andrew Clegg p. 5-186
Data Analysis for Research Chi-Squared
Load the ‘Chi Square Exercises file’ file into SPSS. This file contains the data relating to the two additional
practical exercises that you completed by hand. Perform two chi-squares tests and compare your output
back to your manual calculations. Note that the Excel file contains two spreadsheets that you will need to
access. Import into SPSS the normal way but select the spreadsheet you wish to use from the Opening
Excel Data Source dialog box (as below). Record the results in your log book.
The following coding schemes have been used:
Residential Desirability: Area: Coastal =1; Inland =2;Score: High = 1; Low =2
TV Survey: Gender: Female =1; Male = 2;Location: Greece = 1; Spain = 2; Thailand = 3; Turkey = 4; USA =5Regroup: Europe = 1; Asia = 2; USA = 3
Activity 25:
simulationsimulation
© Dr Andrew Clegg p. 5-187
Data Analysis for Research Chi-Squared
Referring to the variables in the Dataset file, identify a series of relationships that could beexamined using the chi-squared test. Remember you need to focus on category/nominaldata for this exercise.
Activity 26:
© Dr Andrew Clegg p. 5-188
Data Analysis for Research Chi-Squared
Using the Dataset file conduct 3 appropriate chi-squared tests. Please complete the followingtables, making clear reference to the SPSS output. For each test, identify a research scenariothat you are using the test to explore.
Table 28: Chi-Squared 1
Table 29: Chi-Squared 2
Activity 27:
Chi-Squared Test
Research Scenario
Row Variable
Column Variable
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
Chi-Squared Test
Research Scenario
Row Variable
Column Variable
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
© Dr Andrew Clegg p. 5-189
Data Analysis for Research Chi-Squared
Using the Dataset file conduct 3 appropriate chi-squared tests. Please complete the followingtables, making clear reference to the SPSS output. For each test, identify a research scenariothat you are using the test to explore.
Table 30: Chi-Squared 3
Activity 27:
Chi-Squared Test
Research Scenario
Row Variable
Column Variable
Null Hypothesis
Alternative Hypothesis
Comment on the SPSS Output
© Dr Andrew Clegg p. 5-190
Data Analysis for Research Chi-Squared
Notes:
Correlation
Section 6
Learning Outcomes
At the end of this session, you should be able to:
Explain the rationale for the use of correlationanalysis
Understand the basic conditions and criteriainvolved in the use of correlation analysis
Use SPSS to calculate the correlation coefficientfor both the Pearson’s Product MomentCorrelation Coefficient and Spearman’s RankCorrelation Coefficient
Interpret computer generated SPSS correlationanalysis output
© Dr Andrew Clegg p. 6-191
Data Analysis for Research Correlation
6.0 Introduction
The aim of this session is to help you understand the importance of correlation in statistical analysis. By the
end of this session you should understand the meaning of correlation, how to check if data fulfils assumptions
for parametric and non-parametric testing, and how to perform correlation statistics on SPSS.
6.1 The Meaning of Correlation
Correlation is one of the most widely used statistical techniques. It is a means to measure the degree of
association between two variables, that is, the extent to which changes in values of one variable are matched
by changes in another variable. For example, we would tend to expect that, other things being equal, the
market price of houses increases as the size of the house increases, that is bigger houses cost more. The
size and price are correlated. The amount of water flowing down a river would be expected to be closely
related to the amount of rain which has recently fallen on the catchment. The rainfall and river flow are
correlated. We may have data on crime rates and on unemployment in a number of areas. It may be that
those areas with a high crime rate that also, in general, have a higher rate of unemployment. These variables
are also correlated.
Correlation may measure that extent to which higher values of one variable are matched with higher values
of the other and this is called positive correlation or it can measure the extent to which higher values of one
variable are matched with lower values of the other, and this is called negative correlation. For example,
you might find a positive correlation between the amount of beer you drank the night before and the number
of pneumatic drills you think are in your head the next day. However, there might be a negative correlation
between the number of pints and your ability to perform particular tasks.
To repeat, correlation is a measure of association, it says nothing whatsoever about cause. Although
variation is house size may cause variation in house price, and variation in amounts of rainfall may cause
variation in river flow, there has been a long, political as well as sociological, argument about whether
unemployment causes crime. It is possible to find sets of data which have absolutely nothing in common,
except that they are correlated.
Remember:
If higher values of one variable are associated with higher values of the other variable, then the
two variables are positively correlated.
If higher values of one variable are associated with lower values of the other variable, then the
two variables are negatively correlated.
There are several ways to measure correlation, using a range of different indices for different types of data.
When variables are parametric in nature (e.g. interval/ratio data), by far the most commonest measure of
correlation is the Pearson’s Product Moment Correlation Coefficient, often referred to as Pearson’s r.
Where data is ordinal (one or both variables are not measured on an interval scale), or when not normally
distributed, or when other assumptions of the Pearson correlation coefficient are violated, we use the Spearman
Corelation Coefficient, referred to a Spearman’s rs.
© Dr Andrew Clegg p. 6-192
Data Analysis for Research Correlation
Refering to the variables in the Dataset file and your accompanying data set guide, attempt to
complete the following diagram, listing variables that could be correlated using the Pearson Product
Moment Correlation or the Spearman Rank Correlation Coefficient.
Activity 28:
Pearson Product Moment Correlation
Spearman Rank Correlation Coefficient
© Dr Andrew Clegg p. 6-193
Data Analysis for Research Correlation
6.2 Identifying Signs of Correlation in the Data
No matter what the type of data you are using, an important first stage in measuring correlation is to obtain
some idea if correlation may be present in the data. The simplest way to do this is to plot the variables and look
carefully at the graph.
Figure 6.1 shows that the two variables are clearly related in some way: they are strongly correlated. The
graph slopes up to the right, that is, there is an association between higher values, so the correlation is
positive.
Figure 1.1: Strong Positive Correlation
In the case of Figure 6.2, the graph slopes down to the right, thereby implying a negative relationship, meaning
that as one variable increases, the other decreases.
Figure 6.2: Strong Negative Correlation
In addition to positive and negative relationships we sometimes find non-linear or curvilinear relationships, in
which the shape of the relationship between the two variables is not straight, but curves at one or more points
(see Figure 6.3).
© Dr Andrew Clegg p. 6-194
Data Analysis for Research Correlation
Figure 6.3: Non-linear or Curvilinear Relationship
It is important to identify if the relationship is non-linear as:
It would affect the choice of correlation measurement technique;
If the wrong technique was used there would be a spurious result.
Overall, scatter diagrams are useful aids in the preliminary steps of identifying correlation and allow three
aspects of a relationship to be discerned: whether it is linear; the direction of the relationship (positive or
negative); and the strength of the relationship. The amount of scatter is indicative of the strength of the
relationship.
6.3 Correlation Analysis
The correlation coefficient (r) measures linear relationship between the variables. Every correlation coefficient
will lie somewhere on the scale of possible values, that is between -1 and +1 inclusive. A relationship of -1 or
+1 would indicate a perfect relationship, positive or negative respectively, between the two variables. The
complete absence of a relationship would engender a computed coefficient of zero. The closer the correlation
coefficient is to 1 (either positively or negatively) the stronger the relationship between the two variables. The
nearer the correlation coefficient is to zero, the weaker the relationship. These ideas are displayed in Figure
6.4.
© Dr Andrew Clegg p. 6-195
Data Analysis for Research Correlation
Figure 6.4: The Strength and Direction of Correlation Coefficients
If the correlation coefficient is is 0.85, this would indicate a strong positive relationship between the two
variables, whereas a correlation coefficient of 0.28 would denote a weak positive relationship. Similarly, -0.75
and -0.36 would be indicative of strong and weak negative relationships respectively.
However, what is a large correlation ? Cohen and Holliday (1982) suggest the following: 0.19 and below
is very low; 0.20 to 0.30 is low; 0.40 to 0.69 is modest; 0.70 to 0.89 is high; and 0.90 to 1 is very high. However,
these measures are regarded as a rule of thumb and should not be regarded as definite indications. Caution
is also required when comparing computed correlation coefficients. For example we can say that a computed
correlation coefficient of -0.60 is larger than one of -0.30, but we cannot say that the relationship is twice as
strong. In order to understand this more clearly, we need to refer to the coefficient of determination (R2).
This is quite simply the square of the correlation coefficient multiplied by 100. It provides us with an indication
of how far variation in one variable is due to the other. Thus if r= -0.6, then R2 =36 per cent. This means that
36 per cent of the variance in one variable is due to the other. When r = -0.3, then R2 will be 9 per cent. Thus,
although an r of -.06 is twice as large as one of -0.3, it cannot indicate that the former is twice as strong as the
latter, because four times more variance is being accounted for by an r of 0.6 than one of -0.3 (Bryman and
Cramer, 1997). Referring to the determination of coefficient can also influence your interpretation of r. For
example, an r value of 0.75 may seem quite high, but it would only mean that 56 per cent of the variance in y
can be attributed to x. In other words, 46 per cent of the variance in y is due to variables other than x.
PerfectNegativeCorrelation
PerfectPostive
Correlation
0-1 1Strong Weak Weak Strong
NoCorrelation
© Dr Andrew Clegg p. 6-196
Data Analysis for Research Correlation
CARS PERSONS INCOME AGE TRAVEXP[No. of cars] [No. of Persons] [Income (Thousands)] [Age][Travel Expenditure]
0 2 9 25 102 3 25 37 501 1 13 23 202 4 30 30 602 2 50 43 700 1 4 18 51 3 30 27 1002 2 43 55 301 1 10 71 151 3 50 20 202 2 37 41 501 2 25 51 901 5 30 45 402 4 50 40 803 2 75 54 1501 3 45 34 501 4 50 67 300 3 20 44 200 4 13 34 151 3 35 54 502 1 40 65 501 1 75 45 300 2 10 34 101 2 50 26 302 3 30 65 703 4 100 32 1001 3 40 46 602 3 30 55 501 2 30 65 20
6.4 Using SPSS to Measure Correlation: Pearson’s Correlation Coefficient
The most commonly used (and misused) measure of correlation is Pearson’s Product Moment Correlation
Coefficient. This is a powerful parametric measure, which can be used to test for significance and reliability
as long as its assumptions are satisfied. The first two assumptions are:
The relationship between the variables is linear;
The variables are interval or ratio scale measurements.
Before we use Pearson’s Correlation Coefficient to examine possible correlations in the Dataset file, let me
illustrate correlation through a simple example. Load the Excel file ‘Correlation’ into SPSS. The details of
this data file are highlighted below.
The above table refers to factors that might influence the level of car ownership in individual households. If
you wanted to examine the relationship between the different variables, the first stage would be to produce a
series of scatterplots to highlight the direction and strength of any possible relationships. Let us examine
correlation through a specific example. In this case, we will look at the relationship between the number of
persons in the household (Persons) against the number of cars (Cars).
© Dr Andrew Clegg p. 6-197
Data Analysis for Research Correlation
To do so, click Graphs, move the mouse over Legacy Dialogs
and then select Scatter/Dot.
The Scatterplot dialog box appears.
Ensure that Simple is selected and then press Define.
The Simple Scatterplot dialog box appears.
Move the mouse of Cars (Number
of Cars) and press the left mouse
button. Move the mouse over the top
arrow and press the left mouse
button so that cars is selected in the
Y Axis: box.
Move the mouse over Persons
(Number of People) and press the
left mouse button. Move the mouse
over the centre arrow and press the
left mouse button so Persons is
selected in the X Axis box:
Press OK.
simulationsimulation
© Dr Andrew Clegg p. 6-198
Data Analysis for Research Correlation
A scatterplot showing the relationship between the two variables appears.
The non-linear relationship expressed in the scatterplot indicates a very weak correlation between the two
variables. This can be confirmed by actually calculating the correlation coefficient.
To do so, move the mouse over Analyse and press the left mouse button. Move the mouse over Correlate
and then over Bivariate and press the left mouse button again. The Bivariate Correlations dialog box
appears.
© Dr Andrew Clegg p. 6-199
Data Analysis for Research Correlation
Move the mouse over Cars and press the left mouse button. Move the mouse over the top arrow so that Cars
is selected in the Variables Box:.
Repeat the same procedure for Persons. Make sure that the Pearson Correlation coefficient and a two-
tailed test is selected. A two-tailed test is selected as we do not know which direction our relationship
between the two variables will be and we will be looking for either a positive or a negative correlation. Press
OK.
SPSS produces a matrix of correlation coefficients in the output window. In this case the following output is
produced:
As you can see from the output, the value of r for the two variables equals 0.129, which indicates a very weak
correlation. You should also notice that the probability value (p) is also not significant (p>0.05).
© Dr Andrew Clegg p. 6-200
Data Analysis for Research Correlation
As with your previous exercises, you should also provide null and alternative hypotheses. In this case:
Null Hypothesis
There is no significant association between levels of car ownership and the number of persons
in the household.
Alternative Hypothesis [Two-Tailed]
There is a significant association between levels of car ownership and the number of persons in
the household.
Note that this alternative hypothesis is two-tailed as it is not specifying a specific direction (for example a
positive or negative association). An initial scatterplot of the data would reveal any possible association
between the data, and allow you to specify a one-tailed test. In this case, a one-tailed test would look like this:
Alternative Hypothesis [One-Tailed]
There is a positive association between levels of car ownership and the number of persons in the
household.
Referring back to the SPSS output for our initial correlation:
The Pearson Correlation test statistic = .129. The output indicates that this is not significant (p=.503, >0.05)
A conventional way of reporting these figures would be as follows: r = .129, n = 29, p>0.01.
The results indicate that there is no significant association between levels of car ownership and number of
persons in the household. Note that when using correlation you are examining the level of association,
and this should be clearly reflected in your hypthoses.
© Dr Andrew Clegg p. 6-201
Data Analysis for Research Correlation
Let us now repeat this procedure to examine the relationship between additional variables within the dataset.
In this case will look at car ownership against household income.
First create a scatterplot between the car ownership and income. Your scatterplot should look similar to the
graph below:
The scatterplot clearly indicates that there is a linear relationship between the two variables, and that there is
evidence of a positive correlation: in this case, as household income increases so does the level of car
ownership. Having established the existence of a linear relationship, now calculate the correlation coefficient.
In the Bivariate calculations dialog box, specify a one-tailed test as in this case we are expecting a positive
correlation - thus indicating a direction. SPSS will generate the following output.
Correlations
Cars Income
Cars Pearson Correlation 1 .665(**)
Sig. (1-tailed) . .000
N 29 29
Income Pearson Correlation .665(**) 1
Sig. (1-tailed) .000 .
N 29 29
** Correlation is significant at the 0.01 level (1-tailed).
© Dr Andrew Clegg p. 6-202
Data Analysis for Research Correlation
The Pearson Correlation test statistic =0.665. SPSS indicates with ** that it is significant at the 0.01 level for
a one-tailed prediction. The actual p value is shown to be 0.000. A conventional way of reporting these figures
would be as follows: r=0.665, n=29, p<0.01. The results indicate that as household income increases, car
ownership also increases, which is a positive correlation. As the r value reported is positive and p <0.01, we
can state that there is a positive correlation between our two variables and that the null hypothesis can be
rejected.
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
Examine the remaining variable in the dataset and record your observations, using thetables below that are also in your log book.
Table 31: Number of cars against age
Activity 29:
© Dr Andrew Clegg p. 6-203
Data Analysis for Research Correlation
Examine the remaining variable in the dataset and record your observations, using thetables below that are also in your log book.
Table 32: Number of cars against income
Table 33: Number of cars against monthy travel expenses
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
Activity 29:
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
© Dr Andrew Clegg p. 6-204
Data Analysis for Research Correlation
Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficientson appropriate variables and record your answers in the tables below, which can befound in your log book. For each test, identify a research scenario that you are using the test toexplore.
Table 34: Correlation 1
Activity 30:
Research Scenario
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
© Dr Andrew Clegg p. 6-205
Data Analysis for Research Correlation
Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficientson appropriate variables and record your answers in the tables below, which can befound in your log book. For each test, identify a research scenario that you are using the test toexplore.
Table 35: Correlation 2
Activity 30:
Research Scenario
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
© Dr Andrew Clegg p. 6-206
Data Analysis for Research Correlation
6.5 Non-Parametric Correlation: Spearman’s Rank Correlation Coefficient
It is often the case that the data available do not fit the requirements for parametric testing. In this case, there
is a non-parametric correlation measure available. Spearman’s Rank Correlation Coefficient is mathematically
derived from Pearson’s coefficient, but instead of using the actual data values it uses rank or ordinal data. The
Spearman correlation coefficient is known as rs. The main assumptions for the use of Spearman’s rank
correlation are:
The relationship between the variable is monotonic, that is, x increases as y increases or x
decreases as y decreases. A linear relationship is monotonic, but a monotonic relationship is
not necessarily linear.
The variables are ordinal (ranks) or are ranked interval or ratio scale measurements.
To highlight the use of the Spearman’s rank correlation, type the data table
into SPSS. The data refers to a survey of workers in a London hotel. The
manager believed that the employee commitment to customer care policies
was influenced by overall job satisfaction. The data in the table is ranked
for Commitment (Commit) (1=High Commitment and 4 = Poor
Commitment) and Job Satisfaction (Satis) (1= High Satisfaction and 4=Low
Satisfaction)
Use the same procedure starting on page 6-196, to open the Bivariate
Correlations dialog box.
Select both Commit and Satis in the
Variables: box. Instead of Pearson’s
r, make sure that the Spearman
Correlation Coefficient is
selected. Make sure that the one-
tailed test is also selected. This is
because the manager believes that employee commitment increases with job satisfaction. This therefore implies a
direction in the alternative hypothesis making it a one-tailed test.
Commit Satis1.00 1.002.00 3.001.00 2.004.00 3.004.00 4.001.00 1.001.00 2.001.00 2.002.00 1.004.00 4.003.00 4.004.00 4.001.00 1.001.00 2.001.00 2.002.00 2.001.00 1.003.00 3.004.00 4.004.00 3.001.00 1.001.00 2.002.00 1.003.00 4.001.00 1.00
simulationsimulation
© Dr Andrew Clegg p. 6-207
Data Analysis for Research Correlation
Press OK and SPSS will automatically calculate the value of the Spearman’s rank correlation coefficient. In
this case, the following output is produced.
As you can see from the output, there is a strong positive correlation between the two variables (0.78). The
result is also significant (p<0.01) and the manager can be confident at the 99% significance level that
commitment increases with job satisfaction. The positive correlation is also reflected in a scatterplot of the
two variables.
© Dr Andrew Clegg p. 6-208
Data Analysis for Research Correlation
Using the Dataset file, conduct two Spearman Rank Correlation Coefficients onappropriate variables and record your answers in the tables below, which can be foundin your log book. For each test, identify a research scenario that you are using the test to explore.
Table 36: Correlation 3
Activity 31:
orrelation Coefficient
Research Scenario
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
© Dr Andrew Clegg p. 6-209
Data Analysis for Research Correlation
Using the Dataset file, conduct two Spearman Rank Correlation Coefficients onappropriate variables and record your answers in the tables below, which can be foundin your log book. For each test, identify a research scenario that you are using the test to explore.
Table 37: Correlation 4
Activity 31:
orrelation Coefficient
Research Scenario
Please cut and paste your scatterplot below
and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r?
Probability Value?
Please provide a brief summary of your findings here:
© Dr Andrew Clegg p. 6-210
Data Analysis for Research Correlation
Useful Reading
Section 7
p. 209
Geographical Techniques 2 Descriptive Statistics
© Dr Andrew Clegg p. 7-209
Data Analysis & Presentation Useful Reading
7.0 Useful Reading
BRYMAN, A. AND CRAMER, D. (2001), Quantitative Data Analysis with SPSS Release 10 for Windows,
Routledge, London.
BUGLEAR, J. (2000), Stats to Go, Butterworth Heinemann, London.
CLARK, M., RILEY, M., WILKIE, E. AND WOOD, R. (1998), Researching and Writing Dissertations in
Hospitality and Tourism, Thomson Business Press, London.
DANCEY, C. AND REIDY, J. (2002), Statistics Without Maths for Psychology, Second Edition, Prentice
Hall, London.
EBDON, D. (1985), Statistics in Geography, Blackwell, London.
FIELD, A. (2009), Discovering Statistics Using SPSS, Third Edition, Sage, London.
FINN, M., ELLIOTT-WHITE, M. AND WALTON, M. (2000), Tourism and Leisure Research Methods,
Longman, London.
GHAURI, P. AND GRONHAUG, K. (2002), Research Methods in Business Studies, FT Prentice Hall,
London.
HINTON, P. (2004), Statistics Explained, Routledge, London.
HINTON, P., BROWNLOW, C., McMURRAY, I. AND COZENS, B. (2004), SPPS Explained, Routledge,
London.
KINNEAR, P. AND GRAY, C. (1999), SPSS for Windows Made Simple, Psychology Press, London.
KITCHIN, R. AND TATE, N. (2000), Conducting Research into Human Geography, Prentice Hall, London.
MALTBY, J. AND DAY, L. (2002), Early Success in Statistics, Prentice Hall, London.
McQUEEN, R. AND KNUSSEN, C. (2002), Research Methods for Social Science, Prentice Hall, London.
MICROSOFT PRESS (1997), Microsoft Access 97 - At a Glance, Microsoft Press, Washington.
MULBERG, J. (2002), Figuring Figures, Prentice Hall, London.
ROGERSON, P. (2001), Statistical Methods for Geography, Sage Publications, London.
SAUNDERS, .M, LEWIS, P. AND THORNHILL, A. (2003), Research Methods for Business Studies,
Third Edition, FT Prentice Hall, London.
Appendices
Section 8