Upload
tade-faweya
View
222
Download
0
Embed Size (px)
DESCRIPTION
Samples and Data
Citation preview
Introduction to Statistical Research Methods (829N1)
Maria Savona
SPRU (Science Policy Research Unit)
Lecture 2: Samples & Data
Basics on sampling
2
From last week’s lecture
Mr. Smith owns a company and he wants to know what are the tastes of his customers? Would he interview all of them?
I need to test the quality of my products. Should I test all of them? How many?
A pharmaceutical company developed a revolutionary drug that is supposed to reduce patients’ fever. How do they test if it works?
3Sampling
What is sampling?
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population.
4
Sample
Population
Inference
Why do we need sampling? It has some advantages: it’s faster, cheaper and at times the only
feasible choice It allows to accurately estimate general features of a population using
data collected from a tiny fraction of the total.
How do we sample from a population?
Sampling process
1. Define the population
2. Specify a sampling frame
3. Specify a sampling method
4. Determine the sample size
5. Collect the data
5
1) Defining the population
The population is the set of all possible cases of interests (e.g. firms, people, students, countries, patents, etc.).
A sample requires a clearly defined population, from which to draw the sample.
Requires conceptual clarity
e.g. consider question: ‘How many hours of study do students have each week’? But who are the ‘student’ population?– Anyone who sees themselves as a student?– Or restricted to those registered in higher education? – Anywhere in UK? Restricted to Sussex? – Full time only? Home and overseas? Etc. 6
2) Specifying a sampling frame
Once the population of interest is defined, we have to specify how we access that population
A sampling frame is the source from which a sample is drawn. It is a list of all those within a population who can be sampled
List of registered students at Sussex UniversitySussex University students
List of firms located in London from Companies HouseAll the firms based in London
Patent database from WIPOAll the patents owned by a company
Sampling framePopulation
7
working in the pharmaceutical sector Selecting those companies with some specificSIC (Standard Industrial Classification) codes
Selecting the patents which have some specific IPC codes (International Patent Classification)
which are about semiconductors
2) Specifying a sampling frame
8
Sampling frame
Sample (drawn from sampling frame)
Target population
2) Specifying a sampling frame
Sampling frame definition is very important; errors in the sampling frame affect the representativeness of the sample
Sometimes the sampling frame does not match completely the population; – e.g. if we are conducting household surveys on poverty in the Brighton area our
sampling frame misses homeless people and thus introduces bias because homeless people are disproportionately poor and they are not included in the sampling frame.
Other potential issues with sampling frames:– Missing elements: some members of the population are not included in the
sampling frame– Foreign elements: the sampling frame includes some non-members of the
population– Duplicate elements: some elements could be included more than once
9
3) Specifying a sampling method
The sampling method is the way in which the sample units are to be selected.
Probability sampling: the probability ofextraction of a population unit is knowna) Simple random samplingb) Systematic samplingc) Stratified sampling
Non-probability samplingd) Quota samplinge) Convenience samplingf) Snowball sampling
10
Inference from the sample to the
population
They do not allow statistical inference,
but they are still informative and
allow some kind of generalizations
a) Simple random sampling
A subset of individuals (a sample) chosen from a larger set (a population). Each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process.
Imagine a school with 1000 students and you want to select 100 for further study. You could select them randomly by pulling their names out of a hat. This random process means that each student has an equal chance (or probability) of being selected.
11
b) Systematic sampling
The population (sampling frame) units are sorted according to some characteristics.– The first sample unit is extracted randomly– The other units are extracting following a step n
12
LIST OF ALL POSTGRADUATES IN SOCIAL SCIENCES
A…………..B…………..C…………..D…….…….E……….….F…………..G…………..…
SAMPLEPick every nth case from list
You randomly start here
The sampling frame is divided into sub-groups (strata) with respect to some relevant population characteristics (e.g. gender, age, etc.) so that units are relatively similar within stratum and different across different strata.
Stratified sampling consists in performing simple random sampling within each stratum, so that all the sub-groups are adequately represented.
For example, we have a population of 1000 people, 500 males and 500 females, and we would like to extract a sample of 100 people, stratified by gender our sample will include 50 randomly selected males and 50 randomly selected females.
13
b) Stratified sampling
LIST OF ALL SOCIAL SCIENCE POSTGRADUATES SORTED BY
DEPARTMENT
SAMPLE
• SPRU
• Sociology
• Geography
• History
• Etc.
Separate random sample drawn from each department (stratum)
How many from each stratum?
14
b) Stratified sampling
Total number of students (N=1000)Sample size (n=50)Number of students at SPRU (80)
Probability sampling allows a precise and accurate estimation of population parameters. However, in some cases probability sampling is not possible. In this case, non-probability sampling is common practice.
Quota sampling is similar to stratified sampling. The population is segmented in sub-groups (strata). Quota sampling does not need a sampling frame for
each stratum, since extraction does not follow a probabilistic rule.
Quota sampling only requires that the same proportions apply to the sample. For example, an interviewer is told to sample 50 females and 50 males between the age of 18 to 30.
15
d) Quota sampling
e) Availability (convenience) Sampling
As the name suggests, the sample is merely based on those who are “easy to find”
– a local factory to provide a sample of workers– this class to provide a sample of students– a bus station near my house to provide a sample of
users of public transport – university students used to provide a sample of
consumers
Some studies do not need a representative sample and the aim is to show a methodology or to test a theory.
16
f) Snowball Sampling
Uses initial respondents to contact new respondents. A useful option where relevant contacts are hard to
identify e.g. where the roles and responsibilities of different personnel in a company are unclear or where you’re interested in surveying a minority population but do not know how to locate more than one or two members of that population.
Depends upon whether members of these populations know other members
Again, it does not grant a representative sample.
17
4) Determining the sample size
Size matters More is better!
If this class is my population (54 students) and I want to find out what is the mean amount of money in your pockets. Let’s say that on average you have £5 per person – obviously some will have more, some less.
If I ask 2… may well pick 2 with nothing, or 2 with £20 If I ask 10… more likely to balance out and give a mean closer to true value. If I ask 20… even more likely….
However working out the appropriate sample size depends on other factors such as the precision level required and budget constraints.
18
4) Determining the sample size
With a sample of 30 units and a population of 100 units we commit an 11.75% error.
19Household expenditure on pizza (with ‘true’ population mean=$20, and standard deviation=9, confidence level 5%). Source: adapted from: Mazzocchi 2008, Box 5.4, p.116.
Same sample (30) and a population of 1000, the error increases to 16.28%.
With a sample of 500 units and a population of 1000, the error is only 1.98%.
With a sample of 1000 units the error is below 3% even with a population of 100 millions units.
Sampling and non-sampling error
An estimate based on a sample can differ from the true population figure because of: sampling error: random chance involved when sampling. Two main factors
affect the amount of sampling error:a) The size of the sample - increasing sample size reduces sampling
errorb) The amount of variation in the population in the characteristic being
measured (age, income etc.) - the more variation, the greater the sampling error (for any given sample size)
non-sampling error: errors arising from all other aspects of the procedurea) Poorly designed sampling frameb) Measurement errors during fieldwork c) Systematic non response d) Systematic attrition
20
Data...!
21
What type of data?
Cross-sectional data
Longitudinal data
– Time series
– Panel data
22
Variables
A variable is a condition or a quality that can differ from one case to another Conceptual definition: literal or general definition of the
variable Operational definition: specifying the criteria for taking
a measurement of that variable
23
For example: We want to measure firm’s innovativeness We define innovativeness as the capacity to produce new
inventions We measure innovativeness by taking the number of patents
of the firm
Scales of measurement
The scale of measurement specifies a range of values that the variable can take
Discrete versus continuous– Discrete data means that there are finite values within a
specified range (e.g. number of children per household)1 2 3 4 5
– Continuous data means that there are infinite values within a specified range (e.g. age)
1 2 3 | 4 5 -> 3years and 6 months
-> 3 years, 6 months and 2 days, etc.24
4 levels of measurement
Nominal measures use numbers simply as labels for different values (e.g., Female=1, Male=2; or Bus=1, Train=2, Car=3).
Ordinal measures are like nominal ones in that they too use numbers simply as labels, but in this case a higher number does indicate more and a lower number less (e.g., ‘How often do you smoke?’ Never=1, Sometimes=2, Frequently=3, Very often=4).
Interval/ratio scales are those that permit to say by how much a case is better or stronger than another.– Interval scales measure the order of data points and the size of the intervals in
between data points.– Ratio scales are interval scales with a true zero point.
Interval scales can have an arbitrary zero reference point, while ratio scales have a true zero point. For example 0 age and 0 income means no age and no income (i.e. the zero reference point is non-arbitrary), while 0°C only indicates the point at which water freezes, it does not mean no heat at all! However, this distinction is not relevant for the kind of analysis we will carry out.
25Incr
easi
ng p
reci
sion
and
mea
ning
Some more examples
26
*** The level of measurement of a variable is important because it determines the techniques that you can use to analyse it ***
How does SPSS classify data?
27
Before next week…
Read “Introduction to SPSS” available on Study Direct
Download and install SPSS by following the instructions available on this page:
http://www.sussex.ac.uk/its/services/software/owncomputer
28