Upload
randolph-rich
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
INFO 515 Lecture #3 1
Action ResearchDescriptive Statistics and
Surveys
INFO 515Glenn Booker
INFO 515 Lecture #3 2
Reliability and Validity A measure is reliable if it consistently
gives the same answer A key to scientific measurement is the ability
to repeat an experiment reliably A measure is valid if it actually measures
the concept under investigation It tests what you think it tests
INFO 515 Lecture #3 3
Review Std Deviation and CV Standard Deviation can be used to
compare two (or more) groups that have the same units of measure and similar means
Coefficient of Variation can compare two (or more) groups, which have different reference points (means) and different standard deviations See which groups are more closely distributed
around their mean
INFO 515 Lecture #3 4
Z Score The Z Score is the ‘how weird am I’
measure for a given data point* The standardized or ‘z’ score allows you
to do either of the following: Find where one or more individuals stand in
reference to the mean of a single distribution on one unit of measure (one variable)
Where is an individual located relative to a distribution of test scores?
Am I better than average? If so, how much?
* This is not an official ISO definition…
INFO 515 Lecture #3 5
Z Score Find where one or more individuals stand in
reference to the mean of two (or more) different distributions that may have different units of measure
Where does an individual stand relative to two tests, each given in a different class (with different distributions)?
Did I do better on the midterm in philosophy than the one in geography?
INFO 515 Lecture #3 6
Z Score A z score tells you how far above or below
the mean any given score is in standard deviation units
Z scores are most useful when the shape of your actual distribution of scores is nearly normal (see slide 9, or Action Research handout p. 11)
What’s the “normal” distribution?
INFO 515 Lecture #3 7
Normal Distribution Example Consider stopping a car at a traffic light You don’t stop exactly the same place
each time, but generally stop somewhere behind or near the big white line (I hope!)
Describing where you are likely to stop might be described by a “normal distribution”
INFO 515 Lecture #3 8
Normal Distribution The normal, or Gaussian, distribution is the
classic “bell curve” which shows that most measurements are somewhere close to the mean, but a few measurements could range far above or below that mean
It is symmetric, and extends forever above and below the mean
INFO 515 Lecture #3 9
Normal Distribution The normal distribution is described by
two math functions The function f(x) is the probability density
function, often called a PDF; it represents how likely the answer is to fall near the current value of x
The function F(x) is the cumulative probability function; it represents the total chance of getting the current value of x or anything less
A.k.a. a cumulative density function, or CDF
INFO 515 Lecture #3 10
‘f(x)’ is the probability density function (the classic bell curve)‘F(x)’ is the cumulative probability function
Normal Distribution
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
X
f(x)
F(x)
Normal Distribution
INFO 515 Lecture #3 11
Probability Density Function, f(x)
The chance you will stop (the event will occur) between any two distances ‘a’ and ‘b’ is the area under the curve f(x) between those two values
Normal Distribution
0
0.1
0.2
0.3
0.4
0.5
-3 -2 -1 0 1 2 3
X
f(x)
a b
INFO 515 Lecture #3 12
Probability Density Function, f(x) Notice that f(x) is symmetric from left to
right, and that it is defined for all possible values of x (x = negative infinity to x = positive infinity) f(x) never reaches zero!
The total area under the curve f(x) is one You will eventually stop somewhere
Unfortunately, f(x) is a messy function to integrate (find the area under it)
INFO 515 Lecture #3 13
Cumulative Probability Function F(x) Imagine you start at x equals minus
infinity (x = -) Then add up the area under f(x) from
minus infinity to the current value of x This is the cumulative probability function, F(x)
That’s why F(0) (F at x=0) is exactly 0.5 Half of all events occur left of x=0, and half
occur to the right of x=0 (symmetry)
INFO 515 Lecture #3 14
Cumulative Probability Function F(x) So to find the chance of getting a result
between values ‘a’ and ‘b’ is also given by:Probability = F(b) - F(a)
An analogy might be The number of babies born between 1940 (a)
and 1990 (b) is equal to the total number of babies ever born by 1990 (F(b)), minus the total number of babies ever born by 1940 (F(a))
INFO 515 Lecture #3 15
Standard (Z) Scores
Back to Z scores, our motivation for discussing the normal distribution
Z Scores are standardized scores whose distribution has the following properties: Retains the shape of the original scores, but Has a mean of 0 and Has a variance and standard deviation of 1
INFO 515 Lecture #3 16
Calculating Z scores Compute “z” score by subtracting the
mean from the raw score and dividing that result by the standard deviationz = (Xi - = (Score – Mean)/(Standard Dev)
The z score is not just associated with the normal distribution – it can be used with any kind of distribution
INFO 515 Lecture #3 17
Interpreting Z Scores The z score describes how many standard
deviations a specific score is above or below the mean A negative z score means that the score is
below the mean A positive z score is above the mean A z score of zero (z=0) is equal to the mean
INFO 515 Lecture #3 18
Z Score Example I own 250 books -- I want to know how I
compare to other college professors Suppose that the mean number of
books owned by college professors is 150 with a standard deviation of 50 z = (250 - 150) / 50 = 2
My z score is 2; meaning I have 2 standard deviations more books than average (‘cuz I’m a pack rat!)
INFO 515 Lecture #3 19
Z Score Tables Are used to determine the proportion of
the area under the curve that lies between the mean and a given standard score (z)
These tables are prepared using integral calculus to save you time
They show only positive ‘z’ values, since the areas for negative ‘z’ are the same as for positive ‘z’ (thanks to symmetry)
INFO 515 Lecture #3 20
Z Score Tables (Yonker p. 29-30)
Normal Distribution
0
0.1
0.2
0.3
0.4
0.5
-3 -2 -1 0 1 2 3
X
f(x)
z value(Col. A)
Area between 0 and z(Col. B)
Area beyond z(Col. C)
Notice that we always haveCol. B + Col. C = 0.5000
INFO 515 Lecture #3 21
Use of Z Score Tables Z score tables can be used to find the
chance of a measurement (or percentage of cases) occurring between any two z values If the z scores are on opposite sides of the
mean (one positive, one negative), add the areas from Column B for each score
If the z scores are on the same side of the mean (both positive, or both negative), subtract the areas from Column B
Subtract the larger area from the smaller area; otherwise you’d get negative area!
INFO 515 Lecture #3 22
Use of Z Score Table Examples Between z scores of -1.5 and +2.2, the
percent of cases is, from Column B:z(-1.5) is the same area as z(+1.5)z(+1.5) = 0.4332 and z(+2.2) = 0.4861Percent = 43.32 + 48.61 = 91.93%
Between z scores of +1.5 and +2.2, the percent of cases is:Percent = 48.61 – 43.32 = 5.29%
INFO 515 Lecture #3 23
Normal Distribution
0
0.1
0.2
0.3
0.4
0.5
-3 -2 -1 0 1 2 3
X
f(x)
34.13% 13.59% 2.14% 0.13%34.13%13.59%2.14%0.13%
From p. 11 in Yonker
Percentages shown are the total percent between the integer Z score values; between 0 and 1 has 34.13%, between 1 and 2 has 13.59%,
etc.
Cumulative Z Score
INFO 515 Lecture #3 24
F(x) Values For F(x) from minus 6 to plus 6, a
distribution with mean =0 and standard deviation of 1.0 gives:
Z CDF delta CDF from next value-6 0.000000000987 0.000000286-5 0.000000286652 0.000031385-4 0.000031671242 0.001318227-3 0.001349898032 0.021400234-2 0.022750131948 0.135905122-1 0.158655253931 0.3413447460 0.5000000000001 0.8413447460692 0.9772498680523 0.9986501019684 0.9999683287585 0.9999997133486 0.999999999013
INFO 515 Lecture #3 25
Cumulative Z Score Key values are:
From z = -1 to +1, total area is 68.26% From z = -1.96 to +1.96, total area is 95% From z = -2 to +2, total area is 95.44% From z = -2.57 to +2.57, total area is 99% From z = -3 to +3, total area is 99.74%
INFO 515 Lecture #3 26
Transformed z, or T scores A.k.a. Standardized scores or “T” scores Z scores are transformed artificially
Multiply a z score by the desired standard deviation and add the desired mean (e.g. 10 and 50) T = zbecomesz
Examples A z score of -1.5 would give a T score of
T = 10*(-1.5) + 50 = 35 A z of +2.2 would give T = 10*(2.2)+50 = 72
INFO 515 Lecture #3 27
T scores This is used in many fields of research,
especially Psychology and Education (that’s where the “desired” mean and standard deviation values came from)
Benefits: gets rid of negative connotations of negative and zero scores Only z scores below z = -5.0 would result in a
negative T score (typically less than one data point in a million)
INFO 515 Lecture #3 28
Level of Confidence Since the normal distribution goes to
positive and negative infinity, we need a way to limit the range of expected or likely values Or any normal distribution could have any
value some times Define the Level of Confidence as the
acceptable limits of predictable behavior Typically use 95% for most applications,
but 99% for medical research
INFO 515 Lecture #3 29
Level of Confidence Generally, we can say that the actual
value of a parameter estimate is in the range of its mean + twice its standard error, with a 95% level of confidence Use 1.96 instead of 2 for precise work
Thus the value of a parameter with mean of 6.2 and standard error of 1.9 lies between 2.4 (i.e., 6.2 – 2*1.9) and 10.0 (i.e., 6.2 + 2*1.9) with a 95% level of confidence
INFO 515 Lecture #3 30
The “t” Statistic The t-statistic is defined as
t = (parameter estimate) / (standard error) If |t| > 2, then the parameter estimate is
significantly different from zero at the 95% level of confidencet = 6.2/1.9 = 3.26
Hence because |3.26| > 2, this estimate is statistically significant
Also means the 95% confidence interval does not include zero
Again, use 1.96 instead of 2 for precise work
INFO 515 Lecture #3 31
The “t” Statistic T = ‘t’???? No! Notice that the T score is a completely
different concept from the ‘t’ statistic We’ll use the ‘t’ statistic to help judge
SPSS output later in the course
INFO 515 Lecture #3 32
Sampling Terms Population = the entire realm of interest,
everyone, all books, all publishers, all patrons, etc.
Sample = a subgroup or subset of the population Accurate inference requires good samples Use sample since often hard or impossible to
measure the entire population
INFO 515 Lecture #3 33
Sampling Terms Inferential Statistics
Taking samples in order to infer unknown population parameters
Principle of Random Selection A procedure by which each member of the
population has an equally likely chance of being chosen as any other member
Representative of the population
INFO 515 Lecture #3 34
Types of Samples Probabilistic sample - sampling in which
the probability of each element in the population being selected is known and can be specified Each element has the same chance
Non-probabilistic sample – each probability not known a priori (in advance) E.g. convenience samples, or available
samples
INFO 515 Lecture #3 35
Random Sampling Techniques Simple Random Stratified Random
Proportional Disproportional
Cluster Systematic
INFO 515 Lecture #3 36
Simple Random Sample Often can’t sample the entire user
population Must be a truly random sample, not
just convenient Can use random number table, or
computer-generated pseudo-random numbers (Yonker, p. 31) to choose the sample
INFO 515 Lecture #3 37
Stratified Random Sampling Group customers into categories (strata); get
simple random samples from each category (stratum). Can be very efficient method.
Can weigh each stratum equally (proportional s.s.) or unequally (disproportional s.s.) For unequal weight, make fraction ~ standard deviation
of stratum, and ~ 1/ square root (cost of sampling). F ~ /sqrt(cost)where “sqrt” is “square root”, “~” is ‘proportional to’
INFO 515 Lecture #3 38
Major # in Population % in Population # in SampleEducation 50 50% X 20 10
Soc./Beh. Sci. 30 30 6
Business 15 15 3
Sci./Tech 5 5 1
% = 50/100 X 100
Proportional Stratified Random Sampling
Data taken from Carpenter and Vasu, (1978)
INFO 515 Lecture #3 39
Cluster Sampling Divide population into (geographic)
clusters, then do simple random samples within each selected cluster
Try for representative clusters Not as efficient as simple random
sampling, but cheaper Sometimes used for in person interviews
INFO 515 Lecture #3 40
Cluster Sampling Example Randomly select n (certain number of)
census tracks From randomly selected census tracks,
randomly select n blocks From randomly selected blocks, randomly
select addresses Interview the family--unit of study
INFO 515 Lecture #3 41
Systematic Sampling Calculate your sampling interval:
Interval = Size of population / (Size of sample)
Select your first element at random from the sampling interval
Move ahead systematically by the sampling interval (e.g. every 10th customer) until you reach your desired sample size
INFO 515 Lecture #3 42
Non-random Sampling Techniques Quota Accidental Judgment
INFO 515 Lecture #3 43
Non-random techniques Quota sampling
Is economical Is a non-random version of stratified sampling Define desired characteristics in advance:
gender, race, age, etc. Example: Interview 20 females and 20 males
over the age of 65
INFO 515 Lecture #3 44
Non-random techniques Accidental sampling
Mall market studies, Internet surveys Often requires a choice (by the interviewee) to
be sampled Judgment sampling
Pick people who have some special knowledge Seek out experts – more of an interview
method
INFO 515 Lecture #3 45
What is a Survey Study (Assessment)? To describe systematically the facts and
characteristics of a given population or area of interest, factually and accurately. (Isacc and Michael)
Survey studies are used to: Describe what is Establish need Identify problems Infer possible solutions
INFO 515 Lecture #3 46
Surveys A survey often refers to a large data
collection effort: What it involves—personal interviews,
telephone interviews, a questionnaire sent through the mail, document survey, literature survey, social area analysis (observation and description of different areas of the city)
“Who” it involves—community, customers, users, employees, literature
Purpose—information gathering and fact finding to Describe what exists (such as public library services) Establish need, Identify problems, Imply possible solutions
INFO 515 Lecture #3 47
Customer Satisfaction Surveys Could have many opportunities to
conduct surveys Customer call-back after x days Customer complaints Direct customer visits Customer user groups Conferences
INFO 515 Lecture #3 48
Customer Satisfaction Surveys Want representative sample of all
customers Three main methods are used
Personal interview Telephone interview Questionnaire by mail
INFO 515 Lecture #3 49
Personal Interview Advantages:
1. Explore complex issues2. Question clarification3. Rapport4. Higher response rate5. Observation
INFO 515 Lecture #3 50
Personal Interview Disadvantages:
1. Interviewer bias2. Question uniformity3. No anonymity4. Difficult to analyze5. Time consuming
INFO 515 Lecture #3 51
Telephone Interview Advantages:
1. Some anonymity2. Low cost3. Rapid completion4. Higher response rate5. No travel time6. Widely spread sample
INFO 515 Lecture #3 52
Telephone Interview Disadvantages:
1. Reaching people2. Some interview bias possible3. Only accessible phone numbers4. No observation
INFO 515 Lecture #3 53
Structured vs. Unstructured Interviews In an unstructured interview, only the
first question is standard for all respondents The remaining questions are determined by
the answers of each respondent In a semi-structured interview, the
questions are open ended, but all of the respondents receive the same questions
INFO 515 Lecture #3 54
Questionnaire by Mail Advantages:
1. Economical2. Faster3. Wide range of issues4. Widely spread sample5. Avoids interviewer bias6. Anonymity
INFO 515 Lecture #3 55
Questionnaire by Mail Disadvantages:
1. Question clarity2. No probing3. Who is answering?4. No observation5. Response rate
INFO 515 Lecture #3 56
Interview & Questionnaire Tips1. Start with easy questions that the respondent will
enjoy answering You want to prevent boredom early on while building
rapport and putting the respondent at ease2. Try for an easy and natural flow over topics
Place like items together and give a brief explanation when a topic breaks
3. Within topics, go from the general to the specific For example, start with questions on use of the Internet
in general, then move on to specific questions about the use of search engines
INFO 515 Lecture #3 57
Interview & Questionnaire Tips4. Put open-ended or difficult questions
(if any) at the end of the interview or questionnaire
5. Put questions on “sensitive” matters (such as age or income) at the end of the interview or questionnaire Otherwise, the interview may be over before
it has started!
INFO 515 Lecture #3 58
The “Question Continuum” Closed Questions
Fixed Alternatives Structured “Your annual income is: a) 0-25K, b) 26-35K, ”
Semi Structured Questions Open Questions
Free form responses Unstructured “What do you like about Drexel?”
INFO 515 Lecture #3 59
Sample Size How big is enough? Must choose:
Confidence level (80 - 95%, to get Z) Margin of error (B = 3 - 5%)
For simple random sample, also need Estimated satisfaction level (p), which is what you’re
trying to measure, and Total population size (N = total number of customers)
INFO 515 Lecture #3 60
Critical Z valuesConfidence Level (2-sided) critical Z
80% 1.28
90% 1.645
95% 1.96
99% 2.57
INFO 515 Lecture #3 61
Sample Size Sample size, n
n = [N*Z2*p*(1-p)] [N*B2 + Z2*p*(1-p)]
The sample size depends heavily on the answer we want to obtain, the actual level of customer satisfaction (p)!
INFO 515 Lecture #3 62
Sample Size If we choose
80% confidence level, then Z = 1.28 5% margin of error, then B = 5% = 0.05 and expect 90% satisfaction, then p = 0.90
n = (N*1.28^2*0.9*0.1)/ (N*0.05^2 + 1.28^2*0.9*0.1)
n = 0.1475*N/(0.0025*N + 0.1475)
INFO 515 Lecture #3 63
Sample Size
Given:Z 1.28p 0.9B 0.05
Hence:Z^2 1.6384p(1-p) 0.09B^2 0.0025
Find:N n
10 8.55035520 14.9355850 27.06052
100 37.09996200 45.54935500 52.75873
1000 55.6972410000 58.63655
100000 58.947631000000 58.97892Infinity 58.9824
<- Beware of sampling small populations!
For very large N, sample size stabilizes
INFO 515 Lecture #3 64
Sample Size If don’t know customer satisfaction value
‘p’, use 0.5 as worst-case estimate Once the real value of ‘p’ is known, solve
for the actual value of B (margin of error) Key is finding a truly representative
sample For N approaching infinity, sample size
simplifies to:n = p*(1-p)*(Z/B)2