INFO 515Lecture #31 Action Research Descriptive Statistics and Surveys INFO 515 Glenn Booker

INFO 515 Lecture #3 1

Action ResearchDescriptive Statistics and

Surveys

INFO 515Glenn Booker


Reliability and Validity A measure is reliable if it consistently

gives the same answer A key to scientific measurement is the ability

to repeat an experiment reliably A measure is valid if it actually measures

the concept under investigation It tests what you think it tests


Review Std Deviation and CV Standard Deviation can be used to

compare two (or more) groups that have the same units of measure and similar means

Coefficient of Variation can compare two (or more) groups, which have different reference points (means) and different standard deviations See which groups are more closely distributed

around their mean


Z Score The Z Score is the ‘how weird am I’

measure for a given data point* The standardized or ‘z’ score allows you

to do either of the following: Find where one or more individuals stand in

reference to the mean of a single distribution on one unit of measure (one variable)

Where is an individual located relative to a distribution of test scores?

Am I better than average? If so, how much?

* This is not an official ISO definition…


Z Score Find where one or more individuals stand in

reference to the mean of two (or more) different distributions that may have different units of measure

Where does an individual stand relative to two tests, each given in a different class (with different distributions)?

Did I do better on the midterm in philosophy than the one in geography?


Z Score A z score tells you how far above or below

the mean any given score is in standard deviation units

Z scores are most useful when the shape of your actual distribution of scores is nearly normal (see slide 9, or Action Research handout p. 11)

What’s the “normal” distribution?


Normal Distribution Example Consider stopping a car at a traffic light You don’t stop exactly the same place

each time, but generally stop somewhere behind or near the big white line (I hope!)

Describing where you are likely to stop might be described by a “normal distribution”


Normal Distribution The normal, or Gaussian, distribution is the

classic “bell curve” which shows that most measurements are somewhere close to the mean, but a few measurements could range far above or below that mean

It is symmetric, and extends forever above and below the mean


Normal Distribution The normal distribution is described by

two math functions The function f(x) is the probability density

function, often called a PDF; it represents how likely the answer is to fall near the current value of x

The function F(x) is the cumulative probability function; it represents the total chance of getting the current value of x or anything less

A.k.a. a cumulative density function, or CDF


‘f(x)’ is the probability density function (the classic bell curve)‘F(x)’ is the cumulative probability function

Normal Distribution

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

X

f(x)

F(x)

Normal Distribution


Probability Density Function, f(x)

The chance you will stop (the event will occur) between any two distances ‘a’ and ‘b’ is the area under the curve f(x) between those two values

Normal Distribution

0

0.1

0.2

0.3

0.4

0.5

-3 -2 -1 0 1 2 3

X

f(x)

a b


Probability Density Function, f(x) Notice that f(x) is symmetric from left to

right, and that it is defined for all possible values of x (x = negative infinity to x = positive infinity) f(x) never reaches zero!

The total area under the curve f(x) is one You will eventually stop somewhere

Unfortunately, f(x) is a messy function to integrate (find the area under it)


Cumulative Probability Function F(x) Imagine you start at x equals minus

infinity (x = -) Then add up the area under f(x) from

minus infinity to the current value of x This is the cumulative probability function, F(x)

That’s why F(0) (F at x=0) is exactly 0.5 Half of all events occur left of x=0, and half

occur to the right of x=0 (symmetry)


Cumulative Probability Function F(x) So to find the chance of getting a result

between values ‘a’ and ‘b’ is also given by:Probability = F(b) - F(a)

An analogy might be The number of babies born between 1940 (a)

and 1990 (b) is equal to the total number of babies ever born by 1990 (F(b)), minus the total number of babies ever born by 1940 (F(a))


Standard (Z) Scores

Back to Z scores, our motivation for discussing the normal distribution

Z Scores are standardized scores whose distribution has the following properties: Retains the shape of the original scores, but Has a mean of 0 and Has a variance and standard deviation of 1


Calculating Z scores Compute “z” score by subtracting the

mean from the raw score and dividing that result by the standard deviationz = (Xi - = (Score – Mean)/(Standard Dev)

The z score is not just associated with the normal distribution – it can be used with any kind of distribution


Interpreting Z Scores The z score describes how many standard

deviations a specific score is above or below the mean A negative z score means that the score is

below the mean A positive z score is above the mean A z score of zero (z=0) is equal to the mean


Z Score Example I own 250 books -- I want to know how I

compare to other college professors Suppose that the mean number of

books owned by college professors is 150 with a standard deviation of 50 z = (250 - 150) / 50 = 2

My z score is 2; meaning I have 2 standard deviations more books than average (‘cuz I’m a pack rat!)


Z Score Tables Are used to determine the proportion of

the area under the curve that lies between the mean and a given standard score (z)

These tables are prepared using integral calculus to save you time

They show only positive ‘z’ values, since the areas for negative ‘z’ are the same as for positive ‘z’ (thanks to symmetry)


Z Score Tables (Yonker p. 29-30)

Normal Distribution

0

0.1

0.2

0.3

0.4

0.5

-3 -2 -1 0 1 2 3

X

f(x)

z value(Col. A)

Area between 0 and z(Col. B)

Area beyond z(Col. C)

Notice that we always haveCol. B + Col. C = 0.5000


Use of Z Score Tables Z score tables can be used to find the

chance of a measurement (or percentage of cases) occurring between any two z values If the z scores are on opposite sides of the

mean (one positive, one negative), add the areas from Column B for each score

If the z scores are on the same side of the mean (both positive, or both negative), subtract the areas from Column B

Subtract the larger area from the smaller area; otherwise you’d get negative area!


Use of Z Score Table Examples Between z scores of -1.5 and +2.2, the

percent of cases is, from Column B:z(-1.5) is the same area as z(+1.5)z(+1.5) = 0.4332 and z(+2.2) = 0.4861Percent = 43.32 + 48.61 = 91.93%

Between z scores of +1.5 and +2.2, the percent of cases is:Percent = 48.61 – 43.32 = 5.29%


Normal Distribution

0

0.1

0.2

0.3

0.4

0.5

-3 -2 -1 0 1 2 3

X

f(x)

34.13% 13.59% 2.14% 0.13%34.13%13.59%2.14%0.13%

From p. 11 in Yonker

Percentages shown are the total percent between the integer Z score values; between 0 and 1 has 34.13%, between 1 and 2 has 13.59%,

etc.

Cumulative Z Score


F(x) Values For F(x) from minus 6 to plus 6, a

distribution with mean =0 and standard deviation of 1.0 gives:

Z CDF delta CDF from next value-6 0.000000000987 0.000000286-5 0.000000286652 0.000031385-4 0.000031671242 0.001318227-3 0.001349898032 0.021400234-2 0.022750131948 0.135905122-1 0.158655253931 0.3413447460 0.5000000000001 0.8413447460692 0.9772498680523 0.9986501019684 0.9999683287585 0.9999997133486 0.999999999013


Cumulative Z Score Key values are:

From z = -1 to +1, total area is 68.26% From z = -1.96 to +1.96, total area is 95% From z = -2 to +2, total area is 95.44% From z = -2.57 to +2.57, total area is 99% From z = -3 to +3, total area is 99.74%


Transformed z, or T scores A.k.a. Standardized scores or “T” scores Z scores are transformed artificially

Multiply a z score by the desired standard deviation and add the desired mean (e.g. 10 and 50) T = zbecomesz

Examples A z score of -1.5 would give a T score of

T = 10*(-1.5) + 50 = 35 A z of +2.2 would give T = 10*(2.2)+50 = 72


T scores This is used in many fields of research,

especially Psychology and Education (that’s where the “desired” mean and standard deviation values came from)

Benefits: gets rid of negative connotations of negative and zero scores Only z scores below z = -5.0 would result in a

negative T score (typically less than one data point in a million)


Level of Confidence Since the normal distribution goes to

positive and negative infinity, we need a way to limit the range of expected or likely values Or any normal distribution could have any

value some times Define the Level of Confidence as the

acceptable limits of predictable behavior Typically use 95% for most applications,

but 99% for medical research


Level of Confidence Generally, we can say that the actual

value of a parameter estimate is in the range of its mean + twice its standard error, with a 95% level of confidence Use 1.96 instead of 2 for precise work

Thus the value of a parameter with mean of 6.2 and standard error of 1.9 lies between 2.4 (i.e., 6.2 – 2*1.9) and 10.0 (i.e., 6.2 + 2*1.9) with a 95% level of confidence


The “t” Statistic The t-statistic is defined as

t = (parameter estimate) / (standard error) If |t| > 2, then the parameter estimate is

significantly different from zero at the 95% level of confidencet = 6.2/1.9 = 3.26

Hence because |3.26| > 2, this estimate is statistically significant

Also means the 95% confidence interval does not include zero

Again, use 1.96 instead of 2 for precise work


The “t” Statistic T = ‘t’???? No! Notice that the T score is a completely

different concept from the ‘t’ statistic We’ll use the ‘t’ statistic to help judge

SPSS output later in the course


Sampling Terms Population = the entire realm of interest,

everyone, all books, all publishers, all patrons, etc.

Sample = a subgroup or subset of the population Accurate inference requires good samples Use sample since often hard or impossible to

measure the entire population


Sampling Terms Inferential Statistics

Taking samples in order to infer unknown population parameters

Principle of Random Selection A procedure by which each member of the

population has an equally likely chance of being chosen as any other member

Representative of the population


Types of Samples Probabilistic sample - sampling in which

the probability of each element in the population being selected is known and can be specified Each element has the same chance

Non-probabilistic sample – each probability not known a priori (in advance) E.g. convenience samples, or available

samples


Random Sampling Techniques Simple Random Stratified Random

Proportional Disproportional

Cluster Systematic


Simple Random Sample Often can’t sample the entire user

population Must be a truly random sample, not

just convenient Can use random number table, or

computer-generated pseudo-random numbers (Yonker, p. 31) to choose the sample


Stratified Random Sampling Group customers into categories (strata); get

simple random samples from each category (stratum). Can be very efficient method.

Can weigh each stratum equally (proportional s.s.) or unequally (disproportional s.s.) For unequal weight, make fraction ~ standard deviation

of stratum, and ~ 1/ square root (cost of sampling). F ~ /sqrt(cost)where “sqrt” is “square root”, “~” is ‘proportional to’


Major # in Population % in Population # in SampleEducation 50 50% X 20 10

Soc./Beh. Sci. 30 30 6

Business 15 15 3

Sci./Tech 5 5 1

% = 50/100 X 100

Proportional Stratified Random Sampling

Data taken from Carpenter and Vasu, (1978)


Cluster Sampling Divide population into (geographic)

clusters, then do simple random samples within each selected cluster

Try for representative clusters Not as efficient as simple random

sampling, but cheaper Sometimes used for in person interviews


Cluster Sampling Example Randomly select n (certain number of)

census tracks From randomly selected census tracks,

randomly select n blocks From randomly selected blocks, randomly

select addresses Interview the family--unit of study


Systematic Sampling Calculate your sampling interval:

Interval = Size of population / (Size of sample)

Select your first element at random from the sampling interval

Move ahead systematically by the sampling interval (e.g. every 10th customer) until you reach your desired sample size


Non-random Sampling Techniques Quota Accidental Judgment


Non-random techniques Quota sampling

Is economical Is a non-random version of stratified sampling Define desired characteristics in advance:

gender, race, age, etc. Example: Interview 20 females and 20 males

over the age of 65


Non-random techniques Accidental sampling

Mall market studies, Internet surveys Often requires a choice (by the interviewee) to

be sampled Judgment sampling

Pick people who have some special knowledge Seek out experts – more of an interview

method


What is a Survey Study (Assessment)? To describe systematically the facts and

characteristics of a given population or area of interest, factually and accurately. (Isacc and Michael)

Survey studies are used to: Describe what is Establish need Identify problems Infer possible solutions


Surveys A survey often refers to a large data

collection effort: What it involves—personal interviews,

telephone interviews, a questionnaire sent through the mail, document survey, literature survey, social area analysis (observation and description of different areas of the city)

“Who” it involves—community, customers, users, employees, literature

Purpose—information gathering and fact finding to Describe what exists (such as public library services) Establish need, Identify problems, Imply possible solutions


Customer Satisfaction Surveys Could have many opportunities to

conduct surveys Customer call-back after x days Customer complaints Direct customer visits Customer user groups Conferences


Customer Satisfaction Surveys Want representative sample of all

customers Three main methods are used

Personal interview Telephone interview Questionnaire by mail


Personal Interview Advantages:

1. Explore complex issues2. Question clarification3. Rapport4. Higher response rate5. Observation


Personal Interview Disadvantages:

1. Interviewer bias2. Question uniformity3. No anonymity4. Difficult to analyze5. Time consuming


Telephone Interview Advantages:

1. Some anonymity2. Low cost3. Rapid completion4. Higher response rate5. No travel time6. Widely spread sample


Telephone Interview Disadvantages:

1. Reaching people2. Some interview bias possible3. Only accessible phone numbers4. No observation


Structured vs. Unstructured Interviews In an unstructured interview, only the

first question is standard for all respondents The remaining questions are determined by

the answers of each respondent In a semi-structured interview, the

questions are open ended, but all of the respondents receive the same questions


Questionnaire by Mail Advantages:

1. Economical2. Faster3. Wide range of issues4. Widely spread sample5. Avoids interviewer bias6. Anonymity


Questionnaire by Mail Disadvantages:

1. Question clarity2. No probing3. Who is answering?4. No observation5. Response rate


Interview & Questionnaire Tips1. Start with easy questions that the respondent will

enjoy answering You want to prevent boredom early on while building

rapport and putting the respondent at ease2. Try for an easy and natural flow over topics

Place like items together and give a brief explanation when a topic breaks

3. Within topics, go from the general to the specific For example, start with questions on use of the Internet

in general, then move on to specific questions about the use of search engines


Interview & Questionnaire Tips4. Put open-ended or difficult questions

(if any) at the end of the interview or questionnaire

5. Put questions on “sensitive” matters (such as age or income) at the end of the interview or questionnaire Otherwise, the interview may be over before

it has started!


The “Question Continuum” Closed Questions

Fixed Alternatives Structured “Your annual income is: a) 0-25K, b) 26-35K, ”

Semi Structured Questions Open Questions

Free form responses Unstructured “What do you like about Drexel?”


Sample Size How big is enough? Must choose:

Confidence level (80 - 95%, to get Z) Margin of error (B = 3 - 5%)

For simple random sample, also need Estimated satisfaction level (p), which is what you’re

trying to measure, and Total population size (N = total number of customers)


Critical Z valuesConfidence Level (2-sided) critical Z

80% 1.28

90% 1.645

95% 1.96

99% 2.57


Sample Size Sample size, n

n = [N*Z2*p*(1-p)] [N*B2 + Z2*p*(1-p)]

The sample size depends heavily on the answer we want to obtain, the actual level of customer satisfaction (p)!


Sample Size If we choose

80% confidence level, then Z = 1.28 5% margin of error, then B = 5% = 0.05 and expect 90% satisfaction, then p = 0.90

n = (N*1.28^2*0.9*0.1)/ (N*0.05^2 + 1.28^2*0.9*0.1)

n = 0.1475*N/(0.0025*N + 0.1475)


Sample Size

Given:Z 1.28p 0.9B 0.05

Hence:Z^2 1.6384p(1-p) 0.09B^2 0.0025

Find:N n

10 8.55035520 14.9355850 27.06052

100 37.09996200 45.54935500 52.75873

1000 55.6972410000 58.63655

100000 58.947631000000 58.97892Infinity 58.9824

<- Beware of sampling small populations!

For very large N, sample size stabilizes


Sample Size If don’t know customer satisfaction value

‘p’, use 0.5 as worst-case estimate Once the real value of ‘p’ is known, solve

for the actual value of B (margin of error) Key is finding a truly representative

sample For N approaching infinity, sample size

simplifies to:n = p*(1-p)*(Z/B)2

Documents

INFO 515Lecture #31 Action Research Descriptive Statistics and Surveys INFO 515 Glenn Booker