Upload
zenia
View
42
Download
0
Embed Size (px)
DESCRIPTION
Action Research Measurement Scales and Descriptive Statistics. INFO 515 Glenn Booker. Measurement Needs. Need a long set of measurements for one project, and/or many projects to examine statistical trends Could use measurements to test specific hypotheses - PowerPoint PPT Presentation
Citation preview
INFO 515 Lecture #2 1
Action ResearchMeasurement Scales and
Descriptive Statistics
INFO 515Glenn Booker
INFO 515 Lecture #2 2
Measurement Needs Need a long set of measurements for one
project, and/or many projects to examine statistical trends
Could use measurements to test specific hypotheses
Other realistic uses of measurement are to help make decisions and track progress
Need scales to make measurements!
INFO 515 Lecture #2 3
Measurement Scales There are four types of measurement
scales Nominal Ordinal Interval Ratio
Completely optional mnemonic: to remember the sequence, I think of ‘NOIR’ like in the expression ‘film noir’ (‘noir’ is French for ‘black’)
INFO 515 Lecture #2 4
Nominal Scale A nominal (“name”) scale groups or
classifies things into categories, which: Must be jointly exhaustive (cover everything) Must be mutually exclusive (one thing can’t
be in two categories at once) Are in any sequence (none better or worse)
So a nominal variable is putting things into buckets which have no inherant order to them
INFO 515 Lecture #2 5
Nominal Scale Examples include
Gender (though some would dispute limitations of only male/female categories)
Dewey decimal system The Library of Congress system Academic majors Makes of stuff (cars, computers, etc.) Parts of a system
INFO 515 Lecture #2 6
Ordinal Scale This measurement ranks things in order Sequence is important, but the intervals
between ranks is not defined numerically Rank is relative, such as “greater than” or
“less than” E.g. letter grades, urgency of problems,
class rank, inspection ratings So now the buckets we’re using have
some sense or order or direction
INFO 515 Lecture #2 7
Interval Scale An interval scale measures quantitative
differences, not just relative Addition and subtraction are allowed E.g. common temperature scales (°F or C),
a single date (Feb 15, 1999), maybe IQ scores Let me know if you find any more examples
A zero point, if any, is arbitrary (90 °F is *not* six times hotter than 15 °F!)
INFO 515 Lecture #2 8
Ratio Scale A ratio scale is an interval scale with a
non-arbitrary zero point Allows division and multiplication The “best” type of scale to use, if possible E.g. defect rates for software, test scores,
absolute temperature (Kelvin or Rankine), the number or count of almost anything, size, speed, length, …
INFO 515 Lecture #2 9
Summary of Scales Nominal
names different categories, not ordered, not ranked: Male, Female, Republican, Catholic..
Ordinal Categories are ordered: Low, High, Sometimes, Never,
Interval Fixed intervals, no absolute zero: IQ, Temperature
Ratio Fixed intervals with an absolute zero point: Age, Income, Years of
Schooling, Hours/Week, Weight Age could be measured as ratio (years), ordinal (young,
middle, old), or nominal (baby boomer, gen X) Scale of measurement affects (may determine) type of
statistics that you can use to analyze the data
INFO 515 Lecture #2 10
Scale Hierarchy Measurement scales are hierarchical:
ratio (best) / interval / ordinal / nominal Lower level scales can always be derived
from data which uses a higher scale E.g. defect rates (a ratio scale) could be
converted to {High, Medium, Low} or {Acceptable, Not Acceptable} (ordinal scales)
INFO 515 Lecture #2 11
Reexamine Central Tendencies If data are nominal, only the mode is
meaningful If data are ordinal, both median and mode
may be used If data are ratio or interval (called “scale”
in SPSS), you may use mean, median, and mode
INFO 515 Lecture #2 12
Reexamine Variables Discrete variables use counting units or
specific categories Example: makes of cars, grades, … Use Nominal or Ordinal scales
Continuous = Integer or Real Measurements Example: IQ Test scores, length of a table,
your weight, etc. Use Ratio or Interval scales
INFO 515 Lecture #2 13
Refine Research Types Qualitative Research tends to use Nominal
and/or Ordinal scale variables Quantitative Research tends to use
Interval and/or Ratio scale variables
INFO 515 Lecture #2 14
Frequency distributions describe how many times each value occurs in a data set
They are useful for understanding the characteristics of a data set
Frequencies are the count of how many times each possible value appears for a variable (gender = male, or operating system = Windows 2000)
Frequency Distributions
INFO 515 Lecture #2 15
They are most useful when there is a fixed and relatively small number of options for that variable
They’re harder to use for variables which are numbers (either real or integer) unless there are only a few specific options allowed (e.g. test responses 1 to 5 for a multiple choice question)
Frequency Distributions
INFO 515 Lecture #2 16
Generating Frequency Distributions Select the command Analyze /
Descriptive Statistics / Frequencies… Select one or more “Variable(s):” Note that the Frequency (count) and
percent are included by default; other outputs may be selected under the “Statistics...” button A bar chart can be generated as well using
the “Charts…” button; see another way later
INFO 515 Lecture #2 17
Sample Frequency OutputEDUCATIONAL LEVEL
53 11.2 11.2 11.2
190 40.1 40.1 51.3
6 1.3 1.3 52.5
116 24.5 24.5 77.0
59 12.4 12.4 89.5
11 2.3 2.3 91.8
9 1.9 1.9 93.7
27 5.7 5.7 99.4
2 .4 .4 99.8
1 .2 .2 100.0
474 100.0 100.0
8
12
14
15
16
17
18
19
20
21
Total
ValidFrequency Percent Valid Percent
CumulativePercent
INFO 515 Lecture #2 18
Analysis of Frequency Output The first, unlabeled column has the values of data
– here, it first lists all Valid values (there are no Invalid ones, or it would show those too)
The Frequency column is how many times that value appears in the data set
The Percent column is the percent of cases with that value; in the fourth row, the value 15 appears 116 times, which is 24.5% of the 474 total cases (116/474*100 = 24.5%)
INFO 515 Lecture #2 19
Analysis of Frequency Output The Valid Percent column divides each
Frequency by the total number of Valid cases (= Percent column if all cases valid)
The Cumulative Percent adds up the Valid Percent values going down the rows; so the first entry is the Valid Percent for first row, the second entry is from 11.2 + 40.1 = 51.3%, next is 51.3 + 1.3 = 52.5% and so on
Round-off error
INFO 515 Lecture #2 20
Generating Frequency Graphs Frequency is often shown using a
bar graph Bar graphs help make small amounts of
data more visible To generate a frequency graph alone
Click on the Charts menu and select “Bar…” Leave the “Simple” graph selected, and leave
“Summaries are for groups of cases” selected; click the “Define” button
INFO 515 Lecture #2 21
Generating Frequency Graphs
Let the Bars Represent remain “N of cases” Click on variable “Educational Level (years)”
and move it into the Category Axis field Click “OK” You should get the graph on the next slide.
Notice that the text below the X axis is the Label for the Category Axis.
INFO 515 Lecture #2 22
Sample Frequency Output
Notice that the exact same graph can be generated from Frequencies, or just as a bar graph
INFO 515 Lecture #2 23
Frequency Distributions A frequency distribution is a tabulation
that indicates the number of times a score or group of scores occurs
Bar charts best used to graph frequency of nominal & ordinal data
Histograms best used to display shape of interval & ratio data
INFO 515 Lecture #2 24
Employment Category
Employment Category
ManagerCustodialClerical
Fre
qu
en
cy
400
300
200
100
0
SPSS for Windows, Student Version
Frequency Distribution Example
Employment Category
363 76.6 76.6 76.6
27 5.7 5.7 82.3
84 17.7 17.7 100.0
474 100.0 100.0
Clerical
Custodial
Manager
Total
ValidFrequency Percent
ValidPercent
CumulativePercent
INFO 515 Lecture #2 25
Basic Measures - Ratio Used for two exclusive populations
(every case fits into one OR the other) Ratio = (# of testers) /
(# of developers) E.g. tester to developer ratio is 1:4
INFO 515 Lecture #2 26
Proportions and Fractions Used for multiple (> 2) populations Proportion = (Number of this population) /
(Total number of all populations) Sum of all proportions equals unity (one)
E.g. survey results Proportions are based on integer units Fractions are based on real numbered
units
INFO 515 Lecture #2 27
Percentage A proportion or fraction multiplied by 100
becomes a percentage Only report percentages when N (total
population measured) is above ~30 to 50; and always provide N for completeness Why? Otherwise a percentage will imply
more accuracy than the data supports If 2 out of 3 people like something, it’s misleading
to report that 66.667% favor it
INFO 515 Lecture #2 28
Percents Percent = the percentage of cases having
a particular value. Raw percent = divide the frequency of
the value by the total number of cases (including missing values)
Valid percent = calculated as above but excluding missing values
INFO 515 Lecture #2 29
Percent Change The percent increase in a measurement is
the new value, minus the old one, divided by the old value; negative means decrease:% increase = (new - old) / old
The percent change is the absolute value of the percent increase or decrease:% change = | % increase |
INFO 515 Lecture #2 30
Percent Increase Later Value – Earlier Value
Earlier Value So if a collection goes from 50,000
volumes in 1965 to 150,000 in 1975, the percent increase is:
150,000-50,000 = 2 = 200% 50,000
Always divide by where you started
Carpenter and Vasu, (1978)
INFO 515 Lecture #2 31
Percentiles A percentile is the point in a distribution at
or below a given percentage of scores. The median is the 50% percentile Think of the SAT scores - what percentile
were you for verbal, math, etc. - means what percent of people did worse than you
INFO 515 Lecture #2 32
Rate Rate conveys the change in a
measurement, such as over time, dx/dt. Rate = (# observed events) / (# of opportunities)*constant
Rate requires exposure to the risk being measured
E.g. defects per KSLOC (1000 lines of code) = (# defects)/(# of KSLOC)*1000
INFO 515 Lecture #2 33
Exponential Notation You might see output of the form
+2.78E-12 The ‘E’ means ‘times ten to the power of’
This is +2.78 * 10-12 (+2.78*10**-12) A negative exponent, e.g. –12, makes it a very
small number 10-12 = 0.000000000001 10+12 = 1,000,000,000,000
The leading number, here +2.78, controls whether it is a positive or negative number
INFO 515 Lecture #2 34
Exponential Notation
0
+5*10**+12 (a positive number >>1)
+5*10**-12 (a positive number <<1)-5*10**-12 (a negative number <<1)
-5*10**+12 (a negative number >>1)
Pos.
Neg.
INFO 515 Lecture #2 35
Precision Keep your final output to a consistent level
of precision (significant digits) Don’t report one value as “12” and another
as “11.86257523454574123” Pick a level of precision to match the
accuracy of your inputs (or one digit more), and make sure everything is reported that way consistently (e.g. 12.0 and 11.9)
INFO 515 Lecture #2 36
Data Analysis Raw data is collected, such as the dates a
particular problem was reported and closed
Refined data is extracted from raw data, e.g. the time it took a problem to be resolved
Derived data is produced by analyzing refined data, such as the average time to resolve problems
INFO 515 Lecture #2 37
Descriptive statistics describes the key characteristics of one set of data (univariate) Mean, median, mode, range (see also
last week) Standard deviation, variance Skewness Kurtosis Coefficient of variation
Descriptive Statistics
INFO 515 Lecture #2 38
Mean A.k.a.: Average Score The mean is the arithmetic average of the
scores in a distribution Add all of the scores Divide by the total number of scores
The mean is greatly influenced by extreme scores; they pull it off center
INFO 515 Lecture #2 39
HOLDINGS IN 7 DIFFERENT LIBRARIES
X Mean = X N
7400 6500 39200 = 56006200 75900 51004300 Here, sum every data value3800
X= 39200
Mean Calculation
INFO 515 Lecture #2 40
X (IQ) F=Freq FX = F*X140 2 280135 1 135132 2 264130 1 130128 1 128126 1 126125 4 500123 1 123120 4 480110 3 330101 1 101
21 2597
Mean = ∑FX = 2597 = 123.67 = 124 (round off) N 21N = F
Mean with a Frequency Distribution
INFO 515 Lecture #2 41
Staff Salaries $4100 6000 6000 Mode = $6000 6000 8000 Median = 9 + 1 = 5th value = $8000 9000 21000011000 Mean = ∑X = 80100 = $890020000 N 9
Carpenter and Vasu, (1978)
Central Tendency Example
INFO 515 Lecture #2 42
Handling Extreme Values In cases where you have an extreme value
(high or low) in a distribution, it is helpful to report both the median and the mean
Reporting both values gives some indication (through comparison) of a skewed distribution
INFO 515 Lecture #2 43
Measures of Variation Measures which indicate the variation,
or spread of scores in a distribution Range (see last week) Variance Standard Deviation
INFO 515 Lecture #2 44
Standard Deviation, Variance Standard deviation is the average amount
the data differs from the mean (average)SD = ( (Xi-X)**2 / (N-1) )SD = ( Variance )
Variance is the standard deviation squaredVariance = (Xi-X)**2 / (N-1)
[per ISO 3534-1, para 2.33 and 2.34]
INFO 515 Lecture #2 45
Standard Deviation The standard deviation is the square root
of the variance. It is expressed in the same units as the original data.
Since the variance was expressed “squared units” it doesn’t make much practical sense. For example, what are “squared books” or “squared man-hours?”
INFO 515 Lecture #2 46
Computing the VarianceS2 = ∑(X – Mean)2
N 1. Subtract the mean from each score
2. Square the result
3. Sum the squares for all data points
4. Divide by the N of cases
INFO 515 Lecture #2 47
Divide by N or N-1??? You’ll see different formulas for variance
and standard deviation – some divide by N, some by N-1 (e.g. slides 43 and 45); why? If your data covers the entire population (you
have all of the possible data to analyze), then divide by N
If your data covers a sample from the population, divide by N-1
INFO 515 Lecture #2 48
X F FX X2 FX2
17 2 34 289 57816 4 64 256 102414 5 70 196 98010 2 20 100 200 9 3 27 81 243 6 1 6 36 36
221 3061
σ = √ (∑FX2 – (∑FX)2/N) = √ (3061- (221)2/17) N 17
= √ ((3061- 2873)/17) = 3.3
Notice that FX2 is F*(X2), not (F*X)2
Standard Deviation of Bookmobile Distribution
Standard Deviation for Freq Dist.
INFO 515 Lecture #2 49
Distance from Target FrequencyIn Meters Battery A Battery B 200 2 0 150 4 1 100 5 5 50 7 10 0 9 13 -50 7 10-100 5 5-150 4 1-200 2 0
Mean =0 Mean =0Standard D. = Standard D. =102.74 65.83
Runyon and Haber (1984)
Std Dev Reflects Consistency
INFO 515 Lecture #2 50
Standard Deviation vs. Std. Error To be precise, the standard error is the
standard deviation of a statistic used to estimate a population parameter [per ISO 3534-1, para 2.56 and 2.50]
So standard error pertains to sample data, while standard deviation should describe the entire population
We often use them interchangeably
INFO 515 Lecture #2 51
Skewness is a measure of the asymmetry of a distribution. The normal distribution is symmetric,
and has a skewness value of zero. A distribution with a significant positive
skewness has a long right tail Positive skewness means the mean and
median are more positive than the mode (the peak of the distribution)
Negative skewness has a long left tail.
Skewness
INFO 515 Lecture #2 52
Skewness As a rough guide, a skewness magnitude
more than two (>2 or <-2) is taken to indicate a significant departure from symmetry
From www.riskglossary.com
Positive skewness Negative skewness
Both curves have same mean and standard deviation.
INFO 515 Lecture #2 53
Kurtosis is a measure of the extent to which data clusters around a central point For a normal distribution, the value of the
kurtosis is 3 The kurtosis excess (= kurtosis-3) is zero
for a normal distribution Positive kurtosis excess indicates that the data
have longer tails than “normal” Negative kurtosis excess indicates the data
have shorter tails
Kurtosis
INFO 515 Lecture #2 54
Kurtosis
The curve on the right has higher kurtosis than the curve on the left. It is more peaked at the center, and it has fatter tails. If a distribution’s kurtosis is greater than 3, it is said to be leptokurtic (sharp peak). If its kurtosis is less than 3, it is said to be platykurtic (flat peak). They might have equal standard deviation.
Mesokurtic is the “normal” curve, which has kurtosis = 3.From www.riskglossary.com
Platykurtic Leptokurtic
tail
INFO 515 Lecture #2 55
Skewness & Kurtosis Example From the Employee data set, use Analyze /
Descriptive Statistics / Descriptives, select the ‘salary’ variable; Under Options…, select Skewness and Kurtosis
Skewness is 2.125, so there is significant positive skewness to the data
Kurtosis is 5.378, so the data is leptokurtic
INFO 515 Lecture #2 56
Coefficient of Variation The coefficient of variation (CV) is the ratio
of the standard deviation to the mean:CV = [per ISO 3534-1, para 2.35]
Smaller CV means the more representative the mean is for the total distribution
Can compare means and standard deviations of two different populations Higher CV means more variability
INFO 515 Lecture #2 57
Coefficient of Variation Divide the standard deviation by the mean
to get CV. CV = The smaller the decimal fraction this
produces, the more representative is the mean for the total distribution
The larger the decimal fraction, the worse job the mean does of giving us a true picture of the distribution
INFO 515 Lecture #2 58
Frequency graphs can be generated for variables which have many integer or real values (e.g. salary), by using a histogram
A histogram shows how many data points fall into various ranges of values
The closest “normal” curve can be shown for comparison
Generating a Histogram
INFO 515 Lecture #2 59
Generating a Histogram The “¾ rule” is helpful for histograms
The tallest bar should be ¾ of the height of the Y axis
Be sure to label X and Y axes appropriately The each bar shows how many data points
fall within a range of X axis values See How to Lie with Statistics, by Darrell Huff
INFO 515 Lecture #2 60
Histogram of Salary
CURRENT SALARY
54000.0
50000.0
46000.0
42000.0
38000.0
34000.0
30000.0
26000.0
22000.0
18000.0
14000.0
10000.0
6000.0
CURRENT SALARYF
req
ue
ncy
140
120
100
80
60
40
20
0
Std. Dev = 6830.26
Mean = 13767.8
N = 474.00
INFO 515 Lecture #2 61
Another Note on Histograms SPSS will define its own bar widths for a
histogram, e.g. how wide the range of salary values is for each bar
Later in the course, we’ll look at how you can define your own variables to make predefined histograms bars
INFO 515 Lecture #2 62
A histogram can also be made in the shape of a pie
This should be limited to variables with a small number of possible values
Pie Chart Histogram
INFO 515 Lecture #2 63
A *bad* pie chart histogramCURRENT SALARY
15660
15540
15480
15420
15360
15120
15060
15000
14820
14640
14460
14400
14280
14220
14100
14040
10140
10080
10020
9960
9900
9840
9780
9720
9660
9600
9540
9480
9420
9360
9300
9240
9180
(I had to include this one just because it’s
colorful)
INFO 515 Lecture #2 64
This is a better example:EDUCATIONAL LEVEL
21
20
19
18
17
16
15
14
12
8
This visually implies the percentages of data in each value.
INFO 515 Lecture #2 65
Case/Bookmobile
Value of Var.No. of Stops
XNo. of Stops
F No. ofBookmobiles
A 6 17 2B 9 16 4C 10 14 5D 14 10 2E 16 9 3F 17 6 1G 14H 16 N = 17I 14J 10K 9L 14M 14N 16O 9P 17Q 16
Bookmobile examples taken from Carpenter and Vasu, (1978)Same data as used on slides 48 & 66.
Bookmobile Data
INFO 515 Lecture #2 66
Bookmobile Distributions
Stops f % CF CF C% 17 2 11.8 17 2 100 16 4 23.5 15 6 88 14 5 29.4 11 11 64 10 2 11.8 6 13 35 9 3 17.6 4 16 23 6 1 5.8 1 17 6
Cumulative freq adding down
Cumulative freq adding up
Percent cumulative freq counting down
INFO 515 Lecture #2 67
Number of Bookmobile Stops
17.515.012.510.07.55.0
10
8
6
4
2
0
Std. Dev = 3.43
Mean = 13.0
N = 17.00
HISTOGRAM OF BOOKMOBILE STOPS
F
INFO 515 Lecture #2 68
Some data sets are not very close to a normal distribution
Sometimes it helps to transform the independent variable by applying a math function to it, such as looking at log(x) (the logarithm of each x value) instead of just x
Normalizing Data
INFO 515 Lecture #2 69
Normalizing Data In SPSS this can be done by defining a new
variable, such as “log_x” Then use Transform / Compute to
calculatelog_x = LG10(x)assuming that ‘x’ is the original
variable Then generate a histogram showing the
normal curve, to see if log_x is closer to a normal distribution
INFO 515 Lecture #2 70
Who cares if we have a normal distribution?
Many tests in statistics can only be applied to a variable which has a normal distribution – so it’s worth our while to transform the variable
Normalizing Data