View
214
Download
1
Category
Preview:
Citation preview
Sociology 5811:Lecture 4: Other Univariate
Descriptives, Quantiles, and Z-Scores
Copyright © 2005 by Evan Schofer
Do not copy or distribute without permission
Announcements
• Problem set 1 due next week!
Dispersion: The Variance
• Dispersion can be measured by adding up deviation– We square the deviation to avoid negative values– And, divide by “N-1” (instead of N) to get the average
• Result: The “variance”:
1
)(
1
2
11
2
2
N
YY
N
ds
N
ii
N
ii
Y
Dispersion: Standard Deviation
• Result: Standard Deviation– Simply the square root of the variance– Denoted by lower-case s– Most commonly used measure of dispersion
• Formula:
1
)( 2
12
N
YYss
N
ii
YY
Example 1: s = 21.72
Number of CDs (Group 1)
200
175
150
125
100
75
50
25
0
16
14
12
10
8
6
4
20
Std. Dev = 21.72
Mean = 101
N = 23.00
Example 2: s = 67.62
Number of CDs (Group 2)
200.0
175.0
150.0
125.0
100.0
75.0
50.0
25.0
0.0
6
5
4
3
2
1
0
Std. Dev = 67.62
Mean = 100.0
N = 23.00
Example 3: s = 102.15
Number of CDs (Group 3)
200
175
150
125
100
75
50
25
0
14
12
10
8
6
4
2
0
Std. Dev = 102.15
Mean = 104
N = 23.00
Thinking About Dispersion
• Suppose we observe that the standard deviation of wealth is greater in the U.S. than in Sweden…– What can we conclude about the two countries?
• Guess which group has a higher standard deviation for income: Men or Women? Why?
• The standard deviation of a stock’s price is sometimes considered a measure of “risk”. Why?
• Suppose we polled people on two political issues and the S.D. was much higher for one– How would you interpret that?
Other Univariate Stats: Skewness
• Is it a distribution symmetrical?
• Skewness refers to the symmetry of a distribution
• A “tail” is referred to as “skewness”• Tail on left = skewed to left = negative skew
• Tail on right = skewed to right = positive skew
• Perfectly symmetrical distributions have no skew
• Interpretation: The side of the distribution with the tail has fewer cases
• More cases are on the other side of the mean…
Penn 56 RGDPCH 1990
20000.0
18000.0
16000.0
14000.0
12000.0
10000.0
8000.0
6000.0
4000.0
2000.0
0.0
Penn 56 RGDPCH 1990F
req
ue
ncy
50
40
30
20
10
0
Std. Dev = 4915.68
Mean = 4810.4
N = 152.00
Interpreting Skewness• Skewness provides information about inequality
– Example: Economic wealth of nations
Interpreting Skewness
• Skewness provides information about inequality in your data
• Example: Economic wealth of nations…
• Which way is it skewed?
• What is the social interpretation?
• What would be the interpretation if it were skewed in the opposite direction?
• What are some other social circumstances that might generate skewed distributions? Why?
Interpreting Skewness
• Skewness may reflect “floor” or “ceiling” effects
• Example: Number of crimes committed by individuals in a sample.
• Lower bound is zero. Mode is low. Few cases are high. Variable is skewed to right.
• Example: Country school enrollment ratio.• Cannot exceed 100% enrollment in school.
• Can anyone think of other examples?
Calculating Skewness
• Often, skewness is merely used descriptively
• But, statisticians have created a measure• Zero = perfectly symmetrical
• Higher number = increasing skew
• Based on distance from Mean to Median• Remember, Mean moves more if there are extreme cases, as
when there is a “tail”
• Formula:
Ys
Y )Mdn(3 skew
Notes on Skewness
• Skewness is often assessed informally “by eye” rather than calculated as a value.
• Look at a histogram to identify skewness
• Some statistical techniques work properly only on variables that are not skewed.
• Thus, it can be very important to identify highly skewed variables.
Other Univariate Descriptions: Modes
• Modes = Peaks– Note: “the mode” also refers to a measure of central
tendency – the value associated with the highest peak• But, the term is also used more generally:
– Uni-modal distribution: One peak– Bi-modal distribution: Two peaks– Multi-modal distribution: Multiple peaks.
Interpreting Multi-Modal Distributions
• Can you think of a reason for multiple modes?
• The sample is heterogeneous (i.e., made up of more than one group)
• Height forms a bell-shaped distribution for men and for women, but the peaks are different. A combined sample has two peaks
• The sample reflects some exogenous structural ordering process
• Years of education completed is peaked at 12 (high school), 16 (college)
Example: Mode, skew
• How would you describe this variable?
Example: Mode, skew
• How would you describe this variable?
Example: Mode, Skew
• How would you describe this variable?
More Univariate Tools
• Two other issues:– 1. How many cases fall below or above a given
value?– 2. How can we describe a case’s value relative to
other cases?
• Tools:– Cumulative frequency lists/plots– Quantiles (e.g., percentiles, quartiles)– Z-scores
Cumulative Frequency List
• Cumulative Frequency: Number of cases falling in or below a given interval
• Cumulative frequency graph = “ogive”
• Cumulative Percentage: Percentage of cases falling in or below a given interval
• Cumulative frequency lists, graphs can be generated in SPSS: frequency, histogram.
Cumulative Percentage ListYears of Education (N=2904) Value Frequency Percent Cumulat % 7 or less 21 1.4 3.9 8 82 5.3 9.3 9 51 3.3 12.6 10 70 4.6 17.2 11 95 6.2 23.4 12 489 31.8 55.4 13 125 8.1 63.5 14 184 12.0 75.6 15 76 4.9 80.5 16 152 9.9 90.5 17 40 2.6 93.1 18 61 4.0 97.1 19 18 1.2 98.2 20 27 1.8 100.0
Q: How do you find
the median?
Indicates that 55% of students have 12 years of
education or less
Cumulative % Graph
0102030405060708090
100
5 10 15 20
Years of Education
Cu
mu
lati
ve P
erc
en
tag
e
Quantiles
• Percentiles, quartiles, deciles, etc…• General term = quantile
• Quantiles: Dividing cases up into fixed number of equal “bunches”– 100 chunks = percentiles– 10 chunks = deciles– 5 = quintiles– 4 = quartiles
Quartile: Example
• Example: Number of CD’s owned (N=12)
0 0 9 17 19 29 46 87 103 178 202 293First
QuartileSecond
QuartileThird
QuartileFourth
Quartile
• Identifying quartile of a case is a powerful way of describing where a case falls relative to others– A person with 200 CDs is in the top quartile
• 75% have less
• Note: Don’t forget that quantiles are relative– A person of average height in the US would be in the
bottom quartile in a dataset of basketball players.
Quantiles
• Also: Upper and lower bounds of quantiles are useful reference points that describe your data– The border of the 2nd and 3rd quartile is the median, the
middle of your data– The border of the top quartile (178 CDs) gives you a
sense of how many are owned by people toward the upper end of the distribution
– Ex: Sometimes people report “interquartile range”• The range of values that contains the middle 50% of cases.
Quantiles
• Useful questions that Quantiles help answer:
• 1. How does a particular case compare to others in the dataset?– Example: I scored 57 on a test… is that good?– Strategy: Determine the percentile– If 57 corresponds to the 22th percentile, then the
answer is NO!• At least not compared to the others who took the test
– Note: Percentiles indicate position relative
Quantiles
• Useful questions that Quantiles help answer:
• 2. How does a case’s value on one variable compare to another variable?– If I scored 51 on my math test and 78 on my English
test, which is better?– Converting to percentiles allows a direct comparison
• Ex: 51 on math = 95th percentile; 78 on English = 62nd
• Conclusion: Math performance was better!
Quantiles
• Useful questions that Quantiles help answer:
• 3. What values of a variable are high or low for a given variable?– Ex: U.S. Census Income Statistics by Quintiles 2001:– Cutoffs: $17,970; $33,314; $53,000;
$83,500 • 0 to $17,970 = lowest quintile
• $17,970 to $33,314 = second quintile
• $33,314 to $53,000 = third quintile
• $53,000 to $83,000 = fourth quintile
• $83,500 to “Bill Gates” = highest quintile
– Typical starting salary of sociologist: $50,000
Computing Quantiles
• Calculating quantiles in SPSS:
• SPSS frequencies command• Options under statistics button specifies
– Or, you can rely on the Cumulative Percentage list to identify percentiles or other quantiles
• Example: Years of education completed (GSS)• 95th percentile falls at: 18 years of education
• Interpretation: 5% are more educated. 95% are less.
Z-Score (Standardized Score)
• The Z-score: Another way to assess relative placement of cases in a distribution
• Somewhat like a deviation
• And has other uses
• You can convert any or all values of a variable to a common scale
• Running approximately from –3 to +3 , with mean = 0
• Then you can easily compare across variables• Ex: I’m a -.3 on math, a +1.2 on reading
• Negative = below mean, positive = above mean.
Formula for Z-Score
• For any case in your data, calculate:
Y
i
Y
ii s
YY
s
dZ
)(
• Start with the a cases value (Yi)… Then simply subtract the mean and divide by the standard deviation.
Z-Score Example• Example: In the US, the mean level of education is 13
years, with a S.D. of 3 years• Question 1: What is the Z-score of a person who has a
high-school degree? (12 yrs)
333.3
1
3
)1312()(
Y
ii s
YYZ
• Question 2: What is the Z-score of an advanced graduate student? (22 yrs)
0.33
9
3
)1322()(
Y
ii s
YYZ
Properties of Z-Scores
• Z-scores are like deviations• Cases on the mean score zero
• Positive values are above mean, negative below
• But, like quantiles, Z-scores can be compared across variables with different units or means
• Simple deviations can’t be compared if units of measurement are different: Ex: height and weight
• Units of Z-scores are “standard deviations”• A Z-score of -1.83 indicates a case is nearly 2 standard
deviations below the mean.
Z-scoring Whole Variables
• You can convert an entire variable (all cases) to Z-scores, creating a whole new variable
• With useful properties
• Converting to Z-scores preserves the shape of the distribution
• But, mean and standard deviation are altered
• Mean = zero• Because it is based on deviations
• Standard Deviation (sy) = 1• Because distance from mean = divided by sy.
Z-Score Example
• Number of CD’s: Mean = 32.5, s = 29.8
Case Num CD’s (Y)
Mean(Y bar)
Deviation (d)
Z-score(di/s)
1 20 32.5 -12.5 -.42
2 40 32.5 7.5 +.25
3 0 32.5 -32.5 -1.1
4 70 32.5 37.5 +1.3
Converting Variables to Z-scoresGSS Data, N=2904
HIGHEST YEAR OF SCHOOL COMPLETED
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
0
Fre
qu
en
cy
1000
800
600
400
200
0
Converting Variables to Z-scoresGSS Data, N=2904
Z-SCORE: HIGHEST YEAR OF EDUCATION
2.27
1.92
1.58
1.24
.90
.56
.22
-.12
-.46
-.81
-1.15
-1.49
-1.83
-2.17
-2.51
-2.85
-3.19
-3.54
-4.56
Fre
qu
en
cy
1000
800
600
400
200
0
Z-Scoring Whole Variables
• Properties of Z-scored variables
• 1. Mean = 0, S.D. = 1– Unit of variable is literally “standard deviations”– If a value = 1, it means the cases is 1 S.D above mean
• 2. Z-scored variables are useful for comparing variables with very different units
• 3. However, the actual meaning of units is lost– Ex: a variable measured in # of CDs makes sense, but
a variable in # of S.D.s is harder to interpret
Z-Scores and Index Construction
• Issue: It is often useful to combine several variables to create an “index”
• Example: Suppose you ask several similar questions on a survey (all on a scale from 1-5):
• Do you approve of President’s foreign policy?
• Do you approve of the President’s domestic policy?
• Do you approve of the President’s character?
• You can add all 3 together to make a scale from that reflects “overall approval” of Bush
• For each individual, the scale goes from 3 to 15.
Z-Scores and Index Construction
• Example: Constructing an index
Case #
Foreign Domestic Character Index
1 1 2 2 5
2 4 5 5 14
3 2 3 3 8
4 3 4 1 8
5 4 1 2 7
Index value is
simply the sum of the
three component variables
Z-Scores and Index Construction
• Suppose you wanted to make an index of the following variables:– 1. Approval of foreign policy (measured 1-3)– 2. Approval of domestic policy (measured 1-5)– 3. Approval of character (measured 1-100)
• Question: What is the problem with constructing and index from these three measures?
• Answer: Value of index variable is almost wholly determined by the third variable– It is numerically much larger, and “dominates” the
index
Z-Scores and Index Construction
• Calculating Z-scores of each variable (prior to adding them) can help make a better index
• Reason: Z-scoring variables “standardizes” the dispersion of each component of the index– All vars have same mean (0), standard deviation (1)– Thus, each variable contributes roughly equally to the
index. None disproportionately influence it.– Final index of 3 vars: mean = 0, S.D. = 3
• Note: There are many other ways to create indexes… but this is one quick solution
Z-Score: Final Remarks
• Z-scores help us locate cases within a distribution– Example: We know that if Z>0, case is above median
• Under normal circumstances, a case’s Z-score does not tell us exactly which percentile the case falls in…
• It depends on the shape of the distribution…
• However, if the a variable distribution takes on a predictable shape, we can make an accurate determination
• This will prove useful next week!
Recommended