Sociology 5811: Lecture 4: Other Univariate Descriptives, Quantiles, and Z- Scores Copyright © 2005...

Preview:

Citation preview

Sociology 5811:Lecture 4: Other Univariate

Descriptives, Quantiles, and Z-Scores

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Announcements

• Problem set 1 due next week!

Dispersion: The Variance

• Dispersion can be measured by adding up deviation– We square the deviation to avoid negative values– And, divide by “N-1” (instead of N) to get the average

• Result: The “variance”:

1

)(

1

2

11

2

2

N

YY

N

ds

N

ii

N

ii

Y

Dispersion: Standard Deviation

• Result: Standard Deviation– Simply the square root of the variance– Denoted by lower-case s– Most commonly used measure of dispersion

• Formula:

1

)( 2

12

N

YYss

N

ii

YY

Example 1: s = 21.72

Number of CDs (Group 1)

200

175

150

125

100

75

50

25

0

16

14

12

10

8

6

4

20

Std. Dev = 21.72

Mean = 101

N = 23.00

Example 2: s = 67.62

Number of CDs (Group 2)

200.0

175.0

150.0

125.0

100.0

75.0

50.0

25.0

0.0

6

5

4

3

2

1

0

Std. Dev = 67.62

Mean = 100.0

N = 23.00

Example 3: s = 102.15

Number of CDs (Group 3)

200

175

150

125

100

75

50

25

0

14

12

10

8

6

4

2

0

Std. Dev = 102.15

Mean = 104

N = 23.00

Thinking About Dispersion

• Suppose we observe that the standard deviation of wealth is greater in the U.S. than in Sweden…– What can we conclude about the two countries?

• Guess which group has a higher standard deviation for income: Men or Women? Why?

• The standard deviation of a stock’s price is sometimes considered a measure of “risk”. Why?

• Suppose we polled people on two political issues and the S.D. was much higher for one– How would you interpret that?

Other Univariate Stats: Skewness

• Is it a distribution symmetrical?

• Skewness refers to the symmetry of a distribution

• A “tail” is referred to as “skewness”• Tail on left = skewed to left = negative skew

• Tail on right = skewed to right = positive skew

• Perfectly symmetrical distributions have no skew

• Interpretation: The side of the distribution with the tail has fewer cases

• More cases are on the other side of the mean…

Penn 56 RGDPCH 1990

20000.0

18000.0

16000.0

14000.0

12000.0

10000.0

8000.0

6000.0

4000.0

2000.0

0.0

Penn 56 RGDPCH 1990F

req

ue

ncy

50

40

30

20

10

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00

Interpreting Skewness• Skewness provides information about inequality

– Example: Economic wealth of nations

Interpreting Skewness

• Skewness provides information about inequality in your data

• Example: Economic wealth of nations…

• Which way is it skewed?

• What is the social interpretation?

• What would be the interpretation if it were skewed in the opposite direction?

• What are some other social circumstances that might generate skewed distributions? Why?

Interpreting Skewness

• Skewness may reflect “floor” or “ceiling” effects

• Example: Number of crimes committed by individuals in a sample.

• Lower bound is zero. Mode is low. Few cases are high. Variable is skewed to right.

• Example: Country school enrollment ratio.• Cannot exceed 100% enrollment in school.

• Can anyone think of other examples?

Calculating Skewness

• Often, skewness is merely used descriptively

• But, statisticians have created a measure• Zero = perfectly symmetrical

• Higher number = increasing skew

• Based on distance from Mean to Median• Remember, Mean moves more if there are extreme cases, as

when there is a “tail”

• Formula:

Ys

Y )Mdn(3 skew

Notes on Skewness

• Skewness is often assessed informally “by eye” rather than calculated as a value.

• Look at a histogram to identify skewness

• Some statistical techniques work properly only on variables that are not skewed.

• Thus, it can be very important to identify highly skewed variables.

Other Univariate Descriptions: Modes

• Modes = Peaks– Note: “the mode” also refers to a measure of central

tendency – the value associated with the highest peak• But, the term is also used more generally:

– Uni-modal distribution: One peak– Bi-modal distribution: Two peaks– Multi-modal distribution: Multiple peaks.

Interpreting Multi-Modal Distributions

• Can you think of a reason for multiple modes?

• The sample is heterogeneous (i.e., made up of more than one group)

• Height forms a bell-shaped distribution for men and for women, but the peaks are different. A combined sample has two peaks

• The sample reflects some exogenous structural ordering process

• Years of education completed is peaked at 12 (high school), 16 (college)

Example: Mode, skew

• How would you describe this variable?

Example: Mode, skew

• How would you describe this variable?

Example: Mode, Skew

• How would you describe this variable?

More Univariate Tools

• Two other issues:– 1. How many cases fall below or above a given

value?– 2. How can we describe a case’s value relative to

other cases?

• Tools:– Cumulative frequency lists/plots– Quantiles (e.g., percentiles, quartiles)– Z-scores

Cumulative Frequency List

• Cumulative Frequency: Number of cases falling in or below a given interval

• Cumulative frequency graph = “ogive”

• Cumulative Percentage: Percentage of cases falling in or below a given interval

• Cumulative frequency lists, graphs can be generated in SPSS: frequency, histogram.

Cumulative Percentage ListYears of Education (N=2904) Value Frequency Percent Cumulat % 7 or less 21 1.4 3.9 8 82 5.3 9.3 9 51 3.3 12.6 10 70 4.6 17.2 11 95 6.2 23.4 12 489 31.8 55.4 13 125 8.1 63.5 14 184 12.0 75.6 15 76 4.9 80.5 16 152 9.9 90.5 17 40 2.6 93.1 18 61 4.0 97.1 19 18 1.2 98.2 20 27 1.8 100.0

Q: How do you find

the median?

Indicates that 55% of students have 12 years of

education or less

Cumulative % Graph

0102030405060708090

100

5 10 15 20

Years of Education

Cu

mu

lati

ve P

erc

en

tag

e

Quantiles

• Percentiles, quartiles, deciles, etc…• General term = quantile

• Quantiles: Dividing cases up into fixed number of equal “bunches”– 100 chunks = percentiles– 10 chunks = deciles– 5 = quintiles– 4 = quartiles

Quartile: Example

• Example: Number of CD’s owned (N=12)

0 0 9 17 19 29 46 87 103 178 202 293First

QuartileSecond

QuartileThird

QuartileFourth

Quartile

• Identifying quartile of a case is a powerful way of describing where a case falls relative to others– A person with 200 CDs is in the top quartile

• 75% have less

• Note: Don’t forget that quantiles are relative– A person of average height in the US would be in the

bottom quartile in a dataset of basketball players.

Quantiles

• Also: Upper and lower bounds of quantiles are useful reference points that describe your data– The border of the 2nd and 3rd quartile is the median, the

middle of your data– The border of the top quartile (178 CDs) gives you a

sense of how many are owned by people toward the upper end of the distribution

– Ex: Sometimes people report “interquartile range”• The range of values that contains the middle 50% of cases.

Quantiles

• Useful questions that Quantiles help answer:

• 1. How does a particular case compare to others in the dataset?– Example: I scored 57 on a test… is that good?– Strategy: Determine the percentile– If 57 corresponds to the 22th percentile, then the

answer is NO!• At least not compared to the others who took the test

– Note: Percentiles indicate position relative

Quantiles

• Useful questions that Quantiles help answer:

• 2. How does a case’s value on one variable compare to another variable?– If I scored 51 on my math test and 78 on my English

test, which is better?– Converting to percentiles allows a direct comparison

• Ex: 51 on math = 95th percentile; 78 on English = 62nd

• Conclusion: Math performance was better!

Quantiles

• Useful questions that Quantiles help answer:

• 3. What values of a variable are high or low for a given variable?– Ex: U.S. Census Income Statistics by Quintiles 2001:– Cutoffs: $17,970; $33,314; $53,000;

$83,500 • 0 to $17,970 = lowest quintile

• $17,970 to $33,314 = second quintile

• $33,314 to $53,000 = third quintile

• $53,000 to $83,000 = fourth quintile

• $83,500 to “Bill Gates” = highest quintile

– Typical starting salary of sociologist: $50,000

Computing Quantiles

• Calculating quantiles in SPSS:

• SPSS frequencies command• Options under statistics button specifies

– Or, you can rely on the Cumulative Percentage list to identify percentiles or other quantiles

• Example: Years of education completed (GSS)• 95th percentile falls at: 18 years of education

• Interpretation: 5% are more educated. 95% are less.

Z-Score (Standardized Score)

• The Z-score: Another way to assess relative placement of cases in a distribution

• Somewhat like a deviation

• And has other uses

• You can convert any or all values of a variable to a common scale

• Running approximately from –3 to +3 , with mean = 0

• Then you can easily compare across variables• Ex: I’m a -.3 on math, a +1.2 on reading

• Negative = below mean, positive = above mean.

Formula for Z-Score

• For any case in your data, calculate:

Y

i

Y

ii s

YY

s

dZ

)(

• Start with the a cases value (Yi)… Then simply subtract the mean and divide by the standard deviation.

Z-Score Example• Example: In the US, the mean level of education is 13

years, with a S.D. of 3 years• Question 1: What is the Z-score of a person who has a

high-school degree? (12 yrs)

333.3

1

3

)1312()(

Y

ii s

YYZ

• Question 2: What is the Z-score of an advanced graduate student? (22 yrs)

0.33

9

3

)1322()(

Y

ii s

YYZ

Properties of Z-Scores

• Z-scores are like deviations• Cases on the mean score zero

• Positive values are above mean, negative below

• But, like quantiles, Z-scores can be compared across variables with different units or means

• Simple deviations can’t be compared if units of measurement are different: Ex: height and weight

• Units of Z-scores are “standard deviations”• A Z-score of -1.83 indicates a case is nearly 2 standard

deviations below the mean.

Z-scoring Whole Variables

• You can convert an entire variable (all cases) to Z-scores, creating a whole new variable

• With useful properties

• Converting to Z-scores preserves the shape of the distribution

• But, mean and standard deviation are altered

• Mean = zero• Because it is based on deviations

• Standard Deviation (sy) = 1• Because distance from mean = divided by sy.

Z-Score Example

• Number of CD’s: Mean = 32.5, s = 29.8

Case Num CD’s (Y)

Mean(Y bar)

Deviation (d)

Z-score(di/s)

1 20 32.5 -12.5 -.42

2 40 32.5 7.5 +.25

3 0 32.5 -32.5 -1.1

4 70 32.5 37.5 +1.3

Converting Variables to Z-scoresGSS Data, N=2904

HIGHEST YEAR OF SCHOOL COMPLETED

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

0

Fre

qu

en

cy

1000

800

600

400

200

0

Converting Variables to Z-scoresGSS Data, N=2904

Z-SCORE: HIGHEST YEAR OF EDUCATION

2.27

1.92

1.58

1.24

.90

.56

.22

-.12

-.46

-.81

-1.15

-1.49

-1.83

-2.17

-2.51

-2.85

-3.19

-3.54

-4.56

Fre

qu

en

cy

1000

800

600

400

200

0

Z-Scoring Whole Variables

• Properties of Z-scored variables

• 1. Mean = 0, S.D. = 1– Unit of variable is literally “standard deviations”– If a value = 1, it means the cases is 1 S.D above mean

• 2. Z-scored variables are useful for comparing variables with very different units

• 3. However, the actual meaning of units is lost– Ex: a variable measured in # of CDs makes sense, but

a variable in # of S.D.s is harder to interpret

Z-Scores and Index Construction

• Issue: It is often useful to combine several variables to create an “index”

• Example: Suppose you ask several similar questions on a survey (all on a scale from 1-5):

• Do you approve of President’s foreign policy?

• Do you approve of the President’s domestic policy?

• Do you approve of the President’s character?

• You can add all 3 together to make a scale from that reflects “overall approval” of Bush

• For each individual, the scale goes from 3 to 15.

Z-Scores and Index Construction

• Example: Constructing an index

Case #

Foreign Domestic Character Index

1 1 2 2 5

2 4 5 5 14

3 2 3 3 8

4 3 4 1 8

5 4 1 2 7

Index value is

simply the sum of the

three component variables

Z-Scores and Index Construction

• Suppose you wanted to make an index of the following variables:– 1. Approval of foreign policy (measured 1-3)– 2. Approval of domestic policy (measured 1-5)– 3. Approval of character (measured 1-100)

• Question: What is the problem with constructing and index from these three measures?

• Answer: Value of index variable is almost wholly determined by the third variable– It is numerically much larger, and “dominates” the

index

Z-Scores and Index Construction

• Calculating Z-scores of each variable (prior to adding them) can help make a better index

• Reason: Z-scoring variables “standardizes” the dispersion of each component of the index– All vars have same mean (0), standard deviation (1)– Thus, each variable contributes roughly equally to the

index. None disproportionately influence it.– Final index of 3 vars: mean = 0, S.D. = 3

• Note: There are many other ways to create indexes… but this is one quick solution

Z-Score: Final Remarks

• Z-scores help us locate cases within a distribution– Example: We know that if Z>0, case is above median

• Under normal circumstances, a case’s Z-score does not tell us exactly which percentile the case falls in…

• It depends on the shape of the distribution…

• However, if the a variable distribution takes on a predictable shape, we can make an accurate determination

• This will prove useful next week!

Recommended