24
1 1 Slide Slide © 2005 Thomson/South © 2005 Thomson/South-Western Western Chapter 3 Chapter 3 Descriptive Statistics: Numerical Measures Descriptive Statistics: Numerical Measures Part B Part B Measures of Distribution Shape, Relative Location, Measures of Distribution Shape, Relative Location, and Detecting Outliers and Detecting Outliers Exploratory Data Analysis Exploratory Data Analysis Measures of Association Between Two Variables Measures of Association Between Two Variables The Weighted Mean and The Weighted Mean and Working with Grouped Data Working with Grouped Data 2 Slide Slide © 2005 Thomson/South © 2005 Thomson/South-Western Western Measures of Distribution Shape, Measures of Distribution Shape, Relative Location, and Detecting Outliers Relative Location, and Detecting Outliers Distribution Shape Distribution Shape z-Scores Scores Chebyshev’s Theorem Chebyshev’s Theorem Empirical Rule Empirical Rule Detecting Outliers Detecting Outliers

Embed Size (px)

Citation preview

1

Chapter 3Chapter 3Descriptive Statistics: Numerical MeasuresDescriptive Statistics: Numerical Measures

Part BPart B

■■ Measures of Distribution Shape, Relative Location, Measures of Distribution Shape, Relative Location, and Detecting Outliersand Detecting Outliers

■■ Exploratory Data AnalysisExploratory Data Analysis

■■ Measures of Association Between Two VariablesMeasures of Association Between Two Variables

■■ The Weighted Mean and The Weighted Mean and

Working with Grouped DataWorking with Grouped Data

Measures of Distribution Shape,Measures of Distribution Shape,Relative Location, and Detecting OutliersRelative Location, and Detecting Outliers

■■ Distribution ShapeDistribution Shape

■■ zz--ScoresScores

■■ Chebyshev’s TheoremChebyshev’s Theorem

■■ Empirical RuleEmpirical Rule

■■ Detecting OutliersDetecting Outliers

2

Distribution Shape: SkewnessDistribution Shape: Skewness

■■ An important measure of the shape of a distribution An important measure of the shape of a distribution is called is called skewnessskewness..

■■ The formula for computing skewness for a data set is The formula for computing skewness for a data set is somewhat complex.somewhat complex.

■■ Skewness can be easily computed using statistical Skewness can be easily computed using statistical software.software.

Distribution Shape: SkewnessDistribution Shape: Skewness

■■ Symmetric (not skewed)Symmetric (not skewed)

•• Skewness is zero.Skewness is zero.

•• Mean and median are equal.Mean and median are equal.

Rel

ativ

e F

req

uen

cyR

elat

ive

Fre

qu

ency

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = 0 Skewness = 0

3

Distribution Shape: SkewnessDistribution Shape: Skewness

■■ Moderately Skewed LeftModerately Skewed Left

•• Skewness is negative.Skewness is negative.

•• Mean will usually be less than the median.Mean will usually be less than the median.R

elat

ive

Fre

qu

ency

Rel

ativ

e F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = Skewness = −− .31 .31

Distribution Shape: SkewnessDistribution Shape: Skewness

■■ Moderately Skewed RightModerately Skewed Right

•• Skewness is positive.Skewness is positive.

•• Mean will usually be more than the median.Mean will usually be more than the median.

Rel

ativ

e F

req

uen

cyR

elat

ive

Fre

qu

ency

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = .31 Skewness = .31

4

Distribution Shape: SkewnessDistribution Shape: Skewness

■■ Highly Skewed RightHighly Skewed Right

•• Skewness is positive (often above 1.0).Skewness is positive (often above 1.0).

•• Mean will usually be more than the median.Mean will usually be more than the median.

Rel

ativ

e F

req

uen

cyR

elat

ive

Fre

qu

ency

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = 1.25 Skewness = 1.25

Seventy efficiency apartmentsSeventy efficiency apartments

were randomly sampled inwere randomly sampled in

a small college town. Thea small college town. The

monthly rent prices formonthly rent prices for

these apartments are listedthese apartments are listed

in ascending order on the next slide. in ascending order on the next slide.

Distribution Shape: SkewnessDistribution Shape: Skewness

■■ Example: Apartment RentsExample: Apartment Rents

5

425 430 430 435 435 435 435 435 440 440

440 440 440 445 445 445 445 445 450 450

450 450 450 450 450 460 460 460 465 465

465 470 470 472 475 475 475 480 480 480

480 485 490 490 490 500 500 500 500 510

510 515 525 525 525 535 549 550 570 570

575 575 580 590 600 600 600 600 615 615

Distribution Shape: SkewnessDistribution Shape: Skewness

Rel

ativ

e F

req

uen

cyR

elat

ive

Fre

qu

ency

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = .92 Skewness = .92

Distribution Shape: SkewnessDistribution Shape: Skewness

6

The The zz--scorescore is often called the standardized value.is often called the standardized value.

It denotes the number of standard deviations a dataIt denotes the number of standard deviations a datavalue value xxii is from the mean.is from the mean.

zz--ScoresScores

zx x

sii= −

zz--ScoresScores

�� A data value less than the sample mean will have aA data value less than the sample mean will have azz--score less than zero.score less than zero.

�� A data value greater than the sample mean will haveA data value greater than the sample mean will havea za z--score greater than zero.score greater than zero.

�� A data value equal to the sample mean will have aA data value equal to the sample mean will have azz--score of zero.score of zero.

�� An observation’s zAn observation’s z--score is a measure of the relativescore is a measure of the relativelocation of the observation in a data set.location of the observation in a data set.

7

■■ zz--Score of Smallest Value (425)Score of Smallest Value (425)

425 490.80 1.20

54.74ix x

zs

− −= = = −

-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93

-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75

-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47

-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20

-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35

0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45

1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

zz--ScoresScores

Standardized Values for Apartment RentsStandardized Values for Apartment Rents

Chebyshev’s TheoremChebyshev’s Theorem

At least (1 At least (1 -- 1/1/zz22) of the items in ) of the items in anyany data set will bedata set will be

within within zz standard deviations of the mean, where standard deviations of the mean, where z z isis

any value greater than 1.any value greater than 1.

8

At least of the data values must beAt least of the data values must be

within of the mean.within of the mean.

75%75%

zz = 2 standard deviations= 2 standard deviations

Chebyshev’s TheoremChebyshev’s Theorem

At least of the data values must beAt least of the data values must be

within of the mean.within of the mean.

89%89%

zz = 3 standard deviations= 3 standard deviations

At least of the data values must beAt least of the data values must be

within of the mean.within of the mean.

94%94%

zz = 4 standard deviations= 4 standard deviations

For example:For example:

Chebyshev’s TheoremChebyshev’s Theorem

Let Let zz = 1.5 with = 490.80 and = 1.5 with = 490.80 and ss = 54.74= 54.74x

At least (1 At least (1 −− 1/(1.5)1/(1.5)22) = 1 ) = 1 −− 0.44 = 0.56 or 56%0.44 = 0.56 or 56%

of the rent values must be betweenof the rent values must be between

x -- zz((ss) = 490.80 ) = 490.80 −− 1.5(54.74) = 4091.5(54.74) = 409

andand

x + + zz((ss) = 490.80 + 1.5(54.74) = 573) = 490.80 + 1.5(54.74) = 573

(Actually, 86% of the rent values(Actually, 86% of the rent values

are between 409 and 573.)are between 409 and 573.)

9

Empirical RuleEmpirical Rule

For data having a bellFor data having a bell--shaped distribution:shaped distribution:

of the values of a normal random variableof the values of a normal random variable

are within of its mean.are within of its mean.

68.26%68.26%

+/+/-- 1 standard deviation1 standard deviation

of the values of a normal random variableof the values of a normal random variable

are within of its mean.are within of its mean.

95.44%95.44%

+/+/-- 2 standard deviations2 standard deviations

of the values of a normal random variableof the values of a normal random variable

are within of its mean.are within of its mean.

99.72%99.72%

+/+/-- 3 standard deviations3 standard deviations

Empirical RuleEmpirical Rule

xxµµ –– 33σσ µµ –– 11σσ

µµ –– 22σσµµ + 1+ 1σσ

µµ + 2+ 2σσµµ + 3+ 3σσµµ

68.26%68.26%

95.44%95.44%

99.72%99.72%

10

Detecting OutliersDetecting Outliers

�� An An outlieroutlier is an unusually small or unusually largeis an unusually small or unusually largevalue in a data set.value in a data set.

�� A data value with a zA data value with a z--score less than score less than --3 or greater3 or greaterthan +3 might be considered an outlier.than +3 might be considered an outlier.

�� It might be:It might be:

•• an incorrectly recorded data valuean incorrectly recorded data value

•• a data value that was incorrectly included in thea data value that was incorrectly included in the

data setdata set

•• a correctly recorded data value that belongs ina correctly recorded data value that belongs in

the data setthe data set

Detecting OutliersDetecting Outliers

-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93

-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75

-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47

-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20

-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35

0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45

1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

�� The most extreme zThe most extreme z--scores are scores are --1.20 and 2.271.20 and 2.27

�� Using |Using |zz| | >> 3 as the criterion for an outlier, there are3 as the criterion for an outlier, there are

no outliers in this data set.no outliers in this data set.

Standardized Values for Apartment RentsStandardized Values for Apartment Rents

11

Exploratory Data AnalysisExploratory Data Analysis

■■ FiveFive--Number SummaryNumber Summary

■■ Box PlotBox Plot

FiveFive--Number SummaryNumber Summary

11 Smallest ValueSmallest Value

First QuartileFirst Quartile

MedianMedian

Third QuartileThird Quartile

Largest ValueLargest Value

22

33

44

55

12

FiveFive--Number SummaryNumber Summary

425 430 430 435 435 435 435 435 440 440

440 440 440 445 445 445 445 445 450 450

450 450 450 450 450 460 460 460 465 465

465 470 470 472 475 475 475 480 480 480

480 485 490 490 490 500 500 500 500 510

510 515 525 525 525 535 549 550 570 570

575 575 580 590 600 600 600 600 615 615

Lowest Value = 425Lowest Value = 425 First Quartile = 445First Quartile = 445

Median = 475Median = 475

Third Quartile = 525Third Quartile = 525 Largest Value = 615Largest Value = 615

375375 400400 425425 450450 475475 500500 525525 550550 575575 600600 625625

�� A box is drawn with its ends located at the first andA box is drawn with its ends located at the first and

third quartiles.third quartiles.

Box PlotBox Plot

�� A vertical line is drawn in the box at the location ofA vertical line is drawn in the box at the location of

the median (second quartile).the median (second quartile).

Q1 = 445Q1 = 445 Q3 = 525Q3 = 525

Q2 = 475Q2 = 475

13

Box PlotBox Plot

■■ Limits are located (not drawn) using the interquartile Limits are located (not drawn) using the interquartile range (IQR).range (IQR).

■■ Data outside these limits are considered Data outside these limits are considered outliersoutliers..

■■ The locations of each outlier is shown with the The locations of each outlier is shown with the

symbolsymbol * * ..

… continued… continued

Box PlotBox Plot

Lower Limit: Q1 Lower Limit: Q1 -- 1.5(IQR) = 445 1.5(IQR) = 445 -- 1.5(75) = 332.51.5(75) = 332.5

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5

�� The lower limit is located 1.5(IQR) below The lower limit is located 1.5(IQR) below QQ1.1.

�� The upper limit is located 1.5(IQR) above The upper limit is located 1.5(IQR) above QQ3.3.

�� There are no outliers (values less than 332.5 orThere are no outliers (values less than 332.5 orgreater than 637.5) in the apartment rent data.greater than 637.5) in the apartment rent data.

14

Box PlotBox Plot

■■ Whiskers (dashed lines) are drawn from the ends of Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the box to the smallest and largest data values inside the limits.the limits.

375375 400400 425425 450450 475475 500500 525525 550550 575575 600600 625625

Smallest valueSmallest valueinside limits = 425inside limits = 425

Largest valueLargest valueinside limits = 615inside limits = 615

Measures of Association Measures of Association Between Two VariablesBetween Two Variables

■■ CovarianceCovariance

■■ Correlation CoefficientCorrelation Coefficient

15

CovarianceCovariance

Positive values indicate a positive relationship.Positive values indicate a positive relationship.

Negative values indicate a negative relationship.Negative values indicate a negative relationship.

The The covariancecovariance is a measure of the linear associationis a measure of the linear associationbetween two variables.between two variables.

CovarianceCovariance

The correlation coefficient is computed as follows:The correlation coefficient is computed as follows:

forforsamplessamples

forforpopulationspopulations

sx x y y

nxyi i= − −∑

−( )( )

1

σµ µ

xyi x i yx y

N=

− −∑ ( )( )

16

Correlation CoefficientCorrelation Coefficient

Values near +1 indicate a Values near +1 indicate a strong positive linearstrong positive linearrelationshiprelationship..

Values near Values near --1 indicate a 1 indicate a strong negative linearstrong negative linearrelationshiprelationship. .

The coefficient can take on values between The coefficient can take on values between --1 and +1.1 and +1.

The correlation coefficient is computed as follows:The correlation coefficient is computed as follows:

forforsamplessamples

forforpopulationspopulations

rs

s sxyxy

x y= ρ

σσ σxy

xy

x y=

Correlation CoefficientCorrelation Coefficient

17

Correlation CoefficientCorrelation Coefficient

Just because two variables are highly correlated, it Just because two variables are highly correlated, it does not mean that one variable is the cause of thedoes not mean that one variable is the cause of theother.other.

Correlation is a measure of linear association and notCorrelation is a measure of linear association and notnecessarily causation. necessarily causation.

A golfer is interested in investigatingA golfer is interested in investigating

the relationship, if any, between drivingthe relationship, if any, between driving

distance and 18distance and 18--hole score.hole score.

277.6277.6

259.5259.5

269.1269.1

267.0267.0

255.6255.6

272.9272.9

6969

7171

7070

7070

7171

6969

Average DrivingAverage DrivingDistance (yds.)Distance (yds.)

AverageAverage1818--Hole ScoreHole Score

Covariance and Correlation CoefficientCovariance and Correlation Coefficient

18

Covariance and Correlation CoefficientCovariance and Correlation Coefficient

277.6277.6

259.5259.5

269.1269.1

267.0267.0

255.6255.6

272.9272.9

6969

7171

7070

7070

7171

6969

xx yy

10.6510.65

--7.457.45

2.152.15

0.050.05

--11.3511.35

5.955.95

--1.01.0

1.01.0

00

00

1.01.0

--1.01.0

--10.6510.65

--7.457.45

00

00

--11.3511.35

--5.955.95

AverageAverage

Std. Dev.Std. Dev.

267.0267.0 70.070.0 --35.4035.40

8.21928.2192 .8944.8944

TotalTotal

■■ Sample CovarianceSample Covariance

■■ Sample Correlation CoefficientSample Correlation Coefficient

Covariance and Correlation CoefficientCovariance and Correlation Coefficient

7.08 -.9631

(8.2192)(.8944)

xy

xy

x y

sr

s s

−= = =

( )( ) 35.40 7.08

1 6 1i i

xy

x x y ys

n

− − −= = = −− −

19

The Weighted Mean andThe Weighted Mean andWorking with Grouped DataWorking with Grouped Data

■■ Weighted MeanWeighted Mean

■■ Mean for Grouped DataMean for Grouped Data

■■ Variance for Grouped DataVariance for Grouped Data

■■ Standard Deviation for Grouped DataStandard Deviation for Grouped Data

Weighted MeanWeighted Mean

�� When the mean is computed by giving each dataWhen the mean is computed by giving each datavalue a weight that reflects its importance, it isvalue a weight that reflects its importance, it isreferred to as a referred to as a weighted meanweighted mean..

�� In the computation of a grade point average (GPA),In the computation of a grade point average (GPA),the weights are the number of credit hours earned forthe weights are the number of credit hours earned foreach grade.each grade.

�� When data values vary in importance, the analystWhen data values vary in importance, the analystmust choose the weight that best reflects themust choose the weight that best reflects theimportance of each value.importance of each value.

20

Weighted MeanWeighted Mean

where:where:

xxii = value of observation = value of observation ii

wwi i = weight for observation = weight for observation ii

Grouped DataGrouped Data

�� The weighted mean computation can be used toThe weighted mean computation can be used toobtain approximations of the mean, variance, andobtain approximations of the mean, variance, andstandard deviation for the grouped data.standard deviation for the grouped data.

�� To compute the weighted mean, we treat theTo compute the weighted mean, we treat themidpoint of each classmidpoint of each class as though it were the meanas though it were the meanof all items in the class.of all items in the class.

�� We compute a weighted mean of the class midpointsWe compute a weighted mean of the class midpointsusing the using the class frequencies as weightsclass frequencies as weights..

�� Similarly, in computing the variance and standardSimilarly, in computing the variance and standarddeviation, the class frequencies are used as weights.deviation, the class frequencies are used as weights.

21

Mean for Grouped DataMean for Grouped Data

i if Mx

n= ∑

N

Mf ii∑=µ

where: where:

ffi i = frequency of class = frequency of class ii

MMi i = midpoint of class = midpoint of class ii

■■ Sample DataSample Data

■■ Population DataPopulation Data

Given below is the previous sample of monthly rentsGiven below is the previous sample of monthly rents

for 70 efficiency apartments, presented here as groupedfor 70 efficiency apartments, presented here as grouped

data in the form of a frequency distribution. data in the form of a frequency distribution.

Rent (\$) Frequency420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Sample Mean for Grouped DataSample Mean for Grouped Data

22

Sample Mean for Grouped DataSample Mean for Grouped Data

This approximationThis approximation

differs by \$2.41 fromdiffers by \$2.41 from

the actual samplethe actual sample

mean of \$490.80.mean of \$490.80.

34,525 493.21

70x = =

Rent (\$) f i

420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Total 70

M i

429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5

f iM i

3436.07641.55634.03916.03566.52118.01099.02278.01179.03657.034525.0

Variance for Grouped DataVariance for Grouped Data

sf M x

ni i2

2

1= −∑

−( )

σ µ22

= −∑ f M

Ni i( )

■■ For sample dataFor sample data

■■ For population dataFor population data

23

Rent (\$) f i

420-439 8440-459 17460-479 12480-499 8500-519 7520-539 4540-559 2560-579 4580-599 2600-619 6

Total 70

M i

429.5449.5469.5489.5509.5529.5549.5569.5589.5609.5

Sample Variance for Grouped DataSample Variance for Grouped Data

M i - x

-63.7-43.7-23.7-3.716.336.356.376.396.3116.3

f i(M i - x )2

32471.7132479.596745.97110.11

1857.555267.866337.13

23280.6618543.5381140.18

208234.29

(M i - x )2

4058.961910.56562.1613.76

265.361316.963168.565820.169271.76

13523.36

continuedcontinued

3,017.89 54.94s = =

ss22 = 208,234.29/(70 = 208,234.29/(70 –– 1) = 3,017.891) = 3,017.89

This approximation differs by only \$.20 This approximation differs by only \$.20

from the actual standard deviation of \$54.74.from the actual standard deviation of \$54.74.

Sample Variance for Grouped DataSample Variance for Grouped Data

■■ Sample VarianceSample Variance

■■ Sample Standard DeviationSample Standard Deviation

24