63

Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

Embed Size (px)

Citation preview

Page 1: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data
Page 2: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

Descriptive Statistics (Part Descriptive Statistics (Part 2)2)

Descriptive Statistics (Part Descriptive Statistics (Part 2)2)

Standardized Data

Percentiles and Quartiles

Box Plots

Grouped Data

Skewness and Kurtosis (optional)

Chapter4444

Page 3: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For any population with mean For any population with mean and standard deviation and standard deviation , the percentage of , the percentage of observations that lie within observations that lie within kk standard deviations of the mean must be at standard deviations of the mean must be at least 100[1 – 1/least 100[1 – 1/kk22]. ].

• Developed by mathematicians Jules BienaymDeveloped by mathematicians Jules Bienayméé (1796-1878) and Pafnuty Chebyshev (1821-1894).(1796-1878) and Pafnuty Chebyshev (1821-1894).

Standardized DataStandardized DataStandardized DataStandardized Data

Chebyshev’s TheoremChebyshev’s Theorem

Page 4: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For For kk = 2 standard deviations, = 2 standard deviations, 100[1 – 1/2100[1 – 1/222] = 75%] = 75%

• So, at least 75.0% will lie within So, at least 75.0% will lie within ++ 2 2• For For kk = 3 standard deviations, = 3 standard deviations,

100[1 – 1/3100[1 – 1/322] = 88.9%] = 88.9%• So, at least 88.9% will lie within So, at least 88.9% will lie within ++ 3 3

• Although applicable to any data set, these limits Although applicable to any data set, these limits tend to be too wide to be useful.tend to be too wide to be useful.

Standardized DataStandardized DataStandardized DataStandardized Data

Chebyshev’s TheoremChebyshev’s Theorem

Page 5: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• The The Empirical RuleEmpirical Rule states that for data from a states that for data from a normal distribution, we expect that fornormal distribution, we expect that for

• The normal or Gaussian distribution was named for The normal or Gaussian distribution was named for Karl Gauss (1771-1855).Karl Gauss (1771-1855).

• The normal distribution is symmetric and is also The normal distribution is symmetric and is also known as the bell-shaped curve.known as the bell-shaped curve.

kk = 1 about 68.26% will lie within = 1 about 68.26% will lie within ++ 1 1kk = 2 about 95.44% will lie within = 2 about 95.44% will lie within ++ 2 2

kk = 3 about 99.73% will lie within = 3 about 99.73% will lie within ++ 3 3

Standardized DataStandardized DataStandardized DataStandardized Data

The Empirical RuleThe Empirical Rule

Page 6: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

Note: no upper bound is given. Note: no upper bound is given. Data values outside Data values outside ++ 3 3 are rare.are rare.

• Distance from the mean is measured in terms of Distance from the mean is measured in terms of the number of standard deviations.the number of standard deviations.

Standardized DataStandardized DataStandardized DataStandardized Data

The Empirical RuleThe Empirical Rule

Page 7: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• If 80 students take an exam, how many will score If 80 students take an exam, how many will score within 2 standard deviations of the mean?within 2 standard deviations of the mean?

• Assuming exam scores follow a normal distribution, Assuming exam scores follow a normal distribution, the empirical rule statesthe empirical rule states

about 95.44% will lie within about 95.44% will lie within ++ 2 2so 95.44% x 80 so 95.44% x 80 76 students will score 76 students will score ++ 2 2 from from ..

• How many students will score more than 2 How many students will score more than 2 standard deviations from the mean?standard deviations from the mean?

Standardized DataStandardized DataStandardized DataStandardized Data

Example: Exam ScoresExample: Exam Scores

Page 8: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• UnusualUnusual observations are those that lie beyond observations are those that lie beyond ++ 2 2..

• OutliersOutliers are observations that lie beyond are observations that lie beyond ++ 3 3..

Standardized DataStandardized DataStandardized DataStandardized Data

Unusual ObservationsUnusual Observations

Page 9: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For example, the P/E ratio data contains several For example, the P/E ratio data contains several large data values. Are they unusual or outliers?large data values. Are they unusual or outliers?

77 88 88 1010 1010 1010 1010 1212 1313 1313 1313 1313

1313 1313 1313 1414 1414 1414 1515 1515 1515 1515 1515 1616

1616 1616 1717 1818 1818 1818 1818 1919 1919 1919 1919 1919

2020 2020 2020 2121 2121 2121 2222 2222 2323 2323 2323 2424

2525 2626 2626 2626 2626 2727 2929 2929 3030 3131 3434 3636

3737 4040 4141 4545 4848 5555 6868 9191

Standardized DataStandardized DataStandardized DataStandardized Data

Unusual ObservationsUnusual Observations

Page 10: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• If the sample came from a normal distribution, then If the sample came from a normal distribution, then the Empirical rule statesthe Empirical rule states

1x s = 22.72 ± 1(14.08)

2x s = 22.72 ± 2(14.08)

3x s = 22.72 ± 3(14.08)

Standardized DataStandardized DataStandardized DataStandardized Data

The Empirical RuleThe Empirical Rule

= (8.9, 38.8)

= (-5.4, 50.9)

= (-19.5, 65.0)

Page 11: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

22.7222.72 38.838.88.98.9 50.950.9-5.4-5.4 65.065.0-19.5-19.5

Standardized DataStandardized DataStandardized DataStandardized Data

The Empirical RuleThe Empirical Rule

OutliersOutliers OutliersOutliers

UnusualUnusualUnusualUnusual

• Are there any unusual values or outliers?Are there any unusual values or outliers?7 8 7 8 . . .. . . 48 55 68 91 48 55 68 91

Page 12: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• A A standardized variablestandardized variable ( (ZZ) redefines each observation in ) redefines each observation in terms the number of standard deviations from the mean.terms the number of standard deviations from the mean.

iix

z

Standardization Standardization formula for a formula for a population:population:

Standardization Standardization formula for a formula for a sample:sample:

iix x

zs

Standardized DataStandardized DataStandardized DataStandardized Data

Defining a Standardized VariableDefining a Standardized Variable

Page 13: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• zzii tells how far away the observation is from the mean. tells how far away the observation is from the mean.

iix x

zs

== 7 – 22.727 – 22.72

14.0814.08== -1.12-1.12

Standardized DataStandardized DataStandardized DataStandardized Data

Defining a Standardized VariableDefining a Standardized Variable

• For example, for the P/E data, the first value For example, for the P/E data, the first value xx11 = 7. = 7.

The associated The associated zz value is value is

Page 14: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

iix x

zs

== 91 – 22.7291 – 22.72

14.0814.08== 4.854.85

• A negative A negative zz value means the observation is below value means the observation is below the mean.the mean.

Standardized DataStandardized DataStandardized DataStandardized Data

Defining a Standardized VariableDefining a Standardized Variable

• Positive Positive zz means the observation is above the mean. means the observation is above the mean. For For xx6868 = 91, = 91,

Page 15: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Here are the standardized Here are the standardized zz values for the P/E values for the P/E data:data:

Standardized DataStandardized DataStandardized DataStandardized Data

Defining a Standardized VariableDefining a Standardized Variable

• What do you conclude for these four values?What do you conclude for these four values?

Page 16: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized standardized zz value. value.

• MegaStat calculates standardized values as well as MegaStat calculates standardized values as well as checks for outliers.checks for outliers.

Standardized DataStandardized DataStandardized DataStandardized Data

Defining a Standardized VariableDefining a Standardized Variable

Page 17: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• What do we do with outliers in a data set?What do we do with outliers in a data set?

• If due to erroneous data, then discard.If due to erroneous data, then discard.

• An outrageous observation (one completely outside An outrageous observation (one completely outside of an expected range) is certainly invalid.of an expected range) is certainly invalid.

• Recognize unusual data points and outliers and Recognize unusual data points and outliers and their potential impact on your study.their potential impact on your study.

• Research books and articles on how to handle Research books and articles on how to handle outliers.outliers.

Standardized DataStandardized DataStandardized DataStandardized Data

OutliersOutliers

Page 18: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For a normal distribution, the range of values is 6For a normal distribution, the range of values is 6 (from (from – 3 – 3 to to + 3 + 3).).

• If you know the range If you know the range RR (high – low), you can (high – low), you can estimate the standard deviation as estimate the standard deviation as = = RR/6./6.

• Useful for approximating the standard deviation Useful for approximating the standard deviation when only when only RR is known. is known.

• This estimate depends on the assumption of This estimate depends on the assumption of normality.normality.

Standardized DataStandardized DataStandardized DataStandardized Data

Estimating SigmaEstimating Sigma

Page 19: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• PercentilesPercentiles are data that have been divided into are data that have been divided into 100 groups.100 groups.

• For example, you score in the 83For example, you score in the 83rdrd percentile on a standardized test. percentile on a standardized test. That means that 83% of the test-takers scored below you. That means that 83% of the test-takers scored below you.

• DecilesDeciles are data that have been divided into are data that have been divided into 10 groups.10 groups.

• QuintilesQuintiles are data that have been divided into are data that have been divided into 5 groups.5 groups.

• QuartilesQuartiles are data that have been divided into are data that have been divided into 4 groups.4 groups.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

PercentilesPercentiles

Page 20: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Percentiles are used to establish Percentiles are used to establish benchmarksbenchmarks for comparison purposes for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). and 90 percentiles).

• Quartiles (25, 50, and 75 percent) are commonly used Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. to assess financial performance and stock portfolios.

• Percentiles are used in employee merit evaluation Percentiles are used in employee merit evaluation and salary benchmarking.and salary benchmarking.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

PercentilesPercentiles

Page 21: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• QuartilesQuartiles are scale points that divide the sorted are scale points that divide the sorted data into four groups of approximately equal size.data into four groups of approximately equal size.

• The three values that separate the four groups are The three values that separate the four groups are called called QQ11, , QQ22, and , and QQ33, respectively., respectively.

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

QuartilesQuartiles

Page 22: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• The second quartile The second quartile QQ22 is the is the medianmedian, an important , an important

indicator of indicator of central tendencycentral tendency..

• QQ11 and and QQ33 measure measure dispersiondispersion since the since the interquartile rangeinterquartile range QQ33 – – QQ11

measures the degree of spread in the middle 50 percent of data values.measures the degree of spread in the middle 50 percent of data values.

QQ22

Lower 50% Lower 50% || Upper 50% Upper 50%

QQ11 QQ33

Lower 25%Lower 25% || Middle 50% Middle 50% || Upper 25%Upper 25%

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

QuartilesQuartiles

Page 23: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• The first quartile The first quartile QQ11 is the median of the data values below is the median of the data values below QQ22, and , and

the third quartile the third quartile QQ33 is the median of the data values above is the median of the data values above QQ22..

QQ11 QQ22 QQ33

Lower 25%Lower 25% || Second 25%Second 25% || Third 25%Third 25% || Upper 25%Upper 25%

For first half of data, For first half of data, 50% above, 50% above,

50% below 50% below QQ11..

For second half of data, For second half of data, 50% above, 50% above,

50% below 50% below QQ33..

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

QuartilesQuartiles

Page 24: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Depending on Depending on nn, the quartiles , the quartiles QQ11,,QQ22, and , and QQ33 may be members of may be members of

the data set or may lie the data set or may lie betweenbetween two of the sorted data values. two of the sorted data values.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

QuartilesQuartiles

Page 25: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For small data sets, find quartiles using For small data sets, find quartiles using method of method of mediansmedians::

Step 1.Step 1. Sort the observations. Sort the observations.

Step 2.Step 2. Find the median Find the median QQ22..

Step 3.Step 3. Find the median of the data values that lie Find the median of the data values that lie belowbelow QQ22..

Step 4.Step 4. Find the median of the data values that lie Find the median of the data values that lie aboveabove QQ22..

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Method of MediansMethod of Medians

Page 26: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Use Excel function =QUARTILE(Array, k) to return Use Excel function =QUARTILE(Array, k) to return the the kkth quartile.th quartile.

=QUARTILE(Array, 3)=QUARTILE(Array, 3)

=PERCENTILE(Array, 75)=PERCENTILE(Array, 75)

• Excel treats quartiles as a special case of percentiles. Excel treats quartiles as a special case of percentiles. For example, to calculate For example, to calculate QQ33

• Excel calculates the quartile positions as:Excel calculates the quartile positions as:

Position of QPosition of Q11 0.250.25n n + 0.75+ 0.75

Position of QPosition of Q22 0.500.50n n + 0.50+ 0.50

Position of QPosition of Q33 0.750.75n n + 0.25+ 0.25

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Excel QuartilesExcel Quartiles

Page 27: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Consider the following P/E ratios for 68 stocks in a Consider the following P/E ratios for 68 stocks in a portfolio. portfolio.

• Use quartiles to define benchmarks for stocks that are Use quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high-priced (top quartile).low-priced (bottom quartile) or high-priced (top quartile).

7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14

14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19

19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26

26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Example: P/E Ratios and QuartilesExample: P/E Ratios and Quartiles

Page 28: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Using Excel’s method of interpolation, the quartile Using Excel’s method of interpolation, the quartile positionspositions are:are:

Quartile Quartile PositionPosition

FormulaFormula Interpolate Interpolate BetweenBetween

QQ11 = 0.25(68) + 0.75 = 17.75= 0.25(68) + 0.75 = 17.75 XX1717 + + XX1818

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Example: P/E Ratios and QuartilesExample: P/E Ratios and Quartiles

QQ22 = 0.50(68) + 0.50 = 34.50= 0.50(68) + 0.50 = 34.50 XX3434 + + XX3535

QQ33 = 0.75(68) + 0.25 = 51.25= 0.75(68) + 0.25 = 51.25 XX5151 + + XX5252

Page 29: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• The quartiles are:The quartiles are:

QuartileQuartile FormulaFormula

First (First (QQ11)) QQ11 = = XX1717 + 0.75 ( + 0.75 (XX1818--XX1717) )

= 14 + 0.75 (14-14) = 14 = 14 + 0.75 (14-14) = 14

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Example: P/E Ratios and QuartilesExample: P/E Ratios and Quartiles

Second (Second (QQ22)) QQ22 = = XX3434 + 0.50 ( + 0.50 (XX3535--XX3434) )

= 19 + 0.50 (19-19) = 19 = 19 + 0.50 (19-19) = 19Third (Third (QQ33)) QQ33 = = XX5151 + 0.25 ( + 0.25 (XX5252--XX5151) )

= 26 + 0.25 (26-26) = 26 = 26 + 0.25 (26-26) = 26

Page 30: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• So, to summarize:So, to summarize:

• These quartiles express central tendency and These quartiles express central tendency and dispersion. What is the interquartile range?dispersion. What is the interquartile range?

QQ11 QQ22 QQ33

Lower 25%Lower 25% of of P/E P/E RatiosRatios

1414 Second 25%Second 25% of of P/EP/E Ratios Ratios

1919 Third 25%Third 25% of of P/EP/E Ratios Ratios

2626 Upper 25%Upper 25% of of P/EP/E Ratios Ratios

• Because of clustering of identical data values, these quartiles do not Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations.provide clean cut points between groups of observations.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Example: P/E Ratios and QuartilesExample: P/E Ratios and Quartiles

Page 31: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

Whether you use the method of Whether you use the method of medians or Excel, your quartiles will be medians or Excel, your quartiles will be about the same. Small differences in about the same. Small differences in calculation techniques typically do not calculation techniques typically do not

lead to different conclusions in lead to different conclusions in business applications.business applications.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

TipTip

Page 32: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Quartiles generally resist outliers.Quartiles generally resist outliers.• However, quartiles do not provide clean cut points in the sorted However, quartiles do not provide clean cut points in the sorted

data, especially in small samples with repeating data values.data, especially in small samples with repeating data values.

Data set Data set AA:: 1, 2, 4, 4, 8, 8, 8, 81, 2, 4, 4, 8, 8, 8, 8 QQ11 = 3, = 3, QQ22 = 6, = 6, QQ33 = 8 = 8

Data set Data set BB:: 0, 3, 3, 6, 6, 6, 10, 150, 3, 3, 6, 6, 6, 10, 15 QQ11 = 3, = 3, QQ22 = 6, = 6, QQ33 = 8 = 8

• Although they have identical quartiles, these two data sets are Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well.not similar. The quartiles do not represent either data set well.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

CautionCaution

Page 33: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Some robust measures of central tendency and Some robust measures of central tendency and dispersion using quartiles are:dispersion using quartiles are:

StatisticStatistic FormulaFormula ExcelExcel ProPro ConCon

MidhingeMidhinge=0.5*(QUARTILE=0.5*(QUARTILE

(Data,1)+QUARTILE(Data,1)+QUARTILE(Data,3))(Data,3))

Robust to Robust to presence presence of extreme of extreme data data values.values.

Less Less familiar familiar to most to most people.people.

1 3

2

Q Q

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Dispersion Using QuartilesDispersion Using Quartiles

Page 34: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

StatisticStatistic FormulaFormula ExcelExcel ProPro ConCon

MidspreadMidspread QQ33 – – QQ11=QUARTILE(Data,3)-=QUARTILE(Data,3)-QUARTILE(Data,1)QUARTILE(Data,1)

Stable Stable when when extreme extreme data values data values exist.exist.

Ignores Ignores magnitude magnitude of extreme of extreme data data values.values.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Dispersion Using QuartilesDispersion Using Quartiles

Coefficient Coefficient of quartile of quartile variation variation ((CQVCQV))

NoneNone

Relative Relative variation in variation in percent so percent so we can we can compare compare data sets.data sets.

Less Less familiar to familiar to non-non-statisticiansstatisticians

3 1

3 1

100Q Q

Q Q

Page 35: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• The mean of the first and third quartiles.The mean of the first and third quartiles.

• For the 68 P/E ratios,For the 68 P/E ratios,

Midhinge = Midhinge = 1 3

2

Q Q

Midhinge = Midhinge = 1 3 14 2620

2 2

Q Q

• A robust measure of central tendency since A robust measure of central tendency since quartiles ignore extreme values.quartiles ignore extreme values.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

MidhingeMidhinge

Page 36: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• A robust measure of dispersionA robust measure of dispersion

• For the 68 P/E ratios,For the 68 P/E ratios,

Midspread = Midspread = QQ33 – – QQ11

Midspread = Midspread = QQ33 – – QQ11 = 26 – 14 = 12 = 26 – 14 = 12

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Midspread (Interquartile Range)Midspread (Interquartile Range)

Page 37: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Measures Measures relativerelative dispersion, expresses the dispersion, expresses the midspread as a percent of the midhinge.midspread as a percent of the midhinge.

• For the 68 P/E ratios,For the 68 P/E ratios,

3 1

3 1

100Q Q

CQVQ Q

3 1

3 1

26 14100 100 30.0%

26 14

Q QCQV

Q Q

• Similar to the Similar to the CVCV, , CQVCQV can be used to compare data can be used to compare data

sets measured in different units or with different means.sets measured in different units or with different means.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

Coefficient of Quartile Variation (CQV)Coefficient of Quartile Variation (CQV)

Page 38: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• A useful tool of A useful tool of exploratory data analysisexploratory data analysis (EDA). (EDA).

• Also called a Also called a box-and-whisker plotbox-and-whisker plot..

• Based on a Based on a five-number summaryfive-number summary::

XXminmin, , QQ11, , QQ22, , QQ33, , XXmaxmax

• Consider the five-number summary for the Consider the five-number summary for the 68 P/E ratios:68 P/E ratios:

7 14 19 26 917 14 19 26 91

XXminmin, , QQ11, , QQ22, , QQ33, , XXmaxmax

Box PlotsBox PlotsBox PlotsBox Plots

Page 39: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

MinimumMinimum

Median (Median (QQ22))

MaximumMaximum

QQ11 QQ33

BoxBox

WhiskersWhiskers

Right-skewedRight-skewed

Center of Box is MidhingeCenter of Box is Midhinge

Box PlotsBox PlotsBox PlotsBox Plots

Page 40: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Use quartiles to detect unusual data points.Use quartiles to detect unusual data points.

• These points are called These points are called fencesfences and can be found and can be found using the following formulas: using the following formulas:

Inner fencesInner fences Outer fences:Outer fences:

Lower fenceLower fence QQ11 – 1.5 ( – 1.5 (QQ33––QQ11)) QQ11 – 3.0 ( – 3.0 (QQ33––QQ11))

Upper fenceUpper fence QQ33 + 1.5 ( + 1.5 (QQ33––QQ11)) QQ33 + 3.0 ( + 3.0 (QQ33––QQ11))

• Values outside the inner fences are Values outside the inner fences are unusualunusual while while those outside the outer fences are those outside the outer fences are outliersoutliers. .

Box PlotsBox PlotsBox PlotsBox Plots

Fences and Unusual Data ValuesFences and Unusual Data Values

Page 41: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For example, consider the P/E ratio data:For example, consider the P/E ratio data:

• Ignore the lower fence since it is negative and P/E Ignore the lower fence since it is negative and P/E ratios are only positive. ratios are only positive.

Inner fencesInner fences Outer fences:Outer fences:

Lower fence:Lower fence: 14 – 1.5 (26–14) = 14 – 1.5 (26–14) = 44 14 – 3.0 (26–14) = 14 – 3.0 (26–14) = 2222

Upper fence:Upper fence: 26 + 1.5 (26–14) = +4426 + 1.5 (26–14) = +44 26 + 3.0 (26–14) = +6226 + 3.0 (26–14) = +62

Box PlotsBox PlotsBox PlotsBox Plots

Fences and Unusual Data ValuesFences and Unusual Data Values

Page 42: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Truncate the whisker at the fences and display Truncate the whisker at the fences and display unusual values unusual values and outliers and outliers as dots.as dots.

Inner Inner FenceFence

OuterOuterFenceFence

UnusualUnusual OutliersOutliers

Box PlotsBox PlotsBox PlotsBox Plots

Fences and Unusual Data ValuesFences and Unusual Data Values

• Based on these fences, there are three unusual Based on these fences, there are three unusual P/E values and two outliers.P/E values and two outliers.

Page 43: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Although some information is lost, grouped data Although some information is lost, grouped data are easier to display than raw data. are easier to display than raw data.

• When bin limits are given, the mean and standard When bin limits are given, the mean and standard deviation can be estimated.deviation can be estimated.

• Accuracy of grouped estimates depend on Accuracy of grouped estimates depend on - the number of bins- the number of bins- distribution of data within bins- distribution of data within bins- bin frequencies- bin frequencies

Grouped DataGrouped DataGrouped DataGrouped Data

Nature of Grouped DataNature of Grouped Data

Page 44: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Consider the frequency distribution for prices of Consider the frequency distribution for prices of Lipitor® for three cities:Lipitor® for three cities:

Grouped DataGrouped DataGrouped DataGrouped Data

Mean and Standard DeviationMean and Standard Deviation

• WhereWhere mmjj = class midpoint = class midpoint ffjj = class frequency = class frequency

kk = number of classes = number of classes n n = sample size= sample size

Page 45: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Estimate the mean and standard deviation byEstimate the mean and standard deviation by

1

3427.572.92552

47

kj j

j

f mx

n

2

1

( ) 2091.489366.74293

1 47 1

kj j

j

f m xs

n

• Note: don’t round off too soon.Note: don’t round off too soon.

Grouped DataGrouped DataGrouped DataGrouped Data

Nature of Grouped DataNature of Grouped Data

Page 46: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• How accurate are grouped estimates compared to ungrouped estimates?How accurate are grouped estimates compared to ungrouped estimates?

• Now estimate the coefficient of variationNow estimate the coefficient of variation

CVCV = 100 ( = 100 (ss / / ) = 100 (6.74293 / 72.92552) = 9.2% ) = 100 (6.74293 / 72.92552) = 9.2% x

• For the previous example, we can compare the grouped data statistics to the For the previous example, we can compare the grouped data statistics to the ungrouped data statistics.ungrouped data statistics.

Grouped DataGrouped DataGrouped DataGrouped Data

Nature of Grouped DataNature of Grouped Data

Accuracy IssuesAccuracy Issues

Page 47: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• For this example, very little information was lost due to grouping.For this example, very little information was lost due to grouping.

• However, accuracy could be lost due to the nature of the grouping (i.e., if the However, accuracy could be lost due to the nature of the grouping (i.e., if the groups were not evenly spaced within bins).groups were not evenly spaced within bins).

Grouped DataGrouped DataGrouped DataGrouped Data

Accuracy IssuesAccuracy Issues

Page 48: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• The dot plot shows a relatively even distribution within the bins.The dot plot shows a relatively even distribution within the bins.

• Effects of uneven distributions within bins tend to average out unless there is Effects of uneven distributions within bins tend to average out unless there is systematic skewness.systematic skewness.

Grouped DataGrouped DataGrouped DataGrouped Data

Accuracy IssuesAccuracy Issues

Page 49: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Accuracy tends to improve as the number of bins increases.Accuracy tends to improve as the number of bins increases.

• If the first or last class is open-ended, there will be no class midpoint (no mean can If the first or last class is open-ended, there will be no class midpoint (no mean can be estimated).be estimated).

• Assume a lower limit of zero for the first class when the data are nonnegative.Assume a lower limit of zero for the first class when the data are nonnegative.

• You may be able to assume an upper limit for some variables (e.g., age).You may be able to assume an upper limit for some variables (e.g., age).

• Median and quartiles may be estimated even with open-ended classes.Median and quartiles may be estimated even with open-ended classes.

Grouped DataGrouped DataGrouped DataGrouped Data

Accuracy IssuesAccuracy Issues

Page 50: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Generally, Generally, skewnessskewness may be indicated by looking at the may be indicated by looking at the sample histogram or by comparing the mean and median.sample histogram or by comparing the mean and median.

• This visual indicator is imprecise and does not take This visual indicator is imprecise and does not take into consideration sample size into consideration sample size nn..

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness

Page 51: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness• Skewness is a unit-free statistic. Skewness is a unit-free statistic. • The coefficient compares two samples measured in different units or one The coefficient compares two samples measured in different units or one

sample with a known reference distribution (e.g., symmetric normal sample with a known reference distribution (e.g., symmetric normal distribution).distribution).

• Calculate the sample’s Calculate the sample’s skewness coefficientskewness coefficient as: as:

SkewnessSkewness = = 3

1( 1)( 2)

ni

i

x xn

n n s

Page 52: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• In Excel, go to In Excel, go to Tools | Data Analysis | Descriptive Statistics Tools | Data Analysis | Descriptive Statistics or use the function =SKEW(array)or use the function =SKEW(array)

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness

Page 53: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Consider the following table showing the 90% Consider the following table showing the 90% range for the sample skewness coefficient. range for the sample skewness coefficient.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness

Page 54: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Coefficients within the 90% range may be Coefficients within the 90% range may be attributed to random variation.attributed to random variation.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness

Page 55: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Coefficients outside the range suggest the sample Coefficients outside the range suggest the sample came from a nonnormal population.came from a nonnormal population.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness

Page 56: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• As As nn increases, the range of chance variation increases, the range of chance variation narrows.narrows.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

SkewnessSkewness

Page 57: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• KurtosisKurtosis is the relative length of the tails and the is the relative length of the tails and the degree of concentration in the center.degree of concentration in the center.

• Consider three kurtosis prototype shapes.Consider three kurtosis prototype shapes.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

KurtosisKurtosis

Page 58: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• A histogram is an unreliable guide to kurtosis since A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ.scale and axis proportions may differ.

• Excel and MINITAB calculate kurtosis as:Excel and MINITAB calculate kurtosis as:

Kurtosis =Kurtosis = 4 2

1

( 1) 3( 1)

( 1)( 2)( 3) ( 2)( 3)

ni

i

x xn n n

n n n s n n

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

KurtosisKurtosis

Page 59: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Consider the following table of expected 90% Consider the following table of expected 90% range for sample kurtosis coefficient.range for sample kurtosis coefficient.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

KurtosisKurtosis

Page 60: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• A sample coefficient within the ranges may be A sample coefficient within the ranges may be attributed to chance variation.attributed to chance variation.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

KurtosisKurtosis

Page 61: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• Coefficients outside the range would suggest the Coefficients outside the range would suggest the sample differs from a normal population.sample differs from a normal population.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

KurtosisKurtosis

Page 62: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

• As sample size increases, the chance range As sample size increases, the chance range narrows. narrows.

• Inferences about kurtosis are risky for Inferences about kurtosis are risky for nn < 50. < 50.

Skewness and KurtosisSkewness and KurtosisSkewness and KurtosisSkewness and Kurtosis

KurtosisKurtosis

Page 63: Descriptive Statistics (Part 2) Standardized Data Standardized Data Percentiles and Quartiles Percentiles and Quartiles Box Plots Box Plots Grouped Data

Applied Statistics in Business and Economics

End of Chapter 4End of Chapter 4