(8) Measures of Dispersion

Embed Size (px)

Citation preview

  • 7/30/2019 (8) Measures of Dispersion

    1/30

    Applied Statistics and Computing Lab

    MEASURES OF DISPERSION

    Applied Statistics and Computing Lab

    Indian School of Business

  • 7/30/2019 (8) Measures of Dispersion

    2/30

    Applied Statistics and Computing Lab

    Learning goals To understand the need for studying

    dispersion To understand the idea behind measures of

    dispersion

    To study different measures of dispersion

    Additional topics

    Standardization of a variable

    Skewness and Kurtosis

    Five-point summary

    2

  • 7/30/2019 (8) Measures of Dispersion

    3/30

    Applied Statistics and Computing Lab

    Need to study dispersion Two patients are admitted into the Intensive Care Unit of a

    hospital. The night before their operation, the doctor makesthe last visit at 9pm and blood pressure for Patient 1 is 110/80

    and for Patient 2 it is 120/70. Although they are normal, for

    precautionary reasons, the Doctor asks the nurse to check

    their blood pressure every 2 hours. At 7.30 the next morning,the nurse reports that the average blood pressure for both

    the patients was normal, 120/80. The chart of their actual

    blood pressures was:

    3

    Time 11pm 1am 3am 5am 7am

    Patient 1 120/80 100/80 100/60 130/80 150/100

    Patient 2 110/60 100/60 100/70 130/90 160/120

  • 7/30/2019 (8) Measures of Dispersion

    4/30

    Applied Statistics and Computing Lab

    Need to study dispersion (contd.) What if the doctor decides to operate the patients

    without looking at the blood pressure chart? What if someone decides to visit the tourist

    destination next week, based on the averagetemperature of last week, given in our data?

    What if I am interested in working with company X(that is visiting our campus) and I am given informationabout only the mean salary of the employees?

    In an extreme case, a central tendency can alsoindicate a dataset consisting of same constant value

    4

  • 7/30/2019 (8) Measures of Dispersion

    5/30

    Applied Statistics and Computing Lab5

    0 1 2 3 4 5 6 7 8 9 10 11 12 13

    Median

    Mode

  • 7/30/2019 (8) Measures of Dispersion

    6/30

    Applied Statistics and Computing Lab

    Examples

    Variability in temperature through the week

    Scatter of the horsepower capacities, within

    the cars available

    Spread of the prices at which varieties of asingle product (say rice varieties) are available

    Variability in returns on investments

    6

  • 7/30/2019 (8) Measures of Dispersion

    7/30

    Applied Statistics and Computing Lab

    Need for measures of dispersion

    (contd.)

    Helps determine the reliability of the measure

    of central tendency

    Facilitates comparison of two sets of data

    Useful for building further statistical measures

    7

  • 7/30/2019 (8) Measures of Dispersion

    8/30

    Applied Statistics and Computing Lab

    Desired properties A good measure should not get highly affected

    if the data changes slightly

    A good measure should be representative of

    the majority of the data A good measure should allow us to declare an

    interval within which most of the values lie,

    with a certain degree of confidence

    8

  • 7/30/2019 (8) Measures of Dispersion

    9/30

    Applied Statistics and Computing Lab

    Dataset Body measurements on 507

    individuals 247 men and 260 women

    Primarily in 20s and 30s, with some

    exceptions

    All individuals exercise several hours

    a week

    From the 28 total variables present

    in this dataset, we consider thevariables Gender(1=Male,

    0=Female) and Weight(in Kg.)

    9Data source: Measurements collected by authors Grete Heinz and Louis J. Peterson for their study

  • 7/30/2019 (8) Measures of Dispersion

    10/30

    Applied Statistics and Computing Lab

    Dataset (contd.)

    10

    Female Male Overall

    Min. weight (in Kgs.) 42 53.9 42

    Max. weight (in Kgs.) 105.2 116.4 116.4

    Mean weight (in Kgs.) 60.6 78.14 69.15

    Median weight (in Kgs.) 59 77.3 68.2

  • 7/30/2019 (8) Measures of Dispersion

    11/30

    Applied Statistics and Computing Lab11

    Evaluatingdispersion

    Consider theboundaries

    (Measure based onselected values)

    Report theextreme values

    Consider distancefrom a centraltendency

    (Measures based onall the values)

    Build anabsolutemeasure

    Calculate acoefficient

    High coefficient:

    Large spread,high variability

    Small coefficient:Small spread, less

    variability

  • 7/30/2019 (8) Measures of Dispersion

    12/30

    Applied Statistics and Computing Lab

    1. Considering the boundaries These measures consider and report only the

    boundaries of the data

    Try to understand how far the values of the

    variable reach

    The spread of the data is not considered

    relative to any central tendency

    These measures overlook the patterns ofvalues within the boundaries

    12

  • 7/30/2019 (8) Measures of Dispersion

    13/30

    Applied Statistics and Computing Lab

    Minimum and

    maximum values

    ADVANTAGES:

    Useful when range oftolerance exists i.e. if values

    beyond a certain limit areharmful or unacceptable

    Easy to compute andunderstand

    DISADVANTAGES:Ignores any pattern in the

    data

    Ignores most of the data

    Range = (Maximum

    value) (Minimum

    value)

    ADVANTAGES:

    Easy comparison of variabilityacross datasets

    Easy to compute and

    understand

    DISADVANTAGES:Ignores any pattern in the

    data

    Ignores most of the data

    Inter-quartile range

    = (3rd quartile) (1st

    quartile)

    ADVANTAGES:

    Highlights the middle portionof the distribution of values

    Easy to understand

    DISADVANTAGES:

    More difficult to computethan Min-max and range

    Ignores irregularities onthe extremes

    Ignores 25% data on eachside

    13

    Female Male Overall

    (Min. weight, Max. weight) (42, 105.2) (53.9, 116.4) (42, 116.4)

    Weight range 63.2 62.5 74.4

    Weight inter-quartile range 11.1 14.55 20.45

  • 7/30/2019 (8) Measures of Dispersion

    14/30

    Applied Statistics and Computing Lab14

    Evaluatingdispersion

    Consider theboundaries

    (Measure based onselected values)

    Report theextreme values

    Consider distancefrom a centraltendency

    (Measures based onall the values)

    Build anabsolutemeasure

    Calculate acoefficient

    High coefficient:

    Large spread,high variability

    Small coefficient:Small spread, less

    variability

  • 7/30/2019 (8) Measures of Dispersion

    15/30

    Applied Statistics and Computing Lab

    2. Considering distance from central

    tendency Consider the deviations of values from the central tendency

    measure What if we simply sum all these deviations?

    Consider a hypothetical dataset

    (1,1,2,2,3,3,4,5,5,6,6,7,7)

    Mean = Median = 4

    Consider

    = = 0 Taking absolute values or taking squares so that we are

    considering only the magnitudes

    15

  • 7/30/2019 (8) Measures of Dispersion

    16/30

    Applied Statistics and Computing Lab

    Absolute deviations For a dataset consisting ofn observations:

    Absolute deviations:

    Mean absolute deviation from mean = ()

    Mean absolute deviation from median =

    ()

    Median absolute deviation from median = ( )

    16

    Female weights Male weights

    Mean absolute deviation from mean 7.33 8.58

    Mean absolute deviation from median 7.19 8.57

    Median absolute deviation from median 5.1 7.2

  • 7/30/2019 (8) Measures of Dispersion

    17/30

    Applied Statistics and Computing Lab

    Measures based on squared deviation For a dataset consisting ofn observations,

    Variance = = ()

    In order to look at a measure that has unit of measurements

    equivalent to the original data, we can take square root:

    Standard deviation = =

    17

    52.110Variance

    46.92Variance

    malesWeight,

    femalesWeight,

    =

    =

    51.10deviationStandard

    62.9deviationStandard

    malesWeight,

    femalesWeight,

    =

    =

  • 7/30/2019 (8) Measures of Dispersion

    18/30

    Applied Statistics and Computing Lab

    Relative measures of dispersion Coefficient of range:

    ()()

    Always lies between [0,1] Higher the coefficient, broader the range!

    , = 0.43 , = 0.37

    Coefficient of variation: 100 Computes the variability per unit mean

    Indicates how consistent the data is, with respect to its mean

    Higher the coefficient, more spread-over are the observations

    , = 15.87 , = 13.45

    The values of weights among females are more spread-over than those among males

    18

  • 7/30/2019 (8) Measures of Dispersion

    19/30

    Applied Statistics and Computing Lab

    Comparing measures of dispersion

    19

    All the measures that consider distance from central tendency, are based on all the values!

    -Absolute deviations are less affected by extreme

    values, as compared to squared deviations

    -Absolute deviations are easy to understand andinterpret

    -Median absolute deviation is least affected by slight

    changes in the data, across all measures of dispersion

    -Variance and Standard deviation are most popular

    measures of dispersion due to their usefulness in

    building further statistical measures and because they

    algebraically amenable

    -Both play an important part in building and evaluating

    further statistical measures

    -Standard deviation is easier to understand than

    variance, as it is in the same units as the original data

    -Algebraic manipulation of

    measures based on measures of

    absolute deviations is difficult-Variance is most affected by

    extreme values as it is based on

    squared deviations

    -Standard deviation is not very

    easy to compute

    -Standard deviation cannot be

    calculated for data with open

    ended classes

    -Coefficients are free of units therefore facilitate comparison

    -Useful even when two variables are measured in two different units

  • 7/30/2019 (8) Measures of Dispersion

    20/30

    Applied Statistics and Computing Lab

    Standardization

    Standardized variable of

    =

    Mean of standardized variable = 0 Variance of standardized variable = 1

    Standardized variables are free of units

    Therefore measures of variation ofstandardized variables are comparable

    20

  • 7/30/2019 (8) Measures of Dispersion

    21/30

    Applied Statistics and Computing Lab

    Example How is the weight of a new-born affected by whether a mother smokes or not?

    Further, does it affect the perinatal mortality rate that varies for different birth

    weights? Yerushalmy J. found out in his 1971 paper that although low birth rate is associated

    with an increase in the number of babies who die shortly after birth, the babies of

    smokers tended to have much lower death rates than the babies of nonsmokers.*

    In this study, he compared perinatal death rates by grouping birth rates

    In 1986 and 1993, Wilcox & Russell and Wilcox (respectively) strongly recommended

    that the babies should be grouped based on their relative (or standardized) birth

    weight, rather than looking at the absolute weights (in Kgs.)

    What happened then?

    21* And ** taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications

    Table in Yerushalmy J. (1971)**

    (Weights measured in grams)

  • 7/30/2019 (8) Measures of Dispersion

    22/30

    Applied Statistics and Computing Lab

    Example (contd.)

    22Graphs taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications

  • 7/30/2019 (8) Measures of Dispersion

    23/30

    Applied Statistics and Computing Lab

    Further to deviations

    23

    Variance = () is the sum of squares of deviationsfrom the mean divided by n or the expected value of squareddeviation of X from its mean

    Expected values of higher powers of deviations from mean, giveadditional information about the distribution of data

    Expected value of any power of the deviations from mean of a

    variable X (say power) is called the central moment of thatvariable = =

    ( ) = ( )

    Central moments depict the spread and shape of data Variance is 2nd central moment

    Measures using the 3rd and 4th central moments are useful tounderstand the shape of the distribution

  • 7/30/2019 (8) Measures of Dispersion

    24/30

    Applied Statistics and Computing Lab

    Skewness Skewness is a measure of symmetry (or the lack

    of it) in a dataset

    A distribution is right-skewed or positively

    skewed if it stretches asymmetrically to the right

    It is left or negatively skewed if the asymmetric

    stretch is on the left

    Measuring skewness using moments:

    = =

    Important to note that if a distribution is

    perfectly symmetric, = 0 The sign of the coefficient = the sign of A coefficient of skewness value closer to zero,

    indicates a highly symmetric distribution

    24Visuals from Aczel A., Sounderpandian J. Complete business statistics

  • 7/30/2019 (8) Measures of Dispersion

    25/30

    Applied Statistics and Computing Lab

    Kurtosis Kurtosis is a measure of peakedness of a

    dataset

    The ideal value for kurtosis is 3 and such a

    curve is called the Mesokurtic curve

    Value larges than 3 indicates that the

    distribution would be peaked with shorter tails.This graph is also termed the Leptokurtic curve

    Value smaller than 3 would fetch a flatter graph

    with longer tails and is called the Platykurtic

    curve Measuring kurtosis using moments:

    = = 25

    Visual from http://whatilearned.wikia.com/wiki/File:Kurtosis.jpg

    The red line representsa frequency curve of a

    long tailed distribution

    The blue line represents

    a frequency curve of a

    short tailed distribution

    The black line is the

    standard bell curve

  • 7/30/2019 (8) Measures of Dispersion

    26/30

    Applied Statistics and Computing Lab

    Example

    Skewness Kurtosis

    Female 1.14 5.59

    Male 0.29 3.15

    Entire dataset 0.40 2.65

    Table of the gender-wise skewness and kurtosis of weights:

    26

  • 7/30/2019 (8) Measures of Dispersion

    27/30

    Applied Statistics and Computing Lab

    Example (contd.) We see that skewness and kurtosis captures the numeric measure

    of the information presented in a histogram

    We see that the histogram of weights of females is highlystretched on the right, leading to a positive and high skewnessmeasure of 1.14

    The stretch of histogram for weights of the entire dataset ismoderate and much lesser than that for weights of females. This is

    reflected in the slightly lower skewness of 0.40 The weights of males are stretched almost equally on both sides of

    the centrality giving a skewness measure as close to zero as 0.29

    Skewness and Kurtosis shed light on important characteristics suchas symmetry and peakedness

    Give additional information about distribution of data, than themeasures of central tendency and measures of dispersion

    27

  • 7/30/2019 (8) Measures of Dispersion

    28/30

    Applied Statistics and Computing Lab

    Point summary Very useful and practical use of measures of central

    tendency and dispersion

    5-point summary

    6-point summary

    Gives an idea about the extreme values, the valueswithin which the middle 50% of the values lie and alsothe centrality of the data

    6-point summary of Weights in the bodymeasurement data:

    28

    Minimum 1st quartile Median 3rd quartile Maximum

    Minimum 1st quartile Median Mean 3rd quartile Maximum

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    42 58.4 68.2 69.15 78.85 116.4

    Measure R code

  • 7/30/2019 (8) Measures of Dispersion

    29/30

    Applied Statistics and Computing Lab

    Measure R-code

    Minimum min(variable name)

    Maximum max(variable name)

    Range range(variable name)

    Inter-quartile range IQR(variable name)

    Mean absolute deviation about mean mean(abs(variable name-mean(variable name)))

    Mean absolute deviation about median mean(abs(variable name-median(variable name)))

    Median absolute deviation about median median(abs(variable name-median(variable name)))

    Variance var(variable name)

    Standard deviation sd(variable name)

    Coefficient of range (max(variable name) - min(variable name)) /

    (max(variable name) + min(variable name))

    Coefficient of variation library(raster)

    cv(variable name)

    Standardization of a variable function(x) {(x-mean(x))/sqrt(var(x))}

    Skewness and Kurtosis library(moments)

    skewness(variable name)

    kurtosis(variable name)

    6-point summary summary(variable name)

    29

  • 7/30/2019 (8) Measures of Dispersion

    30/30

    Applied Statistics and Computing Lab

    Thank you