Summary of Statistics II

Embed Size (px)

Citation preview

  • 8/12/2019 Summary of Statistics II

    1/8

    8/26/20

    Numerical

    Summaries

    of Data

    Numerical Summary(primarily for quantitative variables)

    Location Variation

    Measures of Location

    Give middle or typical valuesor central tendency.

    Measures of Variation

    Describe spread or scatter

    or dispersion in the data.

    Mode

    Measures of Location

    1. Mean

    2. Median

    3. Mode

    Measures of Location

    1. Mean

    the center of gravity

    of the data (histogram).

    Population Mean =Sample Mean = X

    Measures of Location

    1. Mean

    the center of gravity

    of the data (histogram).

    Population Mean =Sample Mean = X

  • 8/12/2019 Summary of Statistics II

    2/8

    8/26/20

    formula for mean

    Sample

    Mean =Sum of observations

    divided by

    sample size

    SXinX =

    X1 + X2 + +Xnn=

    Median- midpoint of distribution

    At least half of the observations are

    less than or equal to the median,

    and at least half are

    greater than or equal to the median.

    Note: For n observations,

    the median is located at the

    n+ 1

    2

    in the ordered sample.

    -th observation

    Example 1

    Data: 14, 18, 20, 12, 24, 15, 14

    (n = 7odd)

    Step 1: Order the data:

    12, 14, 14, 15, 18, 20, 24

    7 + 1

    2= 4thlocation of median

    q Median is the middle value.

    q At least half the values are at or greater;

    at least half are at or lower.

    median example

    Data: 14, 18, 20, 12, 24, 15, 14

    (n = 7odd)

    Step 1: Order the data:

    12, 14, 14, 15, 18, 20, 24

    94 (outlier)

    94

    Original: X = 16.71

    with outlier: X = 26.71

    Example 2

    q Median is stillthe middle value.q Median isresistant to outliers.

    Data: 14, 18, 20, 12, 24, 15, 14, 214

    (n = 8even, outlier)

    1st: Order the data:

    12, 14, 14, 15, 18, 20, 24, 214

    q Median is the average of the two middle values.

    q Exactly half the values are greater, half lower.

    Example 3

    8 + 1

    2= 4.5thlocation of median

    16.5

  • 8/12/2019 Summary of Statistics II

    3/8

    8/26/20

    1. Order the data.

    2. For odd n, the median isthe center observation.

    3. For even n, the median is

    the average of the two center

    observations.

    Summary for Finding Median 3. Mode - most frequentlyoccurring value

    In a histogram, modal class

    is the one havinglargest frequency,

    i.e., highest bar.

    Good for a discrete quant variable (few values)

    or a categorical variable.

    What typeof variable is it?

    q Ifcategorical, use the mode.

    Average is meaningless;

    look at percentages of occurrences.

    q If variable is quantitative,

    first look at a graph:

    l Skewed or outliers?

    l More or less symmetric?

    Use median.

    Use mean.

    Numerical Summary

    Location Variation

    Mean

    Median

    ModePercentilesQuartiles

    Range

    Std. Deviation

    IQR

    Mutual Fund Selection

    Two mutual funds; annual returns

    for last three years from each.

    Which fund would you choose?

    Fund A: X = 10.0% Fund B: X = 10.0%8.0

    12.0

    10.0

    60.0

    -20.0

    -10.0

    Why does variation matter?Measures of Variation

    1. Range

    2. Variance &

    Standard Deviation

    3. Interquartile Range (IQR)

  • 8/12/2019 Summary of Statistics II

    4/8

    8/26/20

    Highest minus lowest

    value in the sample.

    1. Range Example 4: 3, 4, 1, 7, 4, 5

    1 2 3 4 5 6 7

    Example 5: 1, 1, 1, 7, 7, 7

    1 2 3 4 5 6 7

    Range = Hi - Lo = 7-1 = 6

    Range = Hi - Lo = 7-1 = 6

    How far are the data

    from the middle,

    on average?

    2. Variance &

    Standard Deviation

    Sample Variance = s2Sample Std. Dev. = s

    Population Variance = s2Population Std. Dev. = s

    Notation:

    Example 4: 3, 4, 1, 7, 4, 5 (miles)

    1 2 3 4 5 6 7

    X = 4.0- 3

    - 1+1

    +3

    Avoid this by using either

    1. absolute valueor

    2. squaring

    of the differences.

    Note:

    q The average of the deviations

    from the mean will always be zero.We cannot let the negatives

    cancel out the positives.

    s

    2

    = n- 1

    S(Xi-X)2

    20

    6-1=

    = 4.0 miles2

    Example 4 data:

    Equation for Variance (for a sample):

    3

    4

    1

    7

    4

    5

    x x - x (x x)2

    -1

    0

    -3

    3

    0

    1

    24

    x = 4.0Total 0

    1

    0

    9

    9

    0

    1

    20

    s2

    Standard deviation:

    s= s2

    = 2.0 mi.

  • 8/12/2019 Summary of Statistics II

    5/8

    8/26/20

    Equation for Variance:

    s2

    = N

    S(Xim)2

    For a population:

    s2=n- 1

    S(XiX)2

    For a sample:

    Advantages: Good properties;

    uses all the data.Disadvantages:

    Units are squared.

    Not resistant.

    Variance

    Standard Deviation

    S= S2 The square root

    of the variance.

    = 4.0

    = 2.0miles

    Advantage:

    Easier to interpret

    than variance,

    Units same as data.

    If x= the 100pthpercentile, thenat least 100p% of data is x,at least 100(1-p)% of data is x.

    Sample 100pthpercentile:

    82% of the sample have scores 47,AND 18% have scores 47.

    Example: You are told you scored 47;

    then you hear 47 is at the 82ndpercentile.

    1. Minimum2. 1stQuartile, Q1 = 25th ptile

    3. Median

    4. 3rdQuartile, Q3 = 75th ptile

    5. Maximum

    Five Number Summary

    1st Quartile (25th percentile) :25% of the data values

    lie at orbelowit.

    3rd Quartile (75th percentile) :75%of the data values

    lie at or belowit.

    Quartiles:

  • 8/12/2019 Summary of Statistics II

    6/8

    8/26/20

    Quartiles

    Q1: 25% of the data set is below the first quartile

    Q2: 50% of the data set is below the second quartile

    Q3: 75% of the data set is below the third quartile

    25% 25% 25% 25%

    Q3Q2Q1

    Method 1: Percentile method

    Q1located at position(n+1)*1/4

    Q2located at position(n+1)*2/4

    Q3located at position(n+1)*3/4

    n Q1 Q2 Q3

    5

    8

    11

    31.5 4.5

    4.52.25 6.75

    63 9

    median of observations

    below theposition of

    the median.

    Q3 = median ofobservationsabove theposition of

    the median.

    Q1 =

    Method 2: Median method Ordered data:

    12, 14, 16, 18, 19, 21, 22, 25, 27

    Max =

    Q3 =

    Median =

    Q1 =Min =

    27.0

    19.0

    12.0

    Q1= 15.0 Q3= 23.5

    23.5

    15.0

    IQR = 8.5

    Example 6

    IQR = Q - Q13

    q IQR is the rangeof the

    middle50%of the data.

    q Observations more than

    1.5 IQRsbeyondquartiles

    are considered outliers.

    3. Interquartile Range (IQR)Which summary statistics

    should I use?

    Symmetric:

    Use mean,

    & std. dev.

    Skewed right:

    Use median,

    & IQR.

  • 8/12/2019 Summary of Statistics II

    7/8

    8/26/20

    Boxplot

    A graphically display of

    the five number summary

    (also called a box-and-whiskers plot)

    Ordered data:

    12, 14, 16, 18, 19, 21, 22, 25, 27

    Max =

    Q3 =

    Median =

    Q1 =Min =

    27.0

    19.0

    12.0

    Q1= 15.0 Q3= 23.5

    23.5

    15.0

    IQR = 8.5

    Example 6

  • 8/12/2019 Summary of Statistics II

    8/8

    8/26/20

    28

    22

    24

    12

    14

    16

    18

    20

    26

    Median

    1

    3Q

    Q

    Note:

    Middle 50%of data are

    within the box

    Minimum

    Maximum

    Max =

    Q3 =

    Median =

    Q1 =Min =

    27.0

    19.0

    12.0

    23.5

    15.0

    IQR = 8.5