Lecture 3+ , Descriptive Statistics (Slide)

Embed Size (px)

Citation preview

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    1/50

    Descriptive Statistics

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    2/50

    Descriptive Statistics reducing a complexmass of data to a manageable set of

    information

    Descriptive Statistics: the summary

    and presentation of data to: simplify the data

    enable meaning full interpretation

    support decision making

    Numerical descriptive measures (fewnumbers)

    Graphical presentations

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    3/50

    Descriptive Statistics Measures of central tendency

    Measures of dispersion

    Data Presentation Grouped data

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    4/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    5/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    6/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    7/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    8/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    9/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    10/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    11/50

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    12/50

    What is the Mean?

    The mean is the most common measure ofcentral tendency.

    The arithmetic mean (or average) is

    defined as the sum of all the observedvalues, divided by the number ofobservations.

    The mean is a good way to describe thecenter of a group of data if the values have

    a more or less normal distribution It may not well describe a group of data if afew values are far from the rest (the data is"skewed" or there are many "outliers").

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    13/50

    When is the Mean Useful?

    The mean is useful when you have anormal distribution of data

    The mean is not very useful when youhave an abnormal distribution ofdata.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    14/50

    How is the Mean Calculated?

    Let us calculate the mean for the data set below. Thissample data set represents the incubation period indays for a group of 21 people who contractedhepatitis. Look at the numbers:

    16, 20, 22, 24, 25, 28, 28, 29, 29, 30, 30, 30, 31, 31,

    32, 33, 35, 36, 38, 40, 42 Applying the steps described previously, you simply

    add up all the numbers and then divide by thenumber of observations:

    There are 21 data values (n = 21). The sum of the 21 data values (X1 + X2 ++Xn )

    listed above is 629. 629 / 21 = 29.95. We can round this up to 30.0 or 30. The mean for the data set above, then, is 30.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    15/50

    Median,

    What is the Median?

    The median is a measure of centraltendency that is useful in representing datathat is skewed. "Skewed" simply means

    that there are significantly more data pointswith values below the mean than there areabove the mean (or vice-versa). Withskewed data, the normally centered hump

    on the frequency distribution curve is offsetto the left or right of center. The median isthe value that divides the distribution ofvalues into two equal parts.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    16/50

    When is the Median Useful?

    The median is useful when you havean abnormal (or skewed) distribution

    of data. A skewed distribution datashows up clearly if you present theinformation in a graph:

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    17/50

    How is the Median Determined?

    The median is determined rather than calculated. Thatis, the median is based on its relationship to otherdata in the population rather than calculatedalgebraically. The median of a set of observations is

    the value that falls in the middle position when theobservations are ranked in order from the smallest tothe largest. The rules for calculating the mean are:

    Rank the observations from the smallest to thelargest.

    If the number of observations is odd, the median isthe middle number.

    If the number of observations is even, the median isthe average of the two middle numbers.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    18/50

    An Exercise in Determining the Median

    Let us take a sample of 8 people from thiscommunity, assuming these people are

    representative of the general population.H

    ereare the numbers representing net worth for thesepeople:

    $2,000, $10,000, $25,000, $32,000, $45,000,$50,000, $80,000, $3,000,000,

    we have an even number of observations (8).

    The two middle values in the ordered list are$32,000 and $45,000.Therefore, the median is halfway between the

    values of $32,000 and $45,000.Calculate the average of these two numbers to

    determine the median: $38,500.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    19/50

    Mode,W

    hat is the Mode?The mode is the value that occurs

    with the greatest frequency in a setof observations. If no value is

    repeated within the set ofobservations, then there is no mode.

    If two or more values are repeated atthe same frequency, then each of

    those observations is a mode. In anormal or symmetrical distribution of

    data, the mean, median, and modehave the same values (or very close).

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    20/50

    W

    hen is the Mode Useful?The mode is useful if you are trying to

    focus on the most frequent value for

    a certain population. Although themode is seldom used in public healthstatistics, it could be used to focus

    attention on the modal (most

    common) age group of a populationin the outbreak of a disease, or

    establish some other modalcharacteristic for a population

    experiencing a disease.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    21/50

    H

    ow is the Mode Determined?The mode of a set of observations is the

    value that occurs with the greatestfrequency.

    To determine the mode,

    Rank the observations from the smallest to thelargest.

    Evaluate the ranked data set by counting the

    number of times each individual value occurs,and

    determine which value (s) occur with thegreatest frequency

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    22/50

    Measures of Dispersion

    Dispersion of a set of observations is thevariety exhibited by the observations

    If all values are the same, no dispersion

    More the values are spread, the greater thedispersion

    Many distributions are well-described bymeasure of location and dispersion

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    23/50

    Common Measures of

    Dispersion

    Range

    Variance Standard deviation

    Coefficient of variation (CV)

    Standard Deviation of the Mean (SE)

    Percentiles and Quartiles

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    24/50

    When are measures of dispersion

    useful? If you are evaluating the norm for a

    particular characteristic, like weight orheight, you need to establish the extremes

    (lowest and highest values) in order toassess what might be outside the norm. Forexample, there are standards for weight inproportion to height. Some people are veryheavy for their height, whereas others are

    much lighter compared to their even if theyare of the same height. The extremes ofthis range can describe how far from thenorm a person's weight is when assessedwith their height

    R

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    25/50

    Range

    What is the Range?

    The range is calculated as the

    difference between the smallest andthe largest values in a set of data.

    Heavily influenced by two mostextreme values and ignores the rest

    of the distribution

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    26/50

    Range

    When is the Range Useful?

    The range is an adequate measure of

    variation for a small set of data, likeclass scores for a test. Think of othermeasures where range might beuseful: Salaries for a particular job

    category; or Indoor versus outdoortemperatures?

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    27/50

    Range

    How is the Range Calculated?

    The range is calculated by subtracting

    the smallest value in the data set fromthe largest value in the data set:

    Range = Largest value - Smallest value

    Data set A: 8, 9, 10, 10, 11, 12

    Data set B: 5, 6, 10, 10, 14, 15

    The range for A: 4, The range for B: 10

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    28/50

    Variance

    Variance measures distribution of values

    around their mean Definition of sample variance

    Degrees of freedom

    n-1 used because if we know n-1 deviations,the nth deviation is known

    Deviations have to sum to zero

    )1/()( 22 ! nxxs i

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    29/50

    Standard Deviation

    Definition of sample standard deviation

    Standard deviation in same units as mean

    Variance in units2

    2ss !

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    30/50

    What is the Standard

    Deviation?The standard deviation of a data set

    is based on how much each data

    value deviates from the mean, and isequal to the square root of thevariance. The greater the dispersion

    of values, the larger the standard

    deviation. Much of statistical theory isbased on the standard deviation and

    the 'normal' distribution.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    31/50

    When is the Standard Deviation

    Useful?It is a useful measure when your data

    distribution is very close to a normal curve.In this situation, the mean is the bestmeasure of central tendency, and the

    standard deviation is the best measure ofdispersion.

    In a normal distribution, if you measure 1standard deviation to either side of the

    mean, you will find that 68.3% of the

    observations fall into this area; 95.5% ofthe observations fall within 2 standard

    deviations to either side of the mean; and99.7% of observations fall within 3

    standard deviations of the mean

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    32/50

    Calculation of the Sample Standard Deviationusing the Theoretical (Squared Deviation)Method

    X1 = 2

    X2 = 4

    X3 = 5

    X4 = 5

    X5 = 6

    X6 = 6

    X7 = 6

    X8 = 7

    Childs Age(X) in Years

    Childs Age (X) Minus The

    Mean Age (X) in Years (X X)2

    X = 66 years (X X) = 0 (X X) 2 = 44

    X = X n = 6 years; n = 11; n 1 = 10

    2 6= -4

    4 6= -2 5 6= -15 6= -16 6= 06 6= 06 6= 07 6= 17 6= 18 6= 2

    10 6= 4

    (-4) 2 =16

    (-2) 2 =4

    (-1) 2 =1

    (-1) 2 =

    1( 0) 2 =

    0( 0) 2 =

    0

    ( 0) 2 =

    Squared Deviation

    from the Mean Age

    for a Sample of 11

    Chicken PoxSufferers

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    33/50

    Calculation of the Sample Standard

    Deviation Using the Data in Table 5.6 andthe Theoretical Formula:

    1

    )( 2

    !

    N

    XXS

    !S 44

    10

    !S 4.4

    !S 2.10 years

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    34/50

    Calculation of the Sample StandardDeviation Using the Computational (Sum ofSquares) Formula:

    2

    4

    5

    5

    6

    6

    Childs Age

    (X) in YearsX2

    Computation

    Formula

    4

    16

    25

    25

    36

    36

    36

    49

    49

    64

    100

    1

    2

    2

    -

    !

    n

    n

    X

    XS

    10

    2

    -

    !

    11

    ! 4.4

    !S 2.10 years

    X = 66X2 = 440,

    where

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    35/50

    Coefficient ofVariation (CV)

    What is the Coefficient ofVariation?The coefficient of variation measures variability in relation

    to the mean (or average) and is used to compare therelative dispersion in one type of data with the relative

    dispersion in another type of data. The data to be compared

    may be in the same units, in different units, with the samemean, or with different means.

    When is the Coefficient ofVariation Useful?Suppose you want to evaluate the relative dispersion of

    grades for two classes of students: Class A and Class B. Thecoefficient of variation can be used to compare these two

    groups and determine how the grade dispersion in Class Acompares to the grade dispersion in Class B. This is one

    example of how the coefficient of variation can be applied.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    36/50

    Coefficient ofVariation

    Relative variation rather than absolutevariation such as standard deviation

    Definition of C.V.

    Useful in comparing variation between twodistributions

    Used particularly in comparing laboratorymeasures to identify those determinations with

    more variation

    Also used in C anal ses for com arin

    )100(..x

    sVC !

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    37/50

    Standard Deviation of the Mean(SE)

    The standard deviation of the mean (oftencalled the standard error) is a measure of

    the variation in means of repeated

    samples. It is defined as the standarddeviation divided by the square root of the

    sample size: SE = To calculate thestandard deviation of the mean, do the

    following:Calculate the standard deviation (s).

    Calculate the square root of the sample size (n).

    Divide the standard deviation by result of step 2.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    38/50

    Percentiles and Quartiles

    Definition of Percentiles

    Given a set of n observations x1, x2,, xn, thepth percentile P is value of X such that p

    percent or less of the observations are lessthan P and (100-p) percent or less are greater

    than P

    P10 indicates 10th percentile, etc.

    Definition of Quartiles

    First quartile is P25Second quartile is median or P50

    Third quartile is P75

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    39/50

    Measures of Position

    Quartiles, Deciles,Percentiles

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    40/50

    Q1, Q2, Q3

    divides ranked scores into four equal parts25% 25% 25% 25%

    Q3Q2Q1(minimum) (maximum)

    (median)

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    41/50

    Q1, Q2, Q3

    divides ranked scores into four equal parts

    Quartiles

    25% 25% 25% 25%

    Q3Q2Q1(minimum) (maximum)

    (median)

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    42/50

    Q1, Q2, Q3

    divides ranked scores into four equal parts25% 25% 25% 25%

    Q3Q2Q1(minimum) (maximum)

    (median)

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    43/50

    Finding the Percentile of a

    Given Score

    Percentile of scorex= 100

    number of scores less thanx

    total number of

    scores

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    44/50

    Inter-quartile Range

    Better description of distributionthan range

    Range of middle 50 percent of thedistribution

    Definition of Inter-quartile Range IQR = Q3 - Q1.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    45/50

    0

    1

    2

    3

    4

    5

    6

    7

    0

    1

    2

    3

    4

    5

    6

    7

    Range = 14-1 =13

    uppermiddlelower

    25% 50% 25%

    Values

    uppermiddlelower

    25% 50% 25%

    Values

    1 2 3 4 5 6 7 8 9 10 11 12 13 14

    Range = 21-1 =20

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

    Frequency distributions of values with inter-quartile rangeof 5 to 9

    Frequency distributions of values with inter-quartile rangeof 5 to 9

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    46/50

    Interquartile Range (or IQR): Q3 - Q1

    Semi-interquartile Range:

    Mid quartile:

    10 - 90 Percentile Range: P90 - P10

    2

    2

    Q3 - Q1

    Q1 + Q3

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    47/50

    Percentiles

    A "percentile" shows how a single system may becompared to all other systems. Percentiles rangefrom lowest (1) to highest (99) with the average

    equal to 50The pth percentile (p ranges from 0 to 1) is a value so

    that roughly p% of the data is smaller and (100-p)%of the data is larger. Percentiles can be computed forordinal, interval, or ratio data.

    There are three steps for computing a percentile..1Sort the data from low to high;.2Count the number of values (n);.3Select the p*(n+1) observation.

    If p*(n+1) is not a whole number, then go halfwaybetween the two adjacent numbers.

    If p*(n+1) < 1, select the smallest observation.If p*(n+1) > n, select the largest observation

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    48/50

    Examples

    The following data represents cotinine levels in saliva(nmol/l) after smoking. We want to compute the 50thpercentile.

    73, 58, 67, 93, 33, 18, 1471. Sorted data: 18, 33, 58, 67, 73, 93, 1472. There are n=7 observations.3. Select 0.50*(7+1) = 4th observation.

    Therefore, the 50th percentile equals 67. Notice thatthere are three observations larger than 67 and three

    observations smaller than 67. Suppose we want to compute the 20th percentile.

    Notice that p*(n+1) = 0.20*(7+1)=1.6. This is not awhole number so we select halfway between 1st and2nd observation or 25.5. (Some people see the 1.6 andthink they have to go six tenths of the way to thesecond value. You can do this if you like, but I think life

    is too short to worry about such details.)

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    49/50

    The five number summary

    A five number summary uses percentiles todescribe a set of data. The five numbersummary consists of

    MAX - the maximum value 75% - the 75th percentile (3rd quartile)

    50% - the 50th percentile (2nd quartile ormedian)

    25% - the 25th percentile (1st quartile)

    MIN - the minimum value The five number summary splits the data

    into four regions, each of which contains25% of the data.

  • 8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

    50/50

    Summary

    In practice, descriptive statistics play a

    major roleAlways the first 1-2 tables/figures in a paper

    Statistician needs to know about each variablebefore deciding how to analyze to answer

    research questionsIn any analysis, 90% of the effort goes

    into setting up the data

    Descriptive statistics are part of that 90%