Lecture 3+ , Descriptive Statistics (Slide)

8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)

1/50

Descriptive Statistics


2/50

Descriptive Statistics reducing a complexmass of data to a manageable set of

information

Descriptive Statistics: the summary

and presentation of data to: simplify the data

enable meaning full interpretation

support decision making

Numerical descriptive measures (fewnumbers)

Graphical presentations


3/50

Descriptive Statistics Measures of central tendency

Measures of dispersion

Data Presentation Grouped data


4/50


5/50


6/50


7/50


8/50


9/50


10/50


11/50


12/50

What is the Mean?

The mean is the most common measure ofcentral tendency.

The arithmetic mean (or average) is

defined as the sum of all the observedvalues, divided by the number ofobservations.

The mean is a good way to describe thecenter of a group of data if the values have

a more or less normal distribution It may not well describe a group of data if afew values are far from the rest (the data is"skewed" or there are many "outliers").


13/50

When is the Mean Useful?

The mean is useful when you have anormal distribution of data

The mean is not very useful when youhave an abnormal distribution ofdata.


14/50

How is the Mean Calculated?

Let us calculate the mean for the data set below. Thissample data set represents the incubation period indays for a group of 21 people who contractedhepatitis. Look at the numbers:

16, 20, 22, 24, 25, 28, 28, 29, 29, 30, 30, 30, 31, 31,

32, 33, 35, 36, 38, 40, 42 Applying the steps described previously, you simply

add up all the numbers and then divide by thenumber of observations:

There are 21 data values (n = 21). The sum of the 21 data values (X1 + X2 ++Xn )

listed above is 629. 629 / 21 = 29.95. We can round this up to 30.0 or 30. The mean for the data set above, then, is 30.


15/50

Median,

What is the Median?

The median is a measure of centraltendency that is useful in representing datathat is skewed. "Skewed" simply means

that there are significantly more data pointswith values below the mean than there areabove the mean (or vice-versa). Withskewed data, the normally centered hump

on the frequency distribution curve is offsetto the left or right of center. The median isthe value that divides the distribution ofvalues into two equal parts.


16/50

When is the Median Useful?

The median is useful when you havean abnormal (or skewed) distribution

of data. A skewed distribution datashows up clearly if you present theinformation in a graph:


17/50

How is the Median Determined?

The median is determined rather than calculated. Thatis, the median is based on its relationship to otherdata in the population rather than calculatedalgebraically. The median of a set of observations is

the value that falls in the middle position when theobservations are ranked in order from the smallest tothe largest. The rules for calculating the mean are:

Rank the observations from the smallest to thelargest.

If the number of observations is odd, the median isthe middle number.

If the number of observations is even, the median isthe average of the two middle numbers.


18/50

An Exercise in Determining the Median

Let us take a sample of 8 people from thiscommunity, assuming these people are

representative of the general population.H

ereare the numbers representing net worth for thesepeople:

$2,000, $10,000, $25,000, $32,000, $45,000,$50,000, $80,000, $3,000,000,

we have an even number of observations (8).

The two middle values in the ordered list are$32,000 and $45,000.Therefore, the median is halfway between the

values of $32,000 and $45,000.Calculate the average of these two numbers to

determine the median: $38,500.


19/50

Mode,W

hat is the Mode?The mode is the value that occurs

with the greatest frequency in a setof observations. If no value is

repeated within the set ofobservations, then there is no mode.

If two or more values are repeated atthe same frequency, then each of

those observations is a mode. In anormal or symmetrical distribution of

data, the mean, median, and modehave the same values (or very close).


20/50

W

hen is the Mode Useful?The mode is useful if you are trying to

focus on the most frequent value for

a certain population. Although themode is seldom used in public healthstatistics, it could be used to focus

attention on the modal (most

common) age group of a populationin the outbreak of a disease, or

establish some other modalcharacteristic for a population

experiencing a disease.


21/50

H

ow is the Mode Determined?The mode of a set of observations is the

value that occurs with the greatestfrequency.

To determine the mode,

Rank the observations from the smallest to thelargest.

Evaluate the ranked data set by counting the

number of times each individual value occurs,and

determine which value (s) occur with thegreatest frequency


22/50

Measures of Dispersion

Dispersion of a set of observations is thevariety exhibited by the observations

If all values are the same, no dispersion

More the values are spread, the greater thedispersion

Many distributions are well-described bymeasure of location and dispersion


23/50

Common Measures of

Dispersion

Range

Variance Standard deviation

Coefficient of variation (CV)

Standard Deviation of the Mean (SE)

Percentiles and Quartiles


24/50

When are measures of dispersion

useful? If you are evaluating the norm for a

particular characteristic, like weight orheight, you need to establish the extremes

(lowest and highest values) in order toassess what might be outside the norm. Forexample, there are standards for weight inproportion to height. Some people are veryheavy for their height, whereas others are

much lighter compared to their even if theyare of the same height. The extremes ofthis range can describe how far from thenorm a person's weight is when assessedwith their height

R


25/50

Range

What is the Range?

The range is calculated as the

difference between the smallest andthe largest values in a set of data.

Heavily influenced by two mostextreme values and ignores the rest

of the distribution


26/50

Range

When is the Range Useful?

The range is an adequate measure of

variation for a small set of data, likeclass scores for a test. Think of othermeasures where range might beuseful: Salaries for a particular job

category; or Indoor versus outdoortemperatures?


27/50

Range

How is the Range Calculated?

The range is calculated by subtracting

the smallest value in the data set fromthe largest value in the data set:

Range = Largest value - Smallest value

Data set A: 8, 9, 10, 10, 11, 12

Data set B: 5, 6, 10, 10, 14, 15

The range for A: 4, The range for B: 10


28/50

Variance

Variance measures distribution of values

around their mean Definition of sample variance

Degrees of freedom

n-1 used because if we know n-1 deviations,the nth deviation is known

Deviations have to sum to zero

)1/()( 22 ! nxxs i


29/50

Standard Deviation

Definition of sample standard deviation

Standard deviation in same units as mean

Variance in units2

2ss !


30/50

What is the Standard

Deviation?The standard deviation of a data set

is based on how much each data

value deviates from the mean, and isequal to the square root of thevariance. The greater the dispersion

of values, the larger the standard

deviation. Much of statistical theory isbased on the standard deviation and

the 'normal' distribution.


31/50

When is the Standard Deviation

Useful?It is a useful measure when your data

distribution is very close to a normal curve.In this situation, the mean is the bestmeasure of central tendency, and the

standard deviation is the best measure ofdispersion.

In a normal distribution, if you measure 1standard deviation to either side of the

mean, you will find that 68.3% of the

observations fall into this area; 95.5% ofthe observations fall within 2 standard

deviations to either side of the mean; and99.7% of observations fall within 3

standard deviations of the mean


32/50

Calculation of the Sample Standard Deviationusing the Theoretical (Squared Deviation)Method

X1 = 2

X2 = 4

X3 = 5

X4 = 5

X5 = 6

X6 = 6

X7 = 6

X8 = 7

Childs Age(X) in Years

Childs Age (X) Minus The

Mean Age (X) in Years (X X)2

X = 66 years (X X) = 0 (X X) 2 = 44

X = X n = 6 years; n = 11; n 1 = 10

2 6= -4

4 6= -2 5 6= -15 6= -16 6= 06 6= 06 6= 07 6= 17 6= 18 6= 2

10 6= 4

(-4) 2 =16

(-2) 2 =4

(-1) 2 =1

(-1) 2 =

1( 0) 2 =

0( 0) 2 =

0

( 0) 2 =

Squared Deviation

from the Mean Age

for a Sample of 11

Chicken PoxSufferers


33/50

Calculation of the Sample Standard

Deviation Using the Data in Table 5.6 andthe Theoretical Formula:

1

)( 2

!

N

XXS

!S 44

10

!S 4.4

!S 2.10 years


34/50

Calculation of the Sample StandardDeviation Using the Computational (Sum ofSquares) Formula:

2

4

5

5

6

6

Childs Age

(X) in YearsX2

Computation

Formula

4

16

25

25

36

36

36

49

49

64

100

1

2

2

-

!

n

n

X

XS

10

2

-

!

11

! 4.4

!S 2.10 years

X = 66X2 = 440,

where


35/50

Coefficient ofVariation (CV)

What is the Coefficient ofVariation?The coefficient of variation measures variability in relation

to the mean (or average) and is used to compare therelative dispersion in one type of data with the relative

dispersion in another type of data. The data to be compared

may be in the same units, in different units, with the samemean, or with different means.

When is the Coefficient ofVariation Useful?Suppose you want to evaluate the relative dispersion of

grades for two classes of students: Class A and Class B. Thecoefficient of variation can be used to compare these two

groups and determine how the grade dispersion in Class Acompares to the grade dispersion in Class B. This is one

example of how the coefficient of variation can be applied.


36/50

Coefficient ofVariation

Relative variation rather than absolutevariation such as standard deviation

Definition of C.V.

Useful in comparing variation between twodistributions

Used particularly in comparing laboratorymeasures to identify those determinations with

more variation

Also used in C anal ses for com arin

)100(..x

sVC !


37/50

Standard Deviation of the Mean(SE)

The standard deviation of the mean (oftencalled the standard error) is a measure of

the variation in means of repeated

samples. It is defined as the standarddeviation divided by the square root of the

sample size: SE = To calculate thestandard deviation of the mean, do the

following:Calculate the standard deviation (s).

Calculate the square root of the sample size (n).

Divide the standard deviation by result of step 2.


38/50

Percentiles and Quartiles

Definition of Percentiles

Given a set of n observations x1, x2,, xn, thepth percentile P is value of X such that p

percent or less of the observations are lessthan P and (100-p) percent or less are greater

than P

P10 indicates 10th percentile, etc.

Definition of Quartiles

First quartile is P25Second quartile is median or P50

Third quartile is P75


39/50

Measures of Position

Quartiles, Deciles,Percentiles


40/50

Q1, Q2, Q3

divides ranked scores into four equal parts25% 25% 25% 25%

Q3Q2Q1(minimum) (maximum)

(median)


41/50

Q1, Q2, Q3

divides ranked scores into four equal parts

Quartiles

25% 25% 25% 25%


(median)


42/50

Q1, Q2, Q3

divides ranked scores into four equal parts25% 25% 25% 25%


(median)


43/50

Finding the Percentile of a

Given Score

Percentile of scorex= 100

number of scores less thanx

total number of

scores


44/50

Inter-quartile Range

Better description of distributionthan range

Range of middle 50 percent of thedistribution

Definition of Inter-quartile Range IQR = Q3 - Q1.


45/50

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Range = 14-1 =13

uppermiddlelower

25% 50% 25%

Values

uppermiddlelower

25% 50% 25%

Values

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 21-1 =20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Frequency distributions of values with inter-quartile rangeof 5 to 9

Frequency distributions of values with inter-quartile rangeof 5 to 9


46/50

Interquartile Range (or IQR): Q3 - Q1

Semi-interquartile Range:

Mid quartile:

10 - 90 Percentile Range: P90 - P10

2

2

Q3 - Q1

Q1 + Q3


47/50

Percentiles

A "percentile" shows how a single system may becompared to all other systems. Percentiles rangefrom lowest (1) to highest (99) with the average

equal to 50The pth percentile (p ranges from 0 to 1) is a value so

that roughly p% of the data is smaller and (100-p)%of the data is larger. Percentiles can be computed forordinal, interval, or ratio data.

There are three steps for computing a percentile..1Sort the data from low to high;.2Count the number of values (n);.3Select the p*(n+1) observation.

If p*(n+1) is not a whole number, then go halfwaybetween the two adjacent numbers.

If p*(n+1) < 1, select the smallest observation.If p*(n+1) > n, select the largest observation


48/50

Examples

The following data represents cotinine levels in saliva(nmol/l) after smoking. We want to compute the 50thpercentile.

73, 58, 67, 93, 33, 18, 1471. Sorted data: 18, 33, 58, 67, 73, 93, 1472. There are n=7 observations.3. Select 0.50*(7+1) = 4th observation.

Therefore, the 50th percentile equals 67. Notice thatthere are three observations larger than 67 and three

observations smaller than 67. Suppose we want to compute the 20th percentile.

Notice that p*(n+1) = 0.20*(7+1)=1.6. This is not awhole number so we select halfway between 1st and2nd observation or 25.5. (Some people see the 1.6 andthink they have to go six tenths of the way to thesecond value. You can do this if you like, but I think life

is too short to worry about such details.)


49/50

The five number summary

A five number summary uses percentiles todescribe a set of data. The five numbersummary consists of

MAX - the maximum value 75% - the 75th percentile (3rd quartile)

50% - the 50th percentile (2nd quartile ormedian)

25% - the 25th percentile (1st quartile)

MIN - the minimum value The five number summary splits the data

into four regions, each of which contains25% of the data.


50/50

Summary

In practice, descriptive statistics play a

major roleAlways the first 1-2 tables/figures in a paper

Statistician needs to know about each variablebefore deciding how to analyze to answer

research questionsIn any analysis, 90% of the effort goes

into setting up the data

Descriptive statistics are part of that 90%

Documents

Lecture 3+ , Descriptive Statistics (Slide)