Upload
justden09
View
215
Download
0
Embed Size (px)
Citation preview
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
1/50
Descriptive Statistics
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
2/50
Descriptive Statistics reducing a complexmass of data to a manageable set of
information
Descriptive Statistics: the summary
and presentation of data to: simplify the data
enable meaning full interpretation
support decision making
Numerical descriptive measures (fewnumbers)
Graphical presentations
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
3/50
Descriptive Statistics Measures of central tendency
Measures of dispersion
Data Presentation Grouped data
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
4/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
5/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
6/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
7/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
8/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
9/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
10/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
11/50
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
12/50
What is the Mean?
The mean is the most common measure ofcentral tendency.
The arithmetic mean (or average) is
defined as the sum of all the observedvalues, divided by the number ofobservations.
The mean is a good way to describe thecenter of a group of data if the values have
a more or less normal distribution It may not well describe a group of data if afew values are far from the rest (the data is"skewed" or there are many "outliers").
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
13/50
When is the Mean Useful?
The mean is useful when you have anormal distribution of data
The mean is not very useful when youhave an abnormal distribution ofdata.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
14/50
How is the Mean Calculated?
Let us calculate the mean for the data set below. Thissample data set represents the incubation period indays for a group of 21 people who contractedhepatitis. Look at the numbers:
16, 20, 22, 24, 25, 28, 28, 29, 29, 30, 30, 30, 31, 31,
32, 33, 35, 36, 38, 40, 42 Applying the steps described previously, you simply
add up all the numbers and then divide by thenumber of observations:
There are 21 data values (n = 21). The sum of the 21 data values (X1 + X2 ++Xn )
listed above is 629. 629 / 21 = 29.95. We can round this up to 30.0 or 30. The mean for the data set above, then, is 30.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
15/50
Median,
What is the Median?
The median is a measure of centraltendency that is useful in representing datathat is skewed. "Skewed" simply means
that there are significantly more data pointswith values below the mean than there areabove the mean (or vice-versa). Withskewed data, the normally centered hump
on the frequency distribution curve is offsetto the left or right of center. The median isthe value that divides the distribution ofvalues into two equal parts.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
16/50
When is the Median Useful?
The median is useful when you havean abnormal (or skewed) distribution
of data. A skewed distribution datashows up clearly if you present theinformation in a graph:
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
17/50
How is the Median Determined?
The median is determined rather than calculated. Thatis, the median is based on its relationship to otherdata in the population rather than calculatedalgebraically. The median of a set of observations is
the value that falls in the middle position when theobservations are ranked in order from the smallest tothe largest. The rules for calculating the mean are:
Rank the observations from the smallest to thelargest.
If the number of observations is odd, the median isthe middle number.
If the number of observations is even, the median isthe average of the two middle numbers.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
18/50
An Exercise in Determining the Median
Let us take a sample of 8 people from thiscommunity, assuming these people are
representative of the general population.H
ereare the numbers representing net worth for thesepeople:
$2,000, $10,000, $25,000, $32,000, $45,000,$50,000, $80,000, $3,000,000,
we have an even number of observations (8).
The two middle values in the ordered list are$32,000 and $45,000.Therefore, the median is halfway between the
values of $32,000 and $45,000.Calculate the average of these two numbers to
determine the median: $38,500.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
19/50
Mode,W
hat is the Mode?The mode is the value that occurs
with the greatest frequency in a setof observations. If no value is
repeated within the set ofobservations, then there is no mode.
If two or more values are repeated atthe same frequency, then each of
those observations is a mode. In anormal or symmetrical distribution of
data, the mean, median, and modehave the same values (or very close).
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
20/50
W
hen is the Mode Useful?The mode is useful if you are trying to
focus on the most frequent value for
a certain population. Although themode is seldom used in public healthstatistics, it could be used to focus
attention on the modal (most
common) age group of a populationin the outbreak of a disease, or
establish some other modalcharacteristic for a population
experiencing a disease.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
21/50
H
ow is the Mode Determined?The mode of a set of observations is the
value that occurs with the greatestfrequency.
To determine the mode,
Rank the observations from the smallest to thelargest.
Evaluate the ranked data set by counting the
number of times each individual value occurs,and
determine which value (s) occur with thegreatest frequency
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
22/50
Measures of Dispersion
Dispersion of a set of observations is thevariety exhibited by the observations
If all values are the same, no dispersion
More the values are spread, the greater thedispersion
Many distributions are well-described bymeasure of location and dispersion
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
23/50
Common Measures of
Dispersion
Range
Variance Standard deviation
Coefficient of variation (CV)
Standard Deviation of the Mean (SE)
Percentiles and Quartiles
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
24/50
When are measures of dispersion
useful? If you are evaluating the norm for a
particular characteristic, like weight orheight, you need to establish the extremes
(lowest and highest values) in order toassess what might be outside the norm. Forexample, there are standards for weight inproportion to height. Some people are veryheavy for their height, whereas others are
much lighter compared to their even if theyare of the same height. The extremes ofthis range can describe how far from thenorm a person's weight is when assessedwith their height
R
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
25/50
Range
What is the Range?
The range is calculated as the
difference between the smallest andthe largest values in a set of data.
Heavily influenced by two mostextreme values and ignores the rest
of the distribution
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
26/50
Range
When is the Range Useful?
The range is an adequate measure of
variation for a small set of data, likeclass scores for a test. Think of othermeasures where range might beuseful: Salaries for a particular job
category; or Indoor versus outdoortemperatures?
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
27/50
Range
How is the Range Calculated?
The range is calculated by subtracting
the smallest value in the data set fromthe largest value in the data set:
Range = Largest value - Smallest value
Data set A: 8, 9, 10, 10, 11, 12
Data set B: 5, 6, 10, 10, 14, 15
The range for A: 4, The range for B: 10
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
28/50
Variance
Variance measures distribution of values
around their mean Definition of sample variance
Degrees of freedom
n-1 used because if we know n-1 deviations,the nth deviation is known
Deviations have to sum to zero
)1/()( 22 ! nxxs i
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
29/50
Standard Deviation
Definition of sample standard deviation
Standard deviation in same units as mean
Variance in units2
2ss !
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
30/50
What is the Standard
Deviation?The standard deviation of a data set
is based on how much each data
value deviates from the mean, and isequal to the square root of thevariance. The greater the dispersion
of values, the larger the standard
deviation. Much of statistical theory isbased on the standard deviation and
the 'normal' distribution.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
31/50
When is the Standard Deviation
Useful?It is a useful measure when your data
distribution is very close to a normal curve.In this situation, the mean is the bestmeasure of central tendency, and the
standard deviation is the best measure ofdispersion.
In a normal distribution, if you measure 1standard deviation to either side of the
mean, you will find that 68.3% of the
observations fall into this area; 95.5% ofthe observations fall within 2 standard
deviations to either side of the mean; and99.7% of observations fall within 3
standard deviations of the mean
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
32/50
Calculation of the Sample Standard Deviationusing the Theoretical (Squared Deviation)Method
X1 = 2
X2 = 4
X3 = 5
X4 = 5
X5 = 6
X6 = 6
X7 = 6
X8 = 7
Childs Age(X) in Years
Childs Age (X) Minus The
Mean Age (X) in Years (X X)2
X = 66 years (X X) = 0 (X X) 2 = 44
X = X n = 6 years; n = 11; n 1 = 10
2 6= -4
4 6= -2 5 6= -15 6= -16 6= 06 6= 06 6= 07 6= 17 6= 18 6= 2
10 6= 4
(-4) 2 =16
(-2) 2 =4
(-1) 2 =1
(-1) 2 =
1( 0) 2 =
0( 0) 2 =
0
( 0) 2 =
Squared Deviation
from the Mean Age
for a Sample of 11
Chicken PoxSufferers
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
33/50
Calculation of the Sample Standard
Deviation Using the Data in Table 5.6 andthe Theoretical Formula:
1
)( 2
!
N
XXS
!S 44
10
!S 4.4
!S 2.10 years
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
34/50
Calculation of the Sample StandardDeviation Using the Computational (Sum ofSquares) Formula:
2
4
5
5
6
6
Childs Age
(X) in YearsX2
Computation
Formula
4
16
25
25
36
36
36
49
49
64
100
1
2
2
-
!
n
n
X
XS
10
2
-
!
11
! 4.4
!S 2.10 years
X = 66X2 = 440,
where
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
35/50
Coefficient ofVariation (CV)
What is the Coefficient ofVariation?The coefficient of variation measures variability in relation
to the mean (or average) and is used to compare therelative dispersion in one type of data with the relative
dispersion in another type of data. The data to be compared
may be in the same units, in different units, with the samemean, or with different means.
When is the Coefficient ofVariation Useful?Suppose you want to evaluate the relative dispersion of
grades for two classes of students: Class A and Class B. Thecoefficient of variation can be used to compare these two
groups and determine how the grade dispersion in Class Acompares to the grade dispersion in Class B. This is one
example of how the coefficient of variation can be applied.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
36/50
Coefficient ofVariation
Relative variation rather than absolutevariation such as standard deviation
Definition of C.V.
Useful in comparing variation between twodistributions
Used particularly in comparing laboratorymeasures to identify those determinations with
more variation
Also used in C anal ses for com arin
)100(..x
sVC !
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
37/50
Standard Deviation of the Mean(SE)
The standard deviation of the mean (oftencalled the standard error) is a measure of
the variation in means of repeated
samples. It is defined as the standarddeviation divided by the square root of the
sample size: SE = To calculate thestandard deviation of the mean, do the
following:Calculate the standard deviation (s).
Calculate the square root of the sample size (n).
Divide the standard deviation by result of step 2.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
38/50
Percentiles and Quartiles
Definition of Percentiles
Given a set of n observations x1, x2,, xn, thepth percentile P is value of X such that p
percent or less of the observations are lessthan P and (100-p) percent or less are greater
than P
P10 indicates 10th percentile, etc.
Definition of Quartiles
First quartile is P25Second quartile is median or P50
Third quartile is P75
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
39/50
Measures of Position
Quartiles, Deciles,Percentiles
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
40/50
Q1, Q2, Q3
divides ranked scores into four equal parts25% 25% 25% 25%
Q3Q2Q1(minimum) (maximum)
(median)
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
41/50
Q1, Q2, Q3
divides ranked scores into four equal parts
Quartiles
25% 25% 25% 25%
Q3Q2Q1(minimum) (maximum)
(median)
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
42/50
Q1, Q2, Q3
divides ranked scores into four equal parts25% 25% 25% 25%
Q3Q2Q1(minimum) (maximum)
(median)
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
43/50
Finding the Percentile of a
Given Score
Percentile of scorex= 100
number of scores less thanx
total number of
scores
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
44/50
Inter-quartile Range
Better description of distributionthan range
Range of middle 50 percent of thedistribution
Definition of Inter-quartile Range IQR = Q3 - Q1.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
45/50
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Range = 14-1 =13
uppermiddlelower
25% 50% 25%
Values
uppermiddlelower
25% 50% 25%
Values
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 21-1 =20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Frequency distributions of values with inter-quartile rangeof 5 to 9
Frequency distributions of values with inter-quartile rangeof 5 to 9
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
46/50
Interquartile Range (or IQR): Q3 - Q1
Semi-interquartile Range:
Mid quartile:
10 - 90 Percentile Range: P90 - P10
2
2
Q3 - Q1
Q1 + Q3
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
47/50
Percentiles
A "percentile" shows how a single system may becompared to all other systems. Percentiles rangefrom lowest (1) to highest (99) with the average
equal to 50The pth percentile (p ranges from 0 to 1) is a value so
that roughly p% of the data is smaller and (100-p)%of the data is larger. Percentiles can be computed forordinal, interval, or ratio data.
There are three steps for computing a percentile..1Sort the data from low to high;.2Count the number of values (n);.3Select the p*(n+1) observation.
If p*(n+1) is not a whole number, then go halfwaybetween the two adjacent numbers.
If p*(n+1) < 1, select the smallest observation.If p*(n+1) > n, select the largest observation
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
48/50
Examples
The following data represents cotinine levels in saliva(nmol/l) after smoking. We want to compute the 50thpercentile.
73, 58, 67, 93, 33, 18, 1471. Sorted data: 18, 33, 58, 67, 73, 93, 1472. There are n=7 observations.3. Select 0.50*(7+1) = 4th observation.
Therefore, the 50th percentile equals 67. Notice thatthere are three observations larger than 67 and three
observations smaller than 67. Suppose we want to compute the 20th percentile.
Notice that p*(n+1) = 0.20*(7+1)=1.6. This is not awhole number so we select halfway between 1st and2nd observation or 25.5. (Some people see the 1.6 andthink they have to go six tenths of the way to thesecond value. You can do this if you like, but I think life
is too short to worry about such details.)
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
49/50
The five number summary
A five number summary uses percentiles todescribe a set of data. The five numbersummary consists of
MAX - the maximum value 75% - the 75th percentile (3rd quartile)
50% - the 50th percentile (2nd quartile ormedian)
25% - the 25th percentile (1st quartile)
MIN - the minimum value The five number summary splits the data
into four regions, each of which contains25% of the data.
8/3/2019 Lecture 3+ , Descriptive Statistics (Slide)
50/50
Summary
In practice, descriptive statistics play a
major roleAlways the first 1-2 tables/figures in a paper
Statistician needs to know about each variablebefore deciding how to analyze to answer
research questionsIn any analysis, 90% of the effort goes
into setting up the data
Descriptive statistics are part of that 90%