Upload
dorothy-underwood
View
226
Download
5
Embed Size (px)
Citation preview
Numerical Measures
Numerical Measures
• Measures of Central Tendency (Location)
• Measures of Non Central Location
• Measure of Variability (Dispersion, Spread)
• Measures of Shape
Measures of Central Tendency (Location)
• Mean
• Median
• Mode
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Central Location
Measures of Non-central Location
• Quartiles, Mid-Hinges
• Percentiles
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Non - Central Location
Measure of Variability (Dispersion, Spread)
• Variance, standard deviation
• Range
• Inter-Quartile Range
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
Variability
Measures of Shape• Skewness
• Kurtosis
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
0
-3 -2 -1 0 1 2 3
0
-3 -2 -1 0 1 2 3
Summation Notation
Summation Notation
Let x1, x2, x3, … xn denote a set of n numbers.
Then the symbol
denotes the sum of these n numbers
x1 + x2 + x3 + …+ xn
n
iix
1
Example
Let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
Then the symbol
denotes the sum of these 5 numbers
x1 + x2 + x3 + x4 + x5
= 10 + 15 + 21 + 7 + 13
= 66
5
1iix
Meaning of parts of summation notation
n
mi
i in expression
Quantity changing in each term of the sum
Starting value for i
Final value for i
each term of the sum
Example
Again let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
Then the symbol
denotes the sum of these 3 numbers
= 153 + 213 + 73
= 3375 + 9261 + 343
= 12979
34
33
32 xxx
4
2
3
iix
Measures of Central Location (Mean)
Mean
Let x1, x2, x3, … xn denote a set of n numbers.
Then the mean of the n numbers is defined as:
n
xxxxx
n
xx nn
n
ii
13211
Example
Again let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
Then the mean of the 5 numbers is:
5554321
5
1 xxxxxx
x ii
2.135
66
5
137211510
Interpretation of the Mean
Let x1, x2, x3, … xn denote a set of n numbers.
Then the mean, , is the centre of gravity of those the n numbers.
That is if we drew a horizontal line and placed a weight of one at each value of xi , then the balancing point of that system of mass is at the point .
x
x
x1 x2x3 x4xn
x
107 15 2113
2.13x
In the Example
100 20
The mean, , is also approximately the center of gravity of a histogram
0
5
10
15
20
25
30
60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150
x
x
Measures of Central Location (Median)
The Median
Let x1, x2, x3, … xn denote a set of n numbers.
Then the median of the n numbers is defined as the number that splits the numbers into two equal parts.
To evaluate the median we arrange the numbers in increasing order.
If the number of observations is odd there will be one observation in the middle.
This number is the median.
If the number of observations is even there will be two middle observations.
The median is the average of these two observations
Example
Again let x1, x2, x3, x3 , x4, x5 denote a set of 5 denote the set of numbers in the following table.
i 1 2 3 4 5
xi 10 15 21 7 13
The numbers arranged in order are:
7 10 13 15 21
Unique “Middle” observation – the median
Example 2
Let x1, x2, x3 , x4, x5 , x6 denote the 6 denote numbers:
23 41 12 19 64 8
Arranged in increasing order these observations would be:
8 12 19 23 41 64
Two “Middle” observations
Median
= average of two “middle” observations =
212
42
2
2319
Example
The data on N = 23 students
Variables
• Verbal IQ
• Math IQ
• Initial Reading Achievement Score
• Final Reading Achievement Score
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Total 2244 2307 35.1 48.3
Initial FinalVerbal Math Reading Reading
IQ IQ Acheivement AcheivementMeans 97.57 100.30 1.526 2.100
Computing the Median
Stem leaf Diagrams
Median = middle observation =12th observation
Summary
Initial FinalVerbal Math Reading Reading
IQ IQ Acheivement AcheivementMeans 97.57 100.30 1.526 2.100Median 96 97 1.5 1.9
Some Comments
• The mean is the centre of gravity of a set of observations. The balancing point.
• The median splits the obsevations equally in two parts of approximately 50%
• The median splits the area under a histogram in two parts of 50%
• The mean is the balancing point of a histogram
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
50%
50%
xmedian
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25
• For symmetric distributions the mean and the median will be approximately the same value
50% 50%
xMedian &
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
50%
xmedian
• For Positively skewed distributions the mean exceeds the median
• For Negatively skewed distributions the median exceeds the mean
50%
• An outlier is a “wild” observation in the data
• Outliers occur because – of errors (typographical and computational)– Extreme cases in the population
• The mean is altered to a significant degree by the presence of outliers
• Outliers have little effect on the value of the median
• This is a reason for using the median in place of the mean as a measure of central location
• Alternatively the mean is the best measure of central location when the data is Normally distributed (Bell-shaped)
Review
Summarizing Data
Graphical Methods
8 0 2 4 6 6 9
9 0 4 4 5 5 6 9 9
10 2 2 4 5 5 9
11 1 8 9
12
0
1
2
3
4
5
6
7
8
70 to 80 80 to 90 90 to100
100 to110
110 to120
120 to130
Histogram
Stem-Leaf Diagram
Verbal IQ Math IQ70 to 80 1 180 to 90 6 290 to 100 7 11
100 to 110 6 4110 to 120 3 4120 to 130 0 1
Grouped Freq Table
Numerical Measures
• Measures of Central Tendency (Location)
• Measures of Non Central Location
• Measure of Variability (Dispersion, Spread)
• Measures of Shape
The objective is to reduce the data to a small number of values that completely describe the data and certain aspects of the data.
Measures of Central Location (Mean)
Mean
Let x1, x2, x3, … xn denote a set of n numbers.
Then the mean of the n numbers is defined as:
n
xxxxx
n
xx nn
n
ii
13211
Interpretation of the Mean
Let x1, x2, x3, … xn denote a set of n numbers.
Then the mean, , is the centre of gravity of those the n numbers.
That is if we drew a horizontal line and placed a weight of one at each value of xi , then the balancing point of that system of mass is at the point .
x
x
x1 x2x3 x4xn
x
The mean, , is also approximately the center of gravity of a histogram
0
5
10
15
20
25
30
60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150
x
x
The Median
Let x1, x2, x3, … xn denote a set of n numbers.
Then the median of the n numbers is defined as the number that splits the numbers into two equal parts.
To evaluate the median we arrange the numbers in increasing order.
If the number of observations is odd there will be one observation in the middle.
This number is the median.
If the number of observations is even there will be two middle observations.
The median is the average of these two observations
Measures of Non-Central Location
• Percentiles
• Quartiles (Hinges, Mid-hinges)
DefinitionThe P×100 Percentile is a point , xP ,
underneath a distribution that has a fixed proportion P of the population (or sample) below that value
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
P×100 %
xP
Definition (Quartiles)The first Quartile , Q1 ,is the 25 Percentile , x0.25
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
25 %
x0.25
The second Quartile , Q2 ,is the 50th
Percentile , x0.50
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
50 %
x0.50
• The second Quartile , Q2 , is also the
median and the 50th percentile
The third Quartile , Q3 ,is the 75th Percentile , x0.75
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
75 %
x0.75
The Quartiles – Q1, Q2, Q3
divide the population into 4 equal parts of 25%.
00.020.040.060.080.1
0.120.140.16
0 5 10 15 20 25
25 %
25 %
25 % 25 %
Q1 Q2 Q3
Computing Percentiles and Quartiles
• There are several methods used to compute percentiles and quartiles. Different computer packages will use different methods
• Sometimes for small samples these methods will agree (but not always)
• For large samples the methods will agree within a certain level of accuracy
Computing Percentiles and Quartiles – Method 1• The first step is to order the observations in
increasing order.
• We then compute the position, k, of the P×100 Percentile.
k = P × (n+1)
Where n = the number of observations
ExampleThe data on n = 23 students
Variables
• Verbal IQ
• Math IQ
• Initial Reading Achievement Score
• Final Reading Achievement Score
We want to compute the 75th percentile and
the 90th percentile
The position, k, of the 75th Percentile.
k = P × (n+1) = .75 × (23+1) = 18
The position, k, of the 90th Percentile.
k = P × (n+1) = .90 × (23+1) = 21.6
When the position k is an integer the percentile is the kth observation (in order of magnitude) in the data set.
For example the 75th percentile is the 18th (in size) observation
When the position k is an not an integer but an integer(m) + a fraction(f).
i.e. k = m + f
then the percentile is
xP = (1-f) × (mth observation in size)
+ f × (m+1st observation in size)
In the example the position of the 90th percentile is:
k = 21.6
Then
x.90 = 0.4(21st observation in size)
+ 0.6(22nd observation in size)
When the position k is an not an integer but an integer(m) + a fraction(f).i.e. k = m + fthen the percentile is
xP = (1-f) × (mth observation in size)+ f × (m+1st observation in size)
xp = (1- f) ( mth obs) + f [(m+1)st obs]
(m+1)st obsmth obs
obs obs 1
obs obs 1obs 1
obs obs 1
obs thst
thstth
thst
thp
mm
mmfmf
mm
mx
f
mm
mfmfthst
thst
obs obs 1
obs obs 1
When the position k is an not an integer but an integer(m) + a fraction(f).
i.e. k = m + f
xp = (1- f) ( mth obs) + f [(m+1)st obs]
(m+1)st obsmth obs
fmm
mxthst
thp
obs obs 1
obs
Thus the position of xp is 100f% through the interval between the mth observation and the (m +1)st observation
Example
The data Verbal IQ on n = 23 students arranged in increasing order is:
80 82 84 86 86 89 90 94
94 95 95 96 99 99 102 102
104 105 105 109 111 118 119
x0.75 = 75th percentile = 18th observation in size =105
(position k = 18)
x0.90 = 90th percentile
= 0.4(21st observation in size)
+ 0.6(22nd observation in size)
= 0.4(111)+ 0.6(118) = 115.2
(position k = 21.6)
An Alternative method for computing Quartiles – Method 2• Sometimes this method will result in the
same values for the quartiles.
• Sometimes this method will result in the different values for the quartiles.
• For large samples the two methods will result in approximately the same answer.
Let x1, x2, x3, … xn denote a set of n numbers.
The first step in Method 2 is to arrange the numbers in increasing order.
From the arranged numbers we compute the median.
This is also called the Hinge
ExampleConsider the 5 numbers:
10 15 21 7 13Arranged in increasing order:
7 10 13 15 21
The median (or Hinge) splits the observations in half
Median (Hinge)
The lower mid-hinge (the first quartile) is the “median” of the lower half of the observations (excluding the median).
The upper mid-hinge (the third quartile) is the “median” of the upper half of the observations (excluding the median).
Consider the five number in increasing order:
7 10 13 15 21
Median (Hinge)
13
Lower Half
Upper Half
Upper Mid-Hinge
(First Quartile)
(7+10)/2 =8.5
Upper Mid-Hinge
(Third Quartile)
(15+21)/2 = 18
Computing the median and the quartile using the first method:Position of the median: k = 0.5(5+1) = 3
Position of the first Quartile: k = 0.25(5+1) = 1.5
Position of the third Quartile: k = 0.75(5+1) = 4.5
7 10 13 15 21
Q2 = 13Q1 = 8. 5 Q3 = 18
• Both methods result in the same value
• This is not always true.
Example
The data Verbal IQ on n = 23 students arranged in increasing order is:
80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119
Median (Hinge)
96
Lower Mid-Hinge
(First Quartile)
89
Upper Mid-Hinge
(Third Quartile)
105
Computing the median and the quartile using the first method:Position of the median: k = 0.5(23+1) = 12
Position of the first Quartile: k = 0.25(23+1) = 6
Position of the third Quartile: k = 0.75(23+1) = 18
80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119
Q2 = 96Q1 = 89 Q3 = 105
• Many programs compute percentiles, quartiles etc.
• Each may use different methods.
• It is important to know which method is being used.
• The different methods result in answers that are close when the sample size is large.
Announcement
Assignment 2 has been posted
this assignment has to be handed in and is due Friday, January 22
This assignment requires the use of a Statistical Package (SPSS or Minitab) available in most computer labs.
Instructions on the use of these packages will be given in the lab today
Box-PlotsBox-Whisker Plots
• A graphical method of displaying data
• An alternative to the histogram and stem-leaf diagram
To Draw a Box Plot
• Compute the Hinge (Median, Q2) and the Mid-hinges (first & third quartiles – Q1 and Q3 )
• We also compute the largest and smallest of the observations – the max and the min
• The five number summary
min, Q1, Q2, Q3, max
Example
The data Verbal IQ on n = 23 students arranged in increasing order is:
80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119
Q2 = 96Q1 = 89 Q3 = 105min = 80 max = 119
The Box Plot is then drawn
• Drawing above an axis a “box” from Q1 to Q3.
• Drawing vertical line in the box at the median, Q2
• Drawing whiskers at the lower and upper ends of the box going down to the min and up to max.
BoxLower Whisker
Upper Whisker
Q2Q1Q3min max
Example
The data Verbal IQ on n = 23 students arranged in increasing order is:
min = 80
Q1 = 89
Q2 = 96
Q3 = 105
max = 119
This is sometimes called the five-number summary
70 80 90 100 110 120 130
Box Plot of Verbal IQ
70
80
90
100
110
120
130
Box Plot can also be drawn vertically
Box-Whisker plots(Verbal IQ, Math IQ)
Box-Whisker plots(Initial RA, Final RA )
Summary Information contained in the box plot
Middle 50% of population
25% 25% 25% 25%
Advance Box Plots
• An outlier is a “wild” observation in the data
• Outliers occur because– of errors (typographical and computational)– Extreme cases in the population
• We will now consider the drawing of box-plots where outliers are identified
To Draw a Box Plot we need to:
• Compute the Hinge (Median, Q2) and the Mid-hinges (first & third quartiles – Q1 and Q3 )
• The difference Q3– Q1 is called the inter-quartile range (denoted by IQR)
• To identify outliers we will compute the inner and outer fences
The fences are like the fences at a prison. We expect the entire population to be within both sets of fences.
If a member of the population is between the inner and outer fences it is a mild outlier.
If a member of the population is outside of the outer fences it is an extreme outlier.
Inner fences
Lower inner fence
f1 = Q1 - (1.5)IQR
Upper inner fence
f2 = Q3 + (1.5)IQR
Outer fences
Lower outer fence
F1 = Q1 - (3)IQR
Upper outer fence
F2 = Q3 + (3)IQR
• Observations that are between the lower and upper inner fences are considered to be non-outliers.
• Observations that are outside the inner fences but not outside the outer fences are considered to be mild outliers.
• Observations that are outside outer fences are considered to be extreme outliers.
• mild outliers are plotted individually in a box-plot using the symbol
• extreme outliers are plotted individually in a box-plot using the symbol
• non-outliers are represented with the box and whiskers with– Max = largest observation within the fences– Min = smallest observation within the fences
Inner fencesOuter fence
Mild outliers
Extreme outlierBox-Whisker plot representing the data that are not outliers
Example
Data collected on n = 109 countries in 1995.
Data collected on k = 25 variables.
The variables
1. Population Size (in 1000s)
2. Density = Number of people/Sq kilometer
3. Urban = percentage of population living in cities
4. Religion
5. lifeexpf = Average female life expectancy
6. lifeexpm = Average male life expectancy
7. literacy = % of population who read
8. pop_inc = % increase in popn size (1995)
9. babymort = Infant motality (deaths per 1000)
10. gdp_cap = Gross domestic product/capita
11. Region = Region or economic group
12. calories = Daily calorie intake.
13. aids = Number of aids cases
14. birth_rt = Birth rate per 1000 people
15. death_rt = death rate per 1000 people
16. aids_rt = Number of aids cases/100000 people
17. log_gdp = log10(gdp_cap)
18. log_aidsr = log10(aids_rt)
19. b_to_d =birth to death ratio
20. fertility = average number of children in family
21. log_pop = log10(population)
22. cropgrow = ??
23. lit_male = % of males who can read
24. lit_fema = % of females who can read
25. Climate = predominant climate
The data file as it appears in SPSS
Consider the data on infant mortality
Stem-Leaf diagram stem = 10s, leaf = unit digit
0 4455555666666666777778888899 1 0122223467799 2 0001123555577788 3 45567999 4 135679 5 011222347 6 03678 7 4556679 8 5 9 4 10 1569 11 0022378 12 46 13 7 14 15 16 8
median = Q2 = 27
Quartiles
Lower quartile = Q1 = the median of lower half
Upper quartile = Q3 = the median of upper half
Summary Statistics
1 3
12 12 66 6712, 66.5
2 2Q Q
Interquartile range (IQR)
IQR = Q1 - Q3 = 66.5 – 12 = 54.5
lower = Q1 - 3(IQR) = 12 – 3(54.5) = - 151.5
The Outer Fences
No observations are outside of the outer fences
lower = Q1 – 1.5(IQR) = 12 – 1.5(54.5) = - 69.75
The Inner Fences
upper = Q3 = 1.5(IQR) = 66.5 + 1.5(54.5) = 148.25
upper = Q3 = 3(IQR) = 66.5 + 3(54.5) = 230.0
Only one observation (168 – Afghanistan) is outside of the inner fences – (mild outlier)
Box-Whisker Plot of Infant Mortality
0
0 50 100 150 200
Infant Mortality
Example 2
In this example we are looking at the weight gains (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork).
– Ten test animals for each diet
TableGains in weight (grams) for rats under six diets
differing in level of protein (High or Low)and source of protein (Beef, Cereal, or Pork)
Level High Protein Low protein
Source Beef Cereal Pork Beef Cereal Pork
Diet 1 2 3 4 5 6
73 98 94 90 107 49
102 74 79 76 95 82
118 56 96 90 97 73
104 111 98 64 80 86
81 95 102 86 98 81
107 88 102 51 74 97
100 82 108 72 74 106
87 77 91 90 67 70
117 86 120 95 89 61
111 92 105 78 58 82
Median 103.0 87.0 100.0 82.0 84.5 81.5
Mean 100.0 85.9 99.5 79.2 83.9 78.7
IQR 24.0 18.0 11.0 18.0 23.0 16.0
PSD 17.78 13.33 8.15 13.33 17.04 11.05
Variance 229.11 225.66 119.17 192.84 246.77 273.79
Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55
Non-Outlier MaxNon-Outlier Min
Median; 75%25%
Box Plots: Weight Gains for Six Diets
Diet
We
igh
t G
ain
40
50
60
70
80
90
100
110
120
130
1 2 3 4 5 6
High Protein Low Protein
Beef Beef Cereal Cereal Pork Pork
Conclusions
• Weight gain is higher for the high protein meat diets
• Increasing the level of protein - increases weight gain but only if source of protein is a meat source
Next topic:Numerical Measures of Variability