Upload
doandang
View
220
Download
2
Embed Size (px)
Citation preview
Statistics IChapter 2: Analysis of univariate data
Numerical summary
Central tendency Location Spread Form
⇓ ⇓ ⇓mean quartiles range coeff. asymmetry
median percentiles interquartile range coeff. kurtosismode variance
standard deviationcoeff. of variation
Descriptive statistics
X What are they useful?
X Can we calculate them for all types of variables?
X Which are the most useful in each case?
X How can we use the calculator or Excel?
Measures of central tendency
X The mean
X The median
X The mode
Central tendency: the (artithmetic) mean
The (artithmetic) meanThe mean is the average of all the data
x =
∑ni=1 xin
=x1 + . . .+ xn
n
I It is the most common measure of location
I It is the center of gravity of the data
I It can be calculated only for quantitative variables
The mean: example
For the experience of the 46 professionals of a computer company, Whichis the mean?
x =1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + · · ·+ 17 + 20
46= 7.5 anos
How can we calculate it using the absolute frequency table? and usingthe relative one?
Experience, xi absolute freq., ni relative freq., fi1 5 0,1092 4 0,0873 4 0,0874 4 0,0875 3 0,0656 4 0,0877 1 0,0228 4 0,087
10 4 0,08711 2 0,04312 2 0,04313 2 0,04314 1 0,02215 1 0,02216 3 0,06517 1 0,02220 1 0,022
Total 46 1
The mean with grouped data
This is the same formula but using the center of each interval.For the salary of the 46 professionals of a computer company, Which isthe mean?
Note: the mean salary using the raw data equals 17250.413
The mean: properties
X Linearity: If Y = a + bX ⇒ y = a + bx
If the 46 professionals’ salaries is increased by 2 %, How the meansalary changes?
Afterwards the salary is reduced in 100 dolars, Wich is the final meansalary?
X Disadvantages: Affected by extreme values (outliers)
Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2
x =3 + 1 + 5 + 4 + 2
5= 3 y =
3 + 1 + 50 + 4 + 2
5= 12
Its value has been multiplied by 4!!When the data is skewed an alternative robust measure of centraltendency is more appropriate
Central tendency: the median
...is the most central datum
1 1 1 3 3 5 5 7 8 8 9
1. Order the data from smallest to largest
2. Include repetitions
3. The median is the physical centre
1 1 1 3 3 5 5 7 8 8 ⇒ M =3 + 5
2= 4
MedianOrdered list from smallest to largest: x(1), x(2), . . . , x(n)
M =
x((n+1)/2) if n odd
x(n/2)+x(n/2+1)
2 if n even
The media via the table of frequenciesExperience, xi ni fi Ni Fi
1 5 0,109 5 0,1092 4 0,087 9 0,1963 4 0,087 13 0,2834 4 0,087 17 0,3705 3 0,065 20 0, 435 < 0.5
M=6 4 0,087 24 0, 522 > 0.57 1 0,022 25 0,5438 4 0,087 29 0,6309 0 0 29 0,630
10 4 0,087 33 0,71711 2 0,043 35 0,76112 2 0,043 37 0,80413 2 0,043 39 0,84814 1 0,022 40 0,87015 1 0,022 41 0,89116 3 0,065 44 0,95717 1 0,022 45 0,97818 0 0 45 0,97810 0 0 45 0,97820 1 0,022 46 1,000
The meadian: properties
X Linearity: If Y = a + bX ⇒ My = a + bMx
If the 46 professionals’ salaries is increased by 2 %, How the mediansalary changes?
Afterwards the salary is reduced in 100 dolars, Wich is the finalmedian salary?
X Can we calculate the meadian with the education level data?
Can we calculate the meadian with the 0-1 position of responsabilityvariable?
X Advantage: Not affected by outliers
Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2
Mx = 3 My = 3
When the data is skewed it is a better measure of central tendencythan the mean.
The median and the mean for asymmetric dataAnnual gross salary in 2014, Encuesta de Estructura Salarial 2014, I.N.E.
“La diferencia entre el salario medio y el mediano se explica porque en elcalculo del valor medio influyen notablemente los salarios muy altosaunque se refieran a pocos trabajadores.´´ (En la Nota de Prensa delINE de 28 de octubre de 2016)
Central tendency: the mode
...is the most frequent value
The mode of the variable experience in the 46 professionals example is 1year, with an absolute frequency of 5 employees.
The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.
Central tendency: the mode
Does this definition make sense with the education level data?
Does this definition make sense with the 0-1 position of responsabilityvariable?
Central tendency: the mode
Does this definition make sense with continuous data? ⇒ modal interval
The mode: properties
X It can be calculated for both qualitative and quantitative variables.Indeed, it is the only descriptive measurement (mean, median, mode)that makes sense for nominal qualitative variables.
X Not affected by outliers
X There can be no mode.
X There can be more than one mode: bimodal–trimodal–plurimodal
What it can be indicate?
Location measures
X Quartiles
X Percentiles
Location measures: quartiles and percentiles
X Quartiles split the ranked data into four segments with an equalnumber of values per segment.
X Percentiles split the ranked data into a hundred segments with anequal number of values per segment.
1. Order the data from smallest to largest
2. Include repetitions
3. Select each quartile (percentile) according to:I The first quartil Q1 has position 1
4(n + 1).
I The second quartil Q2 (= median) has position 12(n + 1).
I The third quartil Q3 has position 34(n + 1).
I The k-th percentile Pk , has position k(n + 1)/100, k = 1, . . . , 99.
Quartiles: example
Percentiles: example
Masures of spread
X The range and the interquartile range
X The variance and the standard deviation
X The coefficient of variation
Variation: range and interquartile range (IQR)
I The Range is the simplest measure of variation
R = xmax − xmın
I Ignores the way the data is distributed
I Sensitive to outliers
Example: Given observations 3, 1, 5, 4, 2, R = 5− 1 = 4Example: Given observations 3, 1, 5, 4, 100, R = 100− 1 = 99
I The Interquartile range (IQR) can eliminate some outlier problems.Eliminate high and low observations and calculate the range of themiddle 50 % of the data
RIC = 3rd cuartil− 1st cuartil = Q3 − Q1
Variation: Interquartile range and boxplot
I Outliers are observations that fall
I below the value of Q1 − 1.5 · IQRI above the value of Q3 + 1.5 · IQR
I For extreme outliers, replace 1.5 by 3 in the above definition
25% 25% 25% 25%
12 24 31 42 58
xmin Q1 ((Q2))MEDIANA
Q3 xmax
RI=18
Measure of variation: variance
I Average of squared deviations of values from the mean
I Population variance
σ2 =
∑Ni=1 (xi − µ)2
N
I Sample variance
σ2 =
∑ni=1 (xi − x)2
n=
faster to calculate︷ ︸︸ ︷∑ni=1 x
2i − n(x)2
n⇐ divided by n
I Sample quasi-variance (corrected sample variance)
s2 =
∑ni=1 (xi − x)2
n − 1=
∑ni=1 x
2i − n(x)2
n − 1⇐ divided by n − 1
I They are related via
σ2 =n − 1
ns2
I If a, b (b 6= 0) are real numbers and y = a + bx , then s2y = b2s2
x
Measure of variation: standard deviation (SD)
I The most-commonly used measure of spread
I Population standard deviation, sample standard deviation andsample quasi-standard deviation are respectively
σ =√σ2 σ =
√σ2 s =
√s2
I Shows variation about the mean
I Has the same units as the original data, whilst variance is in units2
I Variance and SD are both affected by outliers
Calculating variance and standard deviationExample: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,Z : 11, 11, 11, 12, 19, 20, 20, 20
x =124
8= 15.5 y =
124
8= 15.5 z =
124
8= 15.5
n∑i=1
x2i = 112 + 122 + . . .+ 212 = 2000
n∑i=1
y 2i = 142 + 152 + . . .+ 172 = 1928
n∑i=1
z2i = 112 + 112 + . . .+ 202 = 2068
s2x =
∑ni=1 x
2i − n(x)2
n − 1=
2000− 8(15.5)2
8− 1=
78
7= 11.1429 ⇒ sx = 3.3381
s2y =
1928− 8(15.5)2
8− 1=
6
7= 0.8571 ⇒ sy = 0.9258
s2z =
2068− 8(15.5)2
8− 1=
146
7= 20.8571 ⇒ sz = 4.5670
Comparing standard deviationsExample cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
● ● ● ●
●
● ● ●
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
z == 15.5 sz == 4.6
y == 15.5 sy == 0.9
x == 15.5 sx == 3.3
Measure of variation: coefficient of variation (CV)
I Measures relative variation and is defined as
CV =s
|x |
I Is a unitless number (sometimes given in %’s)
I Shows variation relative to mean
Example: Stock A: Average price last year = 50, Standard deviation = 5Stock B: Average price last year = 100, Standard deviation = 5
CVA =5
50= 0.10 CVB =
5
100= 0.05
Both stocks have the same SDs, but stock B is less variable relative to its mean
price
Numerical summaries and frequency tables. Standarization.
I If the data is discrete then
x =
∑ki=1 xini
nand s2 =
∑ki=1 x
2i ni − nx2
n − 1
I If the data is continuous, we replace xi in the above difinition, by themid-points of class intervals
I To standardize variable x means to calculate
x − x
s
I If you apply this formula to all observations x1, . . . , xn and call thetransformed ones z1, . . . , zn, then the mean of the z ’s is zero with thestandard deviation of one
I Standarization = finding z-score
Measures of form
X Fisher’s coefficient of asymmetry
X Fisher coefficient of kurtosis
X Empirical rule
Shape: comparing mode, mean and median
Three types of distributions:
I Skewed to the left Mean < Median < Mode
I Symmetric Mean = Median = Mode
I Skewed to the right Mode < Median < Mean
LEFT−SKEWEDx << M
SYMMETRICx == M
RIGHT−SKEWEDM << x
Note: The distribution in the middle is known as bell-shaped or normal
Measures of form: Asymmetry
I Fisher’s coefficient of asymmetry → γ1 =1n
∑ni=1(xi−x)3
S3 . The data isskewed to the right (positive) if γ1 > 0, and vice versa.
Asimetría a la derecha
Fre
qu
en
cy
0 1 2 3 4 5 60
10
20
30
40
50
60
γ1
=
2.236
Asimetría a la izquierda
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
01
00
15
02
00
γ1
=
−1.401
Measures of form: kurtosis
I Fisher’s coefficient of kurtosis → γ2 =1n
∑ni=1(xi−x)4
S4 − 3
I For the standard normal, γ2 = 0. If γ2 > 0→ leptokurtic (sharperthan the standard normal) and platykurtic if γ2 < 0
Distribución Leptocúrtica
De
nsity
−2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Distribución Platicúrtica
De
nsity
−1.0 0.0 1.0 2.0
0.0
0.2
0.4
0.6
0.8
1.0
Empirical rule
If the data is bell-shaped (normal), that is, symmetric and with lighttails, the following rule holds:
I 68 % of the data are in (x − 1s, x + 1s)
I 95 % of the data are in (x − 2s, x + 2s)
I 99.7 % of the data are in (x − 3s, x + 3s)
Note: This rule is also known as 68-95-99.7 ruleExample: We know that for a sample of 100 observations, the mean is40 and the quasi-standard deviation is 5. Assuming that the data isbell-shaped, give the limits of an interval that captures 95 % of theobservations.
95 % of xi ’s are in: (x ± 2s) = (40± 2(5)) = (30, 50)