Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive...

Preview:

Citation preview

Statistics IChapter 2: Analysis of univariate data

Numerical summary

Central tendency Location Spread Form

⇓ ⇓ ⇓mean quartiles range coeff. asymmetry

median percentiles interquartile range coeff. kurtosismode variance

standard deviationcoeff. of variation

Descriptive statistics

X What are they useful?

X Can we calculate them for all types of variables?

X Which are the most useful in each case?

X How can we use the calculator or Excel?

Measures of central tendency

X The mean

X The median

X The mode

Central tendency: the (artithmetic) mean

The (artithmetic) meanThe mean is the average of all the data

x =

∑ni=1 xin

=x1 + . . .+ xn

n

I It is the most common measure of location

I It is the center of gravity of the data

I It can be calculated only for quantitative variables

The mean: example

For the experience of the 46 professionals of a computer company, Whichis the mean?

x =1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + · · ·+ 17 + 20

46= 7.5 anos

How can we calculate it using the absolute frequency table? and usingthe relative one?

Experience, xi absolute freq., ni relative freq., fi1 5 0,1092 4 0,0873 4 0,0874 4 0,0875 3 0,0656 4 0,0877 1 0,0228 4 0,087

10 4 0,08711 2 0,04312 2 0,04313 2 0,04314 1 0,02215 1 0,02216 3 0,06517 1 0,02220 1 0,022

Total 46 1

The mean with grouped data

This is the same formula but using the center of each interval.For the salary of the 46 professionals of a computer company, Which isthe mean?

Note: the mean salary using the raw data equals 17250.413

The mean: properties

X Linearity: If Y = a + bX ⇒ y = a + bx

If the 46 professionals’ salaries is increased by 2 %, How the meansalary changes?

Afterwards the salary is reduced in 100 dolars, Wich is the final meansalary?

X Disadvantages: Affected by extreme values (outliers)

Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2

x =3 + 1 + 5 + 4 + 2

5= 3 y =

3 + 1 + 50 + 4 + 2

5= 12

Its value has been multiplied by 4!!When the data is skewed an alternative robust measure of centraltendency is more appropriate

Central tendency: the median

...is the most central datum

1 1 1 3 3 5 5 7 8 8 9

1. Order the data from smallest to largest

2. Include repetitions

3. The median is the physical centre

1 1 1 3 3 5 5 7 8 8 ⇒ M =3 + 5

2= 4

MedianOrdered list from smallest to largest: x(1), x(2), . . . , x(n)

M =

x((n+1)/2) if n odd

x(n/2)+x(n/2+1)

2 if n even

The media via the table of frequenciesExperience, xi ni fi Ni Fi

1 5 0,109 5 0,1092 4 0,087 9 0,1963 4 0,087 13 0,2834 4 0,087 17 0,3705 3 0,065 20 0, 435 < 0.5

M=6 4 0,087 24 0, 522 > 0.57 1 0,022 25 0,5438 4 0,087 29 0,6309 0 0 29 0,630

10 4 0,087 33 0,71711 2 0,043 35 0,76112 2 0,043 37 0,80413 2 0,043 39 0,84814 1 0,022 40 0,87015 1 0,022 41 0,89116 3 0,065 44 0,95717 1 0,022 45 0,97818 0 0 45 0,97810 0 0 45 0,97820 1 0,022 46 1,000

The meadian: properties

X Linearity: If Y = a + bX ⇒ My = a + bMx

If the 46 professionals’ salaries is increased by 2 %, How the mediansalary changes?

Afterwards the salary is reduced in 100 dolars, Wich is the finalmedian salary?

X Can we calculate the meadian with the education level data?

Can we calculate the meadian with the 0-1 position of responsabilityvariable?

X Advantage: Not affected by outliers

Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2

Mx = 3 My = 3

When the data is skewed it is a better measure of central tendencythan the mean.

The median and the mean for asymmetric dataAnnual gross salary in 2014, Encuesta de Estructura Salarial 2014, I.N.E.

“La diferencia entre el salario medio y el mediano se explica porque en elcalculo del valor medio influyen notablemente los salarios muy altosaunque se refieran a pocos trabajadores.´´ (En la Nota de Prensa delINE de 28 de octubre de 2016)

Central tendency: the mode

...is the most frequent value

The mode of the variable experience in the 46 professionals example is 1year, with an absolute frequency of 5 employees.

The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.

Central tendency: the mode

Does this definition make sense with the education level data?

Does this definition make sense with the 0-1 position of responsabilityvariable?

Central tendency: the mode

Does this definition make sense with continuous data? ⇒ modal interval

The mode: properties

X It can be calculated for both qualitative and quantitative variables.Indeed, it is the only descriptive measurement (mean, median, mode)that makes sense for nominal qualitative variables.

X Not affected by outliers

X There can be no mode.

X There can be more than one mode: bimodal–trimodal–plurimodal

What it can be indicate?

Location measures

X Quartiles

X Percentiles

Location measures: quartiles and percentiles

X Quartiles split the ranked data into four segments with an equalnumber of values per segment.

X Percentiles split the ranked data into a hundred segments with anequal number of values per segment.

1. Order the data from smallest to largest

2. Include repetitions

3. Select each quartile (percentile) according to:I The first quartil Q1 has position 1

4(n + 1).

I The second quartil Q2 (= median) has position 12(n + 1).

I The third quartil Q3 has position 34(n + 1).

I The k-th percentile Pk , has position k(n + 1)/100, k = 1, . . . , 99.

Quartiles: example

Percentiles: example

Masures of spread

X The range and the interquartile range

X The variance and the standard deviation

X The coefficient of variation

Variation: range and interquartile range (IQR)

I The Range is the simplest measure of variation

R = xmax − xmın

I Ignores the way the data is distributed

I Sensitive to outliers

Example: Given observations 3, 1, 5, 4, 2, R = 5− 1 = 4Example: Given observations 3, 1, 5, 4, 100, R = 100− 1 = 99

I The Interquartile range (IQR) can eliminate some outlier problems.Eliminate high and low observations and calculate the range of themiddle 50 % of the data

RIC = 3rd cuartil− 1st cuartil = Q3 − Q1

Variation: Interquartile range and boxplot

I Outliers are observations that fall

I below the value of Q1 − 1.5 · IQRI above the value of Q3 + 1.5 · IQR

I For extreme outliers, replace 1.5 by 3 in the above definition

25% 25% 25% 25%

12 24 31 42 58

xmin Q1 ((Q2))MEDIANA

Q3 xmax

RI=18

Measure of variation: variance

I Average of squared deviations of values from the mean

I Population variance

σ2 =

∑Ni=1 (xi − µ)2

N

I Sample variance

σ2 =

∑ni=1 (xi − x)2

n=

faster to calculate︷ ︸︸ ︷∑ni=1 x

2i − n(x)2

n⇐ divided by n

I Sample quasi-variance (corrected sample variance)

s2 =

∑ni=1 (xi − x)2

n − 1=

∑ni=1 x

2i − n(x)2

n − 1⇐ divided by n − 1

I They are related via

σ2 =n − 1

ns2

I If a, b (b 6= 0) are real numbers and y = a + bx , then s2y = b2s2

x

Measure of variation: standard deviation (SD)

I The most-commonly used measure of spread

I Population standard deviation, sample standard deviation andsample quasi-standard deviation are respectively

σ =√σ2 σ =

√σ2 s =

√s2

I Shows variation about the mean

I Has the same units as the original data, whilst variance is in units2

I Variance and SD are both affected by outliers

Calculating variance and standard deviationExample: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,Z : 11, 11, 11, 12, 19, 20, 20, 20

x =124

8= 15.5 y =

124

8= 15.5 z =

124

8= 15.5

n∑i=1

x2i = 112 + 122 + . . .+ 212 = 2000

n∑i=1

y 2i = 142 + 152 + . . .+ 172 = 1928

n∑i=1

z2i = 112 + 112 + . . .+ 202 = 2068

s2x =

∑ni=1 x

2i − n(x)2

n − 1=

2000− 8(15.5)2

8− 1=

78

7= 11.1429 ⇒ sx = 3.3381

s2y =

1928− 8(15.5)2

8− 1=

6

7= 0.8571 ⇒ sy = 0.9258

s2z =

2068− 8(15.5)2

8− 1=

146

7= 20.8571 ⇒ sz = 4.5670

Comparing standard deviationsExample cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20

● ● ●

● ●

● ● ● ●

● ● ●

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

z == 15.5 sz == 4.6

y == 15.5 sy == 0.9

x == 15.5 sx == 3.3

Measure of variation: coefficient of variation (CV)

I Measures relative variation and is defined as

CV =s

|x |

I Is a unitless number (sometimes given in %’s)

I Shows variation relative to mean

Example: Stock A: Average price last year = 50, Standard deviation = 5Stock B: Average price last year = 100, Standard deviation = 5

CVA =5

50= 0.10 CVB =

5

100= 0.05

Both stocks have the same SDs, but stock B is less variable relative to its mean

price

Numerical summaries and frequency tables. Standarization.

I If the data is discrete then

x =

∑ki=1 xini

nand s2 =

∑ki=1 x

2i ni − nx2

n − 1

I If the data is continuous, we replace xi in the above difinition, by themid-points of class intervals

I To standardize variable x means to calculate

x − x

s

I If you apply this formula to all observations x1, . . . , xn and call thetransformed ones z1, . . . , zn, then the mean of the z ’s is zero with thestandard deviation of one

I Standarization = finding z-score

Measures of form

X Fisher’s coefficient of asymmetry

X Fisher coefficient of kurtosis

X Empirical rule

Shape: comparing mode, mean and median

Three types of distributions:

I Skewed to the left Mean < Median < Mode

I Symmetric Mean = Median = Mode

I Skewed to the right Mode < Median < Mean

LEFT−SKEWEDx << M

SYMMETRICx == M

RIGHT−SKEWEDM << x

Note: The distribution in the middle is known as bell-shaped or normal

Measures of form: Asymmetry

I Fisher’s coefficient of asymmetry → γ1 =1n

∑ni=1(xi−x)3

S3 . The data isskewed to the right (positive) if γ1 > 0, and vice versa.

Asimetría a la derecha

Fre

qu

en

cy

0 1 2 3 4 5 60

10

20

30

40

50

60

γ1

=

2.236

Asimetría a la izquierda

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

01

00

15

02

00

γ1

=

−1.401

Measures of form: kurtosis

I Fisher’s coefficient of kurtosis → γ2 =1n

∑ni=1(xi−x)4

S4 − 3

I For the standard normal, γ2 = 0. If γ2 > 0→ leptokurtic (sharperthan the standard normal) and platykurtic if γ2 < 0

Distribución Leptocúrtica

De

nsity

−2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Distribución Platicúrtica

De

nsity

−1.0 0.0 1.0 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Empirical rule

If the data is bell-shaped (normal), that is, symmetric and with lighttails, the following rule holds:

I 68 % of the data are in (x − 1s, x + 1s)

I 95 % of the data are in (x − 2s, x + 2s)

I 99.7 % of the data are in (x − 3s, x + 3s)

Note: This rule is also known as 68-95-99.7 ruleExample: We know that for a sample of 100 observations, the mean is40 and the quasi-standard deviation is 5. Assuming that the data isbell-shaped, give the limits of an interval that captures 95 % of theobservations.

95 % of xi ’s are in: (x ± 2s) = (40± 2(5)) = (30, 50)

Recommended