32
1 DESCRIPTIVE STATISTICS

1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

Embed Size (px)

Citation preview

Page 1: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

1

DESCRIPTIVE STATISTICS

Page 2: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

2

1. Measures of location

i. Measures of central tendency

ii. Measures of position

2. Measures of dispersion (variation)

DESCRIPTIVE STATISTICS

Descriptive statistics are summary measures which define some important characteristics of data.

Page 3: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

3

Measures of Location

Measures of central tendency are numerical values that tend to locate in some sense the middle of a set of data.

•Arithmetic Mean 

•Median 

•Mode

i. Measures of central tendency

•Geometric Mean

•Harmonic Mean

•Proportion

Page 4: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

4

Measures of central tendency

n X

n

1 iix

Sum of all the observations ( ) divided by the number of the observations (n).

n

1 iix

Arithmetic Mean (Mean)The arithmetic mean is the most common measure of the central tendency and is commonly used for symetrical distributions. It is used to summarize quantitative data.

Page 5: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

5

Example

7

27

7

5327631

7n X

7

1 ii

n

1 ii xx

Age distribution of seven children attending to a children clinic is given below

{1,3,6,7,2,3,5}

9,3 X years

Page 6: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

6

Median The median is the middle value of the set of data when the data are ranked in order

according to magnitude.

When the data are put in order 50 % of the observations are less than or equal to the median, the rest is greater than the median .

thn

2

1 Median value is observation.

Page 7: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

7

n is odd: 5, 28, 8, 10, 9

Ordered data 5, 8, 9, 10, 28

i =(5+1)/2=3

Median is 3rd value which is 9.

Example

n is even: 19, 20, 17, 27, 6, 21

Ordered data 6, 17, 19, 20, 21, 27

i=(6+1)/2=3.5

Median is halfway between the 3rd and 4th

values, which is 19.5.

Page 8: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

8

ModeThe mode is the value of x that occurs most frequently.

Data {1,3,7,3,2,3,6,7}• Mode : 3

Data {1,3,7,3,2,3,6,7,1,1}• Mode : 1 and 3

Data {1,3,7,0,2,-3, 6,5,-1}• Mode : No mode

Page 9: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

9

ExampleSuppose the age in years of the first 10 subjects enrolled in your study are:  34, 24, 56, 52, 21, 44, 64, 44, 42, 46 

Then the mean age of this group is 42.7 years

To find the median, first order the data:21, 24, 34, 42, 44, 44, 46, 52, 56, 64

The median is  (44+44)/2 = 44 years

The mode is 44 years. 

Page 10: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

10

Suppose the next patient enrolls and her age is 97 years.

How does the mean, median and mode change?

Ordered data:21, 24, 34, 42, 44, 44, 46, 52, 56, 64, 97 

Mean is 47,6 42,7

Median is 44

Mode is 44

44

44

Page 11: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

11

Comparison of Mean and Median • Mean is sensitive to “outliers” (a few very large or small values), so sometimes mean does not reflect the quantity desired.20, 21, 22, 23, 24, 25, 26, 90

• Median is “resistant” to outliers Median = 23.5

•Mean is attractive mathematically.

•50% of sample is above the median, 50% of sample is below the median.

38,31x 87.5% of observations

Page 12: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

12

Geometric mean is a summary statistic useful when the measurement scale is not linear, it is computed as

nnxxxG 21

n

xG i

)log()log( or

For example, in the area of psychometrics it is well known that the rated intensity of a stimulus (e.g., brightness of a light) is often a logarithmic function of the actual intensity of the stimulus (brightness measured in units of Lux). In this instance, the geometric mean is a better "summary" of ratings than the simple mean.

Page 13: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

13

For example, suppose you have an investment which earns 10% the first year, 50% the second year, and 30% the third year. What is its average rate of return? It is not the arithmetic mean, because what these numbers signify is that on the first year your investment was multiplied (not added to) by 1.10, on the second year it was multiplied by 1.50, and the third year it was multiplied by 1.30. The relevant quantity is the geometric mean of these three numbers, which is about 1.28966 or about 29% annual interest.

Page 14: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

14

For example, if a stock rose 10% in the first year, 20% in the second year and fell 15% in the third year, then we compute the geometric mean of the factors 1.10, 1.20 and 0.85 as (1.10 × 1.20 × 0.85)1/3 = 1.0391... and we conclude that the stock rose 3.91 percent per year, on average.

Page 15: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

15

Harmonic Mean is a "summary" statistic used in analyses of frequency data. The harmonic mean is sometimes used to average values that change in time.

If a variable contains a zero (0) as a valid score, then the harmonic mean cannot be calculated (since it implies division by zero).

ix

nHM

1

Page 16: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

16

In certain situations, the harmonic mean provides the correct notion of "average". For instance, if for half the distance of a trip you travel at 40 miles per hour and for the other half of the distance you travel at 60 miles per hour, then your average speed for the trip is given by the harmonic mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if you traveled the entire trip at 48 miles per hour. (Note however that if you had traveled for half the time at one speed and the other half at another, the arithmetic mean, 50 miles per hour, would provide the correct notion of "average".)

Page 17: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

17

Proportion is a fraction in which the numerator is included within the denominator. A proportion is often expressed as a percentage.

For example, we might describe the overall amount of smokers in a population as the proportion of people in the population who smoke.

Suppose out of 125 people 15 smoke cigarette, then the proportion of people who smoke is

P = 12.0125

15

Page 18: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

18

ii. Measures of position

Measures of position are used to decribe the location of a specific piece of data in relation to the rest of the sample. Quartiles and percentiles are two most popular measures of position.

Page 19: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

19

Quartiles

Quartiles are numeric values of variable that divide the ordered data into quarters; each set of data has three quartiles.

The first quartile, Q1, is a number such that at most one-fourth of the data are smaller in value than Q1 and at most three-fourths are larger.

thn

Q4

11

The second quartile, Q2, is the median.

thnn

Q2

1

4

)1(22

Page 20: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

20

The third quartile, Q3, is a number such that at most three-fourths of the data are smaller in value than Q3 and at most one-fourth are larger.

thn

Q4

)1(33

Page 21: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

21

Example: Birthweights of 24 infants are as follows:Obs

.Weight

Obs.

Weight Obs. Weight Obs.

Weight

1 2850 7 3150 13 3250 19 3700

2 2900 8 3200 14 3400 20 3800

3 2930 9 3200 15 3450 21 3900

4 2980 10 3200 16 3500 22 4100

5 3000 11 3250 17 3500 23 4400

6 3100 12 3250 18 3600 24 4500

Page 22: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

22

First quartile=Q1= th obs. is the fist quartile.

6th obs.=3100gr 7th obs.=3150gr

25.64

25

50x0.25=12.5gr

Q1=3112.5gr.

Second quartile=Median=Q2= th obs. is the second quartile or median.

12th obs.=3250gr 13th obs.=3250gr

5.122

25

Q2=3250gr.

Third quartile=Q3= th obs. is the third quartile

75.184

25x3

Page 23: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

23

18th obs.=3600gr 19th obs.=3700gr

100x0.75=75gr.

Q3=3675gr.

25% of the infants have birthweights less than 3112,5gr.

Half of the infants have birthweights less than 3250gr.

75%of the infants have birthweights less than 3675 gr.

Page 24: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

24

Percentiles are numerical values of the variable that divide a set of ordered data into 100 equal parts; each set of data has 99 percentiles.

The procedure for determining the value of any kth percentile involve three basic steps.

1. The data must be ordered2. The position number i for the percentile in

question must be determined. It is found by first calculating the value of .

100

)1n(k

Page 25: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

25

Example: Birth weights of 24 infants.

i. Find the 30th percentile.

When the data are ordered th observation, which is the

half way between 7th (3150gr) and 8th (3200gr) observations, is the

30th percentile. The 30th percentile is therefore 3175gr, which

means, 30% of the observations lie below 3175gr.

ii. Find the 50th percentile (median).

th observation, which is the half way between 12th

(3250gr) and 13th observations (3250gr) is the 50th percentile.

3250gr is the 50th percentile. Half of the infants have birth weights

less than 3250gr.

5.7100

)124(30

5.12100

)124(50

Page 26: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

26

2. Measures of dispersion

Range

The range is the simplest measure of dispersion. It is the difference between the highest valued (H) and the lowest valued (L) of the observations.

Range= H-L

Page 27: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

27

Measures of dispersion

Variance is square of standard deviation.

1

)(1

2

2

n

xxs

n

ii

1

)(1

2

n

xxs

n

ii

Standard deviation is the average distance of observations to arithmetic mean.

1

2

2

nn

xx

s

ii

1

2

2

2

nn

xx

s

ii

or

or

Page 28: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

28

55

25

n

xx

5.44

18

1

)( 22

n

xxsStep 5

Step 2

x )( xx 2)( xx 1 -4 16 3 -2 4 5 0 0 6 1 1

10 5 25 25 0 18

55

25

n

xx

5.114

46

1

)( 22

n

xxs

NOTE: The sum of the deviation, , is always zero.

n

ii xx

1

)(

Step 1 x )( xx 2)( xx

6 1 1 3 -2 4 8 3 9 5 0 0 3 -2 4

25 0 18

x )( xx 2)( xx 6 1 1 3 -2 4 8 3 9 5 0 0 3 -2 4

25 0 18

Step 3

Step 4

12.25.42 ss

39.35.112 ss

Page 29: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

29

3 3

5 6

8

1 3 5 6 10

s2=4.5

s2=11.5

First sample

Second sample

The last set of data is more dispersed than the previous set, and therefore its variance is larger.

Page 30: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

30

Coefficient of variation (CV) shows the deviation as a percentage of arithmetic mean.

100x

sCV

Page 31: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

31

Example: Heights of ten subjects are measured in cm and m. Data are listed below

160 1.60 180 1.80 165 1.65 174 1.74 190 1.90 182 1.82 155 1.55 165 1.65 171 1.71 160 1.60

170.2 1.702 11.23 0.1123 0.066 0.066

sx

cv

Although standard deviations are different, coefficient of variations are the same.

Page 32: 1 DESCRIPTIVE STATISTICS. 2 1.Measures of location i.Measures of central tendency ii.Measures of position 2.Measures of dispersion (variation) DESCRIPTIVE

32

Interquartile Range is a measure of dispersion for non-symmetric data IQR = Q3 – Q1

Semi-Interquartile Range is used instead of standart deviation when distribution of data is non-symmetric. SIQR= (Q3 – Q1)/2

25% 25% 25% 25%

L Q1 Q2 Q3 H

Ranked data, increasing order