41
Prepared By: Dr.Anees AlSaadi Community Medicine Department December 2013 Data Summarization (Descriptive statistics 1

Dscriptive statistics

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Dscriptive statistics

1

Prepared By: Dr.Anees AlSaadiCommunity Medicine DepartmentDecember 2013

Data Summarization (Descriptive statistics)

Page 2: Dscriptive statistics

2

Data Summarization Descriptive statistics:

• Continuous Data Description:– Measures of Data Center :

• Mean, Median and Mode / definition.• Practical Exercise.

– Measures of data variability:• Standard deviation(variance)/ Range.• Practical Exercise.

– Normal Distribution Curve.

Page 3: Dscriptive statistics

3

Measures of Center:

• Synonyms: – Measure of central tendency.– Measures of location.

• Identification of the center of the distribution of observations OR the middle or average or typical value.

Page 4: Dscriptive statistics

4

Measures of Center:

• Arithmetic average for all observations.Mean

• The middle observation of ordered data. Median

• Most frequently observed value(s)Mode

Page 5: Dscriptive statistics

5

Measures of Center:Sample Mean:

• The most commonly used measure of location.

• Called Arithmetic average.

Page 6: Dscriptive statistics

6

Measures of Center:How to Calculate Sample Mean:

• Add up data, then divided by sample size (n).

• (n) is the number of observations.

Page 7: Dscriptive statistics

7

Measures of Center:How to CalculateSample Mean Example:

These are systolic blood pressure in (mmHg)

120,80,90,110,95.

X1 =120, X2 =80 … X5 =95

Mean is calculated by adding up the five vales and dividing by 5.

Page 8: Dscriptive statistics

8

Measures of Center:How to Calculate Sample Mean

X= 120+80+90+110+95/5= 99mmHg.

Page 9: Dscriptive statistics

9

Measures of Center:Sample Mean Example:

Calculate the sample mean for number of open heart surgeries done by 7 cardiothoracic surgeons in Hamad hospital during last moth. Where, Dr.A did 4, Dr.B 3, Dr.C 6, Dr.D 5, Dr. E 4, Dr. F 3 and Dr.G 5.

4.28 surgeries.

Page 10: Dscriptive statistics

10

Measures of Center:Sample Mean Example:

The most important feature of the mean is sensitivity to the extreme values (outlier)

Page 11: Dscriptive statistics

11

Measures of Center:Sample Median

Is the middle number also called 50th percentile.

Page 12: Dscriptive statistics

12

Measures of Center:How to Identify Sample Median

• Order observations from smallest to largest.• Find the observation in the middle of the data.• Median is the observation in the middle.

Page 13: Dscriptive statistics

13

Measures of Center:How to Identify Sample Median

Sample Median Example:

Identify the median for the following set of observations:

– 90,80, 200,95, 110.

95

Page 14: Dscriptive statistics

14

Measures of Center:How to Identify Sample Median

Sample Median Example:• Identify the median for the following set of

observations:– 90, 80, 120, 95, 125, 110.

102Position n+1/2

Page 15: Dscriptive statistics

15

Measures of Center:Sample Median Features:

Not affected by the extreme values.

Less efficient to summarize the data statistically.

Page 16: Dscriptive statistics

16

Measures of Center:Sample Mode

• The most commonly occurring value in dataset.

• Not all datasets have a mode.• Unimodal distribution: one mode in the

dataset.• Bimodal distribution: two modes in the dataset.

Page 17: Dscriptive statistics

17

Measures of Center:How to Identify Sample Mode

• Arrange the data from small to greater values. • The most commonly / repeated value is the

sample mode.

Page 18: Dscriptive statistics

18

Measures of Center:How to Calculate Sample Median

Sample Mode Example:

{15, 33, 65, 32, 78, 94, 33, 110, 11, 46, 33}

{11, 15, 32, 33, 33, 33, 46, 65, 78, 94, 110}

Mode is 33

Page 19: Dscriptive statistics

19

Measures of Center:Sample Mode Feature

Not affected by the extreme values.

Less efficient to summarize the data statistically.

Page 20: Dscriptive statistics

20

Practical Exercise

This dataset is the number of hysterectomy performed by female doctors in HMC;

{44, 37, 86, 50, 20, 25, 28, 25, 31, 33, 85, 59, 27, 34, 36}

find the mean, median and mode?

Page 21: Dscriptive statistics

21

Data Summarization Descriptive statistics:

• Continuous Data Description:– Measures of Data Center :

• Mean, Median and Mode / definition.• Practical Exercise.

– Measures of data variability (dispersion) :• Standard deviation(variance)/Range/ Interquartile range.• Practical Exercise.

– Normal Distribution Curve.

Page 22: Dscriptive statistics

22

Measures of Data Dispersion

• Data dispersion = data spread. • Data dispersion:

– Range. – Interquartile range.– Variance. – Standard Deviation.

Page 23: Dscriptive statistics

23

Measures of Data DispersionRange:

• Is equal to largest ( Maximum) value minus smallest (Minimum) value.

• Easy to calculate but it gives no idea about the values between the Max and Min.

Page 24: Dscriptive statistics

24

Measures of Data DispersionRange:

Range Example:

Calculate the range for the following dataset; {40, 28, 42, 30, 31, 38,100, 20, 48, 50, 51, 30}

Range is 100-20=80

Page 25: Dscriptive statistics

25

Measures of Data DispersionRange Feature:

Range is affected by the extreme of values.

Page 26: Dscriptive statistics

26

Measures of Data DispersionInterquartile Range

• Quartiles: the 25th , 50th , 75th percentiles of the data.

• Interquartile range is the distance between the 25th and 75th percentile.

Page 27: Dscriptive statistics

27

Measures of Data DispersionInterquartile Range

• Max, Min,, 1st , 3rd quartiles and median are used to make box-plot (five number summary)

Max

Min

75th Percentile

25th Percentile

Median50th Percentile

Page 28: Dscriptive statistics

28

Measures of Data DispersionInterquartile Range

• Quartiles are number that divide the dataset into four quarters with 25% of observations in each quarter

• Q1 lower quartile 25% of observations below and 75% above it.

• Q2 median and 50% observations on each side of it.

• Q3 upper quartile 25% of observations above and 75% below it.

Q1

Q2

Q3

Page 29: Dscriptive statistics

29

Measures of Data DispersionHow to Find Interquartile Range

• Arrange the data from the smallest to the largest.

• Divide the data into two parts. • Define Q1 as the median of the lower half of

the data.• Define Q3 as the median of the lower half of

the data. • Interquartile range is the Q3-Q1.

Page 30: Dscriptive statistics

30

Measures of Data DispersionHow to Find Interquartile Range

Interquartile Range Example:

{20, 28, 30, 30, 31, 38, 40, 42, 48, 50, 51, 100}

{20, 28, 30, 30, 31, 38, 40, 42, 48, 50, 51, 100}

Q1=25th percentile= 30

Q3=75th percentile= 49

Interquartile Range (IQR)= Q3-Q1=19

Page 31: Dscriptive statistics

31

Practical Exercise

Page 32: Dscriptive statistics

32

Measures of Data DispersionVariance:

• Is the averaged squared deviation from the mean.

• The units of measurement are those of the original data squared.

• Variance: S2 or ϭ2

Page 33: Dscriptive statistics

33

Measures of Data DispersionVariance:

Page 34: Dscriptive statistics

34

Measures of Data DispersionStandard Deviation:

• Is the square root of the variance (S or ϭ)

Page 35: Dscriptive statistics

35

Practical Exercise

Page 36: Dscriptive statistics

36

Measures of Data DispersionStandard Deviation:

• Best used when mean is used as measure of center.

• Standard Deviation = 0 indicates no spread all the data have the same value.

• Is affected by extreme observations.

Page 37: Dscriptive statistics

37

Measures of Data DispersionStandard Deviation:

Page 38: Dscriptive statistics

38

Choosing Measures of Center and Spread

If the distribution is normal or symmetrical

•Use mean and standard deviation.

If the distribution is skewed OR has

large outliers.

•Use Median and range OR (IQR)

If the distribution is bimodal

•Use mode and range OR find out if the two modes represent two different groups and separate them

Page 39: Dscriptive statistics

Characteristics of Measures of Spread

Range IQR Standard Deviation

Simple

Non-Resistance

Resistance

Used with the median

IQR = 0 does not mean there is no spread

Non-Resistance

Used with the mean

Good for symmetrical distribution with no outliers

Standard deviation of 0 means there is no spread.

Page 40: Dscriptive statistics

40

Practical Exercise

Page 41: Dscriptive statistics

41

Thank You