Download pdf - EXPLORATORY DATA ANALYSIS - UAB Barcelona · Exploratory data analysis (EDA) •EDA consists of: –organizing and summarizing the raw data, –discovering important features and

EXPLORATORY DATA ANALYSIS

Jesús Piedrafita [email protected]

Departament de Ciència Animal i dels Aliments

Curso de Formación de Personal Investigador Usuario de Animales para Experimentación

Learning objectives

Distinguish between causation and association studies.

State the concept of Exploratory Data Analysis ant its utility.

Distinguish between population and sample (data).

Distinguish between quantitative and qualitative variables.

List and define the main descriptive statistics: central tendency, variability, shape of the distribution and relative position.

Import data from Excel to R Commander and calculate descriptive statistics.

Define the Normal distribution and develop graphical tests to contrast normality.

2

Two types of studies

2

• EXPERIMENT: it allows us to infer if a particular treatment causes changes in a variable of interest: t-test, ANOVA

• DESCRIPTIVE (“Survey” ): information on a situation or process that cannot be controlled is collected and analysed

to establish associations (correlation and regression)

Central dogma of Statistics

4http://bolt.mph.ufl.edu/6050-6052/unit-1/

http://bolt.mph.ufl.edu/6050-6052/unit-1/

Exploratory data analysis (EDA)

• EDA consists of:– organizing and summarizing the raw data,

– discovering important features and patterns in the data and any striking deviations from those patterns, and then

– interpreting our findings in the context of the problem

• And can be useful for:– describing the distribution of a single variable (center, spread,

shape, outliers)

– checking data (for errors or other problems)

– checking assumptions to more complex statistical analyses

– investigating relationships between variables

5

Variables

• Quantitative (numerical)

– Continuous (adult weight, percent of a fatty acid, … )

– Discrete -countable, finite or infinite- (number of colonies, litter size)

• Qualitative (categorical or classification)

– Ordinal (calving ease score, panel score)

– Nominal (gender, coat colour, …)• Use bar diagrams better than pie-charts

Variable: Set of observations of a particular character

Data: Values of a variable

6

Summary of numerical methods for presenting data

Descriptive statistics

Measurements of central tendency

Measures of variability

Measures of the shape of a distribution

Measures of relative position

Arithmetic mean Range Skewness Percentiles Median Variance Kurtosis Quartiles (Q1, Q2, Q3) Mode Standard deviation z-values Coefficient of

variation

Descriptive Statistics attempts to describe the distribution of the data

7

Measures of central tendency

i ii

i iyfy

n

yy or

Arithmetic mean

The second formula is for grouped data, fi being the proportion of each value

Median: value that is in the middle when observations are sorted

from the smallest to the largest. Robust to the presence of extreme values (in that differed from the mean).

Mode: value among the observations that has the highest frequency

8

Measures of variability

1

)( 2

2

n

yys i i

i

i i

ii iyn

yyyySS

2

22)(

2ss

Sample variance of n observations: More variance indicates more dispersion.

Range: difference between the maximum and the minimum values in a set

of observations. Very affected by extreme values.

Sample standard deviation: it maintains the unit of

measurement of raw data. Both variance and standard deviations are affected by extreme values.

Coefficient of variation: a relative measure of

variability, dimensionless. %)100(

y

sCV

9

Measures of the shape of a distribution

i

i

i

i

s

yy

nn

n

s

y

nn

nsk

3

3

)2)(1(

)2)(1(

Skewness: measure of asymmetry of a frequency distribution. It is 0 for a

symmetric distribution.

Kurtosis: measure of flatness or steepness of a distribution, or a measure

of the heaviness of the tails of a distribution. It is 0 for a normal distribution.

i

i

i

i

nn

n

s

yy

nnn

nn

s

y

nkt

)3)(2(

)1(3

)3)(2)(1(

)1(

31

24

4

(+)

(-)

10

https://en.wikipedia.org/wiki/Skewness

Measures of the relative position

Percentiles: The percentile value (p) of an observation yi, in a data set

has 100p% of observations smaller than yi and 100(1-p)% observations greater than yi.

Quartiles: Percentiles 25% (Q1 or lower quartile), 50% (Q2 o median)

and 75% (Q3 or upper quartile).

z-value: Deviation of an observation from the mean expressed in standard deviation units:

s

yyz i

i

IQR: Interquartile range. Q3-Q1. Little affected by extreme values (outliers).

11

A first dataset

Records of body weight (kg) of cats, both females and males (in MASS package of R).

We have saved the data in an Excel file. In this file there are also two more variables: Sex (M, F) and hearth weight (g).

12

We want to make an Exploratory Analysis using the R Commander software.

Importing a dataset from Excel

13

Descriptive statistics

14

Data: catsWeight

Statistics > Summaries > Numerical summaries

Variable: BODYwt; then select the statistics we want to compute

• Observe how R Commander aligns to the right.

• The coefficient of variation (cv) must be multiplied by 100 to be expressed as a percentage.

• There are some positive skewness and negative kurtosis.

• The lowest and highest observations (range) are in the 0% and 100% percentiles.

mean sd IQR cv skewness

2.723611 0.4853066 0.725 0.178185 0.4786244

kurtosis 0% 25% 50% 75% 100% n

-0.6738393 2 2.3 2.7 3.025 3.9 144

Normal distribution

),(~ 2NY

2

2

2

)(

2

1)(

y

eyf

if its p.d.f. is

Gauss

Standard normal

http://en.wikipedia.org/wiki/Carl

_Friedrich_Gauss

http://en.wikipedia.org/wiki/Normal_distribution

15

http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg

http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg

http://en.wikipedia.org/wiki/File:Carl_Friedrich_Gauss.jpg

Contrast of normality

Normality is assumed both for variables and the distribution of parameter estimates. The last one is determined by statistical reasoning, but the adjustment of real data to a certain distribution must be tested. In the case of the normal distribution:

1. Graphical tests:

i. Density

ii. Box-plot

iii. QQ-plot

2. Numerical tests (in addition to skewness and kurtosis):

i. Shapiro-Wilk (for sample sizes 7 – 2000).

ii. Kolmogorov-Smirnov, Cramer-Von Mises, Anderson-Darling, etc., not described in this course.

16

Studying normality. Density

17

Graphs > Density estimateVariable: BODYwt

Some more theory on the normal distribution

(Logan, 2010)

18

Studying normality. Box (and whiskers)-plot

Whisker: maximum length = 1.5 IQR

MedianIQR

19

Graphs > Boxplot

No outliers in our dataset

The distribution is skewed to the right

Extreme observations (outliers) would have been marked after the end of one or both whiskers with a small circle (o)

Studying normality. Q-Q plots

20

Graphs > Quantile-Comparison (QQ) plots

Q-Q stands for theory normal quantiles vs observed quantiles

Data are expected to follow a straight line if data come from a Normal distribution with any mean and standard deviation

There are several observations not following to the straight line

Descriptive statistics by Sex

21

Statistics > Summaries > Numerical summaries

Variable: BODYwt; Summarize by: Sex; then select the statistics you want to compute

• The distributions differ in mean, variability and shape.

• Mean in males is bigger than in females.

• Males have more variability (sd and IQR) than females.

• The distribution of females is skewed to the right.

• The distribution of males presents some negative kurtosis (flatter than a Normal distribution).

• The joint distribution of both sexes is a mixture of two distributions (one for each sex).

mean sd IQR skewness kurtosis BODYwt:n

F 2.359574 0.2739879 0.35 0.9282730 0.03983634 47

M 2.900000 0.4674844 0.70 0.1330474 -0.73189071 97

Studying normality. Density by Sex

22

Graphs > Density estimateVariable: BODYwt; Plot by groups: Sex

This plot shows how the distribution of females is skewed to the right and suggest that the means are different

Studying normality. Boxplot by Sex

23

The distribution in females is skewed to the right (large upper whisker)

Graphs > BoxplotVariable: BODYwt; Plot by groups: Sex

Studying normality. Q-Q plots by Sex

24

Graphs > Quantile-Comparison (QQ) plotsVariable: BODYwt; Plot by groups: Sex

The distribution of males seems to fit better to a Normal distribution than that of females