EXPLORATORY DATA ANALYSIS
Jesús Piedrafita [email protected]
Departament de Ciència Animal i dels Aliments
Curso de Formación de Personal Investigador Usuario de Animales para Experimentación
Learning objectives
Distinguish between causation and association studies.
State the concept of Exploratory Data Analysis ant its utility.
Distinguish between population and sample (data).
Distinguish between quantitative and qualitative variables.
List and define the main descriptive statistics: central tendency, variability, shape of the distribution and relative position.
Import data from Excel to R Commander and calculate descriptive statistics.
Define the Normal distribution and develop graphical tests to contrast normality.
2
Two types of studies
2
• EXPERIMENT: it allows us to infer if a particular treatment causes changes in a variable of interest: t-test, ANOVA
• DESCRIPTIVE (“Survey” ): information on a situation or process that cannot be controlled is collected and analysed
to establish associations (correlation and regression)
Central dogma of Statistics
4http://bolt.mph.ufl.edu/6050-6052/unit-1/
http://bolt.mph.ufl.edu/6050-6052/unit-1/
Exploratory data analysis (EDA)
• EDA consists of:– organizing and summarizing the raw data,
– discovering important features and patterns in the data and any striking deviations from those patterns, and then
– interpreting our findings in the context of the problem
• And can be useful for:– describing the distribution of a single variable (center, spread,
shape, outliers)
– checking data (for errors or other problems)
– checking assumptions to more complex statistical analyses
– investigating relationships between variables
5
Variables
• Quantitative (numerical)
– Continuous (adult weight, percent of a fatty acid, … )
– Discrete -countable, finite or infinite- (number of colonies, litter size)
• Qualitative (categorical or classification)
– Ordinal (calving ease score, panel score)
– Nominal (gender, coat colour, …)• Use bar diagrams better than pie-charts
Variable: Set of observations of a particular character
Data: Values of a variable
6
Summary of numerical methods for presenting data
Descriptive statistics
Measurements of central tendency
Measures of variability
Measures of the shape of a distribution
Measures of relative position
Arithmetic mean Range Skewness Percentiles Median Variance Kurtosis Quartiles (Q1, Q2, Q3) Mode Standard deviation z-values Coefficient of
variation
Descriptive Statistics attempts to describe the distribution of the data
7
Measures of central tendency
i ii
i iyfy
n
yy or
Arithmetic mean
The second formula is for grouped data, fi being the proportion of each value
Median: value that is in the middle when observations are sorted
from the smallest to the largest. Robust to the presence of extreme values (in that differed from the mean).
Mode: value among the observations that has the highest frequency
8
Measures of variability
1
)( 2
2
n
yys i i
i
i i
ii iyn
yyyySS
2
22)(
2ss
Sample variance of n observations: More variance indicates more dispersion.
Range: difference between the maximum and the minimum values in a set
of observations. Very affected by extreme values.
Sample standard deviation: it maintains the unit of
measurement of raw data. Both variance and standard deviations are affected by extreme values.
Coefficient of variation: a relative measure of
variability, dimensionless. %)100(
y
sCV
9
Measures of the shape of a distribution
i
i
i
i
s
yy
nn
n
s
y
nn
nsk
3
3
)2)(1(
)2)(1(
Skewness: measure of asymmetry of a frequency distribution. It is 0 for a
symmetric distribution.
Kurtosis: measure of flatness or steepness of a distribution, or a measure
of the heaviness of the tails of a distribution. It is 0 for a normal distribution.
i
i
i
i
nn
n
s
yy
nnn
nn
s
y
nkt
)3)(2(
)1(3
)3)(2)(1(
)1(
31
24
4
(+)
(-)
10
https://en.wikipedia.org/wiki/Skewness
Measures of the relative position
Percentiles: The percentile value (p) of an observation yi, in a data set
has 100p% of observations smaller than yi and 100(1-p)% observations greater than yi.
Quartiles: Percentiles 25% (Q1 or lower quartile), 50% (Q2 o median)
and 75% (Q3 or upper quartile).
z-value: Deviation of an observation from the mean expressed in standard deviation units:
s
yyz i
i
IQR: Interquartile range. Q3-Q1. Little affected by extreme values (outliers).
11
A first dataset
Records of body weight (kg) of cats, both females and males (in MASS package of R).
We have saved the data in an Excel file. In this file there are also two more variables: Sex (M, F) and hearth weight (g).
12
We want to make an Exploratory Analysis using the R Commander software.
Importing a dataset from Excel
13
Descriptive statistics
14
Data: catsWeight
Statistics > Summaries > Numerical summaries
Variable: BODYwt; then select the statistics we want to compute
• Observe how R Commander aligns to the right.
• The coefficient of variation (cv) must be multiplied by 100 to be expressed as a percentage.
• There are some positive skewness and negative kurtosis.
• The lowest and highest observations (range) are in the 0% and 100% percentiles.
mean sd IQR cv skewness
2.723611 0.4853066 0.725 0.178185 0.4786244
kurtosis 0% 25% 50% 75% 100% n
-0.6738393 2 2.3 2.7 3.025 3.9 144
Normal distribution
),(~ 2NY
2
2
2
)(
2
1)(
y
eyf
if its p.d.f. is
Gauss
Standard normal
http://en.wikipedia.org/wiki/Carl
_Friedrich_Gauss
http://en.wikipedia.org/wiki/Normal_distribution
15
Contrast of normality
Normality is assumed both for variables and the distribution of parameter estimates. The last one is determined by statistical reasoning, but the adjustment of real data to a certain distribution must be tested. In the case of the normal distribution:
1. Graphical tests:
i. Density
ii. Box-plot
iii. QQ-plot
2. Numerical tests (in addition to skewness and kurtosis):
i. Shapiro-Wilk (for sample sizes 7 – 2000).
ii. Kolmogorov-Smirnov, Cramer-Von Mises, Anderson-Darling, etc., not described in this course.
16
Studying normality. Density
17
Graphs > Density estimateVariable: BODYwt
Some more theory on the normal distribution
(Logan, 2010)
18
Studying normality. Box (and whiskers)-plot
Whisker: maximum length = 1.5 IQR
MedianIQR
19
Graphs > Boxplot
No outliers in our dataset
The distribution is skewed to the right
Extreme observations (outliers) would have been marked after the end of one or both whiskers with a small circle (o)
Studying normality. Q-Q plots
20
Graphs > Quantile-Comparison (QQ) plots
Q-Q stands for theory normal quantiles vs observed quantiles
Data are expected to follow a straight line if data come from a Normal distribution with any mean and standard deviation
There are several observations not following to the straight line
Descriptive statistics by Sex
21
Statistics > Summaries > Numerical summaries
Variable: BODYwt; Summarize by: Sex; then select the statistics you want to compute
• The distributions differ in mean, variability and shape.
• Mean in males is bigger than in females.
• Males have more variability (sd and IQR) than females.
• The distribution of females is skewed to the right.
• The distribution of males presents some negative kurtosis (flatter than a Normal distribution).
• The joint distribution of both sexes is a mixture of two distributions (one for each sex).
mean sd IQR skewness kurtosis BODYwt:n
F 2.359574 0.2739879 0.35 0.9282730 0.03983634 47
M 2.900000 0.4674844 0.70 0.1330474 -0.73189071 97
Studying normality. Density by Sex
22
Graphs > Density estimateVariable: BODYwt; Plot by groups: Sex
This plot shows how the distribution of females is skewed to the right and suggest that the means are different
Studying normality. Boxplot by Sex
23
The distribution in females is skewed to the right (large upper whisker)
Graphs > BoxplotVariable: BODYwt; Plot by groups: Sex
Studying normality. Q-Q plots by Sex
24
Graphs > Quantile-Comparison (QQ) plotsVariable: BODYwt; Plot by groups: Sex
The distribution of males seems to fit better to a Normal distribution than that of females