View
254
Download
3
Category
Preview:
DESCRIPTION
Data Management and Statistical Analysis - Descriptive Statistics
Citation preview
Leilani A. NoraLeilani A. NoraLeilani A. NoraLeilani A. Nora
Assistant Scientist
Descriptive Statistics
Introduction to R:
Data Manipulation and Statistical
Analysis
DATA FRAME : data.serial
• Consider a serialized data with 3 Sites, 3 Treatments, 4 reps and variable Y
Site Trt Rep Y
A 1 1 3
A 1 2 6
A 1 3 8
A 1 4 5
A 2 1 4
A 2 2 4
A 2 3 6
A 2 4 9
A 3 1 7
A 3 2 4
A 3 3 2
A 3 4 4
Site Trt Rep Y
B 1 1 3
B 1 2 6
B 1 3 5
B 1 4 NA
B 2 1 7
B 2 2 0
B 2 3 8
B 2 4 2
B 3 1 5
B 3 2 7
B 3 3 4
B 3 4 4
Site Trt Rep Y
C 1 1 8
C 1 2 NA
C 1 3 8
C 1 4 6
C 2 1 5
C 2 2 4
C 2 3 4
C 2 4 7
SUMMARY STATISTICS
• R contains all the basic tools for calculating summary
statistics.
• cor(), cov() calculate covariances and correlations
• mean(), median(), sum(), var(), min(), max(), range() all are self explanatory
• mad() calculates the mean absolute deviation
• quantile() computes various quantiles of data
• summary() will be discussed on the next slide
SUMMARY STATISTICS : summary()
• Use to obtain a descriptive statistics of a data frame or specific variable.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 4.000 5.000 5.167 7.000 9.000 2.000
• Output are the quartiles, min, max, median, mean and the count of NA’s.
• Ex1. To obtain summary statistics for the variable Y
> summary(data.serial$Y)
• Ex2. To obtain summary statistics for all the columns of a data frame
Site Trt Rep Y
A:12 Min. :1.000 Min. :1.00 Min. :0.000
B:12 1st Qu.:1.000 1st Qu.:1.75 1st Qu.:4.000
C: 8 Median :2.000 Median :2.50 Median :5.000
Mean :1.875 Mean :2.50 Mean :5.167
3rd Qu.:2.250 3rd Qu.:3.25 3rd Qu.:7.000
Max. :3.000 Max. :4.00 Max. :9.000
NA's :2.000
> summary(data.serial)
SUMMARY STATISTICS : summary() SUMMARY STATISTICS : length()
• Use to obtain number of data points of a variable,
say Y
> length(data.serial$Y)
[1] 32
SUMMARY STATISTICS : var() and sd()
[1] 4.488506
• sd() is use to obtain the standard deviation of Y
[1] 2.118609
• var() is use to obtain the variance of Y
> Y.VAR <- var(data.serial$Y, na.rm=TRUE)
> Y.VAR
> Y.STD <- sd(data.serial$Y, na.rm=TRUE)
> Y.STD
• tapply() applies a function to a variable in a separate (non-empty) groups
X – an object, typically a vector
INDEX – list of factors, each of same length
as X
FUN – function to be applied
SUMMARY STATISTICS : tapply()
> tapply(X, INDEX, FUN)
• Ex1. To obtain separate summary stat of Y for each Site
> tapply(data.serial$Y, data.serial$Site,
summary)$A
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 4.500 5.167 6.250 9.000
$B
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.500 5.000 4.636 6.500 8.000 1.000
$C
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.0 4.5 6.0 6.0 7.5 8.0 1.0
SUMMARY STATISTICS : tapply()
• Ex2. To obtain separate standard deviation of Y for
each Site
> tapply(data.serial$Y,data.serial$Site,
sd)
A B C
2.081666 2.377929 1.732051
SUMMARY STATISTICS : tapply()
• Ex3. To obtain separate mean of Y for each Site x Trt
> tapply(data.serial$Y,
list(data.serial$Site,
data.serial$Trt), mean)
1 2 3
A 5.500000 5.75 4.25
B 4.666667 4.25 5.00
C 7.333333 5.00 NA
SUMMARY STATISTICS : tapply() SUMMARY STATISTICS : doBy Package
• doBy Package is use to calculate groupwise
summary statistics in a simple way, much in the spirit of PROC SUMMARY of SAS system.
summaryBy()
• Use for calculating quantities like the “mean and
variance” of a variable, for each combination of two or
more factors.
# formula – a formula object, say Y~Site
# data – a data frame
# FUN – a list of functions to be applied.
# KEEP.NAME – logical, if TRUE and if there is only ONE
function in FUN, then the variables in the output will have
the same name as the variables in the input.
# Order – logical, if TRUE the resulting data frame is
ordered according to the variables on the right hand side
of the formula.
SUMMARY STATISTICS : summaryBy()
• Usage
> summaryBy(formula, data, FUN=mean,
keep.name=FALSE, order=TRUE,na.rm=TRUE,..)
• Ex1. To obtain Site x Trt summary of means for Y
> library(doBy)
> summaryBy(Y~Site+Trt, data=data.serial,
na.rm=TRUE)
Site Trt Y.mean
1 A 1 5.500000
2 A 2 5.750000
3 A 3 4.250000
4 B 1 4.666667
5 B 2 4.250000
6 B 3 5.000000
7 C 1 7.333333
8 C 2 5.000000
SUMMARY STATISTICS : summaryBy()
• Ex2. To obtain Site x Trt summary of minimum, mean,
maximum, variance and standard deviation of Y using
predefined functions.
> summaryBy(Y~Site+Trt, data=data.serial,
FUN=c(min, mean, max, var, sd), na.rm=TRUE)
SUMMARY STATISTICS : summaryBy()
Site Trt Y.min Y.mean Y.max Y.var Y.sd
1 A 1 3 5.500000 8 4.333333 2.081666
2 A 2 4 5.750000 9 5.583333 2.362908
3 A 3 2 4.250000 7 4.250000 2.061553
4 B 1 3 4.666667 6 2.333333 1.527525
5 B 2 0 4.250000 8 14.916667 3.862210
6 B 3 4 5.000000 7 2.000000 1.414214
7 C 1 6 7.333333 8 1.333333 1.154701
8 C 2 4 5.000000 7 2.000000 1.414214
HISTOGRAM
DENSITY PLOT
# freq – logical, if FALSE probability densities are plotted so that histogram has a total area of one.
> hist(data.serial$Y,main='Histogram
of Y', col=‘yellow2',
border=‘tomato1',
freq = FALSE, xlab=“Y Class”,
ylab=“Probability", xlim=c(0, 20))
DENSITY PLOT: seq()
> x <- seq(from=0, to=20, length=100)
> x
• seq(from, to, length) generate regular sequences from
0 to 20 with length of 100.
[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010
[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222
. . .
[97] 19.3939394 19.5959596 19.7979798 20.0000000
dnorm(x, mean, sd)
• dnorm() is use to obtain the probability of x, given the values of mean and sd.
> y <- dnorm(x,
mean(data.serial$Y,na.rm=TRUE),
sd(data.serial$Y, na.rm=TRUE)))
> y
> lines(x, y)
[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010
[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222
. . .
[97] 19.3939394 19.5959596 19.7979798 20.0000000
DENSITY PLOT : lines()
> lines(x, y)
HISTOGRAM WITH DENSITY PLOT:
mtext()
> mtext("Fitting to a normal
distribution")
• mtext(text, side=3…) displays text on top of the plot
# text – a character expression specifying the text to be
written
# side – on which side of the plot you want to display a
text
1 – bottom 2 – left
3 – top 4 – right
CASE1. HISTOGRAM WITH DENSITY PLOT
> mtext("Fitting to a normal
distribution")
> hist(RF$RLD0, main='Histogram of RLD0',
col='plum4', border='black', br=5,
xlab="RLD0 Class",
ylab="Probability",
freq=FALSE,
xlim=c(0, 20))> x <- seq(from=0, to=20, length=100)
> x
> y <- dnorm(x,
mean(data.serial$Y,na.rm=TRUE),
sd(data.serial$Y, na.rm=TRUE)))> lines(x, y)
HISTOGRAM WITH DENSITY PLOT:
lines(), dnorm(), and mtext()
Histogram of Y with Density plot
Y class
Probability
0 2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
Fitting to a normal distribution
BOXPLOT
• Ex1. To obtain boxplot of Y with other graphics parameters
> Boxplot(data.serial$Y,
boxwex=0.35,
main=“Boxplot of Y”,
xlab=“Y”,
horizontal=TRUE)
# boxwex = controls the width
of the boxplot
# horizontal = logical, if
TRUE, the boxplot is plotted
horizontally0 2 4 6 8
Boxplot of Y
Y
> boxplot(split
(data.serial$Y,
data.serial$Site))
> boxplot(Y~Site,
data=data.serial)
A B C
02
46
8
BOXPLOT :boxplot()
THANK YOU! ☺☺☺☺
Please do Exercise C
Recommended