Summary of Remainder
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.
• Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.
• Gentle, JE (2002) Elements of Computational Statistics. Springer.
• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
Topics
• Multiple Regression
• Contrasts
• Count Data
• Proportion Data
• Survival Data
• Binary Response
• Course Summary
Multiple Regression• Two or more continuous explanatory variables• Your problems are not restricted to order. You often lack enough
data to examine all the potential interactions and higher-order effects.– To explore the possibility of a third order interaction term with three
explanatory variables (A:B:C) requires about 38 = 24 data values. – If there’s potential for curvature, you need 33 = 9 more data values to
pin that down.
• Be selective. If you are considering an interaction term, you have to consider all the lower-order interactions and the individual explanatory variables in it.
Issues to Consider
• Which explanatory variables to include.
• Curvature in the response to explanatory variables.
• Interactions between explanatory variables. (High order interactions tend to be rare.)
• Correlation between explanatory variables.
• Over-parameterization. (Avoid!)
Contrasts
• Contrasts are the basis of hypothesis testing and model simplification in ANOVA
• When you have more than two levels in a categorical variable, you need to know which levels are meaningful and which can be combined.
• Sometimes you know which ones to combine and sometimes not.
• First do the basic ANOVA to determine whether there are significant differences to be investigated.
Model Reduction in ANOVA
• Basically how you reduce a model in ANOVA is by combining factor levels.
• Define your contrasts based on the science:– Treatment versus control– Similar treatments versus other treatments.– Treatment differences within similar treatments.
• You can also aggregate factor levels in steps.• See me if you need to do this. R can automate the
process.
Count Data
• With frequency data, we know how often something happened, but not how often it didn’t happen.
• Linear regression assumes constant variance and normal errors. This is not appropriate for count data:1. Counts are non-negative.
2. Response variance usually increases with the mean.
3. Errors are not normally distributed.
4. Zeros are hard to transform.
Handling Count Data in R
• Use a glm model with family=poisson.– This sets errors to Poisson, so variance is
proportional to the mean.– This sets link to log, so fitted values are positive.
• If you have overdispersion (residual deviance greater than residual degrees of freedom), use family=quasipoisson instead.
Contingency Tables
• There is a risk of data aggregation over important explanatory variables (nuisance variables).
• So check the significance of the real part of the model before you eliminate nuisance variables.
Frequencies and Proportions
• With frequency data, you know how often something happened, but not how often it didn’t happen.
• With proportion data, you know both.• Applied to:
– Mortality and infection rates– Response to clinical treatment– Voting– Sex ratios– Proportional response to experimental treatments
Working With Proportions
• Traditionally, proportion data was modelled by using the percentage as the response variable.
• This is bad for four reasons:1. Errors are not normally distributed.2. Non-constant variance.3. Response is bounded by 0.0 and 1.0.4. The size of the sample, n, is lost.
Testing Proportions
• To compare a single binomial proportion to a constant, use binom.test.– y<-c(15,5)– binom.test(y,0.5)– y<-c(14,6)– binom.test(y,0.5)
• To compare two samples, use prop.test.– prop.test(c(14,6),c(10,10))
• Only use glm methods for complex models:– Regression tables– Contingency tables
GLM Models for Proportions
• Start with a general linear model (glm).• family = binomial (i.e., unfair coin flip)• Use two vectors, one of the success counts and
the other of the failure counts.• number of failures + number of successes =
binomial denominator, n• y<-cbind(successes, failures)• model<-glm(y~whatever,binomial)
How R Handles Proportions
• Weighted regression (weighted by the individual sample sizes).• logit link to ensure linearity• If percentage cover data (e.g., survey data)
– Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance).
• If percentage change in a continuous measurement (e.g. growth)– ANCOVA with final weight as the response and initial weight as a
covariate, or– Use the relative growth rate (log(final/initial)) as response.– Both produce normal errors.
Count Data in Proportions
• R supports the traditional arcsine and probit transformations:– arcsine makes the error distribution normal– probit linearises the relationship between percentage
mortality and log(dose)
• It is usually better to use the logit transformation and assume you have binomial data.
Death and Failure Data
• Applications include:– Time to death– Time to failure– Time to event
• This is useful way to analyse performance when the process leading to a goal is complex—for example when it is a robot performing a task.
Problems with Survival Data
• Non-constant variance, so standard methods are inappropriate.
• If errors are gamma distributed, the variance is proportional to the square of the mean.
• Use a glm with Gamma errors.
How do we deal with events that don’t happen during the study?
• In those trials, we don’t know when the event would occur. We just know the time would be greater than the end of the trial. Those trials are censored.
• The methods for handling censored data make up the field of survival analysis.
• (I used survival analysis in my PhD work. My wife does survival analysis for cancer data.)
Binary Response
• Very common:– dead or alive– occupied or empty– male or female– employed or unemployed
• Response variable is 0 or 1.
• R assumes a binomial trial with sample size 1.
When to use Binary Response Data
• Do a binary response analysis only when you have unique values of one or more explanatory variables for each and every possible individual case.
• Otherwise lump: aggregate to the point where you have unique values. Either:– Analyse the data as a contingency table using Poisson errors,
or– Decide which explanatory variable is key, express the data
as proportions, recode as a count of a two-level factor, and assume binomial errors.
Modelling Binary Response
• Single vector with the response variable
• Use glm with family = binomial• Think about a log-log link instead of logit. Use
the one that gives less deviance.
• Fit the usual way.
• Test significance using 2.
Course Summary
• We’ve had an introduction to thinking critically about data.
• We’ve seen how to use a typical statistical analysis system (R).
• We’ve looked at our projects critically.
• We’ve discussed hypothesis testing.
• We’ve looked at statistical modelling.
Statistical Activities
• Data collection (ideally the statistician has a say on how they are collected)
• Description of a dataset– Averages
– Spreads
– Extreme points
• Inference within a model or collection of models• Model selection
Why Model?
• Usually you do statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model.
• A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.
Structure-in-the-data
• Of most interest…, for example:– Modes– Gaps– Clusters– Symmetry– Shape– Deviations from normality
• Plot the data to understand this.
Visualization
• Multiple views are necessary.• Be able to zoom in on the data as a few points
can obscure the interesting structure.• Scaling of the axes may be necessary, since our
eyes are not perfect tools for detecting structure.• Watch out for time-ordered or location-ordered
data, particularly if time or location are not explicitly reported.
Plots
• Use simple plots to start with.
• Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.
Bottom Line
• I am available for consulting (free).
• E-mail: [email protected]
• Phone: 515-3227 or extension 3227 from university phones.
• Plan on about an hour meeting to allow time to think intelligently about your data.