Upload
virginia-bell
View
214
Download
0
Embed Size (px)
DESCRIPTION
Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Last time: Statistical model Parameter estimation Selection explanatory variables (determination coef, F-, t-tests) Model quality: global methods/diagnostics (plots) This week: further investigation of model quality deviating observation points outlier, leverage point/potential, influence point plots, numerical measures and tests test for outliers, hat matrix, Cook’s distance explanatory variables that are themselves linearly related – collinearity: plots, numerical measures variance inflation factors, condition indices, variance decomposition
Citation preview
Statistical Data Analysis
2010/2011
M. de Gunst
Lecture 10
Statistical Data Analysis2
Statistical Data Analysis: Introduction
TopicsSummarizing dataInvestigating distributions Bootstrap Robust methodsNonparametric tests Analysis of categorical dataMultiple linear regression (continued)
Statistical Data Analysis3
Multiple linear regression (Reader: Chapter 8)
Relationship between one response variableand one or more explanatory variableLast time:Statistical modelParameter estimationSelection explanatory variables (determination coef, F-, t-tests)Model quality: global methods/diagnostics (plots)
This week: further investigation of model quality
deviating observation points outlier, leverage point/potential, influence point plots, numerical measures and tests test for outliers, hat matrix, Cook’s distance
explanatory variables that are themselves linearly related – collinearity: plots, numerical measures variance inflation factors, condition indices, variance decomposition
Statistical Data Analysis4
Statistical model
Multiple linear regression model
independent and normally distributed
Issues: 1) estimate2) select explanatory variables3) assess model quality
Statistical Data Analysis5
3) Assessment of model quality – deviating points
Consider observation point (yi, xi1,…,xip)
types of deviating observation points
deviating response: outlierdeviating explanatory variable: potential or leverage pointif point has influence: influence point
how to detect
outlier: test for outliersleverage point: hat matrixInfluence point: Cook’s distance
Statistical Data Analysis6
Example outlier
Forbes’ data: boiling temperature for different pressure
Small deviating effect in responsemay have large effects
Generally easy to detect in plots
Statistical Data Analysis7
3) Assessment of model quality – outliers
Outlier: deviating responseHow to detect? Make plots - which ones?
If possible outliers detected, do formal testIdea: if k-th point outlier, then it fits the regression model up to a shift δ
i.e. it fits mean shift outlier modelfor sufficiently large |δ | , or in matrix notation with s.t.
When is k-th point outlier in terms of δ ?
How to test?
Statistical Data Analysis8
3) Assessment of model quality – outliers
Outlier: deviating response
If k-th point outlier, then it fits mean shift outlier modelfor sufficiently large |δ | , with s.t.
When is k-th point outlier in terms of δ ?
If |δ | significantly different from 0, then k-th point outlier
Test for outlierH0: δ = 0, β arbitraryH1: δ ≠ 0, β arbitrary (note: in Reader one-sided)
Test statistic ~
Statistical Data Analysis9
Example leverage point
Huber’s data:
Small deviation in explanatory variablemay have large effect
Often difficult to detect in plots: on edge of range of values value residual often not large
Statistical Data Analysis10
3) Assessment of model quality – leverage points
Potential or leverage point : deviating explanatory variable How to detect?
With hatmatrix
stems from
Properties of H: and if hii large then other hij small
We see and
Hence, if hii large, then i-th point has potential influence
Statistical Data Analysis11
3) Assessment of model quality – influence points
Influence point: if point has influence
How to detect?
check if point outlier or leverage pointIf yes, then fit model with and without this point
If result very different: point is influence point
Measure based on difference between estimated beta’s:
Cook’s distance for i-th point:
if Di larger than 1 (roughly), then i-th point is influence point
Parameter estimate without i-th point
Statistical Data Analysis12
3) Assessment of model quality – influence points
Measure of influence based on difference between estimated beta’s:
Cook’s distance for i-th point:
If Di larger than 1 (roughly), then i-th point is influence point
Explanation: the set
is confidence region with confidence 1 – α for parameter vector βThus defines measure of distance from
For choices of α around 0.5 the values of b outside this set lie “far away” from For choices of α around 0.5 the boundary of the set, ,has value around 1
Parameter estimate without i-th point
Statistical Data Analysis13
Example influence points
Cook’s distances for different data sets:
Statistical Data Analysis14
3) Assessment of model quality – collinearity
explanatory variables that are themselves linearly related – collinearity: numerical measures variance inflation factors, condition indices, variance decomposition
when a problemif variance of one or more estimator is largethen estimate(s) not reliable
how to detectknown methods?scatter plots, corr. coeff (between pairs of variables), determination coef of Xj on others = squared multiple linear corr coeff between Xj and others + several new numerical measures
Statistical Data Analysis15
3) Assessment of model quality – collinearity
exactly collinear if for some constants not all equal to 0
If one or more collinearities in (general) matrix X, then rank(X) not maximaland does not exist
With approximate collinearities difficult to compute
In design matrix X one or more (approximate) collinearities can exist between its columns
In that case difficult to compute and/or one or more may be large
Statistical Data Analysis16
3) Assessment of model quality – collinearity
How to detect collinearity
scatter plots, corr. coeff (between pairs of variables), determination coef of Xj on all others = squared multiple linear corr
coeff between Xj and all others 4 new numerical measuresi) variance inflation factors because VIFj is amount of increase in variance of due to relationship between
Xj and all others
If VIFj large, then estimate unreliable
Statistical Data Analysis17
3) Assessment of model quality – collinearity
How to detect collinearity
ii) condition number (read in Reader)
iii) condition indicesmakes ues of singular value decompositionwith and D = diagonal( )
k-th condition index:
If small, thus large → collinearity
because then
if not too small, then Xj involved in collinearity
singular values of X≥ 0
Statistical Data Analysis18
3) Assessment of model quality – collinearity
How to detect collinearity
iv) variance decomposition proportions
because (from s.v.d.)
If is large, then investigate which terms involved via the
Write the in matrix and look in row of large (= small ) which are close to 1 Corresponding Xj involved in collinearity
Easier to see then with method (iii)
Statistical Data Analysis19
3) Assessment of model quality – collinearity
No general guideline exists
Sometimes: - leave out one or more explanatory variable - scale explanatory variables - center explanatory variables
Always: - try to find explanation, this may lead to right choice
Solutions for collinearity
variable may loose meaning
Statistical Data Analysis20
3) Assessment of model quality – example
Now: Example body fat data different document