1
Linear Regression
2
Simple Linear Regresion
● First, we consider only one dimension X1.
● Regression: We predict numeric goal Y.● Linear: We assume linear relation Y=f(X),
● with intercept and slope● We have training data
● We minimize least square criterion.
3
Residual Sum of Squares RSS● Residuum: the difference between the true and
the predicted value y, .,● i is the observation index, i=1:N.
● We minimize ,equivalently
equivalently MSE(train.data).
4
Lin. Reg. Coeffitient Estimates● Simple linear regression
● Multivariate linear regression
● where X denotes Nx(p+1) train matrix <1,x>, ● y the N vector of training goal variable.
5
Assessing the Accuracy of Coefficients Estimates
● Different training data lead to different estimates.(red-true, blue-estimated models)
● The dispersion is characterized by variance.true variance sample variance
6
Standard Error, Variance● For data , ● (sample) variance (rozptyl) is:● (sample) standard error (směrodatná odchylka)
SE:● it is our estimate of true value .● variance of the mean estimate
is
● ubiased estimate:
7
Standard Error of Parameters● Standard error of parameters are:
where ● We estimate by residual standard error
● Notice that is smaller for xi more
spread out (more leverage).
8
Hypothesis Testing, Confidence Intervals
● There is approx. 95% chance that the interval
will contain the true value of .● Similarly, in .● Hypothesis test:
● Assume null hypothesis H0 versus alternative H
a.
● What is the probability of measured or higher?– p-value of the t-test
● (n-2) degree of freedom ● If suffitiently low (<5%), we reject null hypothesis.
9
Importance of Features
● If the Pr(>|t|) is low, the parameter is significant.● Usually, significancy level 0.05 is taken,● to be 'really' sure (medicin) 0.001 ,● a parameter with higher value than 0.05 can be
non-zero due a chance.
10
Assessing the Accuracy of the Model● Residual standard error:
●
● average amount that the response will deviate from the true regression line.
● RSE depends on the scale of Y.● mean(wage)=111.7036, RSE=41.64581● pred.y$fit[7]-pred.y$fit[1]=8.099244
11
R2 Statistics● The proportion of variance explained
● scale independent, always in [0,1].
● where TSS (total SS) relates to trivial model – the mean.
● 'Our' wage R2 = 0.0043 is very low.
12
Multiple Linear Regression● Model:
● p – number of variables (features)● Minimizing RSS we get coeffitients .
● one dimensional:● Is advertisement in newspaper important?
13
Linear Regression – Matrix Form● We look for function f in the form:
● that minimizes RSS:
14
Linear Regression - Derivation
● We take a derivative of RSS
● set it to =0
● and get the solution
● and the prediction .
15
CollinearityExtreme Colin.: non invertible XTX
16
Corellation of Variables
● Remarque 2: ● Too high number of predictors p – some are
correlated and with good F- stat. due a chance.● feature selection: Chapter 6, it is on shedule.
17
Pattern on Residuals - Nonlinearity
18
Kvalitative (discrete) Predictors● Encoding by 0/1, more valued we code each
value (except one) separetly.● Example: ethnicity
19
The Estimated Slope is Fixed
20
Non-linear Models● too many combination to check,
● if you know what, ADD IT – log, exp, product, ...
● simplified ideas of nonlinear models:● splines – piecewice polynomial functions● SVM – a trick to check higher degree polynoms● basis function, trees – piecewise 'kernel, constant'● stacking – LR on trained models● and others.
21
Non-linear Model
22
Corelated observations (rezidum)● usuall with time series● usually it leads to underestimate of the error.
23
Non-constant Variance of Error Terms● log transformation, weighted least squares
24
Outliers (odlehlá pozorování)
● Error in the dataset or missing predictor?
25
High leverage – vzdálená X
● leverage statistics: diagonal of H=X(XTX)-1XT.● One dimensional:
26
k – NN regression
27
Comparison of Lin. Reg. and k-NN
● almost linear relation – linear model is better,● highly nonlinear relation – better is k- NN.
28
29