50
1 Giới thiệu Phân tích hồi quy tuyến tính Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney

Phan Tich Hoi Quy Tuyen Tinh

Embed Size (px)

DESCRIPTION

Bài giảng phân tích hồi quy tuyến tính

Citation preview

  • Gii thiu Phn tch hi quy tuyn tnhDr. Tuan V. NguyenGarvan Institute of Medical ResearchSydney

  • Nu cho mt ngi ba loi v kh tng quan, hi quy v cy bt, hn s dng c ba (Anon, 1978)

  • V dIDAge Chol (mg/ml)1463.52201.93524.04302.65574.56253.07282.98363.89222.110433.811574.112333.013222.514634.615403.216484.217282.318494.0Tui v nng cholesterol ca 18 ngi o c nh sau

  • Nhp s liu vo Rid
  • Tng quan gia tui v nng cholesterol

  • Cu hi nghin cu

    Mi tng quan gia tui v nng cholesterolMc tng quanTin on nng cholesterol ng vi mi la tuiPhn tch tng quan v hi quy

  • Phng sai v hip phng sai: i sCoi x v y l hai bin ngu nhin rt ra t mt mu quan st n i tng.o lng dao ng gia x v y: phng saiHip phng sai gia x v y var(x + y) = var(x) + var(y) var(x + y) = var(x) + var(y) + 2cov(x,y)Trong :

  • Phng sai v Hip phng sai: Hnh hcTnh c lp v ph thuc gia x v y c th biu din bng hnh hc:yxhh2 = x2 + y2xyhh2 = x2 + y2 2xycos(H)H

  • ngha ca Phng sai v Hip phng saiPhng sai lun lun l s dngNu hip phng sai = 0, x v y c lp vi nhau.Hip phng sai l mt tng ca mt tch cho: do c th m v cng c th dng.Hip phng sai m = lch pha gia hai phn phi theo hng ngc chiu nhau.Hip phng sai dng = lch pha gia hai phn phi theo hng cng chiu nhau.Hip phng sai = o lng cng tng quan.

  • Hip phng sai v tng quanHip phng sai l mt n v ph thuc. H s tng quan (r) gia x v y l mt hip phng sai c chun ho.r c xc nh bng:

  • Tng quan thun v nghchr = 0.9r = -0.9

  • Kim nh gi thuyt tng quanGi thuyt: Ho: r = 0 ngc vi Ho: r khng bng 0.

    Sai s chun (Standard error) ca r : The t-statistic:Thng k ny c phn phi t vi n 2 bc t do.

    Fishers z-transformation:

    Standard error of z:

    Do vy 95% CI ca z c th tnh bng:

  • Minh ho phn tch tng quanIDAge Cholesterol(x) (y; mg/100ml)463.5201.9524.0302.6574.5253.0282.9363.8222.1433.8574.1333.0222.5634.6403.2484.2282.3494.0Mean38.833.33SD13.600.84

    Cov(x, y) = 10.68t-statistic = 0.56 / 0.26 = 2.17Critical t-value with 17 df and alpha = 5% is 2.11Kt lun: Gia tui v nng cholesterol c mt mi tng quan c ngha thng k..

  • Phn tch hi quy tuyn tnh nnh gi:Lng ho mi tng quan gia hai bin.D onXy dng m hnh d on v nh giKim sotiu chnh yu t nhiu (trng hp phn tch a bin)Ch kho st c hai bin: mt l bin p ng (response variable) v mt l bin d on (predictor variable)Khng c iu chnh cho yu t nhiu hoc cc hip bin khc

  • Tng quan gia tui v nng cholesterol

  • M hnh hi quy tuyn tnhY : bin ngu nhin, l mt bin p ng (response)X : bin ngu nhin, l bin d on, hay yu t nguy c (predictor, risk factor)C Y v X c th l s liu nhm (e.g., yes / no) hoc bin lin tc (e.g., age). Nu Y l bin phn nhm th s dng m hnh logistic regression; nu Y l bin lin tc th s dng m hnh hi quy tuyn tnh n.

    M hnh:Y = a + bX + ea : interceptb : slope / gradient: random error (mc dao ng gia cc i tng trong s y s kin nu x khng i (v d bin i cholesterol trong mt nhm cng la tui)

  • Cc gi nh ca m hnh tuyn tnhCc thng s c mi tng quan tuyn tnh (ng thng) vi nhau;

    X o lng khng c sai s;

    Cc gi tr Y tng ng l c lp vi nhau (v d Y1 khng c mi tng quan vi Y2) ;

    Sai s ngu nhin (e) c phn phi chun vi trung bnh =0 v phng sai c nh.

  • Gi tr k vng v phng saiNu cc gi nh tho mn: Gi tr k vng ca Y l: E(Y | x) = a + bxPhng sai ca Y l: var(Y) = var(e) = s2

  • c lng cc thng s ca m hnh hi quy tuyn tnh Cho hai im A(x1, y1) v B(x2, y2) trong mt mt phng 2 chiu, chng ta c th c mt phng trnh ng thng ni hai im ny.A(x1,y1)B(x2,y2)Gc lch:Phng trnh: y = mx + aVy nu chng ta c hn 2 im th sao?

    axy0dydx

  • c tnh a v bC mt lot cp i: (x1, y1), (x2, y2), (x3, y3), , (xn, yn)Cho a v b l cc c s ca cc thng s a v b, Chng ta c phng trnh ca mu nghin cu: Y* = a + bx

    Mc ch: tm cc gi tr ca a v b sao cho (Y Y*) l ti thiu.

    Cho SSE = tng ca (Yi a bxi)2.Cc gi tr a v b c th lm SSE t gi tr nh nht gi l cc c s bnh phng ti thiu (least square estimates).

  • Tiu chun c tnhyiCholAgeMc ch ca c s bnh phng ti thiu l tm c cc gi tr a v b sao cho tng ca d2 c gi tr nh nht.

  • c tnh a v bSau mt s bc tnh ton, chng ta c:

    Trong :

    Nu cc gi nh ca hi quy l hp l, cc c s ca v s:Khng sai lchPhng sai ti thiu (ngha l hiu qu)

  • Goodness-of-fitBy gi chng ta c phng trnh: Y = a + bX + e

    Cu hi: Phng trnh ny c th m t d liu tt c no?

    Tr li: h s xc nh (R2): mc bin thin trong Y c th gii thch bng mc bin thin trong nhm X.

  • Tch nhm bin thin: khi nimSST = tng ca cc mc khc bit bnh phng gia tng gi tr yi v tr s trung bnh ca y.

    SSR = tng ca cc mc khc bit bnh phng gia gi tr d on ca y v tr s trung bnh ca y.

    SSE = tng ca cc mc khc bit bnh phng gia cc gi tr quan st v gi tr d on ca y.

    SST = SSR + SSE

    Khi h s xc nh l: R2 = SSR / SST

  • Tch nhm bin thin: minh ho hnh hcChol (Y)Age (X)meanSSRSSESST

  • Tch nhm bin thin: i sSome statistics:Total variation:Attributed to the model:Residual sum of square: SST = SSR + SSESSR = SST SSE

  • Phn tch phng saiSS tng ln theo t l vi c mu (n)Trung bnh bnh phng (Mean squares, MS): c chun ho cho bc t do (df)MSR = SSR / p ( p = s bc t do)MSE = SSE / (n p 1)MST = SST / (n 1)

    Bng tm tt phn tch phng sai (Analysis of variance, ANOVA):

    Ngund.f.Sum of squares (SS)Mean squares (MS)F-testRegressionResidualTotalpNp 1n 1SSRSSESSTMSRMSEMSR/MSE

  • Kim nh gi thuyt trong cc phn tch hi quyBy gi chng ta c:S liu mu nghin cu:Y = a + bX + eQun th:Y = a + bX + e

    Ho: b = 0. Khng c mi tng quan tuyn tnh no gia kt cc v bin d on (yu t nguy c) c.

    Ngn ng thng thng: Vi iu kin mu nghin cu cho kt qu thu c , vy xc sut cho c c mt mu quan st m khng nht qun vi gi thuyt khng, tc l khng c mi tng quan no, l bao nhiu phn trm?

  • Din dch v dc (thng s b)Ghi nh rng e c coi l mt phn phi chun vi trung bnh 0 v phng sai v = s2. c tnh s2 bng MSE (or s2)Cng c th cho thy rng GI tr k vng ca b l b, i.e. E(b) = b, Sai s chun (standard errors) ca b l:

    Vy kim nh liu b = 0 s l: t = b / SE(b) s tun theo lut phn phi t vi bc t do l n-1.

  • Khong tin cy xung quanh gi tr d onGi tr quan st l Yi. Gi tr c d on l: Sai s chun (standard error) ca gi tr c d on l:

    c tnh khong cho cc gi tr Yi :

  • Kim tra cc gi nhPhng sai hng nhPhn phi chunM hnh ngM hnh n nhTt c u c th biu din bng biu . Phn tn d (residuals) ca m hnh lun ng vai tr quan trng trong tt c cc bc tin hnh phn tch mt m hnh chn on.

  • Kim tra cc gi nhPhng sai hng nhV ng s liu tn d chun ho theo phng php student (studentized residuals) tng ng vi cc gi tr c d on (predicted values). Kim tra xem s bin thin gia cc gi tr tn d liu c tng i hng nh qua sut ht cc dy gi tr c x l khng (fitted values).Phn phi chunV ng s liu tn d tng ng vi cc gi tr k vng (expected valu), hay cn gi l v ng xc sut chun (Normal probability plot). Nu cc gi tr tn d ny tun theo lut phn ohun th n phi nm trn con ng xin 45o. Xy dng cng thc ng? V ng gi tr tn d tng ng vi gi tr x l (fitted values). Kim tra xem liu biu ca cc gi tr tn d c cho thy xu hng khng tuyn tnh ca chng qua cc dy s liu x l khng (fitted values).M hnh n nhKim tra xem liu c mt hay nhiu gi tr quan st b tc ng. S dng khong cch Cook.

  • Checking assumptions (tt)Khong cch Cook (D) l mt n v o lng mc bin i ca cc gi tr x l trong m hnh hi quy nu loi b mt gi tr th ith ra khi b d liu phn tch.Leverage (tc ng n by) o mc gi tr cc tr xi tng quan vi cc gi tr x cn li. Gi tr tn d student ho (Studentized residual) o mc gi tr cc tr yi tng quan vi cc gi tr y cn li.

  • o lng chnh lPhng sai khng hng nhHon chuyn gi tr p ng (y) sang mt thang n v khc (v d logarithm) thng hu ch.Nu hon chuyn ri m khng gii quyt c tnh trng phng sai khng hng nh, s dng mt c s khc mnh hn, nh l bnh phng ti thiu c cn i tng tc (iterative weighted least squares).Khng tun theo phn phi chunPhn phi khng chun v phng sai khng hng nh thng i i vi nhau.Gi tr ngoi l (Outliers)Kim tra xem s liu c chnh xc khngS dng phng php c tnh ph tr

  • Phn tch hi quy s dng Rid
  • Phn tch hi quysummary(reg)Call:lm(formula = chol ~ age)

    Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040

    Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 ***age 0.057788 0.005399 10.704 1.06e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 0.3027 on 16 degrees of freedomMultiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

  • ANOVA anova(reg)Analysis of Variance Table

    Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 ***Residuals 16 1.4656 0.0916 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  • Chn on: nh hng ca s liup
  • Mt minh ho khng tuyn tnh: BMI v mc hp dn tnh dcNghin cu trn 44 sinh vin i hco ch s trng lng c th (BMI)Cho im hp dn tnh dc (SA) id
  • Phn tch hi quy tuyn tnh gia BMI v SAreg |t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 ***bmi -0.05967 0.02862 -2.084 0.0432 * ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 1.354 on 42 degrees of freedomMultiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

  • BMI v SA: phn tch cc gi tr tn dplot(reg)

  • BMI and SA: biu tn xreg
  • Phn tch li s liu ny# Fit 3 regression modelslinear
  • Mt s nhn xt: Din dch mi tng quanGi tr tng quan nm gia khong 1 v +1. Mt h s tng quan rt nh khng c ngha rng khng c mi tng quan gia hai bin. Mi tng quan ny c th l phi tuyn tnh.

    i vi cc tng quan cong, s dng h s tng phn phn loi (rank correlation) tt hn tng quan Pearson (Pearsons correlation).

    Mt h s tng quan thp (vd: 0.1) c th c ngha thng k nhng khng c ngha lm sng.

    R2 l mt ch s o lng mc tng quan. r = 0.7 trng c v hp dn nhng thc cht R2 ch c 0.49!

    C tng quan khng ng ngha l c quan h nhn qu.

  • Mt s nhn xt: Din dch mi tng quanCn cn thn vi a tng quan. i vi s bin l p, s c p(p 1)/2 cc cp tng quan, v khi s i mt vi vn dng tnh gi (c tng quan gi).

    Tng quan khng th suy din c t cc mi quan h.r(age, weight) = 0.05; r(weight, fat) = 0.03; khng c ngha rng r(age, fat) l gn zero. Nhng trn thc t r(age, fat) = 0.79.

  • Mt s nhn xt: Din dch mi tng quanng biu din tng quan (hi quy) ch l mt tng quan c lng gia cc bin ny trong qun th m thi.

    C mt bt nh lin quan vi cc thng s c c tnh.

    ng hi quy khng th dng c tnh cc gi tr x nm ngoi vng gi tr quan st (ngoi suy).

    Mt m hnh thng k l mt m hnh xp x; tng quan thc c th li l phi tuyn tnh, nhng tng quan tuyn tnh l mt tng quan xp x tng i ph hp nht.

  • Mt s nhn xt: Bo co kt quKt qu phn tch tng quan hi quy cn c m t y : bn cht ca bin p ng (kt cc), cc bin d on (yu t nguy c); bt k mt cch hon chuyn; kim tra cc gi nh...

    Cc h s hi quy (a, b), cng vi cc sai s chun tng ng, v R2 cng cn thit.

  • Vi nhn xt cui cngPhng trnh l ct mc cho cc tng khoa hc bm tr v thng hoa.

    Cc phng trnh p nh nhng bi th, nhng cng thm ch l nhng c hnh.

    V vy m phi ht sc cnh gic v cn tc khi xy dng phng trnh!

  • Li Cm tChng ti xin chn thnh cm n Cng ty Dc phm Bridge Healthcare, Australia ti tr cho chuyn i.