Phan Tich Hoi Quy Tuyen Tinh

Preview:

DESCRIPTION

Bài giảng phân tích hồi quy tuyến tính

Citation preview

  • Gii thiu Phn tch hi quy tuyn tnhDr. Tuan V. NguyenGarvan Institute of Medical ResearchSydney

  • Nu cho mt ngi ba loi v kh tng quan, hi quy v cy bt, hn s dng c ba (Anon, 1978)

  • V dIDAge Chol (mg/ml)1463.52201.93524.04302.65574.56253.07282.98363.89222.110433.811574.112333.013222.514634.615403.216484.217282.318494.0Tui v nng cholesterol ca 18 ngi o c nh sau

  • Nhp s liu vo Rid
  • Tng quan gia tui v nng cholesterol

  • Cu hi nghin cu

    Mi tng quan gia tui v nng cholesterolMc tng quanTin on nng cholesterol ng vi mi la tuiPhn tch tng quan v hi quy

  • Phng sai v hip phng sai: i sCoi x v y l hai bin ngu nhin rt ra t mt mu quan st n i tng.o lng dao ng gia x v y: phng saiHip phng sai gia x v y var(x + y) = var(x) + var(y) var(x + y) = var(x) + var(y) + 2cov(x,y)Trong :

  • Phng sai v Hip phng sai: Hnh hcTnh c lp v ph thuc gia x v y c th biu din bng hnh hc:yxhh2 = x2 + y2xyhh2 = x2 + y2 2xycos(H)H

  • ngha ca Phng sai v Hip phng saiPhng sai lun lun l s dngNu hip phng sai = 0, x v y c lp vi nhau.Hip phng sai l mt tng ca mt tch cho: do c th m v cng c th dng.Hip phng sai m = lch pha gia hai phn phi theo hng ngc chiu nhau.Hip phng sai dng = lch pha gia hai phn phi theo hng cng chiu nhau.Hip phng sai = o lng cng tng quan.

  • Hip phng sai v tng quanHip phng sai l mt n v ph thuc. H s tng quan (r) gia x v y l mt hip phng sai c chun ho.r c xc nh bng:

  • Tng quan thun v nghchr = 0.9r = -0.9

  • Kim nh gi thuyt tng quanGi thuyt: Ho: r = 0 ngc vi Ho: r khng bng 0.

    Sai s chun (Standard error) ca r : The t-statistic:Thng k ny c phn phi t vi n 2 bc t do.

    Fishers z-transformation:

    Standard error of z:

    Do vy 95% CI ca z c th tnh bng:

  • Minh ho phn tch tng quanIDAge Cholesterol(x) (y; mg/100ml)463.5201.9524.0302.6574.5253.0282.9363.8222.1433.8574.1333.0222.5634.6403.2484.2282.3494.0Mean38.833.33SD13.600.84

    Cov(x, y) = 10.68t-statistic = 0.56 / 0.26 = 2.17Critical t-value with 17 df and alpha = 5% is 2.11Kt lun: Gia tui v nng cholesterol c mt mi tng quan c ngha thng k..

  • Phn tch hi quy tuyn tnh nnh gi:Lng ho mi tng quan gia hai bin.D onXy dng m hnh d on v nh giKim sotiu chnh yu t nhiu (trng hp phn tch a bin)Ch kho st c hai bin: mt l bin p ng (response variable) v mt l bin d on (predictor variable)Khng c iu chnh cho yu t nhiu hoc cc hip bin khc

  • Tng quan gia tui v nng cholesterol

  • M hnh hi quy tuyn tnhY : bin ngu nhin, l mt bin p ng (response)X : bin ngu nhin, l bin d on, hay yu t nguy c (predictor, risk factor)C Y v X c th l s liu nhm (e.g., yes / no) hoc bin lin tc (e.g., age). Nu Y l bin phn nhm th s dng m hnh logistic regression; nu Y l bin lin tc th s dng m hnh hi quy tuyn tnh n.

    M hnh:Y = a + bX + ea : interceptb : slope / gradient: random error (mc dao ng gia cc i tng trong s y s kin nu x khng i (v d bin i cholesterol trong mt nhm cng la tui)

  • Cc gi nh ca m hnh tuyn tnhCc thng s c mi tng quan tuyn tnh (ng thng) vi nhau;

    X o lng khng c sai s;

    Cc gi tr Y tng ng l c lp vi nhau (v d Y1 khng c mi tng quan vi Y2) ;

    Sai s ngu nhin (e) c phn phi chun vi trung bnh =0 v phng sai c nh.

  • Gi tr k vng v phng saiNu cc gi nh tho mn: Gi tr k vng ca Y l: E(Y | x) = a + bxPhng sai ca Y l: var(Y) = var(e) = s2

  • c lng cc thng s ca m hnh hi quy tuyn tnh Cho hai im A(x1, y1) v B(x2, y2) trong mt mt phng 2 chiu, chng ta c th c mt phng trnh ng thng ni hai im ny.A(x1,y1)B(x2,y2)Gc lch:Phng trnh: y = mx + aVy nu chng ta c hn 2 im th sao?

    axy0dydx

  • c tnh a v bC mt lot cp i: (x1, y1), (x2, y2), (x3, y3), , (xn, yn)Cho a v b l cc c s ca cc thng s a v b, Chng ta c phng trnh ca mu nghin cu: Y* = a + bx

    Mc ch: tm cc gi tr ca a v b sao cho (Y Y*) l ti thiu.

    Cho SSE = tng ca (Yi a bxi)2.Cc gi tr a v b c th lm SSE t gi tr nh nht gi l cc c s bnh phng ti thiu (least square estimates).

  • Tiu chun c tnhyiCholAgeMc ch ca c s bnh phng ti thiu l tm c cc gi tr a v b sao cho tng ca d2 c gi tr nh nht.

  • c tnh a v bSau mt s bc tnh ton, chng ta c:

    Trong :

    Nu cc gi nh ca hi quy l hp l, cc c s ca v s:Khng sai lchPhng sai ti thiu (ngha l hiu qu)

  • Goodness-of-fitBy gi chng ta c phng trnh: Y = a + bX + e

    Cu hi: Phng trnh ny c th m t d liu tt c no?

    Tr li: h s xc nh (R2): mc bin thin trong Y c th gii thch bng mc bin thin trong nhm X.

  • Tch nhm bin thin: khi nimSST = tng ca cc mc khc bit bnh phng gia tng gi tr yi v tr s trung bnh ca y.

    SSR = tng ca cc mc khc bit bnh phng gia gi tr d on ca y v tr s trung bnh ca y.

    SSE = tng ca cc mc khc bit bnh phng gia cc gi tr quan st v gi tr d on ca y.

    SST = SSR + SSE

    Khi h s xc nh l: R2 = SSR / SST

  • Tch nhm bin thin: minh ho hnh hcChol (Y)Age (X)meanSSRSSESST

  • Tch nhm bin thin: i sSome statistics:Total variation:Attributed to the model:Residual sum of square: SST = SSR + SSESSR = SST SSE

  • Phn tch phng saiSS tng ln theo t l vi c mu (n)Trung bnh bnh phng (Mean squares, MS): c chun ho cho bc t do (df)MSR = SSR / p ( p = s bc t do)MSE = SSE / (n p 1)MST = SST / (n 1)

    Bng tm tt phn tch phng sai (Analysis of variance, ANOVA):

    Ngund.f.Sum of squares (SS)Mean squares (MS)F-testRegressionResidualTotalpNp 1n 1SSRSSESSTMSRMSEMSR/MSE

  • Kim nh gi thuyt trong cc phn tch hi quyBy gi chng ta c:S liu mu nghin cu:Y = a + bX + eQun th:Y = a + bX + e

    Ho: b = 0. Khng c mi tng quan tuyn tnh no gia kt cc v bin d on (yu t nguy c) c.

    Ngn ng thng thng: Vi iu kin mu nghin cu cho kt qu thu c , vy xc sut cho c c mt mu quan st m khng nht qun vi gi thuyt khng, tc l khng c mi tng quan no, l bao nhiu phn trm?

  • Din dch v dc (thng s b)Ghi nh rng e c coi l mt phn phi chun vi trung bnh 0 v phng sai v = s2. c tnh s2 bng MSE (or s2)Cng c th cho thy rng GI tr k vng ca b l b, i.e. E(b) = b, Sai s chun (standard errors) ca b l:

    Vy kim nh liu b = 0 s l: t = b / SE(b) s tun theo lut phn phi t vi bc t do l n-1.

  • Khong tin cy xung quanh gi tr d onGi tr quan st l Yi. Gi tr c d on l: Sai s chun (standard error) ca gi tr c d on l:

    c tnh khong cho cc gi tr Yi :

  • Kim tra cc gi nhPhng sai hng nhPhn phi chunM hnh ngM hnh n nhTt c u c th biu din bng biu . Phn tn d (residuals) ca m hnh lun ng vai tr quan trng trong tt c cc bc tin hnh phn tch mt m hnh chn on.

  • Kim tra cc gi nhPhng sai hng nhV ng s liu tn d chun ho theo phng php student (studentized residuals) tng ng vi cc gi tr c d on (predicted values). Kim tra xem s bin thin gia cc gi tr tn d liu c tng i hng nh qua sut ht cc dy gi tr c x l khng (fitted values).Phn phi chunV ng s liu tn d tng ng vi cc gi tr k vng (expected valu), hay cn gi l v ng xc sut chun (Normal probability plot). Nu cc gi tr tn d ny tun theo lut phn ohun th n phi nm trn con ng xin 45o. Xy dng cng thc ng? V ng gi tr tn d tng ng vi gi tr x l (fitted values). Kim tra xem liu biu ca cc gi tr tn d c cho thy xu hng khng tuyn tnh ca chng qua cc dy s liu x l khng (fitted values).M hnh n nhKim tra xem liu c mt hay nhiu gi tr quan st b tc ng. S dng khong cch Cook.

  • Checking assumptions (tt)Khong cch Cook (D) l mt n v o lng mc bin i ca cc gi tr x l trong m hnh hi quy nu loi b mt gi tr th ith ra khi b d liu phn tch.Leverage (tc ng n by) o mc gi tr cc tr xi tng quan vi cc gi tr x cn li. Gi tr tn d student ho (Studentized residual) o mc gi tr cc tr yi tng quan vi cc gi tr y cn li.

  • o lng chnh lPhng sai khng hng nhHon chuyn gi tr p ng (y) sang mt thang n v khc (v d logarithm) thng hu ch.Nu hon chuyn ri m khng gii quyt c tnh trng phng sai khng hng nh, s dng mt c s khc mnh hn, nh l bnh phng ti thiu c cn i tng tc (iterative weighted least squares).Khng tun theo phn phi chunPhn phi khng chun v phng sai khng hng nh thng i i vi nhau.Gi tr ngoi l (Outliers)Kim tra xem s liu c chnh xc khngS dng phng php c tnh ph tr

  • Phn tch hi quy s dng Rid
  • Phn tch hi quysummary(reg)Call:lm(formula = chol ~ age)

    Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040

    Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 ***age 0.057788 0.005399 10.704 1.06e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 0.3027 on 16 degrees of freedomMultiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

  • ANOVA anova(reg)Analysis of Variance Table

    Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 ***Residuals 16 1.4656 0.0916 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  • Chn on: nh hng ca s liup
  • Mt minh ho khng tuyn tnh: BMI v mc hp dn tnh dcNghin cu trn 44 sinh vin i hco ch s trng lng c th (BMI)Cho im hp dn tnh dc (SA) id
  • Phn tch hi quy tuyn tnh gia BMI v SAreg |t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 ***bmi -0.05967 0.02862 -2.084 0.0432 * ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 1.354 on 42 degrees of freedomMultiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

  • BMI v SA: phn tch cc gi tr tn dplot(reg)

  • BMI and SA: biu tn xreg
  • Phn tch li s liu ny# Fit 3 regression modelslinear
  • Mt s nhn xt: Din dch mi tng quanGi tr tng quan nm gia khong 1 v +1. Mt h s tng quan rt nh khng c ngha rng khng c mi tng quan gia hai bin. Mi tng quan ny c th l phi tuyn tnh.

    i vi cc tng quan cong, s dng h s tng phn phn loi (rank correlation) tt hn tng quan Pearson (Pearsons correlation).

    Mt h s tng quan thp (vd: 0.1) c th c ngha thng k nhng khng c ngha lm sng.

    R2 l mt ch s o lng mc tng quan. r = 0.7 trng c v hp dn nhng thc cht R2 ch c 0.49!

    C tng quan khng ng ngha l c quan h nhn qu.

  • Mt s nhn xt: Din dch mi tng quanCn cn thn vi a tng quan. i vi s bin l p, s c p(p 1)/2 cc cp tng quan, v khi s i mt vi vn dng tnh gi (c tng quan gi).

    Tng quan khng th suy din c t cc mi quan h.r(age, weight) = 0.05; r(weight, fat) = 0.03; khng c ngha rng r(age, fat) l gn zero. Nhng trn thc t r(age, fat) = 0.79.

  • Mt s nhn xt: Din dch mi tng quanng biu din tng quan (hi quy) ch l mt tng quan c lng gia cc bin ny trong qun th m thi.

    C mt bt nh lin quan vi cc thng s c c tnh.

    ng hi quy khng th dng c tnh cc gi tr x nm ngoi vng gi tr quan st (ngoi suy).

    Mt m hnh thng k l mt m hnh xp x; tng quan thc c th li l phi tuyn tnh, nhng tng quan tuyn tnh l mt tng quan xp x tng i ph hp nht.

  • Mt s nhn xt: Bo co kt quKt qu phn tch tng quan hi quy cn c m t y : bn cht ca bin p ng (kt cc), cc bin d on (yu t nguy c); bt k mt cch hon chuyn; kim tra cc gi nh...

    Cc h s hi quy (a, b), cng vi cc sai s chun tng ng, v R2 cng cn thit.

  • Vi nhn xt cui cngPhng trnh l ct mc cho cc tng khoa hc bm tr v thng hoa.

    Cc phng trnh p nh nhng bi th, nhng cng thm ch l nhng c hnh.

    V vy m phi ht sc cnh gic v cn tc khi xy dng phng trnh!

  • Li Cm tChng ti xin chn thnh cm n Cng ty Dc phm Bridge Healthcare, Australia ti tr cho chuyn i.