AMMBR II Gerrit Rooks. Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with

AMMBR II

Gerrit Rooks

Today

• Introduction to Stata– Files / directories– Stata syntax– Useful commands / functions

• Logistic regression analysis with Stata– Estimation– GOF– Coefficients– Checking assumptions

Stata file types

• .ado – programs that add commands to Stata

• .do– Batch files that execute a set of Stata commands

• .dta– Data file in Stata’s format

• .log– Output saved as plain text by the log using

command

The working directory

• The working directory is the default directory for any file operations such as using & saving data, or logging output

• cd “d:\my work\”

Saving output to log files

• Syntax for the log command– log using filename [, append replace [smcl|text]]

• To close a log file– log close

Using and saving datasets

• Load a Stata dataset – use d:\myproject\data.dta, clear

• Save – save d:\myproject\data, replace

• Using change directory– cd d:\myproject– Use data, clear– save data, replace

Entering data

• Data in other formats– You can use SPSS to convert data– You can use the infile and insheet commands to

import data in ASCII format

• Entering data by hand– Type edit or just click on the data-editor button

Do-files

• You can create a text file that contains a series of commands

• Use the do-editor to work with do-files • Example I

Adding comments

• // or * denote comments stata should ignore• Stata ignores whatever follows after /// and

treats the next line as a continuation • Example II

A recommended structure//if a log file is open, close itcapture log close//dont'pause when output scrolls off the pageset more off//change directory to your working directorycd d:\myproject//log results to file myfile.loglog using myfile, replace text// * myfile.do-written 7 feb 2010 to illustrate do-files//

your commands here

//close the log filelog close

Serious data analysis

• Ensure replicability use do+log files• Document your do-files

– What is obvious today, is baffling in six months

• Keep a research log– Diary that includes a description of every program

you run

• Develop a system for naming files

Serious data analysis

• New variables should be given new names• Use labels and notes• Double check every new variable• ARCHIVE

The Stata syntax• Regress y x1 x2 if x3 <20, cluster(x4)

1. Regress = Command– What action do you want to performed

2. y x1 x2 = Names of variables, files or other objects– On what things is the command performed

3. if x3 <20 = Qualifier on observations– On which observations should the command be performed

4. , cluster(x4) = Options– What special things should be done in executing the

command

Examples

• tabulate smoking race if agemother > 30, row

• Example of the if qualifier– sum agemother if smoking == 1 & weightmother < 100

Elements used for logical statements

Operator Definition Example

== Equal to If male == 1

!= Not equal to If male !=1

> Greater than If age > 20

>= Greater than or equal to If age >=21

< Less than If age<66

<= Less than or equal to If age<=65

& And If age==21&male ==1

| or If age<=21|age>=65

Missing values

• Automatically excluded when Stata fits models, they are stored as the largest positive values

• Beware – The expression ‘age > 65’ can thus also include

missing values– To be sure type: ‘age > 65 & age != .’

Selecting observations

• drop variable list• Keep variable list

• drop if age < 65

Creating new variables

• generate command– generate age2 = age * age– generate – see help function

– !!sometimes the command egen is a useful alternative, f.i.

– egen meanage = mean(age)

Useful functionsFunction Definition Example

+ addition gen y = a+b

- subtraction gen y = a-b

/ Division gen density=population/area

* Multiplication gen y = a*b

^ Take to a power gen y = a^3

ln Natural log gen lnwage = ln(wage)

exp exponential gen y = exp(b)

sqrt Square root Gen agesqrt = sqrt(age)

Replace command

• replace has the same syntax as generate but is used to change values of a variable that already exists

• gen age_dum = .• replace age = 0 if age < 5• replace age = 1 if age >=5

Recode

• Change values of exisiting variables– Change 1 to 2 and 3 to 4:

recode origvar (1=2)(3=4), gen(myvar1)

– Change missings to 1:recode origvar (.=1), gen(origvar)

Logistic regression

• Lets use a set of data collected by the state of California from 1200 high schools measuring academic achievement.

• Our dependent variable is called hiqual. • Our predictor variable will be a continuous

variable called avg_ed, which is a continuous measure of the average education (ranging from 1 to 5) of the parents of the students in the participating high schools.

OLS in Stata

_cons -.855187 .0363792 -23.51 0.000 -.9265637 -.7838102 avg_ed .4287064 .0127215 33.70 0.000 .4037467 .4536662 hiqual Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 254.263385 1157 .219760921 Root MSE = .33309 Adj R-squared = 0.4951 Residual 128.260563 1156 .110952044 R-squared = 0.4956 Model 126.002822 1 126.002822 Prob > F = 0.0000 F( 1, 1156) = 1135.65 Source SS df MS Number of obs = 1158

. regress hiqual avg_ed

. use "D:\Onderwijs\AMMBR\apilog.dta", clear

01

1 2 3 4 5avg parent ed

Fitted values Hi Quality School, Hi vs Not

. twoway scatter yhat hiqual avg_ed, connect(l) ylabel(0 1)

(42 missing values generated)(option xb assumed; fitted values). predict yhat

Logistic regression in Stata

_cons -12.30333 .731532 -16.82 0.000 -13.73711 -10.86956 avg_ed 3.910475 .2383352 16.41 0.000 3.443347 4.377603 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -353.94352 Pseudo R2 = 0.5156 Prob > chi2 = 0.0000 LR chi2(1) = 753.49Logistic regression Number of obs = 1158

Iteration 5: log likelihood = -353.94352 Iteration 4: log likelihood = -353.94352 Iteration 3: log likelihood = -353.94368 Iteration 2: log likelihood = -355.09635 Iteration 1: log likelihood = -386.86717 Iteration 0: log likelihood = -730.68708

. logit hiqual avg_ed

. twoway scatter yhat1 hiqual avg_ed, connect(l i) msymbol(i O) sort ylabel(0 1)

(42 missing values generated)(option pr assumed; Pr(hiqual)). predict yhat1

01

1 2 3 4 5avg parent ed

Pr(hiqual) Hi Quality School, Hi vs Not

)9.312( 111

1)|(

XeXYE

Multiple predictors

_cons -12.05417 .739755 -16.29 0.000 -13.50407 -10.60428 avg_ed 3.86531 .2411152 16.03 0.000 3.392733 4.337887 yr_rnd -1.091038 .3425665 -3.18 0.001 -1.762456 -.4196197 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]



. logit hiqual yr_rnd avg_ed

Model fit: the likelihood ratio test

)]baseline()New([22 LLLL

Model fit: LR test





764.88176. di 2*(-348.2462+730.68708)

Pseudo R2: proportional change in LL





.52339899

. di (730.68708-348.2462)/730.68708

Classification Table

Correctly classified 67.42% False - rate for classified - Pr( D| -) 32.58%False + rate for classified + Pr(~D| +) .%False - rate for true D Pr( -| D) 100.00%False + rate for true ~D Pr( +|~D) 0.00% Negative predictive value Pr(~D| -) 67.42%Positive predictive value Pr( D| +) .%Specificity Pr( -|~D) 100.00%Sensitivity Pr( +| D) 0.00% True D defined as hiqual != 0Classified + if predicted Pr(D) >= .5

Total 391 809 1200 - 391 809 1200 + 0 0 0 Classified D ~D Total True

Logistic model for hiqual

. estat class

Classification Table

Correctly classified 87.31% False - rate for classified - Pr( D| -) 10.96%False + rate for classified + Pr(~D| +) 16.76%False - rate for true D Pr( -| D) 23.61%False + rate for true ~D Pr( +|~D) 7.43% Negative predictive value Pr(~D| -) 89.04%Positive predictive value Pr( D| +) 83.24%Specificity Pr( -|~D) 92.57%Sensitivity Pr( +| D) 76.39% True D defined as hiqual != 0Classified + if predicted Pr(D) >= .5

Total 377 781 1158 - 89 723 812 + 288 58 346 Classified D ~D Total True

Logistic model for hiqual

. estat class

Interpreting coefficients: significance



. logit hiqual yr_rnd avg_ed, nolog

bSE

b Wald

Comparing models





After the full model and storage, estimate nested model

.

_cons -12.30333 .731532 -16.82 0.000 -13.73711 -10.86956 avg_ed 3.910475 .2383352 16.41 0.000 3.443347 4.377603 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]



. logit hiqual avg_ed if e(sample)

.

. est store full_model

Likelihood ratio test

(Assumption: . nested in full_model) Prob > chi2 = 0.0007Likelihood-ratio test LR chi2(1) = 11.39

. lrtest full_model

Interpretation of coefficients: direction

------------------------------------------------------------------ avg_ed | 3.86531 16.031 0.000 47.7180 19.5978 0.7698 yr_rnd | -1.09104 -3.185 0.001 0.3359 0.6593 0.3819---------+-------------------------------------------------------- hiqual | b z P>|z| e^b e^bStdX SDofX------------------------------------------------------------------

Odds of: high vs not_high

logit (N=1158): Factor Change in Odds

. listcoef

nnxbxbxbbyp

yp

...

)(1

)(lnlogit 22110

Interpretation of coefficients: direction

------------------------------------------------------------------ avg_ed | 3.86531 16.031 0.000 47.7180 19.5978 0.7698 yr_rnd | -1.09104 -3.185 0.001 0.3359 0.6593 0.3819---------+-------------------------------------------------------- hiqual | b z P>|z| e^b e^bStdX SDofX------------------------------------------------------------------

Odds of: high vs not_high

logit (N=1158): Factor Change in Odds

. listcoef

nnxbxbxbb eeeeyp

yp

...

)(1

)(Odds 22110

Interpretation of coefficients: Magnitude



. logit hiqual yr_rnd avg_ed, nolog

)yr_rnd1.1avg_ed9.312( 11

1)|(

eXYE

Interpretation of coefficients: Magnitude

)yr_rnd1.1avg_ed9.312( 11

1)|(

eXYE

yr_rnd 1200 .18 .3843476 0 1 avg_ed 1158 2.754212 .7697744 1 5 Variable Obs Mean Std. Dev. Min Max

. summ avg_ed yr_rnd

.08509905

. di 1/(1+exp(12-3.9*2.75+1.1))

.21840254

. di 1/(1+exp(12-3.9*2.75))

the assumptions of logistic regression

• The true conditional probabilities are a logistic function of the independent variables.

• No important variables are omitted.• No extraneous variables are included.• The independent variables are measured without

error.• The observations are independent.• The independent variables are not linear

combinations of each other.

Hosmer & Lemeshow

Test divides sample in subgroups, checks whether difference between observed and predicted is about equal in these groups

Test should not be significant (indicating no difference)

Hosmer & Lemeshow

AverageProbabilityIn j th group

First logistic regression

_cons 2.425635 .3995025 6.07 0.000 1.642624 3.208645 cred_ml .7406536 .3152647 2.35 0.019 .1227463 1.358561 meals -.0936 .0084587 -11.07 0.000 -.1101786 -.0770213 yr_rnd -1.189537 .5022235 -2.37 0.018 -2.173877 -.2051967 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]



. logit hiqual yr_rnd meals cred_ml

Then postestimation command

Prob > chi2 = 0.0000 Hosmer-Lemeshow chi2(8) = 40.45 number of groups = 10 number of observations = 707

10 0.9595 62 61.1 8 8.9 70 9 0.7531 44 43.5 26 26.5 70 8 0.4960 23 22.0 47 48.0 70 7 0.1554 4 7.4 68 64.6 72 6 0.0560 2 2.4 68 67.6 70 5 0.0208 1 0.9 71 71.1 72 4 0.0078 0 0.4 68 67.6 68 3 0.0037 0 0.2 71 70.8 71 2 0.0019 1 0.1 71 71.9 72 1 0.0008 1 0.0 71 72.0 72 Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total (Table collapsed on quantiles of estimated probabilities)

Logistic model for hiqual, goodness-of-fit test

. estat gof, table group(10)

Specification error

_cons 2.425635 .3995025 6.07 0.000 1.642624 3.208645 cred_ml .7406536 .3152647 2.35 0.019 .1227463 1.358561 meals -.0936 .0084587 -11.07 0.000 -.1101786 -.0770213 yr_rnd -1.189537 .5022235 -2.37 0.018 -2.173877 -.2051967 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]


. logit hiqual yr_rnd meals cred_ml, nolog

_cons -.1408008 .1637332 -0.86 0.390 -.4617121 .1801105 _hatsq .0748928 .0263911 2.84 0.005 .0231673 .1266184 _hat 1.215465 .1283978 9.47 0.000 .9638102 1.46712 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]


. linktest, nolog

Including interaction term helps

_cons 2.686005 .4307661 6.24 0.000 1.841719 3.530291 ym .0463257 .0188326 2.46 0.014 .0094145 .0832368 cred_ml .7789823 .3206881 2.43 0.015 .1504452 1.407519 meals -.1019211 .0098691 -10.33 0.000 -.1212641 -.0825781 yr_rnd -2.834458 .8630901 -3.28 0.001 -4.526083 -1.142832 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]


. logit hiqual yr_rnd meals cred_ml ym , nolog

. gen ym=yr_rnd*meals

.

Prob > chi2 = 0.3215 Hosmer-Lemeshow chi2(8) = 9.25 number of groups = 10 number of observations = 707

10 0.9697 61 61.5 8 7.5 69 9 0.7725 44 43.4 25 25.6 69 8 0.4745 24 22.0 50 52.0 74 7 0.1420 2 6.5 66 61.5 68 6 0.0620 4 2.5 69 70.5 73 5 0.0204 1 1.0 70 70.0 71 4 0.0095 1 0.5 63 63.5 64 3 0.0054 0 0.3 74 73.7 74 2 0.0033 1 0.2 73 73.8 74 1 0.0015 0 0.1 71 70.9 71 Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total (Table collapsed on quantiles of estimated probabilities)

Logistic model for hiqual, goodness-of-fit test

. estat gof, table group(10)

Ok now

_cons -.0644637 .1684527 -0.38 0.702 -.3946249 .2656976 _hatsq .0297354 .0317399 0.94 0.349 -.0324737 .0919445 _hat 1.067861 .1160715 9.20 0.000 .8403653 1.295357 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]


Iteration 6: log likelihood = -153.36794 Iteration 5: log likelihood = -153.36794 Iteration 4: log likelihood = -153.36857 Iteration 3: log likelihood = -153.49407 Iteration 2: log likelihood = -156.07793 Iteration 1: log likelihood = -174.14403 Iteration 0: log likelihood = -349.01971

. linktest

Ok now

Multicollinearity

Mean VIF 2.56 yr_rnd 1.11 0.903460 avg_ed 3.25 0.307731 meals 3.31 0.301982 Variable VIF 1/VIF

. vif

_cons .2445202 .0824989 2.96 0.003 .0826554 .4063849 meals -.0076084 .000527 -14.44 0.000 -.0086423 -.0065744 yr_rnd -.0008586 .0248112 -0.03 0.972 -.0495386 .0478215 avg_ed .1729601 .021089 8.20 0.000 .1315831 .2143371 hiqual Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 254.263385 1157 .219760921 Root MSE = .30632 Adj R-squared = 0.5730 Residual 108.279876 1154 .093830049 R-squared = 0.5741 Model 145.983509 3 48.6611696 Prob > F = 0.0000 F( 3, 1154) = 518.61 Source SS df MS Number of obs = 1158

. reg hiqual avg_ed yr_rnd meals

Influential observations

(42 missing values generated). predict stdres, rstand

(42 missing values generated)(option pr assumed; Pr(hiqual)). predict p

186018135012 4552185045346984007187112933069 2521134440642152

211569651401951505919521859 5222664347513024873492140860444006452227723728227459071185406111864901 61461618545718994068208639751967 22083509413244912921724409858703121401247811607488264411872114 28223612273218467671660 19852223214459931807 198739781713164213650951630194721401315 50031966 47834718 20735980 521552952075018 1629738592218631103 28353166 323553871748594649758939302269176930232977524225192339

3147881417317291077186160465990244147271843 509014512326 39661497300921038721728214122721886 52108121501992594857685052477 3246465 84472959061988736176839704961980198219971926759 654696 46983118208931309801623413345041595932486855732786187258345899189026854876462740031069131894986717861879 207658671494 193248603767 321448651894151652402623524548331192788321617466572127589617233174190617174146909313113455953191617182280 593741437531721 1819 5379148817142323189823254618 141545964445430137942981 14374358 6124207852703161 33653735 54332548 33757142882 427612744353107 38252284927233231761083688 6105 53385900 252746263353 52734351 549945854673 432610723159

4773410 5725 60435783162 52184461004

3294314

433037644521 4329670 43284320 52945694 302937415663

33711085 44333801001 52242989 204356

309850542324 1523461422546 112542

2910526857285926 5299591289949555547 49213703 5110337359492582 8623622 3945065527152276126

30975000653700 4550

5154

5123084 4436296 6145297356201851 2490

227042533426 1502487060082480 5636479946991709995

19041923 38001706 356648152624 23336007 160017711685 39542281 32071657 4561173 17992276 538048155811339 11123150 5607

583692 5375 5374

5300

47024736 6116

549438871340 252051011500

4663

1402

11611762

1403

481138954852

22272226 3986 61066017846 13622957 320147465358 33552136 24531696140

29551264216845002128 384936381055954 1792337

50165911 58275018281795

229340593944173916872695 57191350302226633833 521919242083 4853 2319249148804002 30042752

167

19654539 536144144724

44152070 169828706182 1777 32389416715421

430726005561

162053042951520018243013 2841342544 35832972 536340225313258914841855 660 5295 1035722639180917372282981 29441751273060154533484 5211755 58734056 10451280 2984536958829355093 12391118 230714901758127514502494395511562599 70

784948 38241839 2338116 5036401821266885998 661278

52112116347116131511 38845252 335048263522 30871887

14931672

16792430 472812766180591058742440147327953765

2991 1646 623519 590464317225834130 531216016088 470540102692 389319141853 50622714 4705406 57011949 49634536

60385700

4553 2313507835324638 924640 4284184523 379736993411 2905330555482606

3307490 5276503533163775 29086109 49233853

30815847

4719

743 44394411 54425404 4271647

503641 3343

5765708 56925483581834283778 1294396 1131505754692929853666 694

3003 5323742 563532063043 4399203 5408285

43345534 451936213083 51333295092 459445254745

51146614544842 3422 550328856769234381 321037125967 4452 5716 520448657483695

3465107 28985943 272583

47353708628 42852535457246835787

336657375039

3272 692256544374194594 4083121556937613733 45575842550630077733754 49264289649 3265612959173760 33452179 487 32856674019 5612 53053193574 5563540142864497 6030 51924984583 435098 49855471 2904560541314496 1115

3845 395648791461

5864

38341234

540326522672351 14015134 258710627542698 2489

26351419 5904 4651

28496 1249 12321108 4040228 13792334

4747532913115063

1912609040245599

23772580 22661426 484948243822 544118301390 4175 270536133502 2755

216521195956

2369 12132583570 38365434

166623781492 46452679 13834223

55694816 39041297 5664

3917 181

144457732386

1219 5409422 533126252691 2098

36751199 610116614135 219818743610 60434309585838763864 5798 478647 138

372

9316252588 1608 205

596

42023460

399851941427 27035020 3870

385838291373 5572

4558

13104512 54145427531627 259

335654444226 32245316483951331198 2802

365019152593 21676190 13723832687 351847785196

51495586 3655

1033

215941822622 2381241 396053974709 121426966016856 1240

53262191 11001458 5593363

26074220373

41454477129916815589

836 5334421361714240 54654248 3521 999801 20973415 4391101834544409 42574275 784 42824670 4237799 3296

13843289

45473340

443944266134490 352042683111 5928

2817 49642924

8102711 3266

56464654 2902319 367058371031

6036

44003408 5555 3293283 61864033 2935

840

4608 4314505 388223874822 45144292

450645374036 48203593 2930

3656

42031038

3582 6114678

30635755

2922559 4790404558624518

2509792 5777387450262573 543563 571349111473843244 5853

358150565851

5704 284255974385 51895639 35894556

4369 35305656 293429135657

4580776 30643881

401

36362918 480023535796 3865606 3126019 302

49361514932

3634323656383204329426364121 5427

42785712

4591572327045844

4043 386828013449

6087

5192

328

4609342

44285761 420034164084

125301

748 331737305

4366 12344834035

5752 492952882816

5524

4386 381259784302 427040915968

493426436156063

4910364061725647 3757 257142644581010

2030

4050

stan

dard

ize

d P

ears

on r

esi

dua

l

0 .2 .4 .6 .8 1Pr(hiqual)

. scatter stdres p, mlabel(snum)

No 27 2.19 0 100 awards ell avg_ed hicred ym low medium medium . 808 824 59 28 cred_hl pared pared_ml pared_hl api00 api99 full some_col 1403 315 high high nd 100 497 low low 458. snum dnum schqual hiqual yr_rnd meals enroll cred cred_ml

. list if snum==1403

_cons -3.528875 1.037345 -3.40 0.001 -5.562035 -1.495716 avg_ed 2.010791 .2947269 6.82 0.000 1.433137 2.588445 meals -.0790397 .0076984 -10.27 0.000 -.0941283 -.0639511 yr_rnd -1.1328 .3842377 -2.95 0.003 -1.885892 -.3797077 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]



. logit hiqual yr_rnd meals avg_ed if snum != 1403

_cons -3.566451 1.01715 -3.51 0.000 -5.560028 -1.572874 avg_ed 1.98805 .2884154 6.89 0.000 1.422766 2.553334 meals -.0758864 .0074453 -10.19 0.000 -.090479 -.0612938 yr_rnd -.9913148 .3743452 -2.65 0.008 -1.725018 -.2576117 hiqual Coef. Std. Err. z P>|z| [95% Conf. Interval]


. logit hiqual yr_rnd meals avg_ed, nolog

If we have enough time left

• Perform a logistic regression analysis• Use apilog.dta• Awards = dependent variable

Documents

AMMBR II Gerrit Rooks. Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with