PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH

PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH Slide 2 Clumping at 0 Some subjects show no response, others have a continuous, or at least ordered response Examples: Hospitalization expense in an HMO Cell growth on plates Urinary output in shock patients Usual normal theory doesnt apply 2 Slide 3 Urinary Output (Afifi & Azen) 3 Slide 4 UO Analysis Survival: 27/70 had UO=0; mean=127.9, s=148.13, skewness=1.13 Deaths: 22/43 had UO=0; mean=31.0, s=71.76, skewness=3.37 For these data: t=3.01 (p=0.0032) Wilcoxon z=2.794 (p=0.0052) Kolmogorov-Smirnov p=0.001 2 part X 2 =15.86 (p 0.00036) 4 Slide 5 Statistical Model f i (x,d)=p i 1-d {(1-p i )h i (x)} d H 0 : p 1 =p 2 h 1 =h 2 Tests: t-test on full data set Wilcoxon rank sum test Kolmogorov-Smirnov Two part Models: Bin+Z; Bin+W; Bin+KS 5 Slide 6 What are the relative properties? Right size? Is =0.05 when its supposed to be? Are the null distributions correct? What is the power of these procedures under various alternatives? (Use log-normal model) Difference only in proportions Difference only in means Difference in both 6 Slide 7 Tests 7 Slide 8 Two-part Tests Define Then the two-part tests are: B 2 +Z 2 (denoted as BZ), B 2 +W 2 (denoted as BW) and B 2 +K 2 (denoted as BK), where K 2 is the chi-squared value corresponding to the p-value of the KS statistic. Since these are independent, we have the sum of two 1 d.f. (central) chi-squared statistics (under the null) 8 Slide 9 Size of Tests n1=n2=50, Equal means 9 Slide 10 10 Slide 11 Power: n= 50,100 P1=0.1, P2=0.2; MEAN DIFFERENCE=0 11 Slide 12 Power: n=50, 100 Differ only in means P=0.1,0.2, mean=0.5 12 Slide 13 Power:n=100,p1=0.1,p2=0.2 mean=0.3, 0.5 Proportion and mean are consonant 13 Slide 14 Power:n=100,p1=0.2,p2=0.1 mean=0.3, 0.5 Proportion and mean are dissonant 14 Slide 15 Conclusions These results are similar to those for other sample sizes and parameter combinations Size is appropriate Distributions match expectations, except for largest values For differences only in proportions (low proportions), the BZ, BW and BK methods did well, Z did poorly 15 Slide 16 Conclusions (2) For differences only in means, the W, K, Z, BW and BK did well For consonant differences (mean and proportion in same direction), W, K, BW and BK did well, Z and BZ poorly For dissonant differences, BW, BK and BZ were far superior to the others 16 Slide 17 Conclusions (3) Theoretical results indicate that computing sample size or power with the non-central 2 distribution gives an excellent agreement with the simulated powers Papers: Comparisons - Statistics in Medicine 2001, p. 1215 Non-central - Statistics in Medicine 2001, p. 1235 17 Slide 18 Peter A. Lachenbruch and John Molitor Oregon State University Slide 19 The Two-part Model Some data have an excess of zero values. These arent be easily modeled because of the spike at 0. Can use a mixture model if one cannot distinguish a sampling zero from a structural zero. Example: telephone calls in a short period of time. If phone is turned on, some time periods may have no calls. If phone is turned off, there are no calls registered. Can use two-part model if all zeros are structural. Example: hospitalization cost when an insured was not hospitalized. Size of growth on an agar plate if all activity is inhibited. 19 Slide 20 An equation or two Let y be the response. It is zero if no response, and non-zero otherwise. Let h(y) be the conditional distribution of y given y>0 Let d be an indicator of non-zero response and p=probability that z=1 For a two part model, we have The log-likelihood is easy to compute and the solution is simply the likelihood estimate for p and for the mean (regression) of y. 20 Slide 21 Inference One estimates parameters using the individual components of the likelihood. These are standard estimates. For the zero-nonzero part we use a logistic regression, and for the nonzero values we use a multiple regression. An issue is how to select variables for inclusion in a model. Select variables separately for each part of the model? Select variables for the model as a whole using the 0 as if it were a regular observation. 21 Slide 22 Variable selection criteria What criterion: R 2 =1-RSS/SST R 2 adj =1-(n-1)/(n-k-1)*RSS/SST AIC=n*ln(RSS/n)+2k+n+n*ln(2 ) BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2 ) (these are for normal distribution models) Use forward or backward stepping P to enter 0.15, 0.05 P to remove 0.15, 0.05 Best subsets models? For generalized linear models, the deviance is proposed. 22 Slide 23 Variable Selection For the multivariate regression, we can use stepwise regression. There are the usual concerns about stepwise. We can use AIC, BIC, R 2 to select the best model. AIC and BIC penalize the selection based on the number of variables in the model. For normal distributions we have AIC=n*ln(RSS/n)+2k+n+n*ln(2 ) BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2 ) Bias adjusted versions of R 2 and AIC are also available 23 Slide 24 More on selection For the logistic part of the model, we use stepwise logistic regression and specify a p(enter) or p(remove) this is based on the test of the odds ratio for each candidate variable. For variable selection, most programs use a stepwise routine that selects on the basis of the test on the odds ratio (basically a normal theory test). 24 Slide 25 Single model methods There are two single model methods we consider: Include the 0 values in a multiple regression This is obviously inappropriate, but users often have done this In practice, it selects more variables and includes the ones that have been selected by the logistic and multiple regression models. Conduct a Bayesian analysis of the variable selection problem. This is work in progress. 25 Slide 26 Computing - Stata We use Stata for computing because it has some convenient selection commands. The recently developed command, vselect, due to Lindsay and Sheather, allows one to do variable selection using AIC, BIC, R 2 and forward or backward stepping, as well as finding the best set of variables for each number of variables. The Best subsets option uses the leaps and bounds algorithm that vastly reduces the amount of computations. This was due to Furnival and Wilson. 26 Slide 27 More on selection Unfortunately, at present, vselect works only for multiple regression and not for logistic regression. Thus, we considered two strategies: Use stepwise logistic regression directly Regress the 0-1 variable using regression and perform the variable selection operation on the results. The vselect command first computes a multiple regression on all variables, then it computes the stepwise variable selection from the XX matrix It allows the use of R 2, AIC, BIC, Mallows C, and Best subsets regression. In the example, we use the Best option that gives all of the above The Bayesian methods will be presented separately. 27 Slide 28 Example data We use a data set courtesy of Lisa Rider. lald=ln(aldosterone) (response)aldind indicator for 0 -1 Dx2 Polymyositis (1) or Dermatomyositis (2) Agedx age at diagnosis Yeardx year of diagnosisgender male (0) female (1) Ild interstitial lung disease Y/NArthritis Y/N Fever >100 Y/NRaynauds sign Y/N Mechhand mechanics hands Y/Npalpitations Y/N Dysphagia Y/NProximal weakness Y/N Race W/NWRealonspeed onset speed 1 28 Slide 29 The prediction problem We wish to predict laldo. However, 72 out of 420 are 0. This leads to a clump of zero values. We may wish to have a single set of predictors for lald, or we may wish to have a set of predictors for the non-zero values and a (possibly distinct) set of predictors for the 0 values. A related question is how can we evaluate the prediction ability of the resulting equations? 29 Slide 30 Example of vselect. regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed Source | SS df MS Number of obs = 347 -------------+------------------------------ F( 14, 332) = 4.45 Model | 44.1754461 14 3.15538901 Prob > F = 0.0000 Residual | 235.26075 332.708616718 R-squared = 0.1581 -------------+------------------------------ Adj R-squared = 0.1226 Total | 279.436196 346.807619065 Root MSE =.84179 ------------------------------------------------------------------------------ laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- agedx | 0.0061 0.0120 0.51 6.1e-01 -0.0176 0.0298 yeardx | -0.0015 0.0086 -0.18 8.6e-01 -0.0185 0.0154 dx2 | -0.7198 0.1617 -4.45 1.2e-05 -1.0379 -0.4016 gender | -0.1017 0.1016 -1.00 3.2e-01 -0.3015 0.0982 ild | -0.0200 0.1802 -0.11 9.1e-01 -0.3744 0.3345 arthritis | 0.0548 0.0957 0.57 5.7e-01 -0.1334 0.2430 fever | -0.0830 0.1000 -0.83 4.1e-01 -0.2798 0.1138 raynaud | 0.3457 0.1490 2.32 2.1e-02 0.0526 0.6389 mechhand | -0.0275 0.1822 -0.15 8.8e-01 -0.3859 0.3310 palpita | -0.2085 0.1973 -1.06 2.9e-01 -0.5966 0.1797 dysphag | 0.2590 0.0983 2.63 8.8e-03 0.0656 0.4525 proxweak | 0.4575 0.8487 0.54 5.9e-01 -1.2119 2.1270 racewnw | -0.0937 0.0991 -0.95 3.4e-01 -0.2887 0.1012 realonspeed | -0.1849 0.0445 -4.16 4.1e-05 -0.2723 -0.0974 _cons | 6.6862 17.2356 0.39 7.0e-01 -27.2186 40.5910 ------------------------------------------------------------------------------ The next slide gives the vselect command and output. Note the restriction that lald>0 and u80 (an indicator variable that the patient was first diagnosted after 1980. 30 Slide 31 Vselect output This is the vselect output on the non-zero values. We truncated at 5 variables selected the actual output includes all 14 variables. vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed,best 1 Observations Containing Missing Predictor Values Response : laldo Fixed Predictors : Selected Predictors: dx2 realonspeed dysphag raynaud palpita gender racewnw fever a > rthritis proxweak agedx yeardx mechhand ild Actual Regressions 37 Possible Regressions 16384 Optimal Models Highlighted: # Preds R2ADJ C AIC AICC BIC 1.0663986 24.09272 888.755 1873.568 896.4537 2.1044985 10.09118 875.2897 1860.15 886.8377 3.1207073 4.734216 869.9412 1854.861 885.3385 4.1356839 -.1055272 864.9669 1849.957 884.2135 5.1361631.7231399 865.7583 1850.832 888.8543 6.1365321 1.595634 866.591 1851.76 893.5363 Selected Predictors 1 : dx2 2 : dx2 realonspeed 3 : dx2 realonspeed raynaud 4 : dx2 realonspeed dysphag raynaud 5 : dx2 realonspeed dysphag raynaud racewnw 6 : dx2 realonspeed dysphag raynaud palpita racewnw In this case, the program computed 27 regressions out of 16384 (=2 14 possible regressions) 31 Slide 32 Selecting predictors for 0 indicator For the logistic regressions we use stepwise logistic regression that selects variables based on odds ratios. We use forward stepping with a p-to-enter of 0.15 stepwise, pe(.15): logistic aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80 note: proxweak dropped because of estimability note: 1 obs. dropped because of estimability begin with empty model p = 0.0036 < 0.1500 adding palpita p = 0.0322 < 0.1500 adding arthritis p = 0.0340 < 0.1500 adding gender Logistic regression Number of obs = 418 LR chi2(3) = 17.40 Prob > chi2 = 0.0006 Log likelihood = -183.34326 Pseudo R2 = 0.0453 ------------------------------------------------------------------------------ aldind | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- palpita | 0.3060 0.1217 -2.98 2.9e-03 0.1403 0.6674 arthritis | 1.8598 0.5150 2.24 2.5e-02 1.0809 3.2000 gender | 0.4839 0.1657 -2.12 3.4e-02 0.2474 0.9466 ------------------------------------------------------------------------------ estat ic ----------------------------------------------------------------------------- Model | Obs ll(null) ll(model) df AIC BIC -------------+---------------------------------------------------------------. | 418 -192.0435 -183.3433 4 374.6865 390.8284 ----------------------------------------------------------------------------- Note: N=Obs used in calculating BIC; see [R] BIC note We see that the dx2 and onset speed variables did not enter, so somewhat different variables predict 0-ness than the magnitude of response 32 Slide 33 Selecting predictors for 0 with regression, ignoring binomial form We display only results for first five selected variables. regress aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi > ta dysphag proxweak racewnw realonspeed if u80 Source | SS df MS Number of obs = 419 -------------+------------------------------ F( 14, 404) = 1.84 Model | 3.56544676 14.254674768 Prob > F = 0.0319 Residual | 56.0622382 404.138767916 R-squared = 0.0598 -------------+------------------------------ Adj R-squared = 0.0272 Total | 59.627685 418.142649964 Root MSE =.37252 ------------------------------------------------------------------------------ aldind | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- agedx | -0.0053 0.0047 -1.14 2.5e-01 -0.0145 0.0038 yeardx | 0.0017 0.0035 0.50 6.2e-01 -0.0051 0.0085 dx2 | -0.0281 0.0646 -0.43 6.6e-01 -0.1550 0.0988 gender | -0.0857 0.0416 -2.06 4.0e-02 -0.1675 -0.0039 ild | -0.0459 0.0714 -0.64 5.2e-01 -0.1862 0.0944 arthritis | 0.0789 0.0380 2.08 3.8e-02 0.0043 0.1535 fever | 0.0636 0.0396 1.61 1.1e-01 -0.0143 0.1414 raynaud | 0.0049 0.0599 0.08 9.4e-01 -0.1129 0.1226 mechhand | 0.0803 0.0765 1.05 2.9e-01 -0.0701 0.2306 palpita | -0.2003 0.0701 -2.86 4.5e-03 -0.3382 -0.0624 dysphag | -0.0360 0.0390 -0.92 3.6e-01 -0.1127 0.0407 proxweak | -0.2055 0.3751 -0.55 5.8e-01 -0.9429 0.5319 racewnw | 0.0280 0.0395 0.71 4.8e-01 -0.0496 0.1057 realonspeed | -0.0053 0.0178 -0.30 7.6e-01 -0.0404 0.0297 _cons | -2.2499 6.9270 -0.32 7.5e-01 -15.8673 11.3676 ------------------------------------------------------------------------------ 33 Slide 34 Selecting predictors for 0 with regression, ignoring binomial form, 2.. vselect aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi > ta dysphag proxweak racewnw realonspeed if u80,best 2 Observations Containing Missing Predictor Values Response : aldind Fixed Predictors : Selected Predictors: palpita arthritis gender fever agedx mechhand dysphag racewnw > ild proxweak yeardx dx2 realonspeed raynaud Actual Regressions 62 Possible Regressions 16384 Optimal Models Highlighted: # Preds R2ADJ C AIC AICC BIC 1.0197545 5.197552 366.7613 1555.89 374.837 2.028156 2.597088 364.1486 1553.316 376.2622 3.0365444.0194683 361.5079 1550.724 377.6594 4.0389249.0159628 361.4605 1550.735 381.6499 5.0403595.4189426 361.8213 1551.164 386.0485 Selected Predictors 1 : palpita 2 : palpita arthritis 3 : palpita arthritis gender 4 : palpita arthritis gender fever 5 : palpita arthritis gender fever agedx Note that the selected variables are identical to the stepwise logistic regression. 34 Slide 35 Multiple regression with 0 in the data set We now consider the model including 0 as part of the data. This may be made a bit easier having taken logs of the non-zero values, so the 0s arent quite so obviously different.. regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80 Source | SS df MS Number of obs = 419 -------------+------------------------------ F( 14, 404) = 2.84 Model | 62.68539 14 4.47752786 Prob > F = 0.0004 Residual | 638.017201 404 1.5792505 R-squared = 0.0895 -------------+------------------------------ Adj R-squared = 0.0579 Total | 700.702591 418 1.67632199 Root MSE = 1.2567 ------------------------------------------------------------------------------ laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- agedx | -0.0075 0.0157 -0.48 6.4e-01 -0.0383 0.0234 yeardx | 0.0024 0.0117 0.21 8.4e-01 -0.0206 0.0254 dx2 | -0.6763 0.2178 -3.11 2.0e-03 -1.1044 -0.2482 gender | -0.3182 0.1404 -2.27 2.4e-02 -0.5941 -0.0423 ild | -0.1800 0.2408 -0.75 4.6e-01 -0.6533 0.2933 arthritis | 0.2548 0.1280 1.99 4.7e-02 0.0031 0.5065 fever | 0.1069 0.1336 0.80 4.2e-01 -0.1557 0.3695 raynaud | 0.3104 0.2021 1.54 1.3e-01 -0.0868 0.7076 mechhand | 0.2043 0.2580 0.79 4.3e-01 -0.3029 0.7115 palpita | -0.7101 0.2366 -3.00 2.9e-03 -1.1753 -0.2449 dysphag | 0.1165 0.1315 0.89 3.8e-01 -0.1422 0.3751 proxweak | -0.0250 1.2653 -0.02 9.8e-01 -2.5124 2.4625 racewnw | -0.0079 0.1332 -0.06 9.5e-01 -0.2698 0.2541 realonspeed | -0.1742 0.0601 -2.90 4.0e-03 -0.2924 -0.0560 _cons | -0.8421 23.3682 -0.04 9.7e-01 -46.7806 45.0964 ------------------------------------------------------------------------------ 35 Slide 36 Using vselect on the full data set Displaying best five. vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80,best 2 Observations Containing Missing Predictor Values Response : laldo Fixed Predictors : Selected Predictors: dx2 palpita realonspeed gender arthritis raynaud dysphag fever > mechhand ild agedx yeardx racewnw proxweak Actual Regressions 47 Possible Regressions 16384 Optimal Models Highlighted: # Preds R2ADJ C AIC AICC BIC 1.0154376 20.79848 1401.003 2590.131 1409.079 2.0322276 14.33945 1394.79 2583.957 1406.904 3.048014 8.358132 1388.891 2578.106 1405.042 4.0580737 4.926931 1385.429 2574.703 1405.618 5.0673386 1.865516 1382.274 2571.617 1406.501 6.0695667 1.901132 1382.256 2571.677 1410.521 7.0699354 2.752656 1383.071 2572.582 1415.374 Selected Predictors 1 : dx2 2 : dx2 palpita 3 : dx2 palpita realonspeed 4 : dx2 palpita realonspeed arthritis 5 : dx2 palpita realonspeed gender arthritis 6 : dx2 palpita realonspeed gender arthritis raynaud 7 : dx2 palpita realonspeed gender arthritis raynaud dysphag There are some differences in the variables selected by logistic regression and multiple regression. Raynauds and dysphagia were selected in the multiple regression 36 Slide 37 Future Steps Develop a full Bayesian analysis/model May include a model that involves selection of variables with 0 values in the variable selection set or may involve a Bayesian model on the non-zero values and a model for the variable of zero and non-zero values Develop a model using a bootstrap and select based on Wald statistics Stay tuned 37

Documents

PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH