169
STATISTICS FAQ: What are the differences between one-tailed and two- tailed tests? When you conduct a test of statistical significance, whether it is from a correlation, an ANOVA, a regression or some other kind of test, you are given a p-value somewhere in the output. If your test statistic is symmetrically distributed, you can select one of three alternative hypotheses. Two of these correspond to one- tailed tests and one corresponds to a two-tailed test. However, the p-value presented is (almost always) for a two-tailed test. But how do you choose which test? Is the p-value appropriate for your test? And, if it is not, how can you calculate the correct p-value for your test given the p-value in your output? What is a two-tailed test? First let's start with the meaning of a two-tailed test. If you are using a significance level of 0.05, a two- tailed test allots half of your alpha to testing the statistical significance in one direction and half of your alpha to testing statistical significance in the other direction. This means that .025 is in each tail of the distribution of your test statistic. When using a two-tailed test, regardless of the direction of the relationship you hypothesize, you are testing for the possibility of the relationship in both directions. For example, we may wish to compare the mean of a sample to a given value x using a t-test. Our null hypothesis is that the mean is equal to x. A two-tailed test will test both if the mean is significantly greater than x and if

STAT_in SAS

  • Upload
    amitmse

  • View
    301

  • Download
    3

Embed Size (px)

Citation preview

Page 1: STAT_in SAS

STATISTICS

FAQ: What are the differences between one-tailed and two-tailed tests?

When you conduct a test of statistical significance, whether it is from a correlation, an ANOVA, a regression or some other kind of test, you are given a p-value somewhere in the output.  If your test statistic is symmetrically distributed, you can select one of three alternative hypotheses. Two of these correspond to one-tailed tests and one corresponds to a two-tailed test.  However, the p-value presented is (almost always) for a two-tailed test.  But how do you choose which test?  Is the p-value appropriate for your test? And, if it is not, how can you calculate the correct p-value for your test given the p-value in your output?  

What is a two-tailed test?

First let's start with the meaning of a two-tailed test.  If you are using a significance level of 0.05, a two-tailed test allots half of your alpha to testing the statistical significance in one direction and half of your alpha to testing statistical significance in the other direction.  This means that .025 is in each tail of the distribution of your test statistic. When using a two-tailed test, regardless of the direction of the relationship you hypothesize, you are testing for the possibility of the relationship in both directions.  For example, we may wish to compare the mean of a sample to a given value x using a t-test.  Our null hypothesis is that the mean is equal to x. A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05.     

Page 2: STAT_in SAS

What is a one-tailed test?

Next, let's discuss the meaning of a one-tailed test.  If you are using a significance level of .05, a one-tailed test allots all of your alpha to testing the statistical significance in the one direction of interest.  This means that .05 is in one tail of the distribution of your test statistic. When using a one-tailed test, you are testing for the possibility of the relationship in one direction and completely disregarding the possibility of a relationship in the other direction.  Let's return to our example comparing the mean of a sample to a given value x using a t-test.  Our null hypothesis is that the mean is equal to x. A one-tailed test will test either if the mean is significantly greater than x or if the mean is significantly less than x, but not both. Then, depending on the chosen tail, the mean is significantly greater than or less than x if the test statistic is in the top 5% of its probability distribution or bottom 5% of its probability distribution, resulting in a p-value less than 0.05.  The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction. A discussion of when this is an appropriate option follows.   

 

Page 3: STAT_in SAS

When is a one-tailed test appropriate?

Because the one-tailed test provides more power to detect an effect, you may be tempting to use a one-tailed test whenever you have a hypothesis about the direction of an effect. Before doing so, consider the consequences of missing an effect in the other direction.  Imagine you have developed a new drug that you believe is an improvement over an existing drug.  You wish to maximize your ability to detect the improvement, so you opt for a one-tailed test. In doing so, you fail to test for the possibility that the new drug is less effective than the existing drug.  The

Page 4: STAT_in SAS

consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.

So when is a one-tailed test appropriate? If you consider the consequences of missing an effect in the untested direction and conclude that they are negligible and in no way irresponsible or unethical, then you can proceed with a one-tailed test. For example, imagine again that you have developed a new drug. It is cheaper than the existing drug and, you believe, no less effective.  In testing this drug, you are only interested in testing if it less effective than the existing drug.  You do not care if it is significantly more effective.  You only wish to show that it is not less effective. In this scenario, a one-tailed test would be appropriate. 

When is a one-tailed test NOT appropriate?

Choosing a one-tailed test for the sole purpose of attaining significance is not appropriate.  Choosing a one-tailed test after running a two-tailed test that failed to reject the null hypothesis is not appropriate, no matter how "close" to significant the two-tailed test was.  Using statistical tests inappropriately can lead to invalid results that are not replicable and highly questionable--a steep price to pay for a significance star in your results table!   

Deriving a one-tailed test from two-tailed output

The default among statistical packages performing tests is to report two-tailed p-values.  Because the most commonly used test statistic distributions (standard normal, Student's t) are symmetric about zero, most one-tailed p-values can be derived from the two-tailed p-values.   

Below, we have the output from a two-sample t-test in Stata.  The test is comparing the mean male score to the mean female score.  The null hypothesis is that the difference in means is zero.  The two-sided alternative is that the difference in means is not zero.  There are two one-sided alternatives that one could opt to test instead: that the male score is higher than the female score (diff  > 0) or that the female score is higher than the male score (diff < 0).  In this instance, Stata presents results for all three alternatives.  Under the headings Ha: diff < 0 and Ha: diff > 0 are the results for the one-tailed tests. In the middle, under the heading Ha: diff != 0 (which means that the difference is not equal to 0), are the results for the two-tailed test. 

Two-sample t test with equal variances

------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-------------------------------------------------------------------- male | 91 50.12088 1.080274 10.30516 47.97473 52.26703 female | 109 54.99083 .7790686 8.133715 53.44658 56.53507

Page 5: STAT_in SAS

---------+--------------------------------------------------------------------combined | 200 52.775 .6702372 9.478586 51.45332 54.09668---------+-------------------------------------------------------------------- diff | -4.869947 1.304191 -7.441835 -2.298059------------------------------------------------------------------------------Degrees of freedom: 198 Ho: mean(male) - mean(female) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 t = -3.7341 t = -3.7341 t = -3.7341 P < t = 0.0001 P > |t| = 0.0002 P > t = 0.9999

Note that the test statistic, -3.7341, is the same for all of these tests.  The two-tailed p-value is P > |t|. This can be rewritten as P(>3.7341) + P(< -3.7341).  Because the t-distribution is symmetric about zero, these two probabilities are equal: P > |t| = 2 *  P(< -3.7341).  Thus, we can see that the two-tailed p-value is twice the one-tailed p-value for the alternative hypothesis that (diff < 0).  The other one-tailed alternative hypothesis has a p-value of P(>-3.7341) = 1-(P<-3.7341) = 1-0.0001 = 0.9999.   So, depending on the direction of the one-tailed hypothesis, its p-value is either 0.5*(two-tailed p-value) or 1-0.5*(two-tailed p-value) if the test statistic symmetrically distributed about zero. 

In this example, the two-tailed p-value suggests rejecting the null hypothesis of no difference. Had we opted for the one-tailed test of (diff > 0), we would fail to reject the null because of our choice of tails. 

The output below is from a regression analysis in Stata.  Unlike the example above, only the two-sided p-values are presented in this output.

 

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 46.58 Model | 7363.62077 2 3681.81039 Prob > F = 0.0000 Residual | 15572.5742 197 79.0486001 R-squared = 0.3210-------------+------------------------------ Adj R-squared = 0.3142 Total | 22936.195 199 115.257261 Root MSE = 8.8909------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- science | .2191144 .0820323 2.67 0.008 .0573403 .3808885 math | .4778911 .0866945 5.51 0.000 .3069228 .6488594 _cons | 15.88534 3.850786 4.13 0.000 8.291287 23.47939------------------------------------------------------------------------------

For each regression coefficient, the tested null hypothesis is that the coefficient is equal to zero.  Thus, the one-tailed alternatives are that the coefficient is greater than zero and that the coefficient is less than zero. To get the p-value for the one-tailed test of the variable science having a coefficient greater than zero, you would divide the .008 by 2, yielding .004 because the effect is going in the predicted direction. This

Page 6: STAT_in SAS

is P(>2.67). If you had made your prediction in the other direction (the opposite direction of the model effect), the p-value would have been 1 - .004 = .996.  This is P(<2.67). For all three p-values, the test statistic is 2.67. 

See also

Introduction to Power Analysis

FAQ: What's with the different formulas for kurtosis?

In describing the shape statistical distributions kurtosis refers to the peakedness or flatness of the distribution. Different statistical packages compute somewhat different values for kurtosis. What are the different formulas used and which packages use which formula?

We will begin by defining two different sums of powered deviation scores. The first one, s2, is the sum of squared deviation scores while s4 is the sum of deviation scores raised to the fourth power.

Next, we will define m2 to be the second moment about the mean of x and m4 to be the fourth moment. Additionally, V(x) will be the unbiased estimate of the population variance.

Now we can go ahead and start looking at some formulas for kurtosis. The first formula is one that can be found in many statistics books including Snedecor and Cochran (1967). It is used by SAS in proc means when specifying the option vardef=n. This formula is the one most commonly found in general statistics texts. With this definition a perfect normal distribution would have a kurtosis of zero.

Page 7: STAT_in SAS

The second formula is the one used by Stata with the summarize command. This definition of kurtosis can be found in Bock (1975). The only difference between formula 1 and formula 2 is the -3 in formula 1. Thus, with this formula a perfect normal distribution would have a kurtosis of three.

The third formula, below, can be found in Sheskin (2000) and is used by SPSS and SAS proc means when specifying the option vardef=df or by default if the vardef option is omitted. This formula uses the unbiased estimates of varinace and of the fourth moment about the mean. The expected value for kurtosis with a normal distribution is zero.

Examples

Formula 1 -- SASdata test; input x;cards;1987 1987 1991 1992 1992 1992 1992 1993 1994 1994 1995 ;run;

proc means data=test kurtosis vardef=n;run;

Analysis Variable : x

Page 8: STAT_in SAS

Kurtosis-------------- -0.2320107--------------

Formula 2 -- Statainput x19871987199119921992199219921993199419941995end

summ x, detail x------------------------------------------------------------- Percentiles Smallest 1% 1987 1987 5% 1987 198710% 1987 1991 Obs 1125% 1991 1992 Sum of Wgt. 11

50% 1992 Mean 1991.727 Largest Std. Dev. 2.61116575% 1994 199390% 1994 1994 Variance 6.81818295% 1995 1994 Skewness -.889501499% 1995 1995 Kurtosis 2.767989

Formula 3 -- SASdata test; input x;cards;1987 1987 1991 1992 1992 1992 1992 1993 1994 1994 1995 ;run;

proc means data=test kurtosis vardef=df;run;

Analysis Variable : x

Page 9: STAT_in SAS

Kurtosis-------------- 0.4466489--------------

proc means data=test kurtosis;run;

Analysis Variable : x

Kurtosis-------------- 0.4466489--------------

Formula 3 -- SPSSdata list list / yr.begin data.1987 1987 1991 1992 1992 1992 1992 1993 1994 1994 1995 end data.

desc /var=all /stat=kurtosis.

Reference

Bock, R.D. (1975) Multivariate Statistical Methods in Behavioral Research. New York: McGraw-Hill.

Joanest, D.N. and Gill, C.A. (1998) Comparing measures of sample skewness and kurtosis. The Statistician, 47, pp 183-189.

Sheskin, D.J. (2000) Handbook of Parametric and Nonparametric Statistical Procedures, Second Edition. Boca Raton, Florida: Chapman & Hall/CRC.

Page 10: STAT_in SAS

Sndedecor, G.W. and Cochran, W.G. (1967) Statistical Methods, Sixth Edition. Ames, Iowa: Iowa State University Press.

SAS FAQHow can I do test of simple main effects?

Let's use an example data set called crf24.data crf24; input y a b; cards;3 1 14 1 27 1 37 1 41 2 12 2 25 2 310 2 46 1 15 1 28 1 38 1 42 2 13 2 26 2 310 2 43 1 14 1 27 1 39 1 42 2 14 2 25 2 39 2 43 1 13 1 26 1 38 1 42 2 13 2 26 2 311 2 4;run;

These are data from a 2 by 4 factorial design.  The variable y is the dependent variable.  The variable a is an independent variable with two levels while b is an independent variable with four levels.  Let's look at a table of cell means and standard deviations.proc means data=crf24 mean std; class a b; var y;run;The MEANS Procedure

Page 11: STAT_in SAS

Analysis Variable : y

N a b Obs Mean Std Dev------------------------------------------------------------------- 1 1 4 3.7500000 1.5000000

2 4 4.0000000 0.8164966

3 4 7.0000000 0.8164966

4 4 8.0000000 0.8164966

2 1 4 1.7500000 0.5000000

2 4 3.0000000 0.8164966

3 4 5.5000000 0.5773503

4 4 10.0000000 0.8164966-------------------------------------------------------------------

Now let's run the ANOVA. We will get the predicted values, call them yhat, and save them in a temporary data file called crf24p.  We will use these predicted values in a moment when we create a graph of the cell means.

proc glm data=crf24; class a b; model y = a b a*b; output out=crf24p p=yhat;run;quit;The GLM Procedure

Class Level Information

Class Levels Values

a 2 1 2

b 4 1 2 3 4

Number of observations 32

Dependent Variable: y

Sum ofSource DF Squares Mean Square F Value Pr > F

Model 7 217.0000000 31.0000000 40.22 <.0001

Error 24 18.5000000 0.7708333

Corrected Total 31 235.5000000

R-Square Coeff Var Root MSE y Mean

Page 12: STAT_in SAS

0.921444 16.33435 0.877971 5.375000

Source DF Type I SS Mean Square F Value Pr > F

a 1 3.1250000 3.1250000 4.05 0.0554b 3 194.5000000 64.8333333 84.11 <.0001a*b 3 19.3750000 6.4583333 8.38 0.0006

Source DF Type III SS Mean Square F Value Pr > F

a 1 3.1250000 3.1250000 4.05 0.0554b 3 194.5000000 64.8333333 84.11 <.0001a*b 3 19.3750000 6.4583333 8.38 0.0006

We see that in addition to a significant main effect for b there is a significant a*b interaction effect.  Before we do any of the tests of simple main effects, let's graph the cell means to get an idea of what the interaction looks like.  The following sequence of commands will produce a graph of the cell means.  Note that in order to make a graph with the predicted values for each level of a, a data step is necessary to separate the predicted values into two new variables, which we call yhat1 and yhat2.  We then plot both yhat1 and yhat2 against b and overlay the two graphs.

data crf24q; set crf24p; if a = 1 then yhat1 = yhat; if a = 2 then yhat2 = yhat;run;

proc sort data = crf24q; by b;run;

symbol1 i=join;symbol2 i = join line = 3;proc gplot data=crf24q; plot yhat1*b = 1 yhat2*b = 2/overlay;run;quit;

Page 13: STAT_in SAS

The interaction is clearly shown where the two lines cross over between levels b3 and b4.  We will now do a test of simple main effects looking at differences in a at each level of b.

proc glm data=crf24; class a b; model y = a b a*b; lsmeans a*b / slice = b;run;quit;The GLM Procedure

Class Level Information

Class Levels Values

a 2 1 2

Page 14: STAT_in SAS

b 4 1 2 3 4

Number of observations 32

Dependent Variable: y

Sum ofSource DF Squares Mean Square F Value Pr > F

Model 7 217.0000000 31.0000000 40.22 <.0001

Error 24 18.5000000 0.7708333

Corrected Total 31 235.5000000

R-Square Coeff Var Root MSE y Mean

0.921444 16.33435 0.877971 5.375000

Source DF Type I SS Mean Square F Value Pr > F

a 1 3.1250000 3.1250000 4.05 0.0554b 3 194.5000000 64.8333333 84.11 <.0001a*b 3 19.3750000 6.4583333 8.38 0.0006

Source DF Type III SS Mean Square F Value Pr > F

a 1 3.1250000 3.1250000 4.05 0.0554b 3 194.5000000 64.8333333 84.11 <.0001a*b 3 19.3750000 6.4583333 8.38 0.0006

Least Squares Means

a b y LSMEAN

1 1 3.75000001 2 4.00000001 3 7.00000001 4 8.00000002 1 1.75000002 2 3.00000002 3 5.50000002 4 10.0000000

a*b Effect Sliced by b for y

Sum ofb DF Squares Mean Square F Value Pr > F

1 1 8.000000 8.000000 10.38 0.00362 1 2.000000 2.000000 2.59 0.12033 1 4.500000 4.500000 5.84 0.02374 1 8.000000 8.000000 10.38 0.0036

Page 15: STAT_in SAS

There is a statistically significant effect for each level of b except for level 2.  However, one may want to consider the effect of performing multiple tests on the family-wise error rate and perhaps adjust the critical alpha level accordingly.  Using a Bonferonni correction, the critical alpha level would be .0125 instead of .05 (.05/4).  Using the Bonferonni criteria, comparisons one and four would be considered statistically significant.

Note: Statisticians do not universally approve of the use of tests of simple main effects. In particular, there are concerns over the conceptual error rate. Tests of simple main effects are one tool that can be useful in interpreting interactions.  In general, the results of tests of simple main effects should be considered suggestive and not definitive.

SAS FAQHow can I do ANOVA contrasts?

Let's use an example data set called crf24.data crf24; input y a b; cards;3 1 14 1 27 1 37 1 41 2 12 2 25 2 310 2 46 1 15 1 28 1 38 1 42 2 13 2 26 2 310 2 43 1 14 1 27 1 39 1 42 2 14 2 25 2 39 2 43 1 13 1 26 1 38 1 42 2 13 2 26 2 311 2 4

Page 16: STAT_in SAS

;run;

These are data from a 2 by 4 factorial design.  The variable y is the dependent variable.  The variable a is an independent variable with two levels while b is an independent variable with four levels.

Using the contrast statement in a one-way ANOVA

proc glm data = crf24; class b; model y = b;run;quit;The GLM Procedure

Class Level Information

Class Levels Values

b 4 1 2 3 4

Number of observations 32

Dependent Variable: y

Sum ofSource DF Squares Mean Square F Value Pr > F

Model 3 194.5000000 64.8333333 44.28 <.0001

Error 28 41.0000000 1.4642857

Corrected Total 31 235.5000000

R-Square Coeff Var Root MSE y Mean

0.825902 22.51306 1.210077 5.375000

Source DF Type I SS Mean Square F Value Pr > F

b 3 194.5000000 64.8333333 44.28 <.0001

Source DF Type III SS Mean Square F Value Pr > F

b 3 194.5000000 64.8333333 44.28 <.0001proc means data = crf24 mean; class b; var y;run;

Page 17: STAT_in SAS

The MEANS Procedure

Analysis Variable : y

N b Obs Mean----------------------------------- 1 8 2.7500000

2 8 3.5000000

3 8 6.2500000

4 8 9.0000000-----------------------------------

It is quite clear that there is a significant overall F for the independent variable b.  Now let's devise some contrast that we can test:1) group 3 versus group 42) the average of groups 1 and 2 versus the average of groups 3 and 43) the average of groups 1, 2 and 3 versus group 4.proc glm data = 'd:\crf24'; class b; model y = b; means b /deponly; contrast 'Compare 3rd & 4th grp' b 0 0 1 -1; contrast 'Compare 1st & 2nd with 3rd & 4th grp' b 1 1 -1 -1; contrast 'Compare 1st, 2nd & 3rd grps with 4th grp' b 1 1 1 -3;run;quit;The GLM Procedure

Class Level Information

Class Levels Values

b 4 1 2 3 4

Number of observations 32

Dependent Variable: y

Sum ofSource DF Squares Mean Square F Value Pr > F

Model 3 194.5000000 64.8333333 44.28 <.0001

Error 28 41.0000000 1.4642857

Corrected Total 31 235.5000000

R-Square Coeff Var Root MSE y Mean

0.825902 22.51306 1.210077 5.375000

Page 18: STAT_in SAS

Source DF Type I SS Mean Square F Value Pr > F

b 3 194.5000000 64.8333333 44.28 <.0001

Source DF Type III SS Mean Square F Value Pr > F

b 3 194.5000000 64.8333333 44.28 <.0001

Level of --------------y--------------b N Mean Std Dev

1 8 2.75000000 1.488047622 8 3.50000000 0.925820103 8 6.25000000 1.035098344 8 9.00000000 1.30930734

Dependent Variable: y

Contrast DF Contrast SS Mean Square F Value Pr > F

Compare 3rd & 4th grp 1 30.2500000 30.2500000 20.66 <.0001Compare 1st & 2nd with 3rd & 4th grp 1 162.0000000 162.0000000 110.63 <.0001Compare 1st, 2nd & 3rd grps with 4th grp 1 140.1666667 140.1666667 95.72 <.0001

Using the contrast statement in a two-way ANOVA

Now let's try the same contrasts on b but in a two-way ANOVA.proc glm data = 'd:\crf24'; class a b; model y = a b a*b; contrast 'Compare 3rd & 4th grp' b 0 0 1 -1; contrast 'Compare 1st & 2nd with 3rd & 4th grp' b 1 1 -1 -1; contrast 'Compare 1st, 2nd & 3rd grps with 4th grp' b 1 1 1 -3;run;quit;The GLM Procedure

Class Level Information

Class Levels Values

a 2 1 2

b 4 1 2 3 4

Page 19: STAT_in SAS

Number of observations 32Dependent Variable: y

Sum ofSource DF Squares Mean Square F Value Pr > F

Model 7 217.0000000 31.0000000 40.22 <.0001

Error 24 18.5000000 0.7708333

Corrected Total 31 235.5000000

R-Square Coeff Var Root MSE y Mean

0.921444 16.33435 0.877971 5.375000

Source DF Type I SS Mean Square F Value Pr > F

a 1 3.1250000 3.1250000 4.05 0.0554b 3 194.5000000 64.8333333 84.11 <.0001a*b 3 19.3750000 6.4583333 8.38 0.0006

Source DF Type III SS Mean Square F Value Pr > F

a 1 3.1250000 3.1250000 4.05 0.0554b 3 194.5000000 64.8333333 84.11 <.0001a*b 3 19.3750000 6.4583333 8.38 0.0006

Contrast DF Contrast SS Mean Square F Value Pr > F

Compare 3rd & 4th grp 1 30.2500000 30.2500000 39.24 <.0001Compare 1st & 2nd with 3rd & 4th grp 1 162.0000000 162.0000000 210.16 <.0001Compare 1st, 2nd & 3rd grps with 4th grp 1 140.1666667 140.1666667 181.84 <.0001

Note that the F-ratios in these contrasts are larger than the F-ratios in the one-way ANOVA example. This is because the two-way ANOVA has a smaller mean square residual than the one-way ANOVA.

SAS FAQHow can I perform a repeated measures ANOVA with proc mixed?

Page 20: STAT_in SAS

SAS proc mixed is a very powerful procedure for a wide variety of statistical analyses, including repeated measures analysis of variance. We will illustrate how you can perform a repeated measures ANOVA using a standard type of analysis using proc glm and then show how you can perform the same analysis using proc mixed. We use an example of from Design and Analysis by G. Keppel. Pages 414-416. This example contains eight subjects (sub) with one between subjects IV with two levels (group) and one within subjects IV with four levels (indicated by position dv1-dv4). These data are read into a temporary SAS data file called wide below.

DATA wide; INPUT sub group dv1 dv2 dv3 dv4;CARDS;1 1 3 4 7 32 1 6 8 12 93 1 7 13 11 114 1 0 3 6 65 2 5 6 11 76 2 10 12 18 157 2 10 15 15 148 2 5 7 11 9;RUN; PROC PRINT DATA=wide ;RUN; OBS SUB GROUP DV1 DV2 DV3 DV4 1 1 1 3 4 7 3 2 2 1 6 8 12 9 3 3 1 7 13 11 11 4 4 1 0 3 6 6 5 5 2 5 6 11 7 6 6 2 10 12 18 15 7 7 2 10 15 15 14 8 8 2 5 7 11 9

We start by showing how to perform a standard 2 by 4 (between / within) ANOVA using proc glm.

PROC GLM DATA=wide; CLASS group; MODEL dv1-dv4 = group / NOUNI ; REPEATED trial 4;RUN;

The results of this analysis are shown below.

General Linear Models ProcedureClass Level Information

Class Levels Values

Page 21: STAT_in SAS

GROUP 2 1 2

Number of observations in data set = 8

General Linear Models ProcedureRepeated Measures Analysis of VarianceRepeated Measures Level Information

Dependent Variable DV1 DV2 DV3 DV4

Level of TRIAL 1 2 3 4

Manova Test Criteria and Exact F Statistics forthe Hypothesis of no TRIAL EffectH = Type III SS&CP Matrix for TRIAL E = Error SS&CP Matrix

S=1 M=0.5 N=1

Statistic Value F Num DF Den DF Pr > FWilks' Lambda 0.00829577 159.3911 3 4 0.0001Pillai's Trace 0.99170423 159.3911 3 4 0.0001Hotelling-Lawley Trace 119.54335260 159.3911 3 4 0.0001Roy's Greatest Root 119.54335260 159.3911 3 4 0.0001

Manova Test Criteria and Exact F Statistics forthe Hypothesis of no TRIAL*GROUP EffectH = Type III SS&CP Matrix for TRIAL*GROUP E = Error SS&CP Matrix

S=1 M=0.5 N=1

Statistic Value F Num DF Den DF Pr > F

Wilks' Lambda 0.60915493 0.8555 3 4 0.5324Pillai's Trace 0.39084507 0.8555 3 4 0.5324Hotelling-Lawley Trace 0.64161850 0.8555 3 4 0.5324Roy's Greatest Root 0.64161850 0.8555 3 4 0.5324General Linear Models ProcedureRepeated Measures Analysis of VarianceTests of Hypotheses for Between Subjects Effects

Source DF Type III SS Mean Square F Value Pr > FGROUP 1 116.28125000 116.28125000 2.51 0.1645Error 6 278.43750000 46.40625000

General Linear Models ProcedureRepeated Measures Analysis of VarianceUnivariate Tests of Hypotheses for Within Subject Effects

Source: TRIAL Adj Pr > F DF Type III SS Mean Square F Value Pr > F G - G H - F 3 129.59375000 43.19791667 22.34 0.0001 0.0001 0.0001

Page 22: STAT_in SAS

Source: TRIAL*GROUP Adj Pr > F DF Type III SS Mean Square F Value Pr > F G - G H - F 3 3.34375000 1.11458333 0.58 0.6380 0.5693 0.6380

Source: Error(TRIAL) DF Type III SS Mean Square 18 34.81250000 1.93402778

Greenhouse-Geisser Epsilon = 0.6337 Huynh-Feldt Epsilon = 1.0742

Now, we will illustrate how you can perform this same analysis in proc mixed. First, we need to reshape the data so it is in the shape expected by proc mixed. proc glm expects the data to be in a wideformat, where each observation corresponds to a subject. By contrast, proc mixed expects the data to be in a long format where each observation corresponds to a trial. In this case, proc mixed expects that there would be four observations per subject and that each observation would correspond to the measurements on the four different trials. Below we show how you can reshape the data for analysis inproc mixed.

DATA long ; SET Wide; dv = dv1; trial = 1; OUTPUT; dv = dv2; trial = 2; OUTPUT; dv = dv3; trial = 3; OUTPUT; dv = dv4; trial = 4; OUTPUT; DROP dv1 - dv4 ;RUN; PROC PRINT DATA=long ;RUN;

You can compare the proc print for wide with the proc print for long to verify that the data were properly reshaped.

OBS SUB GROUP DV TRIAL

1 1 1 3 1 2 1 1 4 2 3 1 1 7 3 4 1 1 3 4 5 2 1 6 1 6 2 1 8 2 7 2 1 12 3 8 2 1 9 4 9 3 1 7 1 10 3 1 13 2 11 3 1 11 3 12 3 1 11 4 13 4 1 0 1

Page 23: STAT_in SAS

14 4 1 3 2 15 4 1 6 3 16 4 1 6 4 17 5 2 5 1 18 5 2 6 2 19 5 2 11 3 20 5 2 7 4 21 6 2 10 1 22 6 2 12 2 23 6 2 18 3 24 6 2 15 4 25 7 2 10 1 26 7 2 15 2 27 7 2 15 3 28 7 2 14 4 29 8 2 5 1 30 8 2 7 2 31 8 2 11 3 32 8 2 9 4

Now that the data are in the proper shape, we can analyze it with proc mixed.

The class and model statements are used much the same as with proc glm. However, the repeated statement is different. The repeated statement is used to indicate the within subjects (repeated) variables, but note that trial is on the class statement, unlike proc glm. This is because the data are in long format and that there indeed is a separate variable indicating the trials.

We also use the repeated statement to indicate which variable indicates the different subjects (via subject=sub) and we can specify the covariance structure among the repeated measures (in this case we choose compound symmetry via type=cs which is the same structure that proc glm uses. Unlike proc glm, proc mixed has a wide variety of covariance structures you can choose from so you can choose one that matches your data (see the proc mixed manual for more information on this).

PROC MIXED DATA=long; CLASS sub group trial; MODEL dv = group trial group*trial; REPEATED trial / SUBJECT=sub TYPE=CS;run;

As you see below, the results correspond to those produced by proc glm. Note that proc mixed does not produce Sums of Squares or Mean Squares. This is because proc mixed uses maximum likelihood estimation instead of a sums of squares style of computation.

The MIXED Procedure

Class Level Information

Page 24: STAT_in SAS

Class Levels Values

SUB 8 1 2 3 4 5 6 7 8 GROUP 2 1 2 TRIAL 4 1 2 3 4

REML Estimation Iteration History

Iteration Evaluations Objective Criterion 0 1 96.74510121 1 1 69.98784546 0.00000000

Convergence criteria met.

Covariance Parameter Estimates (REML)

Cov Parm Subject Estimate CS SUB 11.11805556 Residual 1.93402778

Model Fitting Information for DV

Description Value Observations 32.0000 Res Log Likelihood -57.0484 Akaike's Information Criterion -59.0484 Schwarz's Bayesian Criterion -60.2265 -2 Res Log Likelihood 114.0969 Null Model LRT Chi-Square 26.7573 Null Model LRT DF 1.0000 Null Model LRT P-Value 0.0000

Tests of Fixed Effects

Source NDF DDF Type III F Pr > F GROUP 1 6 2.51 0.1645 TRIAL 3 18 22.34 0.0001 GROUP*TRIAL 3 18 0.58 0.6380

Proc mixed is much more powerful than proc glm. Because it is more powerful, it is more complex to use. This FAQ just scratches the surface in the use of proc mixed.

SAS FAQHow can I minimize loss of data due to missing observations in a repeated measures ANOVA?

Loss of subjects in a repeated measures ANOVA due to missing data can be a serious problem. If you use proc glm to perform you analysis, it will omit observations listwise, meaning that if any of the observations for a subject are missing, the entire subject will be omitted from the analysis. Consider the data file below based on an example of from Design and Analysis by G. Keppel. Pages 414-416. This example contains 8 subjects (sub) with one between subjects IV with 2

Page 25: STAT_in SAS

levels (group) and 1 within subjects IV with 4 levels. We have inserted 4 missing values to illustrate the impact of missing data in this kind of design.

DATA wide; INPUT sub group dv1 dv2 dv3 dv4;CARDS;1 1 3 4 7 32 1 6 . 12 93 1 7 13 11 114 1 0 3 . 65 2 5 6 11 76 2 10 12 18 . 7 2 10 15 15 148 2 5 . 11 9;RUN; PROC PRINT DATA=wide ;RUN; OBS SUB GROUP DV1 DV2 DV3 DV4 1 1 1 3 4 7 3 2 2 1 6 . 12 9 3 3 1 7 13 11 11 4 4 1 0 3 . 6 5 5 2 5 6 11 7 6 6 2 10 12 18 . 7 7 2 10 15 15 14 8 8 2 5 . 11 9

We start by showing how to perform a standard 2 by 4 (between / within) ANOVA using proc glm.

PROC GLM DATA=wide; CLASS group; MODEL dv1-dv4 = group / NOUNI ; REPEATED trial 4;RUN;

Note the number of observations available for analysis is only four, and that four have been omitted due to missing data. The results of this analysis are shown below.

General Linear Models ProcedureClass Level Information

Class Levels Values

GROUP 2 1 2

Number of observations in data set = 8

NOTE: Observations with missing values will not be included in this analysis. Thus, only 4 observations can be used in this analysis.

General Linear Models Procedure

Page 26: STAT_in SAS

Repeated Measures Analysis of VarianceRepeated Measures Level Information

Dependent Variable DV1 DV2 DV3 DV4 Level of TRIAL 1 2 3 4

General Linear Models ProcedureRepeated Measures Analysis of VarianceTests of Hypotheses for Between Subjects Effects

Source DF Type III SS Mean Square F Value Pr > FGROUP 1 36.00000000 36.00000000 0.46 0.5673Error 2 156.25000000 78.12500000

General Linear Models ProcedureRepeated Measures Analysis of VarianceUnivariate Tests of Hypotheses for Within Subject Effects

Source: TRIAL Adj Pr > F DF Type III SS Mean Square F Value Pr > F G - G H - F 3 47.25000000 15.75000000 5.32 0.0397 0.1430 0.0629

Source: TRIAL*GROUP Adj Pr > F DF Type III SS Mean Square F Value Pr > F G - G H - F 3 2.50000000 0.83333333 0.28 0.8371 0.6556 0.7898

Source: Error(TRIAL) DF Type III SS Mean Square 6 17.75000000 2.95833333

Greenhouse-Geisser Epsilon = 0.3474 Huynh-Feldt Epsilon = 0.7547

Now, we will illustrate how you can perform this same analysis in proc mixed. First, we need to reshape the data so it is in the shape expected by proc mixed. proc glm expects the data to be in a wideformat, where each observation corresponds to a subject. By contrast, proc mixed expects the data to be in a long format where each observation corresponds to a trial. In this case, proc mixed expects that there would be four observations per subject and that each observation would correspond to the measurements on the four different trials. Below we show how you can reshape the data for analysis inproc mixed.

DATA long ; SET Wide; dv = dv1; trial = 1; OUTPUT; dv = dv2; trial = 2; OUTPUT;

Page 27: STAT_in SAS

dv = dv3; trial = 3; OUTPUT; dv = dv4; trial = 4; OUTPUT; DROP dv1 - dv4 ;RUN; PROC PRINT DATA=long ;RUN;

You can compare the proc print for wide with the proc print for long to verify that the data were properly reshaped.

OBS SUB GROUP DV TRIAL

1 1 1 3 1 2 1 1 4 2 3 1 1 7 3 4 1 1 3 4 5 2 1 6 1 6 2 1 . 2 7 2 1 12 3 8 2 1 9 4 9 3 1 7 1 10 3 1 13 2 11 3 1 11 3 12 3 1 11 4 13 4 1 0 1 14 4 1 3 2 15 4 1 . 3 16 4 1 6 4 17 5 2 5 1 18 5 2 6 2 19 5 2 11 3 20 5 2 7 4 21 6 2 10 1 22 6 2 12 2 23 6 2 18 3 24 6 2 . 4 25 7 2 10 1 26 7 2 15 2 27 7 2 15 3 28 7 2 14 4 29 8 2 5 1 30 8 2 . 2 31 8 2 11 3 32 8 2 9 4

Now that the data are in the proper shape, we can analyze it with proc mixed. Proc mixed does not delete missing data listwise. It analyzes all of the data that are present. For the analysis to be valid, it is assumed that the data are missing at random. Rarely, however, are data truly missing at random. To the extent that there are systematic factors that led to the data being missing, the analysis will not be valid. In using this kind of analysis, we recommend that you assess and present information regarding the reasons for missing data and an assessment of the extent to which it was non-random.

Page 28: STAT_in SAS

PROC MIXED DATA=long; CLASS sub group trial; MODEL dv = group trial group*trial; REPEATED trial / SUBJECT=sub TYPE=CS;run;

As you see below, proc mixed analyzed all eight of the subjects and had far less missing data than the analysis with proc glm.

The MIXED Procedure

Class Level Information

Class Levels Values SUB 8 1 2 3 4 5 6 7 8 GROUP 2 1 2 TRIAL 4 1 2 3 4

REML Estimation Iteration History

Iteration Evaluations Objective Criterion 0 1 81.93159646 1 3 63.43970119 0.00138808 2 1 63.39025490 0.00006552 3 1 63.38810898 0.00000018 4 1 63.38810333 0.00000000

Convergence criteria met.

Covariance Parameter Estimates (REML)

Cov Parm Subject Estimate CS SUB 10.83244625 Residual 2.29522110

Model Fitting Information for DV

Description Value Observations 28.0000 Res Log Likelihood -50.0728 Akaike's Information Criterion -52.0728 Schwarz's Bayesian Criterion -53.0686 -2 Res Log Likelihood 100.1456 Null Model LRT Chi-Square 18.5435 Null Model LRT DF 1.0000 Null Model LRT P-Value 0.0000

Tests of Fixed Effects

Source NDF DDF Type III F Pr > F GROUP 1 6 2.37 0.1748 TRIAL 3 14 17.04 0.0001 GROUP*TRIAL 3 14 0.40 0.7556

Page 29: STAT_in SAS

Proc mixed is much more powerful than proc glm. Because it is more powerful, it is more complex to use. This FAQ just scratches the surface in the use of proc mixed.

SAS FAQHow can I test contrasts and interaction contrasts using the estimate statement?

It can be rather tricky to program the estimate statement when there are higher order interactions (e.g., three-way interactions, four-way interactions, etc.) included in the mixed model. Let's look at an example where we are using proc mixed in a repeated measures model. The data set exercise was used in our seminar on repeated measures. The data consists of people who were randomly assigned to two different diets: low-fat and not low-fat and three different types of exercise: at rest, walking leisurely and running. Their pulse rate was measured at three different time points during their assigned exercise: at 1 minute, 15 minutes and 30 minutes. We included all three variables in our mixed model, diet which has two levels, exertype which has three levels and time which also has three levels.  Even though time is a repeated factor we can treat it in the same manner as the other variables when we want to tests the various contrasts and interactions contrasts that may be of interest.  Finally, we will present an example of how to program the estimate statement for interaction contrasts involving a four-way interaction.

proc mixed data=exercise; class diet exertype time; model pulse = exertype|diet|time; repeated time / subject=id type=arh(1) ;run;quit;

<output omitted>

Type 3 Tests of Fixed Effects

Num DenEffect DF DF F Value Pr > Fexertype 2 24 52.17 <.0001diet 1 24 15.81 0.0006diet*exertype 2 24 5.11 0.0142time 2 48 30.82 <.0001exertype*time 4 48 20.25 <.0001diet*time 2 48 2.80 0.0709diet*exertype*time 4 48 4.45 0.0039

Contrasts involving only one variable

The graph of the data indicates that there might be a difference between exertype level 3 and the other two levels. Therefore, we will use a reverse helmert coding for exertype in the estimate statement in order to test this particular

Page 30: STAT_in SAS

contrast. For more information on reverse helmert coding and other contrast coding systems please refer to chapter 5 in our webbook on regression.

proc mixed data=exercise; class diet exertype time; model pulse = exertype|diet|time; repeated time / subject=id type=arh(1) ; estimate 'Exertype 12v3' exertype -.5 -.5 1;run;quit;

<output omitted> Estimates

StandardLabel Estimate Error DF t Value Pr > |t|Exertype 12v3 20.0500 1.9975 24 10.04 <.0001

Contrast involving a two-way interaction

The output from the model indicates that there is a significant interaction between exertype and diet which is reflected in the graphs of exertype by time with one graph per level of diet. In the graphs we see that the pulse rate of the third level of exertype increases much faster in the non-low fat diet group than in the low-fat diet group. We want to see if this difference is significant. In order to do this we will test the interaction of exertype contrasting level 3 versus levels 1 and 2 and diet contrasting levels 1 versus 2. 

Page 31: STAT_in SAS

We want to apply a reverse helmert coding to exertype in order to compare exertype level 3 with the average of the levels one and two, at the same time we want to compare the two levels of diet.  In order to make it easier to see the coding systems we will present them in a table format.

diet level 1  1diet level 2  -1

exertype level 1 -.5exertype level 2 -.5

Page 32: STAT_in SAS

exertype level 3  1

The interaction is coded as d1e1 d1e2 d1e3 d2e1 d2e2 d2e3 where d# = coding for diet level # and e# = coding for exertype level # and d#e# is the product of the two.  The order of the factors are determined by the order in which they appear in the class statement.  In this particular case diet appears before exertype in the class statement and thus with our coding system for diet and exertype we would have the following coding for the interaction:

d1e1 = 1*-.5 = -.5d1e2 = 1*-.5 = -.5d1e3 = 1*1 = 1d2e1 = -1*-.5 = .5 d2e2 = -1*-.5 = .5d2e3 = -1*1 = 1

In order to more easily understand the coding for the interaction it might help to visualize it as a matrix which equals the product of the contrast code for diet as a column matrix and the contrast coding ofexertype as a row matrix.

 exertype level 1 = -.5

exertype level 2 = -.5

exertype level 3 = 1

diet level 1 = 1

1*-.5 = -.5 1*-.5 = -.5 1*1 = 1

diet level 2 = -1

-1*-.5 = .5 -1*-.5 = .5 -1*1 = -1

When writing the estimate statement it does not matter whether the numbers are written as a matrix or as one long stream of numbers, the results will be the same either way. Thus, the following two estimate statements are equivalent:

estimate 'exertype 12v3 by diet 1v2' diet*exertype -.5 -.5 1 .5 .5 -1;estimate 'exertype 12v3 by diet 1v2' diet*exertype -.5 -.5 1 .5 .5 -1;proc mixed data=exercise; class diet exertype time; model pulse = exertype|diet|time; repeated time / subject=id type=arh(1) ; estimate 'exertype 12v3 by diet 1v2' diet*exertype -.5 -.5 1 .5 .5 -1;run;quit;

<output omitted> Estimates Standard

Page 33: STAT_in SAS

Label Estimate Error DF t Value Pr > |t|exertype 12v3 by diet 1v2 -12.7667 3.9951 24 -3.20 0.0039

Contrast involving a three-way interaction

The graphs also make us suspect that there is a difference between the pulse rate of the third level of exertype at the last time point and the other levels of exertype at the other time points and that this difference depends on which diet you follow. So, we want to test the three-way interaction where each variable has a specific contrast coding. For exertype we are contrasting level 3 versus the other two levels, likewise for time we are contrasting level 3 versus the two other time points and for diet we are contrasting the two diets.  Let's look at the contrast coding for each variable in a table format.

diet level 1  1diet level 2  -1

exertype level 1 -.5exertype level 2 -.5exertype level 3  1

time level 1 -.5time level 2 -.5time level 3  1

The coding for the interaction is determined by the order in which the factors appear in the class statement and in this example the order is: diet, exertype and time.  Therefore, the interaction is coded as: d1e1t1 d1e1t2 d1e1t3 d1e2t1 d1e2t2 d1e2t3 d1e3t1 d1e3t2 d1e3t3 d2e1t1 d2e1t2 d2e1t3 d2e2t1 d2e2t2 d2e2t3 d2e3t1 d2e3t2 d2e3t3.  Furthermore, d# = coding for diet level #, e# = coding forexertype level #, t# = coding for time level # and d#e#t# is the product of the three. In this case the coding for the interaction is:

d1e1t1 = 1*-.5*-.5 = .25d1e1t2 = 1*-.5*-.5 = .25d1e1t3 = 1*-.5*1 = -.5d1e2t1 = 1*-.5*-.5 = .25d1e2t2 = 1*-.5*-.5 = .25

Page 34: STAT_in SAS

d1e2t3 = 1*-.5*1 = -.5d1e3t1 = 1*1*-.5 = -.5d1e3t2 = 1*1*-.5 = -.5d1e3t3 = 1*1*1 = 1

d2e1t1 = -1*-.5*-.5 = -.25d2e1t2 = -1*-.5*-.5 = -.25d2e1t3 = -1*-.5*1 = .5d2e2t1 = -1*-.5*-.5 = -.25d2e2t2 = -1*-.5*-.5 = -.25d2e2t3 = -1*-.5*1 = .5d2e3t1 = -1*1*-.5 = .5d2e3t2 = -1*1*-.5 = .5d2e3t3 = -1*1*1 = -1

This can be more conveniently visualized as matrices.  The coding for time which is the last factor in the class statement can be thought of as the row matrix that is multiplied by the coding for exertype which is the second to last factor in the class statement and which can be thought of as the column matrix.  The matrix which is the product is then multiplied by the coding for each level of diet which appears beforeexertype and time in the class statement.

For diet level 1 = 1:

  time level 1 = -.5 time level 2 = -.5 time level 3 = 1exertype level 1 = -.5

1*-.5*-.5 = .25 1*-.5*-.5 = .25 1*-.5*1 = -.5

exertype level 2 = -.5

1*-.5*-.5 = .25 1*-.5*-.5 = .25 1*-.5*1 = -.5

exertype level 3 = 1

1*1*-.5 = -.5 1*1*-.5 = -.5 1*1*1 = 1

For diet level 2 = -1:

  time level 1 = -.5 time level 2 = -.5 time level 3 = 1exertype level 1 = -.5

-1*-.5*-.5 = -.25 -1*-.5*-.5 = -.25 -1*-.5*1= .5

exertype level 2 = -.5

-1*-.5*-.5 = -.25 -1*-.5*-.5 = -.25 -1*-.5*1 = .5

exertype level 3 = 1

-1*1*-.5 = .5 -1*1*-.5 = .5 -1*1*1 = -1

proc mixed data=long;

Page 35: STAT_in SAS

class diet exertype time; model pulse = exertype|diet|time; repeated time / subject=id type=arh(1) ; estimate 'ex 12v3 by diet 1v2 by time 12v3' diet*exertype*time .25 .25 -.5

.25 .25 -.5 -.5 -.5 1-.25 -.25 .5-.25 -.25 .5

.5 .5 -1;run;quit;

<output omitted>

Estimates

StandardLabel Estimate Error DF t Value Pr > |t|ex 12v3 by diet 1v2 by time 12v3 -21.2000 5.7463 48 -3.69 0.0006

Contrast involving a four-way interaction

Our model does not include any four-way interactions but supposed that we had another categorical variable in our model and let's call it shoetype and the subjects in the study were randomly assigned to one of two types of athletic shoes: aerobic shoes and running shoes. We would like to test the interaction where we contrast the diets, the two types of shoes, time point 3 versus the other two time points and runners versus non-runners.  Let's write the contrast coding in tables to get a clearer picture.

shoe level 1  1shoe level 2  -1

diet level 1  1diet level 2  -1

exertype level 1 -.5exertype level 2 -.5exertype level 3  1

time level 1 -.5time level 2 -.5time level 3  1

Page 36: STAT_in SAS

The coding for the interaction is determine by the order in which the factors appear in the class statement and in this example the order is: shoetype, diet, exertype and time.  Therefore, the interaction is coded as: s1d1e1t1 s1d1e1t2 s1d1e1t3 s1d1e2t1 s1d1e2t2 s1d1e2t3 s1d1e3t1 s1d1e3t2 s1d1e3t3 s1d2e1t1 s1d2e1t2 s1d2e1t3 s1d2e2t1 s1d2e2t2 s1d2e2t3 s1d2e3t1 s1d2e3t2 s1d2e3t3   s2d1e1t1 s2d1e1t2 s2d1e1t3 s2d1e2t1 s2d1e2t2 s2d1e2t3 s2d1e3t1 s2d1e3t2 s2d1e3t3 s2d2e1t1 s2d2e1t2 s2d2e1t3 s2d2e2t1 s2d2e2t2 s2d2e2t3 s2d2e3t1 s2d2e3t2 s2d2e3t3.  Furthermore,  s# = coding forshoetype level #, d# = coding for diet level #, e# = coding for exertype level #, t# = coding for time level # and d#e#t# is the product of the three. In this case the coding for the interaction is:

s1d1e1t1 = 1*1*-.5*-.5 = .25s1d1e1t2 = 1*1*-.5*-.5 = .25s1d1e1t3 = 1*1*-.5*1 = -.5s1d1e2t1 = 1*1*-.5*-.5 = .25s1d1e2t2 = 1*1*-.5*-.5 = .25s1d1e2t3 = 1*1*-.5*1 = -.5s1d1e3t1 = 1*1*1*-.5 = -.5s1d1e3t2 = 1*1*1*-.5 = -.5s1d1e3t3 = 1*1*1*1 = 1s1d2e1t1 = 1*-1*-.5*-.5 = -.25s1d2e1t2 = 1*-1*-.5*-.5 = -.25s1d2e1t3 = 1*-1*-.5*1 = .5s1d2e2t1 = 1*-1*-.5*-.5 = -.25s1d2e2t2 = 1*-1*-.5*-.5 = -.25s1d2e2t3 = 1*-1*-.5*1 = .5s1d2e3t1 = 1*-1*1*-.5 = .5s1d2e3t2 = 1*-1*1*-.5 = .5s1d2e3t3 = 1*-1*1*1 = -1

s2d1e1t1 = -1*1*-.5*-.5 = -.25s2d1e1t2 = -1*1*-.5*-.5 = -.25s2d1e1t3 = -1*1*-.5*1 = .5s2d1e2t1 = -1*1*-.5*-.5 = -.25s2d1e2t2 = -1*1*-.5*-.5 = -.25s2d1e2t3 = -1*1*-.5*1 = .5s2d1e3t1 = -1*1*1*-.5 = .5s2d1e3t2 = -1*1*1*-.5 = .5s2d1e3t3 = -1*1*1*1 = -1s2d2e1t1 = -1*-1*-.5*-.5 = .25s2d2e1t2 = -1*-1*-.5*-.5 = .25s2d2e1t3 = -1*-1*-.5*1 = -.5

Page 37: STAT_in SAS

s2d2e2t1 = -1*-1*-.5*-.5 = .25s2d2e2t2 = -1*-1*-.5*-.5 = .25s2d2e2t3 = -1*-1*-.5*1 = -.5s2d2e3t1 = -1*-1*1*-.5 = -.5s2d2e3t2 = -1*-1*1*-.5 = -.5s2d2e3t3 = -1*-1*1*1 = 1

This can also be more conveniently visualized as a matrix.  The coding for time which is the last factor in the class statement can be thought of as the row matrix that is multiplied by the coding for exertypewhich is the second to last factor in the class statement and which can be thought of as the column matrix.  The matrix which is the product is then multiplied by the coding for each level of diet which appears before exertype and time in the class statement.

For shoetype level 1 = 1 and diet level 1 = 1:

  time level 1 = -.5 time level 2 = -.5 time level 3 = 1exertype level 1 = -.5

1*1*-.5*-.5 = .25 1*1*-.5*-.5 = .25 1*1*-.5*1 = -.5

exertype level 2 = -.5

1*1*-.5*-.5 = .25 1*1*-.5*-.5 = .25 1*1*-.5*1 = -.5

exertype level 3 = 1

1*1*1*-.5 = -.5 1*1*1*-.5 = -.5 1*1*1*1 = 1

For shoetype level 1 = 1 and diet level 2 = -1:

  time level 1 = -.5 time level 2 = -.5 time level 3 = 1exertype level 1 = -.5

1*-1*-.5*-.5 = -.25 1*-1*-.5*-.5 = -.25 1*-1*-.5*1= .5

exertype level 2 = -.5

1*-1*-.5*-.5 = -.25 1*-1*-.5*-.5 = -.25 1*-1*-.5*1 = .5

exertype level 3 = 1

1*-1*1*-.5 = .5 1*-1*1*-.5 = .5 1*-1*1*1 = -1

For shoetype level 2 = -1 and diet level 1 = 1:

  time level 1 = -.5 time level 2 = -.5 time level 3 = 1exertype level 1 = -.5

-1*1*-.5*-.5 = -.25 -1*1*-.5*-.5 = -.25 -1*1*-.5*1= .5

exertype level 2 = -.5

-1*1*-.5*-.5 = -.25 -1*1*-.5*-.5 = -.25 -1*1*-.5*1 = .5

Page 38: STAT_in SAS

exertype level 3 = 1

-1*1*1*-.5 = .5 -1*1*1*-.5 = .5 -1*1*1*1 = -1

For shoetype level 2 = -1 and diet level 2 = -1:

  time level 1 = -.5 time level 2 = -.5 time level 3 = 1exertype level 1 = -.5

-1*-1*-.5*-.5 = .25 -1*-1*-.5*-.5 = .25 -1*-1*-.5*1 = -.5

exertype level 2 = -.5

-1*-1*-.5*-.5 = .25 -1*-1*-.5*-.5 = .25 -1*-1*-.5*1 = -.5

exertype level 3 = 1

-1*-1*1*-.5 = -.5 -1*-1*1*-.5 = -.5 -1*-1*1*1 = 1

The code for this interaction would be:

proc mixed data=exercise; class shoetype diet exertype time; model pulse = shoetype|exertype|diet|time; repeated time / subject=id type=arh(1) ; estimate 'sh 1v2 & d 1v2 & ex 12v3 & t 12v3' shoetypediet*exertype*time .25 .25 -.5

.25 .25 -.5 -.5 -.5 1-.25 -.25 .5-.25 -.25 .5

.5 .5 -1 -.25 -.25 .5

-.25 -.25 .5 .5 .5 -1 .25 .25 -.5

.25 .25 -.5 -.5 -.5 1;

run;quit;

Linear regressionSAS FAQHow do I interpret the parameter estimates for dummy variables in proc reg or proc glm?

Consider this simple data file having nine subjects (sub) in three groups (iv) with a score on the dv (dv).

Page 39: STAT_in SAS

DATA dummy; INPUT sub iv dv;CARDS;1 1 482 1 493 1 504 2 175 2 206 2 237 3 288 3 309 3 32;RUN;

Below we do a proc means to find the overall mean, and another proc means to find the means for the three groups.

PROC MEANS DATA=dummy; VAR dv;RUN; PROC MEANS DATA=dummy; CLASS iv; VAR dv;RUN;

As we see below, the overall mean is 33, and the means for groups 1, 2 and 3 are 49, 20 and 30 respectively.

Analysis Variable : DV

N Mean Std Dev Minimum Maximum---------------------------------------------------------9 33.0000000 12.8937970 17.0000000 50.0000000--------------------------------------------------------- Analysis Variable : DV

IV N Obs N Mean Std Dev Minimum Maximum------------------------------------------------------------------------------1 3 3 49.0000000 1.0000000 48.0000000 50.00000002 3 3 20.0000000 3.0000000 17.0000000 23.00000003 3 3 30.0000000 2.0000000 28.0000000 32.0000000------------------------------------------------------------------------------

Let's run a standard ANOVA on this data using proc glm.

PROC GLM DATA=dummy; CLASS iv ; MODEL dv = iv ;RUN;

Page 40: STAT_in SAS

The results of the ANOVA are shown below.

General Linear Models ProcedureClass Level Information

Class Levels ValuesIV 3 1 2 3

Number of observations in data set = 9

General Linear Models Procedure

Dependent Variable: DV Sum of MeanSource DF Squares Square F Value Pr > FModel 2 1302.0000000 651.0000000 139.50 0.0001Error 6 28.0000000 4.6666667

Corrected Total 8 1330.0000000

R-Square C.V. Root MSE DV Mean0.978947 6.546203 2.1602469 33.000000

Source DF Type I SS Mean Square F Value Pr > FIV 2 1302.0000000 651.0000000 139.50 0.0001>Source DF Type III SS Mean Square F Value Pr > FIV 2 1302.0000000 651.0000000 139.50 0.0001

Now, let's take this information we have found, and relate it to the results that we get when we run a similar analysis using dummy coding. Let's make a data file called dummy2 that has dummy variables callediv1 (1 if iv=1), iv2 (1 if iv=2) and iv3 (1 if iv=3).  Note that iv3 is not really necessary, but it could be useful for further exploring the meaning of dummy variables. We will then use proc reg to predict dvfrom iv1 and iv2.

DATA dummy2; SET dummy; IF (iv = 1) THEN iv1 = 1; ELSE iv1 = 0; IF (iv = 2) THEN iv2 = 1; ELSE iv2 = 0; IF (iv = 3) THEN iv3 = 1; ELSE iv3 = 0;RUN; PROC REG DATA=dummy2; MODEL dv = iv1 iv2 ;RUN;

The output is shown below.

Page 41: STAT_in SAS

Model: MODEL1Dependent Variable: DV

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 2 1302.00000 651.00000 139.500 0.0001Error 6 28.00000 4.66667C Total 8 1330.00000

Root MSE 2.16025 R-square 0.9789Dep Mean 33.00000 Adj R-sq 0.9719C.V. 6.54620

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 30.000000 1.24721913 24.054 0.0001IV1 1 19.000000 1.76383421 10.772 0.0001IV2 1 -10.000000 1.76383421 -5.669 0.0013

First, note that from the ANOVA using proc glm that the F value was 139.5 and for the regression using proc reg the F value (for the model) is also 139.5. This illustrates that the overall test of the model using regression is really the same as doing an ANOVA.After the Analysis of Variance section, there is a section titled Parameter Estimates. What is the interpretation of the values listed there, the 30, 19 and -10? Notice how we have iv1 and iv2 that refer to group 1 and group 2, but we did not include any dummy variable referring to group 3. Group 3 is often called the omitted group or reference group. Recall that the means of the 3 groups were 49, 20 and 30 respectively. The intercept term is the mean of the omitted group, and indeed the parameter estimate from the output is the mean of group 3, 30. The parameter estimate for iv1 is the mean of group 1 minus the mean of group 3, 49 - 30 = 19, and indeed that is the parameter estimate for iv1. Likewise, the parameter estimate for iv2 is the mean of group 2 - the mean of group 3, 20 - 30 = -10, the parameter estimate for iv2.So, in summary:

Intercept mean of group 3 (mean of omitted group)

iv1 mean of group 1 - group 3 (omitted group)

iv2 mean of group 2 - group 3 (omitted group)

Try running this example, but use iv2 and iv3 in proc reg (making group 1 the omitted group) and see what happens.

Page 42: STAT_in SAS

Finally, consider how the parameter estimates can be used in the regression model to obtain the means for the groups (the predicted values). The regression model is:

Ypredicted = 30 + iv1*19 + iv2*-10 For group 1: Ypredicted = 30 + 1 * 19 + 0 * -10 = 49For group 2: Ypredicted = 30 + 0 * 19 + 1 * -10 = 20For group 3: Ypredicted = 30 + 0 * 19 + 0 * -10 = 30

As you see, the regression formula predicts that each group will have the mean value of its group.

SAS FAQHow can I compare regression coefficients between two groups?

Sometimes your research may predict that the size of a regression coefficient should be bigger for one group than for another. For example, you might believe that the regression coefficient of height predictingweight would be higher for men than for women. Below, we have a data file with 10 fictional females and 10 fictional males, along with their height in inches and their weight in pounds.

DATA htwt; INPUT id Gender $ height weight ;CARDS; 1 F 56 117 2 F 60 125 3 F 64 133 4 F 68 141 5 F 72 149 6 F 54 109 7 F 62 128 8 F 65 131 9 F 65 131 10 F 70 145 11 M 64 211 12 M 68 223 13 M 72 235 14 M 76 247 15 M 80 259 16 M 62 201 17 M 69 228 18 M 74 245 19 M 75 241 20 M 82 269 ;RUN;

We analyzed their data separately using the proc reg below.

PROC REG DATA=htwt; BY gender; MODEL weight = height ;

Page 43: STAT_in SAS

RUN;

The parameter estimates (coefficients) for females and males are shown below, and the results do seem to suggest that height is a stronger predictor of weight for males (3.18) than for females (2.09).

GENDER=F T for H0: Pr > |T| Std Error ofParameter Estimate Parameter=0 EstimateINTERCEPT -2.397470040 -0.34 0.7427 7.05327189HEIGHT 2.095872170 18.97 0.0001 0.11049098 GENDER=M T for H0: Pr > |T| Std Error ofParameter Estimate Parameter=0 EstimateINTERCEPT 5.601677149 0.63 0.5480 8.93019669HEIGHT 3.189727463 25.88 0.0001 0.12323669

We can compare the regression coefficients of males with females to test the null hypothesis Ho: Bf = Bm, where Bf is the regression coefficient for females, and Bm is the regression coefficient for males. To do this analysis, we first make a dummy variable called female that is coded 1 for female and 0 for male, and a variable femht that is the product of female and height. We then use female height andfemht as predictors in the regression equation.

data htwt2; set htwt; female = . ; IF gender = "F" then female = 1; IF gender = "M" then female = 0; femht = female*height ; RUN; PROC REG DATA=htwt2 ; MODEL weight = female height femht ;RUN;

The output is shown below.

Model: MODEL1Dependent Variable: WEIGHT

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 3 60327.09739 20109.03246 4250.111 0.0001Error 16 75.70261 4.73141

Page 44: STAT_in SAS

C Total 19 60402.80000

Root MSE 2.17518 R-square 0.9987 Dep Mean 183.40000 Adj R-sq 0.9985 C.V. 1.18603

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 5.601677 8.06886167 0.694 0.4975FEMALE 1 -7.999147 11.37054598 -0.703 0.4919HEIGHT 1 3.189727 0.11135027 28.646 0.0001FEMHT 1 -1.093855 0.16777741 -6.520 0.0001

The term femht tests the null hypothesis Ho: Bf = Bm. The T value is -6.52 and is significant, indicating that the regression coefficient Bf is significantly different from Bm.

Let's look at the parameter estimates to get a better understanding of what they mean and how they are interpreted.  First, recall that our dummy variable female is 1 if female and 0 if male; therefore, males are the omitted group. This is needed for proper interpretation of the estimates.

Parameter Variable Estimate INTERCEP 5.601677 : This is the intercept for the males (omitted group) This corresponds to the intercept for males in the separate groups analysis. FEMALE -7.999147 : Intercept Females - Intercept males This corresponds to differences of the intercepts from the separate groups analysis. and is indeed -2.397470040 - 5.601677149 HEIGHT 3.189727 : Slope for males (omitted group), i.e., Bm. FEMHT -1.093855 : Slope for females - Slope for males (i.e. Bf - Bm). From the separate groups, this is indeed 2.095872170 - 3.189727463 .

It is also possible to run such an analysis in proc glm, using syntax like that below.

PROC GLM DATA=htwt2 ; CLASS gender ; MODEL weight = gender height gender*height / SOLUTION ;RUN;

As you see, the proc glm output corresponds to the output obtained by proc reg.

General Linear Models ProcedureClass Level Information

Page 45: STAT_in SAS

Class Levels Values

GENDER 2 F M

Number of observations in data set = 20

General Linear Models Procedure

Dependent Variable: WEIGHT Sum of MeanSource DF Squares Square F Value Pr > F

Model 3 60327.097387 20109.032462 4250.11 0.0001

Error 16 75.702613 4.731413

Corrected Total 19 60402.800000 R-Square C.V. Root MSE WEIGHT Mean

0.998747 1.186031 2.1751812 183.40000

Source DF Type I SS Mean Square F Value Pr > F

GENDER 1 55125.000000 55125.000000 11650.85 0.0001HEIGHT 1 5000.982757 5000.982757 1056.97 0.0001HEIGHT*GENDER 1 201.114630 201.114630 42.51 0.0001

Source DF Type III SS Mean Square F Value Pr > F

GENDER 1 2.3416157 2.3416157 0.49 0.4919HEIGHT 1 4695.8308766 4695.8308766 992.48 0.0001HEIGHT*GENDER 1 201.1146303 201.1146303 42.51 0.0001

T for H0: Pr > |T| Std Error ofParameter Estimate Parameter=0 Estimate

INTERCEPT 5.601677149 B 0.69 0.4975 8.06886167GENDER F -7.999147189 B -0.70 0.4919 11.37054598 M 0.000000000 B . . .HEIGHT 3.189727463 B 28.65 0.0001 0.11135027HEIGHT*GENDER F -1.093855293 B -6.52 0.0001 0.16777741 M 0.000000000 B . . .

NOTE: The X'X matrix has been found to be singular and a generalized inverse

Page 46: STAT_in SAS

was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.

The parameter estimates appear at the end of the proc glm output. They correspond to the output from proc reg and from the separate analyses, that is:

INTERCEPT 5.601677149 : This is the intercept for the males GENDER F -7.999147189 : Intercept Females - Intercept males HEIGHT 3.189727463 : Slope for males HEIGHT*GENDER F -1.093855293 : Slope for females - Slope for males

SAS FAQHow can I compare regression coefficients across three (or more) groups?

Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 10 fictional young people, 10 fictional middle age people, and and 10 fictional senior citizens, along with their heightin inches and their weight in pounds. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.

DATA htwt; INPUT id age height weight ;CARDS; 1 1 56 140 2 1 60 155 3 1 64 143 4 1 68 161 5 1 72 139 6 1 54 159 7 1 62 138 8 1 65 121 9 1 65 161 10 1 70 145 11 2 56 117 12 2 60 125 13 2 64 133 14 2 68 141 15 2 72 149 16 2 54 109 17 2 62 128 18 2 65 131 19 2 65 131 20 2 70 145 21 3 64 211 22 3 68 223 23 3 72 235 24 3 76 247 25 3 80 259 26 3 62 201 27 3 69 228

Page 47: STAT_in SAS

28 3 74 245 29 3 75 241 30 3 82 269 ;RUN;

We analyze their data separately using the proc reg below.

PROC REG DATA=htwt; BY age ; MODEL weight = height ;RUN;

The parameter estimates (coefficients) for the young, middle age, and senior citizens are shown below. below, and the results do seem to suggest that height is a stronger predictor of weight for seniors (3.18) than for the middle aged (2.09). The results also seem to suggest that height does not predict weight as strongly for the young (-.37) as for the middle aged and seniors. However, we would need to perform specific significance tests to be able to make claims about the differences among these regression coefficients.

AGE=1 Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T|INTERCEP 1 170.166445 49.43018216 3.443 0.0088HEIGHT 1 -0.376831 0.77433413 -0.487 0.6396 AGE=2 Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T|INTERCEP 1 -2.397470 7.05327189 -0.340 0.7427HEIGHT 1 2.095872 0.11049098 18.969 0.0001 AGE=3 Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T|INTERCEP 1 5.601677 8.93019669 0.627 0.5480HEIGHT 1 3.189727 0.12323669 25.883 0.0001

We can compare the regression coefficients among these three age groups to test the null hypothesis

Ho: B1 = B2 = B3

where B1 is the regression for for the young, B2 is the regression for for the middle aged, and B3 is the regression for for senior citizens. To do this analysis, we first make a dummy variable called age1 that is coded 1 if young (age=1), 0 otherwise, and age2 that is coded 1 if middle aged (age=2), 0 otherwise. We also create age1ht that is age1 times height, and age2ht that is age2 times height.

Page 48: STAT_in SAS

data htwt2; set htwt; age1 = . ; age2 = . ; IF age = 1 then age1 = 1; ELSE age1 = 0 ; IF age = 2 then age2 = 1; ELSE age2 = 0 ; age1ht = age1*height ; age2ht = age2*height ; RUN;

We can now use age1 age2 height, age1ht and age2ht as predictors in the regression equation in proc reg below. In the proc reg we use the

TEST age1ht=0, age2ht=0;

statement to test the null hypothesis

Ho: B1 = B2 = B3

This test will have two degrees of freedom because it compares among three regression coefficients.

PROC REG DATA=htwt2 ; MODEL weight = age1 age2 height age1ht age2ht ; TEST age1ht=0, age2ht=0 ;RUN;

The output below shows that the null hypothesis

Ho: B1 = B2 = B3

can be rejected (F=17.29, p = 0.0000). This means that the regression coefficients between height and weight do indeed significantly differ across the 3 age groups (young, middle age, senior citizen).

Model: MODEL1Dependent Variable: WEIGHT

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 5 69595.35464 13919.07093 220.261 0.0001Error 24 1516.64536 63.19356C Total 29 71112.00000

Root MSE 7.94944 R-square 0.9787 Dep Mean 171.00000 Adj R-sq 0.9742

Page 49: STAT_in SAS

C.V. 4.64879

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 5.601677 29.48853690 0.190 0.8509AGE1 1 164.564768 41.55490307 3.960 0.0006AGE2 1 -7.999147 41.55490307 -0.192 0.8490HEIGHT 1 3.189727 0.40694172 7.838 0.0001AGE1HT 1 -3.566558 0.61316088 -5.817 0.0001AGE2HT 1 -1.093855 0.61316088 -1.784 0.0871

Dependent Variable: WEIGHTTest: Numerator: 1092.7718 DF: 2 F value: 17.2925 Denominator: 63.19356 DF: 24 Prob>F: 0.0001

It is also possible to run such an analysis in proc glm, using syntax as shown below. Instead of using a test statement, the contrast statement is used to test the null hypothesis

Ho: B1 = B2 = B3

The contrast statement uses the comma to join together what would have been two separate one degree of freedom tests into a single two degree of freedom test that tests the null hypothesis above.

PROC GLM DATA=htwt2 ; CLASS age ; MODEL weight = age height age*height / SOLUTION ; CONTRAST 'test equal slopes' age*height 1 -1 0, age*height 0 1 -1 ;RUN;

If you compare the contrast output from proc glm (labeled test equal slopes found below with the output from test from proc glm above, you will see the F values and p values are the same. This is because these two tests are equivalent.

General Linear Models ProcedureClass Level Information

Class Levels Values

AGE 3 1 2 3

Number of observations in data set = 30

General Linear Models Procedure

Dependent Variable: WEIGHT Sum of Mean

Page 50: STAT_in SAS

Source DF Squares Square F Value Pr > F

Model 5 69595.354644 13919.070929 220.26 0.0001

Error 24 1516.645356 63.193557

Corrected Total 29 71112.000000

R-Square C.V. Root MSE WEIGHT Mean

0.978672 4.648794 7.9494375 171.00000

Source DF Type I SS Mean Square F Value Pr > F

AGE 2 64350.600000 32175.300000 509.15 0.0001HEIGHT 1 3059.211075 3059.211075 48.41 0.0001HEIGHT*AGE 2 2185.543569 1092.771784 17.29 0.0001

Source DF Type III SS Mean Square F Value Pr > F

AGE 2 1395.9046778 697.9523389 11.04 0.0004HEIGHT 1 2597.0189017 2597.0189017 41.10 0.0001HEIGHT*AGE 2 2185.5435689 1092.7717845 17.29 0.0001

Contrast DF Contrast SS Mean Square F Value Pr > F

test equal slopes 2 2185.5435689 1092.7717845 17.29 0.0001

T for H0: Pr > |T| Std Error ofParameter Estimate Parameter=0 Estimate

INTERCEPT 5.6016771 B 0.19 0.8509 29.48853690AGE 1 164.5647676 B 3.96 0.0006 41.55490307 2 -7.9991472 B -0.19 0.8490 41.55490307 3 0.0000000 B . . .HEIGHT 3.1897275 B 7.84 0.0001 0.40694172HEIGHT*AGE 1 -3.5665584 B -5.82 0.0001 0.61316088 2 -1.0938553 B -1.78 0.0871 0.61316088 3 0.0000000 B . . .

NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the

Page 51: STAT_in SAS

letter 'B' are biased, and are not unique estimators of the parameters.

You might notice that the null hypothesis that we are testing

Ho: B1 = B2 = B3

is similar to the null hypothesis that you might test using ANOVA to compare the means of the three groups,

Ho: Mu1 = Mu2 = Mu3

In ANOVA, you can get an overall F test testing the null hypothesis. In addition to that overall test, you could perform planned comparisons among the three groups. So far we have seen how to to an overall test of the equality of the three regression coefficients, and now we will test planned comparisons among the regression coefficients. Below, we show how you can perform two such tests using the contrastastatement in proc glm. The first contrast compares the regression coefficients of the middle aged vs. senior.

Ho: B2 = B3

The second contrast compares the regression coefficients of the young vs. middle aged and seniors.

Ho: B1 = (B2 + B3)/2PROC GLM DATA=htwt2 ; CLASS age ; MODEL weight = age height age*height ; CONTRAST 'Mid Age vs. Sen. ' age*height 0 1 -1 ; CONTRAST 'Yng vs (Mid & Sen)' age*height -2 1 1 ;RUN;

The output from contrast indicates that regression coefficients for middle aged and seniors do not significantly differ (F=3.18, p=0.0871) The second contrast was significant (F=29.96, p=0.0000) indicating that the regression coefficients for the young differ from the middle age and seniors combined.

General Linear Models ProcedureClass Level Information

Class Levels Values

AGE 3 1 2 3

Number of observations in data set = 30

General Linear Models Procedure

Dependent Variable: WEIGHT

Page 52: STAT_in SAS

Sum of MeanSource DF Squares Square F Value Pr > F

Model 5 69595.354644 13919.070929 220.26 0.0001

Error 24 1516.645356 63.193557

Corrected Total 29 71112.000000

R-Square C.V. Root MSE WEIGHT Mean

0.978672 4.648794 7.9494375 171.00000

Source DF Type I SS Mean Square F Value Pr > F

AGE 2 64350.600000 32175.300000 509.15 0.0001HEIGHT 1 3059.211075 3059.211075 48.41 0.0001HEIGHT*AGE 2 2185.543569 1092.771784 17.29 0.0001

Source DF Type III SS Mean Square F Value Pr > F

AGE 2 1395.9046778 697.9523389 11.04 0.0004HEIGHT 1 2597.0189017 2597.0189017 41.10 0.0001HEIGHT*AGE 2 2185.5435689 1092.7717845 17.29 0.0001

Contrast DF Contrast SS Mean Square F Value Pr > F

Mid Age vs. Sen. 1 201.1146303 201.1146303 3.18 0.0871Yng vs (Mid & Sen) 1 1893.2074903 1893.2074903 29.96 0.0001

We can do the exact same analysis in proc reg by coding age1 and age2 like the coding shown in the contrast statements above We will create age1 that will be:

0 for young

1 for middle age

-1 for senior

and we will create age2 that will be:

Page 53: STAT_in SAS

-2 for young

1 for middle age

1 for senior

The significance tests in proc reg below for age1ht and age2ht will correspond to the contrast statements we used in proc glm above.

data htwt3; set htwt; age1 = . ; age2 = . ; IF age = 1 then age1 = 0; IF age = 2 then age1 = 1; IF age = 3 then age1 = -1; IF age = 1 then age2 = -2; IF age = 2 then age2 = 1; IF age = 3 then age2 = 1; age1ht = age1*height ; age2ht = age2*height ; RUN; PROC REG DATA=htwt3 ; MODEL weight = age1 age2 height age1ht age2ht ;RUN;

The results below correspond to the proc reg results above except that the proc glm results are reported as F values and the proc reg results are reported as t values. We can square the t values to make them comparable to the F values. Indeed, for the comparison of Middle age vs. Seniors, the t value of -1.784 when squared becomes 3.183, the same as the F value from proc glm. Likewise, for the comparison of Young vs. middle & Senior the t value from proc reg is 5.473 and when squared becomes 29.954, the same as the F value from proc glm.

Model: MODEL1Dependent Variable: WEIGHT

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 5 69595.35464 13919.07093 220.261 0.0001Error 24 1516.64536 63.19356C Total 29 71112.00000

Root MSE 7.94944 R-square 0.9787 Dep Mean 171.00000 Adj R-sq 0.9742 C.V. 4.64879

Page 54: STAT_in SAS

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 57.790217 16.94450462 3.411 0.0023AGE1 1 -3.999574 20.77745154 -0.192 0.8490AGE2 1 -56.188114 11.96726393 -4.695 0.0001HEIGHT 1 1.636256 0.25524084 6.411 0.0001AGE1HT 1 -0.546928 0.30658044 -1.784 0.0871AGE2HT 1 1.006544 0.18389498 5.473 0.0001

SAS FAQHow can I write an estimate statement in proc glm using a cell means model?

We will use a data set called elemapi2.sas7bdat to demonstrate. Variables mealcat and collcat are two categorical variables, both with three levels. The dependent variable is school's API index. We want to look at a simple comparison to compare group 1 vs. 2 and above of collcat when mealcat = 1. One way of doing this using proc glm with estimate statement can be the following:proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; estimate 'collcat 1 vs 2+ within mealcat = 1' collcat 1 -.5 -.5 collcat*mealcat 1 0 0

-.5 0 0 -.5 0 0;

run;quit;

Another way of accomplishing the same thing, but possibly easier is to use cell means model. A cell means model estimates only one parameter for each cell and sets the intercept to 0. The cell means model is not used in general to produce an overall test of model fit, but is often used to write simpler estimate or contrast statements. So in practice, we need to write proc glm code twice, one for the model fit and one for the estimates or contrasts. In the code shown below, the first proc glm is for model fit and the second one with the estimate statement is for the estimate of simple comparison. We use the noint option in the second proc glm to specify that we are not going to estimate the intercept, therefore will estimate one parameter per cell.

proc glm data = in.elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3;run;quit;proc glm data = in.elemapi2; class collcat mealcat; model api00 = collcat*mealcat/noint ss3; estimate 'collcat 1 vs 2+ within mealcat = 1'

Page 55: STAT_in SAS

collcat*mealcat 2 0 0 -1 0 0 -1 0 0 /divisor=2;quit;

Notice that the order of categorical variables in the class statement decides which variable is the row variable and which is the column variable. For example, in the code above, collcat will be the row variable and mealcat will be the column variable. Therefore, the simple comparison we are interested can be formulated as the following table. Writing the numbers in the table one row at a time, we can write our estimate statement as

estimate 'simple comparison' collcat*mealcat 1 0 0 -.5 0 0 -.5 0 0 ;

or equivalently, we can make use of the option divisor = to rewrite the statement in terms of whole numbers as shown above. 

collcat /mealcat

mealcat = 1

mealcat = 2

mealcat = 3

collcat = 1 1 0 0collcat =2 -.5 0 0collcat = 3 -.5 0 0

If we switch the order of variables in the class statement, we will have to rewrite our estimate statement accordingly. For example, we can rewrite the above proc glm statement such as the following and it produces exactly the same result from the estimate statement, since the corresponding table is simply being transposed: 

mealcat/collcat

collcat = 1

collcat =2

collcat=3

mealcat = 1 1 -.5 -.5mealcat = 2 0 0 0mealcat = 3 0 0 0proc glm data = in.elemapi2; class mealcat collcat; model api00 = mealcat*collcat/noint ss3; estimate 'collcat 1 vs 2+ within mealcat = 1' collcat*mealcat 2 -1 -1 0 0 0 0 0 0 /divisor=2 e;quit;

SAS FAQHow can I compute omega squared in SAS after proc glm?

After you perform an ANOVA using proc glm, it is useful to be able to report omega squared as a measure of the strength of the effect of the independent variable.  Proc

Page 56: STAT_in SAS

glm currently does not have an option that computes this. Here is an example that shows how to compute ω2 . The formula for ω2 given below is based on the formula on page 178 of  Kirk's Experimental Design  using the F-statistic.

omega^2 = df*(F-1)/(df*(F-1)+N),

where F is the F-statistic, df is the degrees of freedom of the model and N is the total number of observations.

proc glm data = in.hsb2; class race ses; model write = race ses/ss3; ods output overallanova = atable;run;quit;Dependent Variable: WRITE writing score Sum ofSource DF Squares Mean Square F Value Pr > FModel 5 2429.84904 485.96981 6.10 <.0001Error 194 15449.02596 79.63415Corrected Total 199 17878.87500

We used an ODS output statement to output the ANOVA table to a data set called atable. proc print data = atable noobs;run;Dependent Source DF SS MS FValue ProbF WRITE Model 5 2429.84904 485.96981 6.10 <.0001 WRITE Error 194 15449.02596 79.63415 _ _ WRITE Corrected Total 199 17878.87500 _ _ _

The F-statistic is 6.10 and there are five degrees of freedom. The total number of observations is the corrected total + 1. The calculation of omega squared is performed in the data step below. We also calculated the f-hat measure of effect size. The f-hat measure of effect size is related to omega-squared as follows:

f-hat = sqrt (omega&2/(1-omega^2)). 

data _omega2_; set atable nobs = last; retain fv p; if source = "Model" then do ; p = df; fv = fvalue - 1; end; if source = "Corrected Total" then do;

Page 57: STAT_in SAS

omega2 = p*fv/(p*fv + df + 1); esize = sqrt(omega2/(1-omega2)); end; if _n_ = last; keep omega2 esize;run;proc print data = _omega2_ noobs;run; omega2 esize0.11313 0.35716

SAS FAQHow can I visualize interactions of continuous variables in multiple regression?

It is difficult to picture what it means for there to be an interaction of continuous variables in multiple regression. These examples show how you can use SAS to produce three dimensional spin plots to help visualize such interactions.  We use 2 strategies to help these animated spin plots help you see the interactions, 1) by allowing you to view the regression plane from a variety of views to picture the regression surface, and 2) by varying certain terms in the regression equation and showing an animated graph illustrating what happens when that term in the regression model varies.  For more background information on understanding such regression analyses, including more technical detail on calculations and interpretations, we recommend Multiple Regression: Testing and Interpreting Interactions and Interaction Effects in Multiple Regression, both available from our Statistics Books for Loan.

First, you need to download the macros stored in spplot.zip. We will assume that you unzip them and store them in c:\spinplots .

Second, we include all of the macros so we can use them.

%include "c:\spinplots\sp_plot.sas";%include "c:\spinplots\sp_plota.sas";%include "c:\spinplots\fixlen.sas";%include "c:\spinplots\sp_plotbx1.sas";%include "c:\spinplots\sp_plotbx1x2.sas";%include "c:\spinplots\sp_plotbx1x1x2.sas";

We then call the %sp_plota macro to make a animated spin plot where the coefficient for x1, called bx1, is 0, and bx2 is 0, and the interaction of these is 0.  We store the resulting file in a file calledc:\spinplots\reg_int_cont1.gif but you can call it anything you like.  You can then bring this up in your web browser to view it (it is not viewable within SAS).  As you see in the plot, this is quite a boring plot since it is a completely flat plane.

%sp_plota(outfile="c:\spinplots\reg_int_cont1.gif", bx1=0,bx2=0,bx1x2=0);

Page 58: STAT_in SAS

Now let's show what happens when you vary bx1, the coefficient for x1.  We use the %sp_plotbx1 macro (which varies bx1) and indicate that we wish to vary bx1 from 0 to 3 incrementing by 0.05.  We set the angle for viewing this at 190 and we make the graph go very quickly by setting the delay to only 10.  As you can see, when bx1 increases, the plane tilts such that increasing values of x1 lead to increased values of y.  

%sp_plotbx1(outfile="c:\spinplots\reg_int_cont2.gif", bx1=0,bx2=0,bx1_lo=0,bx1_hi=3,bx1_by=.05, slen=5, angle=190,gopt= delay=10);

Now, let's see what happens when we introduce an interaction between x1 and x2 (i.e., bx12).  We use the macro named %sp_plotbx1x2 (which varies the bx1x2 term) to show what happens when you increase the value of bx12 from 0 to .5 by .01.  As you see, as bx1x2 increases, the regression plane goes from being flat to being curved, and it curves such that y increases in the corners where x1 and x2are both high or both low, and y decreases in the other 2 corners.

Page 59: STAT_in SAS

* when b1 b2=0, add bx1x2 from 0 to .5, fast ;%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont4.gif", bx1=0,bx2=0, bx1x2_lo=0, bx1x2_hi=.5, bx1x2_by=.01, slen=5, angle=190,gopt= delay=10);

Let's show the same graph, but slow it down.  Sometimes the fast graph (like above) helps to understand the plot, but sometimes a slower plot (like the one below) is more helpful.  For example, in the plot below you can see the values for the coefficient bx1x2 changing and see how that corresponds with the shape of the graph.

%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont5.gif", bx1=0, bx2=0, bx1x2_lo=0, bx1x2_hi=.5, bx1x2_by=.05, slen=5, angle=190,gopt= delay=50);

The bx1x2 term could be negative instead of positive.  The graph below shows a fast graph varying bx12 from -.5 to 0. As you see, it looks much like the graph above, except that the y values flare down when x1 and x2 are both high, or both low.  

Page 60: STAT_in SAS

%%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont6.gif", bx1=0, bx2=0, bx1x2_lo=-.5, bx1x2_hi=0, bx1x2_by=.01, slen=5,angle=190,gopt= delay=10);

We now show the same graph more slowly.

%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont7.gif", bx1=0,bx2=0, bx1x2_lo=-.5, bx1x2_hi=0, bx1x2_by=.05, slen=5, angle=190,gopt= delay=50);

The graph below varies bx1x2 from -.5 to .5 so you can see the full spectrum of the changes as you go from having a negative interaction, to no interaction, to having a positive interaction.  

%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont8.gif", bx1=0, bx2=0, bx1x2_lo=-.5, bx1x2_hi=.5, bx1x2_by=.01, slen=5, angle=190,gopt= delay=10);

Page 61: STAT_in SAS

And we show the above graph more slowly below.

%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont9.gif", bx1=0, bx2=0, bx1x2_lo=-.5, bx1x2_hi=.5, bx1x2_by=.05, slen=5,angle=190,gopt= delay=50);

Below we show a spin plot for the graph where bx1 is 0, bx2 is 0, and bx1x2 is .5.  This allows you to see the graph from all angles.

%sp_plota(outfile="c:\spinplots\reg_int_cont10.gif", bx1=0,bx2=0,bx1x2=.5, title=" ");

Page 62: STAT_in SAS

We showed what happens when you vary bx1 in the regression but that was when there was no interaction.  Let's vary bx1 from 0 to 2 and you can see that, although the plane is twisted, the effect of varyingbx1 is much the same as our previous example, that the regression plane takes on greater tilt (with respect to x1) when bx1 increases.  Because of the increased tilt, we needed to increase the minimum and maximum for the y values via the plot=zmin=-70 zmax=70 option (we are calling the vertical axis y but SAS thinks of it as Z, so that is why the options are zmin and zmax).

%sp_plotbx1(outfile="c:\spinplots\reg_int_cont11.gif", bx1=0, bx2=0, bx1x2=.5, bx1_lo=0, bx1_hi=2, bx1_by=.1, angle=190, slen=5, gopt=delay=50, plot=zmin=-70 zmax=70);

As an aside, you might be tempted to look at these curved planes and think that the bx1x2 term creates regression lines that are curved.  Surprisingly, these graphs are formed all with straight lines.  The graph below tries to help illustrate this by graphing the plane as a series of lines.  As you can see, each of the lines is perfectly straight, but they twist in such a way to form the curved plane. (The cmd=scatter option is

Page 63: STAT_in SAS

used to show a scatterplot rather than a 3d plane to help see the separate regression lines.)

%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont12.gif", bx1=3, bx2=0, bx1x2_lo=0, bx1x2_hi=.5, bx1x2_by=.05, slen=5, angle=190, gopt=delay=50, plot=zmin=-70 zmax=70 noneedle shape='balloon' size=.5, cmd=scatter);

Let's see what happens when we add a quadratic term for x1 (which we are calling bx1x1).  In the graph below, we show a graph where bx2 is 3, giving tilt with respect to x2, where bx1 is 0, but bx1x1 is -.5.  This gives the plane the upside down U shape with respect to x1.

%sp_plota(outfile="c:\spinplots\reg_int_cont13.gif", cons=10, bx1=0, bx2=3, bx1x1=-.5, title="Spinplot", title2="y=&cons+&bx1*x1+&bx2*x2+&bx1x1*x1*x1", plot= zmin=-80 zmax=50);

Page 64: STAT_in SAS

Now, let's see what happens when we add an interaction of bx1x1 by x2 (a term we call bx1x1x2).  The %sp_plotx1x1x2 macro allows us to vary this interaction to see what effect it has on the regression plane.  The graph below varies this interaction from 0 to 0.07 by 0.01.  We increased the size (via the vsize and hsize options) to make the graph easier to view.  As you can see, as this interaction term increases, the upside down U shape gets more extreme when x2 is small and gets more flat and even slight reverses itself when x2 is large.  As you see, the strength of the x1-squared effect depends on the level of x2. 

%sp_plotbx1x1x2(outfile="c:\spinplots\reg_int_cont14.gif", cons=30, bx1=0, bx1x1=-.4, bx2=3, bx1x1x2_lo=0, bx1x1x2_hi=.07, bx1x1x2_by=.01,

slen=5, angle=10,gopt=delay=10 vsize=4 hsize=6, plot= zmin=-100 zmax=100);

Given that bx1x1x2 is .07, we can vary bx1x2 and see how this influences the shape of the graph.  As you see below, when bx1x2 is 0, the curve of x1 is symmetrical, but as the bx1x2 term increases the values increase when x1 and x2 are both small or both large, lifting the graph in these two corners. 

%sp_plotbx1x2(outfile="c:\spinplots\reg_int_cont15.gif", cons=30, bx1=0, bx1x1=-.4, bx2=3, bx1x1x2=.07, bx1x2_lo=0, bx1x2_hi=.4, bx1x2_by=.01, title="Spinplot",

Page 65: STAT_in SAS

title2=h=4 "y=&cons+&bx1*x1+&bx2*x2+&bx1x1*x1*x1+&bx1x1x2*x1*x1*x2&bf*x1*x2" , angle=10, slen=5, gopt=delay=10 vsize=4 hsize=6, plot=zmin=-120 zmax=120);

By being able to visualize these higher order terms, you can see how they might make sense in your research and if you find such relationships, you can use these macros to make graphs to help you visualize and interpret the results of your analyses.  For further information on analyses with interactions with continuous variables, please see Multiple Regression: Testing and Interpreting Interactions andInteraction Effects in Multiple Regression, both available from our Statistics Books for Loan.

The entire SAS program is stored in spinplots.sas.

SAS FAQ:Why are Type III p-values different from the estimate p-values in PROC GLM?

When running a model in PROC GLM with an interaction term, if you indicate the ss3 option you will likely see p-values for the same variable in the Type III Sum of Squares output that are different from the p-values in the Estimate output.  The code below uses the elemapi2 dataset. 

proc glm data="c:\sasreg\elemapi2"; class mealcat;

Page 66: STAT_in SAS

model api00=some_col mealcat some_col*mealcat /solution ss3;run;quit;

Source DF Type III SS Mean Square F Value Pr > Fsome_col 1 36366.366 36366.366 7.70 0.0058mealcat 2 2012065.492 1006032.746 212.95 <.0001some_col*mealcat 2 97468.169 48734.084 10.32 <.0001

StandardParameter Estimate Error t Value Pr > |t|Intercept 480.9461176 B 12.13062708 39.65 <.0001some_col 1.6599700 B 0.75190859 2.21 0.0278mealcat 1 344.9475807 B 17.05743173 20.22 <.0001mealcat 2 105.9176024 B 18.75449819 5.65 <.0001mealcat 3 0.0000000 B . . .some_col*mealcat 1 -2.6073085 B 0.89604354 -2.91 0.0038some_col*mealcat 2 0.5336362 B 0.92720142 0.58 0.5653some_col*mealcat 3 0.0000000 B . . .

We can see that the p-value for some_col in the Type III SS section is 0.0058, while the same variable has a p-value of 0.0278 in the Estimate section. The reason that these two differ is that they correspond to two different tests due to the interaction term. The Type III SS section tests the overall effect of some_col while the Estimate section tests the simple effect of some_col when mealcat is at the level of the reference group.

To make this clear, we can look at the SAS code that reproduces the Type III SS test using the estimate statement:

proc glm data=ats.elemapi2; class mealcat ; model api00= mealcat some_col mealcat*some_col /solution ss3; estimate "overall effect" some_col 3 some_col*mealcat 1 1 1 /divisor=3; run; quit;

Source DF Type III SS Mean Square F Value Pr > F

mealcat 2 2012065.492 1006032.746 212.95 <.0001some_col 1 36366.366 36366.366 7.70 0.0058some_col*mealcat 2 97468.169 48734.084 10.32 <.0001

StandardParameter Estimate Error t Value Pr > |t|overall effect 0.96874585 0.34916249 2.77 0.0058

Standard

Page 67: STAT_in SAS

Parameter Estimate Error t Value Pr > |t|

Intercept 480.9461176 B 12.13062708 39.65 <.0001mealcat 1 344.9475807 B 17.05743173 20.22 <.0001mealcat 2 105.9176024 B 18.75449819 5.65 <.0001mealcat 3 0.0000000 B . . .some_col 1.6599700 B 0.75190859 2.21 0.0278some_col*mealcat 1 -2.6073085 B 0.89604354 -2.91 0.0038some_col*mealcat 2 0.5336362 B 0.92720142 0.58 0.5653some_col*mealcat 3 0.0000000 B . . .

Here is a more mathematical way of looking at this.  Our model has the following structure:

api00 = b_0 + b_1*mealcat_1 + b_2*mealcat_2 + b_3*some_col + b_4*mealcat_1*some_col + b_5*mealcat_2*some_col

The p-value of .0278 corresponds to the test of b_3 = 0. The null hypothesis here is that the effect of some_col is zero when mealcat = 3. The p-value of .0058 corresponds to the test of (b_3 + (b_3 + b_4) + (b_3 + b_5)) /3 = 0. That is the overall effect of some_col across all levels of mealcat.

SAS FAQHow do I interpret odds ratios in logistic regression?

You may also want to check out, FAQ: How do I use odds ratio to interpret logistic regression?, on our General FAQ page.

Introduction

Let's begin with probability.  Let's say that the probability of success is .8, thusp = .8

Then the probability of failure isq = 1 - p = .2

The odds of success are defined asodds(success) = p/q = .8/.2 = 4,

that is, the odds of success are 4 to 1.  The odds of failure would beodds(failure) = q/p = .2/.8 = .25.

This looks a little strange but it is really saying that the odds of failure are 1 to 4.  The odds of success and the odds of failure are just reciprocals of one another, i.e., 1/4 = .25 and 1/.25 = 4.  Next, we will add another variable to the equation so that we can compute an odds ratio.

Another example

Page 68: STAT_in SAS

This example is adapted from Pedhazur (1997).  Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females are admitted. The probabilities for admitting a male are,

p = 7/10 = .7       q = 1 - .7 = .3Here are the same probabilities for females,

p = 3/10 = .3       q = 1 - .3 = .7Now we can use the probabilities to compute the admission odds for both males and females,

odds(male) = .7/.3 = 2.33333odds(female) = .3/.7 = .42857

Next, we compute the odds ratio for admission,OR = 2.3333/.42857 = 5.44

Thus, for a male, the odds of being admitted are 5.44 times as large than the odds for a female being admitted.

Logistic regression in SAS

Here are the SAS logistic regression command and output for the example above.  In this example admit is coded 1 for yes and 0 for no and gender is coded 1 for male and 0 for female.  In the call to proc logistic, we use the desc option (which is short for descending) to indicate that SAS should model the 1s in the outcome variable and not the 0s (which is the default).  Also, we use the expb option on the model statement to have SAS display the odds ratios in the output.data temp;input admit gender freq;cards; 1 1 71 0 30 1 30 0 7;run;

proc logistic data = temp desc;weight freq;model admit = gender / expb;run;The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq Exp(Est)

Intercept 1 -0.8473 0.6901 1.5076 0.2195 0.429gender 1 1.6946 0.9759 3.0152 0.0825 5.444

Page 69: STAT_in SAS

Note that Wald = 3.0152 for both the coefficient for gender and for the odds ratio for gender (because the coefficient and the odds ratio are two ways of saying the same thing).

About logits

There is a direct relationship between the coefficients and the odds ratios.  First, let's define what is meant by a logit:  A logit is defined as the log base e (log) of the odds,

[1]     logit(p) = log(odds) = log(p/q)Logistic regression is in reality ordinary regression using the logit as the response variable,

[2]     logit(p) = a + bXor[3]     log(p/q) = a + bX

This means that the coefficients in logistic regression are in terms of the log odds, that is, the coefficient 1.6946 implies that a one unit change in gender results in a 1.6946 unit change in the log of the odds.  Equation [3] can be expressed in odds by getting rid of the log.  This is done by taking e to the power for both sides of the equation.

[4]     p/q = ea + bX

The end result of all the mathematical manipulations is that the odds ratio can be computed by raising e to the power of the logistic coefficient,

[5]     OR = eb = e1.694596 = 5.444

SAS FAQWhy are my logistic results reversed?

Suppose you run a logistic regression in SAS and the results seem to be the reverse of what you expected.  You might have even run the analysis in another package and found that the signs of the parameter estimates were reversed as compared to your SAS output.  If your outcome variable is coded such that 1 is the event of interest, then you must remember to use the descending option on proc logistic. If you omit the descending option, then SAS will predict the event of 0, and the results will be reversed (e.g., the parameter estimates will have a negative sign instead of a positive sign, and vice versa).

Here is brief example based on the SAS Class Notes, Analyzing Data.  We will take the logistic example from that page and intentionally omit the descending option.

PROC LOGISTIC DATA=hsbstat; MODEL honor = sex public read math science;RUN; Analysis of Maximum Likelihood EstimatesAnalysis of Maximum Likelihood Estimates

Page 70: STAT_in SAS

Parameter Standard Wald Pr > Standardized OddsVariable DF Estimate Error Chi-Square Chi-Square Estimate RatioINTERCPT 1 13.9945 2.1519 42.2917 0.0001 . .SEX 1 1.2467 0.4660 7.1562 0.0075 0.342144 3.479PUBLIC 1 -0.2431 0.5684 0.1829 0.6689 -0.048480 0.784READ 1 -0.0643 0.0284 5.1485 0.0233 -0.360667 0.938MATH 1 -0.1221 0.0350 12.1307 0.0005 -0.618577 0.885SCIENCE 1 -0.0553 0.0328 2.8489 0.0914 -0.300945 0.946

The results do seem odd, that higher scores on read or math would be associated with being less likely to be in honors classes (honor).  If we look at the log file, we see 

NOTE: Proc logistic is modeling the probability that honor=0.

One way to change this to model the probability that honor=1 is to specify the descending option on the proc statement. Refer to Technical Report P-229 or the SAS System Help Files for details.

This message is telling us that the results are reversed because we forgot the descending option.  We include the descending option and the results seem more like we would expect.

PROC LOGISTIC DATA=hsbstat DESCENDING; MODEL honor = sex public read math science;RUN; Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardized OddsVariable DF Estimate Error Chi-Square Chi-Square Estimate RatioINTERCPT 1 -13.9945 2.1519 42.2917 0.0001 . .SEX 1 -1.2467 0.4660 7.1562 0.0075 -0.342144 0.287PUBLIC 1 0.2431 0.5684 0.1829 0.6689 0.048480 1.275READ 1 0.0643 0.0284 5.1485 0.0233 0.360667 1.066MATH 1 0.1221 0.0350 12.1307 0.0005 0.618577 1.130SCIENCE 1 0.0553 0.0328 2.8489 0.0914 0.300945 1.057

Page 71: STAT_in SAS

As you see, the signs of the parameter estimates from these two analyses are the reverse of each other, and the odds ratios are the reciprocal of each other (e.g., 1/3.479 is .287).

SAS FAQ  How do I do a conditional logit model analysis in SAS 9.1?

PROC LOGISITC has been improved in SAS 9.1 It does a lot more than just logistic regression on binary outcome variables. On this page, we show two examples on using proc logistic for conditional logit models. For conditional logit model, proc logistic is very easy to use and it handles all kinds of matching, 1-1, 1-M matching, and in fact M-N matching.

Example 1: 1-1 Matching

This example is adapted from Chapter 7 of Applied Logistic Regression by Hosmer & Lemeshow (2000). You can download the SAS data file lbwt11.sas7bdat here.

The first 20 observations are listed below. Notice that variable pairid indicates that the observations are paired.

pairid lbwt age lastwt race smoke ptd ht ui race1 race2 race3 1 0 14 135 1 0 0 0 0 1 0 0 1 1 14 101 3 1 1 0 0 0 0 1 2 0 15 98 2 0 0 0 0 0 1 0 2 1 15 115 3 0 0 0 1 0 0 1 3 0 16 95 3 0 0 0 0 0 0 1 3 1 16 130 3 0 0 0 0 0 0 1 4 0 17 103 3 0 0 0 0 0 0 1 4 1 17 130 3 1 1 0 1 0 0 1 5 0 17 122 1 1 0 0 0 1 0 0 5 1 17 110 1 1 0 0 0 1 0 0 6 0 17 113 2 0 0 0 0 0 1 0 6 1 17 120 1 1 0 0 0 1 0 0 7 0 17 113 2 0 0 0 0 0 1 0

Page 72: STAT_in SAS

7 1 17 120 2 0 0 0 0 0 1 0 8 0 17 119 3 0 0 0 0 0 0 1 8 1 17 142 2 0 0 1 0 0 1 0 9 0 18 100 1 1 0 0 0 1 0 0 9 1 18 148 3 0 0 0 0 0 0 1 10 0 18 90 1 1 0 0 1 1 0 0 10 1 18 110 2 1 1 0 0 0 1 0proc logistic data = lbwt11 descending; model lbwt = lastwt smoke race2 race3 ptd ht ui ; strata pairid;run;The LOGISTIC Procedure

Conditional Analysis

Model Information

Data Set ATS.LBWT11Response Variable lbwtNumber of Response Levels 2Number of Strata 56Model binary logitOptimization Technique Newton-Raphson ridge

Model Information

low brth wt < 2500g

Number of Observations Read 112Number of Observations Used 112

Response Profile

Ordered Total Value lbwt Frequency

1 1 56 2 0 56

Probability modeled is lbwt=1.

Strata Summary

lbwtResponse ------ Number of Pattern 1 0 Strata Frequency

1 1 1 56 112

Newton-Raphson Ridge Optimization

Page 73: STAT_in SAS

Without Parameter Scaling

Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics

Without WithCriterion Covariates Covariates

AIC 77.632 65.589SC 77.632 84.618-2 Log L 77.632 51.589

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 26.0439 7 0.0005Score 20.2669 7 0.0050Wald 12.7208 7 0.0792

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

lastwt 1 -0.0184 0.0101 3.3229 0.0683smoke 1 1.4007 0.6278 4.9770 0.0257race2 1 0.5714 0.6896 0.6864 0.4074race3 1 -0.0253 0.6992 0.0013 0.9711ptd 1 1.8080 0.7887 5.2557 0.0219ht 1 2.3612 1.0861 4.7259 0.0297ui 1 1.4019 0.6962 4.0554 0.0440

Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limits

lastwt 0.982 0.963 1.001smoke 4.058 1.185 13.890race2 1.771 0.458 6.842race3 0.975 0.248 3.839ptd 6.098 1.300 28.609ht 10.603 1.262 89.115ui 4.063 1.038 15.901

Example 2: 1-M matching

This example is adapted from Chapter 7 of Applied Logistic Regression by Hosmer & Lemeshow (2000). You can download the SAS data file bbdm13.sas7bdat here.

Page 74: STAT_in SAS

The first 20 observations are listed below. Notice that variable str indicates that there are four choices for each subject.

str obs fndx chk agmn wt mod wid nvmr 1 1 1 1 13 118 55 0 0 1 2 0 2 11 175 1 0 0 1 3 0 2 12 135 1 0 0 1 4 0 1 11 125 55 0 0 2 1 1 1 14 118 55 0 0 2 2 0 2 15 183 55 0 0 2 3 0 2 11 218 55 0 0 2 4 0 1 13 192 55 0 0 3 1 1 1 15 125 55 0 0 3 2 0 2 14 123 55 0 0 3 3 0 1 13 140 55 0 0 3 4 0 1 13 160 55 0 0 4 1 1 1 14 150 55 0 1 4 2 0 1 13 130 1 0 0 4 3 0 2 14 140 55 0 0 4 4 0 1 16 130 55 0 0 5 1 1 1 17 150 1 0 0 5 2 0 2 12 148 55 0 0 5 3 0 1 13 134 55 0 0 5 4 0 1 14 138 55 1 0proc logistic data = bbdm13 descending; model fndx = chk agmn wt mod wid nvmr ; strata str;run;` The LOGISTIC ProcedureConditional Analysis Model InformationData Set ATS.BBDM13Response Variable fndxNumber of Response Levels 2Number of Strata 50Model binary logitOptimization Technique Newton-Raphson ridge Model Informationfinal diagnosisNumber of Observations Read 200Number of Observations Used 200 Response Profile Ordered Total Value fndx Frequency 1 1 50 2 0 150Probability modeled is fndx=1. Strata Summary fndxResponse ------ Number of Pattern 1 0 Strata Frequency 1 1 3 50 200Newton-Raphson Ridge OptimizationWithout Parameter Scaling Convergence criterion (GCONV=1E-8) satisfied.

Page 75: STAT_in SAS

Model Fit Statistics Without WithCriterion Covariates CovariatesAIC 138.629 102.430SC 138.629 122.220-2 Log L 138.629 90.430 Testing Global Null Hypothesis: BETA=0Test Chi-Square DF Pr > ChiSqLikelihood Ratio 48.1998 6 <.0001Score 39.9247 6 <.0001Wald 25.2218 6 0.0003 Analysis of Maximum Likelihood Estimates Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSqchk 1 -1.1218 0.4474 6.2862 0.0122agmn 1 0.3561 0.1292 7.6013 0.0058wt 1 -0.0284 0.00998 8.0771 0.0045mod 1 0.00376 0.0120 0.0984 0.7538wid 1 -0.4916 0.8173 0.3618 0.5475nvmr 1 1.4722 0.7582 3.7701 0.0522 Odds Ratio Estimates Point 95% WaldEffect Estimate Confidence Limitschk 0.326 0.135 0.783agmn 1.428 1.108 1.839wt 0.972 0.953 0.991mod 1.004 0.980 1.028wid 0.612 0.123 3.035nvmr 4.359 0.986 19.264

SAS FAQHow can I estimate relative risk in SAS using proc genmod for common outcomes in cohort studies?

Credits

This page was developed and written by Karla Lindquist, Senior Statistician in the Division of Geriatrics at UCSF.  We are very grateful to Karla for taking the time to develop this page and giving us permission to post it on our site.

Introduction

Binary outcomes in cohort studies are commonly analyzed by applying a logistic regression model to the data to obtain odds ratios for comparing groups with different sets of characteristics. Although this is often appropriate, there may be situations in which it is more desirable to estimate a relative risk or risk ratio (RR) instead of an odds ratio (OR). Several articles in recent medical and public health literature point out that when the outcome event is common (incidence of 10% or more), it is often more desirable to estimate an RR since there is an increasing differential between the RR and OR with increasing incidence rates, and there is a tendency for some to

Page 76: STAT_in SAS

interpret ORs as if they are RRs ([1]-[3]). There are some who hold the opinion that the OR should be used even when the outcome is common, however ([4]). Here the purpose is to demonstrate methods for calculating the RR, assuming that it is the appropriate thing to do. There are several options for how to estimate RRs directly in SAS, which have been demonstrated to be reliable in simulated and real data sets of various sizes and outcome incidence rates ([1],[2]). Two of these methods will be demonstrated here using hypothetical data created for this purpose. Both methods use proc genmod. One estimates the RR with a log-binomial regression model, and the other uses a Poisson regression model with a robust error variance.

Example Data: Odds ratio versus relative risk

A hypothetical data set was created to illustrate two methods of estimating relative risks using SAS. The outcome generated is called lenses, to indicate if the hypothetical study participants require corrective lenses by the time they are 30 years old. Assume all participants do not need them at a baseline assessment when they are 10 years old. Assume none of them have had serious head injuries or had brain tumors or other major health problems during the 20 years between assessments. Suppose we wanted to know if requiring corrective lenses is associated with having a gene which causes one to have a lifelong love and craving for carrots (assume not having this gene results in the opposite), and that we screened everyone for this carrot gene at baseline (carrot = 1 if they have it, = 0 if not). We also noted their gender (= 1 if female, = 2 if male), and what latitude of the continental US they lived on the longest (24 to 48 degrees north). All values (N=100) were assigned using a random number generator. The data are in eyestudy.sas7bdat.Here’s a quick description of the variables:

proc means data = eyestudy maxdec = 2; var carrot gender latitude lenses;run;The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum-------------------------------------------------------------------------------carrot 100 0.51 0.50 0.00 1.00gender 100 1.48 0.50 1.00 2.00latitude 100 35.97 7.51 24.00 48.00lenses 100 0.53 0.50 0.00 1.00-------------------------------------------------------------------------------

Page 77: STAT_in SAS

We have an overall outcome rate of 53%. So if we want to talk about whether the carrot-loving gene, gender, or latitude is associated with the risk of requiring corrective lenses by the age of 30, then relative risk is a more appropriate measure than the odds ratio. Here is a simple crosstab of carrot and lenses, which will allow us to calculate the unadjusted OR and RR by hand.

proc freq data = eyestudy; tables carrot*lenses/nopercent nocol;run;Table of carrot by lenses

carrot lenses

Frequency|Row Pct | 0| 1| Total---------+--------+--------+ 0 | 17 | 32 | 49 | 34.69 | 65.31 |---------+--------+--------+ 1 | 30 | 21 | 51 | 58.82 | 41.18 |---------+--------+--------+Total 47 53 100

It is interesting that fewer people with the carrot-loving gene needed corrective lenses (especially since these are fake data!). The OR and RR for those without the carrot gene versus those with it are:

OR =  ( 32/17)/(21/30)  = 2.69 RR =   (32/49)/(21/51)  = 1.59

We could use either proc logistic or proc genmod to calculate the OR. Since proc genmod will be used to calculate the RR, it will also be used to calculate the OR for comparison purposes (and it gives the same results as proc logistic). Here is the logistic regression with just carrot as the predictor:

proc genmod data = eyestudy descending; class carrot; model lenses = carrot/ dist = binomial link = logit; estimate 'Beta' carrot 1 -1/ exp;run;The GENMOD Procedure

Model Information

Data Set EYESTUDYDistribution BinomialLink Function LogitDependent Variable lensesObservations Used 100

Class Level Information

Page 78: STAT_in SAS

Class Levels Values

carrot 2 0 1

Response Profile

Ordered Total Value lenses Frequency

1 1 53 2 0 47

PROC GENMOD is modeling the probability that lenses='1'.

Parameter Information

Parameter Effect carrot

Prm1 InterceptPrm2 carrot 0Prm3 carrot 1

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 98 132.3665 1.3507Scaled Deviance 98 132.3665 1.3507Pearson Chi-Square 98 100.0000 1.0204Scaled Pearson X2 98 100.0000 1.0204Log Likelihood -66.1832

Algorithm converged. Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -0.3567 0.2845 -0.9143 0.2010 1.57 0.2100carrot 0 1 0.9892 0.4136 0.1786 1.7997 5.72 0.0168carrot 1 0 0.0000 0.0000 0.0000 0.0000 . .Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

Contrast Estimate Results

Standard Chi-Label Estimate Error Alpha Confidence Limits Square Pr > ChiSq

Page 79: STAT_in SAS

Beta 0.9892 0.4136 0.05 0.1786 1.7997 5.72 0.0168Exp(Beta) 2.6891 1.1121 0.05 1.1956 6.0481

The estimate statement with the exp option gives us the same OR we calculated by hand above for those without the carrot gene versus those with it. Now this can be contrasted with the two methods of calculating the RR described below.

Relative risk estimation by log-binomial regression

With a very minor modification of the statements used above for the logistic regression, a log-binomial model can be run to get the RR instead of the OR. All that needs to be changed is the link function between the covariate(s) and outcome. Here it is specified as log instead of logit:

proc genmod data = eyestudy descending; class carrot; model lenses = carrot/ dist = binomial link = log; estimate 'Beta' carrot 1 -1/ exp;run;The GENMOD Procedure

Model Information

Data Set EYESTUDYDistribution BinomialLink Function LogDependent Variable lensesObservations Used 100

Class Level Information

Class Levels Values

carrot 2 0 1

Response Profile

Ordered Total Value lenses Frequency

1 1 53 2 0 47

PROC GENMOD is modeling the probability that lenses='1'.

Parameter Information

Parameter Effect carrot

Prm1 InterceptPrm2 carrot 0Prm3 carrot 1

Page 80: STAT_in SAS

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 98 132.3665 1.3507Scaled Deviance 98 132.3665 1.3507Pearson Chi-Square 98 100.0000 1.0204Scaled Pearson X2 98 100.0000 1.0204Log Likelihood -66.1832

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -0.8873 0.1674 -1.2153 -0.5593 28.11 <.0001carrot 0 1 0.4612 0.1971 0.0749 0.8476 5.48 0.0193carrot 1 0 0.0000 0.0000 0.0000 0.0000 . .Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

Contrast Estimate Results

Standard Chi-Label Estimate Error Alpha Confidence Limits Square Pr > ChiSq

Beta 0.4612 0.1971 0.05 0.0749 0.8476 5.48 0.0193Exp(Beta) 1.5860 0.3126 0.05 1.0778 2.3339

Now the exp option on the estimate statement gives us the estimated RR instead of the OR, and it also matches what was calculated by hand above for the RR. Notice that the standard error (SE) for the beta estimate calculated here is much smaller than that calculated in the logistic regression above (SE = 0.414), but so is the estimate itself (logistic regression beta estimate = 0.989), so the significance level is very similar (logistic regression p = 0.017) in this case. One of the criticisms of using the log-binomial model for the RR is that it produces confidence intervals that are narrower than they should be, and another is that there can be convergence problems (1,2). This is why the second approach is also presented here.

Relative risk estimation by Poisson regression with robust error variance

Page 81: STAT_in SAS

Zou ([2]) suggests using a “modified Poisson” approach to estimate the relative risk and confidence intervals by using robust error variances. Using a Poisson model without robust error variances will result in a confidence interval that is too wide. The robust error variances can be estimated by using the repeated statement and the subject identifier (here id), even if there is only one observation per subject, as Zou cleverly points out. Here is how it is done:

proc genmod data = eyestudy; class carrot id; model lenses = carrot/ dist = poisson link = log; repeated subject = id/ type = unstr; estimate 'Beta' carrot 1 -1/ exp;run;

Notice that id, the individual subject identifier, has been added to the class statement and is also on the repeated statement (with an unstructured correlation matrix), telling proc genmod to calculate the robust errors. Also notice that the distribution has been changed to Poisson, but the link function remains log.

The GENMOD Procedure

Model Information

Data Set EYESTUDYDistribution PoissonLink Function LogDependent Variable lensesObservations Used 100

Class Level Information

Class Levels Values

carrot 2 0 1id 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 ...

Parameter Information

Parameter Effect carrot

Prm1 InterceptPrm2 carrot 0Prm3 carrot 1

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Page 82: STAT_in SAS

Deviance 98 64.5361 0.6585Scaled Deviance 98 64.5361 0.6585Pearson Chi-Square 98 47.0000 0.4796Scaled Pearson X2 98 47.0000 0.4796Log Likelihood -85.2681

Algorithm converged. Analysis Of Initial Parameter Estimates

Standard Wald 95% Confidence Chi-Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -0.8873 0.2182 -1.3150 -0.4596 16.53 <.0001carrot 0 1 0.4612 0.2808 -0.0892 1.0116 2.70 0.1005carrot 1 0 0.0000 0.0000 0.0000 0.0000 . .Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

GEE Model Information

Correlation Structure UnstructuredSubject Effect id (100 levels)Number of Clusters 100Correlation Matrix Dimension 1Maximum Cluster Size 1Minimum Cluster Size 1

Algorithm converged.

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates

Standard 95% ConfidenceParameter Estimate Error Limits Z Pr > |Z|

Intercept -0.8873 0.1674 -1.2153 -0.5593 -5.30 <.0001carrot 0 0.4612 0.1971 0.0749 0.8476 2.34 0.0193carrot 1 0.0000 0.0000 0.0000 0.0000 . .

Contrast Estimate Results

Standard Chi-Label Estimate Error Alpha Confidence Limits Square Pr > ChiSq

Beta 0.4612 0.1971 0.05 0.0749 0.8476 5.48 0.0193Exp(Beta) 1.5860 0.3126 0.05 1.0778 2.3339

Page 83: STAT_in SAS

Again, the exp option on the estimate statement gives us the estimated RR, and it matches exactly what was calculated by the log-binomial method. In this case, the SE for the beta estimate and the p-value are also exactly the same as in the log-binomial model. This may not always be the case, but they should be similar. The SE calculated without the repeated statement (i.e., not using robust error variances) is 0.281, and the p-value is 0.101, so the robust method is quite different.

Adjusting the relative risk for continuous or categorical covariates

Adjusting the RR for other predictors or potential confounders is simply done by adding them to the model statement as you would in any other procedure. Here gender and latitude will be added to the model:

proc genmod data = eyestudy; class carrot gender id; model lenses = carrot gender latitude/ dist = poisson link = log; repeated subject = id/ type = unstr; estimate 'Beta Carrot' carrot 1 -1/ exp; estimate 'Beta Gender' gender 1 -1/ exp; estimate 'Beta Latitude' latitude 1 -1/ exp;run;

The GENMOD Procedure

Model Information

Data Set EYESTUDYDistribution PoissonLink Function LogDependent Variable lensesObservations Used 100

Class Level Information

Class Levels Values

carrot 2 0 1gender 2 1 2id 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 ...

Parameter Information

Parameter Effect carrot gender

Prm1 InterceptPrm2 carrot 0Prm3 carrot 1Prm4 gender 1

Page 84: STAT_in SAS

Prm5 gender 2Prm6 latitude

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 96 63.7618 0.6642Scaled Deviance 96 63.7618 0.6642Pearson Chi-Square 96 46.7434 0.4869Scaled Pearson X2 96 46.7434 0.4869Log Likelihood -84.8809

Algorithm converged.

Analysis Of Initial Parameter Estimates

Standard Wald 95% Confidence Chi-Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -0.6521 0.6982 -2.0206 0.7163 0.87 0.3503carrot 0 1 0.4832 0.2831 -0.0716 1.0381 2.91 0.0878carrot 1 0 0.0000 0.0000 0.0000 0.0000 . .gender 1 1 0.2052 0.2781 -0.3398 0.7502 0.54 0.4605gender 2 0 0.0000 0.0000 0.0000 0.0000 . .latitude 1 -0.0100 0.0190 -0.0472 0.0272 0.28 0.5980Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

GEE Model Information

Correlation Structure UnstructuredSubject Effect id (100 levels)Number of Clusters 100Correlation Matrix Dimension 1Maximum Cluster Size 1Minimum Cluster Size 1

Algorithm converged.

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates

Standard 95% ConfidenceParameter Estimate Error Limits Z Pr > |Z|

Intercept -0.6521 0.4904 -1.6134 0.3091 -1.33 0.1836carrot 0 0.4832 0.1954 0.1003 0.8662 2.47 0.0134

Page 85: STAT_in SAS

carrot 1 0.0000 0.0000 0.0000 0.0000 . .gender 1 0.2052 0.1848 -0.1570 0.5674 1.11 0.2669gender 2 0.0000 0.0000 0.0000 0.0000 . .latitude -0.0100 0.0127 -0.0350 0.0150 -0.79 0.4324

Contrast Estimate Results

Standard Chi-Label Estimate Error Alpha Confidence Limits Square Pr > ChiSq

Beta Carrot 0.4832 0.1954 0.05 0.1003 0.8662 6.12 0.0134Exp(Beta Carrot) 1.6213 0.3168 0.05 1.1055 2.3777Beta Gender 0.2052 0.1848 0.05 -0.1570 0.5674 1.23 0.2669Exp(Beta Gender) 1.2278 0.2269 0.05 0.8547 1.7637Beta Latitude -0.0100 0.0127 0.05 -0.0350 0.0150 0.62 0.4324Exp(Beta Latitude) 0.9900 0.0126 0.05 0.9656 1.0151

We have also requested the RRs for gender and latitude in the estimate statement. In this case, adjusting for them does not reduce the association between having the carrot-loving gene and risk of needing corrective lenses by age 30.

One should always pay attention to goodness of fit statistics and perform other diagnostic tests. Refer to Categorical Data Analysis Using the SAS System, by M. Stokes, C. Davis and G. Kock for standard methods of checking whichever type of model you use.

References

1. McNutt LA, Wu C, Xue X, Hafner JP. Estimating the Relative Risk in Cohort Studies and Clinical Trials of Common Outcomes. Am J Epidemiol 2003; 157(10):940-3.2. Zou G. A Modified Poisson Regression Approach to Prospective Studies with Binary Data. Am J Epidemiol 2004; 159(7):702-6.3. Sander Greenland , Model-based Estimation of Relative Risks and Other Epidemiologic Measures in Studies of Common Outcomes and in Case-Control Studies,    American Journal  of Epidemiology 2004;160:301-3054. Cook TD. Up with odds ratios! A case for odds ratios when outcomes are common. Acad Emerg Med 2002; 9:1430-4.5. Spiegelman, D. und Hertzmark, Easy SAS Calculations for Risk or Prevalence Ratios and Differences, E American Journal of Epidemiology, 2005, 162, 199-205.

Page 86: STAT_in SAS

SAS FAQIn PROC LOGISTIC why aren't the coefficients consistent with the odds ratios?

Starting with a logistic regression model predicting the variable binary hiread with the variables write and ses. The variable write is continuous, and the variable ses is categorical with three categories (1 = low, 2 = middle, 3 = high). In the code below, the class statement is used to specify that ses is a categorical variable and should be treated as such.

proc logistic data = mydir.hsb2m descending;class ses;model hiread = write ses ;run ;

The "Class Level Information" section of the SAS output shows the coding used by SAS in estimating the model. This coding scheme is what is known as effect coding. (For more information see our FAQ page What is effect coding?)

Class Level Information

DesignClass Value Variables

SES 1 1 0 2 0 1 3 -1 -1

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -8.1220 1.3216 37.7697 <.0001WRITE 1 0.1438 0.0236 37.0981 <.0001SES 1 1 -0.4856 0.2823 2.9594 0.0854SES 2 1 0.0508 0.2290 0.0493 0.8243

Further down in the output, we find the table containing the rest to the estimates of the coefficients. For the variable ses there are two coefficients one for each of the effect coded variables in the model (SES 1 and SES 2). The coefficients are -0.4856 and 0.0508. If we exponetiate these coefficients we get exp(-0.4856) = .61533 and exp(0.0508) = 1.0521, for SES 1 and SES 2 respectively, but the odds ratios in listed in the table with the heading "Odds Ratio Estimates" are 0.398 and 0.681. Why aren't the odds ratios consistent with the coefficients? The answer is that SAS uses effect coding for the coefficients, but uses dummy variable coding when calculating the odds ratios. Because they are not making the same comparisons, it is possible for the coefficients in the table of estimates to be non-significant while the confidence

Page 87: STAT_in SAS

interval around the odds ratios does not include one (or vice versa). (For more information see our FAQ What is dummy coding?)

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits

WRITE 1.155 1.102 1.209SES 1 vs 3 0.398 0.153 1.040SES 2 vs 3 0.681 0.313 1.485

If we run the same analysis, but use dummy variable coding for both the parameter estimates and the odds ratios, we can get coefficients that will be consistent with the odds ratios. There are several methods that can be used to estimate a model using dummy coding for nominal level variables. In the first example below we add (ref='3') / param = ref to the class statement. This instructs SAS that for the variableses the desired reference category is 3 (we could also use category 1 or 2 as the reference group), and then tells SAS that we want to use ref as the parameter.

proc logistic data = mydir.hsb2m descending;class ses (ref='3') / param = ref ;model hiread = write ses ;run ;

Looking at the output (below), the coding system shown in the "Class Level Information" section of the output is for two dummy variables, one for category 1 versus 3, and one for category 2 versus 3. Note two other things in the output below. First, that the coefficients in this model are consistent with the odds ratios. That is, exp(-0.9204) = 0.398 and  exp(-0.3839) = 0.681. The second thing to notice is that the odds ratios from this model are the same as the odds ratios above, only the coefficient estimates have changed. This is expected, since, SAS always uses dummy coding to compute odds ratios, all that has changed is how the coefficients are modeled.

Class Level Information

DesignClass Value Variables

SES 1 1 0 2 0 1 3 0 0

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -7.6872 1.3697 31.4984 <.0001WRITE 1 0.1438 0.0236 37.0981 <.0001

Page 88: STAT_in SAS

SES 1 1 -0.9204 0.4897 3.5328 0.0602SES 2 1 -0.3839 0.3975 0.9330 0.3341

Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limits

WRITE 1.155 1.102 1.209SES 1 vs 3 0.398 0.153 1.040SES 2 vs 3 0.681 0.313 1.485

Another way to use dummy coding is to create the dummy variables manually, and use them in the model statement, bypassing the class statement entirely. The code below does this, first we create two dummy variables, ses_d1 and ses_d2, which code for category 1 versus 3, and category 2 versus 3 respectively. Then we include ses_d1 and ses_d2 in the model statement. There is no need for the class statement here. The output generated by this code will not include the "Class Level Information" since the class statement was not used, however, the output will be otherwise identical to the last model.

data mydir.hsb2m;set 'D:\data\hsb2';if ses = 1 then ses_d1 = 1;if ses = 2 then ses_d1 = 0;if ses = 3 then ses_d1 = 0;

if ses = 1 then ses_d2 = 0;if ses = 2 then ses_d2 = 1;if ses = 3 then ses_d2 = 0;

run;

proc logistic data = mydir.hsb2m descending;model hiread = write ses_d1 ses_d2 ;run ;

As a final exercise, we can run the model using effect coded variables we created, and check to see that the coefficients from this model match the coefficients from the first model. This will confirm that SAS is in fact using effect coding in the first model. The first step is to create the variables for the effect coding, below we have called them ses_e1 and ses_e2, for the coefficients for the differences between category 1 and the grand mean (when all other covariates equal zero), and category 2 and the grand mean, respectively. Then we run the model with ses_e1 and ses_d2 in the model statement, and the class statement omitted entirely (since we have done the work normally done by the class state).

data mydir.hsb2m; set 'D:\data\hsb2'; if ses = 1 then ses_e1 = 1; if ses = 2 then ses_e1 = 0; if ses = 3 then ses_e1 = -1;

Page 89: STAT_in SAS

if ses = 1 then ses_e2 = 0; if ses = 2 then ses_e2 = 1; if ses = 3 then ses_e2 = -1;run;

proc logistic data = mydir.hsb2m descending;model hiread = write ses_e1 ses_e2;run ;

Comparing the table of coefficients below to the coefficients in the very first table of estimates, we see that the coefficients are in fact the same. This confirms that the model in the first table was estimated using effect coded variables to estimate the effect of ses. Note that the odds ratios below do not match the odds ratios in the first model, because when we use the class statement, SAS uses dummy coding to generate the odds ratios, while in this case, the odds ratios are computed directly from the estimated coefficients.

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -8.1220 1.3216 37.7697 <.0001WRITE 1 0.1438 0.0236 37.0981 <.0001ses_e1 1 -0.4856 0.2823 2.9594 0.0854ses_e2 1 0.0508 0.2290 0.0493 0.8243

Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limits

WRITE 1.155 1.102 1.209ses_e1 0.615 0.354 1.070ses_e2 1.052 0.672 1.648

SAS FAQHow can I run simple linear and nonlinear models using nlmixed?

This FAQ page will show how a number of simple linear and nonlinear models can be coded using SAS proc nlmixed. What is meant by "simple" here is that all of the models are fixed effects only with no random effects. All of the models shown can be estimated using specific commands in SAS, for example the binary logistic model can be estimated using proc logistic or proc genmod. In fact it is much easier to run these commands using the specific procedures. The purpose of this page is allow you to be acquainted with these simpler nlmixed, to see how they work and how they're parameterized. With that knowledge you will have a good foundation for building more complex nonlinear mixed models.

Page 90: STAT_in SAS

All of the models use the same dataset, hsbdemo.sas7bdat, and the same two predictor variables, read and female. Models will use different response variables depending upon the type of response variable that is appropriate. For each of the models we will give only partial output, primarily estimates, standard errors, wald tests and p-values so that you will be able to compare the results with specific commands in SAS.

A short note about starting values. You give starting values using the parms statement. Some models are very sensitive to starting values and will not converge unless given good values. Other models are very tolerant and will work properly with starting values for all parameters set to zero. There are even models that don't require you to set the starting values, i.e., no parms statement. They get starting values automatically from the nlmixed procedure.

Ordinary least squares regression

We begin with an ordinary least squares regression predicting write from read and female. The setup for nlmixed is very straightforward with xb being the linear predictor with parameters b0 (the intercept),b1 and b2 as the regression coefficients. There is one additional parameter, s2e, which captures the residual variability and would be equivalent to the mean square error in a traditional OLS regression. By using nlmixed we will obtain an iterated maximum likelihood solution for the model.

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; xb=b0+b1*read+b2*female; model write ~ normal(xb,s2e);run;

/* partial output */Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 20.2284 2.6933 200 7.51 <.0001 0.05 14.9174 25.5393 -1.76E-6b1 0.5659 0.04901 200 11.55 <.0001 0.05 0.4692 0.6625 -0.00011b2 5.4869 1.0066 200 5.45 <.0001 0.05 3.5019 7.4719 -1.04E-6s2e 50.1128 5.0113 200 10.00 <.0001 0.05 40.2310 59.9945 1.12E-8

Thus, our model looks like this, yhat = 20.2284 + .5659*read + 5.4869*female.

Binary logistic regression

Page 91: STAT_in SAS

Our second model is a binary logistic regression predicting honors from read and female. The variable honors means that the student has been selected to participate in the honors English program. We will estimate this model two different ways; first, computing the expected probability and using the binary distribution as shown below.

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b0=5 b1=0 b2=0; xb=b0+b1*read+b2*female; prob = exp(xb)/(1+exp(xb)); model honors ~ binary(prob);run;

/* partial output */Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 -9.6034 1.4264 200 -6.73 <.0001 0.05 -12.4162 -6.7906 -2.71E-9b1 0.1444 0.02333 200 6.19 <.0001 0.05 0.09835 0.1904 -1.16E-6b2 1.1209 0.4081 200 2.75 0.0066 0.05 0.3162 1.9257 4.077E-8

The second approach is to compute a likelihood from the probability, take the log of the likelihood and then compute a general log-likelihood function using general..proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b0=0 b1=0 b2=0; xb=b0+b1*read+b2*female; prob = exp(xb)/(1+exp(xb)); liklhd = (prob**honors)*((1-prob)**(1-honors)); ll = log(liklhd); model honors ~ general(ll);run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 -9.6034 1.4264 200 -6.73 <.0001 0.05 -12.4162 -6.7906 2.854E-7b1 0.1444 0.02333 200 6.19 <.0001 0.05 0.09835 0.1904 0.000016b2 1.1209 0.4081 200 2.75 0.0066 0.05 0.3162 1.9257 -8.69E-8

Probit regression

For the probit regression model we will use the same response variable, honors, as in the binary logit model above.

Page 92: STAT_in SAS

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b0=0 b1=0 b2=0; xb=b0+b1*read+b2*female; prob = probnorm(xb); if honors=0 then liklhd=1-prob; else liklhd=prob; ll = log(liklhd); model honors ~ general(ll);run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 -5.6720 0.7798 200 -7.27 <.0001 0.05 -7.2098 -4.1343 2.66E-6b1 0.08560 0.01301 200 6.58 <.0001 0.05 0.05996 0.1113 0.000128b2 0.6340 0.2301 200 2.76 0.0064 0.05 0.1803 1.0877 1.713E-6

Ordered logistic regression

We will use ses as the response variable for the ordered logistic regression. The variable ses is ordered with values 1, 2 and 3. The model given below estimates a proportional odds ordered logistic model. Some statistical software calls the the parameters a1 and a2 cut points or as thresholds while other packages will parameterize the model with intercepts.

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b1=0 b2=0 a1=1 a2=1; xb=b1*read+b2*female; if ses=1 then liklhd=1/(1+exp(-a1+xb)); if ses=2 then liklhd=1/(1+exp(-a2+xb))-1/(1+exp(-a1+xb)); if ses=3 then liklhd=1-1/(1+exp(-a2+xb)); ll = log(liklhd); model ses ~ general(ll);run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb1 0.05714 0.01386 200 4.12 <.0001 0.05 0.02980 0.08448 -0.00637b2 -0.4195 0.2715 200 -1.55 0.1239 0.05 -0.9549 0.1159 -0.00004a1 1.4774 0.7343 200 2.01 0.0456 0.05 0.02942 2.9254 0.000131a2 3.7306 0.7791 200 4.79 <.0001 0.05 2.1942 5.2669 -4.58E-

Page 93: STAT_in SAS

Generalized ordered logistic regression

Generalized ordered logit estimates separate intercepts and coefficients for each equation and so therefore does not have a proportional odds assumption. In this example, we use the same variables as in the ordered logistic regression above, however, we have to estimate several more parameters then for that model.

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b01=-1 b02=-2 b11=0 b21=0 b12=0 b22=0; xb1=b01+b11*read+b12*female; xb2=b02+b21*read+b22*female; if ses=1 then liklhd=1/(1+exp(xb1)); if ses=2 then liklhd=1/(1+exp(xb2))-1/(1+exp(xb1)); if ses=3 then liklhd=1-1/(1+exp(xb2)); ll = log(liklhd); model ses ~ general(ll); estimate 'b11-b12' b01-b12; estimate 'b21-b22' b21-b22;run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb01 -1.1132 0.9807 200 -1.14 0.2577 0.05 -3.0470 0.8206 0.000022b02 -3.9737 0.9294 200 -4.28 <.0001 0.05 -5.8064 -2.1410 0.000032b11 0.05335 0.01887 200 2.83 0.0052 0.05 0.01614 0.09057 0.000953b21 0.05963 0.01636 200 3.64 0.0003 0.05 0.02736 0.09189 0.001598b12 -0.7005 0.3591 200 -1.95 0.0525 0.05 -1.4085 0.007610 0.000026b22 -0.2037 0.3232 200 -0.63 0.5292 0.05 -0.8410 0.4336 0.00001

Additional Estimates StandardLabel Estimate Error DF t Value Pr > |t| Alpha Lower Upperb11-b12 -0.4127 1.1275 200 -0.37 0.7147 0.05 -2.6361 1.8106b21-b22 0.2633 0.3229 200 0.82 0.4158 0.05 -0.3735 0.900

Traditional statistical procedures will often organize the output by equation. The example output below shows one way the output may be displayed. StandardParameter Estimate Error DF t Value Pr > |t| equation 1b01 -1.1132 0.9807 200 -1.14 0.2577 b11 0.05335 0.01887 200 2.83 0.0052

Page 94: STAT_in SAS

b12 -0.7005 0.3591 200 -1.95 0.0525 equation 2b02 -3.9737 0.9294 200 -4.28 <.0001 b21 0.05963 0.01636 200 3.64 0.0003 b22 -0.2037 0.3232 200 -0.63 0.5292

Multinomial logistic regression

For the multinomial logistic regression prog (program type) is the response variable. If there are k level for the response variable there will be k-1 equations in the multinomial logistic regression model. Since there are three levels of prog there will be two equations in our model. The first equation is for prog=1 and the second equation for prog=3. Thus, prog=2 is our reference or base category. Each of the equations will have an intercept and coefficients for read and female. The two intercepts are b01 and b03 while the coefficients for read are b11 and b13 and for female the coefficients are b12 and b32.

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b01=0 b03=0 b11=0 b31=0 b12=0 b32=0; xb1=b01+b11*read+b12*female; xb3=b03+b31*read+b32*female; expxb1=exp(xb1); expxb2=1; expxb3=exp(xb3); den = expxb1+expxb2+expxb3; if prog=1 then liklhd=expxb1/den; if prog=2 then liklhd=expxb2/den; if prog=3 then liklhd=expxb3/den; ll = log(liklhd); model prog ~ general(ll);run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb01 3.0193 1.0963 200 2.75 0.0064 0.05 0.8575 5.1810 3.695E-6b03 5.3382 1.1564 200 4.62 <.0001 0.05 3.0579 7.6185 -5.42E-6b11 -0.07121 0.02021 200 -3.52 0.0005 0.05 -0.1111 -0.03137 0.000099b31 -0.1173 0.02244 200 -5.23 <.0001 0.05 -0.1615 -0.07304 -0.00024b12 -0.1835 0.3727 200 -0.49 0.6230 0.05 -0.9184 0.5514 5.62E-6b32 -0.1938 0.3801 200 -0.51 0.6106 0.05 -0.9433 0.5556 -6.56E-7

Page 95: STAT_in SAS

As with the generalized ordered logistic regression above you will often see the output for multinomial logistic regression from a traditional statistical procedure organized by groups similar to what is shown below. StandardParameter Estimate Error DF t Value Pr > |t| prog=1 b01 3.0193 1.0963 200 2.75 0.0064 b11 -0.07121 0.02021 200 -3.52 0.0005 b12 -0.1835 0.3727 200 -0.49 0.6230 prog=3 b03 5.3382 1.1564 200 4.62 <.0001 b31 -0.1173 0.02244 200 -5.23 <.0001 b32 -0.1938 0.3801 200 -0.51 0.6106 prog=2 is the base category

Poisson regression

Our next example is a count model using a poisson distribution. The response variable for this model is awards which is a count of the number of awards received by a student during high school. As with the binary logistic model above there are two ways you can parameterize this model. The first method will use the poisson distribution option. The poisson regression model uses awards

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b0=0 b1=0 b2=0; xb=b0+b1*read+b2*female; mu = exp(xb); model awards ~ poisson(mu);run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 -3.0922 0.3325 200 -9.30 <.0001 0.05 -3.7479 -2.4364 -0.00012b1 0.06009 0.005404 200 11.12 <.0001 0.05 0.04944 0.07075 -0.01074b2 0.4690 0.1142 200 4.11 <.0001 0.05 0.2438 0.6941 -0.00015

The second method for parameterizing this model is to compute the log-likelihood and use the general log-likelihood function. The two approached yield the same results.proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b0=0 b1=0 b2=0; xb=b0+b1*read+b2*female; mu = exp(xb); ll = awards*log(mu) - mu - lgamma(awards+1); model awards ~ general(ll);run;

Page 96: STAT_in SAS

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 -3.0922 0.3325 200 -9.30 <.0001 0.05 -3.7479 -2.4364 -0.00012b1 0.06009 0.005404 200 11.12 <.0001 0.05 0.04944 0.07075 -0.01074b2 0.4690 0.1142 200 4.11 <.0001 0.05 0.2438 0.6941 -0.00015

Negative binomial regression

Our final example is a negative binomial regression. We will use the same response variable, awards, as was used in the poisson example. One way of conceptualizing a negative binomial model is to think of it as a poisson model with overdispersion, that is, excess variance. In a true poisson model the mean and the variance are equal. However, variables distributed as a negative binomial have a variance that is greater than the mean. The parameterization if this model is similar to that of the poisson with the addition of an addition parameter, alpha, which measures the degree of over dispersion.

proc nlmixed data='D:\data\hsbdemo.sas7bdat'; parms b0=0 b1=0 b2=0 alpha=.1; xb=b0+b1*read+b2*female; mu = exp(xb); m = 1/alpha; ll = lgamma(awards+m)-lgamma(awards+1)-lgamma(m) + awards*log(alpha*mu)-(awards+m)*log(1+alpha*mu); model awards ~ general(ll);run;

[partial output]Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradientb0 -3.3266 0.4111 200 -8.09 <.0001 0.05 -4.1372 -2.5160 0.000113b1 0.06392 0.006809 200 9.39 <.0001 0.05 0.05050 0.07735 0.010318b2 0.5065 0.1354 200 3.74 0.0002 0.05 0.2396 0.7735 0.000062alpha 0.1828 0.08455 200 2.16 0.0318 0.05 0.01606 0.3495 -0.00016

Page 97: STAT_in SAS

SAS FAQ:From an OLS model to full mixed models using proc nlmixed

In order to help show the relationships among an OLS, random intercept, and random slope models this page shows a series of models each of which builds on the previous models. Model 1 is a standard ordinary least squares (OLS) regression model. Model 2 adds a random intercept to more appropriately model clustered (multi-level) data. Model 3 adds a random slope. Model 4 adds the covariance between the random intercept and slope. Finally, model 5 includes a cross level interaction.

We use proc nlmixed to fit all of these models because it will not only fit all of these models, but the syntax structure and progression across the models allows us to clearly demonstrate the differences, and similarities, between these models. SAS proc nlmixed is a highly flexible procedure that can be used to run a large variety of models. We do not, however, intend to suggest that you should run these models using nlmixed. In many cases it would be easier to run the first model in proc reg, and the subsequent models in proc mixed. An additional advantage to using mixed over nlmixed is that mixed allows the use of both maximum likelihood estimation and restricted maximum likelihood estimation, nlmixed only uses maximum likelihood estimation.

The dataset for this example includes data on 7185 students in 160 schools. It comes from the HLM manual. The outcome variable mathach is a measure of each student's math achievement, the predictor variable female is a binary variable equal to one if the student is female and zero otherwise, and the predictor variable pracad is the proportion of students at that school who are on the academic track. The variable id is the school identifier. Note that female varies within school (often called a level 1 variable) while pracad is constant within a school, but varies across schools (often called a level 2 variable). You can download the data used in this example by clicking hsb.sas7bdat.

Model 1: An OLS regression

The first model we will run is an ordinary least squares (OLS) regression model where female and pracad predict mathach. In equation form the model is:

mathach = b0 + b1*female + b2*pracad + e

And we assume:

e ~ N(0,s2)

Page 98: STAT_in SAS

Below is the proc nlmixed syntax corresponding to this specification. Here we define pred as a linear function of female and pracad, mathach is distributed (~) normally with mean equal to pred and variance s2, both of which we wish to estimate (pred is estimated through the coefficients, s2 is estimated based on the model residuals).

proc nlmixed data="d:\data\hsb"; pred = b0 + b1*female + b2*pracad; model mathach ~ normal(pred,s2);run;

Towards the end of the output, we see the parameter estimates shown below.

Parameter Estimates

Standard Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 9.3439 0.2023 7185 46.20 <.0001 0.05 8.9474 9.7404 -0.1721 b1 -1.5037 0.1546 7185 -9.72 <.0001 0.05 -1.8068 -1.2006 -0.16423 b2 7.8527 0.3073 7185 25.55 <.0001 0.05 7.2503 8.4551 -0.08668 s2 42.7040 0.7124 7185 59.94 <.0001 0.05 41.3074 44.1006 -0.00432

Based on the above estimates, the model is:

mathach = 9.3439 - 1.5037*female + 7.8527*pracad + e

Where:

e ~ N(0,42.7)

This model is a standard OLS regression model, and the coefficients are interpreted as usual for a regression model. Another way to describe this model is to say that all of the coefficients (b0, b1, and b2) are fixed, only the error term (e) has a variance (s2).  This model has totally ignored the nesting structure, assuming all observations are independent of each other. Of course this will not be a valid model for this data, nevertheless, it gives us a starting point.

Model 2: A random intercept model

The problem with the OLS model is that it fails to account for the fact that the observations are not independent, that is, that students are nested within schools. There is often good reason to believe that observations within a cluster are more alike

Page 99: STAT_in SAS

than observations across clusters. In our case, students at one school might, on average, have higher or lower math achievement scores, even after controlling for gender and the percent of students on an academic track. As a result, the residuals for students at that school would be systematically higher or lower than at other schools, violating one of the assumptions of OLS regression. To accommodate this, we can model the intercept as having a mean and a variance. This is the basic multi-level model. In equation form:

mathach = b0 + u + b1*female + b2*pracad + e

Where we assume:

e ~ N(0,s2)u ~ N(0,s2u)

With the addition of u, we can say that the intercept has two parts, a fixed part, represented by b0, and a random component, represented by u. The parameter b0 is said to be fixed because it does not vary, while the u is said to be random, because it is assumed to take on different values for different level 2 units (i.e. schools). Substantively, b0 is the mean intercept across all schools, while u allows for variation around that mean. Another way to think about this model is that there are two sources of error in our predictions, error that is a result of a school having a mean score that is higher or lower than other schools (u), and error that is a result of individual variation (e). In the OLS model above, we combine these into a single term, e, while in a random intercept model we estimate both the variance of e and the variance of u. The model can also be written in the two level formulation (the models, and the assumptions about e and u, are the same):

Level 1:mathach = b0 + b1*female + e

Level 2:b0 = g00 + g01*pracad + ub1 = g10

The code below fits the random intercept model. Before we can run a random effects model with nlmixed, we need to sort the data by the grouping variable, in this case by id, which identifies which school a student was in. The nlmixed command below differs from the nlmixed command for the OLS model in two ways. First, the definition of pred, the predicted value of mathach, now includes the parameter u, the random coefficient for the intercept. This means that expected mean of mathach now depends on which school the student attends. The second change is the addition of the random statement, which indicates that we wish to add a random effect to our model. The random statement defines u as distributed normally with a mean of zero and constant variance we wish to estimate, denoted s2u using the codeu ~ normal(0,

Page 100: STAT_in SAS

s2u). The groups across which we wish to estimate the random effect are identified by subject=id (note that id is the variable that identifies schools in this dataset).

proc sort data="d:\data\hsb"; by id;run;

proc nlmixed data="d:\data\hsb"; pred = b0 + u + b1*female + b2*pracad; model mathach ~ normal(pred,s2); random u ~ normal(0,s2u) subject=id;run;

Parameter Estimates

Standard Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 9.1800 0.4084 159 22.48 <.0001 0.05 8.3734 9.9867 0.000236 b1 -1.3450 0.1691 159 -7.96 <.0001 0.05 -1.6789 -1.0111 0.000183 b2 8.0494 0.6866 159 11.72 <.0001 0.05 6.6934 9.4054 0.000201 s2 38.8414 0.6554 159 59.26 <.0001 0.05 37.5470 40.1358 0.000831 s2u 3.9315 0.5444 159 7.22 <.0001 0.05 2.8563 5.0066 -0.00056

Based on the above estimates, the model can be written:

mathach = 9.18 + u - 1.35*female + 8.05*pracad + e

Where:

e ~ N(0,38.84)u ~ N(0,3.93)

You may have noticed that the term u remains in the equation for our predicted values of mathach, rather than being replaced with an estimate. This is because we have not modeled the intercepts for each school individually (as one would in a so called fixed effects model) but instead we have modeled their distribution, so that the actual intercept (b0+u) for each school can be estimated based on the model, but is not actually part of the model.

In this model, b0, the intercept, is interpreted as the average (mean) intercept across all schools, while s2u is the estimated variance of the individual schools around that mean. Another way to say this is thats2u is the variance between schools. Larger values of s2u indicate greater differences in intercepts across schools (keeping in

Page 101: STAT_in SAS

mind the size of the intercept itself). We also might want to look at the the standard deviation of u, that is, the square root of s2u, rather than the variance (i.e. s2u itself) if we want to get a better sense of how much the intercepts vary across schools. The standard deviation of u is 1.98 (=sqrt(3.93)). The coefficients for the independent variables, that is, b1 and b2, are interpreted in the same manner as before. The estimates for b1 and b2 are slightly different between the two models. This is because this model accounts for the relationship between the mathach scores of students at the same school.

You may have noticed that the variance of e is smaller in this model than in model 1(38.83 vs. 42.7). This is because, as discussed above, in model 1, the error due to individual variation was combined with the error due to variation across schools. In this model (model 2), we include both sources of error in the model, allowing us to distinguish between error at the individual and group levels. Based on the variance of e and u in model 2, we can calculate the proportion of variance that is due to differences across schools, often called the intra-class correlation (icc). The formula for this is s2u/(s2u+s2e), in other words, the variance accounted for by schools, over the total variance. For this model the icc is 3.9315/(38.8414+3.9315) = .0919, so we can say that about 9% of total variance is accounted for by differences in the intercepts across schools.

Model 3: A mixed model with independent covariance structure for the random effects

In addition to allowing the intercept of mathach vary by school, it is possible to allow the relationship between mathach and a predictor variable to vary by school, that is the effect of the predictor variable might be weaker or stronger at different schools. To allow for this we estimate a model with a random coefficient for the variable whose effect we believe may vary by school. Below we estimate a random coefficient for the variable female. Note that it does not make sense to include a random coefficient for the variable pracad in this model, since pracad does not vary within school. The equation for the model looks like this:

mathach = b0 + u0 + b1*female + u1*female + b2*pracad + e = b0 + u0 + (b1+u1)*female + b2*pracad + e

Where we assume:

e ~ N(0,s2)u0 ~ N(0,s2u0)u1 ~ N(0,s2u1)(u0,u1)=0

The random coefficients are u0 (for the intercept) and u1 (for female). In addition to modeling the variances themselves, we can model the relationship between u0 and u1.

Page 102: STAT_in SAS

We can specify three different relationships between u0 and u1, we can assume that they are unrelated, and fix their covariance to zero; we can assume their covariance is equal to some a non-zero value and fix the covariance to that value (this is rarely done); or we can estimate the covariance u0 and u1 as part of the model. Note that with more than two random effects, additional types of relationships can be modeled. In this model we will assume that u0 and u1 have a covariance of zero, this is sometimes called an independent covariance structure for the random effects.

As before we can also write the model in its two-level form (the models, and the assumptions about e and u, are the same):

Level 1:mathach = b0 + b1*female + e

Level 2:b0 = g00 + g01*pracad + u0b1 = g10 + u1

The code for this model is different from the code for the random intercept model in several ways. First, the equation for pred now includes the term u1. Second, the random statement now includes two random effects (u0 and u1). As in the previous models, the random coefficients are assumed to be normally distributed, their mean and variance are just specified a little differently in the code. Instead of a single value, we now have the vector of means [0,0] one for each of the random effects. Instead of a single parameter for variance we now have a variance-covariance matrix that is shown as the vector [s2u0, 0, s2u1]. The vector [s2u0, 0, s2u1] is equivalent to the lower triangular variance-covariance matrix:

s2u0 0 s2u1proc nlmixed data="d:\data\hsb"; pred = b0 + u0 + (b1+u1)*female + b2*pracad; model mathach ~ normal(pred,s2); random u0 u1 ~ normal([0,0],[s2u0,0,s2u1]) subject=id;run;

Parameter Estimates

Standard Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 9.1840 0.4080 158 22.51 <.0001 0.05 8.3781 9.9899 0.00042 b1 -1.3494 0.1742 158 -7.75 <.0001 0.05 -1.6934 -1.0054 0.004663 b2 8.0484 0.6870 158 11.72 <.0001 0.05 6.6915 9.4053 -0.0018

Page 103: STAT_in SAS

s2 38.8006 0.6587 158 58.91 <.0001 0.05 37.4997 40.1016 0.002626 s2u0 3.8666 0.5564 158 6.95 <.0001 0.05 2.7676 4.9657 -0.00124 s2u1 0.2195 0.4019 158 0.55 0.5858 0.05 -0.5743 1.0133 -0.00635

These estimates can be used to write the equation:

mathach = 9.18 + u0 + (-1.35+u1)*female + 8.05*pracad + e

Where:

e ~ N(0,38.8)u0 ~ N(0,3.87)u1 ~ N(0,0.22)(u0,u1)=0

In this model, the interpretation of the coefficient for pracad (b2) remains unchanged. The interpretation of the estimate of the intercept (b0), as well as its variance s2u0, remain the same as in the interpretation in the random intercept model above. The coefficient for female in the fixed part of the model (b1) can now be interpreted as the average effect of the variable female across schools. Since the effect of the variable female is the difference in averages between males and females (controlling for other variables in the model), b1 represents the mean of these average differences between males and females at each school. The coefficient s2u1, the variance of u1 (the random effect for female), is the estimated variance of the effect of female across schools. Larger values of s2u1 indicate greater differences in the effect of female across schools (keeping in mind the size of the coefficient itself and that s2u1 is a variance rather than a standard deviation). Note that if the predictor variable were continuous, b1 would represent the mean change in the outcome for a one unit change in the predictor.

Model 4: A mixed model with an unstructured level 2 covariance structure

In the model above, we specified that there was no relationship between the random coefficient for the intercept and the random coefficient for the slope of the variable female. Depending on the situation, this may or may not be a reasonable assumption. For example, one common situation when studying individual change over time, is individuals who start out with higher values of the outcome may improve more slowly, that is, have a flatter slope, than individuals who started out with lower values on the outcome. This implies a negative relationship between the random effects for the slope and intercept. An unstructured level 2 covariance matrix allows us to estimate this type of relationship.

Page 104: STAT_in SAS

This model is the same as model 3 above except that this model (model 4) includes an estimate of the covariance between the two random coefficients (u0 and u1), this parameter is denoted c01. Accordingly the vector for the variances of the random effects is now [s2u0, c01, s2u1], representing the variance-covariance matrix:

s2u0c01 s2u1

The equations for this models are the same as those for model 3, so we will not rewrite them here.

There are two differences between the code for this model (shown below) and the previous model. Unlike the code for previous models, the code below includes the parms statement. By default the nlmixedprocedure uses 1 as the starting value for all parameters. For some models, especially more complex models, use of the default starting values may lead to slow convergence, non-convergence, or other estimation problems. The model below encounters problems when the default starting values are used, so we use some of the parameter estimates from a simpler model (in this case the random intercept model) as starting values. (Note that SAS does not require the parms statement for models with an unstructured level 2 covariance matrix, and that the parms statement may be used with any nlmixedmodel.) The second difference between this model and the previous model is in the random statement, the variances of the random effects include a parameter (c01) for the covariance of u0 and u1, where the previous model included a 0 to fix this parameter to zero.

proc nlmixed data="d:\data\hsb"; parms b0=9.18 b1=-1.35 b2=8.05 s2=38.84 s2u0=3.93; pred = b0 + u0 + (b1+u1)*female + b2*pracad; model mathach ~ normal(pred,s2); random u0 u1 ~ normal([0,0],[s2u0,c01,s2u1]) subject=id;run;

Parameter Estimates

Standard Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 9.1813 0.4133 158 22.21 <.0001 0.05 8.3650 9.9976 0.000242 b1 -1.3523 0.1831 158 -7.38 <.0001 0.05 -1.7140 -0.9906 -0.00028 b2 8.0489 0.6875 158 11.71 <.0001 0.05 6.6910 9.4068 0.000108 s2 38.7238 0.6584 158 58.81 <.0001 0.05 37.4233 40.0243 -0.00022 s2u0 4.4576 0.7591 158 5.87 <.0001 0.05 2.9584 5.9568 0.000044

Page 105: STAT_in SAS

c01 -0.7000 0.5154 158 -1.36 0.1764 0.05 -1.7179 0.3180 0.000316 s2u1 0.6381 0.5478 158 1.16 0.2458 0.05 -0.4438 1.7200 0.000099

The estimates of the covariances of the random effects are sometimes substantively interesting, such as the discussion of individual change over time discussed above. In other models such covariance may exist, and hence need to be included in the model, but may not of any particular substantive interest. A negative estimate for c01 implies that at schools with higher intercepts (i.e. b0+u0) the difference between males and females (i.e. b1 + u1), that is the effect of gender, will be smaller. Substantively this means that at schools with higher overall math achievement scores (controlling for the effect of gender and proportion of students on an academic track), the difference between males and females tends to be smaller than at schools with lower overall math achievement scores.

Model 5: A mixed model with a cross level interaction

We may also wish to specify a model in which the effect of a level 1 variable depends on the value of a level 2 predictor, often called a cross-level interaction. In our example, this would mean that the effect of being female (female) depends on the proportion of students on an academic track (pracad). We might, for example, expect to see that although on average female students have lower math achievement scores than male students (as evidenced by the negative coefficients for female in previous models), the difference between male and female students becomes smaller as the proportion of students on an academic track increases.

In models 1 and 2, the effect of the variable female was described by a single, fixed coefficient b1. In models 3 and 4 we added a random effect for female, so that the effect of female is described by both a fixed component (b1) and a random component (u1). In this model we add a third term to the effect of female b3*pracad, where b3 is a fixed coefficient estimated by the model and pracad is a level 2 variable. The model can be written:

mathach = b0 + u0 + b1*female + b3*pracad*female + u1*female + b2*pracad + e = b0 + u0 + (b1 + b3*pracad + u1)*female + b2*pracad + e

Where we assume:

e ~ N(0,s2)u0 ~ N(0,s2u0)u1 ~ N(0,s2u1)(u0,u1)=0

Alternatively, we can write this model in the two-level form:

Page 106: STAT_in SAS

Level 1:mathach = b0 + b1*female + e

Level 2:b0 = g00 + g01*pracad + u0b1 = g10 + g11*pracad + u1

The code for this model is the same as for model 3 (i.e. we specify an independent covariance structure), except that the we have added a new coefficient b3 to the equation for pred, b3 is the coefficient for the cross level interaction, so that the total effect of female is equal to b1+b3*pracad+u1.

proc nlmixed data="d:\data\hsb"; pred = b0 + u0 + (b1+b3*pracad+u1)*female + b2*pracad; model mathach ~ normal(pred,s2); random u0 u1 ~ normal([0,0],[s2u0,0,s2u1]) subject=id;run;

Parameter Estimates

Standard Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 9.2802 0.4443 158 20.89 <.0001 0.05 8.4027 10.1577 0.002315 b1 -1.5489 0.3969 158 -3.90 0.0001 0.05 -2.3328 -0.7650 0.003042 b3 0.4147 0.7416 158 0.56 0.5769 0.05 -1.0502 1.8795 0.001542 b2 7.8543 0.7728 158 10.16 <.0001 0.05 6.3279 9.3807 0.001594 s2 38.7984 0.6586 158 58.91 <.0001 0.05 37.4975 40.0992 0.000105 s2u0 3.8957 0.5616 158 6.94 <.0001 0.05 2.7866 5.0049 -0.00049 s2u1 0.1944 0.4021 158 0.48 0.6295 0.05 -0.5998 0.9885 0.000371

Based on the above estimates, we can write the regression equation as:

mathach = 9.28 + u0 + (-1.55 + 0.41*pracad + u1)*female + 7.85*pracad + e

Where:

e ~ N(0,38.8)u0 ~ N(0,3.9)u1 ~ N(0,0.19)(u0,u1)=0

Page 107: STAT_in SAS

The coefficient b1 can now be interpreted as the average effect of female when pracad is equal to zero. The coefficient for the cross-level interaction between female and pracad, that is b3, can be interpreted as the change in the effect of female for a one unit change in pracad. For a school where half of the students are on an academic track (pracad=.5), the effect of female is:

-1.55 + 0.41*.5 + u1 = -1.345 + u1

Another way to say this is that the average effect of female for a school where half of the students are on an academic track is -1.345. Substantively, this means that the higher the proportion of students on an academic track, the less difference in mean math achievement scores between males and females.

See Also

Raudenbush, Stephen W.  & Anthony S. Bryk(2001). Hierarchical Linear Models: Applications and Data Analysis Methods. 2nd ed.Thousand Oaks, CA: Sage Publications.

Raudenbush, Stephen, Anthony Bryk, Yuk Fai Cheong, & Richard Congdon (2001) HLM 5: Hierarchical Linear and Nonlinear Modeling. Scientific Software International: Lincolnwood, IL.

Skrondal, Anders, & Sophia Rabe-Hesketh(2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. New York, NY: Chapman & Hall/CRC.

SAS FAQHow can I do zero-truncated count models using nlmixed?

This FAQ page will show how to code zero-truncated count models using SAS proc nlmixed. We will cover zero-truncated poisson and zero-truncated negative binomial regression models.

Both models use the same dataset, medpar.sas7bdat, and the same predictor variables, died, hmo and dummy variables type2 and type3. For each of the models we will give only partial output, primarily estimates, standard errors, wald tests and p-values so that you will be able to compare the results with ordinary poisson and negative binomial models.

Zero-truncated poisson regression

We will begin with the zero-truncated poisson regression. The setup for nlmixed is very straightforward with xb being the linear predictor with parameters b0 (the

Page 108: STAT_in SAS

intercept) and b1 through b4 as the regression coefficients. By using nlmixed we will obtain an iterated maximum likelihood solution for the model.

options nocenter;

proc nlmixed data="D:data\medpar.sas7bdat"; xb = b0 + b1*died + b2*hmo + b3*type2 + b4*type3; ll = los*xb - exp(xb) - lgamma(los + 1) - log(1-exp(-exp(xb))); model los ~ general(ll);run;

/* partial output */

Fit Statistics

-2 Log Likelihood 9475.1AIC (smaller is better) 9487.1AICC (smaller is better) 9487.1BIC (smaller is better) 9518.9

Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 2.2645 0.01182 1495 191.51 <.0001 0.05 2.2413 2.2877 -0.00649b1 -0.2487 0.01812 1495 -13.73 <.0001 0.05 -0.2842 -0.2131 -0.00345b2 -0.07551 0.02394 1495 -3.15 0.0016 0.05 -0.1225 -0.02856 -0.0016b3 0.2501 0.02099 1495 11.91 <.0001 0.05 0.2089 0.2912 -0.00285b4 0.7504 0.02624 1495 28.59 <.0001 0.05 0.6989 0.8019 0.001308

Thus, our model looks like this, xb = 2.2645 - .2487*died - .07551*hmo + .2501*type2 + .7504*type3.

Zero-truncated negative binomial regression

Our second model is a zero-truncated negative binomial regression which has one additional parameter, alpha. alpha is a measure of overdispersion. If overdispersion (alpha) is zero then the negative binomial model is equivalent to a poisson model.

proc nlmixed data="D:data\medpar.sas7bdat"; xb = b0 + b1*died + b2*hmo + b3*type2 + b4*type3; mu = exp(xb); m = 1/alpha; ll = lgamma(los+m)-lgamma(los+1)-lgamma(m) + los*log(alpha*mu)-(los+m)*log(1+alpha*mu) - log(1 -( 1 + alpha*mu)**(-m)); model los ~ general(ll);

Page 109: STAT_in SAS

run;

/* partial output */

Fit Statistics

-2 Log Likelihood 13693AIC (smaller is better) 13703AICC (smaller is better) 13703BIC (smaller is better) 13730

Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient

b0 2.2240 0.03002 1495 74.08 <.0001 0.05 2.1651 2.2829 -0.00029b1 -0.2522 0.04471 1495 -5.64 <.0001 0.05 -0.3399 -0.1645 -0.00088b2 -0.07542 0.05823 1495 -1.30 0.1955 0.05 -0.1897 0.03881 -0.00099b3 0.2685 0.05500 1495 4.88 <.0001 0.05 0.1606 0.3764 0.000382b4 0.7668 0.08304 1495 9.23 <.0001 0.05 0.6039 0.9297 -0.00067alpha 0.5325 0.02928 1495 18.19 <.0001 0.05 0.4751 0.5900 -0.0032

The way to test whether alpha is different from zero is to compute the difference in the -2 Log Likelihood values for the zero-truncated poisson and the zero-truncated negative binomial. In this example the computation looks like this, (13693-9475.1) = 4217.9, which is distributed as a chi-square with one degree of freedom. Clearly, there is overdispersion in this example.

SAS FAQHow can I do path analysis in SAS?

It is possible to estimate recursive path models using ordinary least squares regression, but using the SAS proc tcalis can make the processes easier and will also provide estimates of direct and indirect effects.

Let's say that we want to estimate the following path model using the hsb2 (hsb2.sas7bdat) dataset.

Page 110: STAT_in SAS

We will begin computing the correlation between the two exogenous variables, read and write. We assume that the data file, hsb2.sas7bdat, is located in the data directory on the C: drive. You may need to change these values for your particular computer configuration.proc corr data='C:\data\hsb2'; var read write;run;

The CORR Procedure

2 Variables: READ WRITE

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum LabelREAD 200 52.23000 10.25294 10446 28.00000 76.00000 reading scoreWRITE 200 52.77500 9.47859 10555 31.00000 67.00000 writing score

Pearson Correlation Coefficients, N = 200Prob > |r| under H0: Rho=0

READ WRITE

READ 1.00000 0.59678reading score <.0001

WRITE 0.59678 1.00000

Page 111: STAT_in SAS

writing score <.0001

This path analysis is really just two regression models. The first model is math = constant + read + write while the second model is science = constant + math + read + write. In proc tcalis we set up the model by entering the response variable with each predictor variable and the name of the parameter being estimated in the path part of the command. In the effpart part of the command we list the paths for direct and indirect effects.proc tcalis data='C:\data\hsb2'; path /* specification of path model */ science <- math beta1, science <- read beta2, science <- write beta3, math <- read beta4, math <- write beta4; effpart /* for direct and indirect effects */ science <- read write; run;

We can now run the proc tcalis command which produces the output shown below. There is a lot of output but we will be focusing on the standardized results given near the end and shown in bold.The TCALIS ProcedureCovariance Structure Analysis: Model and Initial Values

Modeling Information

Data Set WC000001.HSB2N Records Read 200N Records Used 200N Obs 200Model Type PATH

Variables in the Model

Endogenous Manifest MATH SCIENCE LatentExogenous Manifest READ WRITE Latent

Number of Endogenous Variables = 2 Number of Exogenous Variables = 2

Initial Estimates for PATH List

---------Path--------- Parameter Estimate

SCIENCE <- MATH beta1 .SCIENCE <- READ beta2 .SCIENCE <- WRITE beta3 .MATH <- READ beta4 .MATH <- WRITE beta4 .

Page 112: STAT_in SAS

Initial Estimates for Variance Parameters

VarianceType Variable Parameter Estimate

Exogenous READ _Add1 . WRITE _Add2 .Error MATH _Add3 . SCIENCE _Add4 .

NOTE: Parameters with prefix '_Add' are added by PROC TCALIS.

Initial Estimates for Covariances Among Exogenous Variables

Var1 Var2 Parameter Estimate

WRITE READ _Add5 .

NOTE: Parameters with prefix '_Add' are added by PROC TCALIS.

Simple Statistics

Variable Mean Std Dev

READ reading score 52.23000 10.25294WRITE writing score 52.77500 9.47859MATH math score 52.64500 9.36845SCIENCE science score 51.85000 9.90089

Initial Estimation Methods

1 Observed Moments of Variables 2 McDonald Method

Optimization Start Parameter Estimates

N Parameter Estimate Gradient

1 beta1 0.31901 1.4539E-15 2 beta2 0.30153 -6.995E-16 3 beta3 0.20653 -5.578E-16 4 beta4 0.38137 0.00682 5 _Add1 105.12271 1.3207E-19 6 _Add2 89.84359 5.1911E-20 7 _Add3 42.65279 7.0238E-18 8 _Add4 49.01931 -2.963E-18 9 _Add5 57.99673 -5.525E-20

Value of Objective Function = 0.0026412093

Levenberg-Marquardt Optimization

Page 113: STAT_in SAS

Scaling Update of More (1978)

Parameter Estimates 9Functions (Observations) 10

Optimization Start

Active Constraints 0 Objective Function 0.0026412093Max Abs Gradient Element 0.0068163427 Radius 1

Ratio Between Actual Objective Max Abs and Function Active Objective Function Gradient PredictedIter Restarts Calls Constraints Function Change Element Lambda Change 1 0 4 0 0.00264 1.593E-6 3.735E-8 0 1.000

Optimization Results

Iterations 1 Function Calls 7Jacobian Calls 3 Active Constraints 0Objective Function 0.002639616 Max Abs Gradient Element 3.735413E-8Lambda 0 Actual Over Pred Change 0.9999999993Radius 0.0035701627

Convergence criterion (ABSGCONV=0.00001) satisfied.

Fit Summary

Modeling Info N Observations 200 N Variables 4 N Moments 10 N Parameters 9 N Active Constraints 0 Independence Model Chi-Square 369.6536 Independence Model Chi-Square DF 6Absolute Index Fit Function 0.0026 Chi-Square 0.5253 Chi-Square DF 1 Pr > Chi-Square 0.4686 Z-Test of Wilson & Hilferty 0.0617 Hoelter Critical N 1457 Root Mean Square Residual (RMSR) 0.6981

Page 114: STAT_in SAS

Standardized RMSR (SRMSR) 0.0075 Goodness of Fit Index (GFI) 0.9987Parsimony Index Adjusted GFI (AGFI) 0.9868 Parsimonious GFI 0.1664 RMSEA Estimate 0.0000 RMSEA Lower 90% Confidence Limit . RMSEA Upper 90% Confidence Limit 0.1673 Probability of Close Fit 0.5686 ECVI Estimate 0.0954 ECVI Lower 90% Confidence Limit . ECVI Upper 90% Confidence Limit 0.1264 Akaike Information Criterion -1.4747 Bozdogan CAIC -5.7730 Schwarz Bayesian Criterion -4.7730 McDonald Centrality 1.0012Incremental Index Bentler Comparative Fit Index 1.0000 Bentler-Bonett NFI 0.9986 Bentler-Bonett Non-normed Index 1.0078 Bollen Normed Index Rho1 0.9915 Bollen Non-normed Index Delta2 1.0013 James et al. Parsimonious NFI 0.1664

PATH List

Standard---------Path--------- Parameter Estimate Error t Value

SCIENCE <- MATH beta1 0.31901 0.07599 4.19778SCIENCE <- READ beta2 0.30153 0.06691 4.50637SCIENCE <- WRITE beta3 0.20653 0.07139 2.89302MATH <- READ beta4 0.38090 0.02625 14.50821MATH <- WRITE beta4 0.38090 0.02625 14.50821

Variance Parameters

Variance StandardType Variable Parameter Estimate Error t Value

Exogenous READ _Add1 105.12271 10.53865 9.97497 WRITE _Add2 89.84359 9.00690 9.97497Error MATH _Add3 42.65279 4.27598 9.97497 SCIENCE _Add4 49.01931 4.91423 9.97497

Covariances Among Exogenous Variables

StandardVar1 Var2 Parameter Estimate Error t ValueWRITE READ _Add5 57.99673 8.02265 7.22912

Squared Multiple Correlations

Error TotalVariable Variance Variance R-Square

Page 115: STAT_in SAS

MATH 42.65279 87.76788 0.5140SCIENCE 49.01931 97.93776 0.4995

The TCALIS ProcedureCovariance Structure Analysis: Maximum Likelihood Estimation

Stability Coefficient of Reciprocal Causation = 0

Stability Coefficient < 1

Total and Indirect Effects Converge

Effects on SCIENCE Effect / Std Error / tValue / pValue

Total Direct Indirect

READ 0.4230 0.3015 0.1215 0.0609 0.0669 0.0301 6.9458 4.5064 4.0324 0 0 0

WRITE 0.3280 0.2065 0.1215 0.0658 0.0714 0.0301 4.9860 2.8930 4.0324 0 0.003816 0

Standardized Results for PATH List

Standard---------Path--------- Parameter Estimate Error t Value

SCIENCE <- MATH beta1 0.30199 0.07066 4.27365SCIENCE <- READ beta2 0.31240 0.06792 4.59947SCIENCE <- WRITE beta3 0.19781 0.06789 2.91347MATH <- READ beta4 0.41686 0.02202 18.93443MATH <- WRITE beta4 0.38538 0.02065 18.66302

Standardized Results for Variance Parameters

Variance StandardType Variable Parameter Estimate Error t Value

Exogenous READ _Add1 1.00000 WRITE _Add2 1.00000Error MATH _Add3 0.48597 0.04940 9.83792 SCIENCE _Add4 0.50051 0.05015 9.98091

Standardized Results for Covariances Among Exogenous Variables

StandardVar1 Var2 Parameter Estimate Error t Value

Page 116: STAT_in SAS

WRITE READ _Add5 0.59678 0.04564 13.07520

Standardized Effects on SCIENCE Effect / Std Error / tValue / pValue

Total Direct IndirectREAD 0.4383 0.3124 0.1259 0.0594 0.0679 0.0304 7.3808 4.5995 4.1472 0 0 0

WRITE 0.3142 0.1978 0.1164 0.0613 0.0679 0.0281 5.1272 2.9135 4.1365 0 0.003574 0

We will focus our attention on the bolded parts of the output above which include the standardized results for path list, standardized results for variance parameters and the standardized effects on science. We will use the standardized estimates as our path coefficients and the square root of the variance estimates for the error. The error values are sqrt(0.48597) = .6971 (approx = 0.7) for math and sqrt(0.50051) = .70747 (approx = 0.7) for science. Now we can add the path coefficients and errors to the path diagram as shown below.

The proc tcalis also provides estimates of the direct, indirect and total effect for the two exogenous variables because we include the effpart substatement in our model.

Page 117: STAT_in SAS

From these results we see that the indirect effect of read is about one third that of the direct effect. While for write the indirect effect is a bit more than half the size of the direct effect. For this example, the estimates for all of the direct and indirect effects were statistically significant. This is not necessarily a very common occurrence.

SAS FAQHow can I compute Durbin-Watson statistic and 1st order autocorrelation in time series data?

When data set of interest is a time series data, we may want to compute the 1st-order autocorrelation for the variables of interest and to test if the autocorrelation is zero. One common test is Durbin-Watson test. The Durbin-Watson test statistic can be computed in proc reg by using option dw after the model statement. 

Here are two examples using data set sp500.sas7bdat. The variables of interest are open, close, high, low and volume.

Example 1: Computing Durbin-Watson Statistic for a variable.

proc reg data = sp500; model open = /dw;run;quit;

Dependent Variable: open Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 0 0 . . .Error 247 1875052 7591.30215Corrected Total 247 1875052Root MSE 87.12808 R-Square 0.0000Dependent Mean 1194.88379 Adj R-Sq 0.0000Coeff Var 7.29176 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 1194.88379 5.53264 215.97 <.0001

Durbin-Watson D 0.034Number of Observations 2481st Order Autocorrelation 0.979

The value of Durbin-Watson statistic is close to 2 if the errors are uncorrelated. In our example, it is .034. That means that there is a strong evidence that the variable open has high autocorrelation.

Example 2: Output 1st-order autocorrelation of multiple variables into a data set

Page 118: STAT_in SAS

Let's say that we want to compute the 1st-order autocorrelation for all the variables of interest. We can make use of the ODS facility to output the 1st-order autocorrelation for each variable to a data set called auto_corr.

proc reg data = sp500; model open high low close volume = /dw; ods output dwstatistic = auto_corr

(where=(label1="1st Order Autocorrelation")) ;run;quit;

proc print data = auto_corr noobs; var dependent label1 cvalue1;run; cDependent Label1 Value1 open 1st Order Autocorrelation 0.979 high 1st Order Autocorrelation 0.984 low 1st Order Autocorrelation 0.983 close 1st Order Autocorrelation 0.981 volume 1st Order Autocorrelation 0.545

SAS FAQHow can I perform a Bivariate Probit Analysis Using Proc QLIM in SAS 9.1?

SAS proc qlim is a new procedure in SAS/ETS released in SAS 9. It analyzes discrete univariate and multivariate models. We will illustrate how to perform a bivariate probit model analysis using proc qlim.The data set used is hsb2.sas7bdat which can be downloaded following the link. We created two binary variables, hiwrite and himath for the purpose of demonstration. The way to specify our model as a bivariate probit model is very similar to the way to specify a multivariate regression model. The only thing that we need to add is the ENDOGENOUS statement where we specify that the two outcome variables are discrete. The only type of model that SAS does for two discrete outcome variables will be a biprobit model. Therefore it is sufficient to just specify that the two outcome variables are discrete.

options nocenter nodate nofmterr; libname in 'd:\data';data hsb2; set in.hsb2; hiwrite = (write>=60); himath = (math>=60);run;

proc qlim data=hsb2; model hiwrite himath = female read; endogenous hiwrite himath ~ discrete;run; 1

Page 119: STAT_in SAS

The QLIM Procedure

Discrete Response Profile of hiwrite

Index Value Frequency Percent

1 0 147 73.50 2 1 53 26.50

Discrete Response Profile of himath

Index Value Frequency Percent

1 0 151 75.50 2 1 49 24.50

Model Fit Summary

Number of Endogenous Variables 2Endogenous Variable hiwrite himathNumber of Observations 200Log Likelihood -157.57872Maximum Absolute Gradient 0.0001420Number of Iterations 21AIC 329.15744Schwarz Criterion 352.24567

Algorithm converged.

Parameter Estimates

Standard ApproxParameter Estimate Error t Value Pr > |t|

hiwrite.Intercept -5.638784 0.769722 -7.33 <.0001hiwrite.FEMALE 0.608516 0.227431 2.68 0.0075hiwrite.READ 0.085227 0.012885 6.61 <.0001himath.Intercept -5.543897 0.766776 -7.23 <.0001himath.FEMALE 0.019750 0.223054 0.09 0.9294himath.READ 0.087772 0.013117 6.69 <.0001_Rho 0.598763 0.109035 5.49 <.0001

SAS FAQHow Do I perform Chow test in SAS using proc autoreg?

Chow test is an F-ratio test and it is for testing structural change in regression analysis for large samples. It is used mostly in time-series models. Here we are going to show an example using  hsb2.sas7bdat.

Our data set hsb2 consists of high school student scores on various tests and their demographical information. Let's say our model is a regression model of writing scores on math and reading scores. Furthermore we want to test if the regression model will be different for male and female students. In other words, we want to test

Page 120: STAT_in SAS

if the same regression coefficients apply to both male students and female students in the data set or there are two subsets with different intercepts and slopes. We will use Chow test for this purpose.

Since Chow test is mostly used in time series, SAS has included it with proc autoreg. The way to specify the two subsets is to specify the breakpoint in terms of the position of the observations. In this example, we use proc freq to identify the position for the breakpoint and we then have to sort the data accordingly.

proc freq data = hsb2; tables female;run;The FREQ Procedure Cumulative CumulativeFEMALE Frequency Percent Frequency Percent----------------------------------------------------------- 0 91 45.50 91 45.50 1 109 54.50 200 100.00proc sort data = hsb2; by female;run;proc autoreg data = hsb2; model write = math read /chow = 91;run;Dependent Variable WRITE Ordinary Least Squares EstimatesSSE 9938.81034 DFE 197MSE 50.45081 Root MSE 7.10287SBC 1364.64741 AIC 1354.75246Regress R-Square 0.4441 Total R-Square 0.4441Durbin-Watson 1.6662 Structural Change Test BreakTest Point Num DF Den DF F Value Pr > FChow 91 3 194 11.84 <.0001 Standard ApproxVariable DF Estimate Error t Value Pr > |t|Intercept 1 15.5339 3.0180 5.15 <.0001MATH 1 0.4005 0.0717 5.58 <.0001READ 1 0.3094 0.0655 4.72 <.0001

The middle section of the output above gives the Chow Test, and the rest is just the regression model for the entire sample including both male and female students. The Chow test indicates that there is a structural difference for male and female students. Now let's run the regression models separately.

proc reg data = hsb2; by female; model write = math read ;run;quit;FEMALE=0

Page 121: STAT_in SAS

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 7.33165 4.60342 1.59 0.1148MATH 1 0.39321 0.10066 3.91 0.0002READ 1 0.41592 0.09259 4.49 <.0001FEMALE=1 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 21.07310 3.37071 6.25 <.0001MATH 1 0.41966 0.08719 4.81 <.0001READ 1 0.23061 0.07933 2.91 0.0044

Here is the link to a SAS example page on Chow Test. It explains in some detail the assumptions for Chow test and the formula for Chow statistic: SAS Examples: Chow Test for Structural Breaks.

SAS FAQ:How can I bootstrap estimates in SAS?

Bootstrapping allows for estimation of statistics through the repeated resampling of data. In this page, we will demonstrate several methods of bootstrapping a confidence interval about an R-squared statistic in SAS. We will be using the hsb2 dataset that can be found here. We will begin by running an OLS regression, predicting read with female, math, write, and ses, and saving the R-squared value in a dataset called t0. The R-squared value in this regression is 0.5189. 

ods output FitStatistics = t0;proc reg data = hsb2; model read = female math write ses;run;quit;

The REG ProcedureModel: MODEL1Dependent Variable: read reading score

Number of Observations Read 200Number of Observations Used 200

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 4 10855 2713.73294 52.58 <.0001Error 195 10064 51.61276Corrected Total 199 20919

Root MSE 7.18420 R-Square 0.5189Dependent Mean 52.23000 Adj R-Sq 0.5090Coeff Var 13.75493

Page 122: STAT_in SAS

Parameter Estimates Parameter StandardVariable Label DF Estimate Error t Value Pr > |t|Intercept Intercept 1 6.83342 3.27937 2.08 0.0385female 1 -2.45017 1.10152 -2.22 0.0273math math score 1 0.45656 0.07211 6.33 <.0001write writing score 1 0.37936 0.07327 5.18 <.0001ses 1 1.30198 0.74007 1.76 0.0801*store the estimated r-square;data _null_; set t0; if label2 = "R-Square" then call symput('r2bar', cvalue2);run;

To bootstrap a confidence interval about this R-squared value, we will first need to resample.  This step involves sampling with replacement from our original dataset to generate a new dataset the same size as our original dataset.  For each of these samples, we will be running the same regression as above and saving the R-squared value.  proc surveyselect allows us to do this resampling in one step. 

Before carrying out this step, let's outline the assumptions we are making about our data when we use this method. We are assuming that the observations in our dataset are independent. We are also assuming that the statistic we are estimating is asymptotically normally distributed.   

We indicate an output dataset, a seed, a sampling method, and the number of replicates.  The sampling method indicated, urs, is unrestricted random sampling, or sampling with replacement.  The samprateindicates how large each sample should be relative to the input dataset.  A samprate of 1 means that the sampled datasets should be of the same size as the input dataset.  So in this example, we will generate 500 datasets of 200, so our output dataset bootsample will have 100,000 observations. 

%let rep = 500;proc surveyselect data= hsb2 out=bootsample seed = 1347 method = urs

samprate = 1 outhits rep = &rep;run;ods listing close;

The SURVEYSELECT Procedure

Selection Method Unrestricted Random Sampling

Page 123: STAT_in SAS

Input Data Set HSB2Random Number Seed 1347Sampling Rate 1Sample Size 200Expected Number of Hits 1Sampling Weight 1Number of Replicates 500Total Sample Size 100000Output Data Set BOOTSAMPLE

With this dataset, we will now run our regression model, specifying by replicate so that the model will be run separately for each of the 500 sample datasets. After that, we use a data step to convert the R-squared values to numeric. 

ods output FitStatistics = t (where = (label2 = "R-Square"));proc reg data = bootsample; by replicate; model read = female math write ses;run;quit;

* converting character type to numeric type;data t1; set t; r2 = cvalue2 + 0;run;

Method 1: Normal Distribution Confidence Interval

We will first create a confidence interval using the normal distribution theory. This assumes that the R-squared values follow a t distribution, so we can generate a 95% confidence interval by about the mean of the R-squared values based on quantiles from a t-distribution with 499 degrees of freedom.  We find the critical t values for our confidence interval and multiply these by the standard deviation of the R-squared values that arose in our 500 replications.  Our confidence interval using this method is symmetric about the R-squared value we saw in our original regression.  We can see that the 95% confidence interval using this method is (0.432787, 0.605013).  We have also calculated the bias in our original value of R-squared as the difference between that value and the mean of the 500 R-squareds in our bootstrap sample. 

* creating confidence interval, normal distribution theory method;* using the t-distribution;%let alpha = .05;ods listing;proc sql; select &r2bar as r2, mean(r2) - &r2bar as bias,

std(r2) as std_err, &r2bar - tinv(1-α/2, &rep-1)*std(r2) as lb, &r2bar + tinv(1-α/2, &rep-1)*std(r2) as hb

Page 124: STAT_in SAS

from t1;quit;

r2 bias std_err lb hb 0.5189 0.006616 0.043829 0.432787 0.605013

Method 2: Percentile Confidence Interval

Another way to generate a bootstrap 95% confidence interval from the sample of 500 R-squared values is to look at the 2.5th and 97.5th percentiles in this distribution.  This approach to the confidence interval has some advantages over the normal approximation used above.  This interval is not symmetric about the original estimate of the R-squared and this method is unaffected by monotonic transformations on the estimated statistic. The first advantage is relevant because our original estimate is subject to bias.  The second advantage is less relevant in this example than in an instance where the estimate might be subject to a transformation.  The bootstrap estimates that form the bounds of the interval can be transformed in the same way to create the bootstrap interval of the transformed estimate. 

We can easily generate a percentile confidence interval in SAS using proc univariate after creating some macro variables for the percentiles of interest and using them in the output statement. We can see that the confidence interval from this method is (0.436, 0.6017). Since we have put the information of interest into a new dataset, pmethod, we have omitted the standard output from the proc univariate. 

%let alpha = .05;%let a1 = %sysevalf(α/2*100);%let a2 = %sysevalf((1 - α/2)*100);* creating confidence interval, percentile method;proc univariate data = t1 alpha = .05; var r2; output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ;run;

<... output omitted ... >

data t2; set pmethod; bias = r2hat - &r2bar; r2 = &r2bar;run;ods listing;proc print data = t2; var r2 bias p_lb p_ub;run;

Obs r2 bias p_lb p_ub 1 0.5189 .0066164 0.436 0.6017

Page 125: STAT_in SAS

Method 3: Bias-Corrected Confidence Interval

We can also correct for bias in calculating our confidence interval. We have calculated bias in the previous method as the difference between the R-squared we observed in our initial regression and the mean of the 500 R-squared values from the bootstrap samples.  The latter is assumed to be an unbiased estimate of true R-squared.  If we believe wish to correct for the bias in our original value in calculating our confidence interval, we can go through the steps below.  These are described by Cameron and Trivedi in Microeconomics Using Stata.

We first calculate the proportion of the bootstrap R-squareds that are less than our original value.  We will adjust the percentiles used to define our confidence interval based on how this proportion differs from 0.5.  We then find the probit of this proportion (z0) and the proportion associated with our alpha level (zalpha). Next, we calculate the two percentiles that will be used to find our confidence interval, p1and p2, from these values.  We then calculate our interval with proc univariate.  From this method, our interval is (0.40575, 0.5936).

%let alpha = .05;%let alpha1 = %sysevalf(1 - α/2);%put &alpha1;proc sql; select sum(r2<=&r2bar)/count(r2) into :z0bar from t1;quit;

0.44

data _null_; z0 = probit(&z0bar); zalpha = probit(&alpha1); p1 = put(probnorm(2*z0 - zalpha)*100, 3.0); p2 = put(probnorm(2*z0 + zalpha)*100, 3.0); output; call symput('a1', p1); call symput('a2', p2);run;

* creating confidence interval, bias-corrected method;proc univariate data = t1 alpha = .05; var r2; output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ;run;

<... output omitted ...>

data t2;

Page 126: STAT_in SAS

set pmethod; bias = r2hat - &r2bar; r2 = &r2bar;run;

ods listing;

proc print data = t2; var r2 bias p_lb p_ub;run;

Obs r2 bias p_lb p_ub 1 0.5189 .0066164 0.40575 0.5936

References

Cameron, A.C., Trivedi, P.K. Microeconomics Using Stata. Stata Press: College Station, 2009. Efron, B., Tibshirani, R. An Introduction to the Bootstrap. Chapman and Hall: Boca Raton, 1998. Cassell D.L.   Don't Be Loopy: Re-Sampling and Simulation the SAS Way .

SAS FAQ  How do I analyze survey data with a simple random sample design?

This example is taken from Levy and Lemeshow's Sampling of Populations.page 53 simple random sampling

NOTE:  The n on the proc surveymeans statement indicates that there are 773 primary sampling units (PSUs).  Because this is simple random sampling, the elements and the PSUs are the same thing.  Hence, 773 is the population total from which the sample was drawn.This example uses the momsag data set.

proc surveymeans data = momsag n = 773 mean sum std; weight weight1; var momsag;run;The SURVEYMEANS Procedure-

Data Summary

Number of Observations 25Sum of Weights 773.000002

Statistics

Std ErrorVariable Mean of Mean Sum Std Dev------------------------------------------------------------------------MOMSAG 0.920000 0.054475 711.160002 42.108894

Page 127: STAT_in SAS

------------------------------------------------------------------------

This example is taken from Lehtonen and Pahkinen's Practical Methods for Design and Analysis of Complex Surveys.

page 29 Table 2.4  Estimates from a simple random sample drawn without replacement (n = 8); the Province'91 population.

data page29; input id cluster ue91 lab91; fpc = 32; wt = 4; strata = 1; cards; 1 1 4123 33786 2 4 760 5919 3 5 721 4930 4 15 142 675 5 18 187 1448 6 26 331 2543 7 30 127 1084 8 31 219 1330 ;run;

The code below gets the total and the standard error of the total for the variable ue91 and lab91 as shown in the first line of the table.  You can calculate the ratio by hand using the information in the output.  We know of no way to get the median from proc surveymeans. 

proc surveymeans data = page29 mean sum std cv r = .25; var ue91 lab91; weight wt; strata strata; cluster cluster;run;The SURVEYMEANS Procedure

Data Summary

Number of Strata 1Number of Clusters 8Number of Observations 8Sum of Weights 32

Statistics

Std Error Coeff ofVariable Mean of Mean Variation Sum Std Dev----------------------------------------------------------------------------------------

Page 128: STAT_in SAS

ue91 826.250000 415.070586 0.502355 26440 13282lab91 6464.375000 3430.088578 0.530614 206860 109763----------------------------------------------------------------------------------------

SAS FAQ  How do I analyze survey data with a one-stage cluster design?

This example is taken from Levy and Lemeshow's Sampling of Populations.

page 250 simple one-stage cluster samplingThis example uses the tab9_1c data set.NOTE:  The n = 5 in the proc surveymeans statement indicates that there were 5 PSUs from which the sample could be drawn.  You can use this option in any non-stratified design or in a stratified design in which the total number is equal in all strata, e.g. each strata has 20 elements from which the sample can be drawn.  The total is used to calculate the fpc; hence, if the total is omitted, an fpc will not be calculated.  The SAS keywords sum and mean are used to modify the output.

proc surveymeans data = tab9_1c n = 5 sum mean; weight wt1; cluster devlpmnt; var nge65 nvstnrs hhneedvn;run;The SURVEYMEANS Procedure

Data Summary

Number of Clusters 2Number of Observations 40Sum of Weights 100

Statistics

Std ErrorVariable Mean of Mean Sum Std Dev------------------------------------------------------------------------NGE65 1.675000 0.019365 167.500000 1.936492NVSTNRS 0.575000 0.019365 57.500000 1.936492HHNEEDVN 0.525000 0.019365 52.500000 1.936492------------------------------------------------------------------------

This example is taken from Lehtonen and Pahkinen's Practical Methods for Design and Analysis of Complex Surveys.

page 83 Table 3.6  Estimates from a one-stage CLU sample (n = 8); the Province'91 population.NOTE:  The r = .25 in the proc surveymeans statement indicates that the sampling

Page 129: STAT_in SAS

rate was .25.  You can use this option in any non-stratified design or in a stratified design in which the sampling rate was the same in each strata.  The rate is used to calculate the fpc; hence, if the total is omitted, an fpc will not be calculated.  The SAS keywords sum and std are used to modify the output.

data page83; input id str clu wt ue91 lab91; fpc = 32; cards; 1 1 2 4 666 6016 2 1 2 4 528 3818 3 1 2 4 760 5919 4 1 2 4 187 1448 5 1 8 4 129 927 6 1 8 4 128 819 7 1 8 4 331 2543 8 1 8 4 568 4011 ;run;proc surveymeans data = page83 r = .25 sum std ; weight wt; strata str; cluster clu; var ue91 lab91;run;The SURVEYMEANS Procedure

Data Summary

Number of Strata 1Number of Clusters 2Number of Observations 8Sum of Weights 32

Statistics

Variable Sum Std Dev----------------------------------------ue91 13188 3412.140091lab91 102004 30834----------------------------------------