Macros and ODS - University of New Mexicojames/STAT579-F18/SAS12.pdf · Macros and ODS The rst part of these slides overlaps with last week a fair bit, but it doesn’t hurt to review

Macros and ODS

The first part of these slides overlaps with last week a fair bit, but itdoesn’t hurt to review as this code might be a little harder to follow.

SAS Programming November 6, 2014 1 / 89

ODS in SAS studio

We’re now ready to try combining macros with ODS. Here is an exampleof a simulation study that we can do with what we have.

Suppose X1, . . . ,X10 are i.i.d. (independent and identically distributed)exponential random variables with rate 1. If you perform a t-test at theα = 0.05 level for whether or not the mean is 1, what is the type 1 error?

If X1, . . . ,X10 are normal with mean 1, standard deviation σ, then youexpect that you will reject H0 5% of the time when H0 is true. In thiscase, H0 is true (the mean is 1), but the assumption of normality in thet-test is violated, and this might effect the type 1 error rate.


ODS in SAS studio

First we’ll do one data set and run PROC TTEST once with TRACE ONto figure out what table we want to save.


ODS in SAS studio

It looks like we want the third table, which was called TTests


ODS in SAS studio

If you look at the data set in Work Folder, the name of the variable withthe p-value is again Probt, although when PROC TTEST runs, it labelsthe variable by Pr > |t|.


ODS in a macro

Here’s a start, I run PROC TTEST 3 times on 3 generated data sets. Thiscreates 3 small data sets with p-values.


ODS in a macro: using concatenation merge to put allp-values in one data set


ODS dataset of p-values

You should be able to now use your dataset of p-values to analyze yourp-values. Particular questions of interest would be (1) how many p-valuesare below 0.05 (This is the type I error rate), and (2) the distribution ofthe p-values.



In my case, I had trouble doing anything directly with the datasetpvalues. Nothing seemed to print in my output (for example PROCMEANS). So, in my case, I just output my dataset pvalues to an externalfile and then read it in again using a new SAS program. This is slightlyinelegant, but it means I can start from scratch in case any of my ODSsettings changed things and caused problems.//This approach could also be useful in case I wanted to generate p-values inSAS and analyze them or plot them in another program like R, or if I justwant to save those p-values for later reference.





Here I analyze the output of a SAS program using a second SAS program.pvalues2.txt is a cleaned up version of pvalues.txt that removedheader information and so forth.



Note that the MEANS procedure calculated the mean of the 0s and 1s andgot 0.084. This means there were 84 (out of 1000) observations that hada p-value less than 0.05. How did it get the standard deviation? Note thateach observation is a Bernoulli trial, and the standard deviation of aBernoulli trial is

√p(1− p), so we would estimate this to be√

.084(1− .084) = .2773878. Why is this (slightly) different from thestandard deviation reported by PROC MEANS?

Also, is there evidence that there is an inflated type I error? Is 0.084significantly higher than α = 0.05?


Statistical inference and simulations

Sometimes we find ourselves using statistical inference just to interpret ourown simulations, rather than for interpreting data.

Some scientists have the attitude that if a phenomenon is real, then youshouldn’t need to statistics to see it in the data. Although I don’t sharethis point of view, because a lot of data is too complicated to interpret by“eye”, I sort of feel this way with simulations, though.

If you’re not sure whether 0.084 is significantly higher than 0.05 (meaningthere really is inflated type 1 error), you could either get a confidenceinterval around 0.084, or you could just do a larger simulation so that youcould be really sure without having to construct confidence intervals. Inthis case, the confidence interval does exclude 0.05, so there is evidence tothink that the type 1 error rate is somewhat inflated due to violation ofthe assumption of normally distributed samples.


What about the distribution of p-values?

The p-value is a random variable. It is a function of your data, much like X and

σ̂2, so it is a sample statistic. What should the distribution of the p-value beunder the null hypothesis for a test statistic? If you use α = 0.05, this means that5% of the time the p-value should be below α. More generally, for any α,

P(p-value < α) = α

. The p-value therefore has the same CDF as a uniform random variable. So

p-values should be uniformly distributed for appropriate statistical tests when the

null hypothesis and all assumptions are true. This is true for tests based on

continuous test-statistics. For discrete problems, it might not be possible for

P(p-value < α) = α.


The distribution of p-values

Here is a more technical explanation of why p-values are uniformlydistributed for continuous test statistics, when the null is true, and forhypotheses H0 : µ = µ0,HA : µ > µ0 (i.e., I’ll just consider a one-sidedtest). For this one-sided test, the p-value is P(T ≥ t), where T is thetest-statistic.

1− F (t) = P(T > t) = P(F (T ) > F (t)) = 1− P(F (T ) ≤ F (t))

⇒ P(F (T ) ≤ F (t)) = F (t)

Because 0 ≤ F (t) ≤ 1, this means that F (T ) has a uniform distribution(since it has the same CDF). If U is uniform(0,1), then so is 1− U, so1− F (t) is also uniform(0,1). But note that 1− F (t) = P(T ≥ t), whichis the p-value.



Here is the distribution of the p-values represented by a histogram.Typically uniform distributions have flatter looking histograms with 1000observations, so the p-values here do not look uniformly distributed.Again, this would be clearer if we did more than just 1000 simulations.

P-‐value


A different way to simulate this problem

Instead of simulating 1000 data sets of 10 observations, I could have justsimulated all of the data all once and indexed the 1000 sets of 10observations (similar to what I did for the Central Limit Theorem example.In this case, I would want to use PROC TTEST separately on each of the1000 experiments.

Moral: There’s more than one way to do things.


PROC TTEST using a BY statement

P-‐value


PROC TTEST using a BY statement

P-‐value

44rr555588SAS Programming November 6, 2014 19 / 89

How to do this in R?

It’s a little easier in R. Here’s how I would do it:

x <- rexp(10000)

x <- matrix(x,ncol=1000)

pvalue <- 1:1000

for(j in 1:1000) {

pvalue[j] <- t.test(x[,j],mu=1)$p.value

}

sum(pvalue<=.05)/1000 # this is the type I error rate

hist(pvalue)


Another example is a test for homogeneity of variances.

Textbooks often warn that Bartlett’s test can be used to test for equalityof variances, but that it is extremely sensitive to the assumption ofnormality, even though many procedures, such as t-tests and ANOVA arereasonably robust to assumptions of normality.

It is instructive to do a simulation to find out just how sensitive Bartlett’stest is to the assumption of normality. In this example, we’ll again createsamples from two independent exponential distributions with rate λ = 1,so that both have equal variances of size λ2 = 1. This time, we’ll let thesample size vary from n = 10, 20, 30, . . . , 100 and see how the test doeswith increasing sample sizes. Again we’ll look at the type I error rate forthe test. For t-tests, we expect that as the sample size increases, theCentral Limit Theorem tells us that Xn is becoming increasingly similar toa normal distribution, so we expect the type I error rate to improve (getcloser to α) as n increases. The Central Limit Theorem doesn’t apply toS2, the estimate of the variance, so the result of increasing n isn’t as clearhere. SAS Programming November 6, 2014 21 / 89

Testing type 1 error for Bartlett’s test

There are different ways to test for homogeneity of variance in SASdepending on the procedure that you are using. To get a statistic forBartlett’s test, you can use PROC GLM, which can use a two-sample caseas a special case, although PROC GLM is much more general.

PROC GLM can also be used for doing ANOVA, including withunbalanced designs (PROC ANOVA is for balanced designs), MANOVA(multivariate ANOVA with multiple response variables as well as multipleindependent variables), polynomial regression, random effects models,repeated measures, etc.



First, we’ll generate example data, keeping in mind that we want togeneralize our parameters.



Note that PROC GLM wants data in the narrow style, NOT two columns,one for group A, one for group B. The data doesn’t have to be sorted bygroup.



Note that we could have generated the data in a wide format or withgroup A generated first followed by group B. To generate in a wide format,we could have done this with only one output statement and differentvariables for the two groups.

data sim;

do i=1 to &iter;

do j=1 to &n;

x = ranexp(2014*&n + &iter);

y = ranexp(2013*&n + &iter);

output;

end;

end;

keep group x y i;

run;



To generate the As first then the Bs, we could have done this instead withan extra do loop, which still generates a narrow data set.

data sim;

/* generate two exponentials for each

combination of i and j */

do i=1 to &iter;

do j=1 to &n;


group = "A";

output;

end;

do j=1 to &n;


group = "B";

output;

end;

end;

/* i is the iteration */

keep group x i;

run;



Back to the original data. We’ll look at the output from PROC GLM andPROC TTEST to compare them.



Here’s output from PROC GLM.





By default, PROC TTEST does a test for equal variances using the FoldedF-test, whereas PROC GLM does not. Note that the p-value from PROCGLM matches the p-value from PROC TTEST when equal variances areassumed.



Put the trace on to figure out how to save the right table.



Look in the log file for the table.





Now we can extend to more iterations and use BY to get them in one dataset.





We can scale this up to as many iterations as we want. Then we want tokeep track of the number of p-values below 0.05 to get the type 1 errorrate.





Now we want to repeat this same idea but for different sample sizes n. Ofcourse we could just repeat the code over and over and over again,changing the value of n. Or we can loop over different values of n. Thiscreates a 3-level loop instead of 2-levels.






DO loops versus Macros

Instead of having an additional DO loop, I could have created a macro, say%macro bartlett(n,iter) and then run the macro multiple times fordifferent values of n:

%bartlett(10,1000)

%bartlett(20,1000)

%bartlett(30,1000)

And so on. If you’re data step is getting too complicated, then this mightbe reasonable. Also if you want different combinations of parameters, themacro approach is more flexible. For example, if you want 1 millioniterations for n = 10 but 1000 iterations for n = 100 due to timeconstraints, then the macro approach is more flexible.


Suppressing output?

Unfortunately, these simulations create a lot of output as they are. If youwant to run a procedure to create an output data set, but not generateoutput, then you can do this for most procedures, for example using theNOPRINT option in the same line as the procedure, but still creating anoutput data set, such as OUTPUT out= in PROC MEANS.Unfortunately, the NOPRINT option in the procedure means that nothingis printed for ODS to use. As a result, when using ODS, you end up withlots of output. I’m not sure of a good way around this, but it is prettyannoying and slows SAS down to do a lot of I/O. You can reduce theoutput by only selecting what you will need to save:




Results


Results: interpretation

These results are based on 10,000 iterations. Since we are generatingessentially Bernoulli random variables (reject or don’t reject H0), we canthink of the mean of the as the proportion of rejections. This should beα = 0.05, but is higher, with 23% for n = 10, 27% for n = 20, and 29%for n = 50.

For 10,000 iterations, a confidence interval for this proportions has amargin of error of roughly 2

√(.25)(.75)/10000 = 0.009. A handy rule of

thumb is that for a binomial, the 95% confidence interval has margin oferror less than or equal to 1/

√n = 1/100 for n = 10, 000, the point being

that 29% appears to be significantly larger than 27%, so that type 1 errorrates are increasing as the sample size increases.


The moral of the story

The moral of the story is that increasing your sample size doesn’t alwaysimprove your inferences. In this case, the method is sufficiently non-robustthat increasing the sample size makes it perform worse. So when textbookssay that Bartlett’s test isn’t very good for testing equality of variances,they really mean it, although they rarely explain why it is so bad.

So what exactly is Bartlett’s test testing? The usual description is that ittests H0 : σ21 = σ22 assuming that the two samples are from normallydistributed populations. However, considering that the test is likely toreject H0 when σ21 = σ22 but the data are not normal, you could insteadthink of the null hypothesis as

H0 : X1, . . . ,Xniid∼ N(µ1, σ

2), Y1, . . . ,Ymiid∼ N(µ2, σ

2)

i.e., the normality is part of what is being tested. In this case a rejection ofH0 could mean either that the data are not normal or that variances areunequal.


Statistical inconsistency

A related issue regarding increasing sample sizes is statisticalinconsistency. An estimator θ̂n for a parameter θ (which might be anordered-tuple of parameters), is said to be statistically consistent if for anyε > 0 and for any θ ∈ Θ (the parameter space),

limn→∞

P(|θ̂n − θ| > ε) = 0

where n is the sample size. In other words, the estimator gets close to theactual parameter with high probability. You can have estimators that don’thave this property, so that increasing the sample size doesn’t increase theprobability of your estimate being close to the true value.

If you work with a discrete parameter space, then statistical consistencyrequires that the probability approaches 1 of making the correct inferencefor the parameter.


Statistical power

So far, we’ve focused on type 1 error. How about power? Power is nearlyidentical from the point of view of simulation. In this case, you simulatefor some values for which H0 is false, and again count the number of timesthe null is rejected. In this case, the more frequently H0 is rejected, thebetter the method (assuming it has good type 1 error as well).

As an example, we’ll consider the power for testing H0 : µ1 = µ2 in a

t-test when X1 . . . ,Xniid∼ N(0, 1), Y1, . . . ,Yn

iid∼ N(1, 1). In this case thevariances are equal, but the means are different.


Statistical Power


Statistical Power


Statistical Power

Power of the t-test for rejecting H0 when X s are i.i.d N(0,1) and Y s arei.i.d N(1,1).

1111


Uses of Power Analyses

Why is it useful to study power?

The main reasons for studying power for a particular problem are

I sample size determination for study design

I determining the effect size detectable for a given sample size

I choosing between different methods

I investigating robustness of methods to model violation


Power: sample size determination

A power analysis can be useful for determining the sample size you need tohave a good chance of rejecting H0 when H0 is false. This is useful forinitially trying to decide what sort of sample size to aim for whendesigning a study, and can be useful in grant applications.

In the previous t-test example, if we believe that treatment 2 results invalues an average of 1 point higher than treatment 1, then we canestimate that we’d need a sample size of roughly 20 people per group tohave about 80% power to detect a difference. If you wanted a better than80% chance of being able to reject H0, you’d want larger samples.



Often for grant proposals, you might try to justify your grant budget basedon an estimated effect size (i.e., µ1 − µ2) based on preliminary data. Theidea is that if you think the effect size might be some particular value,then you want the sample size to be large enough to have a reasonablechance (80% is often used) to reject the null hypothesis. Since largersamples require more money, this can be used to justify how much moneyyou need to request for your study. Similarly, if you don’t justify yourproposed sample size, then a reviewer might complain that your study islikely to be “underpowered”, meaning it is unlikely to detect anything ifthere is a difference (i.e., if a new drug is more effective).



Sample sizes for studies aren’t usually completely under the researcher’scontrol, but they are analyzed as though they are fixed parameters ratherthan random variables.

If you recruit people to be in a study for example using flyers aroundcampus, the hospital, etc., then you might have historical data to predictwhat a typical sample size would be based on how long and widely youadvertise. Study designers can therefore often indirectly control the samplesize.



Random sample sizes might be worth considering, however, For the t-testexample, you might have better power to reject the null hypothesis if yoursample sizes are equal for the two groups than if they are unequal. Forexample, suppose you are recruiting for testing whether a drug reducesheadaches, and you recruit both men and women. Suppose you suspectthat the drug is more effective for men than women.

If you recruit people for the study, you might not be in direct control ofhow many men versus women volunteer to be in the study. Suppose 55women volunteer to be in the study and 45 men volunteer. You couldmake the sample sizes equal by randomly dropping data from 10 of thewomen, but this would be throwing away information. It is better to useinformation from all 90 study participants, although you might have lesspower with 45 men versus 55 women than with 50 d for each sex.



On the other hand, if for your study, you are collecting expensiveinformation, such as doing MRIs for each participant, you might decide toaccept the first n women volunteers and the first n men volunteers. Apower analysis could help you decide whether it was important to have abalanced design or not.


Power: effect of unbalanced designs

How could we simulate the effect of unbalanced versus balanced designs?Assuming we knew that there were a fixed number of participants (sayn = 100), we could compare the effect of a particular unbalanced design(for example 45 versus 55) versus the balanced design (50 per group). Wecould also let the number of men versus women in each iteration of asimulation be a binomial random variable, so that the degree of imbalanceis random.


Power: determining effect size

In addition to graphing power as a function of sample size, it is commonto plot power as a function of the effect size for a fixed sample size.Ultimately, power depends on three variables: α, n, and the effect sizesuch as µ1 − µ2 for the two-sample t-test example. We usually fix two ofthese variables and plot power as a function of the other variable.

The t-test example is easy to modify to plot power as a function of theeffect size for a given sample size (say, n = 20).







1111


Power: plotting both sample size and effect size







1111



Note that the data set sim that has all of my simulated data has 840,000observations. SAS is still reasonably fast, and the log file gives informationabout how long it took.

NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414

NOTE: The SAS System used:

real time 22.28 seconds

cpu time 9.43 seconds

We could make the plots smoother by incrementing the effect size by asmaller value (say .01), although this will generate 50 times as manyobservations. When simulations get this big, you start having to plan them– how long will they take (instead of 30s, will it take 25min?, 25 days?),how much memory will they use, and so on, even though this is a verysimple simulation.


Length of simulations

The log file also breaks down how long each procedure took. Much of thetime was actually due to generating the PDF file with ODS. From the logfile:

NOTE: The data set WORK.SIM has 840000 observations and 5 variables.

NOTE: DATA statement used (Total process time):



NOTE: The data set WORK.PVALUES has 42000 observations and 9 variables.

NOTE: The PROCEDURE TTEST printed pages 1-21000.

NOTE: PROCEDURE TTEST used (Total process time):



...

NOTE: PROCEDURE SGPLOT used (Total process time):




Length of simulations

When designing simulations, there are usually tradeoffs. For example,suppose I don’t want my simulation to take any longer than it already has.If I want smoother curves, I could double the number of effect sizes I used,but then to keep the simulation the length of time, I might have to usefewer iterations (say 500 instead of 1000). This would increase thenumber of data points at the expense of possibly making the curve morejittery, or even not monotonically increasing. There will usually be atradeoff between the number of iterations and the number of parametersyou can try in your simulation.


Length of simulations for R

If you want to time R doing simulations, the easiest way is to run R inbatch mode. In Linux or Mac OS X, you can go to a terminal, and at theshell prompt, type

time R CMD BATCH myprogram.r

and it will give a similar print out of real time versus cpu time for your Rrun.



0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

mu1−mu2

Pow

er

n=30

n=20

n=10






Power:tradeoff between number of parameters and numberof iterations (500 vs 100 iterations)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

mu1−mu2

Pow

er

n=30

n=20

n=10

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

mu1−mu2

Pow

er

n=30

n=20

n=10


Using Power to select methods

As mentioned before, power analyses are useful for determining whichmethod is preferable when there are multiple methods available to analyzedata.

As an example, to consider the two sample t-test again when we haveexponential data. Suppose we wish test H0 : µ = 2 when λ = 1, so thatthe null hypothesis is false. Since the assumptions of the test are false,researchers might prefer using a nonparametric test.


Using Power to select methods

As an alternative, you can use a permutation test or other nonparametrictest. Here we might wish to see which method is most powerful. If youcan live with the inflated type 1 error for the t-test (or adjust for it byusing a smaller α-level, then you might prefer it if is more powerful.

A number of nonparametric procedures are implemented in PROCNPAR1WAY, as well as PROC MULTTEST. In addition, there are macrosfloating around the web that can do permutation tests without using theseprocedures.


Using power to select methods

Here we’ll try PROC NPAR1WAY and just one nanparametric method, theWilcoxon rank-sum test (also called the Mann-Whitney test). The idea isto pool all of the data, then rank them. Then calculate the sum of theranks for group A versus group B. The two sums should be approximatelyequal, with greater differences in the sums of the ranks being evidencethat the mean for one group is larger than the mean for the other group.


Using power to select methods

Note that there are many other methods we could have selected such as amedian test or a permutation test. This is just to illustrate, and we are notnecessarily finding the most powerful method.


Power: comparing methods









For these parameter values (exponentials with means of 1 and 2), thet-test was more powerful than the Wilcoxon test at all sample sizes. TheWikipedia article on the Mann-Whitney test says: “It [The Wilcoxon orMann-Whitney test] has greater efficiency than the t-test on non-normaldistributions, such as a mixture of normal distributions, and it is nearly asefficient as the t-test on normal distributions.”

Given our limited simulation, we have some reason to be a little bitskeptical of this claim. Still, we only tried one combination of parameters.It is possible that for other parameters or other distributions, the t-test isless powerful. Also, the t-test has inflated type 1 error, so the comparisonmight be a little unfair. We could re-run the experiment using α = .01 forthe t-test and α = .05 for the Wilcoxon to make sure that both hadcontrolled type 1 error rates.



Here’s an example from an empirical paper,




Speed: comparing methods

For large analyses, speed and/or memory might be an issue for choosingbetween methods and/or algorithms. This paper compared using differentmethods within SAS based on speed for doing permutation tests.


Use of macros for simulations

The author of the previous paper provides an appendix with lengthymacros to use as more efficient substitutes to use as replacements for SASprocedures such as PROC NPAR1WAY and PROC MULTTEST, whichfrom his data could crash or not terminate in a reasonable time.

In addition to developing your own macros, a common use of macros is touse macros written by someone else that have not been incorporated intothe SAS language. You might just copy and paste the macro into yourcode, possibly with some modification, and you can use the macro even ifyou cannot understand it. Popular macros might eventually get replacedby new PROCs or new functionality within SAS. This is sort of the SASalternative to user-defined packages in R.


From Macro to PROC

An example of an evolution from macros to PROCS is for bootstrapping.For several years, to perform bootstrapping, SAS users relied on macrosoften written by others to do the bootstrapping. In bootstrapping, yousample you data (or the rows of your data set) with replacement and get anew dataset with the same sample size but some of the values repeatedand others omitted. For example if your data is

-3 -2 0 1 2 5 6 9 bootstrap replicated datas set might be

-2 -2 1 5 6 9 9 9

-3 0 1 1 2 5 5 6

etc.


From Macro to Proc

Basically to generate the bootstrap data set, you generate random nrandom numbers from 1 to n, with replacement, and extract those valuesfrom your data. This was done using macros, but now can be done withPROC SURVEYSELECT. If you search on the web for bootstrapping, youstill might run into one of those old macros.

Newer methods might still be implemented using macros. A webpage from2012 has a macro for Bootstrap bagging, a method of averaging resultsfrom multiple classification algorithms.http://statcompute.wordpress.com/2012/07/14/a-sas-macro-for-bootstrap-aggregating-bagging/

There are also macros for searching the web to download movie reviews orextract data from social media. Try searching on ”SAS macro 2013” forinteresting examples.


Documents

Macros and ODS - University of New Mexicojames/STAT579-F18/SAS12.pdf · Macros and ODS The rst part of these slides overlaps with last week a fair bit, but it doesn’t hurt to review