85
Conditional processing There are several places where conditional processing can be used and is useful. Last week, we talked about (1) using IF statments to conditionally write things to a file in order to generate a report, and (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional processing can be useful, and we’ll go over some of them. SAS Programming

STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Conditional processing

There are several places where conditional processing can be usedand is useful. Last week, we talked about (1) using IF statments toconditionally write things to a file in order to generate a report,and (2) using IF and IF-ELSE statements to create ranges for avariable, such as low, medium, and high.

There are lots of other ways that conditional processing can beuseful, and we’ll go over some of them.

SAS Programming

Page 2: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Uses of IF statements

A common way to use an IF statements is to conditionally read indata. This can be useful for large datasets where you only want asubset of the data.

In this case, there is no need to read in all of the data initially andcreate a huge dataset which will take a lot of memory. Instead, youcan create a smaller dataset to begin with.

SAS Programming

Page 3: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Dataset

To work with a new, and slightly larger dataset, we’ll use somecrime data from a book called Crowding and Behavior by J. L.Freedman (1975). An example from this book uses crime data on110 cities in the US and lists their percentage of nonwhiteresidents, population density, and crime rate per 100,000 residents.

The book argues that there is not a significant correlation betweenpopulation density and crime rate. For now, we’ll be moreinterested in manipulating the data than using it for statisticalanalysis. It is often a good idea to look at the text file for the datato see if there is anything weird or problematic to begin with, andalso to see how variables are delimited.

SAS Programming

Page 4: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Dataset

SAS Programming

Page 5: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

City crime example

In this case, all variables appear to be numeric except for city, butthere are some missing values. The city names can be long, longerthan 8 characters, such as for ALBUQUERQUE. Here you might thinkof two strategies:(1) either worry about the longest city name, or(2) (maybe slightly lazy way....) just use the first 8 characters asan abbreviation for the city name.

Will this work?

SAS Programming

Page 6: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Dataset

A more thorough look at the data reveals that some more caremight be needed.

SAS Programming

Page 7: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Dataset

A look at the next few cities reveals that the first 8 characters donot uniquely determine the name of the city, since there are twocities whose first 8 characters are identitcal:

CHARLESTON.SC

CHARLESTON.W.VA

One feature of the data which is nice is that spaces in city namesare separated by periods, e.g.,

LOS.ANGELES

But you might need to be cautious in assuming that thisconvention is applied throughout.

SAS Programming

Page 8: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Dataset

The main question of interest for this data is the relationshipbetween population density and crime rate, but first we will dosome processing just to describe the data. Here are some questionsof interest, for which we will use some conditional processing, whilesome will require more ideas later on.

1. What proportion of observations have at least one missingvalue? What proportion are missing for

2. Is there a relationship between missing values and the valuesof the values of any other variables? (Does the data appear tobe missing at random?)

3. Create a dataset that only has cities with crime rates above4000 per 100,000 people.

4. How does Albuquerque rank on the list?5. How do we create a variable that ranks the crime rate for

each city?6. How can we print out the city in each state with the highest

(or lowest) crime rate? SAS Programming

Page 9: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Dataset

Some of these questions will be addressed a bit later. For now, wecan address the first few questions.

First, how can we count how many observations have missingvalues?

SAS Programming

Page 10: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Describing missing values

There are a couple of ways of counting missing values, partlydepending on whether they are missing for character versus numericvalues. The most straightforward is to use an IF statement:

SAS Programming

Page 11: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Describing missing values

The previous approach sets the missing variable to missing for allnonmissing observations. It might be better to code these as 0.Then the proportion of 1s in the missing column is the proportionof observations that are missing.

SAS Programming

Page 12: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Describing missing values

You could also just read in the values that are non-missing, butthen you lose the ability to find patterns in the missingness.

SAS Programming

Page 13: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Conditional reading-in of the data

Note that the syntax is a little weird on the previous slide. Insteadof writing

if condition then input .....

You put the IF statement after the action, the INPUT statement.

SAS Programming

Page 14: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Coutning missing values

An alternative way of counting missing values is to use the cmiss

function in SAS.

SAS Programming

Page 15: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Using proc means and proc freq to count missing data

The procedures proc means and proc freq can also be used tocount missing data for numeric and character varaibles,respectively. These provide summaries which aren’t enough to findpatterns in the missing data.

SAS Programming

Page 16: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Using proc means and proc freq to count missing data

SAS Programming

Page 17: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Patterns in the missing data

To look at patterns in the missing data, you could create a subsetof observations that only has missing data. This will be useful ifthere isn’t too much missing data and not too many variables. Inthis case, it is revealing.

SAS Programming

Page 18: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Patterns in the missing data using WHERE

So a clear pattern in the data is the population size and density arealways missing together. We might also be concerned with whethermissing density values are associated with either the percentage ofnonwhite residents or crime rates of the cities. In other words, docities with missing density values tend to have higher crime rates?

Similarly to the subsetting IF, you can run proc means or otherprocedures conditional on certain values in the data set. However,instead of using IF, you use WHERE.

SAS Programming

Page 19: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Running proc means conditional on a particular value

SAS Programming

Page 20: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Running proc means conditional on a particular value

SAS Programming

Page 21: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Looking for correlations

To see if there is an association between missingness and either thedensity or nonwhite variables, we can use the proc corr

procedure, which generates a correlation matrix.

SAS Programming

Page 22: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Plotting the data

It’s always a good idea to plot your data as well. Associationsbetween variables might show up if the two groups of observations(with missing data versus no missing data), have differentdistributions. The correlations are pretty weak and nonsignificant,but it is still good to look at the plots.The older SAS graphics was purely text based, and has a certaincharm. It also means you can create plots that take very littlememory, since they can be saved as plain text files. Theprocedures for this are proc plot (older), and proc gplot.

/* also try gplot instead of plot */

proc plot data=crimeRates;

var crime*missing3;

run;

SAS Programming

Page 23: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Results of proc print, saved as pdf

T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 0 : 1 6 A M 1T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 0 : 1 6 A M 1

P l o t o f c r i m e * m i s s i n g 3 . L e g e n d : A = 1 o b s , B = 2 o b s , e t c . c r i m e | | 6 0 0 0 + | | | | A | | 5 0 0 0 + A | A | A | | A | C | A 4 0 0 0 + C | B | E | B | D | H | F A 3 0 0 0 + D | H A | E A | E A | F B | C | F 2 0 0 0 + D | D | E | B | A B | A | A A 1 0 0 0 + C | A | | B | A | | 0 + | - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - 0 1 m i s s i n g 3

SAS Programming

Page 24: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Results of proc gprint, saved as pdf

T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 2 : 5 2 A M 1T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 2 : 5 2 A M 1

c r i m e

0

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

6 0 0 0

m i s s i n g 3

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0

SAS Programming

Page 25: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Questions about the data

From the questions about the data earlier on in the slides, we havefound that there were 10 cities with missing density and populationvalues, and that these values were always missing together wheneither was missing. The crime and nonwhite variables were nevermissing. Based on correlations and plots, there is no obviousrelationship between missingness and the nonwhite or crimevariables. There is no way to assess whether missingness is relatedto population density or population size, however, based on thedata.

SAS Programming

Page 26: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Creating a subset of the data

To create a subset of the data with cities that have crime ratesabove 4,000 per 100,000 people, we can do the following:

data crimeHigh;

set crimeRates;

if crime > 4000;

run;

proc print;

run;

SAS Programming

Page 27: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

High crime cities

SAS Programming

Page 28: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting by crime

From looking at the previous subset of the data with the highestcrime cities, it is easiest to pick out Albuquerque’s rank among the110 cities. Albuquerque has the second highest crime rate, withSan Francisco being the highest.

To answer the general question of how cities rank, you would needto sort them by crime rate. This can be done in SAS using proc

sort:

proc sort data=crimeRates out=crimeRanked;

by descending crime;

run;

SAS Programming

Page 29: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

High crime cities (top 10)

SAS Programming

Page 30: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting data for proc means BY processing

Sorting data is also needed to analyze data separately for differentlevels of a categorical variable. For example, instead of doing doingproc means separately for different values of missing3, I couldcreate a categorical variable indicating missingness and run proc

means on both values within a single procedure.

SAS Programming

Page 31: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Using BY in proc means

SAS Programming

Page 32: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

BY command

Using the BY command is a powerful feature of SAS. It allows youto generate reports, tables, graphics, or data analysis on differentsubsets of your data automatically. Here we did it for just levels ofa missingness variable, but you could also do separate analyses fordifferent states, different cities, etc. depending on the data.

We also just used the BY command in proc means, but it can beused in many other procedures as well.

Unfortunately, to do the BY command, you have to sort the datafirst based on the variables you are using BY with. Otherprocedures, such as the histogram, also require data to be sortedfirst.

SAS Programming

Page 33: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting issues

In the example of the sort we did earlier, a SAS dataset was sortedand saved as a second SAS dataset, which is a common way tosort data, so that the original order of the data is preserved. If youdo not specify a new dataset for the sort procedure to output to, itwill sort your original data, which you might or might not want.

SAS Programming

Page 34: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting speed

SAS is fast at sorting. I attempted to do a little experiment. Igenerated 4 million uniform random variables (in R, I have toadmit, it is so easy there!) and saved them all to a file. This wasabout as much as I could to stay within my 50Mb quota on myUNM linux account. Nevertheless I transferred as much of the fileas I could using

scp uniform.txt [email protected]

To test how fast SAS can sort this, I use the time command inlinux (also works in OS X):

SAS Programming

Page 35: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting speed: SAS on UNM computer

[jamdeg@polaris SAS]$ cat sort.sas

data uniform;

infile "uniform.txt";

input u;

run;

proc sort data=uniform;

by u;

run;

[jamdeg@polaris SAS]$ time sas sort.sas

Starting SAS ....

The http://Fastinfo.unm.edu entry for SAS is:

http://unm.custhelp.com/cgi-bin/unm.cfg/php/enduser/std_adp.php?p_faqid=6341

real 0m0.002s

user 0m0.001s

sys 0m0.001s

SAS Programming

Page 36: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting speed: R on my laptop

To run R code in batch mode, you type R CMD BATCH myfile.r

at a linux/OS X prompt. This is EXTREMELY useful forsimulations. To see how long it takes, you can put the linuxcommand time in front. Here’s how long on my Mac Book Pro toread in and sort the same data set:

Jamess-MacBook-Pro:superjames$ time R CMD BATCH sort.r

real 0m56.772s

user 0m56.084s

sys 0m0.603s

This isn’t a fair comparison because I’ve used different computers.Unfortunately SAS is available on linux.unm.edu but R isn’t, and Ris available on my laptop, but SAS isn’t. C’est la vie!

SAS Programming

Page 37: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

sorting code

SAS:

data uniform;

infile "uniform.txt";

input u;

run;

proc sort data=uniform;

by u;

run;

R:

Jamess-MacBook-Pro:Music superjames$ cat sort.r

x <- read.table("uniform.txt")

y <- sort(x)

SAS Programming

Page 38: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

SAS boxplot using sorting on missing

Note that the boxplot command requires that the missing2 value

already be sorted.

SAS Programming

Page 39: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Finding the longest character string

Here I used the length function to determine the length of thestring, created a new variable based on this, and sorted by length.So that I didn’t have to print the entire data set, I used an optionin proc print to only print the first 10 observations.

SAS Programming

Page 40: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

String functions in SAS

SAS has MANY string functions which are useful for manipulatingtext. I think of data as fundamentally being strings rather thannumbers. This is more a point of view than anything else.Obviously, we can think of it all as 0s and 1s. Often strings can beinterpreted as numbers, but often it is unnatural to force the to be.Data can be any kind of information – image files, text files, htmlfiles, lists of email addresses, etc. So it useful to have stringfunctions that can manipulate strings in different ways.

Example. Suppose you have a list of 100,000 email addresses, andyour job is to determine how many addresses are .edu versus .comversus .org, etc. What would need to determine this? You’d haveto be careful about weird email addresses [email protected], [email protected], and so forth.

SAS Programming

Page 41: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Ranking cities by state

The last question is the hardest. Some states have only one city onthe list (e.g., New Mexico), while others have several cities(California, Texas, Ohio, etc.) Suppose we want to create avariable that has the state for each city. Then we might want tosort by state, then sort cities within states.

Furthermore, we could sort alphabetically by state, then citiesalphabetically within states; or alphabetically by state, then bycrime rate within each state.

SAS Programming

Page 42: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Ranking cities by state

First, how do we get the states for each city? Often we might haveseparate data files that when combined have the information weneed. We might have one file with cities with demographic andcrime statistics, and another file with states for each city.

For the current dataset, you could probably manually enter thestate for each city either by modifying the data file or by having along IF statement like this

data crime;

...

if city = "ALBUQUERQUE" then state = "NM";

else if city = "BOSTON" then state = "MA";

and so on.

SAS Programming

Page 43: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Ranking cities by state

Although this would work for this data set (it doesn’t take too longto enter in 110 states, even if you have to look most of them up),it would be better to able to MERGE the data from a list of citiesin another data set. If you had more cities, or county level data(there are about 3000 counties in the US), then doing it by handwould not be feasible.

SAS Programming

Page 44: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Sorting on two variables

Supposing you do have a variable that lists the state for every city,you can then sort on two variables.

SAS Programming

Page 45: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

The IN operator

For now we’ll assume that we have the state variable in our dataset as well. We could also create variables to indicate the region ofa certain state. For example, suppose we want a list of cities forstates on the East Coast. We’ll define the East Coast as being thefollowing states:

I Florida (FL)

I South Carolina (SC)

I North Carolina (NC)

I Virginia (VA)

I Delaware (DE)

I Pennsylvania (PA)

I New Jersey (NJ)

I New York (NY)

I Connecticut (CT)

I Rhode Island (RI)

I Massachusetts (MA)

I New Hampshire (NH)

I Maine (ME)

SAS Programming

Page 46: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

The IN operator

Clearly you could do this to do something like this:

data crimeRegions;

set crimeRates;

if state="FL" or state="SC" or state="NC" ... or state = "ME"

then region = "East Coast";

run;

where the ... is filled in.

SAS Programming

Page 47: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

The IN operator

A slightly more compact way is to write

data crimeRegions;

set crimeRates;

if state in

("FL","SC","NC","DE","PA","NJ","NY","CT","RI","MA","NH","ME")

then region = "East Coast";

run;

SAS Programming

Page 48: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

The IN operator

For the IN operator, SAS checks whether the character variablematches one of the items in a list.

The IN operator can also be used with numeric variables, such as

proc print data=crimeRates;

where crime in (0:2000,4000:6000);

run;

which will print the observations with the lowest and highest crimerates.

SAS Programming

Page 49: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

IN operator

There are usually many ways to do things. Usually, you will findways that you like and stick with them, which will become part ofyour coding style. The previous example could also be achievedusing

proc print data=crimeRates;

where 0 <= crime <= 2000 | 4000 <= crime <= 6000;

run;

SAS Programming

Page 50: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

IN operator

The IN operator on numeric data could also be used to pick outunusual observations in a data set called mydata. For example,suppose observations 7, 518, 1009, and 2112 have been flagged asbeing unusual. It is difficult to look at them simultaneously in theoriginal file because they are widely separated. You could then dothis:

data _null_;

set mydata;

file "outliers.txt";

if _n_ in (7,518,1009,2112) then put _all_;

run;

This code prints the weird observations to a separate file so thatthey can be examined more easily together. The all keywordmeans that SAS will also print out an ERROR code (which is 0 ifthere is no error) and the n variable itself along with theuser-defined variables. (Try this yourself on your own data.)

SAS Programming

Page 51: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE statement

The WHERE operator works much like the subsetting IF but canbe used in procedures (IF cannot be used in procedures). WHEREspecifies a condition that tells SAS to only apply the procedure tothose observations meeting the condition.

As an example, you could print only those cities where crime wasgreater than 4000:

proc print data=crimeRates;

where crime>4000;

var city crime;

run;

SAS Programming

Page 52: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE statement

WHERE has many built-in operators that make it behave in specialways:

Operator Description ExampleIS MISSING Matches a missing value where Subj is missing

IS NULL Equivalent to IS MISSING where Subj is null

BETWEEN AND An inclusive range where age between 20

and 40CONTAINS Matches a substring where Name contains Mac

LIKE Matching with wildcards where Name like R n%

=* Phonetic matching where Name =* Nick

SAS Programming

Page 53: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE statement with operators

SAS Programming

Page 54: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE statement with wildcards

Wild cards work with meaning one character, and % being astring of any length (including an empty string). You can havemultiple wildcards in one statement. For example, if you wereprinting files that were either .dat or .txt, and name is a variablein your program, you might use

where name like "%.__t"

Here I used two underscores in a row. In this case you can also use

where name like "%.txt" or name like "%.dat";

so you can apply logical operators to the condition of a WHEREstatement, similar to IF statements.

SAS Programming

Page 55: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Wildcards in linux/MAC OS X/MS-DOS (Windows)

WIldcards in operator systems work much the same way as in SASexcept that * is used for a string of any length and ? is used for astring of length 1. For example

ls *.sas

lists all files ending with extension .sas in the current directory. InMS-DOS, use dir *.sas instead.

To list the contents of all subdirectories, you can do this

ls */*

ls */*/*

etc

SAS Programming

Page 56: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Wildcards for listing files

Large data project can generate directories with thousands of files,and this can crash your computer if you try to search the directoryusing FInder (MAC) or Explore (Windows). DIrectories with lots offiles can also be hard to search, so using wildcards can narrowdown the possibilities. If there are too many files, you can’t evenlist them directly. For example, if I want to list all my mp3 files inmy iTunes folder, I try the following from the music folder (inwhich each directory is an artist/band/composer).

Jamess-MacBook-Pro:Music superjames$ ls */*/*.mp3

-bash: /bin/ls: Argument list too long

It is too long because I probably have close to 10,000 mp3 files. Soinstead I settle for listing a few of them

SAS Programming

Page 57: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Wildcards for listing files

Jamess-MacBook-Pro:Music superjames$ ls A*/*/*.mp3

AC_DC/Back In Black/01 ’’’’’’’.mp3

AC_DC/Back In Black/02 Shoot To Thrill.mp3

AC_DC/Back In Black/03 What Do You Do For Money Honey.mp3

AC_DC/Back In Black/04 Givin’ The Dog A Bone.mp3

AC_DC/Back In Black/05 Let Me Put My Love Into You.mp3

AC_DC/Back In Black/06 Back In Black.mp3

AC_DC/Back In Black/07 You Shook Me All Night Long.mp3

AC_DC/Back In Black/08 Have A Drink On Me.mp3

AC_DC/Back In Black/09 Shake A Leg.mp3

AC_DC/Back In Black/10 Rock And Roll Ain’t Noise Pollution.mp3

Al Di Meola/Elegant Gypsy/01 Track 1.mp3

Al Di Meola/Elegant Gypsy/02 Track 2.mp3

Al Di Meola/Elegant Gypsy/03 Track 3.mp3

Al Di Meola/Elegant Gypsy/04 Track 4.mp3

Al Di Meola/Elegant Gypsy/05 Track 5.mp3

Al Di Meola/Elegant Gypsy/06 Track 6.mp3SAS Programming

Page 58: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE statement with operators

SAS Programming

Page 59: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE vs IF

To process a subset of the observations in a PROC, you should usethe WHERE statement. The IF statement is only available in theData step.

However, in the datastep, you could use WHERE instead of IF inmany cases. For many cases, the difference will be transparent tothe user. Technically, a difference is that with the IF statement,data is read into SAS, into what is called the Program Data Vector(PDV), then tested. For a WHERE statement, the data is testedfirst, then read into the PDV.

SAS Programming

Page 60: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

WHERE vs IF

Because the data is read in first with the IF statement, it can beless efficient than WHERE, which does not have to (temporarily)store the data in the PVC for testing. This can make more of adifference on larger data sets.

However, there are cases where the IF statement is still needed anda WHERE statement doesn’t work. The subsetting IF is requiredinstead of WHERE if

I the selection is based on anything other than what alreadyexists in a SAS dataset, such as if you are reading in datafrom an external file

I selection is made later in the data step rather than thebeginning

I where the selection is made conditionally (e.g., nested IFstatements)

SAS Programming

Page 61: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Student questions

I’ll show a couple of slides that will answer some student questionsthat came up that I think are instructive.

First, what happens if one variable is set equal to another variablethat has a format?

SAS Programming

Page 62: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Format question

[jamdeg@mizar SAS]$ cat date.txt

09/03/2014

[jamdeg@mizar SAS]$ cat date.sas

data date;

informat time mmddyy10.;

infile "date.txt";

input time;

format time mmddyy8.;

time2 = time;

run;

proc print data=date;

run;

SAS Programming

Page 63: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Format question

[jamdeg@mizar SAS]$ cat date.lst

The SAS System 16:43 Wednesday, September 3, 2014 1

Obs time time2

1 09/03/14 19969

SAS Programming

Page 64: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

infile "/home/james/data.txt"; versus infile

"∼/home/james/data.txt";

Several students have gotten errors by trying to put the tildesymbol at the beginning of their path. When using SAS Studio,your files are uploaded to the directory /home/[name]/. If youspecify the full path for the file, you don’t need to use the tilde.The tilde is shorthand for whatever your home directory is.

If you do use the tilde successfully, the log in SAS will show youthe full path it used to find the file.

SAS Programming

Page 65: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

infile "/home/james/data.txt"; versus infile

"∼/home/james/data.txt";

As an example, if I’m in SAS Studio and I type

infile "~/home/james/data.txt";

then I might get an error in my log file like this:

58 data address;

59 infile "~/home/james/data.txt";

60 input name :$41. address :$41. city :$41.; run;

ERROR: Physical file does not exist,

/home/james/home/james/data.txt

What has happened is that SAS substituted /home/james/ for thetilde symbol, but this is redundant — you DON’T need both thetilde and the full path. The tilde approach is useful especially onlinux-type systems, so it might be safer to use the full path (butthen the path might change depending on the computer you areon).

SAS Programming

Page 66: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Finding the path for your file

To find the path for a file, look up the file in your system (forexample using MyComputer in Windows) and click on properties,and hopefully you’ll be able to find the path. For more details trygoogling “finding the path to a file windows”. There are slightlydifferent ways depending on the version of Windows you are using.

In Mac OS X, you can go to a file in the FInder, click on the filewhile holding down the Control key, and click on “Get Info”, andthis will information about the file including the path (underWhere:). Alternatively, if you are in the terminal program on MacOS X or linux, you can type pwd which shows the path to thecurrent directory.

SAS Programming

Page 67: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Question 3

What if you have multiple observations per row? This can happenwhen saving data file from Mac’s program Pages. E.g., you havetwo variables and they are saved like this in the data file.

patient time

24358 09/03/2001 24661 09/01/2003

25663 08/02/2002 27779 08/31/2002

SAS Programming

Page 68: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Multiple observations per row

The previous data can be read in using a special bit of SAS syntax,@@:

data times;

informat time mmddyy10.;

infile "patientTimes.txt";

input ID $ time @@;

format time mmddyy10.;

run;

proc print data=times;

run;

The double at-signs tells SAS that after it reads in two values forthe two variables, keep reading without skipping to the next line.

SAS Programming

Page 69: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

The @@ approach

A neat thing that you can do with the @@ approach is read inarbitrary text, one word at a time. So, I tried this with Romeo andJuliet, the Shakepeare play (older literature like this is oftenavailable online in plain text format). After very little searching, Ifound the plain text file at

http://sydney.edu.au/engineering/it/ matty/Shakespeare/texts/tragedies/romeoandjuliet

SAS Programming

Page 70: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data

To read in the play as data, a first stab is just to read in everyword and see what comes up. After reading in every word, wecould make a frequency distribution of the words to see whichwords were the most common or which words occurred uniquely(occurred once but were never repeated). Here is code for readingin every word this way

SAS Programming

Page 71: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: word frequencies

SAS Programming

Page 72: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: the log file

SAS Programming

Page 73: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: the log file

SAS Programming

Page 74: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: sorted word frequenies

The original file Romeoandjuliet.txt was modified to removepunctuation and tabs using several global search and replaces. Thenew file is Romeoandjuliet2.txt

SAS Programming

Page 75: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: the log file

SAS Programming

Page 76: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: sorted word frequenies

You can also output the word frequencies to a new dataset forfurther analyses.

SAS Programming

Page 77: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: word frequenies

You can also output the word frequencies to a new dataset forfurther analyses.

SAS Programming

Page 78: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: word frequenies

SAS Programming

Page 79: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: sorted word frequenies

SAS Programming

Page 80: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: reading in sentences

Based on previous output, it looks like sentences are tab-delimited,so I tried using tabs as a delimiter on the original input.

SAS Programming

Page 81: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: reading in sentences

SAS Programming

Page 82: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data: reading in sentences

SAS Programming

Page 83: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data

What can you do with this type of information?

There are many questions you could ask with this type of data.Here are some examples:

I What is the correlation between word length and wordfrequency?

I If we read in two of Shakespeare’s plays, we could comparedistributions of words and/or sentence lengths to see if thereare differences. This could be used to investigate stylisticchanges in Shakespeare’s plays over time (especially if read inseveral of his plays).

I It has sometimes been hypothesized that some ofShakespeare’s plays weren’t written by Shakespeare, but bysomeone else. These distributions could be used to look foroutliers in Shakespeare’s plays that suggest stylisticinconsistency.

SAS Programming

Page 84: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Romeo and Juliet as data

Other questions:

I We could look for stylistic differences between types of plays,categorizing plays as comedies, histories, and tragedies. Howoften do certain words, such as “love” and “death” occur intragedies versus comedies?

I We could define pairwise distances between plays based oneuclidean distances between word frequencies. Let wi1, . . . ,win

be the frequency of word i in play n, where we have a list ofall words occurring in all of Shakespeare’s plays. Then define

d(playi , playj) =

√√√√ n∑k=1

(wik − wjk)2

This distance can be used to cluster Shakespeare’s plays usingmultidimensional scaling to see if there are clusters based onstyle (comedy, tragedy, history) or historical period

SAS Programming

Page 85: STAT579: SAS programmingjames/SAS3.pdfand (2) using IF and IF-ELSE statements to create ranges for a variable, such as low, medium, and high. There are lots of other ways that conditional

Textual analysis

Other types of questions and data you could try to address byanalysizing word or sentence frequencies include the following:

I Can you distinguish between male and female authors in aparticular genre (or say in e-mail, facebook posts, etc.)?

I You could try to track decline or increase of usage of certainwords, such as “thee” if you have documents from differenttime periods.

I Can you identify something about an author’s geographicinformation or age based on their use of particular words,slang, etc.? (e.g., in NM, people say “soda”, but in theMidwest, they often say “pop” or “soda pop”.)

I Instead of looking at individual words, you could look at pairsof adjacent words, so that certain pairs might be common forsome authors but not other

SAS Programming