31
1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

Embed Size (px)

Citation preview

Page 1: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

1

EPIB 698C Lecture 4

Raul Cruz-Cano

Summer 2012

Page 2: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

2

Sorting, Printing and Summarizing Your Data

• SAS Procedures (or PROC) perform specific analysis or function, produce results or reports

• Eg: Proc Print data =new; run;• All procedures have required statements, and most have

optional statements• All procedures start with the key word “PROC”, followed

by the name of the procedure, such as PRINT, or contents• Options, if there are any, follow the procedure name• Data=data_name options tells SAS which dataset to use as

an input for this procedure. NOTE: if you skip it, SAS will use the most recently created dataset, which is not necessary the same as the mostly recently used data.

Page 3: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

3

BY statement

• The BY statement is required for only one procedure, Proc sort

PROC Sort data = new;By gender;Run;

• For all the other procedures, BY is an optional statement, and tells SAS to perform analysis for each level of the variable after the BY statement, instead of treating all subjects as one group

Proc Print data =new;By gender;Run;

• All procedures, except Proc sort, assumes you data are already sorted by the variables in your BY statement

Page 4: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

4

PROC Sort • Syntax

Proc Sort data =input_data_name out =out_data_name ;By variable-1 … variable-n;

• The variables in the by statement are called by variables.• With one by variable, SAS sorts the data based on the

values of that variable• With more than one variable, SAS sorts observations by the

first variable, then by the second variable within the categories of the first variable, and so on

• The DATA and OUT options specify the input and output data sets. Without the DATA option, SAS will use the most recently created data set. Without the OUT statement, SAS will replace the original data set with the newly sorted version

Page 5: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

5

PROC Sort • By default, SAS sorts data in ascending order, from the

lowest to the highest value or from A to Z. To have the the ordered reversed, you can add the keyword DESCENDING before the variable you want to use the highest to the lowest order or Z to A order

• The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variables

Page 6: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

6

PROC Sort • Example: The sealife.txt contains information on the average length in

feet of selected whales and sharks. We want to sort the data by the family and length

Name Family Length

beluga whale 15whale shark 40basking shark 30gray whale 50mako shark 12sperm whale 60dwarf shark .5whale shark 40humpback . 50blue whale 100killer whale 30

Page 7: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

7

PROC Sort • Example: The sealife.txt contains information on the average length in

feet of selected whales and sharks. We want to sort the data by the family and length

Name Family Length

beluga whale 15whale shark 40basking shark 30gray whale 50mako shark 12sperm whale 60dwarf shark .5whale shark 40humpback . 50blue whale 100killer whale 30

Page 8: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

8

PROC SortDATA marine;INFILE 'F:\SAS\lecture4\Sealife.txt';INPUT Name $ Family $ Length;run;

* Sort the data;PROC SORT DATA = marine OUT = seasort NODUPKEY;BY Family DESCENDING Length;run;

Page 9: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

9

Title and Footnote statement

• Title and Footnote statements are global statements, and are not technically part of any step.

• You can put them anywhere in your program; but since they apply to the procedure output, it is usually make sense to put them with the procedure

• SyntaxTitle ‘This is a title for this procedure’Footnote ‘This is the footnote for this procedure’;

• To cancel the current title or footnote, use the following null statement: Title; Footnote;

Page 10: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

10

Label Statement

• The label statement can create descriptive labels, up to 256 characters long, for each variable

• Eg:

Label Shipdate = ‘Date merchandise was shipped’;

ID =‘Identification number of subject’;

• When a label statement is used in a data step, the labels become part of the data set; but when used in a PROC step, the labels stay in effect only for the duration of that step

Page 11: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

11

PROC Format statement

• The PROC FORMAT procedure allows you to create your own formats. It is useful when you use coded data.

• The Proc format procedure creates formats what will later be associated with variables in a FORMAT statement

• Syntax of the PROC FORMAT:

PROC FORMAT;

Value name range-1 =‘formated-text-1’

range-2 =‘formated-text-2’

range-n =‘formated-text-n’;

• Name is the name of the format you are creating; if the format is for character data, the you need to use $name instead of name. In addition the name can not be the name of an existing format

Page 12: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

12

PROC Format statement

• Each range is the value of the variable that is assigned to the text given in the quotation marks

• The text can be up to 32,767 characters long, but some procedures print only the first 8 to 16 characters

• The following are some examples of valid range specifications: ‘A’=‘Asian’; character values must be put in quotation marks 1,3,5,7,9=‘ODD’; with more than one value in the range, separate them with comma or hyphen (-);5000-high=‘high price’; the key word high and low can be used in ranges to indicate the lowest and highest

non-missing values for the variable

Page 13: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

13

PROC Format statement

• Here is a survey about subject’s preference of car colors. The data contains subject’s age, sex (coded as 1 for male and 2 for female), annual income, and preferred car color (yellow, green, blue, and white). Here are the data:

age sex income color

19 1 14000 Y

45 1 65000 G

72 2 35000 B

31 1 44000 Y

58 2 83000 W

Page 14: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

14

PROC FORMAT; VALUE gender 1 = 'Male‘ 2 = 'Female'; VALUE agegroup 13 -< 20 = 'Teen' 20 -< 65 = 'Adult' 65 - HIGH = 'Senior'; VALUE $col 'W' = 'Moon White' 'B' = 'Sky Blue' 'Y' = 'Sunburst Yellow' 'G' = ‘Green';PROC PRINT DATA = carsurvey;FORMAT Sex gender. Age agegroup. Color $col. Income DOLLAR8.;RUN;

Page 15: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

15

Subsetting in procedures with a where statement

• The WHERE statement tells a procedure to use a subset of data

• It is an optional statement for any PROC step

• Unlike subsetting in the DATA step, using a WHERE statement in a procedure does not create a new data set

• The basic form is

Where condition; (eg : where gender =‘female’;)

Page 16: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

16

Subsetting in procedures with a where statement

• A data set contains information about well-known painters:

Name Style Nation of origin

Mary Cassatt Impressionism U

Paul Cezanne Post-impressionism F

Edgar Degas Impressionism F

Paul Gauguin Post-impressionism F

Claude Monet Impressionism F

Pierre Auguste Renoir Impressionism F

Vincent van Gogh Post-impressionism N

• Goal: we want a list of impressionist painters

Page 17: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

17

DATA style;INFILE 'F:\SAS\lecture4\style.txt';INPUT Name $ 1-21 style $ 23-40 Origin $ 42;RUN;

PROC PRINT DATA = style;WHERE style = 'Impressionism';TITLE 'Major Impressionist Painters';FOOTNOTE 'F = France N = Netherlands U = US';

RUN;

Page 18: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

18

Summarizing you data with PROC MEANS

• The proc means procedure provide simple statistics on numeric variables. Syntax: Proc means options ;

• List of simple statistics can be produced by proc means: MAX: the maximum valueMIN: the minimum valueMEAN: the meanN : number of non-missing valuesSTDDEV: the standard deviationNMISS: number of missing valuesRANGE: the range of the dataSUM: the sumMEDIAN: the median

DEFAULT

Page 19: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

19

Proc means

• Options of Proc means:

By variable-list : perform analysis for each level of the

variables in the list. Data needs to be

sorted first Class variable-list: perform analysis for each level of the

variables in the list. Data do not need to

be sorted Var variable list: specifies which variables to use in the

analysis

Page 20: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

20

Proc means• A wholesale nursery is selling garden flowers, they want to

summarize their sales figures by month. The data is as follows:

ID Date Lily SnapDragon Marigold756-01 05/04/2001 120 80 110756-01 05/14/2001 130 90 120834-01 05/12/2001 90 160 60834-01 05/14/2001 80 60 70901-02 05/18/2001 50 100 75834-01 06/01/2001 80 60 100756-01 06/11/2001 100 160 75901-02 06/19/2001 60 60 60756-01 06/25/2001 85 110 100

Page 21: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

21

DATA sales; INFILE 'C:\teaching\SAS\lecture4\Flowers.txt'; INPUT CustomerID $ @9 SaleDate MMDDYY10. Lily

SnapDragon Marigold; Month = MONTH(SaleDate); PROC SORT DATA = sales; BY Month; * Calculate means by Month for flower sales;PROC MEANS DATA = sales; BY Month; VAR Lily SnapDragon Marigold; TITLE 'Summary of Flower Sales by Month';RUN;

Page 22: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

22

OUTPUT statement

• We can use the OUTPUT statement to write summary statistics in a SAS data set

• Syntax• OUTPUT out =data_name output-statistic-list;

• Eg: Proc means data =new;Var age BMI;Output out = new1 mean (age BMI)=mean_age mean_BMI;Run;

• In the output data set new1, we have two means for age and BMI respectively. The variable names are mean_age mean_BMI respectively.

Page 23: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

23

Proc means• A wholesale nursery is selling garden flowers, they want to

summarize their sales figures by month. The data is as follows:

ID Date Lily SnapDragon Marigold756-01 05/04/2001 120 80 110756-01 05/14/2001 130 90 120834-01 05/12/2001 90 160 60834-01 05/14/2001 80 60 70901-02 05/18/2001 50 100 75834-01 06/01/2001 80 60 100756-01 06/11/2001 100 160 75901-02 06/19/2001 60 60 60756-01 06/25/2001 85 110 100

Page 24: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

24

PROC MEANS DATA = sales; BY Month; VAR Petunia SnapDragon Marigold; output out=new1 mean(Lily SnapDragon Marigold)=mean_lily mean_SnapDragon mean_Marigold sum (lily SnapDragon Marigold)=sum_lily sum_SnapDragon sum_Marigold; TITLE 'Summary of Flower Sales by Month';RUN;

Page 25: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

25

OUTPUT statement• The SAS data set created by the output statement will

contain all the variables defined in the output statistic list; any variables in a BY or CLASS statement, plus two new variables: _TYPE_ and _FREQ_

• Without BY or CLASS statement, the data will have just one observation

• If there is a BY statement, the data will have one observation for each level of the BY group

• CLASS statements produce one observation for each level of interaction of the class variables

• The value _TYPE_depends on the level of interactions of the CLASS statement.

• _TYPE_= 0 is the grand total

Page 26: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

26

Proc Freq • PROC FREQ can be used to count frequencies of both

character and numeric variables• When you have counts for one variable, it is called one-way

frequencies• When you have two or more variables, the counts are called

two-way, three-way or so on up to n-way frequencies; or simply cross-tabulations

• Syntax: Proc freq ;Table(s) variable-combinations;

• To produce one-ways frequencies, just put variable name after “TABLES”; To produced cross-tabulations, put an asterisk (*) between the variables

Page 27: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

27

Proc Freq

• The blood.txt data contain information of 1000 subjects. The variables include: subject ID, gender, blood_type, age group, red blood cell count, white blood cell count, and cholesterol.

• Here is the data with first few subjects: 1 Female AB Young 7710 7.4 2582 Male AB Old 6560 4.7 .3 Male A Young 5690 7.53 1844 Male B Old 6680 6.85 .5 Male A Young . 7.72 187

• We want to derive frequencies of gender, age group, and blood type.

Page 28: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

28

Proc Freq

proc freq data=blood;

tables Gender Blood_Type;

tables Gender * blood_Type/chisq ;

tables Gender * Age_Group * Blood_Type /

nocol norow nopercent;

run;

Page 29: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

29

PROC FREQ options

• Nocol: Suppress the column percentage for each cell

• Norow: Suppress the row percentage for each cell

• Nopercent: Suppress the percentages in crosstabulation tables, or percentages and cumulative percentages in one-way frequency tables and in list format

Page 30: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

30

PROC FREQ options• Missprint: Display missing value frequencies

• Missing: Treat missing values as nonmissing

Page 31: 1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012

31

PROC FREQ output creates an output data set with frequencies,

percentages, and expected cell frequencies

• Out=: Specify an output data set to contain variable values and frequency counts