25
ARE6031(SAS Data Management) Class Note 08 70 7. SAS Procedures for Descriptive Statistics 7.1. The FREQ Procedure - The FREQ procedure produces one-way to n-way frequency and crosstabulation tables. - For two-way tables, PROC FREQ computes tests and measures of association. - For n-way tables, PROC FREQ does stratified analysis, computing statistics within as well as across strata. - Frequencies can also be output to a SAS data set. One-Way Frequency Tables (for one variable) PROC FREQ; TABLES A; Two-Way Crosstabulation Tables (for two variables) PROC FREQ; TABLES A*B; N-Way Crosstabulation Tables (for n variables) PROC FREQ; TABLES A*B*C*D; Produces k tables, where k is the number of different combinations of values for the variables A and B. Each table has the value of C down the side and the values of D across the top. Note: Multi-way tables can generate a great deal of printed output. For example, if the variables A, B, C, D, and E each have ten levels, five-way tables of A*B*C*D*E could generate 4000 or more pages of output.

7. SAS Procedures for Descriptive Statistics - Hanyangaesl.hanyang.ac.kr/class/are6031/PDF-ENG/are6031(ENG)-08.pdf · SAS Procedures for Descriptive Statistics ... The UNIVARIATE

Embed Size (px)

Citation preview

ARE6031(SAS Data Management) Class Note 08

70

7. SAS Procedures for Descriptive Statistics

7.1. The FREQ Procedure

- The FREQ procedure produces one-way to n-way frequency and crosstabulation tables. - For two-way tables, PROC FREQ computes tests and measures of association. - For n-way tables, PROC FREQ does stratified analysis, computing statistics within as

well as across strata. - Frequencies can also be output to a SAS data set.

One-Way Frequency Tables (for one variable) PROC FREQ; TABLES A;

Two-Way Crosstabulation Tables (for two variables) PROC FREQ; TABLES A*B;

N-Way Crosstabulation Tables (for n variables) PROC FREQ; TABLES A*B*C*D; Produces k tables, where k is the number of different combinations of values for

the variables A and B. Each table has the value of C down the side and the values of D across the top.

Note: Multi-way tables can generate a great deal of printed output. For example,

if the variables A, B, C, D, and E each have ten levels, five-way tables of A*B*C*D*E could generate 4000 or more pages of output.

ARE6031(SAS Data Management) Class Note 08

71

Specifications PROC FREQ options; TABLES requests / options; WEIGHT variable; BY variables;

1) PROC FREQ options; (refer to p.165) DATA=SAS dataset ORDER=FREQ ORDER=DATA ORDER=INTERNAL ORDER=FORMATTED FORMCHAR(1,2,7)=’string’

2) TABLES requests / options; TABLES A*(B C); => TABLES A*B A*C; TABLES (A B)*(C D); => TABLES A*C A*D B*C B*D; TABLES (A B C)*D; => TABLES A*D B*D C*D; TABLES A-C; => TABLES A B C; TABLES (A --C)*D; => TABLES A*D B*D C*D; TABLES A – C*D => illegal Please, refer to p.166 for options.

3) WEIGHT variable;

Normally, each observation contributes a value of 1 to the frequency counts. In other words, each observation represents one subject. However, when a WEIGHT statement appears, each observation contributes the weighting variable’s value for that observation.

ARE6031(SAS Data Management) Class Note 08

72

Examples 1 for various options (PROC_FREQ1.SAS) OPTIONS NODATE; DATA TEST; INPUT A B; CARDS; 1 2 2 1 . 2 . . 1 1 2 1 ; PROC FREQ; TITLE 'NO TABLES STATEMENT';

PROC FREQ; TABLES A / MISSPRINT; TITLE '1-WAY FREQUENCY TABLE WITH MISSPRINT OPTION';

PROC FREQ; TABLES A*B; TITLE '2-WAY CONTINGENCY TABLE';

ARE6031(SAS Data Management) Class Note 08

73

PROC FREQ; TABLES A*B / MISSPRINT; TITLE '2-WAY CONTINGENCY TABLE WITH MISSPRINT OPTION';

ARE6031(SAS Data Management) Class Note 08

74

PROC FREQ; TABLES A*B / MISSING; TITLE '2-WAY CONTINGENCY TABLE WITH MISSING OPTION';

ARE6031(SAS Data Management) Class Note 08

75

PROC FREQ; TABLES A*B / CHISQ; TITLE '2-WAY CONTINGENCY TABLE WITH CHISQ OPTION';

ARE6031(SAS Data Management) Class Note 08

76

PROC FREQ; TABLES A*B / LIST; TITLE '2-WAY FREQENCY TABLE';

PROC FREQ; TABLES A*B / LIST MISSING; TITLE '2-WAY FREQENCY TABLE WITH MISSING OPTION';

PROC FREQ; TABLES A*B / LIST SPARSE; TITLE '2-WAY FREQENCY TABLE WITH SPARSE OPTION';

ARE6031(SAS Data Management) Class Note 08

77

PROC FREQ ORDER=DATA; TABLES A*B / LIST; TITLE '2-WAY FREQENCY TABLE WITH, ORDER=DATA'; RUN;

Examples 2 for PROC FREQ with WEIGHT statement (PROC_FREQ2.SAS) OPTIONS NODATE; DATA HRSWORK; INPUT COLLEGE $ YR HRSWORK; CARDS; A 1 20 A 1 25 A 1 19 A 2 20 A 2 25 A 2 20 B 1 20 B 1 19 B 1 20 B 1 19 B 2 19 B 2 20 ; PROC FREQ; TABLES COLLEGE*YR; TITLE '2-WAY FREQ TABLE'; PROC FREQ; TABLES COLLEGE*YR; WEIGHT HRSWORK; TITLE '2-WAY FREQ TABLE WITH WEIGHT VARIABLE'; RUN;

ARE6031(SAS Data Management) Class Note 08

78

ARE6031(SAS Data Management) Class Note 08

79

7.2. The TABULATE Procedure

PROC TABULATE constructs tables of descriptive statistics from compositions of classification variables, analysis variables, and statistics keywords. Tables can have up to three dimensions; column, row, and page. PROC TABULATE displays descriptive statistics in hierarchical tables. Each table cell belongs to a particular category of observations composed by crossing variable names. The statistics associated with each cell is calculated on values from all observations in that category. The statistics that PROC TABULATE computes are many of the same statistics computed by other descriptive procedures such as MEANS, FREQ, and SUMMARY. PROC TABULATE provides:

- simple but powerful methods to create user-defined tables - a great degree of flexibility in classification hierarchies - a variety of mechanisms for titling and formatting variables and procedure-generated

statistics.

Specifications PROC TABULATE [options]; CLASS variable …; VAR variable …; FREQ variable; WEIGHT variable; FORMAT variables format. …; LABEL variable = label…; BY variable …; TABLE [expression,][ expression,]expression[/options]; KEYLABEL keyword = ‘text …’;

Fundamentals An understanding of the TABLE statement is fundamental to using TABULATE. The following are important elements of the TABLE statement: - types of variables – classification and analysis - TABLE statement operators – comma, blank space, asterisk, brackets, parentheses - table dimension – pages, rows, columns (Please, read the text book for various statements and options. pp. 260-305)

ARE6031(SAS Data Management) Class Note 08

80

Example (PROC_TABULATE1.SAS)

ARE6031(SAS Data Management) Class Note 08

81

ARE6031(SAS Data Management) Class Note 08

82

7.3. The MEANS Procedure

PROC MEANS produces simple univariate descriptive statistics for numeric variables.

Specifications PROC MEANS options; VAR variables; BY variables CLASS variable …; FREQ variable; WEIGHT variable; ID variables OUTPUT options;

VAR Statement: Statistics are calculated for each numeric variable listed in the VAR statement. If a VAR statement is not used, all numeric variables in the input data set, except for those listed in BY, ID, FREQ, or WEIGHT statement are analyzed. BY Statement: A BY statement can be used with PROC MEANS to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. CLASS Statement: In the CLASS statement, the variables used to form subgroups can be assigned. The CLASS statement has an effect on the statistics computed similar to that of BY statement. The differences are in the format of the printed output and in the BY statement sorting requirement. FREQ Statement: When a FREQ statement appears with PROC MEANS, each observation in the input data set is assumed to represent n observations in the calculation of statistics, where n is the value of the FREQ variable. WEIGHT Statement: The WEIGHT statement specifies a numeric variable in the input SAS data set whose values are to be used to weight each observation. Only one variable can be specified. The WEIGHT variable values can be nonintegers and are used to calculate a weighted mean and a weighted variance. ID Statement: An ID statement can be used with PROC MEANS to include additional variables in the output data set. OUTPUT Statement: The OUTPUT statement requests that MEANS output statistics to a new SAS data set.

ARE6031(SAS Data Management) Class Note 08

83

e.g. PROC MEANS;

VAR X1 X2; BY GROUP; OUTPUT OUT=STATS MEAN=MA MB STD=SA

Example: proc_means.sas OPTIONS NODATE; DATA A; INPUT RATING EXCESS PLACE $ DAY; CARDS; 04 54 S 1 07 70 N 1 10 69 N 2 04 52 S 1 07 70 S 2 08 74 N 1 04 60 S 1 07 62 S 2 07 80 N 1 06 61 S 2 06 77 N 2 08 75 N 2 ; PROC MEANS; TITLE 'Statistics for All Numeric Variables'; PROC MEANS DATA=A MAXDEC=3 NMISS RANGE USS CSS T PRT SUMWGT; VAR RATING EXCESS; TITLE 'Requesting Assorted Statistics'; PROC MEANS MAXDEC=3 FW=10; CLASS PLACE DAY; VAR RATING EXCESS; TITLE 'Statistics with Two Class Variables'; PROC SORT; BY PLACE DAY; PROC MEANS MAXDEC=3 FW=10; BY PLACE DAY; VAR RATING EXCESS; OUTPUT OUT=NEW MEAN=RMEAN EMEAN STDERR=RSE ESE; TITLE 'Statistics with Two BY Variables'; PROC PRINT; TITLE 'NEW DATA SET'; RUN;

ARE6031(SAS Data Management) Class Note 08

84

PROC MEANS; TITLE 'Statistics for All Numeric Variables';

PROC MEANS DATA=A MAXDEC=3 NMISS RANGE USS CSS T PRT SUMWGT; VAR RATING EXCESS; TITLE 'Requesting Assorted Statistics';

PROC MEANS MAXDEC=3 FW=10; CLASS PLACE DAY; VAR RATING EXCESS; TITLE 'Statistics with Two Class Variables';

ARE6031(SAS Data Management) Class Note 08

85

PROC SORT; BY PLACE DAY; PROC MEANS MAXDEC=3 FW=10; BY PLACE DAY; VAR RATING EXCESS; OUTPUT OUT=NEW MEAN=RMEAN EMEAN STDERR=RSE ESE; TITLE 'Statistics with Two BY Variables';

PROC PRINT; TITLE 'NEW DATA SET';

ARE6031(SAS Data Management) Class Note 08

86

7.4. The UNIVARIATE Procedure

PROC UNIVARIATE produces simple descriptive statistics (including quantiles) for numeric variables. The UNIVARIATE procedure differs from other SAS procedures that produce descriptive statistics because it provides greater detail on the distribution of a variable. Features in PROC UNIVARIATE include: - detail on the extreme values of a variable - quantiles, such as median - several plots to picture the distribution - frequency tables - a test to determine that the data are normally distributed.

Specifications PROC UNIVARIATE options; VAR variables; BY variables CLASS variable …; FREQ variable; WEIGHT variable; ID variables OUTPUT OUT=SAS_data_set keyword=names;

(Refer to the text book for description on the statements)

Example: proc_univariate.sas ******************************************; * PROC_UNIVARIATE.SAS ; * ; * DESCRIPTIVE STATISTICS FOR ; * HEIGHTS AND WEIGHT OF FEMALES ; ******************************************; OPTIONS NODATE; DATA GIRLS; INPUT SCHOOL $ HEIGHT WEIGHT; CARDS; A 169.6 71.2 A 166.8 58.2 A 157.1 56.0 A 181.1 64.5 A 158.4 53.0 A 165.6 52.4

ARE6031(SAS Data Management) Class Note 08

87

A 166.7 56.8 A 156.5 49.2 A 168.1 55.6 A 165.3 77.8 A 164.0 45.6 A 155.4 44.5 A 160.0 50.0 A 168.6 70.1 A 165.8 59.1 A 154.1 57.0 A 183.1 63.5 A 154.4 55.0 A 167.6 56.4 A 162.7 58.8 A 158.5 47.2 A 167.1 53.6 B 163.3 70.8 B 169.0 55.6 B 153.4 54.5 B 169.0 51.0 B 159.6 61.2 B 156.8 53.2 B 167.1 59.0 B 171.1 63.5 B 159.4 53.0 B 165.6 52.4 B 167.7 56.8 B 158.5 49.2 B 164.1 55.6 C 163.3 77.8 C 163.0 45.6 C 157.4 44.5 C 169.0 50.0 C 163.6 70.1 C 163.8 59.1 C 155.1 57.0 C 163.1 63.5 C 154.4 55.0 C 165.6 56.4 C 167.7 58.8 C 155.5 47.2 C 168.1 53.6 C 165.3 70.8 C 165.0 55.6 C 156.4 53.5 C 159.0 52.0 ; PROC UNIVARIATE DATA=GIRLS FREQ PLOT NORMAL; VAR HEIGHT WEIGHT; TITLE'DESCRIPTIVE STATISTICS FOR ALL DATA'; PROC UNIVARIATE DATA=GIRLS FREQ PLOT NORMAL; VAR HEIGHT WEIGHT; BY SCHOOL; TITLE 'DESCRIPTIVE STATISTICS WITH BY STATEMENT'; RUN;

ARE6031(SAS Data Management) Class Note 08

88

ARE6031(SAS Data Management) Class Note 08

89

ARE6031(SAS Data Management) Class Note 08

90

ARE6031(SAS Data Management) Class Note 08

91

ARE6031(SAS Data Management) Class Note 08

92

PROC UNIVARIATE DATA=GIRLS FREQ PLOT NORMAL; VAR HEIGHT WEIGHT; BY SCHOOL; TITLE 'DESCRIPTIVE STATISTICS WITH BY STATEMENT'; The above procedure will produce descriptive statistics for heights and weights data for each school and following side-by-side box plots.

ARE6031(SAS Data Management) Class Note 08

93

ARE6031(SAS Data Management) Class Note 08

94