Upload
daniel-watkins
View
229
Download
2
Tags:
Embed Size (px)
Citation preview
Chapter 9
Producing Descriptive StatisticsPROC MEANS;
Summarize descriptive statistics for continuous numeric variables.
PROC FREQ;Summarize frequency tables for discrete numeric variables or categorical variables.
Objectives• Compute statistical summaries such as mean,
median, std, min, max, and so on for numeric continuous variables
• Control # of decimals for reporting the summary statistics
• Difference between PROC MEANS and PROC SUMMARY procedures.
• Create one-way frequency table• Create 2-way, n-way cross frequency table
2
PROC MEANS Output
3
Salary by Job Code
The MEANS Procedure
Analysis Variable : Salary
Job NCode Obs N Mean Std Dev Minimum MaximumƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒFLTAT1 14 14 25642.86 2951.07 21000.00 30000.00
FLTAT2 18 18 35111.11 1906.30 32000.00 38000.00
FLTAT3 12 12 44250.00 2301.19 41000.00 48000.00
PILOT1 8 8 69500.00 2976.10 65000.00 73000.00
PILOT2 9 9 80111.11 3756.48 75000.00 86000.00
PILOT3 8 8 99875.00 7623.98 92000.00 112000.00ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Calculating Summary Statisticsfor Numeric Variables
The MEANS procedure displays simple descriptive statistics for the numeric variables in a SAS data set.
General form of a simple PROC MEANS step:
Example:
4
PROC MEANS DATA=SAS-data-set;RUN;
PROC MEANS DATA=SAS-data-set;RUN;
proc means data=mylib.crew; title 'Salary Analysis';run;
Calculating Summary Statistics
5
Salary Analysis
The MEANS Procedure
Variable N Mean Std Dev Minimum MaximumƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒHireDate 69 9812.78 1615.44 7318.00 12690.00Salary 69 52144.93 25521.78 21000.00 112000.00ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
NOTE: PROC MEANS computes summary statistics for any variable we want. However, it is meaningless to compute some variables, such as Hiredate.
Calculating Summary Statistics
By default, PROC MEANS– analyzes every numeric variable in the
SAS data set– prints the statistics N, MEAN, STD, MIN, and MAX – excludes missing values before calculating
statistics.
6
Specifying summary statistics to be computed
PROC MEANS data = mylib.crew mean median range std ;
To specify the summary statistics to be computed, add them to the PROC MEANS statement as options.
Limitting Decimal PlacesBy default, RPOC MEANS uses the BEST. Format
to display values in the report. It can be many decimal places such as 52.000000
To specify the # of decimal places to k places:
PROC MEAN Data = Mylib.crew MAXDEC=k ;
Maxdec =2 will result in 2 decimals in the report.Maxdec =0 will result in no decimal place in the report.
8
Selecting Variables
The VAR statement restricts the variables processed by PROC MEANS. General form of the VAR statement:
9
VAR SAS-variable(s);VAR SAS-variable(s);
Selecting VariablesHireDate LastName FirstName Location Phone EmpID JobCode Salary 07NOV1992 BEAUMONT SALLY T. LONDON 1132 E00525 PILOT1 72000
12MAY1985 BERGAMASCO CHRISTOPHER CARY 1151 E02466 FLTAT3 41000
04AUG1988 BETHEA BARBARA ANN FRANKFURT 1163 E00802 PILOT2 81000
10
proc means data=Mylib.crew; var Salary; title 'Salary Analysis';run;
Mylib.crew
Salary Analysis
The MEANS Procedure
Analysis Variable : Salary
N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 69 52144.93 25521.78 21000.00 112000.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Grouping ObservationsUsing CLASS statement
The CLASS statement in the MEANS procedure groups the observations of the SAS data set for analysis.
General form of the CLASS statement:
11
CLASS SAS-variable(s);CLASS SAS-variable(s);
Grouping ObservationsHireDate LastName FirstName Location Phone EmpID JobCode Salary 07NOV1992 BEAUMONT SALLY T. LONDON 1132 E00525 PILOT1 72000
12MAY1985 BERGAMASCO CHRISTOPHER CARY 1151 E02466 FLTAT3 41000
04AUG1988 BETHEA BARBARA ANN FRANKFURT 1163 E00802 PILOT2 81000
12
proc means data=mylib.crew maxdec=2; var Salary; class JobCode; title 'Salary by Job Code';run;
Mylib.crew
NOTE: The MAXDEC= option controls the number of decimal places displayed in the output.
Grouping Observations using CLASS statement
13
Salary by Job Code
The MEANS Procedure
Analysis Variable : Salary
Job NCode Obs N Mean Std Dev Minimum MaximumƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒFLTAT1 14 14 25642.86 2951.07 21000.00 30000.00
FLTAT2 18 18 35111.11 1906.30 32000.00 38000.00
FLTAT3 12 12 44250.00 2301.19 41000.00 48000.00
PILOT1 8 8 69500.00 2976.10 65000.00 73000.00
PILOT2 9 9 80111.11 3756.48 75000.00 86000.00
PILOT3 8 8 99875.00 7623.98 92000.00 112000.00ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
The summary is displayed based on the order of the categories of the CLASS variable.Variables in CLASS statement can be character or numeric.
Results due to CLASS Statement
14
• The summary is displayed based on the order of the categories of the CLASS variable.
• Variables in CLASS statement can be character or numeric. It is important to make sure you do not use continuous numeric variable in the CLASS statement.
• If there are two or more variables in CLASS statement, the order of the variables in the CLASS statement determined the order in the output report.
PROC MEAN procedure Using BY Statement
15
PROC MEANS;VAR variable list ;BY Variable;
• It is important to know that when using BY statement, the data set MUST be sorted in ascending order based on the variables in the BY statement first using PROC SORT.
•The result using BY statement is displayed as separate tables each is for the category of the variable in the BY statement.
•If there are two or more variables in the BY statement, the order determines the order of the displayed tables in the report.
ExerciseWrite a program to read diabetes data set and use PROC Means to produce summary statistics •for variables Age, Height and Weight.Run the program and see the results.•Produce the summary statistics N, mean median, max, min std, and range, and Set decimal places to two by Maxdec =2.Run the program and see the results.•Ass the CLASS statement to produce summary results for each sex. Run the program to see the results.•Practice using BY statement for each sex. Before you add the BY SEX statement, Make sure you sort the data by SEX.Run the program and see the result. •Add a WHERE statement to select cases for AGE > 30 to the program.Run the program and see the results.
16
Create data set for summary statistics in PROC MEANS
In many occasions, we may want to create a SAS data set consisting of the summary statistics calculated by PROC MEANS.
OUTPUT OUT=sas-data-setsummary-keyword(s) = variablename(s);
NOTE: summary-keywords are: Mean, Min, Max, Range, Std, etc.
Variablenames are the variable names you want to call for each summary statistics for each variable.
17
Create Summary Data Set using PROC MEANS
Examples:PROC MEANS data= mylib.crew;VAR Hiredate salary;OUTPUT OUT = mylib.discrip mean = avghiredate avgsalary Median= medhiredate medsalary;Run;
18
ExerciseRevise the following program to do the following task: Use the OUTPUT OUT= statement to save the summary statistics Mean, Median and Std to a sas data set dia_summary, then print this data set to see what’s in there.
PROC MEANS data = mylib.diabetes maxdec =2 ; var age height weight;class sex;run;
19
PROC SUMMARY procedurePROC SUMMARY procedure uses the same program codes
as PROC MEANS.
PROC SUMMARY does not produce report by default. In order to produce the report, you need to add PRINT as the option:
PROC SUMMARY data = sasdataset PRINT; When do we use PROC SUMMARY?If you only want to produce and save the summary to a
SAS data set, you can use PROC SUMMARY. OR you can use the option: NOPRINT in PROC MEANS.
PROC FREQ procedure Objectives
– Generate simple descriptive statistics using the MEANS procedure.
– Group observations of a SAS data set for analysis using the CLASS statement in the MEANS procedure.
– Create one-way and two-way frequency tables using the FREQ procedure.
– Restrict the variables processed by the FREQ procedure.
21
PROC FREQ Output
22
Distribution of Job Code Values
The FREQ Procedure
Job Cumulative Cumulative Code Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 20.29 14 20.29 FLTAT2 18 26.09 32 46.38 FLTAT3 12 17.39 44 63.77 PILOT1 8 11.59 52 75.36 PILOT2 9 13.04 61 88.41 PILOT3 8 11.59 69 100.00
Goal Report 1International Airlines wants to know how many employees are in each job code.
23
Distribution of Job Code Values
The FREQ Procedure
Job Cumulative Cumulative Code Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 20.29 14 20.29 FLTAT2 18 26.09 32 46.38 FLTAT3 12 17.39 44 63.77 PILOT1 8 11.59 52 75.36 PILOT2 9 13.04 61 88.41 PILOT3 8 11.59 69 100.00
Goal Report 2Categorize job code and salary values to determine how
many employees fall into each group.
24
Salary Distribution by Job Codes
The FREQ Procedure
Table of JobCode by Salary
JobCode Salary
Frequency ‚ Percent ‚ Row Pct ‚ Col Pct ‚Less tha‚25,000 t‚More tha‚ Total ‚n 25,000‚o 50,000‚n 50,000‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Flight Attendant ‚ 5 ‚ 39 ‚ 0 ‚ 44 ‚ 7.25 ‚ 56.52 ‚ 0.00 ‚ 63.77 ‚ 11.36 ‚ 88.64 ‚ 0.00 ‚ ‚ 100.00 ‚ 100.00 ‚ 0.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Pilot ‚ 0 ‚ 0 ‚ 25 ‚ 25 ‚ 0.00 ‚ 0.00 ‚ 36.23 ‚ 36.23 ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 5 39 25 69 7.25 56.52 36.23 100.00
Creating a Frequency Report• PROC FREQ displays frequency counts of the
data values in a SAS data set.
• General form of a simple PROC FREQ step:
25
PROC FREQ DATA=SAS-data-set;RUN;PROC FREQ DATA=SAS-data-set;RUN;
proc freq data=mylib.crew;run;
Example:
Creating a Frequency Report
• By default, PROC FREQ– analyzes every variable in the SAS data set– displays each distinct data value– calculates the number of observations in which each
data value appears (and the corresponding percentage)
– indicates for each variable how many observations have missing values.
26
Default Frequency Reports
•
27...
proc freq data=mylib.crew;run;
HireDate LastName FirstName Location Phone EmpID JobCode Salary 07NOV1992 BEAUMONT SALLY T. LONDON 1132 E00525 PILOT1 72000
12MAY1985 BERGAMASCO CHRISTOPHER CARY 1151 E02466 FLTAT3 41000
04AUG1988 BETHEA BARBARA ANN FRANKFURT 1163 E00802 PILOT2 81000
mylib.crew
Distribution of
LastName
Distribution of
Salary
Distribution of
JobCode
Distribution of
FirstNameDistribution of
EmpID
Distribution of
HireDate
Distribution of
PhoneDistribution of
Location
One-Way Frequency Report• Use the TABLES statement to limit the variables
included in the frequency counts. These are typically variables that have a limited number of distinct values.
• General form of a PROC FREQ step with a TABLES statement:
28
PROC FREQ DATA=SAS-data-set ; TABLES SAS-variables / NOCUM;RUN;
PROC FREQ DATA=SAS-data-set ; TABLES SAS-variables / NOCUM;RUN;
NOCUM option in the TABLES statement suppress Cumulative frequency and Cumulative percentage
Creating a Frequency Report
29
Distribution of Job Code Values
The FREQ Procedure
Job Cumulative Cumulative Code Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 20.29 14 20.29 FLTAT2 18 26.09 32 46.38 FLTAT3 12 17.39 44 63.77 PILOT1 8 11.59 52 75.36 PILOT2 9 13.04 61 88.41 PILOT3 8 11.59 69 100.00
proc freq data=mylib.crew; tables JobCode; title 'Distribution of Job Code Values';run;
Using PROC FORMAT to redefine Categories of Values in TABLES statementInternational Airlines wants to use formats to categorize the flight crew by job code.
30
Pilot
PILOT1PILOT2PILOT3
FLTAT1FLTAT2FLTAT3
Flight Attendant
Stored values Formatted values
Analyzing Categories of Values
•
31
proc format; value $codefmt 'FLTAT1'-'FLTAT3'='Flight Attendant' 'PILOT1'-'PILOT3'='Pilot';run;proc freq data = mylib.crew; format JobCode $codefmt.; tables JobCode;run;
NOTE: The original data values for Jobocde are not changed. They are still FLTAT1 FLTAT2, and so on.
Analyzing Categories of Values
32
Distribution of Job Code Values
The FREQ Procedure
Cumulative Cumulative JobCode Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Flight Attendant 44 63.77 44 63.77 Pilot 25 36.23 69 100.00
Crosstabular Frequency Reports• A two-way, or crosstabular, frequency report
analyzes all possible combinations of the distinct values of two variables.
• The asterisk (*) operator in the TABLES statement is used to cross variables.
• General form of the FREQ procedure to create a crosstabular report:
33
PROC FREQ DATA=SAS-data-set;
TABLES variable1*variable2;RUN;
PROC FREQ DATA=SAS-data-set;
TABLES variable1*variable2;RUN;
Variable1 is ROW and Variable2 is Column
Crosstabular Frequency Reports
•
34
proc format; value $codefmt 'FLTAT1'-'FLTAT3'='Flight Attendant' 'PILOT1'-'PILOT3'='Pilot'; value money low-<25000 ='Less than 25,000' 25000-50000='25,000 to 50,000' 50000<-high='More than 50,000';run;proc freq data=mylib.crew; tables JobCode*Salary; format JobCode $codefmt. Salary money.; title 'Salary Distribution by Job Codes';run;
Crosstabular Frequency Reports
35
Salary Distribution by Job Codes
The FREQ Procedure
Table of JobCode by Salary
JobCode Salary
Frequency ‚ Percent ‚ Row Pct ‚ Col Pct ‚Less tha‚25,000 t‚More tha‚ Total ‚n 25,000‚o 50,000‚n 50,000‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Flight Attendant ‚ 5 ‚ 39 ‚ 0 ‚ 44 ‚ 7.25 ‚ 56.52 ‚ 0.00 ‚ 63.77 ‚ 11.36 ‚ 88.64 ‚ 0.00 ‚ ‚ 100.00 ‚ 100.00 ‚ 0.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Pilot ‚ 0 ‚ 0 ‚ 25 ‚ 25 ‚ 0.00 ‚ 0.00 ‚ 36.23 ‚ 36.23 ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 5 39 25 69 7.25 56.52 36.23 100.00
Additional Syntax for TABLES statement in PROC FREQ; statement
Syntax Equivalent to
tables A*(B C); tables A*B A*C;
tables (A B)*(C D); tables A*C B*C A*D B*D;
tables (A B C)*D; tables A*D B*D C*D;
tables A - - C; tables A B C;
tables (A - - C)*D; tables A*D B*D C*D
TABLES A*B*C;Produces separate two-way tables of B*C for each value of A.
To Suppress some columns in the PROC FREQ summary report
PROC FREQ;TABLES var1*var2/ <OPTIONS>;
Options for suppressing cell frequency: NOFREQOptions for suppressing cell percent: NOPERCENTOptions for suppressing ROW percent: NOROWOptions for suppressing COLUMN percent: NOCOL
Additional usages of PROC FREQ statement
In addition to reporting tables, PROC FREQ; statement also conduct many statistical tests for analyzing categorical data such as Chi-square test, Cochran-Mantel-Haenszel test, Fisher’s exact test, Kappa coefficient, Risk, Odds ratio and so on.
This is beyond the programming course.
ExerciseThe Diabetes data set consists of Sex, Age, Height, Weight, Pulse FastGluc PostGluc for 20
patients. Revise the following program by using PROC FREQ procedure to perform the following tasks:
1. Use IF statement to create AGE_G variable : IF AGE > 45 then, AGE_G = ‘Senior’ , otherwise AGE_G = ‘Young’. Create one-way table for variables SEX , Age_G, and Pulse using user-defined format.
Run the program and see the results. 2. Create cross tabular table sex*(Age_G Pulse), make sure the
user-defined format is applied for Pulse variable.Run the program and see the results. 3. Suppressing ROW percent and Column percent.Run the program and see the results.
proc format;value pulft LOW-70 = 'Low' 71-High = 'High'; run;data diab; set mylib.diabetes; run;
39