Upload
phamcong
View
236
Download
6
Embed Size (px)
Citation preview
Chapter 2 - Describing Data
1. Summary Statistics - Proc Means
(a) Var
(b) Title
(c) Class
(d) By
(e) Output
2. More Statistics and Plots - Univariate
3. Proc Sort
This covers sections: 2.A-H. You should also read section
19I.
1
Creating a SAS data set: Example
/* Population, population density, births and deaths for
Western European countries, 1995 */
DATA EUROPE_W; /* this creates a SAS Data Set called EUROPE_W */
/* Source: Organisation for Economic Co-op. and Devel. Labour
Force Stat., 1976-1996, Paris, 1997 Ed.*/
INPUT COUNTRY $ POP DENSITY BRATE DRATE;
/* POP = population in 1000’s, DENSITY = 1000’s of
residents/km^2 BRATE, DRATE = birth, death rate per 1000 */
DATALINES;
Austria 8047 95.9 . .
Belgium 10137 332.4 . .
Denmark 5228 121.3 13.4 12.0
Finland 5108 15.1 12.3 9.6
France 58143 105.9 12.5 9.1
2
Creating a SAS Data Set: Example
Germany 81661 228.8 9.4 10.8
Greece 10454 79.2 9.7 9.6
Iceland 267 2.61 7.2 6.0
Ireland 3598 51.2 . .
Italy 57283 190.2 . .
Luxembourg 413 158.8 13.2 9.3
Netherlands 15459 378.91 2.3 8.8
Norway 4348 13.4 13.8 10.3
Portugal 9918 107.3 10.8 10.5
Spain 39210 77.7 9.2 8.7
Sweden 8847 19.7 11.6 10.6
Switzerland 7062 171.0 11.6 8.9
UK 58606 239.4 12.5 11.0
;
3
Questions of interest:
1. How many missing birth rates are in our sample?
2. What is the mean population density?
3. How variable is population density from country to coun-
try?
4. What is the distribution of population? population den-
sity?
4
Another SAS Data Set: Infile and Input
• The file snails.txt contains data from an experiment
in which groups of 20 snails were held for periods of
1, 2, 3 or 4 weeks in carefully controlled conditions of
temperature and relative humidity.
• There were two species of snail: A and B.
• At the end of the exposure time the snails were tested
to see if they had survived; the process itself is fatal for
the animals.
• Using the INFILE and INPUT statements, the data can be
read into a SAS data set called SNAILS.
Species Time Humidity Temperature Fatalities N
A 1 60.0 10 0 20
A 1 60.0 15 0 20
...........................................
B 4 75.8 20 7 20
5
Questions of interest:
1. What is the mean and standard deviation of the num-
ber of fatalities of species B for each level of exposure
(TIME)?
2. What is the distribution of the number of fatalities?
3. What is an approximate 95% confidence interval for the
mean number of fatalities?
4. How many times did 0 fatalities occur?
6
Proc Means
• Syntax:
PROC MEANS DATA = SASdata options;
(optional statements)
• Explanation:
– the DATA option specifies a SAS data set. If this
option is not used, SAS looks to the most recently
created or used SAS data set.
– Examples:
PROC MEANS DATA = EUROPE_W;
PROC MEANS DATA = SNAILS;
PROC MEANS;
7
Optional Statements for Proc Means
• To compute specific kinds of statistics, use e.g. N,
NMISS, MEAN, STD, STDERR, CLM, MIN, MAX,
SUM, VAR, CV, SKEWNESS, KURTOSIS, T, and MAXDEC=n.
• An additional option is the NOPRINT option which sup-
presses printing of output in the Output Window.
PROC MEANS DATA = EUROPE_W NMISS MEAN STD
VAR MAXDEC=4;
gives the number of missing observations for each vari-
able in the SAS data set EUROPE_W, as well as the mean,
standard deviation and variance. The MAXDEC option
restricts the number of decimal places to 4.
• A number of types of optional statements can be used,
including a TITLE , VAR , CLASS, BY and OUTPUT statement.
8
Subcommand statements for Proc Means
• The TITLE statement is useful for preparing reports.
• The VAR statement specifies which variables the sum-
mary statistics should be computed for.
Example:
PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR;
TITLE ’Demographic Statistics for Western Europe’;
VAR DENSITY BRATE DRATE;
9
Subgrouping with the Class Statement
• The CLASS statement is used when we require computa-
tion of the various summary statistics for different sub-
groups of classes. For example, to estimate the mean
number of fatalities for each of the two species of snail,
we use SPECIES as a class variable:
• Example:
DATA SNAILS;
INFILE ’snails.txt’;
INPUT SPECIES $ TIME HUMIDITY TEMP FATALITY N;
PROC MEANS DATA=SNAILS MEAN;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES;
RUN;
QUIT;
10
Subgrouping with Class
• After execution, the Output window contains the two
averages:
Mean Fatalities For Each Species of Snail
A 0.708333
B 4.020833
• We are actually interested in the mean number of fa-
talities for each type of snail at each level of exposure
(TIME). Thus, TIME is a second classification variable,
nested within the first classification variable SPECIES.
• We can obtain all of the required averages, as well as
95% confidence limits for the true mean in each case,
by employing the following:
PROC MEANS DATA=SNAILS MEAN CLM;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES TIME;
11
Subgrouping with the By Statement
• The BY statement is almost interchangeable with the
CLASS statement. However, it will only work when the
data set is sorted according to the BY variable. The
CLASS statement does not have this restriction.
• Example:
PROC MEANS DATA=SNAILS MEAN CLM;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
BY SPECIES TIME;
• This works since SPECIES and TIME are already sorted.
For each value of SPECIES the variable TIME is sorted.
The CLASS statement uses more memory than BY, but
the BY will tend to be slower than CLASS, since sorting is
a slow operation. These differences are only noticeable
for large data sets.
12
Using Output from Proc Means
• The OUTPUT statement is used to create a new SAS data
set consisting of the summary statistic computed by
PROC MEANS.
• Example 1: The following creates a new SAS data set
called SNAILSUM which will contain 2 observations (one
for each species) on the 3 variables M_FATAL, S_FATAL,
and V_FATAL.
PROC MEANS DATA=SNAILS MEAN STD VAR NOPRINT;
VAR FATALITY;
CLASS SPECIES;
OUTPUT OUT=SNAILSUM
MEAN=M_FATAL
STD =S_FATAL
VAR =V_FATAL;
13
Output: Another Example
• The following creates a SAS data set consisting of a sin-
gle observation on the two variables M_BRATE and M_DRATE.
The number of variables in the VAR statement must
match the number of variables created by the OUTPUT
statement, for each statistic listed in the options.
PROC MEANS DATA=EUROPE_W MEAN;
VAR BRATE DRATE;
OUTPUT OUT=EUROPSUM
MEAN=M_BRATE M_DRATE;
• These new SAS data sets can later be used by SAS
procedures, if desired.
14
Proc Means: Example
• Here we plot a histogram of the averages of the num-
bers of fatalities. Note that we have used the NOPRINT
option here to suppress output to the Output window.
PROC MEANS DATA=SNAILS MEAN NOPRINT;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES TIME;
OUTPUT OUT = SNAILSUM;
MEAN = M_FATAL;
PROC CHART DATA=SNAILSUM;
VBAR M_FATAL;
15
PROC UNIVARIATE
• Syntax:
PROC UNIVARIATE DATA = SASdata options;
statements;
• Many of the options are the same as for PROC MEANS.
Some additional ones are available: see page 27 of the
textbook. The default output is quite extensive and
includes the median and quartiles, the extreme per-
centiles, and lowest and highest 5 observations. These
last are useful for ensuring that the data has been read
in sensibly.
• The NORMAL option gives a crude normal QQ plot.
– an informal, yet useful, test of normality.
– it is a plot of the ordered observations versus the
expected value of ordered normal observations
– If the plot is close to a straight line, then the data
are approximately normally distributed. Otherwise,
the data are likely non-normal
16
Normal QQ Plot: Example
• This checks whether the distribution of Western Euro-
pean population densities are approximately normal.
PROC UNIVARIATE DATA=EUROPE_W NORMAL;
VAR DENSITY;
• To train your eye to recognize typical departures from
non-normality, simulation of normal and non-normal data
sets having various sample sizes is helpful:
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
X = RANNOR(0);
PUT X;
END;
RUN; QUIT;
17
Normal QQ Plotting
• Now, construct the normal QQ plot:
DATA NORTEST;
INFILE ’normal.dat’;
INPUT X;
PROC UNIVARIATE NORMAL;
VAR X;
RUN; QUIT;
• Repeating this for a number of different simulation runs
will give you a good notion as to what the normal QQ
plot should look like.
18
Normal QQ Plotting of Non-Normal Data
• To see what a normal QQ plot shouldn’t look like, try
something like the following:
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
U = UNIFORM(0);
IF U < .8 THEN X = RANNOR(0);
ELSE X = 5*RANNOR(0);
PUT X;
END;
RUN; QUIT;
or
19
Normal QQ Plots of Non-Normal Data
• DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
X = RANEXP(0);
PUT X;
END;
RUN; QUIT;
• In each case, create the normal QQ plot to see what
happens when the data is really not normally distributed.
20
The Plot options and Proc Means
• Crude stem-and-leaf and boxplots can be produced us-
ing the PLOT option.
• Most of the statements that can be used with PROC MEANS
can be used with PROC UNIVARIATE. The exception is the
CLASS statement. You must make sure the data are
sorted properly and use the BY statement instead.
21
PROC SORT
• Syntax
PROC SORT DATA=SASdata;
BY var1 var2 ... ;
Example 1:
PROC SORT DATA = EUROPE_W;
BY DENSITY;
The SAS data set then becomes
Country POP DENSITY BRATE DRATE
Iceland 267 2.61 7.2 6.0
Norway 4348 13.40 13.8 10.3
Finland 5108 15.10 12.3 9.6
................................
Netherlands 15459 378.91 2.3 8.8
The following sorts the data set so that DENSITY appears
in reverse order.
PROC SORT DATA = EUROPE_W;
BY DESCENDING DENSITY;
22