Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Humaira Khair
A Closer Look at Proc Compare
1
How PROC COMPARE makes
my work easy?
Proc Compare is a procedure that allows two datasets to be compared for properties, number of observations and number of variables.
For a dataset, we can find differences in:
date of creation, last modification of the datasets,
number of variables and observations of the datasets.
For matching variables, we can get output about differences in:
Values, type, length, formats, informats and labels.
For observations, we can get a comparison of the values of matching observations. We can also decide how different the values of the observations can be.
PROC COMPARE produces lengthy output. With well-chosen
options and statements, we can compare pairs of SAS
datasets at multiple levels without the need of MERGEs or
SQL JOINs for DATA step.
proc compare BASE=th_old COMPARE=th_new ;
title 'Proc Compare with no options‘ ;
run ;
BASE=
Specify the base data set
COMPARE=
Specify the comparison data set
Proc Compare with no options
The COMPARE Procedure
Comparison of WORK.TH_OLD with WORK.TH_NEW
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TH_OLD 23OCT13:11:56:06 23OCT13:11:56:06 10 13
WORK.TH_NEW 23OCT13:11:56:06 23OCT13:11:56:06 10 14
Variables Summary
Number of Variables in Common: 10.
Number of Variables with Differing Attributes: 10.
Number of BY Variables: 2.
Listing of Common Variables with Differing Attributes
Variable Dataset Type Length Format Informat
crnum WORK.TH_OLD Char 7
WORK.TH_NEW Char 7 $7.
dxdt WORK.TH_OLD Num 8 YYMMDD10.
WORK.TH_NEW Num 8 YYMMDD10.
rectype WORK.TH_OLD Char 11 $11. $11.
WORK.TH_NEW Char 8 $8.
recloc WORK.TH_OLD Char 10 $10. $10.
WORK.TH_NEW Char 7 $7.
vitalst WORK.TH_OLD Char 8 $8. $8.
WORK.TH_NEW Char 1 $1.
t7 WORK.TH_OLD Char 15 $15. $15.
WORK.TH_NEW Char 12 $12.
cci3 WORK.TH_OLD Char 11 $11. $11.
WORK.TH_NEW Char 8
cci4 WORK.TH_OLD Char 11 $11. $11.
WORK.TH_NEW Char 8
txdt3 WORK.TH_OLD Num 8 DATE9.
WORK.TH_NEW Num 8 DDMMYY10.
txdt4 WORK.TH_OLD Num 8 DATE9.
WORK.TH_NEW Num 8 DDMMYY10.
Observation Summary
Observation Base Compare
First Obs 1 1
First Unequal 1 1
Last Unequal 13 13
Last Match 13 13
Last Obs . 14
Number of Observations in Common: 13.
Number of Observations in WORK.TH_NEW but not in WORK.TH_OLD: 1.
Total Number of Observations Read from WORK.TH_OLD: 13.
Total Number of Observations Read from WORK.TH_NEW: 14.
Number of Observations with Some Compared Variables Unequal: 13.
Number of Observations with All Compared Variables Equal: 0.
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 1.
Number of Variables Compared with Some Observations Unequal: 9.
Number of Variables with Missing Value Differences: 4.
Total Number of Values which Compare Unequal: 65.
Maximum Difference: 138.
Variables with Unequal Values
Variable Type Len1 Len2 Label Ndif MaxDif MissDif
crnum CHAR 7 7 10 0
dxdt NUM 8 8 dxdt 9 138 0
recloc CHAR 10 7 recloc 4 0
vitalst CHAR 8 1 vitalst 1 0
t7 CHAR 15 12 t7 13 0
cci3 CHAR 11 8 cci3 11 11
cci4 CHAR 11 8 cci4 3 3
txdt3 NUM 8 8 txdt3 11 0 11
txdt4 NUM 8 8 txdt4 3 0 3
Value Comparison Results for Variables
__________________________________________________________
|| ($) CancerCare registry number
|| Base Value Compare Value
Obs || crnum crnum
________ || _______ _______
||
4 || 866 723
5 || 488 866
6 || 834 488
<skip>
__________________________________________________________
__________________________________________________________
|| dxdt
|| (N) diagnosis date (SAS)
|| Base Compare
Obs || dxdt dxdt Diff. % Diff
________ || _________ _________ _________ _________
||
4 || 2010-12-14 2010-07-29 -138.0000 -0.7415
6 || 2010-11-26 2010-12-14 18.0000 0.0968
7 || 2010-12-13 2010-11-26 -17.0000 -0.0914
<skip>
________________________________________________________
__________________________________________________________
|| recloc
|| ($) record location
|| Base Value Compare Value
Obs || recloc recloc
________ || __________ _______
||
2 || mcc sbu
3 || sbu mcc
4 || mcc sbu
8 || sbu mcc
__________________________________________________________ 7
_________________________________________________________
|| vitalst
|| ($) vital status
|| Base Value Compare Value
Obs || vitalst vitalst
________ || ________ _
||
13 || a d
__________________________________________________________
__________________________________________________________
|| t7
|| ($) Tumour stage according to AJCC 7th edition
|| Base Value Compare Value
Obs || t7 t7
________ || _______________ ____________
||
1 || T1b T1b(m)
2 || T1b T1b(s)
3 || T1b T1b(s)
4 || T1b T2(m)
5 || T2 T1b(s)
6 || T1a T2(m)
7 || T3 T1a(s)
8 || T1b T3(s)
9 || T2 T1b(m)
10 || T3 T2(s)
11 || T3 T3(s)
12 || T3 T3(m)
13 || T1b T3(s)
__________________________________________________________
8
__________________________________________________________
|| cci3 ($) Treatment code3
|| Base Value Compare Value
Obs || cci3 cci3
________ || ___________ ________
2 || 1FU59HAV
4 || 1FU59CAV
5 || 1FU59HAV
6 || 1FU59CAV
7 || 1FU59CAV
8 || 1FU59CAV
9 || 1FU59CAV
10 || 1FU59CAV
11 || 1MC87LA
12 || 1MC87LA
13 || 1FU59CAV
__________________________________________________________
__________________________________________________________
|| cci4 ($) Treatment code4
|| Base Value Compare Value
Obs || cci4 cci4
________ || ___________ ________
8 || 1MC87LA
11 || 1FU59HAV
12 || 1FU59CAV
__________________________________________________________
__________________________________________________________
|| txdt3
|| Base Compare
Obs || txdt3 txdt3 Diff. % Diff
________ || _________ _________ _________ _________
2 || . 02/12/11 . .
4 || . 18/11/11 . .
5 || 22JUN2011 . . .
6 || . 28/06/11 . .
7 || 24MAY2011 . . .
8 || . 24/05/11 . .
9 || 25FEB2011 . . .
10 || . 25/02/11 . .
__________________________________________________________
|| txdt3
|| Base Compare
Obs || txdt3 txdt3 Diff. % Diff
________ || _________ _________ _________ _________
||
11 || 10DEC2010 . . .
12 || . 10/12/10 . .
13 || 01APR2011 . . .
__________________________________________________________
__________________________________________________________
|| txdt4
|| Base Compare
Obs || txdt4 txdt4 Diff. % Diff
________ || _________ _________ _________ _________
||
8 || . 15/06/12 . .
11 || 09MAY2011 . . .
12 || . 09/05/11 . .
__________________________________________________________
Now let’s start adding some options:
proc compare BASE=th_old COMPARE=th_new NOVALUES LISTVAR ;
title 'Proc Compare: If we want to compare the contents of the data sets' ;
run ;
Adding these two options: NOVALUES LISTVAR will skip the output part “Value Comparison Results for Variables”
proc compare BASE=th_old COMPARE=th_new
NOVALUES WARNING NOPRINT ;
title 'PROC COMPARE with NOVALUES, WARNING &
NOPRINT options‘ ;
run ;
Here NOPRINT option will suppress the output and
WARNING option will show in the LOG.
WARNING: 10 variables have conflicting attributes in the two data sets.WARNING: Data set WORK.TH_NEW contains 1 observations not in WORK.TH_OLD.WARNING: Values of the following 9 variables compare unequal: crnum dxdt
recloc vitalst t7 cci3 cci4 txdt3 txdt4WARNING: The data sets WORK.TH_OLD and WORK.TH_NEW contain unequal
values.
In most cases, the main goal is to compare the values of
variables for matching observations in two data sets using
the ID variable(s). Before using ID statement we need to
sort two data sets by ID. The output will be very lengthy
with each ID#. So we have to add some other options.
proc compare BASE=th_old COMPARE=th_new OUT=th_change
NOSUMMARY OUTBASE OUTCOMPARE OUTDIF OUTNOEQUAL
NOPRINT ;
by crnum ;
title 'PROC COMPARE using ID statement' ;
run ;
I have found this particular way of using PROC COMPARE
to be very helpful to check that expected updates to a data
set have been made.
Control the output data set
Create an output data set
Write an observation that contains the
differences for each pair of matching
observations
Suppress the writing of observations when all values are equal Control the details in the default report
Print only a short comparison summary
Suppress all printed output
OUT=
OUTDIF
OUTNOEQUAL
BRIEFSUMMARY
NOPRINT
_TYPE_ _OBS_ crnum dxdt recloc vitalst t7 cci3 cci4 txdt3 txdt4
BASE 2 138 2010-12-14 mcc a T1b . .
COMPARE 2 138 2010-12-14 sbu a T1b(s) 1FU59HAV 2011-12-02 .
DIF 2 138 E XXX....... ........ ...XXX......... XXXXXXXX... ........... . E
BASE 4 866 2010-12-14 mcc a T1b . .
COMPARE 5 866 2010-12-14 mcc a T1b(s) . .
DIF 5 866 E .......... ........ ...XXX......... ........... ........... E E
BASE 5 488 2010-12-14 mcc a T2 1FU59HAV 2011-06-22 .
COMPARE 6 488 2010-12-14 mcc a T2(m) 1FU59CAV 2011-06-28 .
DIF 6 488 E .......... ........ ..XXX.......... .....X..... ........... 1960-01-07 E
BASE 6 834 2010-11-26 mcc a T1a . .
COMPARE 7 834 2010-11-26 mcc a T1a(s) . .
DIF 7 834 E .......... ........ ...XXX......... ........... ........... E E
BASE 7 201 2010-12-13 mcc a T3 1FU59CAV 2011-05-24 .
COMPARE 8 201 2010-12-13 mcc a T3(s) 1FU59CAV 1MC87LA 2011-05-24 2012-06-15
DIF 8 201 E .......... ........ ..XXX.......... ........... XXXXXXX.... E .
BASE 8 389 2010-11-29 sbu a T1b . .
COMPARE 9 389 2010-11-29 sbu a T1b(m) . .
DIF 9 389 E .......... ........ ...XXX......... ........... ........... E E
BASE 9 818 2010-09-29 sbu a T2 1FU59CAV 2011-02-25 .
COMPARE 10 818 2010-09-29 sbu a T2(s) 1FU59CAV 2011-02-25 .
DIF 10 818 E .......... ........ ..XXX.......... ........... ........... E E
BASE 10 676 2010-12-13 sbu a T3 . .
COMPARE 11 676 2010-12-13 sbu a T3(s) . .
DIF 11 676 E .......... ........ ..XXX.......... ........... ........... E E
BASE 11 693 2010-09-30 sbu a T3 1MC87LA 1FU59HAV 2010-12-10 2011-05-09
COMPARE 12 693 2010-09-30 sbu a T3(m) 1MC87LA 1FU59CAV 2010-12-10 2011-05-09
DIF 12 693 E .......... ........ ..XXX.......... ........... .....X..... E E
BASE 12 358 2010-11-25 sbu a T3 . .
COMPARE 13 358 2010-11-25 sbu d T3(s) . .
DIF 13 358 E .......... X....... ..XXX.......... ........... ........... E E
BASE 13 161 2010-07-29 sbu a T1b 1FU59CAV 2011-04-01 .
COMPARE 14 161 2010-07-29 sbu a T1b(m) 1FU59CAV 2011-04-01 .
DIF 14 161 E .......... ........ ...XXX......... ........... ........... E E
• _TYPE_ (Type of Observation) is a character variable. Its value indicates the source of the values for the matching variables in that observation. For this example it has the values BASE, COMPARE and DIF since OUTBASE OUTCOMPARE OUTDIF options were specified.
• _OBS_ (Observation Number) is a numeric variable containing a number further identifying the source of the OUT= observations. For observations with _TYPE_ equal to DIF, _OBS_ is a sequence number that counts the matching observations in the BY group.
• For numeric variables, E indicates the values are equal on that variable and that observation.
• For character variables, a period (.) is included for each position that is the same between the two data sets and an X is used to designate unequal characters.
• The OUTBASE and OUTCOMPARE options also ensure that non-matching observations (i.e. the ID value is in one data set and not the other) will be included in the output data set.
No need to Rename
WITH statement allows to compare variables that have different names in two data sets. Variables on the VAR and WITH statements are matched up one-to-one.
proc compare BASE=th_old COMPARE=th_new ;
id crnum ;
var sex age height weight ;
with gender age_yrs ht wt ;
run ;
Check for Formatted ValuesProc Compare compares unformatted values. If there are two matching variables that are formatted differently, Proc Compare lists the formats of the variables.
PROC COMPARE is a validation tool that is worth
getting to know.
There are several options available to customize the
PROC COMPARE output. You will just have to explore
those depending on your work purpose.
References:
http://ciser.cornell.edu/sasdoc/saspdf/proc/c09.pdf
http://analytics.ncsu.edu/sesug/2011/BB12.Williams.pdf
http://www.hasug.org/newsletters/hasug200711/Proc_Compare.pdf
http://www.lexjansen.com/pharmasug/2003/tutorials/tu056.pdf
Questions?