Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

  • Upload
    robin

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    1/15

    SAS Oneway Frequency Tabulations and Twoway Contingency

    Tables (Crosstabs)

    /***********************************************************

    This example illustrates:

    How to create user-defined formats

    How to recode continuous variables into ordinal categories

    How to generate oneway and twoway tables and basic tests

    The following tests are illustrated:

    Chi-square goodness of fit test

    Binomial test of proportion for a two-level variable

    Exact Binomial test

    Pearson Chi-square test

    Fishers exact test

    Cochran-Armitage test for trend

    Procs used:

    Proc Format

    Proc MeansProc Freq

    Proc Contents

    Filename: frequencies.sas

    ************************************************************/

    OPTIONS FORMCHAR="|----|+|---+=|-/\*";

    OPTIONS NODATE PAGENO=1 FORMDLIM=" ";

    PROC FORMAT;

    VALUE AGEFMT 1 = "1:19-29"

    2 = "2:30-39"

    3 = "3:>39";

    VALUE HIAGEFMT 1 = "1:AGE > 39"

    2 = "2:AGE =240"

    2 = "2:=240";

    VALUE PILLFMT 1 = "1:PILL"

    2 = "2:NO PILL";

    VALUE WTFMT 1 = "1:=140";

    VALUE HIBMIFMT 1 = "1:BMI>23"

    2 = "2:BMI

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    2/15

    The log from that results from these Proc Format commands is shown below. These formats will be stored in

    the Work library, and thus will be temporary. In the document that follows, you will see the formats being

    applied within each procedure, by using a format statement. These formats will not be automatically attached tovariables, and have to be specified for each procedure.

    4 PROC FORMAT;

    5 VALUE AGEFMT 1 = "1: Age 19-29"6 2 = "2: Age 30-39"7 3 = "3: Age >39";NOTE: Format AGEFMT has been output.89 VALUE HIAGEFMT 1 = "1: Age >39"10 2 = "2: Age =240"13 2 = "2: Chol 23"27 2 = "2: BMI

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    3/15

    BMI = WTKG/(HTCM/100)**2;

    IF BMI > 23 then HIBMI = 1;

    IF 0= 240 THEN HICHOL=1;

    IF CHOL < 240 THEN HICHOL=2;

    IF CHOL < 200 THEN CHOLCAT=1;

    IF CHOL >= 200 AND CHOL < 240 THEN CHOLCAT=2;IF CHOL >= 240 THEN CHOLCAT=3;

    END;

    IF WT NOT=. THEN DO;

    IF WT < 120 THEN WTCAT=1;

    IF WT >= 120 AND WT < 140 THEN WTCAT=2;

    IF WT >= 140 THEN WTCAT=3;

    END;

    DROP WTKG HTCM;

    RUN;

    We use two methods for checking the newly created variables. The simplest one is Proc Means. This tells us

    most importantly if we have included all cases in our new variables, and if we have avoided adding data wherethere should be none! We will carefully examine the sample size for each original variable, and each new

    variable that was created, to be sure they match. This simple check should always be done first!

    TITLE "DESCRIPTIVE STATISTICS";

    PROC MEANS;

    RUN;

    3

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    4/15

    DESCRIPTIVE STATISTICSThe MEANS Procedure

    Variable N Mean Std Dev Minimum Maximum-------------------------------------------------------------------------------ID 188 1598.96 1057.09 3.0000000 3519.00AGE 188 33.8191489 10.1126942 19.0000000 55.0000000HT 186 64.5107527 2.4850673 57.0000000 71.0000000WT 186 131.6720430 20.6605767 94.0000000 215.0000000PILL 188 1.5000000 0.5013351 1.0000000 2.0000000CHOL 186 236.1505376 42.5555145 155.0000000 390.0000000

    ALB 186 4.1112903 0.3579694 3.2000000 5.0000000CALC 185 9.9621622 0.4795556 8.6000000 11.1000000URIC 187 4.7705882 1.1572312 2.2000000 9.9000000PAIR 188 47.5000000 27.2063810 1.0000000 94.0000000BMI 184 19.0736235 2.6285786 15.2305671 29.6996059HIBMI 184 1.9021739 0.2978899 1.0000000 2.0000000AGEGROUP 188 1.9255319 0.8432096 1.0000000 3.0000000HIAGE 188 1.6808511 0.4673916 1.0000000 2.0000000HICHOL 186 1.5322581 0.5003051 1.0000000 2.0000000CHOLCAT 186 2.2634409 0.7783954 1.0000000 3.0000000WTCAT 186 2.0322581 0.7490767 1.0000000 3.0000000-------------------------------------------------------------------------------

    A second way to check recodes of continuous variables into categories is illustrated below. Basically, you

    should check the minimum and maximum value of the original variable in each category of the new categorical

    variable to be sure the range of values is specified as you wanted it to be. Do this only after you have checkedthe sample sizes by using a simple Proc Means statement, as illustrated above.

    TITLE "CHECKING RECODE OF WT INTO WTCAT";

    PROC MEANS DATA=B510.WERNER;

    CLASS WTCAT;

    VAR WT;

    FORMAT WTCAT WTFMT.;

    RUN;

    CHECKING RECODE OF WT INTO WTCATThe MEANS ProcedureAnalysis Variable : WT

    N

    WTCAT Obs N Mean Std Dev Minimum Maximum---------------------------------------------------------------------------------------------1: Wt =140kg 55 55 156.0363636 17.2969315 140.0000000 215.0000000---------------------------------------------------------------------------------------------

    TITLE "CHECKING RECODE OF AGE INTO AGEGROUP";

    PROC MEANS DATA=B510.WERNER;

    CLASS AGEGROUP;

    VAR AGE;

    FORMAT AGEGROUP AGEFMT.;

    RUN;

    CHECKING RECODE OF AGE INTO AGEGROUPThe MEANS Procedure

    Analysis Variable : AGEN

    AGEGROUP Obs N Mean Std Dev Minimum Maximum------------------------------------------------------------------------------------------1: Age 19-29 74 74 23.8378378 2.7846302 19.0000000 29.00000002: Age 30-39 54 54 33.5925926 3.0376165 30.0000000 39.00000003: Age >39 60 60 46.3333333 4.6892111 40.0000000 55.0000000

    4

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    5/15

    ------------------------------------------------------------------------------------------

    TITLE "CHECKING RECODE OF CHOL INTO HICHOL";

    PROC MEANS DATA=B510.WERNER;

    CLASS HICHOL;

    VAR CHOL;

    FORMAT HICHOL HICHOLFMT.;

    RUN;

    CHECKING RECODE OF CHOL INTO HICHOL

    The MEANS ProcedureAnalysis Variable : CHOL

    NHICHOL Obs N Mean Std Dev Minimum Maximum-------------------------------------------------------------------------------------------1: Chol >=240 87 87 272.4712644 29.0159696 240.0000000 390.00000002: Chol

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    6/15

    Frequency Missing = 2

    Cumulative CumulativeAGEGROUP Frequency Percent Frequency Percent

    -----------------------------------------------------------------1: Age 19-29 74 39.36 74 39.362: Age 30-39 54 28.72 128 68.093: Age >39 60 31.91 188 100.00

    Cumulative CumulativeHIAGE Frequency Percent Frequency Percent

    ----------------------------------------------------------------1: Age >39 60 31.91 60 31.912: Age =240 87 46.77 87 46.772: Chol 23 18 9.78 18 9.78

    6

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    7/15

    2:BMI23

    -------------------------------------Proportion (P) 0.0978ASE 0.021995% Lower Conf Limit 0.0549

    95% Upper Conf Limit 0.1408

    Exact Conf Limits95% Lower Conf Limit 0.059095% Upper Conf Limit 0.1502

    Test of H0: Proportion = 0.2ASE under H0 0.0295Z -3.4649One-sided Pr < Z 0.0003Two-sided Pr > |Z| 0.0005

    If you wish to obtain an exactbinomial test of the null hypothesis, use the exact statement.

    PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;

    TABLES HIBMI / BINOMIAL (P=.20);exact binomial;

    FORMAT HIBMI HIBMIFMT.;

    RUN;

    This results in an exact test of the null hypothesis, in addition to the default asymptotic test.

    Exact TestOne-sided Pr

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    8/15

    1: Chol =240 87 46.77 50.00 186 100.00

    Frequency Missing = 2Chi-Square Test

    for Specified Proportions-------------------------Chi-Square 0.8889DF 2

    Pr > ChiSq 0.6412

    Effective Sample Size = 186Frequency Missing = 2

    Two-Sample Tests for Categorical Variables:

    Chi-Square test of Independence

    Two by Two Table:

    If you wish to examine the relationship between two categorical variables, you can use Proc Freq. Use the

    chisq option to obtain the Pearson chi-square test of independence (or of homogeneity), and use the expectedoption to get the expected value in each cell. The commands below can be used to get a cross-tabulation. In this

    case, we have a 2 by 2 table, because each categorical variable has two levels. We test:

    H0: HIAGE is independent of HICHOL statusHA: HIAGE is not independent of HICHOL status

    Note that Fishers exact test is produced by default for a 2 x 2 table, when the chisq option is specified. Read

    either the one-sided or two-sided p-value for the Fishers exact test, which are at the bottom of the respectivepanel of output, and shown in bold below.

    TITLE "2x2 TABLE";

    PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;

    TABLES HIAGE*HICHOL / CHISQ EXPECTED;

    FORMAT HIAGE HIAGEFMT. HICHOL HICHOLFMT.;RUN;

    2x2 TABLETable of HIAGE by HICHOL

    HIAGE HICHOLFrequency |Expected |Percent |Row Pct |Col Pct |1: Chol |2: Chol | Total

    |>=240 |39 | 42 | 18 | 60| 28.065 | 31.935 || 22.58 | 9.68 | 32.26| 70.00 | 30.00 || 48.28 | 18.18 |

    ------------+--------+--------+2: Age

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    9/15

    46.77 53.23 100.00

    Frequency Missing = 2Statistics for Table of HIAGE by HICHOL

    Statistic DF Value Prob------------------------------------------------------Chi-Square 1 19.1914

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    10/15

    | 37.04 | 62.96 |-------------+--------+--------+3: Age >39 | 42 | 18 | 60

    | 70.00 | 30.00 |-------------+--------+--------+Total 87 99 186

    Frequency Missing = 2

    Statistics for Table of AGEGROUP by HICHOL

    Statistic DF Value Prob------------------------------------------------------Chi-Square 2 19.2578

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    11/15

    Statistics for Table of AGEGROUP by WTCATStatistic DF Value Prob------------------------------------------------------Chi-Square 4 11.7418 0.0194Likelihood Ratio Chi-Square 4 11.4638 0.0218Mantel-Haenszel Chi-Square 1 8.7820 0.0030Phi Coefficient 0.2513Contingency Coefficient 0.2437Cramer's V 0.1777

    We now look at some examples using a different data set.

    We first use Proc Contents to learn what variables are in the data set, and the types of all the variables.

    title;

    proc contents data=b510.cars;

    run;

    The CONTENTS Procedure

    Data Set Name B510.CARS Observations 406

    Member Type DATA Variables 8Engine V9 Indexes 0Created Monday, August 21, 2006 09:41:24 PM Observation Length 64Last Modified Monday, August 21, 2006 09:41:24 PM Deleted Observations 0Protection Compressed NOData Set Type Sorted NOLabelData Representation WINDOWS_32Encoding wlatin1 Western (Windows)

    Alphabetic List of Variables and Attributes

    # Variable Type Len Format Label

    5 ACCEL Num 8 4. Time to Accelerate from 0 to 60 mph (sec)

    8 CYLINDER Num 8 1. Number of Cylinders2 ENGINE Num 8 5. Engine Displacement (cu. inches)3 HORSE Num 8 5. Horsepower1 MPG Num 8 4. Miles per Gallon7 ORIGIN Num 8 1. Country of Origin4 WEIGHT Num 8 4. Vehicle Weight (lbs.)6 YEAR Num 8 2. Model Year (modulo 100)

    proc format;

    value originfmt 1="USA"

    2="Europe"

    3="Japan";

    run;

    Output from the SAS log is shown below. Because this format had already been defined in the current run ofSAS, there is a note in the log stating that it is already on the library. If this format were to be resubmitted with

    new values, the new values would over-write the old values.

    142 proc format;143 value originfmt 1="USA"144 2="Europe"145 3="Japan";NOTE: Format ORIGINFMT is already on the library.NOTE: Format ORIGINFMT has been output.146 run;

    11

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    12/15

    We now get simple descriptive statistics for each level of the variable ORIGIN, using a class statement.

    proc means data=b510.cars;

    class origin;

    format origin originfmt.;

    run;

    /*To get less output use options nolabel;*/

    options nolabel;

    proc means data=b510.cars;

    class origin;

    format origin originfmt.;

    run;The MEANS Procedure

    Countryof NOrigin Obs Variable Label N Mean--------------------------------------------------------------------------------------USA 253 MPG Miles per Gallon 248 20.1282258

    ENGINE Engine Displacement (cu. inches) 253 247.7134387HORSE Horsepower 249 119.6064257WEIGHT Vehicle Weight (lbs.) 253 3367.33ACCEL Time to Accelerate from 0 to 60 mph (sec) 253 14.9284585YEAR Model Year (modulo 100) 253 75.5217391CYLINDER Number of Cylinders 253 6.2766798

    Europe 73 MPG Miles per Gallon 70 27.8914286ENGINE Engine Displacement (cu. inches) 73 109.4657534HORSE Horsepower 71 81.0000000WEIGHT Vehicle Weight (lbs.) 73 2431.49ACCEL Time to Accelerate from 0 to 60 mph (sec) 73 16.8219178YEAR Model Year (modulo 100) 73 75.7397260CYLINDER Number of Cylinders 73 4.1506849

    Japan 79 MPG Miles per Gallon 79 30.4506329ENGINE Engine Displacement (cu. inches) 79 102.7088608HORSE Horsepower 79 79.8354430

    WEIGHT Vehicle Weight (lbs.) 79 2221.23ACCEL Time to Accelerate from 0 to 60 mph (sec) 79 16.1721519YEAR Model Year (modulo 100) 79 77.4430380CYLINDER Number of Cylinders 79 4.1012658

    --------------------------------------------------------------------------------------

    Countryof NOrigin Obs Variable Label Std Dev Minimum---------------------------------------------------------------------------------------------USA 253 MPG Miles per Gallon 6.3768059 10.0000000

    ENGINE Engine Displacement (cu. inches) 98.7799678 85.0000000HORSE Horsepower 39.7991647 52.0000000WEIGHT Vehicle Weight (lbs.) 788.6117392 1800.00ACCEL Time to Accelerate from 0 to 60 mph (sec) 2.8011159 8.0000000YEAR Model Year (modulo 100) 3.7145843 70.0000000

    CYLINDER Number of Cylinders 1.6626528 4.0000000

    Europe 73 MPG Miles per Gallon 6.7239296 16.2000000ENGINE Engine Displacement (cu. inches) 22.3719083 68.0000000HORSE Horsepower 20.8134572 46.0000000WEIGHT Vehicle Weight (lbs.) 490.8836172 1825.00ACCEL Time to Accelerate from 0 to 60 mph (sec) 3.0109175 12.2000000YEAR Model Year (modulo 100) 3.5630332 70.0000000CYLINDER Number of Cylinders 0.4907826 4.0000000

    Japan 79 MPG Miles per Gallon 6.0900481 18.0000000

    ENGINE Engine Displacement (cu. inches) 23.1401260 70.0000000HORSE Horsepower 17.8191991 52.0000000WEIGHT Vehicle Weight (lbs.) 320.4972479 1613.00

    12

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    13/15

    ACCEL Time to Accelerate from 0 to 60 mph (sec) 1.9549370 11.4000000YEAR Model Year (modulo 100) 3.6505947 70.0000000CYLINDER Number of Cylinders 0.5904135 3.0000000

    ---------------------------------------------------------------------------------------------

    Countryof NOrigin Obs Variable Label Maximum-------------------------------------------------------------------------------USA 253 MPG Miles per Gallon 39.0000000

    ENGINE Engine Displacement (cu. inches) 455.0000000HORSE Horsepower 230.0000000WEIGHT Vehicle Weight (lbs.) 5140.00ACCEL Time to Accelerate from 0 to 60 mph (sec) 22.2000000YEAR Model Year (modulo 100) 82.0000000CYLINDER Number of Cylinders 8.0000000

    Europe 73 MPG Miles per Gallon 44.3000000ENGINE Engine Displacement (cu. inches) 183.0000000HORSE Horsepower 133.0000000WEIGHT Vehicle Weight (lbs.) 3820.00ACCEL Time to Accelerate from 0 to 60 mph (sec) 24.8000000YEAR Model Year (modulo 100) 82.0000000CYLINDER Number of Cylinders 6.0000000

    Japan 79 MPG Miles per Gallon 46.6000000

    ENGINE Engine Displacement (cu. inches) 168.0000000HORSE Horsepower 132.0000000WEIGHT Vehicle Weight (lbs.) 2930.00ACCEL Time to Accelerate from 0 to 60 mph (sec) 21.0000000YEAR Model Year (modulo 100) 82.0000000CYLINDER Number of Cylinders 6.0000000

    -------------------------------------------------------------------------------

    We now take a look at a 3 by 5 table (the row variable has 3 levels and the column variable has 5 levels) to see

    if there is any association between Country of Origin, and Number of Cylinders. The Pearson chi-square test isperhaps appropriate herebut lets see.

    title Row variable is nominal, column variable is ordinal

    proc freq data = b510.cars;tables origin*cylinder / chisq expected;

    format origin originfmt.;

    run;

    13

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    14/15

    Row variable is nominal, column variable is ordinalTable of ORIGIN by CYLINDER

    ORIGIN(Country of Origin) CYLINDER(Number of Cylinders)

    Frequency|Expected |Percent |Row Pct |

    Col Pct | 3| 4| 5| 6| 8| Total---------+--------+--------+--------+--------+--------+USA | 0 | 72 | 0 | 74 | 107 | 253

    | 2.4988 | 129.31 | 1.8741 | 52.474 | 66.842 || 0.00 | 17.78 | 0.00 | 18.27 | 26.42 | 62.47| 0.00 | 28.46 | 0.00 | 29.25 | 42.29 || 0.00 | 34.78 | 0.00 | 88.10 | 100.00 |

    ---------+--------+--------+--------+--------+--------+Europe | 0 | 66 | 3 | 4 | 0 | 73

    | 0.721 | 37.311 | 0.5407 | 15.141 | 19.286 || 0.00 | 16.30 | 0.74 | 0.99 | 0.00 | 18.02| 0.00 | 90.41 | 4.11 | 5.48 | 0.00 || 0.00 | 31.88 | 100.00 | 4.76 | 0.00 |

    ---------+--------+--------+--------+--------+--------+Japan | 4 | 69 | 0 | 6 | 0 | 79

    | 0.7802 | 40.378 | 0.5852 | 16.385 | 20.872 |

    | 0.99 | 17.04 | 0.00 | 1.48 | 0.00 | 19.51| 5.06 | 87.34 | 0.00 | 7.59 | 0.00 || 100.00 | 33.33 | 0.00 | 7.14 | 0.00 |

    ---------+--------+--------+--------+--------+--------+Total 4 207 3 84 107 405

    0.99 51.11 0.74 20.74 26.42 100.00

    Frequency Missing = 1

    Statistics for Table of ORIGIN by CYLINDERStatistic DF Value Prob------------------------------------------------------Chi-Square 8 185.7937

  • 8/14/2019 Oneway Frequency Tabulations and Twoay Cross-tabulations Using SAS

    15/15

    WARNING: Computing exact p-values for this problem may require much time and memory. Press thesystem interrupt key to terminate exact computations.

    NOTE: There were 406 observations read from the data set B510.CARS.NOTE: PROCEDURE FREQ used (Total process time):

    real time 31.02 secondscpu time 23.54 seconds

    We now resubmit the commands, using instead the Monte Carlo option in SAS (mc). This will give us a quite

    good approximation to the Fishers exact test p-value, but based on 10,000 strategically chosen tables.

    title "Row variable is nominal, column variable is ordinal";

    title2 "Try Fisher's Exact test";

    proc freq data = b510.cars;

    tables origin*cylinder / chisq expected;

    exact fisher / mc;

    format origin originfmt.;

    run;

    The output for these tests are shown below. The appropriate p-value is the portion labeled Pr