BIOSTATISTICS LABORATORY PART 2: …web2.facs.org/ORC2014Flashdrive/MATERIALS/OBJECTIVES_OUTLINES...BIOSTATISTICS LABORATORY PART 2: UNIVARIATE ANALYSIS: COMPARING MEANS, MEDIANS,

ACS Outcomes Research Course Biostatistics Laboratory #2

BIOSTATISTICS LABORATORY PART 2: UNIVARIATE ANALYSIS: COMPARING MEANS, MEDIANS, AND PROPORTIONS Learning objectives:

1) Understand that the type of variable (continuous vs. dichotomous) determines which statistical test should be used.

2) Learn the commands for comparing proportions using the Chi-square test. 3) Learn the commands to compare two means with a t-test. 4) Learn the commands to compare two medians using the Wilcoxon rank-sum

test. COMPARING PROPORTIONS: Chi-Square Test: Proportions are used to summarize dichotomous variables. For example, the proportion of patients having cardiac surgery with diabetes is the number with diabetes (numerator) divided by the total number having surgery (denominator).

When comparing two proportions, the chi-square test--or x2 test—is most often used.

This test can be used to compare proportions for just two groups or more than two groups.

The generic command structure for the chi-square test is a modification of the tab

command (where varname1 and varname2 are two dichotomous variables):

tab varname1 varname2, chi

Note the ―, chi‖ tagged onto the end. This command will result in a 2x2 table and give

the results of the x 2 test – including a P-value that reflects the probability that the two

proportions are the same (the so-called null hypothesis). For the exercises in this laboratory, we will continue to use our dataset of CABG patients in Maryland. The data should already be located on your hard drive—locate it and open it in STATA. Using this dataset, compare the mortality rate for patients older than and less than 80 years. First, you will need to create a dichotomous variable for this analysis (1=age >80 years and 0=age < or = 80 years). The commands for generating and labeling this new variable are written below (we will

call the variable ―age80‖).


Type the commands:

generate age80=1 if age>=80

replace age80=0 if age<80

label variable age80 “Age greater than 80 years”

label define lage80 1”Yes” 0”No”

label values age80 lage80

Task: Perform a tabulation of this new variable to make sure it was created correctly. Type the command:

tab age80

STATA Output: Age greater |

than 80 |

years | Freq. Percent Cum.

------------+-----------------------------------

No | 4,248 91.00 91.00

Yes | 420 9.00 100.00

------------+-----------------------------------

Total | 4,668 100.00

Now we will use the x 2 test to determine whether the mortality rate

in patients older than 80 years is statistically different from the mortality rate in patients younger than 80 years. Type the command:

tab died age80, chi

STATA output: Died |

during | Age greater than 80

hospitaliz | years

ation | No Yes | Total

-----------+----------------------+----------

Alive | 4,132 398 | 4,530

Dead | 110 21 | 131

-----------+----------------------+----------

Total | 4,242 419 | 4,661

Pearson chi2(1) = 8.1677 Pr = 0.004

We can modify the table in STATA’s output to include the mortality rate (proportion who

died) in each age group by adding ―col‖ (for column percentage) after the comma in the

command line, as shown in the following example.


Type the command:

tab died age80, col chi

STATA output:

Died during| Age greater than 80

hospitaliz | years

ation | No Yes | Total

-----------+----------------------+----------

Alive | 4,132 398 | 4,530

| 97.41 94.99 | 97.19

-----------+----------------------+----------

Dead | 110 21 | 131

| 2.59 5.01 | 2.81

-----------+----------------------+----------

Total | 4,242 419 | 4,661

| 100.00 100.00 | 100.00

Pearson chi2(1) = 8.1677 Pr = 0.004

From the output, we can see the mortality rate for those younger than 80 is 2.59% compared to 5.01% for those 80 or older, which is statistically different with P=0.004 by the chi-square test. Note of Caution: When sample sizes are very small—less than 5 observations in any of the cells of the 2x2 table—the Chi-square test may not give the correct answer. In this setting, the Fisher’s exact test is most commonly used. The commands can be found in the STATA help or reference guides. (But when using a large administrative database, small samples such as this are rarely encountered.) DETERMINING THE DISTRIBUTION OF A CONTINUOUS VARIABLE: Knowing the distribution of a variable—whether it has a symmetric or bell-shaped distribution or not—is important because it might dictate the type of statistical test used. As a brief review, we will again use a frequency histogram to determine whether a variable has a symmetric distribution. Continuous variables with a symmetric distribution are usually described using means; and we can compare two means using the Student’s t-test. When a variable is skewed to the left or right, however, you may want to describe it using the median (50th percentile); and we can compare two medians using the Wilcoxon rank-sum test. The general command structure to create a histogram:

histogram varname, frequency normal

Task: Using the above command, create a frequency histogram for the two variables we will be using in this exercise—patient age (variable ―age‖) and length of stay (variable ―los‖)


You will once again find that age (―age‖) has a normal distribution and length of stay (―los‖) does not. With that in mind, you can describe each variable using the appropriate statistics (the mean or median for central tendency and the standard deviation and inter-quartile range for spread). Remember, both of these sets of statistics can be found using a single command:

summarize varname, detail

Type the command:

summarize age, detail

STATA Output: Note that the median and mean are very similar – this is what you would expect when the data have a symmetric distribution (no right or left skew). Task: Use the same command to determine the mean and median for ―los‖. Are they the same? Does this make sense given the histogram of ―los‖? COMPARING MEANS: Comparing Means of Two Groups: Student’s t-test When comparing the mean values of a continuous variable (e.g., age) within two groups, the Student’s t-test is used. The command for the t-test is as follows (where varname1 is the continuous variable and varname2 is a dichotomous variable that divides observations into two groups:

ttest varname1, by(varname2)

This command will compare the mean value of observations where varname2=1 with the mean value of observations where varname2=0. As an example, compare the mean age of patients who live with those who die.

Age in years at admission

-------------------------------------------------------------

Percentiles Smallest

1% 40 16

5% 47 22

10% 51 32 Obs 4668

25% 58 32 Sum of Wgt. 4668

50% 67 Mean 65.78535

Largest Std. Dev. 10.73579

75% 74 91

90% 79 92 Variance 115.2571

95% 81 93 Skewness -.3413035

99% 85 94 Kurtosis 2.609806

Median age Mean age Age in years at admission

-------------------------------------------------------------

Percentiles Smallest

1% 40 16

5% 47 22

10% 51 32 Obs 4668

25% 58 32 Sum of Wgt. 4668

50% 67 Mean 65.78535

Largest Std. Dev. 10.73579

75% 74 91

90% 79 92 Variance 115.2571

95% 81 93 Skewness -.3413035

99% 85 94 Kurtosis 2.609806

Median age Mean age


Type the command:

ttest age, by(died)

STATA output: The mean age of those who died (death=1) is 71.2 years compared to an age of 65.6 years for those who lived. There are three p-values shown at the bottom but we are most interested in the middle one: the p-value associated with the alternative hypothesis (Ha) that the difference in means does not equal zero. The p-value of the test is <.001. Note of Caution: The t-test above assumes that the individuals in the two groups are independent (not the same people). If you are comparing mean values of some variable (e.g., blood pressure) measured in the same person twice, you should use a paired t-test (see STATA help or reference manual for the commands). COMPARING MEDIANS: When comparing medians of two groups, the Wilcoxon rank sum test is used. The first step is to determine the median value in each group. We have previously seen how the

median and inter-quartile range can be determined using the command summarize

varname, detail, but we will now introduce a second method. We will demonstrate

how to compare the median length of stay (los) for patients who lived and those who died. Here is the new command for getting the median:

centile los, centile(25,50,75)

(The numbers in the bracket tell STATA what percentiles you want in the output)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

0 | 4530 65.62009 .1594247 10.73013 65.30754 65.93264

1 | 131 71.23664 .8339418 9.5449 69.58679 72.8865

---------+--------------------------------------------------------------------

combined | 4661 65.77794 .1572822 10.7379 65.4696 66.08629

---------+--------------------------------------------------------------------

diff | -5.616553 .9481812 -7.475437 -3.757669

------------------------------------------------------------------------------

Degrees of freedom: 4659

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

t = -5.9235 t = -5.9235 t = -5.9235

P < t = 0.0000 P > |t| = 0.0000 P > t = 1.0000

Mean ages for those who

lived and died

Probability (p-value) that the

means are the same

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

0 | 4530 65.62009 .1594247 10.73013 65.30754 65.93264

1 | 131 71.23664 .8339418 9.5449 69.58679 72.8865

---------+--------------------------------------------------------------------

combined | 4661 65.77794 .1572822 10.7379 65.4696 66.08629

---------+--------------------------------------------------------------------

diff | -5.616553 .9481812 -7.475437 -3.757669

------------------------------------------------------------------------------

Degrees of freedom: 4659

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

t = -5.9235 t = -5.9235 t = -5.9235

P < t = 0.0000 P > |t| = 0.0000 P > t = 1.0000

Mean ages for those who

lived and died


means are the same


STATA output: -- Binom. Interp. --

Variable | Obs Percentile Centile [95% Conf. Interval]

-------------+-------------------------------------------------------------

los | 4668 25 5 5 5

| 50 6 6 7

| 75 9 9 9

In this case, the median (50th percentile) is 6 days and the inter-quartile range (25th to 75th percentile) is from 5 to 9 days. This output, however, gives you the median for the overall group. To estimate the median for only a subset (those who lived or died) you can limit the group analyzed using an ―if‖ command:

centile los if died==1, centile (25,50,75)

(Note that a double equal sign must be used with the ―if‖ command.) This command will give you the median and inter-quartile range for only those who died: -- Binom. Interp. --


-------------+-------------------------------------------------------------

los | 131 25 5 4 6

| 50 10 7 11

| 75 17 13 25.23522 Likewise, the following command will yield the median for those who lived:

centile los if died==0, centile (25,50,75)

-- Binom. Interp. --


-------------+-------------------------------------------------------------

los | 4530 25 5 5 5

| 50 6 6 7

| 75 9 9 9

Now, looking at the output from the past two commands, we can see the median length of stay for those who died was 10 days compared to a median length of stay of 6 days for those who lived. Those appear different, but we can test this difference for statistical significance using the Wilcoxon rank sum test: Type the command:

ranksum los, by(died)


STATA output: The p-value for the comparison is shown at the very bottom (p<.001). The difference in medians (10 days for those who died vs. 6 days for those who lived) is ―statistically significant‖.

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

died | obs rank sum expected

-------------+---------------------------------

0 | 4530 10498664 10559430

1 | 131 366127 305361

-------------+---------------------------------

combined | 4661 10864791 10864791

unadjusted variance 2.305e+08

adjustment for ties -2812539.7

----------

adjusted variance 2.277e+08

Ho: los(died==0) = los(died==1)

z = -4.027

Prob > |z| = 0.0001

P-value for the comparison

of medians

You may also hear

this called the

Mann-Whitney test


EXTRA EXERCISES: Continue on and perform these extra exercises if you finish the lab early or you can perform these exercises on your own after the course. Comparing Means for More than Two Groups: Analysis of Variance (ANOVA) The Student’s t-test is appropriate when comparing two groups. When you wish to compare the means from more than two groups, the analysis of variance (ANOVA) is the test of choice. The test is different form the t-test because it does not tell you which group is different from others. Instead, it just tells you that there are statistically significant differences between the group means. If this is found, you must look at the data and perform directed t-tests, which can give you more specific information about the differences between two groups.

The general command structure for ANOVA is as follows (where varname1 is the

continuous variable and varname2 is a categorical variable with more than two groups):

oneway varname1 varname2

(Note: if varname2 is also a continuous variable, analysis of covariance or linear

regression should be used to study the relationship. Consult a statistician if you aren’t familiar with these tests.)

As an example, we will compare the mean age for patients with elective, urgent, and emergent admission type. Type the command:

oneway age atype, tab

STATA output:

The mean ages appear similar across admission types and the p-value 0.06, so there is no statistically significant variation in age for this variable.

| Summary of Age in years at

Admission | admission

type | Mean Std. Dev. Freq.

------------+------------------------------------

Emergent | 66.126266 10.927197 1481

Urgent | 64.947484 11.06317 914

Elective | 65.903384 10.462491 2246

Other | 64.666667 10.111874 9

------------+------------------------------------

Total | 65.784086 10.736317 4650

Analysis of Variance

Source SS df MS F Prob > F

------------------------------------------------------------------------

Between groups 856.320619 3 285.440206 2.48 0.0594

Within groups 535026.902 4646 115.15861

------------------------------------------------------------------------

Total 535883.222 4649 115.268493

Mean ages of patients with

each admission type


means are significantly

different

| Summary of Age in years at

Admission | admission

type | Mean Std. Dev. Freq.

------------+------------------------------------

Emergent | 66.126266 10.927197 1481

Urgent | 64.947484 11.06317 914

Elective | 65.903384 10.462491 2246

Other | 64.666667 10.111874 9

------------+------------------------------------

Total | 65.784086 10.736317 4650

Analysis of Variance

Source SS df MS F Prob > F

------------------------------------------------------------------------

Between groups 856.320619 3 285.440206 2.48 0.0594

Within groups 535026.902 4646 115.15861

------------------------------------------------------------------------

Total 535883.222 4649 115.268493

Mean ages of patients with

each admission type


means are significantly

different


Task: Now use this command to determine if there are differences in the mean ages of

CABG patients based on payer type (variable for payer type is ―pay1‖).

When interpreting the results of the ANOVA, remember that the test is different from the t-test because it does not tell you which group is different from others. In fact, two groups may appear identical, while others appear quite different. Instead, ANOVA tells you that there is more between group variation than within group variation. If this is found to be the case, you must look at the data and perform directed t-tests, which can give you more specific information about the differences between any two groups.

Comparing Medians for More than Two Groups: Kruskal-Wallis Test When you need to compare the medians for more than two groups the Kruskal-Wallis test is used. The generic command structure (where varname1 is the continuous variable and varname2 is the categorical variable):

kwallis varname1, by(varname2)

As an example, consider the difference in median length of stay (variable ―los‖) between patients with different admission types. First, we need to find out the median length of stay for each value of ―atype‖—i.e., the median for emergent, urgent, and elective admissions. Remember that ‖atype‖ is a categorical variable and we will need to use the ―if‖ command to obtain the median of each group. To review what number goes with each admission type, use the ―codebook‖ command. Type the command:

codebook atype

STATA output: -------------------------------------------------------------------------------

atype Admission type

-------------------------------------------------------------------------------

type: numeric (float)

label: latype

range: [1,6] units: 1

unique values: 4 missing .: 18/4668

tabulation: Freq. Numeric Label

1481 1 Emergent

914 2 Urgent

2246 3 Elective

9 6 Other

Now we will use the ―centile‖ command to determine the median within each group.

Type the command (where ―1‖ means emergent):

centile los if atype==1, centile(25,50,75)


STATA output: -- Binom. Interp. --


-------------+-------------------------------------------------------------

los | 1481 25 6 6 6

| 50 8 8 8

| 75 12 11 12

Task: Modify this command to determine the median length of stay for the other two categories, urgent and elective. Now, we will see if these differences are statistically significant. Type the command:

kwallis los, by(atype)

STATA output:

The differences in medians are statistically significant. Similar to ANOVA, this doesn’t mean that all the medians are different from each other. Instead, it means that all three medians are not equal.

It is worth noting that t-tests and ANOVA (parametric tests) can be used in most cases. We have introduced the Wilcoxon rank-sum test and Kruskal-Wallis test (non-parametric tests) just so you know they exist. You should also appreciate that when variables aren’t normally distributed you may need to use them. If unsure, you should consult an epidemiologist or statistician.

Test: Equality of populations (Kruskal-Wallis test)

+----------------------------+

| atype | Obs | Rank Sum |

|----------+------+----------|

| Emergent | 1481 | 4.37e+06 |

| Urgent | 914 | 2.25e+06 |

| Elective | 2246 | 4.18e+06 |

| Other | 9 | 11788.50 |

+----------------------------+

chi-squared = 603.922 with 3 d.f.

probability = 0.0001

chi-squared with ties = 611.343 with 3 d.f.



of medians

Test: Equality of populations (Kruskal-Wallis test)

+----------------------------+

| atype | Obs | Rank Sum |

|----------+------+----------|

| Emergent | 1481 | 4.37e+06 |

| Urgent | 914 | 2.25e+06 |

| Elective | 2246 | 4.18e+06 |

| Other | 9 | 11788.50 |

+----------------------------+

chi-squared = 603.922 with 3 d.f.


chi-squared with ties = 611.343 with 3 d.f.



of medians

Documents

BIOSTATISTICS LABORATORY PART 2: …web2.facs.org/ORC2014Flashdrive/MATERIALS/OBJECTIVES_OUTLINES...BIOSTATISTICS LABORATORY PART 2: UNIVARIATE ANALYSIS: COMPARING MEANS, MEDIANS,