Click here to load reader
Upload
lekhanh
View
212
Download
0
Embed Size (px)
Citation preview
ACS Outcomes Research Course Biostatistics Laboratory #2
BIOSTATISTICS LABORATORY PART 2: UNIVARIATE ANALYSIS: COMPARING MEANS, MEDIANS, AND PROPORTIONS Learning objectives:
1) Understand that the type of variable (continuous vs. dichotomous) determines which statistical test should be used.
2) Learn the commands for comparing proportions using the Chi-square test. 3) Learn the commands to compare two means with a t-test. 4) Learn the commands to compare two medians using the Wilcoxon rank-sum
test. COMPARING PROPORTIONS: Chi-Square Test: Proportions are used to summarize dichotomous variables. For example, the proportion of patients having cardiac surgery with diabetes is the number with diabetes (numerator) divided by the total number having surgery (denominator).
When comparing two proportions, the chi-square test--or x2 test—is most often used.
This test can be used to compare proportions for just two groups or more than two groups.
The generic command structure for the chi-square test is a modification of the tab
command (where varname1 and varname2 are two dichotomous variables):
tab varname1 varname2, chi
Note the ―, chi‖ tagged onto the end. This command will result in a 2x2 table and give
the results of the x 2 test – including a P-value that reflects the probability that the two
proportions are the same (the so-called null hypothesis). For the exercises in this laboratory, we will continue to use our dataset of CABG patients in Maryland. The data should already be located on your hard drive—locate it and open it in STATA. Using this dataset, compare the mortality rate for patients older than and less than 80 years. First, you will need to create a dichotomous variable for this analysis (1=age >80 years and 0=age < or = 80 years). The commands for generating and labeling this new variable are written below (we will
call the variable ―age80‖).
ACS Outcomes Research Course Biostatistics Laboratory #2
Type the commands:
generate age80=1 if age>=80
replace age80=0 if age<80
label variable age80 “Age greater than 80 years”
label define lage80 1”Yes” 0”No”
label values age80 lage80
Task: Perform a tabulation of this new variable to make sure it was created correctly. Type the command:
tab age80
STATA Output: Age greater |
than 80 |
years | Freq. Percent Cum.
------------+-----------------------------------
No | 4,248 91.00 91.00
Yes | 420 9.00 100.00
------------+-----------------------------------
Total | 4,668 100.00
Now we will use the x 2 test to determine whether the mortality rate
in patients older than 80 years is statistically different from the mortality rate in patients younger than 80 years. Type the command:
tab died age80, chi
STATA output: Died |
during | Age greater than 80
hospitaliz | years
ation | No Yes | Total
-----------+----------------------+----------
Alive | 4,132 398 | 4,530
Dead | 110 21 | 131
-----------+----------------------+----------
Total | 4,242 419 | 4,661
Pearson chi2(1) = 8.1677 Pr = 0.004
We can modify the table in STATA’s output to include the mortality rate (proportion who
died) in each age group by adding ―col‖ (for column percentage) after the comma in the
command line, as shown in the following example.
ACS Outcomes Research Course Biostatistics Laboratory #2
Type the command:
tab died age80, col chi
STATA output:
Died during| Age greater than 80
hospitaliz | years
ation | No Yes | Total
-----------+----------------------+----------
Alive | 4,132 398 | 4,530
| 97.41 94.99 | 97.19
-----------+----------------------+----------
Dead | 110 21 | 131
| 2.59 5.01 | 2.81
-----------+----------------------+----------
Total | 4,242 419 | 4,661
| 100.00 100.00 | 100.00
Pearson chi2(1) = 8.1677 Pr = 0.004
From the output, we can see the mortality rate for those younger than 80 is 2.59% compared to 5.01% for those 80 or older, which is statistically different with P=0.004 by the chi-square test. Note of Caution: When sample sizes are very small—less than 5 observations in any of the cells of the 2x2 table—the Chi-square test may not give the correct answer. In this setting, the Fisher’s exact test is most commonly used. The commands can be found in the STATA help or reference guides. (But when using a large administrative database, small samples such as this are rarely encountered.) DETERMINING THE DISTRIBUTION OF A CONTINUOUS VARIABLE: Knowing the distribution of a variable—whether it has a symmetric or bell-shaped distribution or not—is important because it might dictate the type of statistical test used. As a brief review, we will again use a frequency histogram to determine whether a variable has a symmetric distribution. Continuous variables with a symmetric distribution are usually described using means; and we can compare two means using the Student’s t-test. When a variable is skewed to the left or right, however, you may want to describe it using the median (50th percentile); and we can compare two medians using the Wilcoxon rank-sum test. The general command structure to create a histogram:
histogram varname, frequency normal
Task: Using the above command, create a frequency histogram for the two variables we will be using in this exercise—patient age (variable ―age‖) and length of stay (variable ―los‖)
ACS Outcomes Research Course Biostatistics Laboratory #2
You will once again find that age (―age‖) has a normal distribution and length of stay (―los‖) does not. With that in mind, you can describe each variable using the appropriate statistics (the mean or median for central tendency and the standard deviation and inter-quartile range for spread). Remember, both of these sets of statistics can be found using a single command:
summarize varname, detail
Type the command:
summarize age, detail
STATA Output: Note that the median and mean are very similar – this is what you would expect when the data have a symmetric distribution (no right or left skew). Task: Use the same command to determine the mean and median for ―los‖. Are they the same? Does this make sense given the histogram of ―los‖? COMPARING MEANS: Comparing Means of Two Groups: Student’s t-test When comparing the mean values of a continuous variable (e.g., age) within two groups, the Student’s t-test is used. The command for the t-test is as follows (where varname1 is the continuous variable and varname2 is a dichotomous variable that divides observations into two groups:
ttest varname1, by(varname2)
This command will compare the mean value of observations where varname2=1 with the mean value of observations where varname2=0. As an example, compare the mean age of patients who live with those who die.
Age in years at admission
-------------------------------------------------------------
Percentiles Smallest
1% 40 16
5% 47 22
10% 51 32 Obs 4668
25% 58 32 Sum of Wgt. 4668
50% 67 Mean 65.78535
Largest Std. Dev. 10.73579
75% 74 91
90% 79 92 Variance 115.2571
95% 81 93 Skewness -.3413035
99% 85 94 Kurtosis 2.609806
Median age Mean age Age in years at admission
-------------------------------------------------------------
Percentiles Smallest
1% 40 16
5% 47 22
10% 51 32 Obs 4668
25% 58 32 Sum of Wgt. 4668
50% 67 Mean 65.78535
Largest Std. Dev. 10.73579
75% 74 91
90% 79 92 Variance 115.2571
95% 81 93 Skewness -.3413035
99% 85 94 Kurtosis 2.609806
Median age Mean age
ACS Outcomes Research Course Biostatistics Laboratory #2
Type the command:
ttest age, by(died)
STATA output: The mean age of those who died (death=1) is 71.2 years compared to an age of 65.6 years for those who lived. There are three p-values shown at the bottom but we are most interested in the middle one: the p-value associated with the alternative hypothesis (Ha) that the difference in means does not equal zero. The p-value of the test is <.001. Note of Caution: The t-test above assumes that the individuals in the two groups are independent (not the same people). If you are comparing mean values of some variable (e.g., blood pressure) measured in the same person twice, you should use a paired t-test (see STATA help or reference manual for the commands). COMPARING MEDIANS: When comparing medians of two groups, the Wilcoxon rank sum test is used. The first step is to determine the median value in each group. We have previously seen how the
median and inter-quartile range can be determined using the command summarize
varname, detail, but we will now introduce a second method. We will demonstrate
how to compare the median length of stay (los) for patients who lived and those who died. Here is the new command for getting the median:
centile los, centile(25,50,75)
(The numbers in the bracket tell STATA what percentiles you want in the output)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 4530 65.62009 .1594247 10.73013 65.30754 65.93264
1 | 131 71.23664 .8339418 9.5449 69.58679 72.8865
---------+--------------------------------------------------------------------
combined | 4661 65.77794 .1572822 10.7379 65.4696 66.08629
---------+--------------------------------------------------------------------
diff | -5.616553 .9481812 -7.475437 -3.757669
------------------------------------------------------------------------------
Degrees of freedom: 4659
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = -5.9235 t = -5.9235 t = -5.9235
P < t = 0.0000 P > |t| = 0.0000 P > t = 1.0000
Mean ages for those who
lived and died
Probability (p-value) that the
means are the same
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 4530 65.62009 .1594247 10.73013 65.30754 65.93264
1 | 131 71.23664 .8339418 9.5449 69.58679 72.8865
---------+--------------------------------------------------------------------
combined | 4661 65.77794 .1572822 10.7379 65.4696 66.08629
---------+--------------------------------------------------------------------
diff | -5.616553 .9481812 -7.475437 -3.757669
------------------------------------------------------------------------------
Degrees of freedom: 4659
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = -5.9235 t = -5.9235 t = -5.9235
P < t = 0.0000 P > |t| = 0.0000 P > t = 1.0000
Mean ages for those who
lived and died
Probability (p-value) that the
means are the same
ACS Outcomes Research Course Biostatistics Laboratory #2
STATA output: -- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
los | 4668 25 5 5 5
| 50 6 6 7
| 75 9 9 9
In this case, the median (50th percentile) is 6 days and the inter-quartile range (25th to 75th percentile) is from 5 to 9 days. This output, however, gives you the median for the overall group. To estimate the median for only a subset (those who lived or died) you can limit the group analyzed using an ―if‖ command:
centile los if died==1, centile (25,50,75)
(Note that a double equal sign must be used with the ―if‖ command.) This command will give you the median and inter-quartile range for only those who died: -- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
los | 131 25 5 4 6
| 50 10 7 11
| 75 17 13 25.23522 Likewise, the following command will yield the median for those who lived:
centile los if died==0, centile (25,50,75)
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
los | 4530 25 5 5 5
| 50 6 6 7
| 75 9 9 9
Now, looking at the output from the past two commands, we can see the median length of stay for those who died was 10 days compared to a median length of stay of 6 days for those who lived. Those appear different, but we can test this difference for statistical significance using the Wilcoxon rank sum test: Type the command:
ranksum los, by(died)
ACS Outcomes Research Course Biostatistics Laboratory #2
STATA output: The p-value for the comparison is shown at the very bottom (p<.001). The difference in medians (10 days for those who died vs. 6 days for those who lived) is ―statistically significant‖.
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
died | obs rank sum expected
-------------+---------------------------------
0 | 4530 10498664 10559430
1 | 131 366127 305361
-------------+---------------------------------
combined | 4661 10864791 10864791
unadjusted variance 2.305e+08
adjustment for ties -2812539.7
----------
adjusted variance 2.277e+08
Ho: los(died==0) = los(died==1)
z = -4.027
Prob > |z| = 0.0001
P-value for the comparison
of medians
You may also hear
this called the
Mann-Whitney test
ACS Outcomes Research Course Biostatistics Laboratory #2
EXTRA EXERCISES: Continue on and perform these extra exercises if you finish the lab early or you can perform these exercises on your own after the course. Comparing Means for More than Two Groups: Analysis of Variance (ANOVA) The Student’s t-test is appropriate when comparing two groups. When you wish to compare the means from more than two groups, the analysis of variance (ANOVA) is the test of choice. The test is different form the t-test because it does not tell you which group is different from others. Instead, it just tells you that there are statistically significant differences between the group means. If this is found, you must look at the data and perform directed t-tests, which can give you more specific information about the differences between two groups.
The general command structure for ANOVA is as follows (where varname1 is the
continuous variable and varname2 is a categorical variable with more than two groups):
oneway varname1 varname2
(Note: if varname2 is also a continuous variable, analysis of covariance or linear
regression should be used to study the relationship. Consult a statistician if you aren’t familiar with these tests.)
As an example, we will compare the mean age for patients with elective, urgent, and emergent admission type. Type the command:
oneway age atype, tab
STATA output:
The mean ages appear similar across admission types and the p-value 0.06, so there is no statistically significant variation in age for this variable.
| Summary of Age in years at
Admission | admission
type | Mean Std. Dev. Freq.
------------+------------------------------------
Emergent | 66.126266 10.927197 1481
Urgent | 64.947484 11.06317 914
Elective | 65.903384 10.462491 2246
Other | 64.666667 10.111874 9
------------+------------------------------------
Total | 65.784086 10.736317 4650
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 856.320619 3 285.440206 2.48 0.0594
Within groups 535026.902 4646 115.15861
------------------------------------------------------------------------
Total 535883.222 4649 115.268493
Mean ages of patients with
each admission type
Probability (p-value) that the
means are significantly
different
| Summary of Age in years at
Admission | admission
type | Mean Std. Dev. Freq.
------------+------------------------------------
Emergent | 66.126266 10.927197 1481
Urgent | 64.947484 11.06317 914
Elective | 65.903384 10.462491 2246
Other | 64.666667 10.111874 9
------------+------------------------------------
Total | 65.784086 10.736317 4650
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 856.320619 3 285.440206 2.48 0.0594
Within groups 535026.902 4646 115.15861
------------------------------------------------------------------------
Total 535883.222 4649 115.268493
Mean ages of patients with
each admission type
Probability (p-value) that the
means are significantly
different
ACS Outcomes Research Course Biostatistics Laboratory #2
Task: Now use this command to determine if there are differences in the mean ages of
CABG patients based on payer type (variable for payer type is ―pay1‖).
When interpreting the results of the ANOVA, remember that the test is different from the t-test because it does not tell you which group is different from others. In fact, two groups may appear identical, while others appear quite different. Instead, ANOVA tells you that there is more between group variation than within group variation. If this is found to be the case, you must look at the data and perform directed t-tests, which can give you more specific information about the differences between any two groups.
Comparing Medians for More than Two Groups: Kruskal-Wallis Test When you need to compare the medians for more than two groups the Kruskal-Wallis test is used. The generic command structure (where varname1 is the continuous variable and varname2 is the categorical variable):
kwallis varname1, by(varname2)
As an example, consider the difference in median length of stay (variable ―los‖) between patients with different admission types. First, we need to find out the median length of stay for each value of ―atype‖—i.e., the median for emergent, urgent, and elective admissions. Remember that ‖atype‖ is a categorical variable and we will need to use the ―if‖ command to obtain the median of each group. To review what number goes with each admission type, use the ―codebook‖ command. Type the command:
codebook atype
STATA output: -------------------------------------------------------------------------------
atype Admission type
-------------------------------------------------------------------------------
type: numeric (float)
label: latype
range: [1,6] units: 1
unique values: 4 missing .: 18/4668
tabulation: Freq. Numeric Label
1481 1 Emergent
914 2 Urgent
2246 3 Elective
9 6 Other
Now we will use the ―centile‖ command to determine the median within each group.
Type the command (where ―1‖ means emergent):
centile los if atype==1, centile(25,50,75)
ACS Outcomes Research Course Biostatistics Laboratory #2
STATA output: -- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
los | 1481 25 6 6 6
| 50 8 8 8
| 75 12 11 12
Task: Modify this command to determine the median length of stay for the other two categories, urgent and elective. Now, we will see if these differences are statistically significant. Type the command:
kwallis los, by(atype)
STATA output:
The differences in medians are statistically significant. Similar to ANOVA, this doesn’t mean that all the medians are different from each other. Instead, it means that all three medians are not equal.
It is worth noting that t-tests and ANOVA (parametric tests) can be used in most cases. We have introduced the Wilcoxon rank-sum test and Kruskal-Wallis test (non-parametric tests) just so you know they exist. You should also appreciate that when variables aren’t normally distributed you may need to use them. If unsure, you should consult an epidemiologist or statistician.
Test: Equality of populations (Kruskal-Wallis test)
+----------------------------+
| atype | Obs | Rank Sum |
|----------+------+----------|
| Emergent | 1481 | 4.37e+06 |
| Urgent | 914 | 2.25e+06 |
| Elective | 2246 | 4.18e+06 |
| Other | 9 | 11788.50 |
+----------------------------+
chi-squared = 603.922 with 3 d.f.
probability = 0.0001
chi-squared with ties = 611.343 with 3 d.f.
probability = 0.0001
P-value for the comparison
of medians
Test: Equality of populations (Kruskal-Wallis test)
+----------------------------+
| atype | Obs | Rank Sum |
|----------+------+----------|
| Emergent | 1481 | 4.37e+06 |
| Urgent | 914 | 2.25e+06 |
| Elective | 2246 | 4.18e+06 |
| Other | 9 | 11788.50 |
+----------------------------+
chi-squared = 603.922 with 3 d.f.
probability = 0.0001
chi-squared with ties = 611.343 with 3 d.f.
probability = 0.0001
P-value for the comparison
of medians