Upload
luqi-kong
View
111
Download
0
Embed Size (px)
Citation preview
1
How radiotherapy influences the survival of lung cancer patients
Luqi Kong
2016/05/05
2
Contents 1 Introduction ........................................................................................... 3
2 Data description .................................................................................... 3
3 Check for balance .................................................................................. 4
3.1 Chi-square test of independence ..................................................... 4
3.2 Standardized Mean Difference ........................................................ 4
4 Matching ............................................................................................... 5
5 Test for causal effect.............................................................................. 5
6 Further implications ............................................................................... 6
7 Bibliography ........................................................................................... 7
Appendix ...................................................................................................... 8
3
1 Introduction Lung cancer is the number one cause of cancer deaths in both men and women in the U.S. and
worldwide. Treatment of lung cancer can involve a combination of surgery, chemotherapy, and
radiation therapy as well as newer experimental methods. The treatment methods have no
absolute positive effects on the patients. For radiation therapy, it may help to control the cancer
and prolong patients’ life, but also has the chance to spread the tumor to other sites of the
body.
This paper aims at examining how radiation therapy influences the survival of lung and bronchus
cancer patients: 1) If radiation has any causal effect on patients’ survival; 2) What is the
direction of the effect if there is causal relationship between radiotherapy and survival; 3) What
is the magnitude of the effect if there is causal relationship between radiotherapy and survival.
The first part of the paper is to understand the data extracted from SEER database. The second
part is to check the balance of the covariates using Chi-square test of independence and
Standard Mean Difference. After checking the balance, the third part is to match on the
covariates that are not well balanced, using Nearest Neighbors Matching, Optimal Matching and
Full Matching methods. The next part is to calculate the treatment effect with the matched data
and reach conclusions to the research problem. The final part includes further implications of
the paper, including the discussion of unobserved biases and instrumental variable.
2 Data description The data of this paper is from SEER database 1973 -2013.The Surveillance, Epidemiology, and
End Results (SEER) of the National Cancer Institute is a program that works to provide
information on cancer statistics in an effort to reduce the burden of cancer among the U.S.
1321 incidences are extracted from the database using the following criterion: 1) the site of
cancer is lung and bronchus; 2) year of diagnosis is 2005; 3) SEER registries are San Francisco-
Oakland SMSA-1973+, San Jose-Monterey-1992+, Los Angeles-1992+, California excluding
SF/SJM/LA-2000+; 4) All the patients had performed surgery; 5) All the patients died at or before
2013; 6) The patients are over 50 years old.
The outcome of this study is the survival months. Treatment is the beam radiation therapy,
while control group is no radiation therapy. Covariates include: age, gender, site, cancer stage,
tumor size, lymph nodes involvement and metastasis. (See Table 1 for detailed description)
Patients in the treatment group live 27 months on average (SD: 23.34), while patients in the
control group live 34 months on average (SD: 27.19). Patients who are older, have higher cancer
stage, have lymph nodes and metastasis involvement tend to live shorter. For gender, female
survive better than male. Patients in LA registry live shorter on average than people in other
sites. (See Table 2 and Figure 1. Figure 1’s violin plot draws the distribution of survival months
upon each value of the covariates. The fatter violin indicates larger sample size.)
4
3 Check for balance
3.1 Chi-square test of independence Chi-square distribution assumes treatment that patients receive is not affected by the presence
of covariates. Assume fij is the observed frequency count of events belonging to both i-th
category of x and j-th category of y. Also assume eij to be the corresponding expected count if x
and y are independent.
The null hypothesis of the independence assumption is to be rejected if the p-value of the
following Chi-squared test statistics is less than a given significance level α.
For example, if treatment is independent from lymph nodes involvement, e11=e (treatment=0,
lymph nodes=0) = 816*0.809=660. f11=f (treatment=0, lymph nodes=0) =731. 2
11 = (731-
660)2/660=7.638 (slightly different than 7.562 given by R, due to rounding in my calculation)
| Lymph nodes involvement Treatment | 0 | 1 | Row Total | -------------|-----------|-----------|-----------| 0 | 731 | 338 | 1069 | | 7.562 | 12.219 | | | 0.684 | 0.316 | 0.809 | | 0.896 | 0.669 | | | 0.553 | 0.256 | | -------------|-----------|-----------|-----------| 1 | 85 | 167 | 252 | | 32.078 | 51.833 | | | 0.337 | 0.663 | 0.191 | | 0.104 | 0.331 | | | 0.064 | 0.126 | | -------------|-----------|-----------|-----------| Column Total | 816 | 505 | 1321 | | 0.618 | 0.382 | |
See table 3 for a summary of this test and table 4 for detail.
Among the 7 covariates, only gender (P value: 0.9123) and site (P value: 0.2313) pass the chi-
square test. For the other 5 covariates, the null hypothesis of independence is rejected under
0.05 significance level, which indicates the imbalance of covariates between treatment and
control group.
3.2 Standardized Mean Difference Another method to check the balance is using the standardized mean difference, which is
calculated as the mean difference of a covariate between treatment and control group divided
by the pooled standard deviation.
0.2 is used as the threshold for balance checking. If the standard mean difference of a covariate
is greater than 0.2, then the treatment and control groups are judged to be unbalanced in terms
5
of the covariate. Except for gender (SMD: 0.013) and site (SMD:-0.025), all other 5 covariates
have larger than 0.2 standard mean difference. This method gives the same results as the chi-
square test. (See Table 5 and Figure 3)
Therefore, the covariates we need to match on in the next part is age, stage, tumor size, lymph
nodes and metastasis. Gender and site are relatively balanced and not going through matching
process.
4 Matching 3 matching methods are used through MatchIt Package in R: Nearest neighbor, optimal
matching (ratio=1:1) and full matching.
Nearest neighbor matching selects the r (default = 1) best control matches for each individual in
the treatment group (excluding those discarded using the discard option). Matches are chosen
for each treated unit one at a time, with the order specified by the m. order command (default =
largest to smallest). At each matching step we choose the control unit that is not yet matched
but is closest to the treated unit on the distance measure.
The default nearest neighbor matching method in MatchIt is “greedy” matching, where the
closest control match for each treated unit is chosen one at a time, without trying to minimize a
global distance measure. In contrast, “optimal” matching finds the matched samples with the
smallest average absolute distance globally across all the matched pairs. We can choose ratio
(number of control subjects/treatment subjects in each matched subclass). In this paper, 1:1
ratio is used.
Full matching is the method where each matched set contains one treated unit and one or more
controls (or one control unit and one or more treated units). The only units not placed into a
subclass will be those discarded (if a discard option is specified) because they are outside the
range of common support. Full matching is optimal in terms of minimizing a weighted average
of the estimated distance measure between each treated subject and each control subject
within each subclass. Using full matching method, 242 groups are matched out of 252 treated
units and 1069 control units.
Full matching has the best improvement in balance (99.73%) in terms of propensity score,
although it performs the worst in balance improvement in terms of the covariate metastasis.
Overall, full matching achieve the best result of balance. The matched data is then used to
calculate the treatment effect. (See Table 6 for detailed comparison among the 3 matching
methods; Figure 4, 5 for balance comparison before and after matching)
5 Test for causal effect After matching, every subject receives a weight. Each treated unit has the weight 1, while each
control unit receives a weight of 2
1
1*
i
n
n m, where n2 is the control group size, n1 is the treated
group size, and mi is the ratio of control units to treated units in ith matched sub-class.
6
The weighted mean of survival months is 27.08 in the treatment group and 28.40 in the control
group after matching. The weighted mean difference is -1.32. Compare to the raw mean
difference before matching (-7.37), the magnitude of treatment effect decreases. This is
probably due to the imbalance of the original data, that patients with worse condition are more
likely to receive the treatment, which decreases the survival months in the treatment group and
amplifies the negative treatment effect.
To test if the weighted mean difference is significantly different from zero, two sample t-test is
used. T score is -0.7616, degree of freedom=482, p value=0.4467, 95% percent confidence
interval is [-5.6778, 2.5059]. Under 5% significance level, we fail to reject the null hypothesis
that true difference in means is equal to 0. To conclude, we have nearly no confidence to say
that radiation has causal effect on survival months. (See Figure 5 for visualization)
Sensitivity analysis is not included in this paper cause when gamma increases from 1, P value
only becomes larger and the original conclusion is strengthened not overturned. So it’s not
necessary to do the sensitivity analysis if the original conclusion is not changed.
6 Further implications In this paper, only 7 covariates are included. Other unobserved covariates such as the psycho
factors, economic status, emotional support and surgery type all influence patients’ chance of
survival. Once these factors are considered, the conclusion of this paper may be changed. For
further research, more information about the patient needs to be incorporated. For example,
the information about chemotherapy should be included. Records about the surgery should be
more detailed, including the surgery type and hospital. In addition, if the patient’s mental
positivity can be measured in some way, it will be a good source of information.
We can also include instrumental variable into the research to distinguish the units in the
matched group. Doctor’s recommendations for treatment may serve as an IV. There are patients
who are compliant with the recommendations while others are not. These different responses
to the incentives (IV) may result in a more accurate analysis.
In addition, machine learning techniques such as boosted gradient model can be tried to
estimate the propensity score. The method in this paper is logistic regression, which is a
parametric model. With non-parametric models, we rely less on the assumptions.
Last but not least, multiple control groups can be formed, including “radiation before surgery”,
“radiation during surgery” and “radiation after surgery”. These different treatments may have
different effect on the survival of the patients.
7
7 Bibliography
[1] D. Ho, K. Imai, G. King and a. E. Stuart, "Matching as Nonparametric Preprocessing for
Reducing Model Dependece in Parametric Causal Inference," Political Analysis, vol. 15, no.
3, pp. 199-236, 2007.
[2] G. R. a. A. R. M. Daniel F. McCaffrey, "Propensity Score Estimation With Boosted Regression
for Evaluating causal effects in observational studies," Psychological Methods, vol. 9, no. 4,
pp. 403-425, 2004.
[3] B. B. H. a. J. Bowers, "Covariate balance in simple, stratified," vol. 23, no. 2, pp. 219-236,
2008.
[4] P. R. Rosenbaum, Design of Observational Studies, Springer-Verlag New York, 2010.
[5] B. a. K. S. Hansen, " Optimal full matching and related designs," JCGS, vol. 15, pp. 609-627,
2006.
[6] H. Wickham, ggplot2: Elegant Graphics for data analysis, Springer-Verlag New York, 2009.
[7] A. Z. a. K. David Meyer, "The Strucplot Framework:Visualizing Multi-Way Contingency
Tables with vcd," Journal of Statistical Software, vol. 17, no. 3, pp. 1-48, 2006.
8
Appendix
9
Table 1: Variable Description Variable Type of Variable Value Recode Survival Months Continuous >=0
Treatment Binary Beam Radiation 0
No Radiation 1
Age Ordered Categorical 50-60 years old 1
60-70 years old 2
70-80 years old 3
>80 years old 4
Sex Categorical Male 1
Female 2
Registry Site Categorical CA(Not SF/SJM/LA) 1
Los Angeles 2
San Francisco 3
San Jose-Monterey 4
Stage Ordered Categorical Localized 1
Regional 2
Distant 3
Tumor Size Ordered Categorical 0-20mm 1
20-40mm 2
40-100mm 3
Lymph Nodes Binary No involvement 0
Involvement 1
Metastasis Binary No involvement 0
Involvement 1
10
Table 2: Survival by different covariates (Numbers are rounded to integer)
Variable Value Mean Standard Deviation
1st Quantile
Median 3rd Quantile
Treatment Radiation 27 23 10 20 37
No Radiation 34 27 12 27 55
Age
50-60 33 26 12 26 50
60-70 36 27 14 28 55
70-80 34 27 11 26 52
80-90 26 25 6 16 41
Gender Male 30 26 10 21 47
Female 36 27 13 30 56
Site
CA(Not the 3 sites below) 32 27 11 24 50
LA 31 26 10 22 50
SF 38 28 15 33 61
SJM 37 27 15 34 57
Stage
Localized 40 28 16 36 62
Regional 31 25 11 23 46
Distant 22 23 5 15 31
Tumor Size
0-20mm 39 27 16 35 60
20-40mm 35 27 13 27 54
40-100mm 27 25 9 18 40
Lymph nodes No 36 28 13 30 58
Yes 27 24 9 20 41
Metastasis No 35 27 13 27 54
Yes 23 23 6 14 32
11
Figure 1: Survival by different covariates
12
Figure 1: Survival by different covariates
13
Figure 2: Treament by different covariates
14
Table 3: Pearson’s chi-square test for independence between treatment and covariates Covariates 2 Degree of
freedom P value
Age 43.33 3 2.094e-09
Sex 0.01 1 0.9123
Site 4.30 3 0.2313
Stage 94.74 2 < 2.2e-16
Tumor Size 40.71 2 1.442e-09
Lymph Nodes 102.23 1 2.2e-16
Metastasis 29.12 1 6.865e-08
Table 4: Crosstable for each covariate and treatment
Age and treatment | Age Treatment | 1 | 2 | 3 | 4 | Row Total | -------------|-----------|-----------|-----------|-----------|-----------| 0 | 117 | 318 | 436 | 198 | 1069 | | 3.778 | 1.121 | 1.314 | 2.052 | | | 0.109 | 0.297 | 0.408 | 0.185 | 0.809 | | 0.676 | 0.763 | 0.855 | 0.896 | | | 0.089 | 0.241 | 0.330 | 0.150 | | -------------|-----------|-----------|-----------|-----------|-----------| 1 | 56 | 99 | 74 | 23 | 252 | | 16.026 | 4.756 | 5.575 | 8.707 | | | 0.222 | 0.393 | 0.294 | 0.091 | 0.191 | | 0.324 | 0.237 | 0.145 | 0.104 | | | 0.042 | 0.075 | 0.056 | 0.017 | | -------------|-----------|-----------|-----------|-----------|-----------| Column Total | 173 | 417 | 510 | 221 | 1321 | | 0.131 | 0.316 | 0.386 | 0.167 | |
15
Gender and treatment | Gender Treatment | 1 | 2 | Row Total | -------------|-----------|-----------|-----------| 0 | 537 | 532 | 1069 | | 0.003 | 0.003 | | | 0.502 | 0.498 | 0.809 | | 0.811 | 0.807 | | | 0.407 | 0.403 | | -------------|-----------|-----------|-----------| 1 | 125 | 127 | 252 | | 0.013 | 0.013 | | | 0.496 | 0.504 | 0.191 | | 0.189 | 0.193 | | | 0.095 | 0.096 | | -------------|-----------|-----------|-----------| Column Total | 662 | 659 | 1321 |
Site and treatment | Site Treatment | 1 | 2 | 3 | 4 | Row Total | -------------|-----------|-----------|-----------|-----------|-----------| 0 | 120 | 73 | 252 | 624 | 1069 | | 0.309 | 0.045 | 0.416 | 0.050 | | | 0.112 | 0.068 | 0.236 | 0.584 | 0.809 | | 0.769 | 0.830 | 0.843 | 0.802 | | | 0.091 | 0.055 | 0.191 | 0.472 | | -------------|-----------|-----------|-----------|-----------|-----------| 1 | 36 | 15 | 47 | 154 | 252 | | 1.309 | 0.190 | 1.767 | 0.210 | | | 0.143 | 0.060 | 0.187 | 0.611 | 0.191 | | 0.231 | 0.170 | 0.157 | 0.198 | | | 0.027 | 0.011 | 0.036 | 0.117 | | -------------|-----------|-----------|-----------|-----------|-----------| Column Total | 156 | 88 | 299 | 778 | 1321 | | 0.118 | 0.067 | 0.226 | 0.589 | |
16
Stage and treatment | Stage Treatment | 1 | 2 | 3 | Row Total | -------------|-----------|-----------|-----------|-----------| 0 | 466 | 453 | 150 | 1069 | | 10.687 | 2.734 | 4.651 | | | 0.436 | 0.424 | 0.140 | 0.809 | | 0.941 | 0.749 | 0.679 | | | 0.353 | 0.343 | 0.114 | | -------------|-----------|-----------|-----------|-----------| 1 | 29 | 152 | 71 | 252 | | 45.335 | 11.599 | 19.730 | | | 0.115 | 0.603 | 0.282 | 0.191 | | 0.059 | 0.251 | 0.321 | | | 0.022 | 0.115 | 0.054 | | -------------|-----------|-----------|-----------|-----------| Column Total | 495 | 605 | 221 | 1321 | | 0.375 | 0.458 | 0.167 | |
Tumor Size and treatment | Size Treatment | 1 | 2 | 3 | Row Total | -------------|-----------|-----------|-----------|-----------| 0 | 240 | 497 | 332 | 1069 | | 0.794 | 1.937 | 5.036 | | | 0.225 | 0.465 | 0.311 | 0.809 | | 0.857 | 0.861 | 0.716 | | | 0.182 | 0.376 | 0.251 | | -------------|-----------|-----------|-----------|-----------| 1 | 40 | 80 | 132 | 252 | | 3.369 | 8.215 | 21.363 | | | 0.159 | 0.317 | 0.524 | 0.191 | | 0.143 | 0.139 | 0.284 | | | 0.030 | 0.061 | 0.100 | | -------------|-----------|-----------|-----------|-----------| Column Total | 280 | 577 | 464 | 1321 | | 0.212 | 0.437 | 0.351 | |
17
Lymph Nodes Involvement and treatment | Lymph Nodes Involvement Treatment | 0 | 1 | Row Total | -------------|-----------|-----------|-----------| 0 | 731 | 338 | 1069 | | 7.562 | 12.219 | | | 0.684 | 0.316 | 0.809 | | 0.896 | 0.669 | | | 0.553 | 0.256 | | -------------|-----------|-----------|-----------| 1 | 85 | 167 | 252 | | 32.078 | 51.833 | | | 0.337 | 0.663 | 0.191 | | 0.104 | 0.331 | | | 0.064 | 0.126 | | -------------|-----------|-----------|-----------| Column Total | 816 | 505 | 1321 | | 0.618 | 0.382 | |
Metastasis and treatment | Metastasis Treatment | 0 | 1 | Row Total | -------------|-----------|-----------|-----------| 0 | 954 | 115 | 1069 | | 0.764 | 5.002 | | | 0.892 | 0.108 | 0.809 | | 0.832 | 0.657 | | | 0.722 | 0.087 | | -------------|-----------|-----------|-----------| 1 | 192 | 60 | 252 | | 3.240 | 21.221 | | | 0.762 | 0.238 | 0.191 | | 0.168 | 0.343 | | | 0.145 | 0.045 | | -------------|-----------|-----------|-----------| Column Total | 1146 | 175 | 1321 | | 0.868 | 0.132 | |
18
Table 5: Check for balance-Standardized mean difference
Variable Treatment Control Mean difference
Standard Deviation
Standardized Mean Difference
Z score P value
Age 2.250 2.670 -0.415 0.064 -0.460 -6.470 9.95e-11
Gender 1.500 1.500 0.006 0.035 0.013 0.180 8.57e-01
Site 3.270 3.290 -0.025 0.072 -0.025 -0.350 7.27e-01
Stage 2.170 1.700 0.462 0.050 0.677 9.340 9.49e-21
Size 2.370 2.090 0.279 0.052 0.382 5.400 6.62e-08
Lymph 0.663 0.316 0.347 0.034 0.742 10.200 2.46e-24
Metastasis 0.238 0.108 0.131 0.024 0.389 5.500 3.89e-08
Figure 3: Check for balance-Standardized mean difference
Standardized Mean Difference
Covariates
19
Table 6: Comparison of different matching methods
Note:
0X : Mean in treated group before matching 0X : Mean in treated group after matching
1X : Mean in control group before matching 1X : Mean in control group after matching
SMD : Standardized mean difference before matching SMD : Standardized mean difference after matching
: Improvement in balance after matching
Matching Methods Features 0X 1X SMD
0X 1X SMD %
Nearest Neighbor (252 treated, 252 control)
Propensity Score 0.2972 0.1657 0.8595 0.2972 0.2948 0.0152 98.2354
Age 2.2540 2.6688 -0.4582 2.2540 2.2698 -0.0175 96.1741
Stage 2.1667 1.7044 0.7594 2.1667 2.1746 -0.0130 98.2831
Tumor Size 2.3651 2.0861 0.7315 2.3651 0.6548 0.0168 97.7096
Lymph Nodes 0.6627 0.3162 0.3757 0.6627 2.3611 0.0053 98.5778
Metastasis 0.2381 0.1076 0.3058 0.2381 0.2302 0.0186 93.9192
Optimal (1:1) (252 treated, 252 control)
Propensity Score 0.2972 0.1657 0.8595 0.2972 0.2948 0.0154 98.2109
Age 2.2540 2.6688 -0.4582 2.2540 2.2857 -0.0351 92.3482
Stage 2.1667 1.7044 0.7594 2.1667 2.1865 -0.0326 95.7079
Tumor Size 2.3651 2.0861 0.7315 2.3651 0.6587 0.0084 98.8548
Lymph Nodes 0.6627 0.3162 0.3757 0.6627 2.3492 0.0214 94.3111
Metastasis 0.2381 0.1076 0.3058 0.2381 0.2421 -0.0093 96.9596
Full (252 treated, 1069 control)
Propensity Score 0.2972 0.1657 0.8595 0.2972 0.2975 -0.0023 99.7299
Age 2.2540 2.6688 -0.4582 2.2540 2.2647 -0.0118 97.4221
Stage 2.1667 1.7044 0.7594 2.1667 2.1756 -0.0147 98.0610
Tumor Size 2.3651 2.0861 0.7315 2.3651 0.6605 0.0047 99.3513
Lymph Nodes 0.6627 0.3162 0.3757 0.6627 2.3657 -0.0008 99.7888
Metastasis 0.2381 0.1076 0.3058 0.2381 0.2174 0.0485 84.1393
20
Figure 4: Covariate balance before and after matching
21
Figure 5: Overall balance (Propensity Score) before and after matching
The number in Y-axis is due to “scaling” in R.
22
Figure 6: Treatment Effect