Survival Estimation Using Bootstrap, Jackknife and K

Survival Estimation Using Bootstrap, Jackknife and K-Repeated Jackknife Methods11-2014
Ugochukwu A. Mbata University of Lagos, Akoka-Lagos, Nigeria, [email protected]
Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm
Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons
This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState.
Recommended Citation Adewara, Johnson A. and Mbata, Ugochukwu A. (2014) "Survival Estimation Using Bootstrap, Jackknife and K-Repeated Jackknife Methods," Journal of Modern Applied Statistical Methods: Vol. 13 : Iss. 2 , Article 15. DOI: 10.22237/jmasm/1414815240 Available at: http://digitalcommons.wayne.edu/jmasm/vol13/iss2/15
November 2014, Vol. 13, No. 2, 287-306.
Copyright © 2014 JMASM, Inc.
ISSN 1538 − 9472
Johnson Ademola Adewara is a Senior Lecturer at the Distance Learning Institute at the University of Lagos. Email at [email protected] U.A. Mbata is a Graduate Fellow and Consulting Statistician in the Mathematics Department at the University of Lagos. Email at [email protected]
287
Johnson Ademola Adewara University of Lagos
Akoka-Lagos, Nigeria
Akoka-Lagos, Nigeria
Three re-sampling techniques are used to estimate the survival probabilities from an exponential life-time distribution. The aim is to employ a technique to obtain a parameter estimate for a two-parameter exponential distribution. The re-sampling methods considered are: Bootstrap estimation method (BE), Jackknife estimation method (JE) and the k-repeated Jackknife estimation method (KJE). The methods were computed to obtain the mean square error (MSE) and mean percentage error (MPE) based on simulated data. The estimates of the two-parameter exponential distribution were substituted to estimate
survival probabilities. Results show that the MSE value is reduced when the K–repeated jackknife method is used.
Introduction
Modern statistics is anchored in the use of statistics and hypothesis tests that only
have desirable and well-known properties when computed from populations that
are normally distributed. While it is claimed that many such statistics and
hypothesis tests are generally robust with respect to non-normality, other
approaches that require an empirical investigation of the underlying population
distribution or of the distribution of the statistic are possible and in some instances
preferable. In instances when the distribution of a statistic, conceivably a very
complicated statistic, is unknown, no recourse to a normal theory approach is
available and alternative approaches are required. Statistical models and methods
for estimating survival data and other time-to-event data are extensively used in
many fields, including the biomedical sciences, engineering, the environmental
sciences, economics, actuarial sciences, management, and the social sciences.
288
Survival analysis refers to techniques for studying the occurrence and timing of
events. It is concerned with studying the random variable T, representing the time
between entry to a study and some event of interest, such as: death, the onset of
disease, time until equipment failures, earthquakes, automobile accidents, time-to-
promotions, time until stock market crashes, revolutions, job terminations, births,
marriages, divorces, retirements or arrests. There are many different models for
survival data, and what often distinguishes one model from another is the
probability distribution for T. Resampling statistics refer to the use of the observed
data or of a data generating mechanism (such as a die) to produce new hypothetical
samples (resamples) that mimic an underlying population, the results of which can
then be analyzed. With numerous cross-disciplinary applications especially in the
sub-disciplines of the life science, resampling methods are widely used because
they are options when parametric approaches are difficult to employ or otherwise
do not apply.
Resampled data is derived using a manual mechanism to simulate many
pseudo-trials. These approaches were difficult to utilize prior to 1980s because
these methods require many repetitions. With the incorporation of computers, the
trials can be simulated in a few minutes and is why these methods have become
widely used. The methods that will be discussed are used to make many statistical
inferences about the underlying population. The most practical use of resampling
methods is to derive confidence intervals and test hypotheses. This is accomplished
by drawing simulated samples from the data themselves (resamples) or from a
reference distribution based on the data; afterwards, you are able to observe how
the statistic of interest in these resamples behaves. Resampling approaches can be
used to substitute for traditional statistical (formulaic) approaches or when a
traditional approach is difficult to apply. These methods are widely used because
their ease of use. They generally require minimal mathematical formulas, needing
a small amount of mathematical (algebraic) knowledge. These methods are easy to
understand and stray away from choosing an incorrect formula in your diagnostics.
Two general approaches considered here are: Jackknife approach and
Bootstrap approach. The aim of this study is to employ a technique to obtain an
estimate of the parameter of the two-parameter exponential distribution. The
methods considered in this paper are: Bootstrap estimation method (BE), Jackknife
estimation method (JE) and the k-repeated Jackknife estimation method (KJE). The
estimates of the two-parameter exponential distributions are used to estimate the
survival probability. Methodology under Bootstrap, Jackknife and the proposed K-
repeated Jackknife is presented, followed by data analysis, results, discussion and
a conclusion.
ADEWARA & MBATA
Re-sampling methods are becoming increasingly popular as statistical tools. These
methods involve sampling or scrambling the original data numerous times. Two
general approaches are considered here. They are: Jackknife approach and
Bootstrap approach. The two-parameter exponential distribution is adopted when
failure will never occur prior to some specified time, t0. The parameter t0 is a
location parameter that shifts the distribution an amount equal to t0 towards the right
on the time line. When t t0, the probability density function of exponential
distribution becomes:
0 0

; t
(2)
Bootstrapping is a modern, computer-intensive, general purpose approach to
statistical inference, falling within a broader class of re-sampling methods to
simplify the often intricate calculations of traditional statistical theory. A
parametric bootstrap method is considered in this article.
The general theory (see Rizzo, 2008) is as follows. Suppose 1 2, , , nt t t is a
random sample from the distribution of T. An estimator for a parameter is an
n variate function 1 2, ,ˆ ˆ , nt t t of the sample. Functions of the estimator
are therefore n–variate functions of the data, also. For simplicity, let
1 2, , , , T n
nt t t t R and 1 2 , ,t t denote a sequence of independent random
samples generated from the distribution of T. Random variables from the sampling
distribution of can be generated by repeatedly drawing independent random
samples j
the replicates is given as
290
ˆ ˆMSE E
random samples 1 2 , , ,
m t t t are generated from the distribution of T then
estimate of the MSE of 1 2, ,ˆ ˆ , nt t t is
2
1
nt t t .
Estimate of the standard error of the bootstrap estimate, ˆ B is given by
2
1
2

(7)
If the bootstrap estimator ˆ B is known from (3) then the estimate of survival
function is given as

Jackknife Estimation
The jackknife is a more orderly version of the bootstrap. As opposed to re-sampling
randomly from the entire sample like the bootstrap does, the jackknife takes the
entire sample except for 1 value, and then calculates the test statistic of interest. It
repeats the process, each time leaving out a different value, and each time
recalculating the test statistic. This method was introduced by Quenouille (1949)
and further modification in Quenouille (1956). The theory is as follows:
Let be an estimator of the parameter θ based on the complete sample of
size n with g subgroups. Let ˆ i be the corresponding estimator based on the
sample at the ith deletion. Define
ˆ 1 1,2ˆ , ,i ig g i g (9)
The ith deletion of the total could be one individual observation or several
observation. The latter case is called group- or block-based jackknife if one
replication or one block observations are deleted. In equation (1) estimation i is
called the ith pseudo value and the estimator in equation (2) is the jackknife
estimator for the parameter θ, where θ can be a variance component, covariance
component, correlation coefficient, or any other parameter of interest.
1 1
(10)
In equation (10), is called a pseudo jackknife estimate. A t-test can then be
used to test significant deviation from a given parameter value, 0 with degrees of
freedom 1g (Miller, 1974a, b). The equation (9) can be rewritten as
1 1 1,ˆ ˆ 2ˆ ˆ , ,ˆ i i ig g g i g . (11)
Thus, it is obvious that pseudo value i in equation (11) is related to choices
for g . When g is large, a slight difference between and ˆ i
will cause
unfavorable values. More importantly, it will potentially cause a large standard
error for an estimate and thus decrease the power for the parameter being tested. If
it is assumed that the estimate ˆ i in equation (9) for the ith deletion is unbiased,
SURVIVAL ESTIMATION USING THREE METHODS
292
then it is easy to prove that in equation (12) is unbiased too. It is often true if
is an unbiased estimate of θ, then ˆ i will be unbiased after a few individuals in the
original data are deleted.
. (12)
In equation (12), is called a non-pseudo jackknife estimate of the parameter
θ. For each non-normally distributed variable, based on the Central Limit Theorem,
is approximately normally distributed when g is large. Thus, an approximate z-
test can be used when g is large or t-test can be used to test significant deviation
from a given parameter value, 0 , with the degrees of freedom 1g . An estimate
of the mean square error (MSE) of the jackknife estimate, ˆ jack is given by
2
(13)
Estimate of the standard error of the jackknife estimate, ˆ jack is given by
2
1
2, 1

293
If the jackknife estimator ˆ jack is known from (12) then the estimate of survival
function is given as

K – Repeated Jackknife Estimation Method
The K – Repeated jackknife procedure is a re-sampling iterative scheme for mean
square error (MSE) reduction. This involves jackknifing the observed data k-time,
where k equals the sample size of the observed data. The procedure is conveniently
applied when the sample size is small. The stopping rule for the repeated jackknife
replications depends on the sample size of the original data. The procedure
converges before or at kth time, where the estimate from the jackknife replications
is the same as estimator of the parameter θ based on the complete sample of size n.
At the Kth time, the kth – repeated jackknife estimate of bias is highly negligible.
The method involves the following steps from the usual jackknife procedure:
Step 1. Observe a random sample T = (t1, t2, . . . ,tn)
Step 2. Compute ˆ t a function of the data which estimates the parameter
of the model.
Step 3. For i up to n
generate a jackknife sample 1 1 1, , , ,i i i nT t t t t by leaving out the
ith observation
calculate ˆ i from each of the Jackknife sample iT by
1
294
Step 4. Repeat step 3 using the estimates from ˆ i to form pseudo samples.
The new pseudo samples are used to generate another set of jackknife estimates;
this is continued until the kth time. This implies that the process is repeated k times,
and at any given stage the preceding jackknife estimates are used as new samples
in the next stage until the kth time.
Step 5. At the kth time the K-repeated Jackknife estimate is calculated as
1
1
(19)

(20)

, 1
2
1
1
is known from (19), then the estimate
of survival function is

. (24)

where K = n (sample size) indicates the stopping rule. Other estimators such as
variance, standard error and confidence interval can be estimated as in (20), (21),
(22) and (23).
This study described three types of parameter estimation methods based on
re-sampling technique: the bootstrap method, the jackknife method and the k-
repeated jackknife method. However, the intention of this study is to use Monte
Carlo simulated data to compare the three methods based on mean squared error
(MSE) and mean percentage error (MPE), hence survival estimation.
296
Exploratory data analysis approach using simulated data generated by R-statistical
program is adopted in this research work. This is to validate the statistical
assumptions of an exponential distribution. In statistics, every statistical model has
its own assumptions that have to be verified and met, to provide valid results. In
the case of exponential distribution, the confidence interval for the mean life of an
event requires two major assumptions: the time-to-occurrence of events of interest
are independent, and the time for occurrence of event is exponentially distributed.
These two statistical assumptions must be satisfied for the corresponding
confidence interval to cover the true mean with the prescribed probability. The
simulated data is based on random generation of values which satisfies both the
assumption of independence and exponentially identical distribution. Some
properties of the exponential distribution are as follows: the theoretical mean and
standard deviation are equal. Hence, (1) the sample values of mean and standard
deviation should be close. (2) Histogram should show that the distribution is right
skewed (Median < Mean). (3) A plot of Cumulative-Failure vs. Cumulative-Time
should be close to linear. (4) The regression slope of Cum-Failure vs. Cum-Time is
close to the failure rate. (5) A plot of Cum-Rate vs. Cum-Failure should
decrease/stabilize at the failure rate level. (6) Plots of the Exponential probability
and its scores should also be close to linear. Some of these properties are explained
by the exploratory data analysis displayed in Figures 1, i - xii.
ADEWARA & MBATA
Table 1. Generation Parameters
Sample 1 (Sample size = 10, = 0.5) Sample 7 (Sample size = 20, = 1.5)
Figure 1. Histogram for each of the Randomly Generated Sample (i – xii)
298
Sample (i) N λ Mean Median Std.Dev.
Sample 1 10 0.5 2.6006960 1.4973290 3.7961510
Sample 2 10 1.0 0.9211933 0.7875227 0.7792381
Sample 3 10 1.5 0.7320056 0.5990053 0.5961606
Sample 4 10 2.0 0.8510537 0.7067248 0.8785794
Sample 5 20 0.5 1.8608660 1.1843510 1.8059580
Sample 6 20 1.0 1.2379810 0.9922479 1.0967930
Sample 7 20 1.5 0.5363318 0.2470020 0.6540290
Sample 8 20 2.0 0.7281072 0.2824484 1.0112260
Sample 9 30 0.5 1.5960620 1.0194960 1.6278560
Sample 10 30 1.0 0.8559385 0.5422463 0.8296031
Sample 11 30 1.5 0.6353941 0.3864003 0.5880883
Sample 12 30 2.0 0.4639596 0.2700518 0.5150784
Results
The results of descriptive statistics show that as the sample sizes 10, 20 and 30
increase the mean and standard deviation are decreasing which satisfied one of the
properties of that the theoretical mean and standard deviation are equal. The sample
mean and standard deviation obtained are very close also as the value of λ increases
the median values obtained get smaller. Figure 1 above shows that the observed
distribution agrees with the exponential distribution property 1 and property 2
described in the data above. Figures 1, i–xii show right-skewness, which supported
the attribute of an exponential distribution.
Table 3 shows the parameter estimation of the three methods. The results
reveal that the estimation of the bootstrap approach is better than the other two
methods that is the jackknifing and K repeated jackknifing. Table 4 is the result of
the mean square error (MSE) of the analysis which is about the variance of the three
methods. Results reveal that, as λ values increase, the results of jackknifing and K
repeated jackknifing are better than the bootstrapped approach. Table 5 is the
computation of the mean percentage error (MPE) the result shows that estimation
of the bootstrap approach is better than the other two methods.
ADEWARA & MBATA
299
Table 3. Estimation Using the Three Methods Bootstrap, Jackknifing and K repeated
jackknifing
0.5 0.527441921 0.527445588 0.527445588
1 0.491819455 0.491882638 0.491882638
1.5 0.491085203 0.491099760 0.491099760
2.0 0.528125037 0.528118624 0.528118624
Table 4. Estimation to the Bootstrap, Jackknifing and K repeated jackknifing using MSE
methods
300
Table 5. Bootstrap, Jackknifing and K-repeated jackknifing using MPE methods
ˆ BS t ˆ
jackS t ˆKS t
0.5 0.054883842 0.054891176 0.054891176
1 0.016361090 0.016234724 0.016234724
1.5 0.017829594 0.017800480 0.017800480
2.0 0.056250074 0.056237248 0.056237248
Table 6. Survival Estimation Using the Three Methods with Respect to MSE and MPE
BOOTSTRAP METHOD (1)
10
20
30
ADEWARA & MBATA
301
Table 6, cont’d. Survival Estimation Using the Three Methods with Respect to MSE and
MPE
10
20
30
K-REPEATED JACKKNIFE METHOD (3)
10
20
30
302
Discussion
The comparison of parametric estimators of exponential distribution using
Bootstrap, Jackknife, and K-Repeated Jackknife methods indicates that the
estimates of the population parameter are very close which implies that the
estimators are unbiased. A comparison of the mean square error (MSE) and mean
percentage error (MPE) of the estimators shows that K-Repeated Jackknife method
has a minimum variance unbiased estimator (MVUE); irrespective of the sample
size whether it is small or large at any given values of lambda (λ). The three
methods are used to estimate the survival function for exponential distribution and
its mean square error (MSE) and mean percentage error (MPE). The results can be
deduced that the performance of the two jackknife procedures over the bootstrap
procedure is 66.67% to 33.33%. This result has been able to show the effect or
influence of jackknife method, especially the k-repeated procedure on error
reduction in estimating population parameter.
Conclusion
This study demonstrates that both methods of re-sampling technique are very
efficient in estimating the population parameters and their mean square errors
(MSE), as viewed by Efron (1998). These methods were used to find the best
minimum variance unbiased estimator, using mean square error (MSE) and mean
percentage error (MPE). The estimates of the two-parameter exponential
distribution are used to estimate the survival probability. The attractiveness of
jackknifing and bootstrapping is that they provide investigators with an important
and unattainable type of information. Jackknifing and bootstrapping have their
limitations and inherent assumptions as all statistical procedures do. The three
methods are computationally intensive. However, these techniques represent an
important step in refining the process of data analysis more especially the k-
repeated procedure. Hence, it can be deduced that bootstrapping is a method for
evaluating the variance of an estimator while jackknife is a method for reducing the
bias of an estimator, and evaluating the variance of an estimator. This is clearly
shown in the MSE results. The MSE value is reduced using the K–repeated
jackknife method.
ADEWARA & MBATA
NJ: Prentice-Hall.
Collett, D. (1994). Modeling survival data in medical research. London:
Chapman and Hall.
Conover, W. J. (1980). Practical nonparametric statistics. New York: John
Wiley and Sons.
Cox, D. R., & Oakes, D. (1984). Analysis of survival data. New York:
Chapman and Hall.
Crowley, P. (1992). Resampling methods for computation-intensive data
analysis in ecology and evolution. Annual Review of Ecological Systems, 23, 405-
447.
Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the
jackknife, and cross-validation. The American Statistician, 37, 36-48.
Efron, B., & Tibshirani, R. (1998). An introduction to the bootstrap. Boca
Raton, FL: Chapman & Hall/CRC.
Fernández, A. J. (2000). Estimation and Hypothesis Testing for Exponential
Lifetime Models with Double Censoring and Prior Information. Journal of
Economic and Social Research, 2(2), 1-17.
Good, P. L. (2005). Resampling methods (3rd ed.). Berlin: Birkhauser.
Lawless, J. F. (1982). Statistical models and methods for lifetime data. New
York: John Wiley and Sons.
Miller, R. G. (1974a). An unbalanced jackknife. Annals of Statistics, 2, 880-
891.
Miller, R. G. (1974b). The jackknife-a review. Biometrika, 61, 1-15.
Quenouille, M. H. (1949). Approximate tests of correlation in time-series.
Journal of the Royal Statistical Society, Series B, 11, 68-84.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43, 353-
360.
Rizzo, M. L. (2008). Statistical computing with R. Boca Raton, FL:
Chapman and Hall/CRC.
Suzuki, K. (1985a). Nonparametric estimation of lifetime distributions from
a record of failures and follow-ups. Journal of the American Statistical
Association, 80, 68-72.
304
Suzuki, K. (1985b). Estimation of lifetime parameters from incomplete field
data. Technometrics, 27, 263-271.
Table A1. Sample Size = 10
S/N Sample 1(λ= 0.5) Sample 2(λ= 1.0) Sample 3(λ= 1.5) Sample 4(λ= 2.0)
1 0.38672500 2.27994293 0.20192160 0.01058359
2 0.11665480 0.71411938 0.48301090 2.51951031
3 0.12661390 0.75298901 0.12473380 1.94495769
4 4.42286640 2.19289233 1.99662440 0.67531853
5 2.00704600 0.04442099 0.67780240 1.05647689
6 2.34689070 0.35021824 0.75981590 0.01373747
7 0.86211020 1.01674933 0.87373650 0.08979746
8 1.03118540 0.82205630 0.21055390 0.73813115
9 12.74339800 0.02251542 1.47164790 0.07333057
10 1.96347260 1.01602885 0.52020820 1.38869338
1 2.62129072 0.13858938 0.61025422 0.20816883
2 3.66009415 0.23677409 1.39562321 0.10508682
3 0.28671927 0.65882621 1.64585069 0.16343073
4 0.53614812 2.03913598 1.11779841 4.47429656
5 1.67262059 3.07310425 0.64918675 0.31910786
6 0.08521250 0.85047816 0.05778008 0.21707275
7 0.84341716 0.93587816 0.04898416 0.04360368
8 1.87988871 3.97283307 0.12098188 0.11505109
9 3.13213741 1.91406287 0.55509004 0.24578897
10 6.82508233 1.04861761 0.19238185 1.05875641
11 0.02900937 0.03350444 0.26953211 0.00355591
12 1.74900498 1.30175357 0.01133644 0.05922794
13 1.24431388 1.52538027 0.28231025 0.20815442
14 1.12438751 0.06504098 2.50193518 1.48214568
15 1.02901637 1.90192968 0.52524965 0.89694421
16 2.19818977 0.71117769 0.20890800 1.53327300
17 5.75705560 2.66119925 0.12742349 1.11055824
18 0.67605113 1.38839741 0.09933566 0.81368075
19 0.79778248 0.30239189 0.08220231 0.46919296
20 1.06989471 0.00054309 0.22447181 1.03504680
306
1 1.09871091 0.17085314 0.24591320 0.03006484
2 1.30597223 0.31266839 0.78554402 0.01316790
3 0.43742079 0.21960578 0.26193004 0.08580695
4 0.57034193 1.09603105 0.07651001 0.06684230
5 1.02475744 0.54796524 0.29620540 0.75349995
6 1.53172531 0.13561858 0.58560088 0.56383185
7 0.57173448 0.46805576 0.56401687 0.10014385
8 3.18835121 0.58675535 0.41054238 0.39468972
9 1.01423546 0.41116558 1.42498655 0.26076246
10 0.05104055 3.42353290 0.03212381 0.38563026
11 2.99399940 1.09817464 0.19768464 0.96558979
12 2.95112802 0.35962685 0.24295821 0.27934118
13 4.74244122 1.98862082 0.16056365 0.01547041
14 0.04628853 1.11964839 0.24140637 0.34343548
15 5.35809191 0.48539163 0.06956623 0.80092480
16 4.26185504 1.09365222 1.49188159 1.49780414
17 0.04367701 0.73949713 0.22345808 0.90349970
18 0.05851474 0.34345758 1.56937093 0.18701462
19 0.46455153 0.29243694 0.59195204 0.00288255
20 0.09371455 0.25904558 0.42176436 0.74330912
21 2.96970220 1.79180600 1.11983745 0.25455636
22 0.54133977 0.01809066 0.28416911 0.81341579
23 1.29204462 2.94010580 0.65638599 0.10230827
24 0.02923826 0.79327459 0.12515747 2.24807427
25 2.28724168 1.13577062 2.26479793 0.08718112
26 3.62406597 0.05051664 1.73361763 1.15865739
27 0.55831489 0.53652729 0.36225826 0.02484439
28 4.24494133 1.08361205 1.41035274 0.48756216
29 0.13192788 0.25488727 0.93471466 0.25596331
30 0.39449119 1.92176066 0.27655395 0.09251206
Journal of Modern Applied Statistical Methods
11-2014
Johnson A. Adewara
Ugochukwu A. Mbata

Documents

Survival Estimation Using Bootstrap, Jackknife and K