Chapter 7: Hypothesis testingjames/w8-STAT574b.pdfSAS Programming March 6, 2015 1 / 43 Hypothesis testing: one sample The idea is then to compare observed - expected cumulative hazard

Chapter 7: Hypothesis testing

Hypothesis testing is typically done based on the cumulative hazardfunction. Here we’ll use the Nelson-Aalen estimate of the cumulativehazard. The survival function is used to weight differences between theobserved and expected cumulative hazard.

Recall that the Nelson-Aalen estimate of the cumulative hazard is

H(t) =∑t≤ti

diYi

In a one-sample problem, you test whether the hazard rate h(t) is equal tosome reference hazard, h0(t). The null hypothesis is H0 : h(t) = h0(t).Under the null hypothesis, the expected hazard rate at time ti is h0(ti ).

SAS Programming March 6, 2015 1 / 43

Hypothesis testing: one sample

The idea is then to compare observed - expected cumulative hazard ratesat the time τ , the largest time in the study (τ = tD) if the largest time isa death time). The test statistic is then

Z (τ) = O(τ)− E (τ) =D∑i=1

W (ti )diYi−∫ τ

0W (s)h0(s) ds

where W (·) is a weight function.

The variance is

V [Z (τ)] =

∫ τ

0W 2(s)

h0(s)

Y (s)ds


Hypothesis testing

The expected value of Z (τ) = 0, so if we take a z-score of Z (τ)(subtracting the mean and dividing by the standard deviation), we get

Z (τ)/√V [Z (τ)]

which has an approximate standard normal distribution. This can be usedfor either a two-sided or one-sided test. For example, a one-sided testwould be H1 : h(t) > h0(t), and you would reject only for large values of

Z (τ)/√V [Z (τ)]


Hypothesis testing

The most popular choice for a weighting function is W (t) = Y (t), whichleads to

O(τ) =D∑i=1

Y (ti )diYi

=D∑i=1

di

This is also called the log-rank test (not sure why).

Other weight functions are possible. For example

W (t) = Y (t)S0(t)p[1− S0(t)]q

with 0 ≤ p, q ≤ 1 (you don’t necessarily need q = 1− p here). The choiceof p affects whether you care more about the hazard not matching thehypothesized hazard for small t or large t. For example, if p is large, thenmore emphasis is placed on the estimated hazard matching the null hazardfor small values of t.

S0(t) can be obtained from S0(t) = − exp[−H0(t)].SAS Programming March 6, 2015 4 / 43

Hypothesis testing

An example where you would use the one-sided hypothesis test is intesting whether some population has a higher hazard than a referencepopulation, such as the psychiatric patients from Iowa. Recall that for thisexample, we looked at excess mortality previously.


Hypothesis testing: two or more samples

If you have two or more samples (i.e., mortality for three differenttreatments or three different risk groups), then the null and alternativehypothesis are similar to that for ANOVA:

H0 : h1(t) = h2(t) = · · · hK (t), for all t ≤ τ

HA : hi (t) 6= hj(t) for some i 6= j and some t ≤ τ

where τ is the largest time at which all of the groups have at least onesubject at risk.



We now define ti as the unique death times for the pooled data (i.e.,ignoring the group that each observation comes from), and again tD is thelargest death time.

We observe dij deaths at time ti in sample j , and there are Yij individuals

at risk at time ti in sample j . We let di =∑K

j=1 dij be the total number of

deaths at time ti and Yi =∑K

j=1 Yij be the total number of indivdiuals atrisk (available for death?) at time ti .



The idea for testing the hypothesis is that under the null hypothesis, theestimate of the hazard (and cumulative hazard) should be the same (inexpectation) using the pooled data (ignoring the group the samples arefrom) and for the individual samples. We can think of the pooled data asproviding a more precise estimate of the hazard for the jth sample thanthe jth sample itself, so using the idea of observed minus expected, we canwrite

Zj(τ) =D∑i=1

Wj(t)

(dijYij− di

Yi

), j = 1, . . . ,K

If all of the Zj(τ) terms are close to 0, then all of the sample estimatedcumulative hazards are close to the pooled cumulative hazard, so they allmust be close to each other, and this supports the null hypothesis.



The typical weight function used is Wj(t) = Yij(t)W (ti ), where W (ti ) is acommon weight shared by each group. For this weighting scheme,

Zj(τ) =D∑i=1

[dij − Yij

(diYi

)]

V [Zj(τ)] = σjj =D∑i=1

W (ti )2Yij

Yi

(1−

Yij

Yi

)(Yi − diYi − 1

)di , j = 1, . . . ,K

cov(Zj(τ),Zk(τ)) = σjk =D∑i=1

W (ti )2Yij

Yi

Yik

Yi

(Yi − diYi − 1

)di , j 6= k



Based on the second formula for Zj(τ), the sum∑K

j=1 Zj(τ) is equal to 0,meaning that the Zj(τ) are not independent of one another. In particularZK (τ) is a linear combination of Z1(τ), . . . ,ZK−1(τ). Consequently, weconstruct a test statistic just based on the first K − 1 Zj(τ) terms:

χ2 = (Z1(τ), . . . ,ZK−1(τ))Σ−1(Z1(τ), . . . ,ZK−1(τ))′

where (Z1(τ), . . . ,ZK−1(τ)) is interpreted as a K − 1 row-vector, Σ is a(K − 1)× (K − 1) covariance matrix (if you had made a K × K matrixusing all the variables, it wouldn’t be full rank, and therefore notinvertible). The χ2 statistic has K − 1 degrees of freedom, and you canbase the test on this distribution.


Hypothesis testing: two samples

Several weight functions are possible. W (t) = 1 for all t leads to thetwo-sample log-rank test. W (ti ) = Yi and W (ti ) =

√Yi have also been

used.

In the case of K = 2 samples, the test statistic can be written as

Z =

∑Di=1W (ti )

[di1 − Yi1

(diYi

)]√∑D

i=1W (ti )2Yi1Yi

(1− Yi1

Yi

)(Yi−diYi−1

)SInce we don’t have to square in this case, we can do one-sided as well astwo-sided hypothesis tests based on a standard normal distribution insteadof a χ2, or you can square the statistic and use a χ2

1 distribution.





This example was kidney dialysis patients with surgically implantedcatheters versus percutaneous (needle-puncture) placement of catheter.Even though the survival curves look fairly different after 1 year or so, thedifferences are not statistically signficant. Note that there are also veryfew observations for the percutaneous sample.

Actually the number of observations is fairly small for both samples, so theconfidence intervals would be fairly wide.







Different choices for the weight function affect the p-value. It is reassuringif a lot of weighting schemes give the same conclusion. The cases wherethe p-value were low were where the weighting scheme gave a lot ofweight to differences in the hazard for large values of ti , which of course iswhere they appear different. This can also be sensitive to differences incensoring patterns in the two samples, so should be used cautiously.

A problem with using lots of weighting schemes is if you only reportweighting schemes that give the results you want and different weightsconflict. This would be dishonest, so you should either pick a weightingscheme and stick to it, or report results of the different weighting schemesthat you used.


Hypothesis testing: weight functions


Hypothesis testing: weight functions

The most common weight functions are either flat, W (ti ) = 1 ordecreasing, with W (ti ) = Yi . A weight function that is increasing mightbe used if to compare longer term survival when early survival might bedue to complications rather than long term effectiveness of a treatment.

An example is in comparing autologous transplants versus allogenictransplants for bone marrow for leukemia. Allogenic transplant patients(receiving bone marrow from sibling) tend to have more complicationsearly on, reducing early survival rates (and increasing early hazard rates),but if interest is in long term survival, then a weight function could beused that emphasized later times.


Hypothesis testing in R

To test the difference in survival curves in R, you can use survdiff()

from the survival library. An example is with the allo- versus auto-patients in the leukemia data.

> x <- read.table("leukemia2.txt")

> a <- survdiff(Surv(x$V1,x$V2)~factor(x$V3))

Call:

survdiff(formula = Surv(x$V1, x$V2) ~ factor(x$V3))

N Observed Expected (O-E)^2/E (O-E)^2/V

factor(x$V3)=1 51 28 25.8 0.182 0.382

factor(x$V3)=2 50 22 24.2 0.195 0.382

Chisq= 0.4 on 1 degrees of freedom, p= 0.537

The results suggest that the two groups had survival experiences that werenot statistically significantly different from each other.



To plot the two survival curves together you can use

> x <- read.table("leukemia2.txt")

> a <- survfit(Surv(x$V1[x$V3==1],x$V2[x$V3==1])~1)

> b <- survfit(Surv(x$V1[x$V3==2],x$V2[x$V3==2])~1)

> plot(a,conf=F)

> points(b$time,b$surv,type="s",col="red",lwd=3)

> legend(20,1,legend=c("auto","allo"),col=c("black","red"),

lty=c(1,1),lwd=c(1,3),cex=1.3)


Hypothsis testing in R



The survdiff() function in R has an optional paramter rho whosedefault is 0, which results in the log rank test. Larger values of rho putlarger weight on later times and can have a big impact on the p-value.


Hypothesis testing in SAS

You can use PROC LIFETEST in SAS to do hypothesis testing. We’ll takea look at examples after the break.


Tests of trend

For multiple samples (K > 2), a different alternative hypothesis is thefollowing:

HA : h1(t) ≤ h2(t) ≤ · · · ≤ hK (t)

, for t ≤ τ , where at least one inequality is strict. This is equivalent to

HA : S1(t) ≥ · · · ≥ SK (t)


Tests of trend

We construct the Zj(τ)s as before and use any weight functions Wj(ti ).We also pick a new set of weights aj , j = 1, . . . ,K , where aj = j is oftenused.

The test statistic is now

Z =

∑Kj=1 ajZj(τ)√∑K

j=1

∑Kk=1 ajak σjk

where Σ = (σjk) is the K ×K covariance matrix. (It isn’t full rank, but wedon’t need the inverse.) The test statistic can be compared to a standardnormal.


Tests of trend


Stratified tests

If different populations have different covariates (age, sex, etc.), thenideally, you could use a regression approach to survival analysis to adjustfor covariates before comparing survival curves or hazard rates. This isdone in Chapter 8.

If there are a small number of levels for a predictor, then you can use astratified test instead.

Let

H0 : h1s(t) = h2s(t) = · · · = hKs(t), s = 1, . . . ,M, t ≤ τ

The idea is that for each level of the covariate (indexed by s), the hazardrate should be the same. Typically, M is small.


Stratified tests

For the stratified test, let

Zj .(τ) =M∑s=1

Zjs(τ)

σjk =M∑s=1

σjks

Then the test statistic is as before with multiple samples:

(Z1.(τ), . . . ,ZK−1,.(τ))Σ−1(Z1.(τ), . . . ,ZK−1,.(τ))′

which is approximately χ2 with K − 1 degrees of freedom. Here we haveK samples and M strata within each sample.


Renyi type tests

For a two sample problem, if hazard functions cross, then the previoustests might not detect much overall difference in the hazard rates. Thus,the overall survival experience might be similar, but it could be different inthe short term and different in the long term. If one group is at more atrisk in the short term, and another in the long term, these changes ofdirection could cancel out leading one to not reject the hypothesis that thehazards are different.

Renyi-type tests are based on the maximum absolute value of thedifferences between cumulative hazard rates rather than the summeddifferences.

The idea is similar to the Kolmogorov-Smirnov test for comparing twodistributions, which uses the largest absolute value of the differencebetweent the two empirical CDF functions, but Renyi tests allow forcensoring.


Renyi type tests

To construct this test, let

Z (ti ) =∑tk≤ti

W (tk)

[dk1 − Yk1

(dkYk)

)], i = 1, . . . ,D

where as usual dk = dk1 + dk2 and Yk = Yk1 + Yk2 (i.e., dk and Yk arethe pulled number of deaths and number at risk at time tk over bothsamples). The standard error of Z (τ) is

σ2(τ) =∑τk≤τ

W (tk)2(Yk1

Yk

)(Yk2

Yk

)(Yk − dkYk − 1

)dk

where τ is the largest death time tk with Yk1,Yk2 > 0


Renyi type tests

The test statistic is

Q = sup{|Z (t)|, t ≤ τ}/σ(τ)

you can think of the supremum here as just the maximum of the absolutevalues of the Z (tj) values. Critical values are given in the Appendix, tableC.5, and are based on the theory of Brownian motion.


Renyi type tests


Renyi type tests: finding the maximum |Z (tj)|



Testing based on a fixed point in time

Instead of testing survival and hazard rates over all time points, you mightbe interested in the 1-yr survival rate. Note that the time being testedshould be chosen before doing the test. If you look at two survival curvesand say, “Wow, they look really different at year 3, is that significant?”then the p-value will biased too low.

It is similar to testing at many time points but then not adjusting formultiple comparisons. In practice, this is what happens all the timethough. People look at a graph of the data, which is maybe meant to bedescriptive, something jumps out at them as being unusual, and they say,“Wow, is that significant?” It’s extremely difficult to answer this type ofquestion. A better approach in this type of case might be the Renyi typeof test, because it is accounting for the fact that you are looking atmaximum differences over the entire time frame.



Here we want to testH0 : S1(t0) = S2(t0)

againstHA : S1(t0) 6= S2(t0)

for two survival curves. (The method can be generalized to more survivalcurves.) The test statistic is

Z =S1(t0)− S2(t0)√

V [S1(t0)] + V [S2(t0)]

which has an approximate standard normal distribution for large samples.



If you want to test multiple fixed time points, such as the 1-yr and 5-yrsurvival rates, then you should adjust for multiple comparisons. For testingtwo time points, a Bonferroni adjustment could be made, meaning thatyou reject each hypothesis only if the p-value is less than α/2. The moretime points you check, the less power you will have to find signficantdifferences.


Bonferroni adjustments

Probably the most popular, and simplest adjustment to make for multipletesting is Bonferroni adjustments. The idea is that to have k tests at levelα (meaning that if the null hypotheses are true for all k tests, there is onlya 5% chance of making an error on any one of them), you use an α levelof α/k for each test.

What is the rationale for doing this?



There are several ways to justify Bonferroni adjustments. One is to look atthe expected number of false positives under the null. Let Xi = 1 if youmake a correct decision on test i , and otherwise Xi = 0. What type ofvariable is Xi? What is the probability that Xi = 1 if the null hypothesis(for experiment i) is true? What is the expected value of Xi?



Xi as defined previously is Bernoulli with p = α if testing using level α.The expected value of a Bernoulli(p) random variable is p. (Why?), so theexpected value of Xi is α.

If you do k experiments, the expected number of false positives is

E

[k∑

i=1

Xi

]= kα

However, if you test at the α/k level, then the expected number of falsepositives is α. Thus, the Bonferroni adjustment controls the expectednumber of false positives.



Another approach is to use something called Bonferroni’s inequality. LetAi be the event that you don’t reject the null hypothesis. Suppose we setP(Ai ) = 1− α/k when the null is true. From the Inclusion-Exclusionformula

P(A1A2) = P(A1) + P(A2)− P(A1 ∪ A2) ≥ P(A1) + P(A2)− 1

If we apply the formula again, setting B = A1A2, we get

P(A1A2A3) = [P(A1)+P(A2)−1]+P(A3)−1 ≥ P(A1)+P(A2)+P(A3)−2

In general for k events

P(A1 · · ·Ak) ≥k∑

i=1

P(Ai )− (k − 1)



If P(Ai ) = 1− α/k , then we get

P(A1 · · ·Ak) ≥ k(

1− α

k

)− k + 1 = 1− α

Thus, the probability of all decisions being correct is at least 1− α, andthe probability of making any wrong decision is at most α.



Bonferroni’s inequality can be useful in other probabilistic arguments aswell.


Documents

Chapter 7: Hypothesis testingjames/w8-STAT574b.pdfSAS Programming March 6, 2015 1 / 43 Hypothesis testing: one sample The idea is then to compare observed - expected cumulative hazard