Upload
hoanghanh
View
217
Download
0
Embed Size (px)
Citation preview
APPROXIMATE LIKELIHOOD INFERENCE FOR HAPLOTYPE
RISKS IN CASE-CONTROL STUDIES OF A RARE DISEASE
by
Zhijian Chen
B.Sc. in Statistics, Peking University, 2003.
a project submitted in partial fulfillment
of the requirements for the degree of
Master of Science
in the Department
of
Statistics and Actuarial Science
c© Zhijian Chen 2006
SIMON FRASER UNIVERSITY
Fall 2006
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL
Name: Zhijian Chen
Degree: Master of Science
Title of project: Approximate Likelihood Inference for Haplotype Risks in Case-
Control Studies of a Rare Disease
Examining Committee: Dr. Gary Parker
Chair
Dr. Brad McNeneySenior SupervisorSimon Fraser University
Dr. Jinko GrahamSenior SupervisorSimon Fraser University
Dr. Xiaoqiong Joan HuExternal ExaminerSimon Fraser University
Date Approved:
ii
Abstract
The standard study design to study risk factors for rare diseases is the case-control design.
Genetic association case-control studies often include haplotypes as risk factors. Haplotypes
are not always observed, though observable single-locus genotypes contain partial haplotype
information. Missing haplotypes lead to analysis of data with missing covariates. Maximum
likelihood (ML) inference is then based on solving a set of weighted score equations. How-
ever, the weights cannot be calculated exactly. We describe three methods that approximate
ML by approximating the weights: i) naive application of prospective ML (PML), which
ignores the case-control sampling design, ii) an estimating equations (EE) approach and iii)
a hybrid approach which is based on PML, but with improved weights suggested by EE. We
investigate the statistical properties of the three methods by simulation. In our simulations
the hybrid approach gave more accurate estimates of statistical interactions than PML and
more accurate standard errors than EE.
iii
Acknowledgements
I am greatly indebted to my co-supervisors Dr. Jinko Graham and Dr. Brad McNeney
for their support and guidance throughout my 2 years at SFU, and for their influence and
inspiration to me. I feel very fortunate to have been part of the statistical genetics group.
I also want to thank them, as well as Dr. Richard Lockhart, Dr. Joan Hu and Dr. Boxin
Tang, for their encouragement in my studies. I would like to express my gratitude to the
faculty and staff of the Department of Statistics and Actuarial Science at Simon Fraser
University for their devotion to the graduate programs. Many thanks to my friends and
fellow students: Lihui, Li, Celes, Linda, Cindy, Lucy, Yunfeng, Tony, Pritam, Matthew,
David, Kelly, Linnea, Dean, Gurbakhshash, Ryan, Wendell, Saman, Crystal, Eric, John,
Mark, Darcy, Simon, Darby, Jason and many more. Especial thanks to Chunfang and
Wilson for helping me settle down, and to Ji-Hyung for her patience with my questions
in the completion of my thesis. I also wish to acknowledge Dr. Keith Walley and the
iCAPTURE Center at St. Paul’s Hospital for offering me the opportunity of joining them
for a summer term. Finally, I would like to express my deep gratitude to my parents and
my sister for their love, understanding and support all the way.
iv
Contents
Approval ii
Abstract iii
Acknowledgements iv
Contents v
List of Figures vii
List of Tables viii
1 Introduction 1
1.1 Genetic Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Genetic Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methods 6
2.1 PML for Cohort Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 RML for Case-Control Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 A Variant Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 MLEs via EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 MLEs by direct solution of score equations . . . . . . . . . . . . . . 15
2.3 Approximate Score Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 PML as an Approximate Score Method . . . . . . . . . . . . . . . . 16
2.3.2 MPSE as an Approximate Score Method . . . . . . . . . . . . . . . . 16
v
2.3.3 EE as an Approximate Score Method . . . . . . . . . . . . . . . . . 20
2.3.4 A PML/MPSE Hybrid Approach . . . . . . . . . . . . . . . . . . . . 21
3 Simulation Study 23
3.1 Design of Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Overview of Simulation Conclusions . . . . . . . . . . . . . . . . . . 28
3.3.2 Results for Simulation Scenarios i) and ii) . . . . . . . . . . . . . . . 29
3.3.3 Results for Simulation Scenarios iii) and iv) . . . . . . . . . . . . . . 30
4 Conclusions and Future Work 41
Appendices 42
A Variable Probability Sampling 43
A.1 Overview of VPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2 Equivalence of Probabilities Under VPS and VSS . . . . . . . . . . . . . . . 44
B Derivation of prv(H | X ; γv) 46
C Simulation Results 48
Bibliography 55
vi
List of Figures
3.1 Results for βHX from simulation scenario i) . . . . . . . . . . . . . . . . . . 32
3.2 Results for βHX from simulation scenario ii) . . . . . . . . . . . . . . . . . . 33
3.3 Boxplots of bias in estimation of βHX (upper plot) and bias in estimation of
associated standard error (lower plot) from simulation scenario i) . . . . . . 34
3.4 Boxplots of bias in estimation of βHX (upper plot) and bias in estimation of
associated standard error (lower plot) from simulation scenario ii) . . . . . . 35
3.5 Results for βHX from simulation scenario iii) . . . . . . . . . . . . . . . . . 36
3.6 Results for βHX from simulation scenario iv) . . . . . . . . . . . . . . . . . 37
3.7 Boxplots of bias in estimation of βHX (upper plot) and bias in estimation of
associated standard error (lower plot) from simulation scenario iii) . . . . . 38
3.8 Boxplots of bias in estimation of βHX (upper plot) and bias in estimation of
associated standard error (lower plot) from simulation scenario iv) . . . . . 39
3.9 Estimated bias of standard error for GLM, PML and HYBRID after exclud-
ing EE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
vii
List of Tables
3.1 Haplotype frequencies used in the simulations . . . . . . . . . . . . . . . . . 26
3.2 The four simulation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Adjusted intercept β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
C.1 Simulation results for scenario i) . . . . . . . . . . . . . . . . . . . . . . . . 49
C.2 Simulation results for scenario ii) . . . . . . . . . . . . . . . . . . . . . . . . 50
C.3 Simulation results for scenario iii) . . . . . . . . . . . . . . . . . . . . . . . . 51
C.4 Simulation results for scenario iv) . . . . . . . . . . . . . . . . . . . . . . . . 52
viii
Chapter 1
Introduction
1.1 Genetic Background
Genetic material is stored on chromosomes in the nucleus of every human cell. Each
chromosome contains a single molecule of duplex DNA along its length, with protein tightly
and complexly coiled. The units of each DNA strand are nucleotides, each of which
contains one of the four chemical bases: Adenine (A), Guanine (G), Thymine (T) and
Cytosine (C). In the duplex DNA, A and T are paired and C and G are paired. A gene
is a sequence of nucleotides along a DNA molecule that influences one or more hereditary
traits (or phenotypes). The physical position of a gene in a chromosome is called its locus
(plural loci). Homologous chromosomes are a pair of chromosomes inherited separately
from parents that contain the same genetic loci in the same order. Every normal person
has 22 pairs of non-sex chromosomes (autosomes) and a pair of sex chromosomes, with
XX for females and XY for males.
Genetic variations exist in most natural populations of organisms, and are the source
of diversity in the population. We call such genetic differences between individuals at iden-
tifiable loci DNA markers. The variant forms at a locus are called alleles. When the
alleles an individual carries at a locus are different from each other, the genotype of this
individual at this locus is said to be heterozygous, otherwise, it is said to be homozy-
gous. In a large random-mating population with no selection, mutation, or migration, the
allele frequencies and the genotype frequencies are constant from generation to generation;
1
CHAPTER 1. INTRODUCTION 2
furthermore, there is a simple relationship between allele frequencies and genotype frequen-
cies. This principle, known as Hardy-Weinberg Equilibrium (HWE), is as follows: if
the allele frequencies are p for the common allele (often denoted by 0) and q for the rare
allele (often denoted by 1) among the parent population, then the genotype frequencies for
0/0, 0/1 (or 1/0) and 1/1 among the next generation are p2, 2pq and q2, respectively, where
“/” separates alleles on the maternally inherited chromosome from those on the paternally
inherited chromosome.
The most common type of DNA marker is the single-nucleotide polymorphism
(SNP), which involves two variants at a single base pair, each of which is observed in the
general population at a frequency greater than 1%. SNPs are abundant and are distributed
approximately uniformly throughout the genome. The terms “allele” and “locus” can also
be referred to the variant forms and physical position of a SNP. Since the two DNA strands
are complementary, only the nucleotide in a single strand is read during SNP genotyping,
and single letters are used to record the alleles. At a diallelic SNP site, for example,
the genotype of an individual who has T-A base pair in one chromosome and C-G base
pair in the homologous chromosome will be recorded as heterozygote T/C (or C/T); while
the genotype of an individual who has T-A base pair at the corresponding site in both
homologous chromosomes will be homozygote T/T. The genetic constitution of an individual
chromosome is called a haplotype. A haplotype can also refer to the combination of alleles
over a sequence of loci on the same chromosome and can be treated as a “super allele”.
Meiosis is a special type of cell division that produces gametes (sperm and egg cells).
During meiosis, homologous chromosomes may exchange genetic material in a process called
crossing-over. When crossing-over events lead to gametes with alleles of different parental
origin at two loci, the gametes are said to be recombinant at these loci. The ratio of the
expected number of recombination events between two loci to the total number of gametes is
the recombination rate between these two loci. When two loci are so close on a chromosome
that recombination is almost impossible, the recombination rate between these two loci is
0. When two loci are far from each other on a chromosome, the expectation is that half
of the gametes will be recombinant, and the recombination rate between these two loci is
0.5. The tendency of nearby loci to co-segregate to the next generation leads to correlation
CHAPTER 1. INTRODUCTION 3
between alleles at nearby loci, a phenomenon called linkage disequilibrium (LD).
1.2 Genetic Association Studies
Many complex traits, such as height, weight, and susceptibility to disease, are strongly in-
fluenced by both environmental factors and genetic factors. Genetic association studies are
procedures to detect correlations between genetic variants and disease phenotypes on a pop-
ulation scale, relying on LD between genotyped markers and unknown disease loci. Due to
their high density throughout the genome and the development of genotyping technologies,
SNPs have become popular tools in mapping complex disease genes. Single SNP-based
methods are powerful approaches in detecting disease associations with genetic variants,
provided that LD between the genotyped markers and an unknown disease locus is strong.
When LD decreases, however, the power of single marker association tests may suffer be-
cause the information contained in flanking markers is ignored. There is evidence that in
some cases the combination of closely linked SNP markers on the same chromosome will be
in stronger LD with a disease pre-disposing locus, in which case haplotype based methods
may be better able to capture information on disease associations than single SNP-based
tests (Akey et al. 2001). Therefore, there has been great interest in recent years in using
haplotypes as risk factors to identify the genetic basis of complex diseases.
The determination of the pair of haplotypes (or haplogenotype) in a subject is called
haplotype phasing. Although it is possible to infer haplogenotypes through molecular meth-
ods or through genotyping additional family members, such methods are too costly and
laborious to be practical in large-scale population studies. The polymerase chain reaction
method, the current standard genotyping technology, only allows experimenters to deter-
mine the two alleles at a single locus for a subject, without specifying which chromosome
each allele is from. Therefore, for a subject who is heterozygous at more than one locus,
there is more than one possible haplotype pair consistent with the observed single-locus
genotypes, and haplotype phase for this subject is said to be ambiguous. For example, if
the observed single-locus genotypes of a subject at three SNP loci are 0/0, 0/1 and 0/1,
then there are two possible haplogenotypes this subject may carry: 000/011 or 010/001.
CHAPTER 1. INTRODUCTION 4
Recently, statistical methods have been developed to infer haplogenotypes of subjects and
estimate haplotype frequencies from single-locus genotype data on a population sample (e.g.
Excoffier and Slatkin 1995, Stephens et al. 2001). These algorithms reconstruct haplotypes
by estimating posterior probabilities for the haplogenotype of each subject, conditional on
observed single-locus genotypes. Though not strictly correct, investigators often evaluate
differences in haplotype frequencies between diseased and disease-free subjects by estimating
haplotype frequencies and performing standard chi-squared tests of association. This prac-
tice incorrectly assumes that diseased and disease-free subjects are randomly sampled from
separate randomly-mating populations and ignores the variation in the estimated haplotype
frequencies due to missing phase. Epstein and Satten (2003) developed maximum-likelihood
methods that overcome these difficulties. However, they do not consider the effects of other
risk factors (besides haplotypes) in their models of disease risk. For complex diseases with
more than one risk factor and possible interactions between risk factors, statistically sound
methods are required for more detailed models of risk.
There are three commonly used observational study designs in epidemiology: cohort
studies, cross-sectional studies and case-control studies. A cohort study is a study in
which each subject presently has certain exposures (covariates) or receives a particular treat-
ment, and is followed up over time for the outcome of interest. In cross-sectional studies,
covariates and the outcome are measured at the same point in time. Cohort and cross-
sectional studies are often referred to as prospective designs. An important requirement for
prospective studies is that the disease outcome of interest should be common; otherwise,
the number of outcomes observed will be too small for reliable statistical inference. For a
rare disease, a widely accepted design is the case-control study, which looks backwards in
time to measure exposures on subjects of known disease status. This retrospective study
design is often inexpensive and convenient.
1.3 Overview of the Thesis
In this thesis we consider the problem of inference of haplotype risks in case-control studies
of a rare disease when haplotype phase is not always observed. Missing haplotype phase
CHAPTER 1. INTRODUCTION 5
leads to logistic regression analysis of haplotype risks with missing covariate data. Maximum
likelihood (ML) inference is then based on the solution to a set of weighted complete-data
score equations. However, as discussed later in the thesis, the weights can not be calculated
exactly. We therefore describe methods that approximate ML by approximating the weights.
Ghadessi (2005) discussed the application of prospective maximum likelihood (PML; e.g.
Burkett 2002) to case-control data, and how PML approximates the correct retrospective
maximum likelihood (RML) approach. She also compared PML to an estimating equation
(EE) approach (Zhao et al. 2003) developed specifically for case-control data. However, her
simulation study involved disease risk models with haplotype effects only, and no nongenetic
effects or interactions. We extend her work in two ways. First, we discuss new methodology
(Spinka et al. 2005) published in the interim. Second, we conduct a simulation study in
which the risk model includes nongenetic covariates and haplotype-nongenetic interactions.
Our simulations extend those of Spinka et al. (2005) to consider less extreme interaction
effects that we consider to be more plausible and rare disease probabilities.
An outline of the thesis is as follows. In Chapter 2 we discuss RML and approximate
RML methods. We include a review of the material from Ghadessi that describes PML for
cohort or cross-sectional data with missing haplotype phase, RML for case-control studies
with missing haplotype phase, and the justification of PML as an approximation to RML
that uses the correct complete-data scores but approximate weights. We then describe the
modified prospective score equation (MPSE) approach of Spinka et al. and show that it is
also an approximate RML method, with a different approximation to the weights than that
of PML. We next connect MPSE with the earlier EE approach of Zhao et al. Finally, we
show how a simple modification of PML yields the MPSE estimator of regression parameters.
However, since this modification to PML does not change the way the variance is estimated,
our variance estimator is different from that suggested by Spinka et al. We therefore call our
modified PML/MPSE approach the hybrid approach. In Chapter 3 we present simulations
to compare statistical properties of PML, EE and the hybrid approach in case-control studies
of a rare disease. We could not obtain reliable software implementing MPSE and hence this
approach could not be included in our simulation study. Chapter 4 summarizes the main
conclusions and discusses directions for future research.
Chapter 2
Methods
A variety of statistical methods have been proposed for haplotype-disease association studies
using unphased genotype data on unrelated subjects. These methods are often developed
within the generalized linear model (GLM) framework, in which the distribution of the
dependent variable Y , given covariate vector Z, is in the exponential family:
f(Y | Z ; η, φ) = exp{
Y η − b(η)φ
+ c(Y, φ)}
,
where b(·) and c(·) are known functions, and η and φ are the canonical parameter and
the dispersion parameter, respectively. A link function g(µ) = η, which is monotonic and
differentiable, relates µ ≡ E[Y | Z] to the linear predictor η = β0+Zβ1, where β1 is a vector
of coefficients associated with the effects of Z. For a binary disease response, the logit link
g(µ) = log µ1−µ is the canonical link, and the dispersion parameter φ is 1. Let D denote the
status of a disease with
D =
1 presence of disease
0 absence of disease.
Let H and X respectively denote the haplogenotype factor and non-genetic (environmental)
factor. The risk associated with haplogenotype H is modeled in terms of the joint effect of
the pair of haplotypes constituting H. Thus in genetic association studies of haplotype and
6
CHAPTER 2. METHODS 7
environmental risk factors, the standard disease risk model is the logistic regression model:
logpr(D = 1 | H, X)pr(D = 0 | H, X)
= β0 + z(H, X)β1, (2.1)
where z(H, X) is a row covariate vector that codes the effects of X and H and their possible
interaction, and where β1 is a vector of log odds-ratios. For example, if only a single
risk haplotype hr and a continuous non-genetic covariate X are considered, and hr has a
multiplicative effect on the disease risk, then
z(H, X)β1 = βXX + βHNhr(H) + βHXNhr(H)X, (2.2)
where Nhr(H) denotes the number of copies of the risk haplotype hr in H, and where
β1 = (βX , βH , βHX)T is the vector of regression parameters associated with the main effect
of X, the main effect of hr and the interaction effect of X with hr, respectively.
When single-locus genotypes, instead of haplogenotypes, are observed on subjects, miss-
ing haplotype phase leads to logistic regression analysis with incomplete data. Burkett
(2002) developed a PML method for inference of trait associations with SNP-haplotypes
and non-genetic covariates, using unphased genotype data of unrelated subjects, for cohort
or cross-sectional studies. Independently, Lake et al. (2003) proposed a similar prospective
method that built on the score test approach of Schaid et al. (2002), with calculation of
standard errors slightly different from that of Burkett (2002). The prospective methods of
Burkett (2002) and Lake et al. (2003) are respectively implemented in the hapassoc package
and the haplo.stats package for the R programming environment. In this chapter, we review
material from Ghadessi (2005) which describes PML and RML, and justifies PML as an
approximation to RML that uses the correct complete-data scores but approximate weights.
We then describe the MPSE approach of Spinka et al. (2005) as another approximate RML
method, which uses a different approximation to the RML weights than PML. Next, we
connect MPSE with the earlier EE approach of Zhao et al. (2003). Finally, we describe our
modified PML/MPSE hybrid approach.
CHAPTER 2. METHODS 8
2.1 PML for Cohort Studies
Burkett (2002), Lake et al. (2003) and Burkett et al. (2004) maximize the prospective
likelihood for cohort or cross-sectional data by an EM algorithm. In this section we give a
summary of the notation and details of the EM algorithm in this context.
Index subjects in a cohort sample of size n by j = 1, . . . , n. Note that indexing co-
hort subjects by j differs from Ghadessi (2005), who used i, but is more consistent with
the indexing of case-control subjects by j in Section 2.2. Throughout, in a slight abuse of
notation, we use capital letters to denote random variables or observed values, as appropri-
ate. Let Dj , Gj , Hj and Xj respectively denote the disease status, single-locus genotypes,
haplogenotype and non-genetic covariates for the jth subject. For population-based studies
with independent subjects, the log-likelihood is the summation of individual log-likelihood
contributions from each subject. Let β = (β0, β1) denote the logistic regression parameters,
and let γ parameterize the joint distribution of H and X. Then θ = (β, γ) is a vector of
parameters that completely describes the complete-data log-likelihood
lc(θ;D,H,X ) =n∑
j=1
lcj(θ;Dj ,Hj , Xj) =n∑
j=1
log pr(Dj ,Hj , Xj ; θ),
where D is all data on the disease status of all subjects, H is the haplogenotype data
on all subjects and X is the non-genetic covariate data on all subjects. When haplotype
phase is not directly observable, the likelihood is based on the observed data (D,G,X ),
where G is the single-locus genotype data on all subjects. The missing haplotype phase
can be accounted for by the use of the EM algorithm, in which the ML estimates θ are
obtained by iteratively maximizing the conditional expectation of the complete-data log-
likelihood, given the observed data (D,G,X ) and parameter estimates θ(t) from the previous
step. Ibrahim (1990) showed that, in the GLM framework, estimation of the parameters
with missing information on categorical covariates can be reduced to iterative weighted
regression, in which the posterior weights of all consistent complete covariate values given
the observed data and θ(t) are calculated in the E-step of the EM algorithm. This method
is sometimes referred to as the EM algorithm by the method of weights. In the context
of haplotype risk inference with missing haplotype phase, the method of weights involves
CHAPTER 2. METHODS 9
extending each subject in the original sample into “pseudo-individuals” that have the same
disease status and non-genetic covariates, but different haplogenotypes, all of which are
consistent with the subject’s observed single-locus genotypes. In such an extended sample,
complete covariate information is observed on all pseudo-individuals. By properly weighting
all pseudo-individuals, the estimates of θ can be obtained iteratively, as described in detail
in Burkett (2002).
A complicating feature of logistic regression inference of haplotype risks with missing
haplotype phase is that the distribution of haplogenotypes can not be identified from data on
single-locus genotypes (e.g. Epstein and Satten 2003). The problem arises from the fact that
certain haplogenotypes are never observed on their own unambiguously. PML approaches
developed to date (Burkett 2002, Burkett et al. 2004, Lake et al. 2003), solve this iden-
tifiability problem by imposing Hardy-Weinberg proportions (HWP) for haplogenotypes.
Under independence of H and X it can be shown (e.g. Burkett 2002) that the marginal
distribution of X need not be estimated. HWP of haplogenotype frequencies means that
the probability a subject carries haplogenotype H = (hk, hk′) is given by
pr(H = (hk, hk′)) =
γ2hk
if hk = hk′
2γhkγhk′ if hk 6= hk′ ,
(2.3)
where γhkis the frequency of haplotype hk. Haplotype frequencies can be estimated from
data on single-locus genotypes, based on information provided by haplotypes within hap-
logenotypes that are observed unambiguously.
Under the assumptions of H and X independence and HWP, the parameter θ is redefined
to be θ = (β, γh) where γh is the vector of haplotype frequencies. For the jth subject,
j = 1, . . . , n, let HGj ={
Hkj ; k = 1, . . . , Kj
}denote the set of haplogenotypes consistent
with the single-locus genotypes Gj . Briefly, the E-step of the algorithm involves calculating
the conditional expectation of the complete-data log-likelihood,
Q(θ | θ(t)) =n∑
j=1
Eθ(t) [lcj(θ;Dj ,Hj , Xj) | Dj , Gj , Xj ]
=n∑
j=1
Kj∑
k=1
wjk(θ(t)) log pr(θ;Dj ,Hkj , Xj)
CHAPTER 2. METHODS 10
where the wjk(θ(t)) are the weights for each of the Kj pseudo-individuals:
wjk(θ(t)) = pr(Hkj | Dj , Gj , Xj ; θ(t))
=pr(Dj | Hk
j , Xj ;β(t))pr(Hkj ; γ(t)
h )∑Kj
k′=1 pr(Dj | Hk′j , Xj ;β(t))pr(Hk′
j ; γ(t)h )
.
The M-step is to maximize the conditional expectation of the complete-data log-likelihood,
Q(θ | θ(t)) which is, up to a constant term,
Q(θ | θ(t)) =n∑
j=1
Kj∑
k=1
wjk(θ(t)) log pr(Dj | Hkj , Xj ;β)
+n∑
j=1
Kj∑
k=1
wjk(θ(t)) log pr(Hkj ; γh). (2.4)
Standard errors of θ are based on the inverse of the observed information evaluated at θ.
The observed information can be calculated by Louis’ formula (Louis, 1982), which expresses
the observed information as the conditional expectation of the complete-data information
minus the conditional variance of the complete-data score given the observed data. In our
context, the conditional expectations are weighted sums with the same weights as those
available from the final iteration of the EM algorithm.
2.2 RML for Case-Control Data
In a case-control study, information on covariates is collected retrospectively given disease
status. Therefore, a retrospective sampling model, rather than a prospective sampling
model, describes the data. Let (H0, X0) be baseline values of haplogenotypes H and en-
vironmental covariates X, and let D = i denote the disease status within the ith disease
group; i = 0, 1. The conditional probability of (H, X) given disease status D can be shown
to be
pr(H, X | D = i) = ci(ξ, β1) exp{ξ(H, X) + iz(H, X)β1},
CHAPTER 2. METHODS 11
where ξ(H, X) = log{pr(H, X | D = 0)/pr(H0, X0 | D = 0)} is a nuisance function of H
and X, and ci(ξ, β1) is a normalizing constant (Prentice and Pyke 1979, Shin et al. 2006).
Let ϑ = (β1, ξ) parametrize the retrospective likelihood initially. It can be shown that
fitting a standard logistic regression model of disease risk is equivalent to RML analysis
for inference of log odds-ratio parameters β1, provided that covariates are fully observed
and the distribution of covariates is treated completely non-parametrically (Prentice and
Pyke 1979). However, when haplotypes are among the risk factors of interest and haplotype
phase is ambiguous, the result from Prentice and Pyke (1979) is not applicable.
In order to tackle the problem of missing haplotype phase, we first consider a variant
sampling scheme (Prentice and Pyke 1979) which is asymptotically equivalent to case-
control sampling. This scheme leads to a parametrization of the complete-data likelihood
that is useful for our problem. We next note that maximum-likelihood estimates (MLEs) for
the regression parameters may be obtained indirectly through the EM algorithm or directly
as the solutions to a set of score equations. This is important because, in later sections,
we will motivate the EM algorithm for PML as an approximation to the EM algorithm for
RML. We will also motivate the MPSE and EE estimating equations as approximations to
the RML score equations.
Slight changes in the notation are required for RML inference on case-control data with
missing haplotype phase. Let Dij , Gij , Hij and Xij respectively denote the disease status,
single-locus genotype data, haplogenotype and non-genetic covariate data for the jth subject
within the ith disease group; i = 0, 1 and j = 1, ..., ni, where ni is the number of subjects
in the ith disease group, and n = n0 + n1 is the total number of subjects in the sample.
As before, D, G and X represent disease status, single-locus genotype data and non-genetic
covariate data, respectively, of all subjects; and H is the latent haplogenotype data of all
subjects.
2.2.1 A Variant Sampling Scheme
The variant sampling scheme (VSS), is a two-stage hypothetical sampling design:
stage 1: Independently sample n disease status variates from a large hypothetical popula-
tion. In each of the n binomial trials, the probability of sampling a case is n1/n and
CHAPTER 2. METHODS 12
the probability of sampling a control is n0/n. Let N1 be the number of cases sampled
and N0 = n−N1 be the controls sampled. In a study with n subjects, the expected
number of cases and controls sampled are n1 and n0, respectively.
stage 2: Sample N0 covariate vectors from the appropriate conditional distribution of
(H, X) given D = 0 and N1 covariate vectors from the conditional distribution of
(H, X) given D = 1.
Since cases and controls respectively represent the diseased class and the disease-free class
in the underlying population, the conditional distribution of (H, X) in the second stage of
VSS is the same conditional distribution as in the true case-control sampling given disease
status. Under VSS, disease status and covariates are both random. The VSS hypothetical
population has a relative frequency n1/n of cases and n0/n of controls.
Let “prv” and “pr” denote probability density functions or mass functions, as appropri-
ate, under VSS and under true case-control sampling, respectively. From the description
above, the conditional distribution of (H, X) given disease status D under true case-control
sampling is the same as under VSS; that is pr(H, X | D ;ϑ) = prv(H, X | D ;ϑ). As shown
in Ghadessi (2005; page 18), prv(H, X | D ;ϑ) can be reparametrized as
prv(H, X | D = i ;ϑv) =prv(D = i | H, X ;βv0, β1)prv(H, X ; γv)
prv(D = i), (2.5)
where prv(D = i | H, X ;βv0, β1) is a logistic regression model with the same log odds-
ratio parameters β1 as the logistic model for a population sample, but with a different
intercept βv0 appropriate to the VSS hypothetical population; γv parameterizes the joint
distribution of H and X under VSS; and prv(D = i) is the disease risk under VSS, which
is ni/n by definition. The new parameter ϑv ≡ (β1, γv) reparametrizes ϑ = (β1, ξ), and βv0
is a function of β1 and γv. The complete-data likelihood would be a product of terms in
equation (2.5).
2.2.2 MLEs via EM algorithm
Write lc(ϑv;H,X ) = log prv(H,X | D ;ϑv) for the complete-data log-likelihood. For conve-
nience, let βv = (βv0, βT1 )T .
CHAPTER 2. METHODS 13
The Expectation Step
The conditional expectation of the complete-data log-likelihood given the observed data,
disease status and the estimates of ϑv in the tth iteration is now given by
Q(ϑv | ϑ(t)v ) = E
ϑ(t)v
[lc(ϑv;H,X ) | D,G,X ]
=1∑
i=0
ni∑
j=1
Eϑ
(t)v
[lcij(ϑv;Hij , Xij) | Gij , Xij , Dij ],
where lcij(ϑv ;Hij , Xij) = log prv(Hij , Xij | Dij ;ϑv) is the complete-data log-likelihood for
the jth subject within the ith disease group. Let HGij = {Hkij ; k = 1, . . . , Kij} be the set of
haplogenotypes that are consistent with the observed single-locus genotype data Gij . Then
Q(ϑv | ϑ(t)v ) =
1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑ(t)v ) log prv(H
kij , Xij | Dij ;ϑv)
where
wijk(ϑ(t)v ) = prv(H
kij , Xij | Gij , Xij , Dij ;ϑ(t)
v ) = prv(Hkij | Gij , Xij , Dij ;ϑ(t)
v )
is the posterior weight of the pseudo-individual with complete covariate vector (Hkij , Xij),
given the observed data (Gij , Xij), disease status Dij and parameter estimates ϑ(t)v . Since
H implies G, one can write the RML weights in terms of the conditional probability of
covariates given disease status as:
wijk(ϑ(t)v ) =
prv(Hkij , Gij , Xij | Dij ;ϑ
(t)v )
prv(Gij , Xij | Dij ;ϑ(t)v )
=prv(Hk
ij , Xij | Dij ;ϑ(t)v )
prv(Gij , Xij | Dij ;ϑ(t)v )
=prv(Hk
ij , Xij | Dij ;ϑ(t)v )
∑Kij
k′=1 prv(Hk′ij , Xij | Dij ;ϑ
(t)v )
(2.5)=
prv(Dij | Hkij , Xij ;β
(t)v )prv(Hk
ij , Xij ; γ(t)v )
∑Kij
k′=1 prv(Dij | Hk′ij , Xij ;β
(t)v )prv(Hk′
ij , Xij ; γ(t)v )
.
CHAPTER 2. METHODS 14
Now write prv(H, X; γv) = prv(H | X; γv) prv(X; γvx), where γvx parameterizes the mar-
ginal distribution of X in the VSS hypothetical population. Replacing the joint distribution
of H and X in the weights by this product yields
wijk(ϑ(t)v ) =
prv(Dij | Hkij , Xij ;β
(t)v )prv(Hk
ij | Xij ; γ(t)v )
∑Kij
k′=1 prv(Dij | Hk′ij , Xij ;β
(t)v )prv(Hk′
ij | Xij ; γ(t)v )
, (2.6)
in which the marginal probability of X is canceled out. Calculation of prv(Dij | Hkij , Xij ;β
(t)v )
is straightforward using the disease risk model and current estimates of βv. However, as is
the case with population (prospective) sampling, the conditional distribution of H given X
under VSS sampling can not be estimated from the observed data without further assump-
tions because certain haplogenotypes are never observed unambiguously.
The Maximization Step
The weighted retrospective log-likelihood, with given RML weights wijk(ϑ(t)v ), is
Q(ϑv | ϑ(t)v ) =
1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑ(t)v ) log prv(H
kij , Xij | Dij ;ϑv),
where ϑv = (β1, γv). By equation (2.5), Q(ϑv | ϑ(t)v ) is, up to a constant term, equal to
Q(ϑv | ϑ(t)v ) =
1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑ(t)v ) log prv(Dij | Hk
ij , Xij ;βv)
+1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑ(t)v ) log prv(H
kij , Xij ; γv), (2.7)
where βv = (βv0, β1). Maximization of Q would appear to be completely analogous to the
prospective case (see equation 2.4). However, recall that βv0 is a function of both β1 and
γv and so is not a free parameter. By contrast, in the prospective likelihood the intercept
term is a free parameter. Nevertheless, as shown in Appendix C of Ghadessi (2005), Q
may be maximized by treating βv0 as a free parameter. Hence, the M-step for RML is the
same as the M-step for PML.
CHAPTER 2. METHODS 15
2.2.3 MLEs by direct solution of score equations
The observed-data score function can be expressed as a weighted sum of complete-data
score functions over complete-data configurations consistent with the observed data, where
the weights are the posterior probabilities of the complete-data configurations given the
observed data (Louis 1982). The M-step of the EM algorithm solves the same weighted
sum as the score equations, but with the weights fixed at values determined by parameter
estimates from the previous iteration of the algorithm. At convergence, however, the weights
used in the final M-step will be the MLEs. Therefore, at convergence, the estimating
equations for the M-step will have the same solution as the score equations. The weighted
estimating equations in the M-step are obtained by differentiating (2.7) for fixed weights
with respect to βv to obtain
1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑ(t)v )
∂
∂βvlog pr(Dij | Hk
ij , Xij ;βv).
At convergence, the M-step involves solving (for βv)
0 =1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑv)∂
∂βvlog pr(Dij | Hk
ij , Xij ;βv).
The RML score equations for the regression parameters must therefore be
0 =1∑
i=0
ni∑
j=1
Kij∑
k=1
wijk(ϑv)∂
∂βvlog pr(Dij | Hk
ij , Xij ;βv). (2.8)
In other words, the weights in the E-step of the EM algorithm are of the same functional
form as the weights in the score equations.
2.3 Approximate Score Methods
As with the prospective likelihood, the retrospective likelihood is a function of parameters
that can not be identified from the observed data, unless assumptions are made about the
distribution of haplogenotypes. Unfortunately, approaches which formulate the likelihood
CHAPTER 2. METHODS 16
after making these assumptions appear to lack robustness (Spinka et al. 2005). Another
strategy is to derive the score equations for the regression parameters of primary interest
without making assumptions, and then, where necessary, make approximations that depend
only on identifiable parameters. We show below that PML applied to case-control data,
MPSE and EE are examples of this approximate score approach, which make different
approximations to the weights in the RML score equations (2.8). Our motivation of MPSE
is novel; Spinka et al. present it only as an alternate set of estimating equations to an
assumption-based RML approach they develop.
2.3.1 PML as an Approximate Score Method
PML is derived for prospective data assuming a random sample and assuming HWP and
independence of H and X in the population from which the random sample is drawn. As
argued previously, the case-control sample can be viewed as a random sample from the VSS
hypothetical population. Therefore, applying PML to case-control data has the effect of
approximating prv(H | X ; γv) by prv(H ; γvh), the haplogenotype frequencies in the VSS
hypothetical population that would obtain if HWP held. Here γvh is a vector of haplotype
frequencies in the VSS hypothetical population. The PML weights approximate the correct
RML weights in equation (2.6) by
wijk(ϑv) =prv(Dij | Hk
ij , Xij ;βv)prv(Hkij ; γvh)
∑Kij
k′=1 prv(Dij | Hk′ij , Xij ;βv)prv(Hk′
ij ; γvh). (2.9)
As previously noted, the M-steps of PML and RML are the same.
2.3.2 MPSE as an Approximate Score Method
MPSE was developed for diseases of any frequency in the population, with emphasis on more
common diseases. In general, the estimating equations are parametrized in terms of β0, the
population log-odds of disease in the baseline group. For rare diseases, this parametrization
would appear to be problematic, since β0 can not be estimated from case-control data unless
the population probability of disease, pr(D = 1), is known (Chatterjee and Carroll 2005).
However, as shown below, for rare diseases it turns out that MPSE does not depend on β0.
CHAPTER 2. METHODS 17
Spinka et al. (2005) describe MPSE in terms of another approximation to case-control
sampling called variable probability sampling (VPS). VPS is motivated by nested case-
control sampling. A detailed description is given in Appendix B, where we also show that
the joint probability distribution of complete data D, H and X is the same under VPS and
VSS. Therefore, we use prv to denote probability density functions or mass functions under
VPS as well as VSS.
We discuss the MPSE estimating equations for the regression parameters first, and for
population haplotype frequencies second. We also review the form of the variance-covariance
matrix of the resulting parameter estimators.
Estimation of Regression Parameters
The estimating equations for the regression parameters βv = (βv0, β1) are of the same form
as the RML score equation (2.8)
0 =1∑
i=0
ni∑
j=1
Kij∑
k=1
˜wijk(β0, βv, γh)∂
∂βvlog prv(Dij | Hk
ij , Xij ;βv), (2.10)
but with approximate weights
˜wijk(β0, βv, γh) =prv(Dij | Hk
ij , Xij ;βv)rβ0,βv(Hkij , Xij)pr(Hk
ij ; γh)∑Kij
k′=1 prv(Dij | Hk′ij , Xij ;βv)rβ0,βv(Hk′
ij , Xij)pr(Hk′ij ; γh)
, (2.11)
where
rβ0,βv(H, X) =1 + exp{βv0 + z(H, X)β1)}1 + exp{β0 + z(H, X)β1)} .
For the RML weights in the equation (2.6), prv(H | X; γv) is approximated assuming HWP
and independence of H and X in the general population, as described in Appendix B. In-
serting the approximation to prv(H | X; γv) into the RML weights in equation (2.6) gives
the MPSE weights in equation (2.11). The MPSE weights are likely to be a better approxi-
mation than the PML weights since they can be derived under assumptions (independence
of H and X and HWP in the population) which seem more reasonable than those needed
to justify PML applied to case-control data (independence of H and X and HWP in the
pooled case-control sample).
CHAPTER 2. METHODS 18
The estimating equations depend on β0 only through the weights, and the weights
˜wijk(β0, βv, γh) depend on β0 only through rβ0,βv . For a rare disease with probability of
disease that is small for all covariate values likely to be observed, exp{β0+x(H, X)β1} << 1,
so that
rβ0,βv(H, X) ≈ 1 + exp{βv0 + z(H, X)β1} ≡ rβv(H, X)
Thus, under a rare disease assumption, an estimate of β0 is not needed to estimate r, and
therefore is not needed to estimate the weights or solve the estimating equations. The
MPSE weights simplify to
˜wijk(β1, γh) =exp{Dijz(Hk
ij , Xij)β1}pr(Hkij ; γh)
∑Kij
l=1 exp{Dijz(H lij , Xij)β1}pr(H l
ij ; γh). (2.12)
In particular, the weights for the pseudo-individuals in the control group (i.e. Dij = 0)
depend on haplotype frequencies only, so that
˜w0jk(β1, γh) ≡ ˜w0jk(γh). (2.13)
Estimation of Haplotype Frequencies
In general, the weights ˜wijk(β0, βv, γh) also depend on the marginal distribution of hap-
logenotypes in the population. Under the assumption of population HWP, the distribution
of haplogenotypes is specified by haplotype frequencies γh = (γh1 , . . . , γhK). The estimating
equations for the haplotype frequencies are
0 =1∑
i=0
ni∑
j=1
Kij∑
k′=1
˜wijk′(β0, βv, γh)Nhk(Hk′
ij )
− γhk
1∑
i=0
ni∑
j=1
∑Kl=1 2γhl
rβ0,βv(H = (hk, hl), Xij)∑H′∈H pr(H ′; γh)rβ0,βv(H ′, Xij)
, k = 1, . . . , K, (2.14)
where H is the set of all haplogenotypes. The first term of equation (2.14) is the expected
count for haplotype hk in the pooled sample. These estimating equations are the score
equations for γh that result from the retrospective likelihood under the assumptions of
HWP and independence of H and X in the population (Spinka et al. 2005). Furthermore,
CHAPTER 2. METHODS 19
assuming a rare disease, we can invoke equations (2.12) and (2.13) to obtain
0 =1∑
i=0
ni∑
j=1
Kij∑
k′=1
˜wijk′(β1, γh)Nhk(Hk′
ij )
− γhk
1∑
i=0
ni∑
j=1
∑Kl=1 2γhl
rβv(H = (hk, hl), Xij)∑H′∈H pr(H ′; γh)rβv(H ′, Xij)
, k = 1, . . . , K,
=n0∑
j=1
K0j∑
k′=1
˜w0jk′(γh)Nhk(Hk′
0j) +n1∑
j=1
K1j∑
k′=1
˜w1jk′(β1, γh)Nhk(Hk′
1j)
− γhk
1∑
i=0
ni∑
j=1
∑Kl=1 2γhl
rβv(H = (hk, hl), Xij)∑H′∈H pr(H ′; γh)rβv(H ′, Xij)
, k = 1, . . . , K. (2.15)
Clearly, these estimating equations for the haplotype frequencies depend on the regression
parameters. We shall revisit this issue later in comparing MPSE to EE.
Variance Estimator for MPSE
Since the MPSE estimating equations are not score equations, the estimator of the variance
of the resulting parameter estimators is not simply the inverse of the matrix of derivatives
of the estimating equations. Following standard estimating equations theory, Spinka et al.
(2005) present a “sandwich” variance estimator that correctly accounts for the case-control
sampling. Let U(βv, γh) denote the collection of MPSE estimating equations. For a sample
of n independent individuals, U(βv, γh) is a sum over contributions from each subject. Index
subjects by the subscript j within disease state i. Then U(βv, γh) =∑1
i=0
∑nij=1 Uij(βv, γh).
The estimator is the solution (βv, γh) to U(βv, γh) = 0. Let U denote the matrix of partial
derivatives of U with respect to (βv, γh). Then the variance estimator from the MPSE
approach is
U(βv, γh)−1Var[U(βv, γh)]U(βv, γh)−1, (2.16)
where the middle term is an estimate of the variance of the estimating equations evaluated
at (βv, γh):
Var[U(βv, γh)] =1∑
i=0
ni∑
j=1
Uij(βv, γh)Uij(βv, γh)T −1∑
i=0
ni Ui(βv, γh) Ui(βv, γh)T,
CHAPTER 2. METHODS 20
with Ui(βv, γh) = 1ni
∑nij=1 Uij(βv, γh).
Case-control samples are comprised of two independent and identically distributed (iid)
samples, cases and controls, and so are not identically distributed. For future reference,
we note here that the variance estimator in equation (2.16) for case-control data differs
from its counterpart for data arising from an iid sample. The expression for Var[U(βv, γh)]
involves a weighted sum of mean-score terms∑1
i=0 ni Ui(βv, γh) Ui(βv, γh)T. In con-
trast, the analogous expression for an iid sample would involve n U(βv, γh) U(βv, γh)T,
where U(βv, γh) is the mean score. One of the regularity conditions assumed in deriving
asymptotic distributions for estimators that solve estimating equations is that the esti-
mating equation tends in probability to zero as the sample size tends to infinity (Zhao
et al. 2003); that is, n U(βv, γh) P→ 0. The EE approach ignores the mean-score term∑1
i=0 ni Ui(βv, γh) Ui(βv, γh)T
when constructing variance estimators. Ignoring this term
would be justified if the sample were iid. However, since case-control samples are not iid,
the EE variance estimators have the potential to be asymptotically conservative.
2.3.3 EE as an Approximate Score Method
Under the rare disease assumption that exp{β0 + z(H, X)β1} ≈ 0, it is straightforward
to show that the EE estimating equations for the regression parameters are the same as
those of MPSE. However, we emphasize that even though the estimating equations for the
regression parameters are the same, the estimating equations for the haplotype frequencies
are different. Since the estimators of regression parameters and haplotype frequencies jointly
solve the full set of estimating equations, the regression parameter estimators for the two
approaches will be different. Consequently, the methods are not equivalent, contrary to the
assertions of Spinka et al. (2005).
In deriving their estimating equations for the haploytype frequencies, Zhao et al. (2003)
pointed out that, under the rare disease assumption, the controls in the sample could
be treated as representative of the general population and approximated the population
haplotype frequencies needed in their weights by haplotype frequencies estimated from the
CHAPTER 2. METHODS 21
control data only. Their estimating equations for haplotype frequencies can be written as
0 = γhk− 1
2n0
n0∑
j=1
K0j∑
k′=1
˜w0jk′(γh)Nhk(Hk′
0j), k = 1, . . . , K, (2.17)
which do not depend on the regression parameters β1, unlike the corresponding estimating
equations (2.15) of Spinka et al. (2005).
Finally, as already noted, the sandwich variance matrix is computed in slightly different
ways in the two approaches. Zhao et al. (2003) appear to be incorrectly using a form of
variance estimator that would only be appropriate for an iid sample. Spinka et al. (2005)
did not mention this difference between MPSE and EE. The error in the calculation of the
EE variance estimator suggests that the EE standard errors will be biased upward.
2.3.4 A PML/MPSE Hybrid Approach
The MPSE estimating equations for regression parameters are of the same form as those
for PML except that the weights are different. Therefore we can implement the MPSE
estimating equations for regression parameters by modifying a PML approach, such as
the one implemented in the R package hapassoc. At the E-step, the MPSE weights are
calculated according to formula (2.12), assuming a rare disease so that the intercept β0 is not
required. At the M-step, regression parameter estimates are updated by solving the weighted
complete-data estimating equations for PML with given weights. However, the MPSE
weights require estimates of population haplotype frequencies rather than of haplotype
frequencies in the pooled sample, and so we must also implement the MPSE estimating
equations for population haplotype frequencies in equation (2.15). Since we’ve changed the
estimating equations for the haplotype frequencies from those implemented in the original
hapassoc, the variance calculation from hapassoc for all the parameter estimates will be
incorrect in general. However, the variance estimates for the regression parameters may be
approximately correct, provided that the observed information matrix for the observed-data
likelihood is approximately block-diagonal (i.e. βv and γh are approximately independent).
To see this, partition the information matrix of the observed data Io into four blocks Ioij ;
i = 1, 2, j = 1, 2, with Io11 corresponding to the regression parameters and information
CHAPTER 2. METHODS 22
matrix returned by hapassoc, Io22 corresponding to the haplotype frequencies and Io
12 = IoT21
corresponding to the cross-terms. To say the information matrix is approximately block
diagonal is to say that Io12 ≈ 0. In general, the variance estimator of the regression parameter
estimator is (Io−1)11. In the special case of Io12 ≈ 0, we obtain (Io−1)11 ≈ (Io
11)−1. The
variance estimator for the regression parameters should then be correct if hapassoc returns
a valid estimator of Io11.
The observed information for the regression parameters returned by hapassoc should
approximate Io11. From Louis’ equations,
Io11 = E(Ic
11 | D,G,X )− V (U c1 | D,G,X ),
where Ic11 is the complete data information for the regression parameters, U c
1 is the complete-
data score vector for the regression parameters and the expectation and variance are condi-
tional on the observed data (D,G,X ) for all subjects. The hybrid version of hapassoc uses
the correct complete-data scores and information and uses approximately correct weights
for calculating conditional expectations and variances. Therefore, the observed information
for the regression parameters returned by hapassoc should approximate Io11. However, the
calculation of Io22 in hapassoc is incorrect, as it does not reflect the estimating equations
for the haplotype frequencies that are used in the modified code. Therefore, the variance
estimates for the haplotype frequencies, which are approximately (Io22)
−1 if the information
is approximately block-diagonal, will be incorrect.
Chapter 3
Simulation Study
There have been previous simulation studies to investigate the performance of PML in
haplotype-disease association analysis of cross-sectional data (Burkett et al. 2004) and
of case-control data (Lake et al. 2003, Ghadessi 2005). Ghadessi (2005) compared PML
to EE in case-control studies of a rare disease (with pr(D = 1) = 0.0009), through two
sets of simulations with underlying disease-risk models having haplotype effects only. The
EE estimators and standard errors were more biased than those from PML in both sets of
simulations. Spinka et al. (2005) conducted simulation studies to compare PML to MPSE in
case-control studies of a relatively common disease (with pr(D=1)=0.107). Their underlying
disease-risk model included a single risk haplotype, an “environmental” factor, and large
statistical interaction between the risk haplotype and environmental factor. They showed
that, when haplotypes and the environmental factor are independent in the population,
MPSE was unbiased for the log odds-ratio parameters in the logistic model while PML was
biased.
Of all previous simulation studies, those of Spinka et al. (2005) and Ghadessi (2005)
are most relevant to the focus of this thesis. However, there are limitations of each of
these two studies. For example, the simulations of Spinka et al., involving a relatively
common disease, are less relevant than if they involved a rare disease, since the case-control
design is primarily for rare diseases. In addition, Spinka et al. considered only one value
of the statistical interaction effect, and this value was very large relative to the sizes of the
main effects of the environmental factor and risk haplotype. By contrast Ghadessi did not
23
CHAPTER 3. SIMULATION STUDY 24
include interactions at all in her disease risk model. Spinka et al. presumably selected a
large interaction effect for their study to illustrate the bias of PML applied to case-control
data. It can be shown that interaction between haplotypes H and non-genetic factors X
creates departures from independence of H and X in the cases (Shin et al. 2006), and hence
departures from independence in the case-control sample. When interaction effects are large,
so is the dependence of H and X in the case-control sample, making the PML weights poor
approximations to the true RML weights. However, for more modest interaction effects
likely to be encountered in practice, we would expect less bias in the risk estimates.
In this project we present simulations to compare the statistical properties of PML,
EE and the hybrid approach in case-control studies of a rare disease, for risk models that
include haplotype effects, non-genetic effects and statistical interactions between the two.
We could not obtain reliable software implementing MPSE and so could not assess the
performance of this approach in our simulation study. The goals of our simulations are to
compare the bias and precision of the estimators of log odds-ratios, the bias of estimators of
the standard errors, and the power for detecting haplotype-environment interactions, for the
three approaches. The PML approach and the EE approach are respectively implemented
in the package hapassoc for the R programming environment and in the software package
Hplus, both available for public use. The hybrid approach, which we denote HYBRID, was
implemented by modifying the hapassoc code. Due to limitations of Hplus, we were only
able to compare the approaches under a multiplicative risk model, in which the odds of
being affected is increased by a multiplicative factor for each copy of the risk haplotype
that replaces a baseline haplotype.
Large samples (e.g. 1000 cases and 1000 controls) are always desired in case-control
studies of rare diseases, particularly when gene-environment interactions are of primary
interest, because logistic regression has low power to detect interactions (Smith and Day
1984). Also, it is reasonable to conduct the simulations using large samples in order to
investigate the asymptotic properties of the three approaches. However, large sample sizes
can not always be achieved, due to cost and time considerations. Therefore, we are also
interested in investigating the performance of PML, EE and HYBRID for data sets of more
realistic size (e.g. 500 cases and 500 controls).
CHAPTER 3. SIMULATION STUDY 25
3.1 Design of Simulations
We conducted our simulations in a setting similar to that of Ghadessi (2005), in which hap-
lotypes were comprised of three SNPs and two sets of haplotype frequencies were considered
(Table 3.1). The ability of single-locus genotypes to predict haplotypes, as measured by R2h
(Stram et al. 2003), is 78.95% for the first set of frequencies and 59.78% for the second set.
We refer to these two levels of haplotype ambiguity as moderate and extreme, respectively.
Haplotype h1, which consisted of allele 0 at all three loci, was chosen to be a risk haplotype
that was positively associated with the disease. It is necessary to choose the frequency of h1
to be smaller than at least one other haplotype, because Hplus automatically chooses the
most frequent haplotype in the sample to be the baseline haplotype in fitting the logistic
regression model. We performed our simulation studies with varying sample sizes of 1000
cases and 1000 controls, and 500 cases and 500 controls. The resulting four simulation
scenarios are summarized in Table 3.2. The data were simulated from the logistic disease
model
logit{pr(D = 1 | H, X)} = β0 + βXX + βHNh1(H) + βHXNh1(H)X, (3.1)
where βX , βH and βHX were the regression coefficients associated with a main effect of an
“environmental factor” X, described below, a main effect of h1 and an interaction effect,
and where Nh1(H) counts the number of copies of haplotype h1 contained in haplogenotype
H. Throughout our simulation studies, βX and βH were fixed at 0.1 and 0.7, respectively,
but four values of βHX were considered: 0.1, 0.3, 0.5 and 0.7, representing different levels of
interactions. The intercept term β0 was adjusted for different βHX and different haplotype
frequencies (Table 3.3), so that the probability of the disease in the general population
would be around 0.0009. This disease probability is consistent with a two-year incidence
study of type 1 diabetes in Scandinavians.
In a single simulation replicate, we first generated haplogenotypes for an underlying
population, under the assumption of HWP. The population size was 1500000 for the first
and the third simulation scenarios and 750000 for the second and the fourth ones. We
then generated a continuous environmental covariate X for each subject in the population,
independent of the subject’s genetic status, using a normal distribution with mean 0 and
CHAPTER 3. SIMULATION STUDY 26
Table 3.1: Haplotype frequencies used inthe simulations
Haplotype FrequencySet 1 Set 2
h1 = 000 .23 .07h2 = 001 .27 .93/7h3 = 010 .15 .93/7h4 = 011 .10 .93/7h5 = 100 .10 .93/7h6 = 101 .05 .93/7h7 = 110 .05 .93/7h8 = 111 .05 .93/7
Table 3.2: The four simulation scenarios
Scenario Haplotype Sample sizefrequencies (cases/controls)
i Set 1 1000/1000ii Set 1 500/500iii Set 2 1000/1000iv Set 2 500/500
Table 3.3: Adjusted intercept β0
βHX Haplotype frequenciesSet 1 Set 2
0.1 -7.45 -7.160.3 -7.50 -7.180.5 -7.65 -7.220.7 -7.80 -7.26
variance 1. For each subject, the binary disease status D was simulated according to the
penetrance model in equation (3.1). Once the population was simulated, a subset of cases
and controls of specified size was randomly selected, and the single-locus genotypes and
environmental covariates of the selected subjects were collected and recorded as data. The
data set was then analyzed using PML, EE and HYBRID, respectively. We also wished to
compare the finite-sample bias in the regression parameter and standard error estimators
of PML, EE and HYBRID to the finite-sample bias of maximum likelihood when there
is no missing haplotype phase; i.e. logistic regression. The finite-sample bias of logistic
regression analysis of the phase-known data provides a baseline against which to judge the
bias of methods that analyze the data with phase ambiguity. Hence we also recorded the
haplogenotypes of all sampled subjects and obtained another data set which was analyzed
by logistic regression using the glm function of the R programming environment. We use the
notation GLM to denote maximum likelihood applied to the complete (i.e. phase-known)
data.
CHAPTER 3. SIMULATION STUDY 27
Let “h0” denote the baseline haplotype in the sample. The model fit to the data is
logit{pr(D = 1 | H, X)} = β0 + βXX +∑
hj 6=h0
βhjNhj
(H) +∑
hj 6=h0
βhjXNhj(H)X, (3.2)
where βhjand βhjX are regression coefficients associated with main effects of haplotype hj
and effects of interaction between hj and X, and where Nhj(H) counts the number of copies
of haplotype hj contained in haplogenotype H. In terms of the model used to simulate data,
given in equation (3.1), we have that βH = βh1 and βHX = βh1X ; to simplify notation we
use βH and βHX throughout. The estimates of βX , βH and βHX from GLM, PML, EE and
HYBRID, as well as their associated standard errors, were recorded.
For each value of βHX and each simulation scenario, statistical properties of the ap-
proaches were estimated from 10000 simulation replicates. In up to about a half of the
simulated data sets, either Hplus estimated h1 to be the most frequent haplotype, or one
or more of the approaches failed to converge while fitting the risk model. Such data sets
were discarded until the desired number of 10000 replicates was obtained.
3.2 Statistical Properties
We computed four commonly used measures to evaluate the performance of GLM, PML,
EE and HYBRID. Let b be the true value of a regression parameter associated with an effect
(e.g. βHX for the interaction effect). Let b be an estimator (e.g. the PML estimator) of b
and br be its realization in the rth simulation replicate. The first measure is the estimated
bias of b:
Bias(b) = ¯b− b =
1R
R∑
r=1
(br − b),
in which the summation is over all R replicates. In our current simulation study, R =
10000. The estimated bias is compared to the corresponding simulation error (described in
Appendix A of Ghadessi (2005)), and the estimator is said to be unbiased if the estimated
bias is within simulation error of zero. Let SE be the standard error associated with b,
and SEr be the standard error of br. Let SD =√
1R
∑Rr=1(br − ¯
b)2 denote the empirical
standard deviation of b. In our simulation study, SD is considered to be the nominal (true)
CHAPTER 3. SIMULATION STUDY 28
value of the standard error of b. The second measure is the estimated bias of SE:
Bias(SE) =1R
R∑
r=1
(SEr − SD),
which quantifies the accuracy of the standard error estimator of b. The standard error
estimator is said to be unbiased if the estimated bias of SE is within the corresponding
simulation error of zero. The third measure is the estimated coverage probability of the
confidence interval, (b−Zα/2SE, b + Zα/2SE) at significance level α, that includes the true
value b:
CP =1R
R∑
r=1
δ{|br − b| < Zα/2SEr},
where δ is the indicator function. In our simulation study, α = 0.05 and Zα/2 is approx-
imately 2. An acceptable estimated coverage probability of the 95% confidence interval
should be within simulation error of the nominal 95%. The fourth measure is the estimated
power of the approach to detect the presence of the effect, quantified as the probability of
rejecting the null hypothesis b = 0:
P =1R
R∑
r=1
δ{|br| > Zα/2SEr}.
3.3 Simulation Results
We first present an overview of conclusions from the simulation study, with particular
attention to results that address methodological issues raised in Chapter 2 and to results
that confirm previous studies. More detailed simulation results related to estimation of
βHX are presented next, which support the general conclusions of the simulation study.
Full simulation results for βH , βX and βHX appear in Tables C.1 - C.4 of Appendix C.
3.3.1 Overview of Simulation Conclusions
Overall, HYBRID performed the best of the three approximate score methods, with approx-
imately correct inference and good power to detect the interaction effect in all simulation
configurations.
CHAPTER 3. SIMULATION STUDY 29
We next discuss the bias and variance of risk parameter estimators, and the bias in the
standard error estimators.
Our simulations, under more moderate interaction effects than those of Spinka et al.,
show that, when haplotype ambiguity is moderate, bias in all estimators of the regression
parameters, including PML, is comparable to the finite-sample bias of logistic regression
with known haplotypes (GLM). However, when haplotype ambiguity is extreme, the PML
and EE estimators are biased relative to HYBRID. The bias of PML is likely due to its in-
correct approximation of the RML weights in these simulations. EE and HYBRID (MPSE),
on the other hand, use the correct weights because the data are simulated under popula-
tion HWP and independence of H and X. The bias of EE is likely due to the estimating
equations for the haplotype frequencies which differ from those for MPSE.
In contrast, under extreme haplotype ambiguity, the EE regression estimators were less
variable than those of HYBRID (MPSE). Recall that, unlike MPSE, the EE estimating
equations for the haplotype frequencies involve the controls only and do not depend on the
regression parameters. If regression parameter estimators are imprecisely determined, the
EE estimator of haplotype frequencies might be less variable than the MPSE estimator,
even though the MPSE estimator uses data from both cases and controls. Such decreased
variability in estimators of haplotype frequency might then translate into decreased variance
for the regression parameter estimators of EE, relative to those of MPSE.
The most striking simulation results regarding standard errors were those for EE. The
conservative standard errors for EE are almost certainly due to an error in the EE variance
calculation noted previously. Figure 3.9 shows the bias in standard errors, after excluding
those of EE, for the remaining methods. The HYBRID standard errors perform the best of
the three approximate score methods, even though the variance calculation from hapassoc
is incorrect.
3.3.2 Results for Simulation Scenarios i) and ii)
Figure 3.1 summarizes the results for the first simulation scenario, in which the first set of
haplotype frequencies was used. Based on the estimated bias of βHX , we make the following
key observations. First, the bias of estimators from PML, EE and HYBRID was upward
CHAPTER 3. SIMULATION STUDY 30
(anti-conservative) in general. Second, the biases increased as βHX increased. Third, EE
and HYBRID performed slightly better than PML, as the biases of the EE and HYBRID
estimators were within simulation error of zero when βHX = 0.1 and 0.3. The boxplots in
Figure 3.3 show that the variability in βHX estimates appeared to be smaller for EE and
HYBRID than for PML.
EE performed poorly in calculating the standard errors of the βHX estimates, as the
standard error of βHX was upward biased (Figure 3.1). The standard errors from PML
and HYBRID showed a slightly downward bias (anti-conservative) in general and exceeded
simulation errors (Figure 3.9). However, the magnitudes of the biases were small compared
to the bias of the EE estimator. Figure 3.3 also shows that standard errors from EE are
more spread-out than those from PML and HYBRID.
The estimated coverage probabilities of the 95% confidence intervals of PML and HY-
BRID were approximately 95% and within simulation errors of the nominal 95%. By con-
trast, the estimated coverage probabilities of EE were larger than 99% in general and
exceeded simulation errors. The inflation of the standard errors from EE resulted in low
power to detect the interaction effects. The estimated power to detect weak interactions was
low for all three approaches but improved for PML and HYBRID as the level of interaction
increased.
Figures 3.2, 3.4 and 3.9 show the simulation results for the second simulation scenario,
which used the same haplotype frequencies configuration as in the first simulation scenario
but with smaller sample size. Similar patterns of biases in estimation of βHX and standard
errors from PML, EE and HYBRID were observed. The magnitude of bias was bigger and
the variability in the estimates was greater than in the first simulation scenario.
3.3.3 Results for Simulation Scenarios iii) and iv)
The second set of haplotype frequencies was used in the third and fourth simulation sce-
narios. The simulation results for βHX are summarized in Figures 3.5 and 3.6, in the same
formats as those for the first two simulation scenarios.
Both EE and HYBRID showed downward bias in estimating weak interaction effects
and upward bias in estimating moderate and strong interaction effects, while PML showed
CHAPTER 3. SIMULATION STUDY 31
upward bias in estimating weak and moderate interactions and downward bias in estimating
strong interactions. The estimated bias was within simulation error for PML when βHX =
0.1 and 0.5, for EE when βHX = 0.1 and 0.3, and for HYBRID when βHX = 0.1, 0.3 and 0.5.
The variability in the estimates appeared to be smaller for EE than for PML and HYBRID,
as shown in Figures 3.7 and 3.8.
The results for the standard error showed that the estimated bias of standard errors was
downward for PML and upward for EE and HYBRID in general, with all estimated biases
exceeding simulation errors. Figure 3.7 shows similar inflation and spread of standard errors
from EE as observed in the first two simulation scenarios, and standard errors from PML
and HYBRID that are more concentrated than EE.
The 95% confidence interval from EE gave estimated coverage probabilities of around
99%, due to the highly-conservative standard errors. The coverage probabilities were slightly
below 95% for PML and were slightly above 95% for HYBRID. As expected, the estimated
power to detect interactions was much lower for EE than for PML and HYBRID. The power
for all approaches improved as βHX or sample size increased. Recall that the ability of the
single-locus genotypes to predict the number of copies of risk haplotype h1 is lower and
phase ambiguity is higher in the second set of haplotype frequencies than in the first set.
Thus, it is not surprising that the power for PML and HYBRID were much lower than for
the GLM using phase-known data, even for high levels of interaction and large sample sizes.
CHAPTER 3. SIMULATION STUDY 32
Fig
ure
3.1:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioi)
0.1
0.3
0.5
0.7
0.00
0
0.00
5
0.01
0
0.01
5G
LMP
ML
EE
HY
BR
ID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 33
Fig
ure
3.2:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioii)
0.1
0.3
0.5
0.7
0.00
0
0.00
8
0.01
6
0.02
4
0.03
2G
LMP
ML
EE
HY
BR
ID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
0.20
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 34
Figure 3.3: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario i)
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6 GLM
PMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 35
Figure 3.4: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario ii)
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
GLMPMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 36
Fig
ure
3.5:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioiii
)
0.1
0.3
0.5
0.7
−0.0
1
0.00
0.01
0.02
0.03
GLM
PM
LE
EH
YB
RID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
0.20
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 37
Fig
ure
3.6:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioiv
)
0.1
0.3
0.5
0.7
0.00
0.01
0.02
0.03
0.04
GLM
PM
LE
EH
YB
RID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
0.20
0.25
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 38
Figure 3.7: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario iii)
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
GLMPMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 39
Figure 3.8: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario iv)
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
GLMPMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 40
Fig
ure
3.9:
Est
imat
edbi
asof
stan
dard
erro
rfo
rG
LM
,P
ML
and
HY
BR
IDaf
ter
excl
udin
gE
E
0.1
0.3
0.5
0.7
−0.0
04
−0.0
02
0.00
0
0.00
2
0.00
4G
LMP
ML
HY
BR
ID
unbi
ased
beta
_{H
X}
Bias of Standard Error
Sce
nario
i)
0.1
0.3
0.5
0.7
−0.0
15
−0.0
10
−0.0
05
0.00
0
0.00
5
beta
_{H
X}
Bias of Standard Error
Sce
nario
ii)
0.1
0.3
0.5
0.7
−0.0
15
−0.0
10
−0.0
05
0.00
0
0.00
5
0.01
0
beta
_{H
X}
Bias of Standard Error
Sce
nario
iii)
0.1
0.3
0.5
0.7
−0.0
4
−0.0
3
−0.0
2
−0.0
1
0.00
0.01
beta
_{H
X}
Bias of Standard Error
Sce
nario
iv)
Chapter 4
Conclusions and Future Work
The development of genotyping technologies makes the identification of most of the nu-
cleotide variations in the human genome possible and provides abundant data for disease
gene mapping. However, current widely used PCR-based genotyping techniques only al-
low experimenters to observe genotypes at one specific locus at a time. Therefore, when
haplotypes are among the risk factors in association studies of a disease, missing data on
genetic factors could arise due to phase ambiguity. A variety of statistical methods have
been developed within the GLM framework to relate haplotypes and non-genetic covariates
to the disease phenotype based on observed data for single-locus genotypes.
We have considered haplotype risk inference in case-control studies of a rare disease
in the presence of haplotype phase ambiguity and information on non-genetic risk factors.
We reviewed RML and compared three approximate score methods (PML, EE and MPSE)
that use approximate weights in the weighted RML score equations. We also proposed a
hybrid approach, which uses the MPSE parameter estimator and a PML variance estimator.
Our simulations adopted the two haplotype frequency configurations described in Ghadessi
(2005), which yield moderate and extreme levels of haplotype ambiguity, respectively. We
varied the sample size and considered four relatively modest levels of statistical interaction
between haplotype and non-genetic risk factors. Our simulation results were in general
agreement with those of Spinka et al., in that we showed PML is more biased than EE
or HYBRID in estimating the interaction effect. Such bias is likely due to the incorrect
PML weights, derived assuming HWP and independence of haplotypes and non-genetic
41
CHAPTER 4. CONCLUSIONS AND FUTURE WORK 42
factors in the pooled case-control sample. However, under moderate haplotype ambiguity,
the PML bias was comparable to the bias of logistic regression analysis of phase-known
data. A drawback of the EE approach is its conservative variance estimator, which leads to
conservative coverage of confidence intervals and low power to detect statistical interaction.
Overall, the hybrid approach performed best of the three approximate score methods in our
simulations.
There are several areas for future work. First, the hybird approach showed promise in
our simulations, but its variance estimator lacks justification. Implementing the correct
variance estimator would result in MPSE for rare diseases, which we could compare to HY-
BRID. Second, in simulations so far, haplotypes and the non-genetic factor were simulated
independently. Such independence is assumed in deriving the MPSE and EE weights, and
so these weights were correct for the simulated data. However, theoretical and empirical
results to date do not address the statistical properties of MPSE or EE when haplotypes
and nongenetic factors are dependent. We have begun simulations under such dependence,
but more work is required. Finally, our simulations only considered SNPs at three loci
and two sets of haplotype frequencies, and data were simulated and analyzed under a mul-
tiplicative disease risk model only. There are many other simulation configurations and
disease risk models (e.g. dominant or recessive models) that could be used to compare
these approaches.
Appendix A
Variable Probability Sampling
The sampling scheme given in Spinka et al. (2005) is called variable probability sampling
(VPS), motivated by nested case-control sampling, where a case-control sample is drawn
from a cohort of subjects. As the sampled cohort gets large, the cohort becomes a good
approximation to the general population so that the nested case-control study becomes a
good approximation to the population-based case-control study we’re trying to approximate.
Here we give an overview of VPS based on the description of Lawless et al. (1999), and
show that the conditional probability of H given X is the same under VPS and VSS.
A.1 Overview of VPS
Variable probability sampling is done in three stages:
stage 1: Sample a cohort of size nc from a population. Index subjects in the cohort by
j = 1, . . . , nc. Measure disease status D on all nc subjects in the cohort. Let M1
be the number of cases in the cohort and M0 be the number of controls. Then
Mi ∼ binomial(nc,pr(D = i)).
stage 2: Examine all nc of the subjects in the cohort and decide whether they will be
included in the case-control sample, conditional on their disease status. Subject j
with disease status i is included in the case-control sample with probability µi =
(ni/nc)/pr(D = i). Let Rj be an indicator variable with value 1 if subject j is
43
APPENDIX A. VARIABLE PROBABILITY SAMPLING 44
included in the sample and 0 otherwise. Then prvps(R = 1 | D = i) = µi, where prvps
denotes probability under VPS. In VPS, inclusion status R depends only on D and
not on covariates (H, X), so that prvps(R = 1 | H, X,D) = prvps(R = 1 | D). Hence
R and (H, X) are conditionally independent given D.
stage 3: Measure covariates (H, X) on those in the case-control sample; that is, measure
(Hj , Xj) on subjects with Rj = 1.
Let N1 be the number of cases included in the case-control sample and N0 be the number
of controls. Then Ni | Mi ∼ binomial(Mi, µi) and hence
Evps(Ni) = Evps(Evps(Ni | Mi))
= Evps (Miµi) = µiEvps(Mi) =ni/nc
pr(D = i)ncpr(D = i) = ni.
Under VPS the Ni are random and so is their sum N = N0 + N1. By contrast, under VSS
the size of the case-control sample is fixed at n.
The observed variables on the cohort members can be written as (Dj , Rj , RjHj , RjXj); j =
1, . . . , nc, to reflect the fact that (H, X) are not observed on those with R = 0. For those
of the cohort in the case-control sample, we observe (D, R = 1, RH = H, RX = X) and for
others in the cohort not in the case-control sample (D, R = 0, RH = 0, RX = 0).
Under VPS, the cohort is an iid sample from prvps(D, R,RH,RX). We then focus on D
and (RH, RX) in the subsample for which R = 1, giving an iid sample from prvps(D, RH, RX |R = 1) = prvps(D, H,X | R = 1).
A.2 Equivalence of Probabilities Under VPS and VSS
We first show that the joint distributions prvps(D, H,X | R = 1) and prv(D, H,X) are
equal. Recall that the conditional distribution of H and X given disease status D under
VSS is the same as under the true case-control sampling, which implies that prv(D =
i,H, X) = pr(H, X | D = i) ni/n. We now establish that
prvps(D = i,H, X | R = 1) = pr(H, X | D = i) ni/n, (A.1)
APPENDIX A. VARIABLE PROBABILITY SAMPLING 45
by showing
1. prvps(H, X | D = i, R = 1) = pr(H, X | D = i) and
2. prvps(D = i | R = 1) = ni/n.
Since the right-hand side of equation (A.1) is prv(D = i,H,X), it follows that prvps(D, H,X |R = 1) = prv(D, H,X) as desired.
Showing that prvps(H, X | D = i, R = 1) = pr(H, X | D = i):
First recall that, in VPS, (H, X) and R are conditionally independent given D. Hence
prvps(H, X | D, R = 1) = prvps(H, X | D) and all that remains to show is prvps(H, X |D) = pr(H, X | D). Now prvps(H, X | D = 0) and prvps(H, X | D = 1) describe the
covariate distributions in controls and in cases of the sampled cohort, respectively. Since
this cohort is drawn randomly from the population, prvps(H, X | D = 0) = pr(H, X | D = 0)
and prvps(H, X | D = 1) = pr(H, X | D = 1). In summary, prvps(H, X | D) = pr(H, X | D)
and hence prvps(H, X | D, R = 1) = pr(H, X | D).
Showing that prvps(D = i | R = 1) = ni/n:
We have
prvps(D = i, R = 1) = prvps(R = 1 | D = i)prvps(D = i) =ni/nc
pr(D = i)pr(D = i) =
ni
nc.
Hence prvps(R = 1) = n0/nc + n1/nc = n/nc, so that
prvps(D = i | R = 1) = prvps(D = i, R = 1)/prvps(R = 1) = ni/n.
The same reasoning can be used to show that prvps(D = i | R = 1) = ni/n whenever
prvps(R = 1 | D = i) = k ni/pr(D = i) for some constant k. However, only the choice
k = 1/nc leads to Evps(Ni) = ni.
Appendix B
Derivation of prv(H | X ; γv)
Recall that the conditional joint distribution of H and X given disease status is the same
under the true case-control sampling and under the variant sampling scheme:
pr(H, X | D;ϑ) = prv(H, X | D;ϑv).
Therefore, the joint distribution of H and X in the hypothetical population is given by
prv(H, X; γv) =1∑
i=0
prv(H, X | D = i;ϑv)prv(D = i)
=1∑
i=0
pr(H, X | D = i;ϑ)prv(D = i)
=1∑
i=0
pr(D = i | H, X;β0, β1)pr(H, X; γ)pr(D = i)
prv(D = i)
= pr(H, X; γ)1∑
i=0
pr(D = i | H, X;β0, β1)pr(D = i)
prv(D = i)
= pr(H, X; γ)1∑
i=0
{exp{i(β0 + z(H, X)β1)}1 + exp{β0 + z(H, X)β1}
prv(D = i)pr(D = i)
}
= pr(H, X; γ)prv(D = 0)pr(D = 0)
{1
1 + exp{β0 + z(H, X)β1}+
exp{β0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1}
prv(D = 1)pr(D = 1)
pr(D = 0)prv(D = 0)
}(B.1)
The intercept term βv0 and β0 are related through βv0 = β0 + log{prv(D = 1)pr(D =
46
APPENDIX B. DERIVATION OF PRV (H | X ; γV ) 47
0)/{pr(D = 1)prv(D = 0)}} (Spinka et al. 2005). Thus
prv(H, X; γv) = pr(H, X; γ)prv(D = 0)pr(D = 0)
{1
1 + exp{β0 + z(H, X)β1}+
exp{β0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1} exp(βv0 − β0)
}
=1 + exp{βv0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1}
prv(D = 0)pr(D = 0)
pr(H, X; γ), (B.2)
which is dependent on the disease probability and the joint distribution of H and X in the
general population. The marginal distribution of X in the hypothetical population is the
sum of prv(H, X; γv) over all haplogenotypes, hence
prv(H | X; γv) =prv(H, X; γv)prv(X; γvx)
=rβ0,βv0,β1(H, X)pr(H, X; γ)∑
H′∈H rβ0,βv0,β1(H ′, X)pr(H ′, X; γ), (B.3)
where
rβ0,βv0,β1(H, X) =1 + exp{βv0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1} .
Under the assumption of gene-environment independence in the population, we have
prv(H | X; γv) =rβ0,βv0,β1(H, X)pr(H; γh)∑
H′∈H rβ0,βv0,β1(H ′, X)pr(H ′; γh). (B.4)
Substituting equation (B.4) into the RML weights in equation (2.6) gives the MPSE weights
in equation (2.11).
Appendix C
Simulation Results
The full simulation results for βX , βH and βHX from the four scenarios are summarized in
four tables. The “Bias” item in the Property column indicates the estimated bias of an esti-
mator of a log odds-ratio b from an approach, and the “Bias(SE)” item is the estimated bias
of the associated standard error. “CP” and “Power” are the estimated coverage probability
of the 95% confidence interval and the estimated power for detecting an effect, respectively.
Only the model with main effects and interactions, specified in equation (3.2), was fit to
the data. This model includes main effects and interactions, and hence significance tests of
main effects would be in the presence of an interaction. Since such tests are not typically
meaningful, we present estimated power for tests of interaction (βHX) only. The simulation
error for each estimate is given in brackets after the estimate. Estimates within simulation
error of the nominal value are indicated with asterisks.
48
APPENDIX C. SIMULATION RESULTS 49
Table C.1: Simulation results for scenario i)
βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00033 (0.00251)∗ 0.00333 (0.00304) 0.00193 (0.00303)∗ 0.00183 (0.00303)∗
Bias(SE) -0.00118 (0.00009) -0.00402 (0.00016) 0.08309 (0.00162) -0.00532 (0.00015)CP 0.94820 (0.00443)∗ 0.94850 (0.00442)∗ 0.99080 (0.00191) 0.94640 (0.00450)∗
βH Bias 0.00790 (0.00179) 0.01066 (0.00225) 0.00938 (0.00224) 0.00925 (0.00223)Bias(SE) 0.00077 (0.00003) 0.00212 (0.00007) 0.01030 (0.00033) 0.00236 (0.00007)CP 0.95220 (0.00427)∗ 0.95820 (0.00400) 0.96500 (0.00368) 0.95750 (0.00403)
βHX Bias 0.00141 (0.00183)∗ 0.00081 (0.00239)∗ -0.00002 (0.00235)∗ 0.00021 (0.00236)∗
Bias(SE) -0.00062 (0.00005) -0.00359 (0.00011) 0.06816 (0.00130) -0.00324 (0.00011)CP 0.95090 (0.00432)∗ 0.94630 (0.00451)∗ 0.99050 (0.00194) 0.94670 (0.00449)∗
Power 0.19840 (0.00798) 0.14250 (0.00699) 0.02850 (0.00333) 0.14490 (0.00704)0.3 βX Bias -0.00012 (0.00255)∗ 0.00353 (0.00308) 0.00152 (0.00306)∗ 0.00132 (0.00306)∗
Bias(SE) -0.00061 (0.00010) -0.00250 (0.00017) 0.11068 (0.00187) -0.00374 (0.00016)CP 0.94870 (0.00441)∗ 0.94840 (0.00442)∗ 0.99500 (0.00141) 0.94660 (0.00450)∗
βH Bias 0.00397 (0.00184) 0.00522 (0.00228) 0.00579 (0.00227) 0.00578 (0.00226)Bias(SE) 0.00091 (0.00003) 0.00370 (0.00007) 0.01377 (0.00035) 0.00371 (0.00007)CP 0.95480 (0.00415) 0.95770 (0.00403) 0.96840 (0.00350) 0.95810 (0.00401)
βHX Bias 0.00329 (0.00188) 0.00412 (0.00243) 0.00144 (0.00239)∗ 0.00187 (0.00239)∗
Bias(SE) -0.00022 (0.00006) -0.00204 (0.00012) 0.08979 (0.00150) -0.00140 (0.00012)CP 0.94960 (0.00438)∗ 0.94860 (0.00442)∗ 0.99390 (0.00156) 0.94950 (0.00438)∗
Power 0.89750 (0.00607) 0.72730 (0.00891) 0.28670 (0.00904) 0.72820 (0.00890)0.5 βX Bias -0.00441 (0.00264) 0.00017 (0.00321)∗ -0.00123 (0.00315)∗ -0.00153 (0.00314)∗
Bias(SE) -0.00076 (0.00010) -0.00268 (0.00019) 0.15368 (0.00229) -0.00309 (0.00018)CP 0.95110 (0.00431)∗ 0.95080 (0.00433)∗ 0.99610 (0.00125) 0.94930 (0.00439)∗
βH Bias 0.00458 (0.00187) -0.00035 (0.00236)∗ 0.00555 (0.00234) 0.00586 (0.00233)Bias(SE) 0.00309 (0.00004) 0.00492 (0.00008) 0.01917 (0.00039) 0.00494 (0.00008)CP 0.95810 (0.00401) 0.95850 (0.00399) 0.97540 (0.00310) 0.95880 (0.00398)
βHX Bias 0.00805 (0.00198) 0.00924 (0.00257) 0.00487 (0.00250) 0.00546 (0.00249)Bias(SE) -0.00086 (0.00007) -0.00309 (0.00014) 0.12125 (0.00183) -0.00114 (0.00015)CP 0.94850 (0.00442)∗ 0.94560 (0.00454)∗ 0.99620 (0.00123) 0.94950 (0.00438)∗
Power 0.99910 (0.00060) 0.98140 (0.00270) 0.61360 (0.00974) 0.98210 (0.00265)0.7 βX Bias -0.00483 (0.00280) -0.00398 (0.00342) -0.00286 (0.00328)∗ -0.00307 (0.00325)∗
Bias(SE) -0.00055 (0.00012) -0.00253 (0.00022) 0.20046 (0.00258) -0.00058 (0.00019)CP 0.95220 (0.00427)∗ 0.95230 (0.00426)∗ 0.99870 (0.00072) 0.95340 (0.00422)∗
βH Bias 0.00352 (0.00201) -0.01720 (0.00249) 0.00059 (0.00248)∗ 0.00122 (0.00248)∗
Bias(SE) 0.00326 (0.00004) 0.00600 (0.00009) 0.02586 (0.00048) 0.00569 (0.00009)CP 0.95790 (0.00402) 0.96020 (0.00391) 0.98050 (0.00277) 0.96470 (0.00369)
βHX Bias 0.00985 (0.00213) 0.01377 (0.00275) 0.00853 (0.00261) 0.00897 (0.00260)Bias(SE) -0.00106 (0.00008) -0.00297 (0.00017) 0.15517 (0.00204) 0.00127 (0.00014)CP 0.94740 (0.00446)∗ 0.94700 (0.00448)∗ 0.99800 (0.00089) 0.95250 (0.00425)∗
Power 1.00000 (0.00000) 0.99960 (0.00040) 0.76000 (0.00854) 0.99940 (0.00049)∗ Bias, Bias(SE) or CP is within simulation error of the nominal value
APPENDIX C. SIMULATION RESULTS 50
Table C.2: Simulation results for scenario ii)
βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00005 (0.00363)∗ 0.00039 (0.00445)∗ -0.00009 (0.00437)∗ -0.00042 (0.00438)∗
Bias(SE) -0.00396 (0.00019) -0.01036 (0.00033) 0.14237 (0.00331) -0.00991 (0.00031)CP 0.94730 (0.00447)∗ 0.94530 (0.00455) 0.98910 (0.00208) 0.94610 (0.00452)∗
βH Bias 0.01997 (0.00248) 0.02868 (0.00307) 0.02660 (0.00304) 0.02632 (0.00304)Bias(SE) 0.00485 (0.00006) 0.01041 (0.00014) 0.03773 (0.00098) 0.01082 (0.00014)CP 0.95640 (0.00408) 0.96490 (0.00368) 0.97680 (0.00301) 0.96670 (0.00359)
βHX Bias 0.00178 (0.00265)∗ 0.00399 (0.00346) 0.00186 (0.00336)∗ 0.00237 (0.00337)∗
Bias(SE) -0.00200 (0.00011) -0.00690 (0.00023) 0.11630 (0.00263) -0.00408 (0.00022)CP 0.94650 (0.00450)∗ 0.94630 (0.00451)∗ 0.98960 (0.00203) 0.94740 (0.00446)∗
Power 0.11960 (0.00649) 0.09430 (0.00584) 0.01860 (0.00270) 0.09160 (0.00577)0.3 βX Bias -0.00031 (0.00368)∗ 0.00205 (0.00448)∗ 0.00317 (0.00440)∗ 0.00276 (0.00440)∗
Bias(SE) -0.00279 (0.00020) -0.00741 (0.00034) 0.17497 (0.00387) -0.00732 (0.00033)CP 0.94980 (0.00437)∗ 0.94900 (0.00440)∗ 0.99140 (0.00185) 0.95020 (0.00435)∗
βH Bias 0.01891 (0.00254) 0.02414 (0.00314) 0.02411 (0.00311) 0.02390 (0.00311)Bias(SE) 0.00490 (0.00007) 0.01117 (0.00015) 0.04094 (0.00107) 0.01120 (0.00015)CP 0.95610 (0.00410) 0.96410 (0.00372) 0.97720 (0.00299) 0.96460 (0.00370)
βHX Bias 0.00651 (0.00274) 0.00949 (0.00357) 0.00286 (0.00346)∗ 0.00363 (0.00347)Bias(SE) -0.00247 (0.00012) -0.00717 (0.00025) 0.13899 (0.00308) -0.00423 (0.00024)CP 0.95100 (0.00432)∗ 0.94540 (0.00454) 0.99190 (0.00179) 0.94600 (0.00452)∗
Power 0.62610 (0.00968) 0.44170 (0.00993) 0.12480 (0.00661) 0.44280 (0.00993)0.5 βX Bias -0.00390 (0.00377) -0.00202 (0.00462)∗ 0.00337 (0.00449)∗ 0.00258 (0.00447)∗
Bias(SE) -0.00067 (0.00021) -0.00532 (0.00037) 0.22053 (0.00427) -0.00396 (0.00038)CP 0.95040 (0.00434)∗ 0.95110 (0.00431)∗ 0.99440 (0.00149) 0.95130 (0.00430)∗
βH Bias 0.01519 (0.00270) 0.01215 (0.00337) 0.01779 (0.00332) 0.01791 (0.00332)Bias(SE) 0.00308 (0.00007) 0.00716 (0.00017) 0.04270 (0.00119) 0.00757 (0.00016)CP 0.95480 (0.00415) 0.96160 (0.00384) 0.97690 (0.00300) 0.96130 (0.00386)
βHX Bias 0.01251 (0.00284) 0.01818 (0.00373) 0.00604 (0.00358) 0.00722 (0.00358)Bias(SE) -0.00105 (0.00013) -0.00686 (0.00028) 0.17203 (0.00340) -0.00227 (0.00032)CP 0.94990 (0.00436)∗ 0.94420 (0.00459) 0.99160 (0.00183) 0.94810 (0.00444)∗
Power 0.95720 (0.00405) 0.83200 (0.00748) 0.33520 (0.00944) 0.82820 (0.00754)0.7 βX Bias -0.01069 (0.00398) -0.01409 (0.00500) -0.00232 (0.00480)∗ -0.00324 (0.00476)∗
Bias(SE) 0.00003 (0.00024)∗ -0.00946 (0.00043) 0.27108 (0.00473) -0.00640 (0.00040)CP 0.95160 (0.00429)∗ 0.94650 (0.00450)∗ 0.99470 (0.00145) 0.94940 (0.00438)∗
βH Bias 0.01052 (0.00285) -0.00764 (0.00353) 0.01232 (0.00350) 0.01295 (0.00349)Bias(SE) 0.00545 (0.00009) 0.00943 (0.00019) 0.05259 (0.00122) 0.00993 (0.00019)CP 0.95870 (0.00398) 0.96370 (0.00374) 0.98460 (0.00246) 0.96320 (0.00377)
βHX Bias 0.02092 (0.00308) 0.03136 (0.00410) 0.01379 (0.00386) 0.01479 (0.00385)Bias(SE) -0.00301 (0.00016) -0.01221 (0.00033) 0.20727 (0.00372) -0.00431 (0.00033)CP 0.94930 (0.00439)∗ 0.93750 (0.00484) 0.99150 (0.00184) 0.95030 (0.00435)∗
Power 0.99800 (0.00089) 0.96600 (0.00362) 0.49910 (0.01000) 0.96580 (0.00363)∗ Bias, Bias(SE) or CP is within simulation error of the nominal value
APPENDIX C. SIMULATION RESULTS 51
Table C.3: Simulation results for scenario iii)
βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00266 (0.00353)∗ 0.01217 (0.00497) 0.00532 (0.00474) 0.00673 (0.00497)
Bias(SE) -0.00128 (0.00016) -0.01021 (0.00038) 0.06581 (0.00293) -0.01373 (0.00036)CP 0.94990 (0.00436)∗ 0.94720 (0.00447)∗ 0.95040 (0.00434)∗ 0.94110 (0.00471)
βH Bias 0.00252 (0.00283)∗ 0.00233 (0.00430)∗ 0.10597 (0.00358) 0.00001 (0.00427)∗
Bias(SE) -0.00161 (0.00006) 0.00238 (0.00020) -0.00429 (0.00059) 0.00339 (0.00020)CP 0.94710 (0.00448)∗ 0.95380 (0.00420)∗ 0.89020 (0.00625) 0.95410 (0.00419)∗
βHX Bias -0.00040 (0.00286)∗ 0.00130 (0.00463)∗ -0.00276 (0.00386)∗ -0.00221 (0.00442)∗
Bias(SE) -0.00203 (0.00011) -0.01062 (0.00031) 0.10739 (0.00293) -0.00038 (0.00033)CP 0.94860 (0.00442)∗ 0.94360 (0.00461) 0.98490 (0.00244) 0.95250 (0.00425)∗
Power 0.10520 (0.00614) 0.07030 (0.00511) 0.02430 (0.00308) 0.06330 (0.00487)0.3 βX Bias 0.00060 (0.00352)∗ 0.02051 (0.00486) -0.00037 (0.00471)∗ 0.00743 (0.00487)
Bias(SE) -0.00014 (0.00016)∗ -0.00413 (0.00038) 0.10020 (0.00336) -0.00799 (0.00037)CP 0.95180 (0.00428)∗ 0.95330 (0.00422)∗ 0.96240 (0.00380) 0.94860 (0.00442)∗
βH Bias 0.00282 (0.00280) -0.00543 (0.00425) 0.10520 (0.00359) -0.00176 (0.00423)∗
Bias(SE) 0.00122 (0.00006) 0.00633 (0.00020) 0.00868 (0.00072) 0.00734 (0.00020)CP 0.95210 (0.00427)∗ 0.96010 (0.00391) 0.90670 (0.00582) 0.95990 (0.00392)
βHX Bias 0.00524 (0.00285) 0.00922 (0.00448) 0.00302 (0.00389)∗ 0.00204 (0.00433)∗
Bias(SE) 0.00053 (0.00011) -0.00203 (0.00032) 0.13068 (0.00336) 0.00697 (0.00036)CP 0.95340 (0.00422)∗ 0.95110 (0.00431)∗ 0.98760 (0.00221) 0.95770 (0.00403)Power 0.56800 (0.00991) 0.27820 (0.00896) 0.13530 (0.00684) 0.26000 (0.00877)
0.5 βX Bias 0.00441 (0.00353) 0.04084 (0.00492) -0.00999 (0.00473) 0.01493 (0.00490)Bias(SE) 0.00126 (0.00016) -0.00419 (0.00039) 0.15706 (0.00393) -0.00820 (0.00037)CP 0.95230 (0.00426)∗ 0.95080 (0.00433)∗ 0.97540 (0.00310) 0.94570 (0.00453)∗
βH Bias -0.00156 (0.00287)∗ -0.02588 (0.00431) 0.09173 (0.00364) -0.01256 (0.00434)Bias(SE) 0.00068 (0.00007) 0.00514 (0.00020) 0.03227 (0.00093) 0.00553 (0.00020)CP 0.95390 (0.00419)∗ 0.95650 (0.00408) 0.94700 (0.00448)∗ 0.95710 (0.00405)
βHX Bias 0.00655 (0.00296) 0.00247 (0.00457)∗ 0.01035 (0.00394) 0.00279 (0.00446)∗
Bias(SE) -0.00160 (0.00013) -0.00468 (0.00034) 0.16268 (0.00372) 0.00459 (0.00039)CP 0.94890 (0.00440)∗ 0.94590 (0.00452)∗ 0.99080 (0.00191) 0.95200 (0.00428)∗
Power 0.93910 (0.00478) 0.62040 (0.00971) 0.35470 (0.00957) 0.61840 (0.00972)0.7 βX Bias 0.00478 (0.00371) 0.05732 (0.00512) -0.03260 (0.00471) 0.01857 (0.00502)
Bias(SE) -0.00491 (0.00017) -0.00848 (0.00041) 0.22289 (0.00435) -0.01140 (0.00039)CP 0.94470 (0.00457) 0.94490 (0.00456) 0.98320 (0.00257) 0.94740 (0.00446)∗
βH Bias -0.00333 (0.00297) -0.04869 (0.00445) 0.08578 (0.00381) -0.01693 (0.00451)Bias(SE) -0.00060 (0.00007) -0.00003 (0.00021)∗ 0.05940 (0.00114) 0.00223 (0.00021)CP 0.95000 (0.00436)∗ 0.94480 (0.00457) 0.96140 (0.00385) 0.95180 (0.00428)∗
βHX Bias 0.01060 (0.00313) -0.00782 (0.00467) 0.02696 (0.00396) 0.00990 (0.00459)Bias(SE) -0.00495 (0.00014) -0.00620 (0.00038) 0.19813 (0.00419) 0.00514 (0.00044)CP 0.94390 (0.00460) 0.94660 (0.00450)∗ 0.99210 (0.00177) 0.95500 (0.00415)Power 0.99850 (0.00077) 0.87320 (0.00665) 0.55620 (0.00994) 0.87890 (0.00652)
∗ Bias, Bias(SE) or CP is within simulation error of the nominal value
APPENDIX C. SIMULATION RESULTS 52
Table C.4: Simulation results for scenario iv)
βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00266 (0.00514)∗ 0.00954 (0.00737) -0.00113 (0.00679)∗ 0.00605 (0.00714)∗
Bias(SE) -0.00588 (0.00033) -0.02287 (0.00079) 0.13934 (0.00495) -0.01869 (0.00074)CP 0.94800 (0.00444)∗ 0.94690 (0.00448)∗ 0.96380 (0.00374) 0.94550 (0.00454)∗
βH Bias 0.01093 (0.00399) 0.01364 (0.00622) 0.15348 (0.00515) 0.00754 (0.00613)Bias(SE) 0.00001 (0.00013)∗ 0.00262 (0.00041) 0.01857 (0.00149) 0.00622 (0.00042)CP 0.95380 (0.00420)∗ 0.95420 (0.00418) 0.91140 (0.00568) 0.95750 (0.00403)
βHX Bias 0.00317 (0.00419)∗ 0.01122 (0.00702) 0.00408 (0.00566)∗ 0.00336 (0.00645)∗
Bias(SE) -0.00694 (0.00022) -0.02952 (0.00065) 0.17586 (0.00484) -0.00123 (0.00073)CP 0.94410 (0.00459) 0.93920 (0.00478) 0.98530 (0.00241) 0.95600 (0.00410)Power 0.07780 (0.00536) 0.05830 (0.00469) 0.01590 (0.00250) 0.04340 (0.00408)
0.3 βX Bias 0.00290 (0.00513)∗ 0.01840 (0.00739) -0.00329 (0.00686)∗ 0.01248 (0.00712)Bias(SE) -0.00353 (0.00033) -0.02180 (0.00081) 0.16555 (0.00531) -0.01577 (0.00075)CP 0.95180 (0.00428)∗ 0.94840 (0.00442)∗ 0.96330 (0.00376) 0.94760 (0.00446)∗
βH Bias 0.00454 (0.00401) -0.00332 (0.00617)∗ 0.14524 (0.00514) -0.00043 (0.00613)∗
Bias(SE) 0.00113 (0.00013) 0.00653 (0.00041) 0.03476 (0.00164) 0.00863 (0.00041)CP 0.95220 (0.00427)∗ 0.95790 (0.00402) 0.92640 (0.00522) 0.95890 (0.00397)
βHX Bias 0.01088 (0.00422) 0.02668 (0.00690) 0.01062 (0.00567) 0.00874 (0.00641)Bias(SE) -0.00527 (0.00024) -0.02263 (0.00066) 0.19426 (0.00529) 0.00444 (0.00080)CP 0.94990 (0.00436)∗ 0.94230 (0.00466) 0.98630 (0.00232) 0.95660 (0.00408)Power 0.32410 (0.00936) 0.16950 (0.00750) 0.06450 (0.00491) 0.13860 (0.00691)
0.5 βX Bias 0.00587 (0.00524) 0.03295 (0.00759) -0.01407 (0.00687) 0.02095 (0.00727)Bias(SE) -0.00690 (0.00034) -0.02757 (0.00082) 0.21657 (0.00603) -0.02158 (0.00081)CP 0.95130 (0.00430)∗ 0.94720 (0.00447)∗ 0.97270 (0.00326) 0.94630 (0.00451)∗
βH Bias 0.00366 (0.00407)∗ -0.02054 (0.00638) 0.13758 (0.00531) -0.00539 (0.00635)∗
Bias(SE) 0.00203 (0.00014) -0.00097 (0.00042) 0.05635 (0.00201) 0.00262 (0.00046)CP 0.95350 (0.00421)∗ 0.95350 (0.00421)∗ 0.94360 (0.00461) 0.95480 (0.00415)
βHX Bias 0.01367 (0.00434) 0.03298 (0.00705) 0.02134 (0.00583) 0.01455 (0.00663)Bias(SE) -0.00666 (0.00026) -0.02696 (0.00070) 0.21995 (0.00595) 0.00136 (0.00131)CP 0.94480 (0.00457) 0.93730 (0.00485) 0.98540 (0.00240) 0.95270 (0.00425)∗
Power 0.69010 (0.00925) 0.37840 (0.00970) 0.18060 (0.00769) 0.33450 (0.00944)0.7 βX Bias 0.00719 (0.00534) 0.05515 (0.00780) -0.03772 (0.00692) 0.03493 (0.00744)
Bias(SE) -0.00777 (0.00035) -0.03033 (0.00087) 0.26294 (0.00604) -0.02584 (0.00119)CP 0.94870 (0.00441)∗ 0.94600 (0.00452)∗ 0.97570 (0.00308) 0.94330 (0.00463)
βH Bias -0.00436 (0.00419) -0.05427 (0.00640) 0.11930 (0.00540) -0.01986 (0.00642)Bias(SE) 0.00149 (0.00014) -0.00045 (0.00042) 0.09126 (0.00232) 0.00497 (0.00045)CP 0.95430 (0.00418) 0.95120 (0.00431)∗ 0.95870 (0.00398) 0.95570 (0.00412)
βHX Bias 0.01775 (0.00450) 0.01673 (0.00723) 0.03438 (0.00597) 0.01418 (0.00698)Bias(SE) -0.00772 (0.00029) -0.03328 (0.00078) 0.24170 (0.00602) -0.00733 (0.00182)CP 0.94670 (0.00449) 0.93360 (0.00498) 0.98680 (0.00228) 0.94800 (0.00444)∗
Power 0.92010 (0.00542) 0.59720 (0.00981) 0.32600 (0.00937) 0.57510 (0.00989)∗ Bias, Bias(SE) or CP is within simulation error of the nominal value
Bibliography
Akey J, Jin L, Xiong M (2001). Haplotypes vs single marker linkage disequilibrium tests:
what do we gain? European Journal of Human Genetics, 9:291–300.
Burkett K (2002). Logistic Regression with Missing Haplotypes. Master’s thesis, Simon
Fraser University, Burnaby, BC.
Burkett K, Graham J, McNeney B (2006). hapassoc: Software for Likelihood Inference of
Trait Associations with SNP Haplotypes and Other Attributes. Journal of Statistical
Software, 16(2).
Burkett K, McNeney B, Graham J (2004). A note on inference of trait associations with
SNP haplotypes and other attributes in generalized linear models. Human Heredity,
57:200–206.
Chatterjee N, Carroll RJ (2005). Semiparametric maximum-likelihood estimation exploiting
gene-environment independence in case-control studies. Biometrika, 92:399–418.
Epstein MP, Satten GA (2003). Inference on haplotype effects in case-control studies using
unphased genotype data. American Journal of Human Genetics, 73:1316–1329.
Excoffier L, Slatkin M (1995). Maximum-likelihood estimation of molecular haplotype fre-
quencies in a diploid population. Molecular Biology and Evolution, 12:921–927.
Ghadessi M (2005). A comparison of two logistic regression approaches for case-control
data with missing haplotypes. Master’s thesis, Simon Fraser University, Burnaby, BC.
Ibrahim JG (1990). Incomplete data in generalized linear models. Journal of the American
Statistical Association, 85:765–769.
53
BIBLIOGRAPHY 54
Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ (2003).
Estimation and tests of haplotype-environment interaction when linkage phase is am-
biguous. Human Heredity, 55:56–65.
Lawless JF, Kalbfleisch JD, Wild CJ (1999). Semiparametric methods for response-selective
and missing data problems in regression. Journal of the Royal Statistical Society, B,
61:413–438.
Louis TA (1982). Finding the observed information matrix when using the EM algorithm.
Journal of the Royal Statistical Society, B, 44:226–233.
Prentice RL, Pyke R (1979). Logistic disease incidence models and case-control studies.
Biometrika, 66:403–411.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002). Score tests for
association between traits and haplotypes when linkage phase is ambiguous. American
Journal of Human Genetics, 70:425–434.
Shin J, McNeney B, Graham J (2006). Likelihood inference in case-control studies of
a rare disease under independence of genetic and continuous non-genetic covariates.
Submitted.
Smith PG, Day NE (1984). The design of case-control studies: the influence of confounding
and interaction effects. International Journal of Epidemiology, 13:356–365.
Spinka C, Carroll RJ, Chatterjee N (2005). Analysis of case-control studies of genetic and
environmental factors with missing genetic information and haplotype-phase ambiguity.
Genetic Epidemiology, 29:108–127.
Stephens M, Smith NJ, Donnelly P (2001). A new statistical method for haplotype recon-
struction from population data. American Journal of Human Genetics, 68:978–989.
Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike MC
(2003). Choosing haplotype-tagging SNPS based on unphased genotype data using a
preliminary sample of unrelated subjects with an example from the Multiethnic Cohort
Study. Human Heredity, 55:27–36.
BIBLIOGRAPHY 55
Zhao LP, Li SS, Khalid N (2003). A method for the assessment of disease associations with
single-nucleotide polymorphism haplotypes and environmental variables in case-control
studies. American Journal of Human Genetics, 72:1231–1250.