View
0
Download
0
Category
Preview:
Citation preview
TWO PAPERS ON MONTE CARLO ESTIMATION OF
MODELS FOR COMPLEX GENETIC TRAITS
by
Sun Wei GuoElizabeth A. Thompson
TECHNICAL REPORT No. 229
Apr.il1992
De.partment ofStatistics, GN~22
University of Washington
Seattle, Washington 98195 USA
Two Papers on Monte Carlo Estimation of
Models for Complex Genetic Traits *
Sun Wei Guo Elizabeth A. Thompson
April, 1992
Abstract
In human quantitative genetics, computational complexity restricts the cur
rent methods for estimation of models for complex genetic traits. The two pa
pers in this technical report continue the development of Markov chain Monte
Carlo methods to accomplish this estimation. The papers here have been sub
mitted for publication. They are based on work developed in Sunwei Guo's
Ph.D. (Guo, 1991), and continuing under a to E.A.Thompson
from the National Institutes of Health (Thompson and Wijsman, 1990). Two
previous papers have been published; one on the Monte Carlo estimation of
likelihood ratios (Thompson and Guo, 1991), the second on a Monte Carlo
quantitative traits (Guo and Thompson, 1991).
NIH
Contents
1
:37
1JlSt,HIlitLl(Jll of Ml:i{ed IVlo,dels1.
References
genetics. Unpublished
)nl'\Terl~ntv of Washington.
Guo, S. W. (1991) Monte Carlo methods the 'iU''''U',>YU'Y>
Ph.D. dissertation. Dept. of Biostatistics,
Guo, S. W. and Thompson, E. A. (1991) Monte Carlo estimation of variance compo
nent models. IMA J. klath. Appl. Med. BioI. 8: 171-189.
Thompson, E. A. and Guo, S. "V. (1991) Evaluation of likelihood ratios for complex
genetic models. IJV1A J. ~Math. Appi. kled. Bioi. 8: 149-169.
Thompson, E.A. and Wijsman, M. (1990) A Gibbs sampler approach to the likeli-
hood analysis of complex models. Technical report 193, Department
of Statistics, of "Vashington.
Monte Carlo Estimation of Mixed models for
Large Complex Pedigrees *
Sun Wei Guo l Elizabeth A. Thompsonl ,2
1Department of Biostatistics, SC-32
2Department of Statistics, GN-22
University of Washington
Seattle, Washington 98195
U. S. A.
Abstract
In human quantitative genetics, computational complexity restricts the cur
rent methods for estimation of mixed models which include major gene effects to
data on small pedigrees. However, large complex pedigrees are not uncommon
in practice. Also, large pedigrees tend to provide more information on genetic
transmission and are more genetically homogeneous than a pooled sample of
many nUlcle<Lr t<Lm.ilies.
Gibbs sarnpJler, for estimationEM algorithm and
aPI>ro,:Lch also nf()vl,C1es a j.v.l.vu,,<:;
mixed models. The
as'TmlDtc~tic variance-
meth(Jds are COllceptlla1ty O'.L"lH'C,
easy to impl€~m(mt can hartdle multip,le IH~ritabliel H{)Il-IlenlaOle r,Lndom
COIIrpOllen11s. A nUlllerH:al eJ\:amIlle to illustrate
words: EM algorithrrl,
1 Introduction
has a In and Stt~W<I,rt.
Morton and MacLean, 1974). This partitions in a quantitative
trait into three sources: the effect of a single major of large effect, residual
additive heritable of polygenic loci, and the independent random effects of
the environment. Numerous applications of the mixed model in human genetics have
been published (see, for example, Leppert et al. 1986). In the field of plant/animal
mixed model can used to the genetic quantitative
traits such disease resistance.
Sample sizes being equal, a single large pedigree tends to provide more information
on major gene transmission than a pooled sample of many nuclear families. For those
traits that involve mitochondrial (matrilineal) inheritance or other effects that provide
long-range dependence, a single large pedigree is more suitable for study, not only
because it provides more information on the transmission but also because these
effects can be obscured by other familial correlations in nuclear families. Moreover,
large pedigrees tend to be more genetically and environmentally homogeneous than
a pooled sample of nuclear families. However, most of the methods proposed
so far for estimating mixed are restricted mainly to data on nuclear families
or This is due to the formidable computational burden in
the evaluation of the likelihood. and MacLean (1974) proposed a numerical
I1H~LlJUU of is not
model to Inc,orporate lTmt,'ttlCin
SImIlar to IS
an Importcmt
llK€~l1hl[)Od.s, a torJffilClaoLe
li~lsteldt (1982) prclpoised an method of vW~'vUH:N"HjL""
likelihood. for by Q",,:,rrh'rl«
the likelihood surface. Although Hasstedt (1982) showed that approximation
works on small pedigrees, Thompson and Wijsrnan (1990) pointed out that
approximation can be sensitive to parameters of the procedure and thus warrants
further investigation. Furthermore, the approximation is general the sense
that it can only deal mixed model with major gene(s) and additive polygenic
effects. Since most complex quantitative traits are believed to be by a
number of heritable/non-heritable random effects, a more general method is needed.
In a recent paper, Thompson and Guo (1991) proposed Monte Carlo evaluation of
the likelihood ratios of mixed models on complex pedigrees by using the Gibbs sampler
(Geman and Geman, 1984). Their method provides a tractable and efficient approach
to likelihood ratio evaluation for mixed models and other complex genetic models.
In this paper, we shall show how the Gibbs sampler, in conjunction with the EM
algorithm, can be used to estimate the parameters of mixed models and their standard
errors. This method, coupled with likelihood evaluation of Thompson and
Guo (1991), an integrated approach to estimation and for
mixed models.
sm:lUlaLt~U data set on a is described in 3. In section
we nrClVlcLe a SUlnnlary dlSCUS~;lOll.
3
2 The Method
2.1 Notation and assumptions
Po(G) = IT PO(Gi )
founder s i non- .fclun£!ers
J.
pa1:arr1etl~r p 18 lRvnlv:e(!. alttlmlgh UtJJlIVI,tJ a
18
or
odep,enclmg Wh(~thl~r 11'1rln;,rll1'" I J 18 an md.lca,tor lUIlctllon
1 or 0 del>endl1Jlg on wllletJler J IS a tou.ndl~r or nOJll-tIJllIld€:r,
of l!:ellotYPIC C()ntilj?;uratlon G can
Po(G) h(G)p21pl 1
h(G) exp [(21Fl 1 IFI 2 ) log[P/(l p)J + 21Fl1og(1 - (1)
where 1 is a vector of ones, and h(G) does not depend on p. The vectors Ii ,2,3)
are (of course) functions of G, but for notational convenience we leave this dependence
implicit.
For given G, the simple mixed model be specified, in vector notation, as
(2)
where a is a vector of additive genetic effects, normally distributed as Nn(O,o-;A),
where A is the numerator relationship matrix (Henderson, 1976), and e is a vector
of error effects (or individual environmental effects) normally distributed Nn(O, 0-;1).
2.2 The EM equations
There are parameters to be estimated: the gene treqw~nc:y major gene ellt:CLS
z = and nnl\ro'pnlr and emllr(mnlleIiltal Vo,L Lo,LL'-'C01> ,
likelihood for model (2)
IS m(jllOl.·-f!t~Ile j?;ellOtyplC C(:mfJlguratlon on IS
over l'hrClUl!:hout, f
P a dlSicre1~e pJwbablllty dls'~ntlUtlon.
G, error cOlmp'onen'~s
(Ott,
Estimation o is a "missmg
G
based on the "complete data" (y, G, a) is:
- #1 11
X exp r(21~11 + 1~12)10g 1 P P - 2;~aIA-lal
fe ( ylG, a)Pe(G)fe(a) =1
where c(O) is a function of 0 but not G.
Thus the natural sufficient statistics for 0 are 21F l l + IF12, l~(y - a), l~l,
(i = 1,2,3), a'A-lain and e'eln, whose unconditional expectations are 2nFP, ni#i,
ni, (i = 1,2, 3), O'~ and 0';, respectively, where nF is the total number of founders in
the pe'(1l.e:re4e, ni is the expected number genotYl)e i ( i - 1,2, :3).
Hence, if we denote new values of parameters by *, we obtain the following EM
equations estimating 0:
p*
(i = 1,
eXl)ec~tat;lOIls on
the
fo(G,
there is no practical way to evaluate denominator of the above equation for
a pedigree of more than about ten individuals (Ott, 1979).
2.3 Monte Carlo estimation
Vve the major genotypes and polygenotypes given the data and
estimating the conditional expectations required in the by a Monte
Carlo method. However, a classical Monte Carlo method, providing independent re
alizations of major genotypes and polygenotypes given the data, is precluded because
the conditional distribution is intractable and because there is no known efficient
algorithm to generate independent realizations from the distribution (8).
To sample unknown major and polygenotypes from the conditional
distribution, we use Gibbs an iterative procedure generates multi-
ple dependent realizations of the unobserved variables conditional on observed data
(Hastings, 1970; Geman and Geman, 1984; Gelfand Smith,1990). Beginning from
any realization of and polygenotypes, the genotype polygenotype
curItent estImate of pal'arrtetE~rs, and current cOllti~~uratIon of
one
a lVJ.d,IIHJV t = 1,
IS a stcbtlC)flCbry GIs:tn butlCm
the ma,lor PieIlot:yP(;S can move
poJsg:.enC)tYlpes can· move to
1'>,-.,,,'f',,,,,, probability one step by positivity of distribution.
Any irreducible Hastings algorithm is also Harris 1991). Hence
P8(G, aly) is the unique invariant distribution of the Markov chain and averages over
the Markov the For
forintegrable lUIlCtlfon V(G, a)
E(V),
g:ellotyples and
1 twhere ~ - I: V(G(l), a(l».
t [=1
The chain is also geometrically ergodic by an argument of Chan (1991). This is
because the major genotypes configurations themselves form a Markov chain {G(t)},
which has a finite state space and so is is geometrically ergodic (Chung, 1967), but
then so is the joint chain because the rate for the joint chain is dominated by the
rate for the major genotype chain (equation 2.2 in Chan, 19.91). Geometric ergodicity
implies a central limit th(;or,em
where the asymptotic variance O"~ delpeIldS on lUIlctJLOn V and on autocorrela-
are cOI~re,latled. To recluc:e autoc:orlrel,:ttI()n, one can sarnp,le
reClUllred to a
conapultationaJ e111Clen<:y IS dlslcussed
one runs
re':l,l1i~atloI1S are
to the COllditioJllal eXlpe<Jta1t;ioJllS and new O(l+1}.
This completes one iteration until likelihood
of the model no trend. With a reasonably large sample
size (N =200, in equations can
be accurately est'im(i,ted
Hnpl(~mient the individual, the conditional dis-
tribution of his genotype given his trait value Yi, polygenotype ai and genotypes
and polygenotypes of other members in the pedigree, and the conditional distribu
tion of his polygenotype given his trait value Yi, his genotype Gi, and genotypes
and polygenotypes of other members in the pedigree. For individual i, we specify
a neighborhood consisting of (if present the his spouse(s)
(if Yi, polygenotype
ai, and genotypes of neighborhood, the and polygenotypes on other
pedigree members do not information about Gi . if we
let ,Gm of {Gj} his , {Gjl} his ottsprmp;'
,a,y)
)
IS If IS mlE,slug,
to be L If i is a founder,
Similarly, if we ae][lQt;e
mothe~r, spouse(s)
seg:re~!;at:ion pr()ba,I::>lJIty is just the l'I'",.~r.'tur.lr tI~equeI1CY
pojly~ell()types of
f( aila_i, y, G) - f( aila-i,]li, Gi )
<X [g f(a;da;, a;)] f(~ilaf>am)f(YiIGi,ai) (10)
where a_i denotes the breeding values of all the members in the pedigree except i,
and j(ailaj, am) is the polygenic segregation probability density; that is, given aj, am,
ai rv N((af am)/2, ()"~/2). If i is a founder, ai rv N(O, ()"~).
After some algebra, we find that the conditional distribution of ai given Yi, Gi, a j, am, {aj}
and {ajl} is normal with mean
and variance
E (11)
(12)
IS or
number of offsprings. Equations (11) and (12) are QUU"'HH
ai given Yi, af, am, {aj} and {ajl} for the polygenic
I'tH)mpSCm and Shaw, 1990; Guo and Thompson, 1991). Thus, generation of
the (local) conditional distributions (9) and (10), is straightforward.
2.4 Choice of starting genotypic configurations (G, a)
Although the ergodic theorem ensures that the realizations generated via the Gibbs
sampler will converge in distribution to the true joint distribution, it is important to
choose a good starting genotypic configuration to avoid an unnecessarily prolonged pe
riod before realizations can be collected. Since the observed data contain information
elH:::CLS, we use an approach we refer as gene-dropping" ,
as opposed to the simple method (MacCluer et aI, 1986). Basically,
we drop the major from the top of the pedigree down to the bottom, using
the current estimate of gene frequency and also the data, To each founder i in the
pe(llgr'ee, we a the pr()Oa,OllIt
0(
1,
each n0I1-Ie.unGer J 1 a ",,,,,,.nt,rnp IS ral1ldoJrnl} salnplled
) ex = klGj,
ex P(Gi = klGjl Gm)exp [ (14)
Once the major genes have been dropped through the pedigree, we drop the
polygenes. We first randomly draw a polygenotype for each founder i according to
the following distribution:
(15)
Once an the founders are assigned breeding values, the breeding value for eaeh
non-founder i is randomly drawn from the following distribution:
fe(aiIYi, Gil aj, am) ex fe(YilGil ai)fe(adaj, am)
ex [_ (Yi - ~~;- ai)2] exp {-"----'-~--'-C---..::.-}(16)
2.5 Extension to include multiple random effects
al12::orltmn can
can
nance emects:
dOJffil][laI1Ce ett~ectlh U"'''''"HJ'LH,'-,U as N(O, aJD), and notation
1S as beb3re. On a zen:>lo()p p,edl,e;re,e, Dare = 1 if i = j, or
dij = 1/4 if i and j are full-siblings or 0 otherwise.
equations similar to (4-7) are needed. In addition,
estimate this model, the EM
EM equation for 17J is
(17)
To evaluate the expectations EM equations, we need samples from the joint
conditional distribution fe(G, a, dly), but
fe(G,a,dIY) = fe(yIG,a,d)Pe(G)fe(a)fe(d)I:G fa fd fe(yIG, a, d)Pe(G)fe(a)fe(d) dd da
a distribution more complicated than equation (8). Using the Gibbs sampler, however,
we can generate realizations from this conditional distribution. It is sufficient to
calculate the conditional distributions:
f(aila_i,y,G,d) f(ailaj,am,{aji},{aj}, Yi,
C< [~ail] ,ai,
neighbors are now his
two as before.
mean
(12), respec:tlVely, exc.ept Yi IS now rt:I.Hi:tI.~t:U by Yi - di.and variance similar to (11)
If we denote by Si the of i's full slbJtmg;s, e}CClUlcilIJlg IlJlm8lelt, s = ISi/+l,then distribution (20) is a normal distribution:
d - 3s+3j, 1] - 4 s+2'
More generally, a complex mixed model has the following form:
(21)
UTh.,.....,. zi's are k random components, with Zi rv Nn(O, o-;Edj e is environmental effect,
with e rv Nn(O, 0";1). Zl, ... , Zk ,e and major gene are assumed mutually independent.
Ei's are known are positive selui-de'linilte. Without of p;e][lel'allty, we can
assume are imTer1~ibLe we can always re- pajra:mE~tr]lZe model so that,
enough
eql11i:U,10I1S for estllfi,'Lt1I1P;
= 0"; c, where c > 0 iswhich is lfi'/ertltile. and
-c>so
,...
n
i = 1, ... ,
In .e;eIletJlC models, direct 111"\i'pr!'l1011 of m<lktnces can avoided taking
inverses are often sparse, facilitating efficient storage and computation.
To obtain realizations the conditional distribution given data, agam, the
Gibbs sampler can be used, which requires only the conditional distributions:
- Pe(Gil{Gj }, {Gjl }, Gj, Gm , Yi, Zli,· .. , Zki)
()( [g P(G;dG;, Gj )] P(GdGf> Gm ) exp [- (y; - "G, ~~;- ... - ZIO)' ]c26)
and
Ie (
ex
ex
IZr, , Zk, (Zj)-i, G, y)
Izr, ,Zk, (Zj)-i, Gi , Yi)
(27)
ance are ,G,
p
Since ,e:enera1Glon of varJlables conditional (mrtrlllJUltl(mS IS stral,e:ht-
1
plementation to estimate parameters of model (2).
2.6 Estimation of the information matrix
as other set,tlIllp;Sl reason for
usmg alt!;ori.ttllrIl is that likelihood function difficult or impractical to
evaluate, but, if the data are viewed as a function of some missing random variables,
the evaluation of MLEs based on the "complete data" is relatively straightforward.
However, the EM algorithm does not immediately yield asymptotic standard errors
of parameter estimates. Yet, it is often of practical interest to know the variance
covariance matrix, or, to construct confidence interval for estimated parameters. In
this section, we pr~~seIlt a IVllJlll,e Carlo method for estimating the observed information
matrix. For notational convenience, let u = (G, Zt, Zz, ... , Zk), (i.e. u is the "missing
data" vector). Then
1(0;00
) = L(O) = [ Pe(u)fe(Ylu) dPe(uly) = [ Pe(u,y) dPeo(uly)L(Oo) Ju Peo(u)foo(Ylu) . Ju (u,y)
(Thompson and Guo, 1991), which nT'"","lfl,,,,,,
IS
on IS
sec()nd IS COIlldltlonal distI'ibtlticm of u y.
equation can
as Qhr,wn in previous sectioll, u can be sampled
a IVI0IlLe
Po(uly) via the Gibbs sam-
pIer. the conditional
distribution Po(uly),
/u a'lO:e~t,y) dF,(uly) '" ~ t a'lOg:;;~rJ,y) (29)
Similarly,
Eo (alogpo(U,y)alogpo(U,y)) ~ ~ t alogPo(u(k),y)alogPo(u(k),y) (30)aBi aBj N k=1 aBi aBj
(aIOgpO(U,y)) ~ ~ t alogPo(u(k),y) (31)
aBi N k=1 aBi
If the information is to be evaluated at the MLE iJ, then
ralog Po(u, y) dPo(U1y)1 = alog~o(Y)1 = 0iu aOi 0=8 aOt 0=8
so at 0 = iJ
The log-likelihood the eltE~cts are
assl1mt~(1 to
term
tleltlCe the Pe(u,y) are
and
y) PII (G, a, y) = fe(yIG,
where
logPe(G, a,y) log fe(yIG, a) + log fe(a) + log PII (G)
k k 2 1 'Y" (Yi - /lai ai)2log fe(yIG, a) = --log 211" - -log O"e - L..t 2
2 2 2 iEO ·O"e(33)
(34)
where ° is the index set of individuals who are observed and k = 101, nlF is the
number of founders with genotype 1(1 = 1,2,3), and C is some constant. Therefore,
the score vector is sImply
non-zero corupiJUe:uts
index
----:::--'-:::-'---'- -
1{erlot"1rpe 1;
(1=1,2,3) (38)
and
82 Iogfo(a) __ n + _Ia'A-La8((T~)2 2(T~ (T~
(39)
(40)
82 IogPo(G)8p2
---- +---:---.,--- (41)
2.7 Assessing the Monte Carlo variance and optimal sample.spacIng
Since lUCl;U\-'U is used to Ci:ll;lUl<:Ll;C mixed model,
var'latiOn IS mtlroduced.
is
t=-oo
IS t asStlmlllp;
. ., can
the empirical autocovariance at lag t, and can be estImat€~dby
00
&? = L: w(t)1tt=-oo
(43)
where IS some sUJ.talble weight turlctl0n. For example, w(t) 1
w(t) = 0 for large t, and w(t) makes a smooth monotone traJ[ls!t,lOn betwe<m
values (Geyer, 1991). Once the Monte Carlo variances are estimated, one can estimate
also the optimal spacing, k, in the Gibbs sampler. The Central Limit Theorem
variance for a function V of chain sampled at spacing k is
00
Sk = L: ,ktt=-oo
and the variance av€~ra~~e of _,,,,.In<><, at consecutive sam-
pIes is approximately skiN. To achieve accuracy, must be proportional to Sk.
conditional eXjDec;ta1;lOils In to
"'vrr.,,,rl,,, the Monte(This ap1D1H~S to any function V, and
Carlo estlmi'ttes of
cost of
or pr()pc)rtJlonal
pr()pClrtionial to + can
can k. cost
delpel1dS on
on autocovarlanlce struc1Gure of
even ditten~nt statisl1ics USt::UI,() e:stunate,the pal:arrlet<:;rs, may
have different optimal spacings.
3 Numerical examples
In this section we the method proposed above. The
programs implementing method are written in C. All computations were
carried out ona DEC3100 UNIX workstation. The random number generator used
was the run-time library drand48. program psdraw (Geyer, 1988) was used to
draw the pedigree.
Example. Simulated data. vVe consider the simulated data on a 230-member,
six-generation pedigree are 67 founders the pedigree. This
pedigree is similar inform to the plant pedigrees of Dr. Mitchell-OIds (personal
communication) to study of III
Brassica campestris. The model we considered is (1) 2. The simulation
values are shown Table 2. \Vith 0.5, ILL = IL2=1.0,
IL3 = we pel'formed
lteratU)ll a sarrlple 400 Gibbs sanlpll':;s were ';rJ'll.wn
ten IteJratlons, salnples were dr<1Wlrl. 20
est,lIDlatl':;s are obtarne<:1, salnples are f!,e:t1el'atj~d,
we 7.
model
standard errors and 95% Co]tltlc1eIlce Hltf'rv'all'l are shown Table 2. The C;;"~HH<L~";U
Table 3. In essence,
estlmat(~S of O"~ and
asymptotic variance-covariance matrix is
correctly the major gene effects. Note that
asymptotically, a sul)st,anl;ial ne~~ative COITelati,on.
Figure 2 shows and over It can
seen the figures that an the estimates approached the vicinity of their
MLEs, and the log-likelihood ratio of the mixed model versus the polygenic model
quickly increases. The estimates of major gene effects stabilize quickly. Other pa
rameter estimates, and the log-likelihood, continue to vary within limits, as is to be
expected of a Monte-Carlo procedure; each EM step was based on only 400 (depen-
dent) samples. Table 4 shows the of log-likelihood ratios with respect to
various models. It can from table that the data strongly support a mixed
model with codominant expression of the major substantial difference in
log-likelihood ratios strongly rejects a model, a pure addi-
tive model, polygenic model, effect
esti.ma,ted log-likelihood model with major
plus ",rt,rtiti"c> T)nlvO'pn1lr c()m1porlent,
surprIsmg, if we
stana;ara error
estlmatt~a p.ara,mt~tel~Sare
de'Vla,tKms are
errors m£tgnllttlde are nes~ll~:ll)le.
we explore the:l1~:el1ll1oo,dStLrtal~e for
the two variance u; u;. vVe considered eight points surrounding
putative MLEs of u; and u; (Table 5), unchanged.
From Table 5 it can be seen that the estimated parameters do indeed provide higher
llKlellllOCid surface
between u; and also noticeable as expected from the
negative asymptotic correlation two parameters. The small
differences in the likelihood ratios suggest that the likelihood surface with respect to
u; and u; is fairly flat, as is also evidenced by the relatively wide confidence intervals
for these two parameters.
4 Discussion
We have presented a new method for estimating mixed models for large complex
metnCl(] can easily handle with multiple heritable/non-
peIQls;!:re<~s do not pose
While it true with mCre8,S1lJlg a1vallaOlllty of nfl,IVn'1flT'nI11r DNA mark-
al)~;en(:::e or eXllsttmCe
partl1cular, m1J{ed mCldels can
mCldels can
once are Impu1;ed
mc~mt)ers, one can do of a poteIltH'Ll11nk,'Lp;e
a of m!l:>rUlatlve meiosles t;:(~ Udltl
mtonua,tl\Jre het(~rO,zve:ollfes In can C;"'IHHHH'C;U
the Monte Carlo methc,(1 of this 1-'...~rvJ.. an impullfed genetic cOllti~;llI'atJOnon
our method can be to include genetic marker combining segregation
and linkage analysis (Guo and Thompson, in preparation).
Implementation of the Monte Carlo EM algorithm requires specification of three
operational parameters: of EM iterations, Carlo sample size
used to conditional expectations at iteration, and number of Gibbs
cycles between samples. The number of iterations can be determined by monitoring
the estimates and log-likelihood values (Figure 2). The Monte Carlo sample size is
largely determined by the desired accuracy of the parameter estimates, while the op
timal number of cycles for each sample can be investigated by the methods of section
2.7. In general it is the result of a compromise between two conflicting goals: more
accuracy in Monte Carlo estimation and less computing cost. More specifically, if the
Markov chain constructed via the Gibbs sampler high autocorrelation between
consecutive values of a function on the chain, a larger Carlo sample size is
needed order to achieve desired Increasing the number of cycles would
reduce between successive and thus the re-
sample. For the ex,'Lmple sal;Istactor'y for C;>J~HH,'-"v-
are more rarlC!.()m ett4ects.
IS mc:th<:rerlt to
as
to
errors
"accelerated" version of Monte
sample size, as well as the numt>er
any el.!!:en1lTallleS of One
cycles.
mcttnx areproblem can be more acute
remedy
If estirnation·of the curn:mt tie~ssllan is inexpensive,
l''lewton··l{<'LpllSOn ITletllOCl is combined
with
All~l'l(:~u1!ll we have Ili for
an individuals, this for simplicity only. In general, mixed models can incorporate
other covariates as age and sex. The effects of these covariates can be estimated
by appropriate EM equations as described in Thompson and Shaw (1990). Alterna
tively, one can sample the major genes and polygenic effects based on the current
estimate of covariate e1tec1;s and, once one can the major poly-
additional covariates and use standard regression UH~LJJlVUl"
estimate CO'varlat,e effects. The latter method is based on the fact that, given major
genes and polygenic effects, observations on different Inrllvl/in::lJR are m(1e~)eIJlClent,as
is seen from
area of
more cOlnplex geIletlc nloclels
Acknowledgement
IDc)dels on
Charles for shCLnrl~
was sup,por'ted in part
USDA contract
and
References
a qmmtjitat;ive trait:
Series: Theory and ;.Uethods. Springer-
Canmn~~s, C., Thclmplson,
Tnpor"U"IiVn. .1"lUT"TUll of Hum,fJ,n lJ:enletu:s ':I:'ll;v,CU--VO'\}.
Brockwell, P, J., R. A. (1987)
Verlag. New York.
Skolnick, M.H. (1978) Probability Functions on
Cornpl<:lx Pl:ldH~ret~S. Adv.Appl.Prob. •".""'-,,
Chan, Asymptotic behavior of the Gibbs SarnpJler.
No. 294, Department of Statistics, University of Chicago.
Technical Report
Chung, K. L. (1967) Markov Chains with Stationary Transition Probabilities, 2nd ed.
Berlin: Springer-Verlag.
Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977) Maximum likelihood from incom-
plete data via J R Stat Soc. B 39:1-38.
J:<.;l~;tQ]l, R.C. and ;JM:::W41 J. (1971) A General Model
Human Heredity 21:523-,542.
lveJlletlc Analysis of
marginal derlsltles. Journal of 'IDP'f'?r/l'n Statistical As.soctatzon
tioJrtware for lJal'culat111g
lJepartuaeIlt of Statistics, Jnnlenntv of \¥ashington.
Geyer, Markov chain Monte Carlo ma,xiIllmm llKelUlOoc1. Computer
ence and Statistics: Proceeding'S of 23rd Symposium on the Pp.
156-163. Interface Foundation of North America.
Guo, S. W. (1991) Monte Carlo methods in the quantitative genetics. Unpublished
ch~:se],tatlon. Dept. of Biostatistics, University of Washington.
Guo, S~ W. and E. A. (1991) Monte Carlo estimation of variance compo-
nent models. IMA J. Nfath. Appl. Med. BioI. 8: 171-189.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and
their applications. Biometrika 57: 97-109.
Hasstedt, S. J. (1982) A mixed-model likelihood approximation on large pedigrees.
Computers and Biomedical Research 15:295-307.
Hasstedt, S. J. and Cartwright, P. (1979) PAP-Pedigree Analysis Package. Technical
Report No. 13, Department of Medical Biophysics and Computing, University of
Utah.
A SlmLple metnoa a numE~ra1tor
relationship matrix used in prediction of brE~edlll~ vaH''''''. DZiOTfLeCTU;S 32:69-83.
J.
deiin€~d on a
m press.
a Markov
"""'lAt",..", clcmtiJl;u.rat,ioILS by a SaIJaplmg SCllenae. Biometrics:
Sundberg, R. (1974) Maximum likelihood theory for mcomplete
Statist. l:q,~....,l)~.
from an expo-
data
augmentation. Journal oJAmerican Statistical Association 82:528-540.
Thompson, E. A., Shaw, R. G. (1990) Pedigree analysis quantitative traits: vari-
ance components without matrix inversion. Biometrics. 46:399-413.
Thompson, E.A. (1986) Pedigree Analysis in Human Genetics. The Johns Hopkins
Thompson, E. A. and Guo, S. W. (1991) Evaluation of likelihood ratios for complex
genetic models. IMA J. Math. Appl. Med. Bioi. 8:
(1991) .l.UO,Lft'-/V chains for eX1PlOTll11Jl; p,ost,en()r distributions. Technical
1: alg;lorithm to estlmate v;:tnance cO]J[1pCmellt models.
3. Compute , z 1,2, ... , kj
4. Set initial parameter estimates, p,j (j = 1,2, 3)j a'l, i = 1,2, ... , k, and a;j
5. Posterior gene-dropping:
Drop Gj Drop Z1, ... , Zkj
6. De-memorization: Gibbs salnple (z~, zg, ... , z2Gly) for certain number of timesj
7. Next EM iteration step
Set p* = 11/J2 = at2 = 0';2 0, i 1,2, ... , k; j = 1,2,3.;
For j = 1 to N (the Monte Carlo sample size)
Gibbs sample (ZI, Z2, ... , Zk, Gly);
for 1= 1 to k (the .chosen spacing)
Randomly permute all individuals in the pedigree;
In the order indicated by the permutation,
update genotypes and random effects;
next 1
After kth cycle, we have configuration (ZI, Z2, .•. , Zk, G);
compute p* = p* + (21F11 + 1F12 )!(21F1)/Nj
compute
compute p,: =next j
+ [lHy Z1 ... - Zk)] !1~1;
.= 1, ,z 1, ...
-I..v":"",,,,2.7914)
0.8855)
(0.0633, 0.2890)
2.3169
0.1762 0.0576
2.0
p
0.6
0.2
fi2 0.0
fiI
Estil1latedasYl1lptotic var'lal1;ce-co,rar1an(~ell1laU'IX X 103.
p
-2.2621 58.6270
Table 5: Log-likelihood ratios in the neighborhood of the MLE. The Gibbs sampler
is run at Do set to See text explaination.
Figure
Figure 2: iVlonte Log-
timate of over iterations; additive nOhW'PTI11r
variance vs. the EM (d) }£stln;ratie of error vaI'IaIlce vs the iterations;
(e) Estimates of major gene effects.
Figure 1: Pedigree structure of the simulated data. Individuals with grey color con-
stI1Lute a and grey colors con-
SHLULt:: a black
constItute a Zi$ljl-member Slx-,:;eller,xtlcm p~~dl~:ree.
..0
~ It)
~0
i ~ ~
Vf i'"li 0
i 0
0 50 1()() 150 200
iteralipn(a)
"'1
n0
'"08 tlc ..,~ C ".
~0 ! 0.. ~
~..
~0 '"i 0It)
0
'"". 00
0 50 100 lS0 200 50 100 150 200
iteration it&ralion(c) (d)
'.
" '''~' .."~""''-~~~'"'' ', " •••• ~".- ." < ~ •••.• '".~.-_•• ,~
...'~'i' ~.•• ,'- .• _--_._. •
o 50 100 150 200
are:
A Monte Carlo Method for Combined
Segregation and Linkage Analysis
Sun Wei Guo l Elizabeth A. Thompson2
I Department of Biostatistics
University of Michigan
Ann Arbor, MI 48109-2029
2Department of Statistics, GN-22
University of Washington
Seattle, Washington 98195
U. S. A.
Running Head: A Monte Carlo method for genetic
(1), (2): This is based on reSE~ar(:h c()mlJ.letE~(1 S\iVG was a student
Jpn::;}rtlTlPl.t of tilc.statIstlcs,
Correspondence to:
Dr. b.:li;zabeth
lLn.<n·frro",-nf of Statistics, GN-22
Seattle, Washington 98195
Phone: (206) 685-0108
Fax: (206) 685-7419
a
Abstract
Carlo to
of a quantitative trait observed on an
In conjunction with the Carlo method of likelihood
ratio evaluation proposed by Thompson and Guo (1991), the method
provides for estimation and hypothesis testing. The greatest attraction
of this approach is its ability to handle complex genetic models and large
pe(11~ree:s. Two examples are presented. One is of simulated data
ona large pedigree; the other is a reanalysis of published data previously
analyzed by other methods. These examples illustrate the practicality of
the method.
Introduction
The past decade seen enorDlous success
successes are of for Huntington's disease, fibrosis and
lJuLch.enne's Dluscular' dystropy. contrast, progress in Dlapping quantitative traits
has been very slow, despite the fact that Dlany relevant measures of diseases are
clinical, physiological and biological traits that vary continuously aDlong individuals.
There is no shortage of data. In fact, advances in biology and molecular genetics
have generated so much data that the availability of statistical techniques has become
becoDle a bottleneck in the process of the mapping of quantitative traits.
The current available techniques for mapping quantitative traits can be grouped
into four categories: 1) sib~pair methods (Haseman and Elston, 1972),2) discrete-type
linkage analysis (Ott, 1991; Thomas and Cortessis, 1991),3) mixed models (Hasstedt,
1982), and 4) regressive models (Bonney, 1984; Bonney et aI, 1988). Although sib-pair
methods are fairly robust and have the advantage of no need to make ascertainment
corrections, their sta,tistic;al power IS especially when linkage is loose. In
addition, they ignore the interdependency aDlong sib-pairs from the same nuclear
At best, they can only tell whether there is a linkage, are thus primarily a
Dl(l,ppm.e; a qUlmt;ltat.lve to dlchot;orrnze
were dlE;cn~te
ttnierlJlttLllve nn;Ln.vu is to assume
but su1tteI:s loss of information. Additionally, penel;ranc~;s must
arblitr,ary cutoff.
pene'tra,nce functions for a quantltat][Ve
frequency, the mean and for each genotype (Ott, 1974). Since quantitative
traits are probably typically controlled by a number of loci acting in concert with
environmental effects, the adequacy of these models is questionable.
The regressive models proposed by Bonney (1984, 1988) represent a new
development. The model handles the residual variation unaccounted for the major
gene effects as if it "noise", without specifying its origin. Furthermore, the
model assumes a Markovian dependence structure with regard to the residuals among
first-degree relatives. By doing so, the model provides flexibility in incorporation
of covariates and efficiency in computation. However, while simple Markovian
a
Morton, 1981; Ott, Var'latIOn III
a qU<l,ntlta1ave
and/or other U"'jLn'<~IJL'0fnon·-hentabJle eJlte<:ts, of
the environment. Although model is biologically computational
difficulties have limited its use mainly to segregation rather than in
conjunction with linkage analysis, and primarily to data on nuclear tarml1es or small
pedigrees (Ott, 1979; Hasstedt, 1982).
Traditionally, analyses performed separately
(Ott, 1991). Historically, with limited marker data and computing power,
most linkage analyses were carried out only after sufficient information had been
gathered to infer a mode of inheritance for the trait. However, segregation analysis
can only, at best, demonstrate the presence of major gene(s). It cannot localize
them, and often lacks power to estimate genetic parameters correctly in the presence
of multiallelic trait loci or ,gerlet-lc h,ete:rol~eIJlel1;y 1."Hl'''-'U, 1984; Ott, 1990). Violation of
the distributional assumptions of the mixed model can lead to spurious support for a
major gene (MacLean et aI, 1975; Go et aI, 1978; Eaves, 1983). Incorporation of linked
markers might potentially improve robustness of the mixed model. Moreover,
linkage of a trait to a genetic marker evidence of the
eXlsteJD.Ce of a
the
to cornbme llnKa,,~e
et
cornplex, J!:enetlc heterC'J!:elleit,y and
homogeneous than a pooled sample of
both segregation
pel::IIJ!:rees, which, are more
nuclear families. It is also useful to
onbe able to an;:l,lv~~p
consider more realistic yet more complicated genetic models that can incorporate
various heritable/non-heritable random and fixed effects and to develop practical
computation.
In this paper, we propose a Monte Carlo approach to combined segregi1,tion and
linkage analysis for quantitative traits, which extends our previous work on the
Monte Carlo estimation of variance component models and mixed models (Guo and
Thompson, 1991,1992). The greatest attraction of the approach is that it can handle
complex genetic models and data on large pedigrees. In the next section, we describe
the •Illethodand computational algorithm. The practicality of the approach is then
illustrate by two examples. Finally, we discuss of the proposed method in relation to
f"e>r",nr work in this area, and indicate directions for future research.
Methods
Notation and assumptions
IJOnSl'Cler an n-rl1ernbt'~r pE~dl}:~ree a corltm.UOl1S
data not be a,V~LHa,Ulefor same individuals. i::iuppc)se
eHlerUN of eXl)OEatl,on, we consldler a lTIl,>eed
a major autoEionlalmodel <hU\.,IC;O an l'\r!,rh1:"nu:> nnh.,o'pn",r C;]llC;C;C. and
an independent effect, without fixed or CO'vaI'lat;e e:lte<:ts. Extension to
include fixed covariate effects, and dominance or or non-heritable
random effects is straightforward (Guo and Thompson, 1991, 1992), but in this paper
we focus on the inclusion of marker data rather on complexities of the trait
modeL
For technical reasons (see Discussion sec:tlolIl we COlISl<1er a diallelic marker
locus. To notation, let the two alleles of the major gene trait locus be D and d,
with gene frequencies p and 1-p respectively. Let the two alleles at the marker locus,
be Band b, with gene frequencies q and 1 - q. Let Gi denote the ith individual's
combined genotype at trait and marker loci. For a given genotypic configuration G
on the pedigree, let ii be an indicator vector, jthentryequal to 1 or 0 depending
whether jth individual has genotype i. Similarly, we let iF be an indicator vector
with entry i equal to 1 or 0, depending on whether the ith individual is a founder.
It is assumed that each of the three genotypes DD, Dd, and dd, denoted as 1, 2 and
3, makes a specific contribution Pi (i 1, the phenotype. It is also assumed
trait and loci are in equilibrium, locus
a
y
y +a+e 1
urh,,,,"'" a is a vector of adldltlve .e;erletJlC effects, and e is a vector of individual
environmental ettJects. of the ve<:tOJFS a and e is assumed Normally distributed
mean 0, e ha'Vlllll'! varrance-CO'VaIJIaI1Ce LU<JhOLLA <7;1, a having ValJlaIlCe o-;A
where A is the numerator relationship matrix (Henderson, 1976).
Monte Carlo estimation
There are total of eight paranreters to be estimated: the allele frequencies p and q, the
the reCOmbmi1Ltlclll tr'act,lon major gene e1tect;s JLi, i = 1,2,3, and the polygenic and
residual variances, 0-; and <7;. However, estimation of q within the pedigree analysis
is often of secondary interest, as considerable information on the marker may have
accumulated. Besides, if the marker is co-dominant, as is usually the case, q can be
easily estimated from observed marker phenotypes. Therefore, we assume q is known
and let 0 = { p, JLl , JL2, JL3, 0-;, 0-;, r} denote the vector of parameters to be estimated.
The likelihood for model (1) is:
L(O) Ps(y, M) L 1fs(yla, G)P(MIG)Ps(G)dPo(a)G a
L fu(yIG)P(MIG)Ps(G)G
(2)
where G is the combined two-locus genotypic configuration on the pedigree and
the sum is over all on marker
1 or 0, on WIJLeI11er or not
G, IS an
1979). Also,
)IIPe(G) = II PeCCi)founders
where im and if are the parents of i, P(Gi ) is the genotypic frequency and IS a
function of p and q, and P(Gj lGjll Gj.J is the two-locus transmission probability and
is, in general, a function of the recombination fraction r.
A framework for estimation of model (1) is as a "missing data problem" , with G
and a missing. Thus formulation of an EM algorithm is appropriate. The form of the
EM equations for p, a~, a; and fJi (i = 1,2,3) are given by Guo and Thompson (1992).
The added feature here is the inclusion of the linked marker, and the estimation of the
recombination fraction r. To obtain the EM equation for r, suppose that (a, G) were
observed for all n individuals of the pedigree. Then, estimation of r is just a matter
of counting. Of course, we can restrict attention to those parent pairs in which at
least one parent is doubly heterozygous; only these informative for linkage (see,
for example, Ott, 1991). Let Hi (Hi = 0,1,2) be the number of doubly heterozygous
parents in the ith parent-offspring trio, and R i the number of recombinant events in
segregation from the doubly heterozygous parents to the offspring (Ri = 0,1,2,
the expected number of recombination events (Thomas and Cortessis, 1991). Table
1 provides values of Hi and Ri for fJVCli:ll.J1C informative ffi21,tiIlgS.
gelletlc oontiguration G on two
rec;oIllblllatl0n tractlC)ll r
the equation r IS
coml>lete<; one
Sex-specific recombination can be estimated with minor modification, by
counting separately segregations in males and in females.
Despite the simplicity of the framework, implementation is not immediate,
since there is no way to evaluate explicitly the conditional expectations such as those
in (4). Since the distribution of major genotypes and polygenic values, given
the observed data, is intractable. Guo and Thompson (1991,1992) have proposed
Monte Carlo estimation of the required conditional expectations, using the Gibbs
sampler to obtain realizations of the major genotypes and polygenic values given the
data. The Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990)
IS an iterative procedure for drawing multiple (but dependent) realizations from
the unknown conditional distribution. In our case, it works as follows. Beginning
from any realization of polygenic values and combined major genotypes, (a, G),
that is consistent with phenotypic and marker observations, the polygenic values
and genotypes are updated, for each 111 pedigree in turn (in random
observed data (if
any) and the polygenic values and 1!.eIlot.yP(~S of all other members in the pedigree.
IS (which in our
case
as a recLlIz;atl.on
G y, reaJlz.at1()ns at su(:ce:ssnre
cv(;les are dejJerldent, it IS not
In prt'l,ctJlCe, we colJlect rea,llz,at1()ns
etlJich~nt to use
rn<lUu' 0'01'",,1""1">,0'" and Tlnill1lTPnlir V(UU~;;:) for
n of at of 20
polygenic values are stored used as (dependent) realizations from conditional
distribution fu(a, GIY, M). By the ergodic theorem, mean of any function of
(a, G) over the realizations is a consistent estimate of the expectation of that function.
Thus, we have estimates of expectations such as those in (4).
To implement Gibbs we need only, foreachindividual, conditional
distribution of his combined major genotype Gi given Yi, A{, ai, and the major
genotypes of other members in the pedigree, and the conditional distribution of his
polygenic value, ai, given Yi, his combined genotype Gi, and the polygenic values of
other members in the pedigree. For individual i, conditioning on the major genotypes
of all other pedigree members involves only his immediat,e neighbors; his parents,
(if in the pedigree), his spouse(s) (if any), his offspring (if any). The
genotypes of other pedigree member do not contribute further information. Hence,
if we let G(i) denote the genotypes of all pedigree members except i, Gi , Gm the
genotypes of the parents of ith individual, {Gj} his spouses', {GjLJ his offspring's,
then
ex
as
a tOlln<lerJa =lor
the combined lJW'O-lIJCtLS s(~,grie,ga,tlOin
population O'Aru"tunl'p tI'eql11erlCY
Pe(MdGi) = 1 for all POS,Sll)le Gi. updating Giisstraightforward.
Similarly, the value ai can straightforwardly, given an
observed polygenic values of
Quantitative traits are often affected covariates as age and sex. The
effects of these covariates can be estimated by appropriate EM equations as described
in Thompson and Shaw (1990). Alternatively, one can sample major genotypes and
polygenic effects based on current estimates of covariate effects and then use standard
regression methods to estimate covariate effects. The latter method is based on
the fact major genotypes, polygenic and heritable random effects,
major
observations on ditterlent individuals are Hence, for each realization of
nn'I"(f'Plyir and other heritable random one can treat these
Choice of starting realizations
to choose a good "b"rh,nO' genotypic configuration in order to
pr(>lo11gf~d iteration. observed data on each individual conditionally on
provides partial iniOrluation on major gene effects and re(;Olllbination lI'aCLI0J[l.
This local information can be used in a "posterior gene-dropping" method to provide
a sensible starting point, given the current parameter estimates (Guo and Thompson,
1992). Here the procedure is adapted for marker data.
First, the major genes are simulated, from the top of the pedigree down to the
bottom, using the information of CUITeIlt estimates of gene frequency, recombination
fraction, major gene effects. and data. For each founder i in the pedigree and each
possible two-locus genotype g, we calculate the probability
Pe(Gi = glYi, Atli) ex Pe(Gi g)Pe(MiIGi)Je(YiIGi = g)
ex Pe(Gi = g)Pe(A1i IGi )exp [_ ~i ~ ~Gi~;]- Cfa + Cfe
normalizing (for each i) the sum over g to 1. Here Pe(Gi = g) is just the frequency
of combined genotype, calculated on the assumption of linkage and Hardy-Weinberg
equilibria. If Yi is missing, we let Je(YilGi = g) = 1, for all g. If }.1i is missing,
P(MiIGi) is set to 1 for all.J\;h A genotype is then randomly selected according to
the calculated probability. Once all founders are assigned combined genotypes, we
can drop the to non-founders. For each non-founder i, a «""~r\tu",p IS .e;ellerat(~d
from IOllovV'm~ prot>alJ'lllI;y distribution:
ex
ex:
are alrlBadv a:sslF;ne'd.where f mare
vVith linked IU£tJrKel'l:>, h,OW1Bver. this "po,stelllor geIlle-c[rOI)pllllg" procedure not
aSEngIlea combined geIlOtyp(~S
prc)!)l(~m; if
able to carry through because it is Pe(Gi = ,}Y!i) =0 for all
possible g. This is some combined geIlotyp(~S a:3S1!~ne~d. to the pa.rents of,
the ith not be consistent Mi. In practice
"P()st(~rt()r ~!mE~-dlrOp,p1!l~77 until
COlnp,atlil:ne with their
this
all the mdJVl11uaJs
observed marker phenotypes.
Once the major genes at marker and trait loci have been dropped down the
pedigree, we drop the polygenic values similarly conditioning on individual trait
values, major genotypes, and already assigned parental polygenic values (Guo and
ThOmPson; 1992).
Estimation of variance-covariance matrix
errors of estimated paJrau1.etersIt is important to estimate the
construct cOllt1<leIlce ln1:pr"U';:,.ll<
or to
nrr1,'7111IP a Monte Carlo
on a
are x are u
cova
where Bi and Bj are Co]:np,onen1~sof B. In our u (a, G) is the "missing data"
(y,M). Each can be estlmate:d
consists of conditional expectations of simple functions of
u = (a, G) given x (y, M). For example, if N realizations U(l), U(2), ... ,uUV) are
drawn from Pa(ulx), the first term on the RHS can be estimated by
log Pa(um,x)8BlJBj
terms are estimated similarly, using the same realizations.
The first and second derivatives of the "complete data" log-likelihood,
logPa(u,x) = logPa(a,G,y,M), are easy to evaluate since
Pa(a, G, y, M) = Jo(yla, G)Jo(a)Pa(MIG)Pa(G)
and each term has a slIlrrpJle structure, with typically only a subset of the parameters
involved in anyone component of the model. For example, the recombination fraction
r appears only in Pa(G) and
8 )(7)
where the IS over all non-founders, the combined-genotype
con:ti~UlLatl()fl G. 3.
slm,pIe formuJae can
an estImate inf,orulation ma1GrlX IS obt,alIl.ed, can
to a no:rnllllal
can
Likelihood ratio evaluation
A general method for Monte Carlo estimation of likelihood ratios was given by
Thompson and Guo (1991). For the model (1), the likelihood (2) takes the form
L(O) I:fe(yIG)Pe(MIG)Pe(G)G
where the sum is over all possible in the pedigree.
Direetevaluation of likelihood is impossible on a large pedigree due to the
prohibitively large number of terms in the summation (Ott, 1979). However, it can
be shown that the likelihood ratio between two parameter values () and 00 can be
written in the form
(Thompson and Guo, 1991). Thus a Monte Carlo estimate is
OOLa,nll~(l by sarrlpl111g M).
thatM
not deJJen.d IS a nrl""lPllt
a be genera,te(1 alongsl(1e those of G. fe(yIG) IS a
po,ly~:enic llkellhC)O(1 jml01'1l111·.!r 1l1te,~ra,tic.nover unlobsrenred a values (equation
replaCE~(1 by Monte sarnplmJ&;, but for a
simple pOJlygem,c lIlO(1,el on a simple pedigree exact evaluation is possible. Moreover,
any evaluation may in fact be unnelcessar'y If () and (}o differ only in the recombination
fraction, fe(yIG) = feo (yIG) for all G and these terms also cancel from the likelihood
ratio estimator (8). For linkage analysis, one is often interested in computing the LOD
score-a log likelihood ratio at given values of the other genetic parameters. If r in (}o
is the recombination parameter, while r in () is 0.5, then the estimated LOD score is:
(9)
with no evaluation of the polygenic likelihood required.
In the Monte Carlo EM algorithm described above, given the current parameter
estimate (}(k), realizations are obtained from fe(k) (a, GIY, M) and used to obtain the
next parameter estimates (}(k+l), say. The realized major genotypes G are realizations
from the marginal conditional distribution Pe<k)(GIY, M). The same realizations can
thus be used to estimate the LOD score at (}(k); no additional realizations are needed.
However, when satisfactory est,lm,at<~s of other parameters are obtained, and the
run
to nrc~vlcle score curve. A
r nrcwH1es belbw(~en r same
r' ().
All the me1GllO(1S above can used on more than one pedi.lJ~re,e;
of Ca, G) conditional on (y, M) are simply obtained for and required
conditional expectations combined in equations. For est,imlatjion
information matrix, since pedigrees are unrelated, the total observed information
is simply the sum of the values for the individual pedigrees. The inverse of the
observed information matrix is then an estimate of the asymptotic variance-covariance
matrix on the total data set. Likewise, the overall LOD score is the sum of the LOD
scores on individual pedigrees.
Results
In this section we provide two examples to illustrate the method proposed in previous
section. The programs implementing the method proposed in this paper are written in
C. All the computations were carried out on a DECstation 3100. The random number
generator used was the run-time library drand48. The program psdraw (Geyer, 1988)
was used to draw pedigrees.
Example 1. Simulated data.
We consider simulated data on a 230-member, six-generation pedigree (Figure 1).
are 67 was to data;
:>UJlIUJ.dLJ,UJl V;tlllP" are <:hr,urn
is 0.5.
same
l.O, 113 = r= p=
Itel'atl,ons of IVlonte Itel"atI()ll 200
rea,llzatllons (a, G) were 20 of of pedigree
between each sampling. For the EM iterations, 1000 Gibbs realizations
were sampled, with 20 between each sampling. Once the final estimates were
obtained, 8000 realizations, with 30 cycles between two realizations, were
asji'mlPto,tic va'rianCle-cov.:xrianlce matrix and thedrawn
LOD scores at various recombination fractions.
2 shows the LOD score and the parameter estimates against the EM
iterations. The Monte Carlo samples used in the EM iterations are not large; figure 2
reflects the continuing random variation in the conditional expectations used for the
EM procedure. However, larger samples are unnecessary. Even for this case where
the data providesl.lbstantial information, the statistical standard errors (Table 4)
are much greater than the standard errors in the Monte Carlo sampling. The final
estimates, with their estimated standard errors and nominal 95% confidence intervals,
are shown in Table 4. 5 the asymptotic matrix of
score curve: estlJmatled mlaXlmUlm LOD
score IS nn,,,,al',c; is evident,
not seem to
eff€:cts are correctly mlerrecL
errors,
score at
val:lallce cOlnpow~nt est;]tm(J~lies have higher rpl;~J,nrp st;andal:d
addition, the well
95% confidence interval includes
the true parameters; in all cases
simulation value.
nominal
Example 2. Hypercholesterolemia and the LDL receptor
gene.
We re-analyze the data on LDL cholesterol levels and LDL receptor genotypes on a
60-member, five-generation pedigree (Leppert et al., 1986). The pedigree is shown
in figure 4. This data set has been extensively studied by several workers; the
following analysis is presented to illustrate the methods of this paper, and not to
draw conclusions about the genetic mechanisms of the disease.
Using the Pedigree Analysis Package, PAP, Leppert al (1986) carried out a
segregation analysis under the assumption of a model Then, they performed a
linkage analysis the parameters obtained from the analysis. They
of at r=
ease
al (1988) pertoruled a combined seg;re~~ation and linkage analysis using a regressive
To
assl1m{;d a dornmant
O.
leadlrl.!?: to pleval:ed
atscorea ma]l{lrnum
HIOnla8 and Cortessis a
no
the poste~rl(jr mean offound that the ranged from 0.065 to
recombination fraction from 0.076 to 0.318.
We performed a combined segregation and linkage analysis using the methods of
this paper. Since individual 7 is unobserved and does not have offspring, and thus
contributes no she the evident
that the genotypes of individuals 8, 18 and 23 can be the existing data.
Following Leppert et al (1986), we used the model (1), but made no ascertainment
correction nor any assumption of dominance. Starting values p = 0.4, o-~ = 718.0,
0-; = 3797.0, f.ll = 375.6, f.l2 = 139.7, and f.l3 = 95.3 were obtained by a Monte Carlo
EM of the mixed model without marker data. Then we performed Monte Carlo EM
for 200 iterations. At each EM iteratipn a sample of 400 Gibbs realizations were
drawn, with 10 cycles between each sampling. For the last ten iterations, 1000 Gibbs
realizations were sampled, with 20 cycles between each sampling. Based on the final
estimates, 12,000 realizations were drawn, with 20 cycles between two consecutive
samplings. The final estimates, standard errors and confidence
6. The estJma,ted ma:x:im.um
lterat,lOIls are sn~Jwn
score is 7.13 (Figure
no
asc:ertainUlerlt cc)rn~ct:lon was LHU'Ucv, it CaIln()t
tiec:onld, blecause of the nunlberofIOUJnaers (nF = 16),
a smce
IS k
information in these data is not great; the likelihood surface is flat. Although
the presence of the major is dear, the magnitudes of the major gene effects have
wide confidence intervals. As usual, the estimates of additive and error variances are
even less precise. fact, the wide confidence intervals for a; means that for these
data IS no of any polygenic (Table 7).
The results of our analysis of data are (not surprisingly) consistent with those
of previous authors. The current approach provides maximum likelihood estimates of
all the parameters in the model, together with standard errors or other measures of
precision. The procedure for likelihood ratio evaluation provides a LOD score curve,
and also permits exploration of the multiparameter likelihood surface.
Discussion
Almost every function human biology exhibits continuous variation. Aspects of
diabetes and hypelJte][1S110n, pnedlsposltlC)ll to cancer, drug and alcohol sensitivity, can
be m(~asl11re:a as qmmtlltat;ive cOlnplex behavioral psychological
are
are
quantItatIve ones tOC:USJlll,!( on at VH~'Cu. mstea:a
of a1t,ect,atllon 1:iL(1bLU1:i.
Our apl>rOl'tch nr()Vl.:iPR a prctctl,cal iVlonte
loci is
aPl:Jro,ach to COnnblJned
to handle complex models and large It simple,
numerically stable and computationally In a this
approach works quite welL a Monte Carlo EM approach, combined segregation
and linkage analysis does not substantially increase the computillg time, compared
with segregation an';<,.hl'R1R
Due to the formidable computational of complex segregation analysis
and increasing computing power, there is greatly increased interest in employing
Monte Carlo methods in pedigree analysis. Ott (1989), Ploughman and Boehnke
(1989) and Kong et aL (1991) have independently proposed Monte Carlo methods for
sampling from the pedigree genotype distribution conditioned on the trait phenotypes
observed in the pedigree. Unlike the current approach, those methods T'Ari1111'A
probability computations at the trait locus in order to simulate data at a linked
marker. Thus they are not feasible complex models or complex pedigrees. Closer
to this paper is the work of Lange and Matthysse (1989) and and Sobel (1991),
who proposed using a Metropolis to calculate LOD scores location
scores. and a
for tW'O-lJ)'Olllt llnK<tge an,alVSlS cornblnes a tlaveslan perspe'ctl'.:e
inf,ornlation on paJranlleters IS no consensus
likelihood sur'ta<:e
me~th<)as of LJ<.ULF-,'"'
peltletraIlCe turlCtIOIlS are known
Gibbs sanIPl€:r.
we
are
a COInplex
by eStlmalGmg
and CO~Jodcers
on chcnce of the
and LOD or location scores are generally, all previous have
been restricted to relatively simple genetic models for the trait. By contrast, the
Monte Carlo EM approach permits estimation of the parameters of complex models,
in conjunction with linkage analysis, and exploration of a multi-parameter likelihood
The efficiency and validity of alternative methods of Markov chain Monte Carlo is
currently an active research area in the statistical literature (Tierney, 1991; Gelfand
and Smith, 1990). For validity, the technical requirement is that of irreducibility
(hence ergodicity) of the Markov chain. For the Gibbs sampler employed in this
paper, as also in Thomas and Cortessis (1991), irreducibility is only assured for a
diallelic marker locus. Lange and Sobel (1991) point to the same requirement for
their Metropolis algorithm. However, irreducibility is not the main barrier in practice.
Depending on marker phenotypes observed on the pedigree, it may in fact obtain
for multi-allelic markers. Further, it can always be assured by modification of the
salnplmg procedure; one modification is the rejection sampling method proposed
by k>filBeIJlan and
The gre<'tter practIcal pl['obJlem IS conapultational e11lcl.en(:y
are as
the Me:tro,poJ]S IVU:l.,LJi:WV can "sticky" ).
same true of sampler to rare
rec;es~nv(~s on a complex (Thompson, a reason.
By contrast, the of genotype-phenotype correspondence a model a
complex quantitative trait results in less "stickiness" for the Markov chain of genotypic
configurations. However, marker information, with not all individuals observed,
and!or with tight linkage, is likely to create problems of computational efficiency.
The occurrence of multiple alleles at marker loci, and the consequent necessity of
using rejection sampling or some other method to ensure ergodicity, can only increase
these problems. Computational efficiency is an important issue that warrants further
investigation.
Finally, it should be pointed out that the approach of this paper is not limited to
the mapping of the single quantitative trait in the framework of mixed models. The
same approach can be applied to a variety of gene mapping problems, such as power
calculation for linkage analysis, and combined segregation and linkage analysis for
multivariate traits. It can developed to incorporate genetic heterogeneity among
different pedigrees, and to handle multiple trait and marker loci. It opens up new
ways to tackle complicated models which analytical methods are often
lacking.
fraitf1111 d.1SC111SSI0IJlS and
his
and COlnnleI1ltB,
programs.
for hellJful
::Shieel1lan for providing her pedigree neighborhood
References
On
COIltiIlUC)US human traits: .l:teJ~re:sshre models. AIlaer'Ica,n Journal of Medical
Bonney GE, Lathrop GM, Lalouel J-M (1988) Combined linkage and segregation
analysis using regressive models. American Journal of Human Genetics 43:29
37.
Eaves LJ (1983) Errors of inference in the detection of major gene effects on
psychological test scores. American Journal of Human Genetics 35:1179-1189
Elston RC (1984) Genetic Analysis Workshop II: Sib pair screening tests for linkage.
Genetic Epidemiology 1:175-178
Elston RC, MacCluer JW, Hodge SE, Spence MA, King RH (1989) Genetic Analysis
Workshop 6: Linkage analysis based on affected pedigree members. In l\1ultipoint
Mapping and Linkage Based on Affected Pedigree Members. RC Elston et al
(eds). Alan R. Liss, New York. pp 93-103
Gelfand Smith AFM (1990) Sampling based approaches to calculating marginal
1 ntfiSaCl;IOllS on
MacIll11e In.teHll~ence 6:
Geyer CJ
of pedigree
a,nd Sta,tistics: Proceedings of the 23rd Symposium on the lnterliwe, Pp 156-163.
Interfa,ce Foundation of North AnlerJlca.
Go Rep, Elston RC, Ka,pla,n EB Efficiency
segregation a,nalysis. American Journal ofHuman Genet;lcS 30:28-37
Guo SW, Thompson EA (1991) Monte Carlo estimation of variance component
models. IMA J Math Appl in Med & BioI 8:171-189
Guo SW, Thompson EA (1992) Monte Carlo estimation of mixed models for large
complex pedigrees. Submitted.
Hasema,n JK, Elston RC (1972) The investigation of linkage between a quantitative
and a, marker locus. Behav Genet 2:3-19
Hasstedt SJ (1982) A mixed-model approximation on large pedigrees.
Biomedical Research 15:295-307
A slill.ple lllt:LllUU a numE~ralGOr
A lllt:LIlLlU C()mlJ'ln111~ p<:ellIlJ?:
Gene M(l~pp.mg
UhakI'av,'trtI A, Cox D, MIS.110p Bale SJ, and ;'Kt,UniCK
of
quantitative traits nuclear families: Comparison of
Genetic Epidemiology 6:713-726
program packages.
Lalouel J-M, Morton NE (1981) Complex segregation analysis with pointers. Human
Heredity 31:312-321
Lange K, Matthysse S (1989) Simulation of pedigree genotypes by random walks.
American Journal of Human Genetics 45:959-970
Lange K, Sobel E (1991) A random walk method for computing genetic location
scores. American Journal of Human Genetics 49:1320-1334
Leppert MF, Hasstedt S et al (1986) A DNA probe for the LDL receptor gene is
tightly linked to hypercholesterolemia a pedigree with early coronary disease.
American Journal of Human Genetics 39:300-306
MacLean CJ, Morton NE, Lew R (1975) Analysis of family resemblance, IV.
Upera,tionai cha.racteristics of segregatH)n """"'" v 0'"", Human lielletlcs
Ott J
hUIJuan linkage studIes. Amenc<tn J1JUrl1dl of
Human lienetiIcs 26:58:8-1:)9
Ott J
and mixed models in human pedigrees. American. Journal of Human Genetics
31:161-175
Ott J (1989) Computer simulation IIle:LIl()usinhuman linkage analysis. Proc Nat Acad
Sci USA 86:4175-4178
Ott J (1990) Cutting a Gordian knot in the linkage analysis of complex human traits.
American Journal of Human Genetics 46:219-221
Ott J (1991) Analysis of Human Genetics Linkage. Revised edition. The Johns
Hopkins University Press. Baltimore.
Ploughman LM, Boehnke M (1989) Estimating the power of a proposed linkage study
for a complex genetic trait. American Journal of Human Genetics 44:.543-551
Risch N (1984) Segregation analysis incorporating linkage markers. I. Single-locus
mCldels with an application to I diabetes. American Journal of Human
\.ieIletl'cs 36:363-386
detlne'd on
a press
sampling apI>rO':l.Cn to linkage analysis.Thomas
J ::>tatlst L.-r.:,-ou
Uortessls V (1991) A
for data an
Thompson EA (1991) Probabilities on complex pedigrees; the Gibbs sampler
approach. Computer Science and Statistics: Proceedings of the 23rd Symposium
on the Pp 321-328. Foundation North America.
Thompson Shaw RG (1990) Pedigree analysis for quantitative traits: varIance
components without matrix inversion. Biometrics 46:399~413
Thompson EA, Guo SW (1991) Evaluation of likelihood ratios for complex genetic
models. IMA J of Math Appl in Med & BioI 8: 149-169
Tierney, L. (1991) Markov chains for exploring posterior distributions. Technical
Report No. 560, School of Statistics, University of Minnesota.
1: for estimation of the recombination fraction: number of double-heterozygous (H)(R), Only informative matings are listed, - denotes impossible combinations. <P
(1 1,)2]. Cortessis (1991).
Number of
db/db db/dB dB/dB db/DB dB/Db
1 0 1 1 0
1 1 0 - 0 1
1 0 l' 1 1 0 1 0
1 1 l' 0 0 1 0 1
1 - 0 1 - - 1 0
1 1 0 ........., - 0 1
1 0 1 - l' 0 1 - 1 0
1 1 0 - l' 1 0 - 0 1
2 0 1 2 1 0 2 1 2 1 0
2 1 <I> 1 <I> 1 1 <I> 1 <I> 1
1 0 1 0 1 l' 1 0
1 - - 0 - 1 1 0
1 - - - 0 0 1 1 1 l' 0
1 - - - 0 - 1 1 0
2 2 1 0 1 2 0 1 0 1 2
1 - 1 0 1 0 l' - 0 1
1 1 0 0 1
1 - - - 1 1 0 0 0 'r 1
1 - - 1 0 - 0 1
estimate
pe<i1j!ree ne]Lp;hoo]rhc~Od strucl;ure; Record the number of individuals
l.
2.
data;
tn€:cnumo,er ot oos,ervled 1IJLdlVlduals, k;
3. Compute A-I;
4. Set initial estimate of 0; i.e. of p, T, #h (j = 1,2, 3); o-~, and 0-;;
5. Posterior gene-dropping:
Drop G; Drop a;
6. De-memorization: Gibbs sample (a, GIY, M) for specified number of times,
at current parameter values p, r, #h (j 1,2, 3); o-~, and 0-;;
7. EM iteration
Set p* = r* = 0-~2 = 0-;2 = 0, #j 0, j = 1,2,3;
For j = 1 to given number of sample size N
Gibbs sample (a, GIY, M) at current parameter values,
p, r, #h (j = 1,2, 3); o-~, and 0-;;
for l = 1 to given number of cycles C
Randomly permute all individuals in the pedigree;
In order indicated above, update genotypes and random effects;
next l
After Cth cycle, we obtain one realization from Pe(a, GIY, M);
increment p* by (21F11 + 1F12 )!(21F1);
next j
2_
Table 3: Example of linkage segregation probabilities peG; IGj , Gk ) and the first order
derivatives of the logarithm of the segregation probabilities. The parental genotypes
are db/DB x dB/Db. The recombination fraction is r. The derivative does not exist
at r = O.
Offspring Segregation First order derivative of
genotype probability log segregation probability
db/db r{l - r)/4 l/r - 1/(1 - r)
db/dB [r2+ (1 - r)2]/4 2(2r 1)/[r2+ (1 - r)2]
dB/dB 1'(1 - r)/4 l/r - 1/(1 - 1')
db/Db [r2+ (1 - r)2]/4 2{2r - 1)/[r2+ (1- r)2]
dB/DB [r2+ (1 r)2]/4 2(2r 1)/[r2+ (1 - r)2]
Db/Db r(l - r)/4 l/r - 1/(1 - r)
Db/DB [r2+ (1 - r)2]j4 2(2r - 1)j[r2+ (1 - r)2]
DB/DB r(l -r)/4 l/r - 1/(1 - r)
db/DB 1 - r)/2 l/r - 1/(1 r)
dB/Db r(1-r)/2 l/r - 1/{1 r)
stalldard errors
2.0 2.3400 0.2190 (1.9109, 2.7692)
/1>2 0.0 O. 0.1805 (-0.1936, 0.5141)
/1>3 -2.0 -2.1133 0.1832 (-2.4724, -1.7541)
0.6 0.6280 0.1557 0.9332)
0.2 0.0694 (0.0174, 0.2895)
r 0.1 0.0385 0.0325 (0.0000, 0.1021)
Table 5~ Estimated variance-covariance matrix for simulated data. The actual value
of each element in the matrix is the shown value times a factor of 10-6•
p
-1794.3 47940
-2863.1 17731 32591
-2506.2 9524.0 25539 33578
.5102.5
6: LI""lliH"""'U palranlet~ers,with
vals, for hy,)er~choleslGer()lelnia data.
ST;al[lOara errors and
Parameter Estimate S.E.
p 0.3266
/11 378.880
/12 157.220
/13 94.980
(j2 862.150a
(j2 2933.538e
0.1066
27.133
21.851
21.106
847.456
1122.495
(0.118, 0.536)
(325.700, 432.061)
(1 200.049)
(53.613, .1au .. ,)"f0
(0.0, 2523.163)
(733.449, 5133.627)
Table 7: Estimated variance-covariance matrix for hypercholesterolemia data.
P /11 /12 /13 (j2 (j2a e
1.136 X 102
-6.159 X 101 X 102
7.725 X 101 3.006 X 102 4.775 X 102
-9.013 X 102 X X 101 X 102
7.067 X 10° X 102 X 7.
con
con-
1:
stltute a 4b-m€~ml:)er. Ivur-g,eneration
a :;1()-memc,er. IIvt~-gtmerationpe<iil!:I'ee:
co:nst,ittlte a L;,)l}-ITlenr1h<'l'
'"~
0
"l
i '"i ~
't0 '"9 1 IV.
'" '" V0
'" "!
'"0 50 tOO tOO 200
EMilera1ion EM_on(II) (b)
~'" .,
0C
E ~ S ...: 0 0C 'C0 ~'"..
~~
:8 ~ '"E '60
..,0 ..
'"~ 0
'" 't<> <>
0 50 tOO t50 200 0 50 tOO 150 200
EMit.,ation EMileration(c) (d)
~ r- ---~
.,.S
1;'t IC.. <>'C 1 '< ' •• ~'"
~ ,. -~ ••••. ••••• ~ •.•• <¥. -- , ••••
Is "! '"li; '" Is...'"
E '7 ,0
" '""t -- ...~-- --------------------0
0 50 tOO 150 0 50 tOO tOO 200
Figure of the ClUJ.H.Huu<.Al
LOD score,
error variance,
~oo(/J
Qo...J
o,...
o
0.0 0.1 0.2 0.3 0.4 0.5
recombination fraction
score curve
OQdOOCdtl8liI BI:> BB BB BI> BI> BB BBBB BB
2
P-
oBB
BI>
0 61 62
0 2 0BI> BI:> BI:> - -y
oBB
26
BI>
are
o
0.0 0.1 0.2 0.3 0.4 0.5
recombination fraction
5: NIOnte score curve
'".,
~0
~III .,0 l!l9
"' d
...
..,~
0 0 50 100 150 200
EM_ EMl1erafion(a) (b)
S ~d
c8,2
iU '":~I i ~
:z; Ji
'5
" ~~..
0
0 §d
50 100 150 0 50 100 150 200
EMl1eration EMit...ation(e) (a)
8 L"'...00
l'l ...2l 8 ~c ... 0..'iii ..
~> 0 ~IsIi ~ ,~
~E0 ..••• .,' ••• ".'+"
~ 2 ~----------~-,--------~-----~
0 50 100 150 200
COIJabllaed segreg.atlcm and LDL data
Est,IID.ate of
error val'lance,
Recommended