Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Gene Mapping: Linkage Analysis
Eric Sobel
2010 Apr 19–26 HG 236B Page 1
Overview
• Key Concepts – Conceptually, how to map genes – IBS & IBD – Linkage vs. Association
• Parametric Linkage • Kinship Coefficients • Non-Parametric Linkage • Mendel Software Package
2010 Apr 19–26 HG 236B Page 2
How to Map Genes
2010 Apr 19–26 HG 236B Page 3
• We will start from simple ideas, and build to complex strategies.
• What is the simplest genetic architecture for gene mapping?
• Conceptually, without practical limits, how would you find the genetic etiology of this trait?
• Why is this simple model not appropriate for many traits?
How to Map Genes
2010 Apr 19–26 HG 236B Page 4
• How can one carry the simple strategies forward to more complex trait architectures?
• In a word, statistics. • How strong is the genetic
difference observed between affecteds and normals? Or, more generally, how significant is the correlation between genotype and phenotype?
How to Map Genes
2010 Apr 19–26 HG 236B Page 5
How to Map Genes
2010 Apr 19–26 HG 236B Page 6
How to Map Genes
2010 Apr 19–26 HG 236B Page 7
Reference for previous figure
European Journal of Human Genetics (2008) 16:265–269 Fifth finger camptodactyly maps to chromosome 3q11.2–q13.12 in a large German kindred Sajid Malik, Jörg Schott, Julia Schiller, Anna Junge, Erika Baum, Manuela Koch
2010 Apr 19–26 HG 236B Page 8
2010 Apr 19–26 HG 236B Page 9
IBD & IBS Two alleles at a locus are identical by descent (IBD) if and only if they are both descendents of a common ancestral allele. Two alleles at a locus are identical by state (IBS) if and only if they have the same value, i.e., are in the same state.
A/C A/C
A/A A/C
A|A A|C A|C
A/C T/A
A/C A/C T/A
• The data is the most important component of statistical genetic analysis!
• You need “lots of good data” to have enough power to find genes for complex traits.
• The amount of genetic data is exploding, which forces new, professional management and analysis techniques.
• Do not underestimate the considerable time needed to manage the data. If several projects, consider a dedicated data manager.
2010 Apr 19–26 Page 10 HG 236B
2010 Apr 19–26 HG 236B Page 11
Parametric Linkage Analysis
• “Linkage” – we are trying to find if two loci are linked, i.e., close together on a chromosome.
• “Parametric” – the input must include parameters defining how we think the genotypes at the trait locus influence the trait phenotype, a.k.a. the mode or model of inheritance, or the penetrance values.
• Mathematical model for familial transmission and genetic susceptibility.
• Many linkage methods originated in 1950-1970 to analyze genetic data from populations with rare traits (e.g., Huntington’s or ataxia-telangiectasia).
• Rare traits are usually influenced by a major gene (although there are usually modifier genes and/or other genes with minor affect as well).
• Very successful at finding major genes (where the signal to noise ratio is high)!
2010 Apr 19–26 Page 12 HG 236B
2010 Apr 19–26 HG 236B Page 13
Linkage Analysis Overview
• Linkage analysis is good at finding rare variants that segregate through families.
• Linkage signal can be detected at marker up to 20 Mb away from trait locus. Thus few markers are needed to cover the genome.
• Parametric Linkage Analysis is good at taking a genome scan of data on (extended) pedigrees and estimating a region of linkage
• For complex traits, the localized region will usually be > 2 Mb wide, perhaps > 10 Mb, depending on the amount of data
Linkage Analysis Overview
• Must have genotypes on related individuals. – May be any structure from parent-child
trios to large, extended pedigrees.
• Must know the pedigree structure with high confidence.
• Trait may be qualitative or quantitative, but must be similarly defined across all pedigrees; the more precise the definition, the better.
2010 Apr 19–26 HG 236B Page 14
Linkage Analysis Overview
• Causes of trait should be rare enough that most of the affected individuals within a single pedigree have the trait due to the same set of conditions. – Differing causes across pedigrees is OK
• Reasonably strong genetic effect of the susceptibility loci.
2010 Apr 19–26 HG 236B Page 15
2010 Apr 19–26 HG 236B Page 16
Linkage Analysis Overview
Parametric Linkage (and NPL) is a pedigree-based analysis, i.e., all results are due to relationships and IBD status (determined from the recombination events) within the pedigrees; no analysis is done between pedigrees. However, one can combine analysis from within many pedigrees to get an overall result.
If the region of interest is smaller than ~2 Mb, then there will be very few recombination events in this region within any one pedigree. Since Linkage only uses results within pedigrees, it will be less useful in this small region.
2010 Apr 19–26 HG 236B Page 17
Comparison to Association
Association (and Haplotyping) is a population-based analysis, i.e., results take into account IBS status across all pedigrees. Association analysis can give significant results in small regions. However, these significant results can usually only be found over smaller distances.
Association, by considering extant families as bottom pieces of a very large unknown pedigree, can use IBS as a more convenient stand-in for IBD.
2010 Apr 19–26 HG 236B Page 18
Comparing Association (IBS) to Linkage (IBD)
from A tutorial on statistical methods for population association studies by DJ Balding Nature Reviews Genetics 7, 781-791 (October 2006)
Linkage and Association are Complementary Methods
• Genetics is technology driven and newest technology is designed for Association testing not Linkage.
• Lots of “low-hanging fruit” can be found with this new technology. New susceptibility genes for complex traits finally identified.
• Linkage and Association are complementary. Some genes will be much easier found with one method, some with the other.
2010 Apr 19–26 HG 236B Page 19
Linkage and Association are Complementary Methods
2010 Apr 19–26 HG 236B Page 20
from Patterns of linkage disequilibrium in the human genome by Ardlie, Kruglyak, and Seielstad, Nature Reviews Genetics 3, 299-309 (April 2002)
Linkage and Association are Complementary Methods
2010 Apr 19–26 HG 236B Page 21
Lobo, I. (2008) Multifactorial inheritance and genetic disease. Nature Education 1(1)
Common Disease, Common Variant
(CDCV) Hypothesis
• CDCV: the allelic variants causing common diseases, will themselves be common.
• There maybe many common gene-variants (oligogenic), all with relatively minor affect on the trait, as well as possibly an overall background genetic affect (polygenic).
• Due to their minor effect the variants have escaped selection pressure, allowing the trait to remain common.
2010 Apr 19–26 HG 236B Page 22
Common Disease, Common Variant
(CDCV) Hypothesis
• CDCV hypothesis has been popular since late 1990s.
• CDCV drove the push for genome-wide association technologies and studies.
• No question that new, replicated, common, susceptibility variants have been found based on these new technologies and studies.
2010 Apr 19–26 HG 236B Page 23
Published Genome-Wide Associations through 12/2009, 658 published GWA at p ≤ 5 × 10-8
NHGRI GWA Catalog www.genome.gov/GWAStudies
2010 Apr 19–26 Page 24 HG 236B
• In practice, in these GWAS the effect size of the variants is often very small. Often the largest < 3% of population attributable genetic risk. Even all together the variants often account for < 10% of the total genetic effect size.
• Where is all the “dark matter,” the rest of the genetic heritability? – Gene-gene interactions (epistasis) – Epigenetics – Gene-environment interactions – Rare variants – ???
2010 Apr 19–26 Page 25 HG 236B
Summary
2010 Apr 19–26 HG 236B Page 26
Linkage Association
Within pedigree analysis
Across pedigree (or unrelateds) analysis
IBD based IBS based
Tracks regions of DNA Tracks specific values at markers
Signal spread ~2 Mb Signal spread ~100 Kb
Harmed by LD Needs LD
2010 Apr 19–26 HG 236B Page 27
Still More Overview
• There are Many Programs but Few Methods. Don’t just try every program you can get your hands on.
• If only one program gives you good results, be suspicious. If results are not robust to small perturbations in your data (including the model), be suspicious.
• In fact, always be suspicious. Don’t treat the programs as impenetrable black boxes. Know their assumptions and shortcomings.
2010 Apr 19–26 HG 236B Page 28
This Week
• We will go through an example of simple Parametric Linkage calculations and then discuss more general, realistic, and complicated, computations.
• We will cover Non-Parametric Linkage Analysis
• We will then discuss software and their algorithms.
• Some mathematical detail will be presented, but also practical advice and overviews.
• For more theoretical detail, consider HG 207A (Biomath 207A); for more applied detail, consider HG 207B (Biomath 207B). Also covered in HG 224 (CS 224).
2010 Apr 19–26 HG 236B Page 29
Output from Parametric Linkage is an estimate for the Recombination Fraction
θ between the loci
• Theta is related to the expected number of perceived recombination events between two loci in one meiosis. Theta is the expected proportion of non-parental gametes produced at these two loci in one meiosis.
• 0 ≤ θ ≤ 0.5 (Why ≤ 0.5?)
• θ is the parameter in the output of Parametric Linkage Analysis
2010 Apr 19–26 HG 236B Page 30
Recombination versus Genetic versus Physical distance
• Genetic distance is measured in Morgans ≡ expected number of crossovers between the two loci per gamete
• Genetic distance is related to recombination fraction via a map function, e.g., Haldane map (assumes no interference) or Kosambi map (allows for interference)
• Internally, software assumes no interference and uses recombination fractions
• Be aware of which values the software input should be: Haldane cM, Kosambi cM, or recombination fraction
• For small distances, genetic distance (M) is similar to recombination fraction (θ)
• Genetic distance is additive, recombination fractions are not
2010 Apr 19–26 HG 236B Page 31
Recombination versus Genetic versus Physical distance
• Very complex relation between genetic distance and physical distance (bp)
• Female genetic map is generally longer than male’s in humans (on some small regions ~10:1; but also, although rare, on some small regions ~1:10, e.g., the psuedo-autosomal region on X that is 19 cM in males but 2.7 cM in females); obviously, the physical maps are identical
• Very rough rule of thumb in humans: 1 cM = 1 Mbp
• Map functions are a topic of continuing research
• Use sex-specific maps as input to the software, if possible
2010 Apr 19–26 HG 236B Page 32
Example Location Score Graph
Selected Pedigrees
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
-0.20 -0.10 0.00 0.10 0.20 0.30 0.40 0.50
Distance (M)
Lo
cati
on
Sc
ore
2010 Apr 19–26 HG 236B Page 33
Mendel’s Second Law: Independent Assortment
• “Genes controlling different traits segregate independently”
• So, suppose a person is heterozygous at two loci: A/a at one and C/c at the other
• Four possible gametes: AC::Ac::aC::ac
• Under independent assortment they are all equally likely
2010 Apr 19–26 HG 236B Page 34
Pairs of Homologous Chromosomes
2010 Apr 19–26 HG 236B Page 35
Tight Linkage Violates Mendel’s Second Law
• However, if we know the two loci are tightly linked and that the person has AB on one homologous chromosome and ab on the other (i.e. we know phase), then the most common gametes will be AB and ab, i.e., the non-recombinant (NR) or parental gametes
• Ab and aB, the recombinant (R) or non-parental gametes, will be rare
2010 Apr 19–26 HG 236B Page 36
Gamete Probabilities
If θ is the recombination fraction between locus A and B, then for a person with AB|ab phased genotype, the gamete probabilities are:
Gamete Probability Type AB (1- θ)/2 NR ab (1- θ)/2 NR Ab (θ)/2 R aB (θ)/2 R
2010 Apr 19–26 HG 236B Page 37
Pedigree 1
A/A B/b
a/a b/b
A/a B/b
a/a b/b
A/a B/b
a/a b/b
a/a B/b
A/a B/b
A/a B/b
A/a b/b
A/a B/b
NR NR NR NR NR R R
2010 Apr 19–26 HG 236B Page 38
Likelihood of Pedigree 1
• L(data :: θ=θ1) = (1-θ1)5 θ12
which can be evaluated for all θ1 • Null hypothesis is that loci A
and B are unlinked, i.e., that θ0 = 0.5; L(data :: θ=0.5) = (0.5)7
• We form a Likelihood Ratio Test (LRT) which compares a general θ1 against the null θ0 : L(data :: θ1) / L(data :: 0.5) This is the odds for θ1 compared to the Null
2010 Apr 19–26 HG 236B Page 39
Maximum Likelihood Estimate
Now one can calculate this Likelihood Ratio Test (the odds) at various values for θ. The value that maximizes the odds is the Maximum Likelihood Estimate (MLE) of θ. This is the best estimate for θ, given the data.
€
ˆ θ
2010 Apr 19–26 HG 236B Page 40
LOD
• LOD is the logarithm base 10 of the odds: LOD(θ1) = log10[L(θ1) / L(0.5)]
• If LOD(θ1) > 3, then we can reject the Null and conclude there is linkage at θ1 (for genome scan use 3.6 rather than 3)
• If LOD (θ1) < -2, then we say linkage is excluded at θ1, but this isn’t too useful for complex traits
• Otherwise, we are inconclusive and need more data, i.e., more families
• LODs can be added across independent families at a given position, under identical locus definitions
2010 Apr 19–26 HG 236B Page 41
LOD scores for Figure 1
€
ˆ θ = #R#R+#N =2 7≈0.286
LOD(ˆ θ )≈0.288
In this case, in which all phases are known:
LOD(θ) = log10[L(θ) / L(0.5)] = log10[(1-θ)5 θ2 / (0.5)7]
LOD(0.05) = -0.606 LOD(0.1) = -0.122 LOD(0.2) = 0.225 LOD(0.3) = 0.287 LOD(0.4) = 0.202 LOD(0.5) = 0.0
2010 Apr 19–26 HG 236B Page 42
Mapping Markers
• To choose between alternate maps for a set of markers, we compare the max LOD score for each ordering (n markers ⇒ n!/2 possible maps).
• Here θ is now a vector (θ1,… θn-1) where θi = recombination fraction between markers i and i+1.
• Also, each LOD is now the ratio of the likelihood using the chosen θi to the likelihood where all θi = 0.5.
• The ordering with the largest max LOD is the most likely.
2010 Apr 19–26 HG 236B Page 43
Traits • Parametric linkage analysis
tests linkage between traits and markers by turning the trait phenotype into a genotype (or more accurately, summing over the possible genotypes at the trait locus, each weighted according to the penetrance function).
• The simplest case is a rare, completely dominant (i.e., fully penetrant) phenotype: affected phenotype ⇔ +/– genotype normal phenotype ⇔ –/– genotype
2010 Apr 19–26 HG 236B Page 44
Pedigree 2A
b/b b/a
b/a a/a
b/a a/a a/a b/a b/a b/a b/a
2010 Apr 19–26 HG 236B Page 45
Pedigree 2A
b/b +/-
b/a -/-
b/a +/-
a/a -/-
b/a +/-
a/a -/-
a/a +/-
b/a +/-
b/a +/-
b/a -/-
b/a +/-
NR NR NR NR NR R R
2010 Apr 19–26 HG 236B Page 46
Another Example: Pedigree 2B
b/b
b/b
b/a b/b b/a b/b
1 2
3 4
5 6 7 8
2010 Apr 19–26 HG 236B Page 47
Another Example: Pedigree 2B
Person Possible Genotype Probability 1 a+ | ?- 2 b- | b- 3 a+ | b- 4 b- | b- 5 b- | a+ 1-θ 6 b- | b- 1-θ 7 b- | a- θ 8 b- | b- 1-θ
a/? +/-
b/b -/-
b/b -/-
b/a +/-
b/b -/-
b/a -/-
b/b -/-
1 2
3 4
5 6 7 8
a/b +/-
2010 Apr 19–26 HG 236B Page 48
LOD scores for Pedigree 2B
LOD(θ) = log10[L(θ) / L(0.5)] = log10[(1-θ)3 θ1 / (0.5)4]
Total LODs for Pedigree 2 A&B Pedigree LOD(θ) Name 0.05 0.1 0.2 0.3 0.4
2A -0.61 -0.12 0.22 0.29 0.20 2B -0.16 0.07 0.21 0.22 0.14
Total -0.77 -0.05 0.43 0.51 0.34
2010 Apr 19–26 HG 236B Page 49
Informativeness Meioses are informative for linkage if the gamete can be identified as R or NR. In these pedigrees, assume a rare, dominant trait (affected = +/-).
A/a +/-
A/a -/-
A/A +/-
a/a -/-
A/a +/-
A/a +/-
a/a -/-
A/a +/-
A/a -/-
A/a +/-
A/a +/-
a/a -/-
A/a +/-
A/a -/-
A/A +/-
A/a +/-
a/a -/-
A/a +/-
B/b -/-
A/b +/-
Uninformative Uninformative
Informative Informative
2010 Apr 19–26 HG 236B Page 50
Another Example: Pedigree 2C
Person Possible Genotype Probability I II
1 (I) b+|c- OR (II) b-|c+ 2 b-|c- 3 b+|b- 1-θ θ 4 uninformative 5 c-|c- 1-θ θ 6 uninformative
3 4 5 6
2 1
b/c -/-
b/b +/-
b/c +/-
c/c -/-
c/b -/-
b/c +/-
2010 Apr 19–26 HG 236B Page 51
LOD scores for Pedigree 2C
LOD(θ) = log10[L(θ) / L(0.5)]
Total LODs for Pedigree 2 A,B&C Pedigree LOD(θ) Name 0.05 0.1 0.2 0.3 0.4
2A -0.61 -0.12 0.22 0.29 0.20 2B -0.16 0.07 0.21 0.22 0.14 2C 0.26 0.21 0.13 0.06 0.02
Total -0.51 0.16 0.56 0.57 0.36
€
= log1012θ
0(1−θ)2+ 12θ2(1−θ)0
12
2
2010 Apr 19–26 HG 236B Page 52
Complex Traits Require a Penetrance Function
(a.k.a. Mode of Inheritance)
• The input parameter in Parametric Linkage is the model of how the genotype at the trait locus influences the trait phenotype.
• For complex traits this clearly needs to be more complicated than: affected phenotype ⇔ +/– genotype.
• The model is specified as a penetrance function: Pr(phenotype | genotype).
• For example,
1/1 1/2 2/2normal 1.0 0.0 0.0affected 0.0 1.0 1.0
2010 Apr 19–26 HG 236B Page 53
Penetrance Functions
• Another example, 1/1 1/2 2/2
normal 1.0 1.0 0.0affected 0.0 0.0 1.0
• More generally, 1/1 1/2 2/2
normal 0.99 0.2 0.1affected 0.01 0.8 0.9
• You can also have liability classes, for example, decade of life, smoking, value at second trait, etc. Some software is even more flexible and allows different values at each individual.
• Several conditions can increase the phenocopy rate; but the value here should still probably be small (but if using a multipoint analysis, then it should definitely be positive).
2010 Apr 19–26 HG 236B Page 54
Standard Pedigree Likelihood Function
• In ∑Gp , Gp runs over all possible
multilocus-genotypes at individual p • j runs over all founders; {c, m, f}
runs over all parent-offspring triples; i runs over all individuals
• Xi is the phenotype (or observed genotype) at individual i at all loci
L = ∑G1…∑Gn
[ ∏j Prior(Gj) ×
∏{c,m,f} Trans(Gc | Gm Gf) ×
∏i Pen(Xi | Gi)]
2010 Apr 19–26 HG 236B Page 55
Locus Heterogeneity • When more than one trait locus is
suspected, the Pr( any pedigree is segregating a disease gene linked to the current position, θ ) is called α.
• One can simultaneously estimate θ and α in an unbiased fashion using Parametric Linkage Analysis.
• Moreover, at each θ one can obtain the Heterogeneous LOD (HLOD) which is maximized over all α. This is the parametric score one should use for complex traits.
• Also, for a given θ and α one can obtain the posterior probability for each pedigree of whether that pedigree is segregating a disease gene in that vicinity.
2010 Apr 19–26 HG 236B Page 56
Multipoint Analysis • Parametric Linkage has been extended
to multiple markers and almost any size pedigree, using the method of Location Scores.
• In Location Score computations, the positions of the markers are fixed and only the location of the trait locus varies (examples).
• Multipoint has the advantage of using all your data simultaneously; it can turn uninformative markers, informative.
• As with all multipoint analysis, it is crucial that the map order be correct. The marker positions may be approximate but their order must be right. Choose markers accordingly.
u
uuuuuuuu
u
0 1
1.5
Figure: Standardized location scores resulting from the Monte Carlo analysis of 169 pedigrees in the Consortium which exhibit an ataxia-telangiectasia locus linked to 11q22-23.
u
uuuuuuu
uu
u
0.3
u
uuuuu
uu
u
0 1 2
2.0
u
uuuuu
uu
u
u
0 1
Position of A-T Gene Relative to Marker Loci (in cM)
u
u
uuu
uuuuu
u
0 1
1.4
u
uu
uuuuu
u
u
0 1 2 3
3.3
u
uuuuuuuuu
u
0.3
uuuuuuuuuu
u
0.35
DRD2S132 S144
CJ77S84S35
uuu
uu
uu
uuu
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
-50 -40 -30 -20 -10 0
Sta
nd
ard
ize
d L
oca
tio
n S
co
re (
in l
og
10
un
its)
50 1.8cM cM
uuuuuuuuuuu
0.2
u
uuuuu
uu
u
1.2
u
u
uuuuu
uuu
u
0 1
1.5
CJ193
S611STMY GL4
S927
uuuuuuuuuu
u
0.15
u
uuuuuuuuuu
0.43
uu
uu
uu
uu
uu
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
0 10 20 30 40 50
50
A1S1343
A4 J12.8S1294
A2
Y12.8
u
uuuuu
uu
u0 1
1.8
uuuuuuuuuuu
0.15
u
u
uuuuu
u
u
0 2 4 6
6.8
u
uuuuu
u
u
u
0 1 2 3 4 5
5.2
2010 Apr 19–26 HG 236B Page 58
Multipoint versus Single Point (a.k.a. Two Point)
• Single Point analysis is more flexible in allowing the data to fit the model, since θ is not constrained by neighboring markers. That is, for models of inheritance that are not precise (and none are), the will be less accurate but the LOD score will still be accurate.
• Multipoint has more places that data error can enter the problem; but it can also help you find these errors.
€
ˆ θ
2010 Apr 19–26 HG 236B Page 59
Power Considerations for Linkage Analysis
Power to detect linkage depends strongly on the magnitude of the contribution that the disease locus makes to the genetic variation of the trait.
Figure adapted from Joe Terwilliger
True Marker Genotypes
Putative Trait-Locus Genotype
True Phenotype
Linkage (IBD) or LD (IBS)
Incorrect Model
Tested Correlation Other Loci
Environment
Observed Genotypes
Observed Phenotype
Mistype
Mis- diagnose
Polygenic Effect
Small Genetic Effect
Alternate Etiology, such as:
Too Little Data
2010 Apr 19–26 HG 236B Page 60
Factors Influencing Power of Linkage Analysis
• Pedigree size and structure • Total sample size • Marker informativeness • Distance of marker from disease
locus • Phenocopy rate • Genetic heterogeneity • Magnitude of genetic effect
2010 Apr 19–26 HG 236B Page 61
Design Issues for Linkage Analysis
• For Parametric: Pre-specify a few trait models, e.g., reduced penetrance dominant and recessive, and an additive model. The model doesn’t need to be exactly right
• Simplify problem with either specific phenotype or isolated population, or both
• After positive result, test for model robustness
• Replication is vital • SNP sets available for linkage
analysis: 6000 can cover human genome since linkage signal can be detected far from trait locus
2010 Apr 19–26 HG 236B Page 62
Assumptions in Linkage Analysis
• Hardy-Weinberg Equilibrium • Linkage Equilibrium! • Random Mating • No Chiasma Interference
(Haldane map function) • No Epistasis, i.e., no interaction
between alleles at different loci
2010 Apr 19–26 HG 236B Page 63
Disadvantages of Linkage Analysis
• For Parametric: Explicitly Model-based (although other methods are often implicitly model-based)
• Bilinearity sensitivity • Analysis only within pedigrees,
not across pedigrees • Originally designed for simple
traits, so it is difficult to include multiple trait loci simultaneously
• Localized region will usually be greater than 2 Mb wide
2010 Apr 19–26 HG 236B Page 64
Advantages of Linkage Analysis
• Not sensitive to across-pedigree allelic heterogeneity nor population history
• No candidate genes to guess • Can detect a signal up to 20 cM away • Good at finding rare variants • P-value (LOD score) is accurate given the
data (including the model); for other methods estimated p-values may not be accurate, either too conservative or anti-conservative
• Genome-wide level of significance is well-defined; not true for other methods
2010 Apr 19–26 HG 236B Page 65
Overview of current general-pedigree Linkage Analysis software
Algorithm Programs Solution Size Restriction
Elston-Stewart FastLink Linkage Mendel Vitesse
exact varies: ~8 loci, less with loops (larger for VITESSE)
Lander-Green Allegro GeneHunter Mendel Merlin
exact ~20 people ( 2 n – f ≤ 20 )
Markov chain Monte Carlo
Loki SimWalk
estimate much larger ( > 1000 individuals, > 1000 loci )
Algorithm Increase in computational time with increase in: people markers missing data Elston-Stewart linear exponential severe
Lander-Green exponential linear modest
Markov chain Monte Carlo
linear linear mild
2010 Apr 19–26 HG 236B Page 66
Non-Parametric Linkage Analysis (NPL)
2010 Apr 19–26 HG 236B Page 67
Identity By Descent (IBD)
Recall that two alleles at a locus are identical by descent (IBD) if and only if the two alleles are both descendents of a common ancestral allele
A/C C/B
B/A B/A C/B C/A
2010 Apr 19–26 HG 236B Page 68
Simple IBD
Simple IBD has 3 possible states:
Individual i
Individual j
Allele1 Allele2
Z0 Alleles IBD = 0
Z1 Alleles IBD = 1
Z2 Alleles IBD = 2
2010 Apr 19–26 HG 236B Page 69
Condensed IBD for Inbred Pedigrees
Condensed IBD has 9 possible states:
Individual i
Individual j
Allele1 Allele2
S9
S6
S3
S8
S5
S2
S7
S4
S1
2010 Apr 19–26 HG 236B Page 70
S*15
Detailed IBD
Detailed IBD has 15 possible states:
Individual i
Individual j
Maternal Paternal
S*14 S*
13
S*12 S*
11 S*10
S*9 S*
8 S*7
S*6 S*
5 S*4
S*3 S*
2 S*1
2010 Apr 19–26 HG 236B Page 71
Descent States
2010 Apr 19–26 HG 236B Page 72
Descent Graphs (Inheritance Vectors)
2010 Apr 19–26 HG 236B Page 73
Non-Parametric Linkage (NPL) Analysis
• If we assume that many of the affecteds within a pedigree are affected because they share particular disease alleles at a trait locus, then it is reasonable to predict that those disease alleles are IBD.
• Moreover, in those affecteds, the alleles at loci linked to the trait locus will also often be IBD.
• NPL analysis tests for more sharing in the affecteds than one would expect when there is no linkage.
2010 Apr 19–26 HG 236B Page 74
NPL Example
2010 Apr 19–26 HG 236B Page 75
Design Issues for NPL
• Unaffecteds are used to help determine IBD relationships of the alleles in the affecteds.
• Always try to genotype at least three individuals per family; if parents unavailable, use unaffected siblings.
• Usually NPL is a multipoint analysis to help infer IBD status.
2010 Apr 19–26 HG 236B Page 76
Measuring Significance for NPL
• To measure the significance of the observed NPL statistic for linkage, one should compare it to the null distribution of the same statistic for similar data sets where there is no linkage.
• This results in a p-value ≡ Pr( observing a value as extreme or more extreme than the actual statistic | no linkage ).
• In practice, accurate p-values are time-consuming to compute, so many reported p-values are conservative.
2010 Apr 19–26 HG 236B Page 77
NPL_Pairs Statistic
• NPL statistics measure the degree of sharing, among the affecteds, of alleles IBD at a specific position.
• For example, at a given locus R, Spairs = ∑I,J[ IBD(IM,JM) + IBD(IM , JP) +
IBD(IP , JM) + IBD(IP , JP) ] where I & J run over all pairs of affecteds, IBD(x,y) is the Pr(x and y are IBD), IM is the maternal allele at R in I, IP is the paternal allele at R in I, and similarly for JM and JP.
2010 Apr 19–26 HG 236B Page 78
NPL_All Statistic
• Sall is another well known NPL statistic.
• Sall measures the degree of sharing of alleles in all subsets of the affecteds at once (not just in pairs of affecteds).
• Sall = [ ∑h ∏B(h) |B(h)|! ] / 2t where {h} is the set of all possible vectors containing one allele at locus R from each affected; {B(h)} is the set of IBD blocks contained in h; and t is the number of affecteds.
2010 Apr 19–26 HG 236B Page 79
Recessive Statistic • Spairs and Sall are the most commonly
cited NPL statistics. • For recessive traits, one expects the
alleles at the trait locus in the affecteds will come from just a few founders.
• So, the number of IBD blocks contained in the set of all alleles at the trait locus in the affecteds, Sblocks, is a good statistic to detect recessive traits.
2010 Apr 19–26 HG 236B Page 80
Dominant Statistic
• For a dominant trait, one expects that the affecteds share one allele IBD.
• So, the size of the largest IBD block contained in the set of all alleles at the trait locus in the affecteds, Smax-tree, is a good statistic to detect dominant traits.
2010 Apr 19–26 HG 236B Page 81
Advantages of NPL
• No explicit model of inheritance! (Thus NPL is also known as Model-Free Linkage Analysis.)
• Not sensitive to population history nor allelic heterogeneity.
• Signal can be seen up to 20 cM away (similar to Parametric Linkage).
2010 Apr 19–26 HG 236B Page 82
Disadvantages of NPL
• No measure of specific position for trait locus
• P-values may be conservative • Model is not explicit but it may be
implicit in the statistic • Not good for fine mapping below
~2 cM (similar to Parametric Linkage)
• Still computationally complex for large pedigrees, particularly for accurate p-values
2010 Apr 19–26 HG 236B Page 83
Where does Parametric Linkage Analysis fit in a genetic study using (extended) pedigrees?
1) Mistyping Analysis (there are mistypings in all interesting genetic studies)
2) Genome scan (~1 cM intervals) analyzed using either Parametric Linkage (PL) or Non-Parametric Linkage (NPL) Analysis
3) Fine mapping of supported regions analyzed using either PL or NPL (2 – 0.1 cM intervals)
4) Below ~2 Mb move to Association Analyses. Alternatively jump directly to Genome-wide Association Analysis!
2010 Apr 19–26 HG 236B
Mendel 10 Analysis Options
Number Analysis Name Number Analysis Name
1 Mapping Markers 14 Penetrances
2 Linkage Analysis 15 Gaussian Penetrances
3 Haplotyping 16 Combining Alleles
4 NPL 17 Gene Dropping
5 Mistyping 18 Combining SNPs
6 Allele Frequencies 19 Polygenic QTL
7 Genetic Counseling 20 QTL Association
8 Gamete Competition 21 Trimming Pedigrees
9 Pedigree Selection 22 Association given Linkage
10 Kinship Matrices 23 SNP Imputation
11 Genetic Equilibrium 24 SNP Association (GWAS)
12 Cases and Controls 25 File Conversion
13 TDT
Page 84