Gene Mapping: Linkage Analysis - UCLA Human Genetics

Gene Mapping: Linkage Analysis

Eric Sobel

2010 Apr 19–26 HG 236B Page 1

Overview

•  Key Concepts – Conceptually, how to map genes –  IBS & IBD – Linkage vs. Association

•  Parametric Linkage •  Kinship Coefficients •  Non-Parametric Linkage •  Mendel Software Package

2010 Apr 19–26 HG 236B Page 2

How to Map Genes

2010 Apr 19–26 HG 236B Page 3

•  We will start from simple ideas, and build to complex strategies.

•  What is the simplest genetic architecture for gene mapping?

•  Conceptually, without practical limits, how would you find the genetic etiology of this trait?

•  Why is this simple model not appropriate for many traits?

How to Map Genes

2010 Apr 19–26 HG 236B Page 4

•  How can one carry the simple strategies forward to more complex trait architectures?

•  In a word, statistics. •  How strong is the genetic

difference observed between affecteds and normals? Or, more generally, how significant is the correlation between genotype and phenotype?

How to Map Genes

2010 Apr 19–26 HG 236B Page 5

How to Map Genes

2010 Apr 19–26 HG 236B Page 6

How to Map Genes

2010 Apr 19–26 HG 236B Page 7

Reference for previous figure

European Journal of Human Genetics (2008) 16:265–269 Fifth finger camptodactyly maps to chromosome 3q11.2–q13.12 in a large German kindred Sajid Malik, Jörg Schott, Julia Schiller, Anna Junge, Erika Baum, Manuela Koch

2010 Apr 19–26 HG 236B Page 8

2010 Apr 19–26 HG 236B Page 9

IBD & IBS Two alleles at a locus are identical by descent (IBD) if and only if they are both descendents of a common ancestral allele. Two alleles at a locus are identical by state (IBS) if and only if they have the same value, i.e., are in the same state.

A/C A/C

A/A A/C

A|A A|C A|C

A/C T/A

A/C A/C T/A

•  The data is the most important component of statistical genetic analysis!

•  You need “lots of good data” to have enough power to find genes for complex traits.

•  The amount of genetic data is exploding, which forces new, professional management and analysis techniques.

•  Do not underestimate the considerable time needed to manage the data. If several projects, consider a dedicated data manager.

2010 Apr 19–26 Page 10 HG 236B

2010 Apr 19–26 HG 236B Page 11

Parametric Linkage Analysis

•  “Linkage” – we are trying to find if two loci are linked, i.e., close together on a chromosome.

•  “Parametric” – the input must include parameters defining how we think the genotypes at the trait locus influence the trait phenotype, a.k.a. the mode or model of inheritance, or the penetrance values.

•  Mathematical model for familial transmission and genetic susceptibility.

•  Many linkage methods originated in 1950-1970 to analyze genetic data from populations with rare traits (e.g., Huntington’s or ataxia-telangiectasia).

•  Rare traits are usually influenced by a major gene (although there are usually modifier genes and/or other genes with minor affect as well).

•  Very successful at finding major genes (where the signal to noise ratio is high)!

2010 Apr 19–26 Page 12 HG 236B

2010 Apr 19–26 HG 236B Page 13

Linkage Analysis Overview

•  Linkage analysis is good at finding rare variants that segregate through families.

•  Linkage signal can be detected at marker up to 20 Mb away from trait locus. Thus few markers are needed to cover the genome.

•  Parametric Linkage Analysis is good at taking a genome scan of data on (extended) pedigrees and estimating a region of linkage

•  For complex traits, the localized region will usually be > 2 Mb wide, perhaps > 10 Mb, depending on the amount of data


•  Must have genotypes on related individuals. –  May be any structure from parent-child

trios to large, extended pedigrees.

•  Must know the pedigree structure with high confidence.

•  Trait may be qualitative or quantitative, but must be similarly defined across all pedigrees; the more precise the definition, the better.

2010 Apr 19–26 HG 236B Page 14


•  Causes of trait should be rare enough that most of the affected individuals within a single pedigree have the trait due to the same set of conditions. –  Differing causes across pedigrees is OK

•  Reasonably strong genetic effect of the susceptibility loci.

2010 Apr 19–26 HG 236B Page 15

2010 Apr 19–26 HG 236B Page 16


Parametric Linkage (and NPL) is a pedigree-based analysis, i.e., all results are due to relationships and IBD status (determined from the recombination events) within the pedigrees; no analysis is done between pedigrees. However, one can combine analysis from within many pedigrees to get an overall result.

If the region of interest is smaller than ~2 Mb, then there will be very few recombination events in this region within any one pedigree. Since Linkage only uses results within pedigrees, it will be less useful in this small region.

2010 Apr 19–26 HG 236B Page 17

Comparison to Association

Association (and Haplotyping) is a population-based analysis, i.e., results take into account IBS status across all pedigrees. Association analysis can give significant results in small regions. However, these significant results can usually only be found over smaller distances.

Association, by considering extant families as bottom pieces of a very large unknown pedigree, can use IBS as a more convenient stand-in for IBD.

2010 Apr 19–26 HG 236B Page 18

Comparing Association (IBS) to Linkage (IBD)

from A tutorial on statistical methods for population association studies by DJ Balding Nature Reviews Genetics 7, 781-791 (October 2006)

Linkage and Association are Complementary Methods

•  Genetics is technology driven and newest technology is designed for Association testing not Linkage.

•  Lots of “low-hanging fruit” can be found with this new technology. New susceptibility genes for complex traits finally identified.

•  Linkage and Association are complementary. Some genes will be much easier found with one method, some with the other.

2010 Apr 19–26 HG 236B Page 19


2010 Apr 19–26 HG 236B Page 20

from Patterns of linkage disequilibrium in the human genome by Ardlie, Kruglyak, and Seielstad, Nature Reviews Genetics 3, 299-309 (April 2002)


2010 Apr 19–26 HG 236B Page 21

Lobo, I. (2008) Multifactorial inheritance and genetic disease. Nature Education 1(1)

Common Disease, Common Variant

(CDCV) Hypothesis

•  CDCV: the allelic variants causing common diseases, will themselves be common.

•  There maybe many common gene-variants (oligogenic), all with relatively minor affect on the trait, as well as possibly an overall background genetic affect (polygenic).

•  Due to their minor effect the variants have escaped selection pressure, allowing the trait to remain common.

2010 Apr 19–26 HG 236B Page 22

Common Disease, Common Variant

(CDCV) Hypothesis

•  CDCV hypothesis has been popular since late 1990s.

•  CDCV drove the push for genome-wide association technologies and studies.

•  No question that new, replicated, common, susceptibility variants have been found based on these new technologies and studies.

2010 Apr 19–26 HG 236B Page 23

Published Genome-Wide Associations through 12/2009, 658 published GWA at p ≤ 5 × 10-8

NHGRI GWA Catalog www.genome.gov/GWAStudies

2010 Apr 19–26 Page 24 HG 236B

•  In practice, in these GWAS the effect size of the variants is often very small. Often the largest < 3% of population attributable genetic risk. Even all together the variants often account for < 10% of the total genetic effect size.

•  Where is all the “dark matter,” the rest of the genetic heritability? –  Gene-gene interactions (epistasis) –  Epigenetics –  Gene-environment interactions –  Rare variants –  ???

2010 Apr 19–26 Page 25 HG 236B

Summary

2010 Apr 19–26 HG 236B Page 26

Linkage Association

Within pedigree analysis

Across pedigree (or unrelateds) analysis

IBD based IBS based

Tracks regions of DNA Tracks specific values at markers

Signal spread ~2 Mb Signal spread ~100 Kb

Harmed by LD Needs LD

2010 Apr 19–26 HG 236B Page 27

Still More Overview

•  There are Many Programs but Few Methods. Don’t just try every program you can get your hands on.

•  If only one program gives you good results, be suspicious. If results are not robust to small perturbations in your data (including the model), be suspicious.

•  In fact, always be suspicious. Don’t treat the programs as impenetrable black boxes. Know their assumptions and shortcomings.

2010 Apr 19–26 HG 236B Page 28

This Week

•  We will go through an example of simple Parametric Linkage calculations and then discuss more general, realistic, and complicated, computations.

•  We will cover Non-Parametric Linkage Analysis

•  We will then discuss software and their algorithms.

•  Some mathematical detail will be presented, but also practical advice and overviews.

•  For more theoretical detail, consider HG 207A (Biomath 207A); for more applied detail, consider HG 207B (Biomath 207B). Also covered in HG 224 (CS 224).

2010 Apr 19–26 HG 236B Page 29

Output from Parametric Linkage is an estimate for the Recombination Fraction

θ between the loci

•  Theta is related to the expected number of perceived recombination events between two loci in one meiosis. Theta is the expected proportion of non-parental gametes produced at these two loci in one meiosis.

•  0 ≤ θ ≤ 0.5 (Why ≤ 0.5?)

•  θ is the parameter in the output of Parametric Linkage Analysis

2010 Apr 19–26 HG 236B Page 30

Recombination versus Genetic versus Physical distance

•  Genetic distance is measured in Morgans ≡ expected number of crossovers between the two loci per gamete

•  Genetic distance is related to recombination fraction via a map function, e.g., Haldane map (assumes no interference) or Kosambi map (allows for interference)

•  Internally, software assumes no interference and uses recombination fractions

•  Be aware of which values the software input should be: Haldane cM, Kosambi cM, or recombination fraction

•  For small distances, genetic distance (M) is similar to recombination fraction (θ)

•  Genetic distance is additive, recombination fractions are not

2010 Apr 19–26 HG 236B Page 31

Recombination versus Genetic versus Physical distance

•  Very complex relation between genetic distance and physical distance (bp)

•  Female genetic map is generally longer than male’s in humans (on some small regions ~10:1; but also, although rare, on some small regions ~1:10, e.g., the psuedo-autosomal region on X that is 19 cM in males but 2.7 cM in females); obviously, the physical maps are identical

•  Very rough rule of thumb in humans: 1 cM = 1 Mbp

•  Map functions are a topic of continuing research

•  Use sex-specific maps as input to the software, if possible

2010 Apr 19–26 HG 236B Page 32

Example Location Score Graph

Selected Pedigrees

-1.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

-0.20 -0.10 0.00 0.10 0.20 0.30 0.40 0.50

Distance (M)

Lo

cati

on

Sc

ore

2010 Apr 19–26 HG 236B Page 33

Mendel’s Second Law: Independent Assortment

•  “Genes controlling different traits segregate independently”

•  So, suppose a person is heterozygous at two loci: A/a at one and C/c at the other

•  Four possible gametes: AC::Ac::aC::ac

•  Under independent assortment they are all equally likely

2010 Apr 19–26 HG 236B Page 34

Pairs of Homologous Chromosomes

2010 Apr 19–26 HG 236B Page 35

Tight Linkage Violates Mendel’s Second Law

•  However, if we know the two loci are tightly linked and that the person has AB on one homologous chromosome and ab on the other (i.e. we know phase), then the most common gametes will be AB and ab, i.e., the non-recombinant (NR) or parental gametes

•  Ab and aB, the recombinant (R) or non-parental gametes, will be rare

2010 Apr 19–26 HG 236B Page 36

Gamete Probabilities

If θ is the recombination fraction between locus A and B, then for a person with AB|ab phased genotype, the gamete probabilities are:

Gamete Probability Type AB (1- θ)/2 NR ab (1- θ)/2 NR Ab (θ)/2 R aB (θ)/2 R

2010 Apr 19–26 HG 236B Page 37

Pedigree 1

A/A B/b

a/a b/b

A/a B/b

a/a b/b

A/a B/b

a/a b/b

a/a B/b

A/a B/b

A/a B/b

A/a b/b

A/a B/b

NR NR NR NR NR R R

2010 Apr 19–26 HG 236B Page 38

Likelihood of Pedigree 1

•  L(data :: θ=θ1) = (1-θ1)5 θ12

which can be evaluated for all θ1 •  Null hypothesis is that loci A

and B are unlinked, i.e., that θ0 = 0.5; L(data :: θ=0.5) = (0.5)7

•  We form a Likelihood Ratio Test (LRT) which compares a general θ1 against the null θ0 : L(data :: θ1) / L(data :: 0.5) This is the odds for θ1 compared to the Null

2010 Apr 19–26 HG 236B Page 39

Maximum Likelihood Estimate

Now one can calculate this Likelihood Ratio Test (the odds) at various values for θ. The value that maximizes the odds is the Maximum Likelihood Estimate (MLE) of θ. This is the best estimate for θ, given the data.

€

ˆ θ

2010 Apr 19–26 HG 236B Page 40

LOD

•  LOD is the logarithm base 10 of the odds: LOD(θ1) = log10[L(θ1) / L(0.5)]

•  If LOD(θ1) > 3, then we can reject the Null and conclude there is linkage at θ1 (for genome scan use 3.6 rather than 3)

•  If LOD (θ1) < -2, then we say linkage is excluded at θ1, but this isn’t too useful for complex traits

•  Otherwise, we are inconclusive and need more data, i.e., more families

•  LODs can be added across independent families at a given position, under identical locus definitions

2010 Apr 19–26 HG 236B Page 41

LOD scores for Figure 1

€

ˆ θ = #R#R+#N =2 7≈0.286

LOD(ˆ θ )≈0.288

In this case, in which all phases are known:

LOD(θ) = log10[L(θ) / L(0.5)] = log10[(1-θ)5 θ2 / (0.5)7]

LOD(0.05) = -0.606 LOD(0.1) = -0.122 LOD(0.2) = 0.225 LOD(0.3) = 0.287 LOD(0.4) = 0.202 LOD(0.5) = 0.0

2010 Apr 19–26 HG 236B Page 42

Mapping Markers

•  To choose between alternate maps for a set of markers, we compare the max LOD score for each ordering (n markers ⇒ n!/2 possible maps).

•  Here θ is now a vector (θ1,… θn-1) where θi = recombination fraction between markers i and i+1.

•  Also, each LOD is now the ratio of the likelihood using the chosen θi to the likelihood where all θi = 0.5.

•  The ordering with the largest max LOD is the most likely.

2010 Apr 19–26 HG 236B Page 43

Traits •  Parametric linkage analysis

tests linkage between traits and markers by turning the trait phenotype into a genotype (or more accurately, summing over the possible genotypes at the trait locus, each weighted according to the penetrance function).

•  The simplest case is a rare, completely dominant (i.e., fully penetrant) phenotype: affected phenotype ⇔ +/– genotype normal phenotype ⇔ –/– genotype

2010 Apr 19–26 HG 236B Page 44

Pedigree 2A

b/b b/a

b/a a/a

b/a a/a a/a b/a b/a b/a b/a

2010 Apr 19–26 HG 236B Page 45

Pedigree 2A

b/b +/-

b/a -/-

b/a +/-

a/a -/-

b/a +/-

a/a -/-

a/a +/-

b/a +/-

b/a +/-

b/a -/-

b/a +/-

NR NR NR NR NR R R

2010 Apr 19–26 HG 236B Page 46

Another Example: Pedigree 2B

b/b

b/b

b/a b/b b/a b/b

1 2

3 4

5 6 7 8

2010 Apr 19–26 HG 236B Page 47

Another Example: Pedigree 2B

Person Possible Genotype Probability 1 a+ | ?- 2 b- | b- 3 a+ | b- 4 b- | b- 5 b- | a+ 1-θ 6 b- | b- 1-θ 7 b- | a- θ 8 b- | b- 1-θ

a/? +/-

b/b -/-

b/b -/-

b/a +/-

b/b -/-

b/a -/-

b/b -/-

1 2

3 4

5 6 7 8

a/b +/-

2010 Apr 19–26 HG 236B Page 48

LOD scores for Pedigree 2B

LOD(θ) = log10[L(θ) / L(0.5)] = log10[(1-θ)3 θ1 / (0.5)4]

Total LODs for Pedigree 2 A&B Pedigree LOD(θ) Name 0.05 0.1 0.2 0.3 0.4

2A -0.61 -0.12 0.22 0.29 0.20 2B -0.16 0.07 0.21 0.22 0.14

Total -0.77 -0.05 0.43 0.51 0.34

2010 Apr 19–26 HG 236B Page 49

Informativeness Meioses are informative for linkage if the gamete can be identified as R or NR. In these pedigrees, assume a rare, dominant trait (affected = +/-).

A/a +/-

A/a -/-

A/A +/-

a/a -/-

A/a +/-

A/a +/-

a/a -/-

A/a +/-

A/a -/-

A/a +/-

A/a +/-

a/a -/-

A/a +/-

A/a -/-

A/A +/-

A/a +/-

a/a -/-

A/a +/-

B/b -/-

A/b +/-

Uninformative Uninformative

Informative Informative

2010 Apr 19–26 HG 236B Page 50

Another Example: Pedigree 2C

Person Possible Genotype Probability I II

1 (I) b+|c- OR (II) b-|c+ 2 b-|c- 3 b+|b- 1-θ θ 4 uninformative 5 c-|c- 1-θ θ 6 uninformative

3 4 5 6

2 1

b/c -/-

b/b +/-

b/c +/-

c/c -/-

c/b -/-

b/c +/-

2010 Apr 19–26 HG 236B Page 51

LOD scores for Pedigree 2C

LOD(θ) = log10[L(θ) / L(0.5)]

Total LODs for Pedigree 2 A,B&C Pedigree LOD(θ) Name 0.05 0.1 0.2 0.3 0.4

2A -0.61 -0.12 0.22 0.29 0.20 2B -0.16 0.07 0.21 0.22 0.14 2C 0.26 0.21 0.13 0.06 0.02

Total -0.51 0.16 0.56 0.57 0.36

€

= log1012θ

0(1−θ)2+ 12θ2(1−θ)0

12

2

2010 Apr 19–26 HG 236B Page 52

Complex Traits Require a Penetrance Function

(a.k.a. Mode of Inheritance)

•  The input parameter in Parametric Linkage is the model of how the genotype at the trait locus influences the trait phenotype.

•  For complex traits this clearly needs to be more complicated than: affected phenotype ⇔ +/– genotype.

•  The model is specified as a penetrance function: Pr(phenotype | genotype).

•  For example,

1/1 1/2 2/2normal 1.0 0.0 0.0affected 0.0 1.0 1.0

2010 Apr 19–26 HG 236B Page 53

Penetrance Functions

•  Another example, 1/1 1/2 2/2

normal 1.0 1.0 0.0affected 0.0 0.0 1.0

•  More generally, 1/1 1/2 2/2

normal 0.99 0.2 0.1affected 0.01 0.8 0.9

•  You can also have liability classes, for example, decade of life, smoking, value at second trait, etc. Some software is even more flexible and allows different values at each individual.

•  Several conditions can increase the phenocopy rate; but the value here should still probably be small (but if using a multipoint analysis, then it should definitely be positive).

2010 Apr 19–26 HG 236B Page 54

Standard Pedigree Likelihood Function

•  In ∑Gp , Gp runs over all possible

multilocus-genotypes at individual p •  j runs over all founders; {c, m, f}

runs over all parent-offspring triples; i runs over all individuals

•  Xi is the phenotype (or observed genotype) at individual i at all loci

L = ∑G1…∑Gn

[ ∏j Prior(Gj) ×

∏{c,m,f} Trans(Gc | Gm Gf) ×

∏i Pen(Xi | Gi)]

2010 Apr 19–26 HG 236B Page 55

Locus Heterogeneity •  When more than one trait locus is

suspected, the Pr( any pedigree is segregating a disease gene linked to the current position, θ ) is called α.

•  One can simultaneously estimate θ and α in an unbiased fashion using Parametric Linkage Analysis.

•  Moreover, at each θ one can obtain the Heterogeneous LOD (HLOD) which is maximized over all α. This is the parametric score one should use for complex traits.

•  Also, for a given θ and α one can obtain the posterior probability for each pedigree of whether that pedigree is segregating a disease gene in that vicinity.

2010 Apr 19–26 HG 236B Page 56

Multipoint Analysis •  Parametric Linkage has been extended

to multiple markers and almost any size pedigree, using the method of Location Scores.

•  In Location Score computations, the positions of the markers are fixed and only the location of the trait locus varies (examples).

•  Multipoint has the advantage of using all your data simultaneously; it can turn uninformative markers, informative.

•  As with all multipoint analysis, it is crucial that the map order be correct. The marker positions may be approximate but their order must be right. Choose markers accordingly.

u

uuuuuuuu

u

0 1

1.5

Figure: Standardized location scores resulting from the Monte Carlo analysis of 169 pedigrees in the Consortium which exhibit an ataxia-telangiectasia locus linked to 11q22-23.

u

uuuuuuu

uu

u

0.3

u

uuuuu

uu

u

0 1 2

2.0

u

uuuuu

uu

u

u

0 1

Position of A-T Gene Relative to Marker Loci (in cM)

u

u

uuu

uuuuu

u

0 1

1.4

u

uu

uuuuu

u

u

0 1 2 3

3.3

u

uuuuuuuuu

u

0.3

uuuuuuuuuu

u

0.35

DRD2S132 S144

CJ77S84S35

uuu

uu

uu

uuu

-25

-20

-15

-10

-5

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

-50 -40 -30 -20 -10 0

Sta

nd

ard

ize

d L

oca

tio

n S

co

re (

in l

og

10

un

its)

50 1.8cM cM

uuuuuuuuuuu

0.2

u

uuuuu

uu

u

1.2

u

u

uuuuu

uuu

u

0 1

1.5

CJ193

S611STMY GL4

S927

uuuuuuuuuu

u

0.15

u

uuuuuuuuuu

0.43

uu

uu

uu

uu

uu

-25

-20

-15

-10

-5

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

0 10 20 30 40 50

50

A1S1343

A4 J12.8S1294

A2

Y12.8

u

uuuuu

uu

u0 1

1.8

uuuuuuuuuuu

0.15

u

u

uuuuu

u

u

0 2 4 6

6.8

u

uuuuu

u

u

u

0 1 2 3 4 5

5.2

2010 Apr 19–26 HG 236B Page 58

Multipoint versus Single Point (a.k.a. Two Point)

•  Single Point analysis is more flexible in allowing the data to fit the model, since θ is not constrained by neighboring markers. That is, for models of inheritance that are not precise (and none are), the will be less accurate but the LOD score will still be accurate.

•  Multipoint has more places that data error can enter the problem; but it can also help you find these errors.

€

ˆ θ

2010 Apr 19–26 HG 236B Page 59

Power Considerations for Linkage Analysis

Power to detect linkage depends strongly on the magnitude of the contribution that the disease locus makes to the genetic variation of the trait.

Figure adapted from Joe Terwilliger

True Marker Genotypes

Putative Trait-Locus Genotype

True Phenotype

Linkage (IBD) or LD (IBS)

Incorrect Model

Tested Correlation Other Loci

Environment

Observed Genotypes

Observed Phenotype

Mistype

Mis- diagnose

Polygenic Effect

Small Genetic Effect

Alternate Etiology, such as:

Too Little Data

2010 Apr 19–26 HG 236B Page 60

Factors Influencing Power of Linkage Analysis

•  Pedigree size and structure •  Total sample size •  Marker informativeness •  Distance of marker from disease

locus •  Phenocopy rate •  Genetic heterogeneity •  Magnitude of genetic effect

2010 Apr 19–26 HG 236B Page 61

Design Issues for Linkage Analysis

•  For Parametric: Pre-specify a few trait models, e.g., reduced penetrance dominant and recessive, and an additive model. The model doesn’t need to be exactly right

•  Simplify problem with either specific phenotype or isolated population, or both

•  After positive result, test for model robustness

•  Replication is vital •  SNP sets available for linkage

analysis: 6000 can cover human genome since linkage signal can be detected far from trait locus

2010 Apr 19–26 HG 236B Page 62

Assumptions in Linkage Analysis

•  Hardy-Weinberg Equilibrium •  Linkage Equilibrium! •  Random Mating •  No Chiasma Interference

(Haldane map function) •  No Epistasis, i.e., no interaction

between alleles at different loci

2010 Apr 19–26 HG 236B Page 63

Disadvantages of Linkage Analysis

•  For Parametric: Explicitly Model-based (although other methods are often implicitly model-based)

•  Bilinearity sensitivity •  Analysis only within pedigrees,

not across pedigrees •  Originally designed for simple

traits, so it is difficult to include multiple trait loci simultaneously

•  Localized region will usually be greater than 2 Mb wide

2010 Apr 19–26 HG 236B Page 64

Advantages of Linkage Analysis

•  Not sensitive to across-pedigree allelic heterogeneity nor population history

•  No candidate genes to guess •  Can detect a signal up to 20 cM away •  Good at finding rare variants •  P-value (LOD score) is accurate given the

data (including the model); for other methods estimated p-values may not be accurate, either too conservative or anti-conservative

•  Genome-wide level of significance is well-defined; not true for other methods

2010 Apr 19–26 HG 236B Page 65

Overview of current general-pedigree Linkage Analysis software

Algorithm Programs Solution Size Restriction

Elston-Stewart FastLink Linkage Mendel Vitesse

exact varies: ~8 loci, less with loops (larger for VITESSE)

Lander-Green Allegro GeneHunter Mendel Merlin

exact ~20 people ( 2 n – f ≤ 20 )

Markov chain Monte Carlo

Loki SimWalk

estimate much larger ( > 1000 individuals, > 1000 loci )

Algorithm Increase in computational time with increase in: people markers missing data Elston-Stewart linear exponential severe

Lander-Green exponential linear modest

Markov chain Monte Carlo

linear linear mild

2010 Apr 19–26 HG 236B Page 66

Non-Parametric Linkage Analysis (NPL)

2010 Apr 19–26 HG 236B Page 67

Identity By Descent (IBD)

Recall that two alleles at a locus are identical by descent (IBD) if and only if the two alleles are both descendents of a common ancestral allele

A/C C/B

B/A B/A C/B C/A

2010 Apr 19–26 HG 236B Page 68

Simple IBD

Simple IBD has 3 possible states:

Individual i

Individual j

Allele1 Allele2

Z0 Alleles IBD = 0

Z1 Alleles IBD = 1

Z2 Alleles IBD = 2

2010 Apr 19–26 HG 236B Page 69

Condensed IBD for Inbred Pedigrees

Condensed IBD has 9 possible states:

Individual i

Individual j

Allele1 Allele2

S9

S6

S3

S8

S5

S2

S7

S4

S1

2010 Apr 19–26 HG 236B Page 70

S*15

Detailed IBD

Detailed IBD has 15 possible states:

Individual i

Individual j

Maternal Paternal

S*14 S*

13

S*12 S*

11 S*10

S*9 S*

8 S*7

S*6 S*

5 S*4

S*3 S*

2 S*1

2010 Apr 19–26 HG 236B Page 71

Descent States

2010 Apr 19–26 HG 236B Page 72

Descent Graphs (Inheritance Vectors)

2010 Apr 19–26 HG 236B Page 73

Non-Parametric Linkage (NPL) Analysis

•  If we assume that many of the affecteds within a pedigree are affected because they share particular disease alleles at a trait locus, then it is reasonable to predict that those disease alleles are IBD.

•  Moreover, in those affecteds, the alleles at loci linked to the trait locus will also often be IBD.

•  NPL analysis tests for more sharing in the affecteds than one would expect when there is no linkage.

2010 Apr 19–26 HG 236B Page 74

NPL Example

2010 Apr 19–26 HG 236B Page 75

Design Issues for NPL

•  Unaffecteds are used to help determine IBD relationships of the alleles in the affecteds.

•  Always try to genotype at least three individuals per family; if parents unavailable, use unaffected siblings.

•  Usually NPL is a multipoint analysis to help infer IBD status.

2010 Apr 19–26 HG 236B Page 76

Measuring Significance for NPL

•  To measure the significance of the observed NPL statistic for linkage, one should compare it to the null distribution of the same statistic for similar data sets where there is no linkage.

•  This results in a p-value ≡ Pr( observing a value as extreme or more extreme than the actual statistic | no linkage ).

•  In practice, accurate p-values are time-consuming to compute, so many reported p-values are conservative.

2010 Apr 19–26 HG 236B Page 77

NPL_Pairs Statistic

•  NPL statistics measure the degree of sharing, among the affecteds, of alleles IBD at a specific position.

•  For example, at a given locus R, Spairs = ∑I,J[ IBD(IM,JM) + IBD(IM , JP) +

IBD(IP , JM) + IBD(IP , JP) ] where I & J run over all pairs of affecteds, IBD(x,y) is the Pr(x and y are IBD), IM is the maternal allele at R in I, IP is the paternal allele at R in I, and similarly for JM and JP.

2010 Apr 19–26 HG 236B Page 78

NPL_All Statistic

•  Sall is another well known NPL statistic.

•  Sall measures the degree of sharing of alleles in all subsets of the affecteds at once (not just in pairs of affecteds).

•  Sall = [ ∑h ∏B(h) |B(h)|! ] / 2t where {h} is the set of all possible vectors containing one allele at locus R from each affected; {B(h)} is the set of IBD blocks contained in h; and t is the number of affecteds.

2010 Apr 19–26 HG 236B Page 79

Recessive Statistic •  Spairs and Sall are the most commonly

cited NPL statistics. •  For recessive traits, one expects the

alleles at the trait locus in the affecteds will come from just a few founders.

•  So, the number of IBD blocks contained in the set of all alleles at the trait locus in the affecteds, Sblocks, is a good statistic to detect recessive traits.

2010 Apr 19–26 HG 236B Page 80

Dominant Statistic

•  For a dominant trait, one expects that the affecteds share one allele IBD.

•  So, the size of the largest IBD block contained in the set of all alleles at the trait locus in the affecteds, Smax-tree, is a good statistic to detect dominant traits.

2010 Apr 19–26 HG 236B Page 81

Advantages of NPL

•  No explicit model of inheritance! (Thus NPL is also known as Model-Free Linkage Analysis.)

•  Not sensitive to population history nor allelic heterogeneity.

•  Signal can be seen up to 20 cM away (similar to Parametric Linkage).

2010 Apr 19–26 HG 236B Page 82

Disadvantages of NPL

•  No measure of specific position for trait locus

•  P-values may be conservative •  Model is not explicit but it may be

implicit in the statistic •  Not good for fine mapping below

~2 cM (similar to Parametric Linkage)

•  Still computationally complex for large pedigrees, particularly for accurate p-values

2010 Apr 19–26 HG 236B Page 83

Where does Parametric Linkage Analysis fit in a genetic study using (extended) pedigrees?

1)  Mistyping Analysis (there are mistypings in all interesting genetic studies)

2)  Genome scan (~1 cM intervals) analyzed using either Parametric Linkage (PL) or Non-Parametric Linkage (NPL) Analysis

3)  Fine mapping of supported regions analyzed using either PL or NPL (2 – 0.1 cM intervals)

4)  Below ~2 Mb move to Association Analyses. Alternatively jump directly to Genome-wide Association Analysis!

2010 Apr 19–26 HG 236B

Mendel 10 Analysis Options

Number Analysis Name Number Analysis Name

1 Mapping Markers 14 Penetrances

2 Linkage Analysis 15 Gaussian Penetrances

3 Haplotyping 16 Combining Alleles

4 NPL 17 Gene Dropping

5 Mistyping 18 Combining SNPs

6 Allele Frequencies 19 Polygenic QTL

7 Genetic Counseling 20 QTL Association

8 Gamete Competition 21 Trimming Pedigrees

9 Pedigree Selection 22 Association given Linkage

10 Kinship Matrices 23 SNP Imputation

11 Genetic Equilibrium 24 SNP Association (GWAS)

12 Cases and Controls 25 File Conversion

13 TDT

Page 84

Documents

Gene Mapping: Linkage Analysis - UCLA Human Genetics