59
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Analysis of whole genome Analysis of whole genome association studies in association studies in pedigreed populations pedigreed populations Goutam Sahana Genetics and Biotechnology Faculty of Agricultural Sciences Aarhus University, 8830 Tjele, Denmark

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Analysis of whole genome association studies in pedigreed populations Goutam Sahana

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

A A R H U S U N I V E R S I T E T

Faculty of Agricultural Sciences

Analysis of whole genome Analysis of whole genome association studies in pedigreed association studies in pedigreed

populationspopulations

Goutam Sahana

Genetics and Biotechnology Faculty of Agricultural Sciences

Aarhus University, 8830 Tjele, Denmark

Concept of mapping

Identification of genetic variant underlying disease susceptibility or a trait value

= Causal variant

Evidence for the location of the gene

Approaches to Mapping

1. Candidate gene studies Association Resequencing approaches

2. Genome-wide studies Linkage analysis Genome-wide association studies (Linkage

disequilibrium, LD mapping)

Linkage mapping

Look for marker alleles that are correlated with the phenotype within a pedigree

Different alleles can be connected with the trait in the different pedigrees

Association mapping

Marker alleles are correlated with a trait on a population level

Can detect association by looking at unrelated individuals from a population

Does not necessarily imply that markers are linked to (are close to) genes influencing the trait.

Eff

ect

Freq. of causal variant

Very difficult

Linkage analysis

Unlikely to exist

Association study

Linkage vs. association

Modified from D. Altschuler

Linkage vs. association

Potential Advantage Linkage Association

No prior information regarding gene function required

+ +

Localization to small genomic region - +

Not susceptible to effects of stratification + -/+

Sufficient power to detect common alleles of modest effect (MAFs>5%)

-/+ +

Ability to detect rare allele (MAFs<1%) + -

Tools for analysis available + +/-

Hirschhorn & Daly, Nature Rev. Genet. 2005

Allelic Association

Direct Association Allele of interest is itself involved in phenotype

Indirect Association Allele itself is not involved, but due to LD with

the functional variant

Spurious association Confounding factors (e.g., population

stratification)

Linkage disequilibrium

Non random association between alleles at different loci. Loci are in LD if alleles are present on haplotypes in different proportions than expected based on allele frequencies

Two alleles that are in LD are occurring together more often than would be expected by chance

Linkage disequilibrium

Locus A: Alleles A & a; freq. PA & Pa

Locus B: Alleles B & b; freq. PB & Pb

Possible haplotyoesA

B

A

b

a

B

a

b

Expected frequencies: pApB pApb papB papb

Observed frequencies: pAB pAb paB pab

D = pAB - pApB ≠ 0

LD variation across genome

The extent of LD is highly variable across the genome

The determinants of LD are not fully understood.

Factors that are believed to influence LD Genetic drift Population growth Admixture or migration Selection Variable recombination rates

Haplotype

Genotypes

Locus1 2 4Locus2 1 3Locus3 3 2Locus4 4 1Locus5 2 3Locus6 1 2

Identification of phase

413122

232431

Haplotypes

PHASE

BEAGLE

Haplotype-based analysis

Increased ability to identify regions that are shared identical by descent among affected individuals

Haplotypes may the causative ‘composite allele’ rather than a particular nucleotide at a particular SNP

Haplotype analysis is meaningful only if SNPS are in themselves in LD

Monogenic

verses

Complex traits

Monogenic trait

Mutation in single gene is both necessary and sufficient to produce the phenotype or to cause the disease

The impact of the gene on genetic risk is the same in all families

Follow clear segregation pattern in families

Typically rare in population

Complex trait

Multiple genes lead to genetic predisposition to a phenotype

Pedigree reveals no Mendelian pattern

Any particular gene mutation is neither sufficient nor necessary to explain the phenotype

Environment has major contribution

We study the relative impact of individual gene on the phenotype

Some examples

Disease Mendelian/ Complex

No. of genes

Incidence (in 100,000)

Cystic fibrosis M 1 40

Huntington disease

M 1 5-10

Diabetes, type 2 C ? 10,000 – 20,000

Alzheimer C ? 20,000

Schizophrenia C ? 1000

Quantitative Trait

A biological trait that shows continuous variation rather than falling into distinct categories

Quantitative trait locus (QTL) - Genetic locus that is associated with variation in such quantitative trait

Assessing genetic contributions to complex traits

Continuous characters (wt, blood pressure) Heritability: Proportion of observed variance in

phenotype explained by genetic factors

Discrete characters (disease) Relative risk ratio: λ= risk to relative of an

affected individual/risk in general population λ encompasses all genetic and environmental

effects, not just those due to any single locus

Factors that influence identification of allelic association

Effect size Linkage disequilibrium Disease and marker allele frequencies Sample Size

Reviewed by Zondervar & Cardon, Nature Rev. Genet. 2004

Odds ratio

Sample size

Zondervar & Cardon (Nature Rev. Genet. 2004)

Disease allele freq.

Marker allele freq.

Odd ratio

3.0 2.0 1.3

0.2 0.2 150 360 2900

0.5 430 1250 11,000

0.05 0.2 1170 4150 40,000

0.5 4200 15000 160,000

No. of cases= no. of controls; D’=0.7; power 80%; =0.001

Population stratification

M m Freq.

Affected 50 50 0.10

Unaffec. 450 450 0.90

Freq. 0.50 0.50

Consider two case/control samples, genotyped at a marker with alleles M and m

M m Freq.

Affected 1 9 0.01

Unaffec. 99 891 0.99

Freq. 0.10 0.90

Sample A

2 NS 2 NS

Sample B

Population stratification

M m Freq.

Affected 50 50 0.10

Unaffec. 450 450 0.90

Freq. 0.50 0.50

M m Freq.

Affected 1 9 0.01

Unaffec. 99 891 0.99

Freq. 0.10 0.90

Sample A Sample B

M m Freq.

Affected 51 59 0.055

Unaffec. 549 1341 0.945

Freq. 0.30 0.70

2 =14.8

P<0.001

Dealing with population structure

Genomic control (Devlin and Roeder, 1999) Inflate the distribution of the test statistic by λ. λ estimated from data

Unlinked ‘null’ markersTest locus

2 No stratification

E(2)

2

E(2)

Stratification Adjust test statistics

Dealing with population structure

Structured association (Pritchard et al., 2000) Discover structure from set of unlinked markers,

i.e. assign probabilities of ancestry from k populations to each individual, and then control for it.

Association analysis approaches Case–control studies

Markers frequencies are determined in a group of affected individuals and compared with allele frequencies in a control population

Family based methods Based on unequal transmission of alleles from parents to

a single affected child in each family. Associations are summed over many unrelated families

Case-Control studies: 2 test

11 12 22 Total

Case n11 n12 n22 N

Ctrl m11 m12 m22 M

Total T11 T12 T22 N+M

Genotypes Alleles

1 2 Total

Case n1 n2 2N

Ctrl m1 m2 2M

Total T1 T2 2(N+M)

2x3 contingency table 2x2 contingency table

Test of independence:

2 = (O-E)2/E with 2 or 1 df

Family based tests

Genotypes from independent family trios where the child is affected

Use the non-transmitted genotypes or alleles as internal controls to the transmitted ones

Family-based association studies

1 4 transmitted

2 3 non-transmitted

? ?1 2 3 4

1 4 control

Is an allele transmitted more often than

it’s not transmitted to affected offspring ?

TDT: Transmission Disequilibrium Test

G/G G/g

G/g

Non-transmitted

G g

Tra

nsm

itte

d

G

g

a b

c d

TDTG = (TG-NTG)2/(TG+NTG)

=(b-c)2/(b+c) ~ 21

TDT: Transmission Disequilibrium Test

Multiallelic markers ETDT (Sham & Curtis, 1995)

Missing parent genotypes TRANSMIT (Cayton,1999)

Haplotypes TDTHAP (Clayton & Jones, 1999)

Sibs TDT/STDT (Spielman & Ewens, 1998)

Pedigrees PBAT (Martin et al, 2000)

Quantitative traits QTDT (Abecasis et al. 2000)

Some limitations

Subjects – random or structure family Parents not available Difficult when there are very many genes

individually of small effect Environmental influence may obscure

genetic effects Genetic heterogeneity underlying disease

phenotype Hidden (unaccounted) relationship

Rare allele

Single family is segregating

A a

Offspring group I Offspring group II

B b

Complex pedigree &

Quantitative traits

Complex pedigree

Non-independence among pedigree members

Only polygenic relationship is not sufficient Association analysis should account for the

point-wise relationship among individuals Identical-by-decent probabilities

Methods

Combined linkage and LD Generalized linear models Mixed-model (Yu et al. 2006) Bayesian approach

Combined linkage and LD

• Polygene – the whole relationship in pedigree is used

• Identical-by-descend coefficients were estimated for

point-wise relationship

Phenotype= Fixed factors + Polygene + Haplotype

Phase determination - GDQTLQTL mapping - DMU

QTL for Clinical Mastitis in cattle

0

2

4

6

8

10

12

14

16

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Morgan

LR

T

LA

QTL for Clinical Mastitis in cattle

0

2

4

6

8

10

12

14

16

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Morgan

LR

T

LA

LD

QTL for Clinical Mastitis in cattle

0

2

4

6

8

10

12

14

16

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Morgan

LR

T

LA

LD/LA

LD

Simulation

100 half-sib families (Dairy cattle pedigree)

2000 progeny 5 chromosomes – 100 cM (each) SNP – 5000 15 QTL (1QTL-10%, 4QTL-5 %, 10QTL–2%)

50% of the genetic variance

Heritability – 30%

Generalized linear models

Phenotype= Sire-family + genotype

Software – TASSELhttp://www2.maizegenetics.net/index.php?page=bioinformatics/tassel

0

20

40

60

80

100

120

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

- ln

(p)

Generalized linear models

0

20

40

60

80

100

120

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

- ln

(p)

Generalized linear models

0

20

40

60

80

100

120

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

- ln

(p)

Generalized linear models

Mixed-model (Yu et al. 2006)

Phenotype= Fixed factors + SNP + Population + polygene

SAS mixed model (Gael Pressoir)

Relationship

0 1 2

STRUCTURE

Mixed-model

0

20

40

60

80

100

120

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

- ln

(p)

Mixed-model

0

20

40

60

80

100

120

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

- ln

(p)

Mixed-model

0

20

40

60

80

100

120

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

- ln

(p)

Bayesian approach

Phenotype= Fixed factors + Polygene + Allele or Haplotype

• All markers are fitted simultaneously, search for

marker combination that explains the trait variation

• Avoid multiple testing

Software – iBays (Janss LLG, 2007)

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Po

ste

rio

r p

rob

ab

ilit

yBayesian approach

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Po

ster

ior

pro

bab

ilit

yBayesian approach

Multiple testing

Multiple testing

Performing one test at an alpha level of 0.05 implies 5% chance of rejecting a true null hypothesis (false positive)

Performing 100 tests at = 0.05 when all 100 H0 are true, we expect 5 of the tests to give FP results

Pr(at least one FP)=1-Pr(no FP)= 1- (0.95)100 = 0.994 (if the tests are independent)

Multiple testing

Bonferroni correction Rejection level of each test is i /m

Permutation test False discovery rate (FDR)

What proportion of rejections are when H0 is true?

Of all the times you reject H0 how often is H0 true?

q value (Storey et al. PNAS 2003)

Summary

4 methods LD and linkage GLM Mixed-model Bayesian approach

Project team

Goutam Sahana

Bernt Guldbrandtsen

Luc Janss

Mogens Sandø Lund