61
Seminar on GENOME WIDE ASSOCIATION STUDIES 1 By:Varsha Gaitond ID:PALB2235

Genome wide association studies seminar

Embed Size (px)

Citation preview

Page 1: Genome wide association studies seminar

1

Seminar onGENOME WIDE ASSOCIATION

STUDIES

By:Varsha GaitondeID:PALB2235

Page 2: Genome wide association studies seminar

2

Introduction Terminologies Comparison of AM v/s BMGWAS- Introduction Methodology Challenges –Conducting GWASCase studies Advantages DisadvantagesFuture of GWASRevisionConclusion

Contents

Page 3: Genome wide association studies seminar

3

Terminologies• False negative: the declaration of an outcome as statistically

non-significant, when the effect is actually genuine.

• False positive: the declaration of an outcome as statistically significant, when there is no true effect.

• Linkage: refers to coinheritance of different loci within a genetic distance on the chromosome.

• Linkage equilibrium: LE is a random association of alleles at different loci and equals the product of allele frequencies within haplotypes.

Page 4: Genome wide association studies seminar

4

• Linkage disequilibrium: LD is a non-random association of alleles at different loci, describing the condition with non-equal frequency of haplotypes in a population.

• Minor allele Frequency(MAF):The frequency of the less common alleles of a polymorphic locus. Its value lies between 0 to 0.5,and can be vary between populations.

• Odd ratio: Measurement of association that is commonly used in case control studies. Defined as odd of exposure to the susceptible genetic variant in case compared with that in controls. If OR significantly greater than 1,then the genetic variant is associated with a disease.

Page 5: Genome wide association studies seminar

5

Mapping

• Mapping methods

Linkage mapping: measures recombination between markers and the unknown gene (linkage)

• Association mapping: measure correlation between marker alleles and allele in a population (linkage disequilibrium)

• It is the representation of information using special relationships

Page 6: Genome wide association studies seminar

6

Difference

Bi-parental mapping Association mapping• No cross required, works

with existing germplasm.• Phenotypic data can be

already available.• High resolution.• More than 2 alleles are

tested.• Many loci for a single trait

are concurrently analyzed.• Comparatively low.

• Experimental cross required.

• Phenotypes to be collected.

• Limited mapping resolution.• Essentially 2 alleles are tested

• Constraints to segregating loci between parental lines.• High detection power

Page 7: Genome wide association studies seminar

7

Page 8: Genome wide association studies seminar

8

Genome wide association studies (GWAS)

• Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.

• Design:– Identify population structure– Select case subjects (those with disease)– Select control subjects (healthy)– Genotype a million SNPs for each subject– Determine which SNP is associated.• Encoded data• Ranking SNPs

Page 9: Genome wide association studies seminar

Linkage vs AssociationLinkage

1. Family-based

2. Matching/ethnicity generally unimportant

3. Few markers for genome coverage (300-400 STRs)

4. Can be weak design

5. Good for initial detection; poor for fine-mapping

6. Powerful for rare variants

Association

1. Families or unrelateds

2. Matching/ethnicity crucial

3. Many markers req for genome coverage (105 – 106 SNPs)

4. Powerful design

5. Poor for initial detection; good for fine-mapping

6. Powerful for common variants; rare variants generally impossible

9

History of GWAS

• Successful study published in 2005,with investigating patients age related molecular degeneration.

• Prior to GWAS in 2000 Inheritance studies of linkage families.

• HapMap2003

Page 10: Genome wide association studies seminar

10

Building Haplotype Maps for Gene-finding

1. Human Genome Project Good for consensus, not good for individual differences

2. Identify genetic variants Anonymous with respect to traits.

3. Assay genetic variants Verify polymorphisms, catalogue correlations amongst sites Anonymous with respect to traits

Sept 01 Feb 02 April 04 Oct 04

April 1999 – Dec 01

Oct 2002 - present

Page 11: Genome wide association studies seminar

Hap Map• Multi-country effort to identify, catalog common human

genetic variants.

• Developed to better understand and catalogue LD patterns across the genome in several populations.

• Genotyped ~4 million SNPs on samples of African, east Asian, European ancestry.

• All genotype data in a publicly available data base.

• Can download the genotype data

– Able to examine LD patterns across genome

– Can estimate approximate coverage of a given SNP chip

• Can represent 80-90% of common SNPs with

~300,000 tag SNPs for European or Asian samples~500,000 tag SNPs for African samples

11

Page 12: Genome wide association studies seminar

Linkage vs AssociationLinkage

1. Family-based

2. Matching/ethnicity generally unimportant

3. Few markers for genome coverage (300-400 STRs)

4. Can be weak design

5. Good for initial detection; poor for fine-mapping

6. Powerful for rare variants

Association

1. Families or unrelateds

2. Matching/ethnicity crucial

3. Many markers req for genome coverage (105 – 106 SNPs)

4. Powerful design

5. Poor for initial detection; good for fine-mapping

6. Powerful for common variants; rare variants generally impossible

12

Why Genome wide association mapping..?

Page 13: Genome wide association studies seminar

Linkage vs AssociationLinkage

1. Family-based

2. Matching/ethnicity generally unimportant

3. Few markers for genome coverage (300-400 STRs)

4. Can be weak design

5. Good for initial detection; poor for fine-mapping

6. Powerful for rare variants

Association

1. Families or unrelateds

2. Matching/ethnicity crucial

3. Many markers req for genome coverage (105 – 106 SNPs)

4. Powerful design

5. Poor for initial detection; good for fine-mapping

6. Powerful for common variants; rare variants generally impossible

13

Why are they possible now?

Page 14: Genome wide association studies seminar

How it is done?

14

Page 15: Genome wide association studies seminar

15

SNPs

Sam

ples

One-Stage Design

Replication-based analysisSNPs

Sam

ples

Stage 1

Stag

e 2

SNPs

Sam

ples

Stage 1

Stag

e 2

Two-Stage Design

15

Joint analysis

Page 16: Genome wide association studies seminar

16

Multistage Designs

• Joint analysis has more power than replication.

• p-value in Stage 1 must be liberal.

• CaTs power calculator.

• Here signals from an initial, First-stage GWAare used to define a subset of SNPsthat are retyped in additional second stage samples.

• Lower cost—do not gain power.

http://www.sph.umich.edu/csg/abecasis/CaTS/index.html

Page 17: Genome wide association studies seminar

17

Analysis of GWAS

• Most common approach: look at each SNP one-at-a-time.• Possibly add in multi-marker information.• Further investigate / report top SNPs only. Or backwards replication…Most commonly trend test.• Log additive model, logistic regression.• Adjust for potential population stratification.

Page 18: Genome wide association studies seminar

Basics for GWASCalculate the odd ratio• If 2 events are considered• odds of A and B isOR = Odds(A)/Odds(B) = (A)/(~A) / (B)/(~B) • Symmetry in odds ratioOR = Odds(D|G=1)/Odds(D|G=0) = = Odds(G|D=1)/Odds(G|D=0).

18

Page 19: Genome wide association studies seminar

• Significance?

Chi- square test.

Rank SNP by P-value. (Statistical test of association )• Search for SNPs that deviate from the independence

assumption.• Rank SNPs by p-values

19

Page 20: Genome wide association studies seminar

20

Linkage disequilibrium

Page 21: Genome wide association studies seminar

21

Concept of LD

Page 22: Genome wide association studies seminar

22(Abdurakhmonov and Abdukarimov, 2008)

Page 23: Genome wide association studies seminar

Software used for calculation of LD

23

• “Graphical overview of linkage disequilibrium” (GOLD) to depict the structure and pattern of LD.

• “Trait Analysis by aSSociation, Evolution and Linkage” (TASSEL) and PowerMarker

Page 24: Genome wide association studies seminar

24

LD in animals

Page 25: Genome wide association studies seminar

25

LD in plant system

• Maize- (Zea mays spp mays )Studies conducted to investigate LD over a wide range of population and marker type.

Page 26: Genome wide association studies seminar

26

Factors affecting LDThe factors, which lead to an increase in LD, include

• Inbreeding, • Small population size, • Genetic isolation between lineages, • Population subdivision,• Low recombination rate, • Population admixture, • Natural and artificial selection, • Balancing selection, etc.

The factors, which lead to a decrease/disruption in LD, include

• Outcrossing, • High recombination rate, • High mutation rate, etc.

Page 27: Genome wide association studies seminar

Use of GWAS in plant systemsBasic biology• Understand the makeup of molecular pathways.• Dissect the genetic component of molecular variation.• Genotype environmental interaction.Breeding• Mining of markers causal for phenotype. • to assist in breeding decisions.• Maximization of yield, pathogen resistance etc.

27Brown and Brown 2008

Page 28: Genome wide association studies seminar

Preprocessing

29

Genotype• Imputing of missing value.• Hidden Mankov models and related approaches. Beagle,IMPUTE• In GWAS based on full sequencing data some alleles may be rare or even private.

Phenotype• Most parametric models are based on Gaussian assumptions.• Phenotypic residues are often non Gaussian.• Phenotypic transformation on suitable scale.• Use of prior knowledge.• eg.Growth rate , generation doubling time etc.• Variance stabiliztion.

Page 29: Genome wide association studies seminar

Challenges we are going to address while conducting GWAS

30

Multiple hypothesis testing• In GWAS the number of statistical tests is commonly is on the

order of 10⁶.• At significance level of 0.01we would expect 10,000 false

positive. Thus individual p-value <0.01are not significant anymore.

• Correction of multiple hypothesis testing is critical.

Population structure• Confounding structure leads to false positive.

Statistical power and resolution• Small samples, large number of hypothesis.• Increased power• Testing compound hypothesis.

Page 30: Genome wide association studies seminar

Model checkingQ-Q plots

31

• Compare the Quartiles of the empirical test statistic distribution to

assume null distribution.• Sort test statistic.• Plot test statistic against (Y-axis) quartile of the theoretical null

distribution.• If the plot is close to diagonal the distribution makeup.• Deviation from the diagonal indicates inflation or deflation of

test statistics.• Repair the plot with HW equilibrium.

Page 31: Genome wide association studies seminar

32

Page 32: Genome wide association studies seminar

Checking for HWE

33

Page 33: Genome wide association studies seminar

34

Imputation of SNP genotype

Page 34: Genome wide association studies seminar

Identify Match with Reference

35

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . .

. . . . G . . . . . . . C . . . . A . . .

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C

Page 35: Genome wide association studies seminar

Phase chromosomes, impute missing genotypes

36

Observed Genotypes

c g a g A t c t c c c g A c c t c A t g gc g a a G c t c t t t t C t t t c A t g g

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C

Gonçalo Abecasis

Page 36: Genome wide association studies seminar

37

• Why in Arabidopsis?

• Hermophodite.so large existence of spp.• Behaves as naturally existing inbred population• LD is more extensive.• False positive rates strongly differed between the traits.

Page 37: Genome wide association studies seminar

38

• They considered 4 phenotypes for which the major loci were known.

• Vernalization response locus-FRI.

• 3 pathogen resistant loci-Rpm1,Rps2,Rps5.

Page 38: Genome wide association studies seminar

39

Page 39: Genome wide association studies seminar

40

Page 40: Genome wide association studies seminar

41

Page 41: Genome wide association studies seminar

42

Page 42: Genome wide association studies seminar

43

Page 43: Genome wide association studies seminar

44

Page 44: Genome wide association studies seminar

45

• Biparental and QTL approaches are not scalable to investigate genetic potential of 12000 accessions.

• GWAS simultaneously screened large number of genetic accessions.

• Genotype once and sequence repeatedly.

• Took Global collection of 413 accessions of sativa.

• Collected from 82 countries and designed 440000 oligonucleotides.

Results• Correlation analysis between different phenotypes.• Correlation ranged from -0.41 to 0.9(seed width and length).

Page 45: Genome wide association studies seminar

46

Page 46: Genome wide association studies seminar

48Analysis of Naive and mixed model approaches

Page 47: Genome wide association studies seminar

49

Page 48: Genome wide association studies seminar

50

Page 49: Genome wide association studies seminar

51

• Smut (Sphacelotheca reiliana) study using Illumina maize SNP 50 array.

• 45868 SNPs in the panel of 144 inbred lines.• Classified candidate genes as resistant genes, disease

response genes and other genes. Outcome• 50K SNP offers highest array.• Chromosomal region and specific SNP that affect resistance

level in commercial build up population.• Assessed the extent of LD in target population.• Identified genes or QTLs that significantly affect smut

resistance.• Characterized those genes based on known function and co-

location.

Page 50: Genome wide association studies seminar

52

Page 51: Genome wide association studies seminar

53

Page 52: Genome wide association studies seminar

54

Research in other plants• Barley• Lettuce• Tomato• Sorghum• Wheat• Foxtail millet

Page 53: Genome wide association studies seminar

Advantages

55

• Biological pathway of the trait does not have to be known.

• Discovering novel candidate genes.

• Encourages collaborative consortia.

• Rules out specific genetic association.

• Provides more robust data.

• Identifies the mutations explaining few percent of phenotypic variant.

Page 54: Genome wide association studies seminar

Limitations

56

• Results need replication in independent samples in different population.

• A large study of population is required, detects association not causation.

• Identifies specific location not complete gene.

• Focus on common variants and many associated variants are not causal.

• Detect any variant(>5%) in a population.

• Cost of each DNA sample and pooling them.

• Unavailability of funding agencies.

• Not predictive and explains less heritability.

Page 55: Genome wide association studies seminar

Will GWAS results explain more heritability?

57

Page 56: Genome wide association studies seminar

58

Candidate gene• Hypothesis-driven• Low-cost: small genotyping

requirements• Multiple-testing less

important– Possible many misses,

fewer false positives

Current association study challenges

Page 57: Genome wide association studies seminar

Still why GWAS is popular?

59

• The dropping genotyping costs. It involves whole genome.

• Resequencing of all the individuals in a population, large structure variation such as copy number variation.

• Eg.Resequencing of Arabidopsis lyrata.

• In future this will help in RNA-seq data to include in e-QTL mapping in GWAS studies.

• Population choice will no longer restricted to model organisms will slowly become more focused on the spp which are more relevant in answering biological questions.

• The accuracy depends on 1 time genotyping and repeated phenotyping.

Page 58: Genome wide association studies seminar

Output of GWAS

60

• All phenotype and genotype data should to be made public and be deposited in public databases.

• As such file format and minimum information standards should to be established.

• Priority to storage and dissemination of phenotypic and genotypic data.

Page 59: Genome wide association studies seminar

Conclusion

61

Moving beyond Genomics

Syst

ems

Biol

ogy

Page 60: Genome wide association studies seminar

62

Thank you

The more we find, the more we see,

the more we come to learn.

The more that we explore, the more we shall return.”

Sir Tim Rice, Aida, 2000

Page 61: Genome wide association studies seminar

63