53
Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Embed Size (px)

DESCRIPTION

Genetic Spectrum of Complex Diseases GWAS Sequencing Linkage

Citation preview

Page 1: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Analysis of Next Generation Sequence Data

BIOST 205504/06/2015

Page 2: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Last Lecture

• Genome-wide association study has identified thousands of disease-associated loci

• Large consortium performs meta-analysis to further increase the sample size (power) to detect additional loci

• GWAS is limited by the chip design and rare variants are rarely explored

Page 3: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Genetic Spectrum of Complex Diseases

GWASSequencing

Linkage

Page 4: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Outline• Background

• From sequence data to genotype

• Rare Variant Tests

Page 5: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Human Genome and Single Nucleotide Polymorphisms (SNPs)

• 23 chromosome pairs• 3 billion bases

• A single nucleotide change between pairs of chromosomes

• E.g.

• A/A or G/G homozygote • A/G heterozygote

Haplotype1: AAGGGATCCACHaplotype2: AAGGAATCCAC

Page 6: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Association Study in Case Control Samples

CAGATCGCTGGATGAATCGCATCCGGATTGCTGCATGGATCGCATC

CAGATCGCTGGATGAATCGCATCCAGATCGCTGGATGAATCCCATC

CGGATTGCTGCATGGATCCCATCCGGATTGCTGCATGGATCCCATC  

SNP2↓

SNP3↓

SNP4↓

SNP5↓

SNP1↓

Disease

Page 7: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

– Only subset of functional elements include common variants– Rare variants are more numerous and thus will point to additional loci

Page 8: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

History of DNA Sequencing

Page 9: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Sequencing Cost

http://www.genome.gov/

Page 10: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

A Road to Discover Human Genome

1990-2003 2002 - 2008 -

Page 11: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Current Genome Scale Approaches

• Deep whole genome sequencing– Expensive, only can be applied to limited samples currently– Most complete ascertainment of all variations

• Low coverage whole genome sequencing– Modest cost, typically 100-1000 samples– Complete ascertainment of common variations– Less complete ascertainment of rare variants

• Exome capture and targeted region sequencing– Modest cost, high coverage– Most interesting part of the genome

Page 12: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Next Generation Sequencing• Commercial platforms produce gigabases of sequence rapidly

and inexpensively – ABI SOLiD, Illumina Solexa, Roche 454, Complete

Genomics, and others…

• Sequence data consist of thousands or millions of short sequence reads with moderate accuracy 0.5 – 1.0% error rates per base may be typical

• High-throughput but hard to assemble

Page 13: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

A Typical PipelineShotgun Sequencing Reads

Single Marker Caller

Haplotype-basedCaller

Mapped Reads

Polymorphic Sites

Individual Genotypes

ReadAlignment Software

Page 14: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Short read alignment

Sequencer

Reads from new sequencing machines are short: 30-400 bp

Human source

Page 15: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Short read alignment

Sequencing machine

And you get MILLIONS of them

Page 16: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Short read alignment

Need to map them back to human reference

Page 17: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

AlignmentReference sequence:

actgtagattagccgagtagctagctagtcgat

ccgagaagctag

Find best match for each read in a reference sequence

• Hashing is time and memory consuming for millions of reads and billion-base long reference

• Errors in reads• Each read may be mapped to multiple positions• Individual polymorphisms

Page 18: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Existing Alignment by Category• Hashing reference genome

– SOAP1, MOSAIK, PASS, BFAST, …

• Hashing short reads– Eland, MAQ, SHRiMP, …

• Merge-sorting reference together with reads– Slider

• Based on Burrows-Wheeler Transform– BWA, SOAP2, Bowtie, …

Li and Durbin (2009), Bioinformatics 25 (14): 1754-60

Page 19: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

After Alignment

• Each read is mapped to reference genome with tolerated number of mismatches– Mismatches allow us to discover the individual

variation

• Each site of reference genome is covered by multiple un-evenly distributed reads– Some sites might not be covered

Page 20: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Genome

Genome 1

Genome 2

Genome 3

Genome 4

Reads

Coverage (High vs Low)

VS

• Which one has more power to detect variations?

Page 21: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Genotype Calling from Sequence Data

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’Reference Genome

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTCTAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Sequence Reads

Predicted GenotypeA/C or A/A or C/C

Observed Data2A and 3C

Page 22: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

A Simple ModelAt one site, Na reads carry A, Nb reads carry B

Page 23: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Inference with no reads

Reference Genome

Sequence Reads

Possible Genotypes

P(reads|A/A)= 1.0

P(reads|A/C)= 1.0

P(reads|C/C)= 1.0

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

Page 24: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Inference with short read data

Reference Genome

Sequence Reads

Possible Genotypes

P(reads|A/A)= P(C observed, read maps |A/A)

P(reads|A/C)= P(C observed, read maps |A/C)

P(reads|C/C)= P(C observed, read maps |C/C)

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

Page 25: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Inference assuming error of 1%

Reference Genome

Possible Genotypes

P(reads|A/A)= 0.01

P(reads|A/C)= 0.50

P(reads|C/C)= 0.99

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

Page 26: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

As data accumulate …

Reference Genome

Possible Genotypes

P(reads|A/A)= 0.0001

P(reads|A/C)= 0.25

P(reads|C/C)= 0.98

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG

Page 27: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

As data accumulate …

Reference Genome

Possible Genotypes

P(reads|A/A)= 0.000001

P(reads|A/C)= 0.125

P(reads|C/C)= 0.97

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

Page 28: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

As data accumulate …

Reference Genome

Possible Genotypes

P(reads|A/A)= 0.00000099

P(reads|A/C)= 0.0625

P(reads|C/C)= 0.0097

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

TAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Page 29: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

In the “end”

Reference GenomeP(reads|A/A)= 0.00000098

P(reads|A/C)= 0.03125

P(reads|C/C)= 0.000097

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGAAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG

ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

TAGCTGATAGCTAGATAGCTGATGAGCCCGATATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC

Page 30: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Not the “end” yet

Reference GenomeP(reads|A/A) = 0.00000098

P(reads|A/C) = 0.03125P(reads|C/C) = 0.000097

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGAAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG

ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

TAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Making a genotype call requires combining sequence data with prior information

ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC

Page 31: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Not the “end” yet

Reference Genome

P(reads|A/A)= 0.00000098 Prior(A/A) = 0.00034 P(A/A|reads) < 0.01P(reads|A/C)= 0.03125 Prior(A/C) = 0.00066 P(A/C|reads) = 0.175P(reads|C/C)= 0.000097 Prior(C/C) = 0.99900 P(C/C|reads) = 0.825

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

TAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Base Prior: every site has 1/1000 probability of varying

ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC

Page 32: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Population Based Prior

Reference Genome

P(reads|A/A)= 0.00000098 Prior(A/A) = 0.04 P(A/A|reads) < .001P(reads|A/C)= 0.03125 Prior(A/C) = 0.32 P(A/C|reads) = 0.999P(reads|C/C)= 0.000097 Prior(C/C) = 0.64 P(C/C|reads) = <.001

Sequence Reads

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

TAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Population Based Prior: Use frequency information from examining others at the same site. E.g. P(A) = 0.2

ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC

Page 33: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Prior Information• Individual based prior

– Equal probability of showing polymorphism– 1/1000 bases different from reference– Error Free and Poisson distribution– Single sample, single site

• Population based prior– Estimate frequency from many individuals– Multiple sample, single site

• Haplotype/Imputation based prior– Jointly model flanking SNPs, use haplotype information– Important for low coverage sequence data– Multiple samples, multiple sites

Page 34: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Comparisons of Different Genotype Calling Methods

Page 35: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Rare Variant Tests

• Genotype calling is the first step of the journey

• Identify SNPs/genes associated with phenotype

• Sequencing provides more comprehensive way to study the genome– Discover more rare variants

Page 36: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

– Only subset of functional elements include common variants– Rare variants are more numerous and thus will point to additional loci

Page 37: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Genetic Spectrum of Complex Diseases

GWASSequencing

McCarthy MI et al. Nat Rev Genet. 2008

Page 38: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Several Approaches to Study Rare Variants

• Deep whole genome sequencing – Can only be applied to limited numbers of samples – Most complete ascertainment of variation

• Exome capture and targeted sequencing – Can be applied to moderate numbers of samples – SNPs and indels in the most interesting 1% of the genome

• Low coverage whole genome sequencing – Can be applied to moderate numbers of samples – Very complete ascertainment of shared variation

• New Genotyping Arrays and/or Genotype Imputation – Examine low frequency coding variants in 100,000s of samples – Current catalogs include 97-98% of sites detectable by sequencing an individual

Page 39: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Single SNP Test for Rare Variant

• Rare variants are hard to detect

• Power/sample size depends on both frequency and effect size

• Rare causal SNPs are hard to identify even with large effect size

Page 40: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Single SNP Test for Rare Variant

• Disease prevalence ~10%• Type I error 5x10-6

• To achieve 80% power • Equal number of cases and controls

• Minor Allele Frequency (MAF) = 0.1, 0.01, 0.001

• Required sample size = 486, 3545, 34322,

Page 41: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Alternatives to Single Variant Test Collapsing Method (Burden Test)

• Group rare variants in the same gene/region

• Score each individual– Presence or absence of rare copy– Weight each variant

• Use individual score as a new “genotype”

• Test in a regression framework

Page 42: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Challenges

• Disease is caused by multiple rare variants in an additive manner

• It is hard to separate causal and null SNPs– Including all rare variants will dilute the true signals

• The effect size of each rare variant varies

Page 43: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Power of Burden Test

• Power tabulated in collections of simulated data

• Combining variants can greatly increase power

• Currently, appropriately combining variants is expected to be key feature of rare variant studies.

Page 44: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Impact of Null Variants

• Including non-disease variants reduces power

• Power loss is manageable, combined test remains preferable to single marker tests

Page 45: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Impact of Missing Disease Alleles

• Missing disease alleles loses power

• Still better than single variant test

Page 46: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

1. yi: quantitative or binary phenotypes;

2. α'Xi: fixed effects of covariates;

3. β'Gi: genetic effects from one gene consisted of SNPs;

4. εi: random error.

Sequence Kernel Association Test (SKAT)

0τ:H0 0:,,0~ Assume 0 HWN IE2,0~

Page 47: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Sequence Kernel Association Test (SKAT)

• Regression based method

• Score statistic

• Kernel

'K GWG

Page 48: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015
Page 49: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Maximizing the Power• Power depends summed frequency

– Choose threshold for defining rare carefully

• Enriched functional variants in cases increase power – Focus on loss of function variants only

• Use more efficient design– For quantitative traits, focus on individuals with extreme trait values– For binary traits, focus on individuals with family history of disease

Page 50: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015
Page 51: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Discussion

• Analysis of rare variants is an active research area

• Weight for each SNP is the key

• What to do if the samples are related

• Most tests reply on permutation– Computationally intensive

Page 52: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Reference• The 1000 Genomes Project (2010) A map of human genome vairation from

population-scale sequencing. Nature 467:1061-73

• Nielsen R, Paul JS et al. (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet

• Li Y, Chen W et al. (2012) Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data. Statistics in Biosciences.

• Li Y et al (2011) Low-coverage sequencing: Implication for design of complex trait association studies. Genome Research 21: 940-951

• Chen W, Li B et al. (2013) Genotype calling and haplotyping in parent-offspring trios. Genome Research.

Page 53: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015

Reference• http://genome.sph.umich.edu/wiki/Rare_variant_tests

• Raychaudhuri S. Mapping rare and common causal alleles for complex human diseases. Cell. 2011 Sep 30;147(1):57-69.

• Li and Leal (2008) Am J Hum Genet 83:311-321

• Madsen BE, Browning SR (2009) A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genet 5(2)

• Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zöllner S (2010) Am J Hum Genet 87:604-617

• Wu M, Lee S, et al. (2011) Am J Hum Genet