Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu...

Hidden Markov Models of Haplotype

Diversity and Applications in

Genetic Epidemiology

Ion Mandoiu

University of Connecticut

HMM model of haplotype diversityApplications

- Phasing- Error detection- Imputation- Genotype calling from low-coverage

sequencing dataConclusions

Outline

Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)

High density in the human genome: 1 107 SNPs out of total 3 109 base pairs

Single Nucleotide Polymorphisms

… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

Haplotypes and Genotypes

Diploids: two homologous copies of each autosomal chromosome

One inherited from mother and one from father

Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor

Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor)

allele; 2 - the chromosomes contain different alleles

011100110001000010021200210

+two haplotypes per individual

genotype

Sources of Haplotype Diversity: Mutation

The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.

Sources of Haplotype Diversity: Recombination

Haplotype Structure in Human Populations

Fi = founder haplotype at locus i, Hi = observed allele at locus i

P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data

For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm

Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

HMM Model of Haplotype Frequencies

F1 F2 Fn…

H1 H2 Hn

Outline

Genotype Phasing

g: 0010212 ?

h1:0010111

h2:0010010

h3:0010011

h4:0010110

Maximum Likelihood Genotype Phasing

Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

F'1 F'2 F'n…

H'1 H'2 H'n

Computational Complexity

• [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP

• [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)

Outline

Genotyping Errors

A real problem despite advances in technology & typing algorithms

1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005]

Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004]

In pedigrees, some errors detected as Mendelian Inconsistencies (MIs)

Many errors remain undetected As much as 70% of errors are Mendelian consistent for

mother/father/child trios [Gordon et al. 1999]

0 1 2 1 0 2

0 2 2 1 0 2

Mother Father

Likelihood of best phasing for original trio T

0 1 1 1 0 0 h1

0 0 0 1 0 1 h3

0 1 1 1 0 0 h1

0 1 0 1 0 1 h2

0 0 0 1 0 1 h3

0 1 1 1 0 0 h4

)()()()( MAX)( 4321 hphphphpTL

Likelihood Sensitivity Approach to Error Detection in Trios

0 1 2 1 0 2

0 2 2 1 0 2

Mother Father

Likelihood of best phasing for original trio T

)()()()( MAX)( 4321 hphphphpTL

? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3

0 1 0 1 0 1 h’1

0 1 1 1 0 0 h’2

0 0 0 1 0 0 h’ 3

0 1 1 1 0 1 h’ 4

Likelihood of best phasing for modified trio T’

)'()'()'()'( MAX)'( 4321 hphphphpTL

0 1 2 1 0 2

0 2 2 1 0 2

Mother Father

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

Alternate Likelihood Functions

• Efficiently Computable Likelihood Functions

- Viterbi probability

- Probability of Viterbi Haplotypes

- Total Trio Probability

• [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP

Comparison with FAMHAP (Children)

0 0.005 0.01 0.015

FP rate

TotalProb-UNO

TotalProb-DUO

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP-1

FAMHAP-3

Comparison with FAMHAP (Parents)

0 0.005 0.01 0.015

FP rate

TotalProb-UNO

TotalProb-DUO

TotalProb-TRIO

TotalProb-COMBINED

FAMHAP-1

FAMHAP-3

Outline

Genome-Wide Association Studies

Powerful method for finding genes associated with complex human diseases

Large number of markers (SNPs) typed in cases and controls

Disease causal SNPs unlikely to be typed directly Significant statistical power gained by performing

imputation of untyped Hapmap genotypes [WTCCC’07]

HMM Based Genotype Imputation Train HMM using the haplotypes from related

Hapmap or small cohor typed at high density

Probability of missing genotypes given the typed genotype data

gi is imputed as )|,(argmax }2,1,0{ MxggPx iix

)|,(),|(

MxggPMgxgP

Experimental Results

Estimates of the allele 0 frequency based on Imputation vs. Illumina 15k

Experimental Results

Accuracy and missing data rate for imputed genotypes at different thresholds

Outline

Illumina / Solexa Genetic Analyzer 1G1000 Mb/run, 35bp reads

Roche / 454 Genome Sequencer FLX100 Mb/run, 400bp reads

Applied BiosystemsSOLiD3000 Mb/run, 25-35bp reads

New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing

Ultra-High Throughput Sequencing

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

Probabilistic Model

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

where is the probability that read r has an error at locus I

Conditional probabilities for sets of reads are given by:

Model Training

)(1)(, 1

2)|( ir

gggGrRP

)( )1()0|r(

)( )1()2|r(

1)1|r(

Multilocus Genotyping Problem

GIVEN:

• Shotgun read sets r=(r1, r2, … , rn)

• Base quality scores

• HMMs for populations of origin for mother/father

• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r)

Joint probabilities can be computed using a forward-backward algorithm:

Direct implementation gives O(m+nK4) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs

Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

)()|r()r,( '' ''1 ,1 ,, i

fiii ggPgPiii iiiii

Posterior Decoding Algorithm1. For each i = 1..n, compute

2. Return *)*,...,(* 1 nggg

)r,(maxarg)r|(maxarg* igigi gPgPgii

Homozygous Watson SNPs (Affy 500k)

97.490.5

98.5 96.3 99.1

Heterozygous Watson SNPs (Affy 500k)

Genotyping Accuracy on Watson Reads

Outline

Conclusions

HMM model of haplotype diversity provides a powerful framework for addressing central problems in population genetics & genetic epidemiology

Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations

Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice

Highly scalable runtime (linear in #SNPs and #individuals/reads)

Software available at http://www.engr.uconn.edu/~ion/SOFT/

Acknowledgements

Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc

NSF funding (awards IIS-0546457 and DBI-0543365)

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu...

Documents

L6: Haplotype phasing

Combinatorial Algorithms for Haplotype Inference

Haplotype Based Association Tests

The Relationship Between Haplotype-Based Haplotype Lengthweb.stanford.edu/group/rosenberglab/papers/MehtaEtAl2019...HIGHLIGHTED ARTICLE | INVESTIGATION The Relationship Between Haplotype-Based

Haplotype inference and haplotype-based transmission disequilibrium test (Hap-TDT)

Haplotype Discovery and Modeling

ION I. MANDOIU { Curriculum Vitae August 22, 2019ion/CV/cv.pdf · 2019. 8. 22. · ION I. MANDOIU { Curriculum Vitae August 22, 2019 CONTACT INFORMATION University of Connecticut

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree

Research Article Association, Haplotype, and Gene-Gene

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

High-Performance Haplotype Assemblycalvados.di.unipi.it/storage/paper_files/2014_pHaplo_cibb.pdf · High-Performance Haplotype Assembly Marco Aldinucci3, Andrea Bracciali1, Tobias

Haplotype-resolved genome sequencing: experimental methods ...€¦ · 07.05.2015 · Haplotype-resolved genome sequencing: experimental methods and applications Matthew W. Snyder

· PDF fileAnaliza si aprobare Bilant Contabil APRIL 2012 2. Propunere, ... Albalact: Irina Mandoiu, Eustatiu Mandoiu, Raul Ciurtin Almera: Grigore Lingulescu

A.B. Kahng, I. Mandoiu and S. Muddu -- UCSD

LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of

METHODS FOR HAPLOTYPE RECONSTRUCTION

Introduction to Haplotype Estimation

Mellitus: Application of the Haplotype Method

Reconstruction of Haplotype Spectra from NGS Data