1
Imputation 2
Presenter: Ka-Kit Lam
2
Outline
• Big Picture and Motivation• IMPUTE• IMPUTE2• Experiments• Conclusion and Discussion• Supplementary : – GWAS– Estimate on mutation rate
3
Big Picture and Motivation
4
• Genome-wide association study: – Identify common genetic factors that influence
health/disease
Background
5
Background
• Important to know the SNPs• However, . . . ,– Not all SNPs are genotyped for all individuals in
the case-control study in GWAS.
• How can we guess the missing parts?
Individual 1: ACCCAATTACCAGTATTTA…Individual 2: CCCCATTTACCACTATTTA…Individual 3: ACCCATTTACCACTATTTA…Individual 4: CCCCATTTACCAGTATTTA…
?
?
??
?
6
Information known
• Luckily, we now have references for human DNA:
• But, how can we use the reference genomes?
7
Main Question
• Objective:– Design algorithms • to impute the missing genotypes of the individuals
being studied
– Criteria for algorithms• Scalable• Accurate
8
Big Picture on Algorithm Design
Algorithms
SNPs in study,reference haplotype/genotype
Imputed genotype,associated confidence
1. Scalability2. Accuracy
1. Experimental validation2. Application
In theory, it makes sense In practice, it works
9
IMPUTE
10
Notations and Setting
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?
1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?
2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
Reference Haplotypes :
Genotype in the study sample: L
L
N
K
(Rmk: 0-00 , 1-01, 2-11)
11
Formulation
• Observed genotype and missing genotype
• Classical inference problem:– A reasonable estimate:
– Confidence:
12
Modeling (HMM model):Relationship btw (H,G)
• Assumptions:– Study individuals are independent
– Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model
– Mutation at different sites are conditionally independent given the copied haplotype
13
Modeling (HMM model):Relationship btw (H,G)
0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
Reference Haplotypes :
L
N
Study Individual:
0 2 2 2 0 0 2 2 0 0 0 1 0 2 1
14
Modeling (HMM model):Relationship btw (H,G)
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
L
N
……
0 2 2 2 0 0 2 2 0 0 0 1 0 2 1
15
Modeling (Transition Probability)
• States• Transition
• What is the intuition?
16
Modeling :relationship btw transition Probability and Recombination
• Recombination Process:
17
Modeling :relationship btw transition Probability and Recombination
• Recombination Process:– More reference, longer the copy length
– Copy length in our model depends on genetic distance btw SNPs
Ref panel 1 Ref panel 2
Study individual:
More likely to have longercopy length here
18
Modeling (Transition Probability)
• States• Transition
19
Modeling (Emission Probability)
• Emission probability– Define mutation rate : – Since mutation is assumed independent across
site 0-00 1-01 2 -11
00 (1-λ)2 2λ(1-λ) (λ)2
01 λ(1-λ) (λ)2+(1-λ)2 λ(1-λ)
11 (λ)2 2λ(1-λ) (1-λ)2
20
Extension (completely missing)• Problem: – Missing genotype across all references and study
samples. How to impute?• What can we expect? – Generate information from no information? – We cannot expect to know the genotype– But we can guess the relationship btw them
– Our friend : population genetics may help !
0 0 1 0 1 1 0 ? 1 1 0 0 0 0 0
0 0 1 0 1 1 0 ? 1 1 0 0 0 0 1
0 0 1 0 1 1 0 ? 1 1 0 0 0 1 1
21
Imputation on Reference
• IllustrationH(1) 1 1 1 0 0 1 ? 0 0 0 1 0 1 0
H(2) 1 1 1 0 1 0 ? 1 1 0 0 0 1 0
H (3) 1 1 1 0 0 0 ? 0 0 0 1 1 1 1
H (4) 1 1 1 1 0 0 ? 0 0 0 0 0 0 0
H(N) 1 1 1 0 1 1 ? 0 0 1 1 1 0 0
0
0
1
0
1
22
Imputation on Reference
Algorithm:1. Randomly select an ordering2. Sample the first mutation according to
3. Treat previous as references and impute 4. Repeat several time to get a stable output5. Use the imputed reference to impute the study
23
Computational Complexity:Imputation
……
O(N2L) for each individual
24
Computational Complexity:Imputation
O(N2L) for each individual
25
Computational Complexity:Forward-Backward Algorithm
• Forward Equations:
• Naïve application takes O(N4)
26
Computational Complexity:Forward-Backward Algorithm
• Q : How to compute the following in O(N2) ?
• A: (suggested in fastPhase)
27
Computational Complexity:Forward-Backward Algorithm
• Finally, we have
• Similarly for the backward part
O(N2)
O(N) for each jO(N2) totally
O(N) for each iO(N2) totally
O(N2) totally
28
Demo./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000
29
Demo
30
IMPUTE2
31
Motivation
• Accuracy:– Not all information used during imputation (e.g.
other study individuals)• Complexity: – Need to scale well if we incorporate all
information (e.g. previously it is O(LN2))• New data type:– Diploid reference (1000 genome project)
• Q: How to design algorithms to handle this?
32
Description of Setting(Scenario A)
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?
1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?
2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
Reference Haplotypes :
Genotype in the inference panel: L
L
Nhap
Ninf
(Rmk: 0-00 , 1-01, 2-11) :T, :U (Rmk : sets of index of SNPs)
33
2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
Description of Setting(Scenario B)
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
? ? ? ? ? 2 ? ? ? ? ? 0 ? ?
1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?
Reference Haplotypes :
L
L
Nhap
(Rmk: 0-00 , 1-01, 2-11) :T, :U1 , :U2
Inference panel
Diploid reference panel
Ninf
Ndip
(Rmk : sets of index of SNPs)
1
1
1
2
2
1
1
0
2
1
34
Algorithm for Scenario A
• Illustration:0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?
1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?
2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
35
Algorithm for Scenario A
• Illustration (Burn in)
00
? ? ? ? ? 11
? ? ? ? ? 00
? ?
10
? ? ? ? ? 10
? ? ? ? ? 00
? ?
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
? ? ? ? ? 00
? ? ? ? ? 10
? ?
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
36
Algorithm for Scenario A
• Illustration (Phasing)
00
? ? ? ? ? 11
? ? ? ? ? 00
? ?
10
? ? ? ? ? 10
? ? ? ? ? 00
? ?
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
??
? ? ? ? ? ??
? ? ? ? ? ??
? ?
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?0 ? ? ? ? ? 0 ? ? ? ? ? 0 ? ?Update i
(1) (0) (1)(genotype)
37
Algorithm for Scenario A
• Illustration (Imputing)
00
? ? ? ? ? 11
? ? ? ? ? 00
? ?
10
? ? ? ? ? 10
? ? ? ? ? 00
? ?
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
? ?
??
? ?
? ?
? ?
00
? ?
? ?
? ?
? ?
? ?
10
? ?
? ?
0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
Update i
(1) (0) (1)(genotype)
1 1 0 1 0 0 0 0 0 0 0 1 1 1 10 1 1 1 0 1 0 1 1 1 0 0 0 1 0
38
Phasing Step: Path Sampling
• How to sample path?……
39
Imputation Step: Extract Posterior Probability
• After many rounds, we can get : – For each individual and for each missing site
– Assuming independence in sampling the haploid pair
Hap 10 10.3 0.70.2 0.8… …
Hap 20 10.1 0.90.4 0.6… …
Genotype0 1 20.03 0.34 0.63
0.08 0.44 0.48
… … …
Take average then
40
Algorithm for Scenario A:Complexity Analysis
• A) Burn in phase• B) MCMC iterations for m times:– For each individual i• i) phase(i,T,hap+inf)• ii) impute(i,T+U,hap)• iii) record(posterior probability)
• C) Average over different runs of MCMC to get the genotype and confidence
O((Nhap + Ninf)2LT)
O(NhapLT+U)
O(LT+U)
41
Benefits of the Algorithm
• Faster:– Reducing the load in the imputation step
• More accurate:– Utilize information available to guess
42
Algorithm for Scenario B
• Illustration:0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?
Nhap
Ninf
Ndip
:T, :U1 , :U2
0 ? ? 2 ? ? 2 2 ? ? ? ? 0 ? ?
1 ? ? 2 ? ? 1 1 ? ? ? ? 0 ? ?
43
Algorithm for Scenario B
• Illustration: (Burn in )0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
? ? ? ? ? 00
? ? ? ? ? 10
? ?
Nhap
Ninf
Ndip
:T, :U1 , :U2
00
? ? 11
? ? 11
11
? ? ? ? 00
? ?
10
? ? 11
? ? 10
10
? ? ? ? 00
? ?
44
Algorithm for Scenario B
• Illustration: (Phase T and U2 in diploid ref)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
00
? ? 11
? ? 11
11
? ? ? ? 00
? ?
??
? ? ??
? ? ??
??
? ? ? ? ? ?
? ?
Nhap
Ninf
NdipUpdate i
10
? ? 11
? ? 10
10
? ? ? ? 00
? ?
:T, :U1 , :U2
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
? ? ? ? ? 00
? ? ? ? ? 10
? ?
45
Algorithm for Scenario B
• Illustration: (Impute U1 in diploid ref)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
00
11
11
11
00
01
11
11
01
01
00
11
00
11
00
10
? ? 11
? ? 10
10
? ? ? ? 00
? ?
Nhap
Ninf
Ndip10
11
11
11
00
00
10
10
00
00
00
10
00
11
00
:T, :U1 , :U2
Update i
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
? ? ? ? ? 00
? ? ? ? ? 10
? ?
46
11
? ?
? ?
? ?
? ?
? ?
00
? ?
? ?
? ?
? ?
? ?
10
? ?
? ?
??
? ?
? ?
? ?
? ?
? ?
??
? ?
? ?
? ?
? ?
? ?
??
? ?
? ?
Algorithm for Scenario B
• Illustration: (Phase T in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
00
11
11
11
00
01
11
11
01
01
00
11
00
11
00
10
? ? 11
? ? 10
10
? ? ? ? 00
? ?
Nhap
Ninf
Ndip10
11
11
11
00
00
10
10
00
00
00
10
00
11
00
Update i10
? ? ? ? ? 00
? ? ? ? ? 10
? ?
:T, :U1 , :U2
47
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
??
??
??
??
??
00
??
??
??
??
??
10
??
??
Algorithm for Scenario B
• Illustration: (Impute U2 in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
00
11
11
11
00
01
11
11
01
01
00
11
00
11
00
10
? ? 11
? ? 10
10
? ? ? ? 00
? ?
Nhap
Ninf
Ndip10
11
11
11
00
00
10
10
00
00
00
10
00
11
00
Update i
:T, :U1 , :U2
10
??
??
11
??
??
00
10
??
??
??
??
10
??
??
48
11
? ? ? ? ? 00
? ? ? ? ? 10
? ?
10
??
??
11
??
??
00
10
??
??
??
??
10
??
??
Algorithm for Scenario B
• Illustration: (Impute U1 in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 1 0 1 0 1 1 1 0 0 0 1 0
1 1 1 1 0 0 0 0 0 0 0 1 1 1 1
00
11
11
11
00
01
11
11
01
01
00
11
00
11
00
10
? ? 11
? ? 10
10
? ? ? ? 00
? ?
Nhap
Ninf
Ndip10
11
11
11
00
00
10
10
00
00
00
10
00
11
00
Update i
:T, :U1 , :U2
10
11
11
11
00
10
00
10
10
10
00
01
10
11
01
49
Algorithm for Scenario B:Complexity Analysis
• A) Burn in phase• B) MCMC iterations for m times:
– For each individual i in dip:• i) phase(i,T+U2,hap+dip)
• ii) impute(i,T+U1,hap)• Iii) record(posterior probability)
– For each individual i in inference :• i) phase(i,T,hap+dip+inf)• ii) impute(i,T+U2,hap+dip)
• iii) impute(i,U1, hap)• iv) record(posterior probability)
• C) Average over different runs of MCMC to get the genotype and confidence
O((Nhap + Ninf)2LT+U2)O(NhapLT+U1)O(LT+U1)
O((Nhap + Ndip + Ninf)2LT)O(Nhap+dipLT+U2)
O(LT+U1+U2)
O(NhapLU1)
50
Benefits of the Algorithm
• Able to handle new data type
• Faster and more accurate
51
Further Speeding Up
• Choose k closest neighours in phasing• Need to compute Hamming distance • O(k2L) for HMM but O(NL) for Hamming
distance computation (better than O(N2L) in previous HMM calculation)
• Choose khap closest neighbours in imputation
• Khap >> k is also good (because O(k2) in phasing but O(k) in imputation)
52
Comparison with Beagle
• Weakness of BEAGLE: – Full joint modeling of all individuals– Accuracy decreases when population increases
/number of SNPs increases in the experiments– Less accurate in rare SNPs than IMPUTE2– More memory efficient
• Strength of BEAGLE:– Faster– Better accommodate trio and duos
53
Demo
./impute2 \ -m ./Example/example.chr22.map \ -h ./Example/example.chr22.1kG.haps \ -l ./Example/example.chr22.1kG.legend \ -g ./Example/example.chr22.study.gens \ -strand_g ./Example/example.chr22.study.strand \ -int 20.4e6 20.5e6 \ -Ne 20000 \ -o ./Example/example.chr22.one.phased.impute2
54
Experiments
55
Experiment plans
• Evaluation of the performance of imputation:– Accuracy – Time and space complexity– Comparison with other methods
• Application of imputation– Identification of associated SNPs in GWAS
• Optimizing performance– Effect of multiple reference panels
56
Accuracy and Calibration
• Setting: – Mask the known genotype – Impute using IMPUTE– Compare called base with ground-truth– Calling Threshold:
• by genotype• by SNPs
– Measure % missing and % mismatch for different threshold
– Compare the estimated confidence with the experimental confidence
57
Accuracy and Calibration
Message: IMPUTE is reasonably accurate and is well calibrated
%missing
%mismatch
58
Comparison: Accuracy (in general and rare allele)
Message: IMPUTE2 is accurate , especially in rare allele
The more to lower left the better
59
Comparison: Algorithm Complexity(Time and Space Complexity)
Message: IMPUTE2 is not too bad in terms of time and space complexity
Phasing step: shorter LImputation step: linear in N
Multiple MCMC increases time
60
Application 1: Identification of associated SNPs
• Setting:– Uses case and control set to identify the gene
associated with Type II Diabetes– Use filtered genotype and that have MAP > 1%– Evaluate the P-value and plot against the
chromosome position to identify the causal gene• Useful in
1. Identifying SNPs to follow up2. Assessing strength of signal
61
Application 1: Identification of associated SNPs
Message: IMPUTE helps identifying SNPs associated with phenotype
Red: Imputed SNPsBlack: typed SNPs
62
Application 2: Validation of missing data
• Setting:– Some genotype collected are not very reliable– Use imputation to impute the genotype by
assuming it is missing– Call and compare to the original genotype
63
Application 2: Validation of missing data
Message: IMPUTE helps reassuring the confidence of data
AA
BB
AB?
64
Effect of Reference Set
65
Effect of reference set
• Motivation:– Capture low-frequency variants by incorporating
data among populations– Remain computationally efficient
• Setting:– Pearson correlation for accuracy– Varying Khap
– Adding more references
66
Effect of Reference Set
Message: More reference set improves accuracy and IMPUTE2 facilitate this
Improvement get saturated when khap reach a certain threshold
Improvement get saturated when we have enough references
67
Summary
• IMPUTE, IMPUTE2 and their extensions • They attempt to design algorithms for
imputation based on– Population genetics model– HMM computation
• Extensive experiments suggests that IMPUTE2 is reasonably accurate and can make good use of reference data set available for GWAS.
68
Discussion
• Parameters in HMM:– Can they learn the parameters of copying process from the study
data through EM algorithm?• Completely missing SNPs:
– Can they use clustering algorithm in imputing completely missing data?
• Trios:– Can they use different panels to do the imputation?
• Speed: – Can they preprocess the reference to speed up the computation?– Can the ideas of BEAGLE of merging come into place at some part of
pre-HMM computation?
69
Supplementary : GWAS
70
Genetic Architecture
• Why are we interested in imputation?– For GWAS.
• Domain of interest:
71
Case-Control Study and Bayes Factor
0 1 2
Cases s0 s1 s2
Control r0 r1 r2
Distribution of prior theta is known
72
Supplementary : Reverse Engineering the per site mutation
probability
73
Review of Population Genetics
• Wright Fisher Model for coalescence :
• Infinite site model for mutation– At every inheritance, there is a probability u of
mutation. And mutation occurs only at a distinct site never happened in history.
2M individualsGenerate next generation by randomly choosing with replacement from the last generation and copy
74
Relationship btw Coalescent Theory and Imputation
• Our question: – Having a sample of N individuals as references– What is the mutation rate(per site) λ btw study
sample and the nearest neighbor in the N references
N referencesNearest neighbor in references
study
Whole population (2M)
75
Estimation of Mutation Rate λ
• Pr(no coalescence between the study and all references in last t generations)
• Average time to coalescence
• Thus, mutation rate is λ
BA
N referencesstudy
Time t
76
Estimation of Mutation Rate λ
• Estimate u
• Estimate λ N references
Time t
t2
t3
t4
λ
77
References• Marchini et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet
(2007) vol. 39 (7) pp. 906-13 • Howie et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association
studies. PLoS Genet (2009) vol. 5 (6) pp. e1000529 • Howie et al. Genotype imputation with thousands of genomes. G3 (Bethesda) (2011) vol. 1 (6) pp. 457-70 • Marchini and Howie. Genotype imputation for genome-wide association studies. Nat Rev Genet (2010) vol. 11 (7) pp. 499-
511 • R. Durrett. Probability Models for DNA Sequence Evolution. Springer, 2nd ed., 2008• N. Li and M. Stephens. Modelling linkage disequilibrium, and identifying recombination hotspots using snp data. Genetics,
165:2213–2233, 2003.
78
Thank you