57
L6: Haplotype phasing L6: Haplotype phasing

L6: Haplotype phasing

  • Upload
    ingo

  • View
    68

  • Download
    2

Embed Size (px)

DESCRIPTION

L6: Haplotype phasing. Genotypes and Haplotypes. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles Current Genotyping technology doesn’t give phase. 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0. 2 1 2 1 0 0 1 2 0. - PowerPoint PPT Presentation

Citation preview

Page 1: L6: Haplotype phasing

L6: Haplotype phasingL6: Haplotype phasing

Page 2: L6: Haplotype phasing

Genotypes and HaplotypesGenotypes and Haplotypes

• Each individual has two “copies” of each Each individual has two “copies” of each chromosome. chromosome.

• At each site, each chromosome has one of At each site, each chromosome has one of two allelestwo alleles

• Current Genotyping technology doesn’t give Current Genotyping technology doesn’t give phasephase

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0 Genotype for the individual

Page 3: L6: Haplotype phasing

Haplotype PhasingHaplotype Phasing

Haplotype Phasing is the resolution of a Haplotype Phasing is the resolution of a genotype into the two haplotypes.genotype into the two haplotypes.

Haplotypes increase the power of an Haplotypes increase the power of an association between marker loci and association between marker loci and phenotypic traitsphenotypic traits

Current approaches to HaplotypingCurrent approaches to Haplotyping Via technological innovations (expensive)Via technological innovations (expensive) Statistical Methods (ML, Phase,PL)Statistical Methods (ML, Phase,PL)

Combinatorial approach to the phasing Combinatorial approach to the phasing problemproblem Efficient, provable quality of solutionEfficient, provable quality of solution Not completely generalizable (as yet)Not completely generalizable (as yet)

Page 4: L6: Haplotype phasing

Clark’s ideaClark’s idea

Using the HWE principle, Using the HWE principle, infer phase using infer phase using homozygous sites.homozygous sites.

Not described as an Not described as an algorithm, but as a algorithm, but as a methodology to infer methodology to infer phase.phase.

0 1 1 1 0 0 1 1 0

1 1 0 2 0 0 2 0 0

2 1 2 0 0 0 0 0 0

Page 5: L6: Haplotype phasing

Maximum likelihood estimation of Maximum likelihood estimation of phasephase

Input: Genotypes 1…m with counts nInput: Genotypes 1…m with counts n11, n, n22,..,..

Output: Haplotype frequencies (also individual Output: Haplotype frequencies (also individual haplotype assignments)haplotype assignments)

Define (unknown) genotype probabilities Define (unknown) genotype probabilities PP11,P,P22,P,P33……

Likelihood Function (based on genotype Likelihood Function (based on genotype probabilities)probabilities)

Page 6: L6: Haplotype phasing

Genotypes and HaploptypesGenotypes and Haploptypes

Let cLet cjj be the number of haplotype pairings that be the number of haplotype pairings that will give us genotype j, Thenwill give us genotype j, Then

Use HWE to compute Pr(hUse HWE to compute Pr(hkk,h,hll))

Pj = P(hki=1

c j

∑ hl )

Page 7: L6: Haplotype phasing

Likelihood using haplotype Likelihood using haplotype frequenciesfrequencies

Page 8: L6: Haplotype phasing

The Expectation StepThe Expectation Step

Q: Given haplotype frequencies, what are the Q: Given haplotype frequencies, what are the paired haplotype frequenciespaired haplotype frequencies

A: InitiallyA: Initially

Subsequently, (gth iteration)Subsequently, (gth iteration)

Page 9: L6: Haplotype phasing

The M StepThe M Step

• itit is 0, 1, or 2 (# of times haplotype t occurs in is 0, 1, or 2 (# of times haplotype t occurs in paired haplotype t)paired haplotype t)

Page 10: L6: Haplotype phasing

Bayesian approach to phasingBayesian approach to phasing

Idea: Small variants Idea: Small variants of common of common haplotypes should haplotypes should also be considered also be considered common even though common even though they have low they have low frequency frequency

Page 11: L6: Haplotype phasing

PhasePhase

Page 12: L6: Haplotype phasing

PhasePhase

As described, each haplotype arises from the As described, each haplotype arises from the prior set only through mutations. Recombination prior set only through mutations. Recombination is not consideredis not considered

In subsequent versions, recombination is In subsequent versions, recombination is explicitly considered in the equationexplicitly considered in the equation

Page 13: L6: Haplotype phasing

Phase resultsPhase results

Phase versus EM versus ClarkPhase versus EM versus Clark Error rate: Proportion of individuals incorrectly Error rate: Proportion of individuals incorrectly

predictedpredicted

Page 14: L6: Haplotype phasing

Combinatorial Approach to Combinatorial Approach to HaplotypingHaplotyping

Page 15: L6: Haplotype phasing

The Perfect Phylogeny ModelThe Perfect Phylogeny Model We assume that the We assume that the

evolution of extant evolution of extant haplotypes can be haplotypes can be displayed on a rooted, displayed on a rooted, directed tree, with the all-0 directed tree, with the all-0 haplotype at the root, haplotype at the root, where each site changes where each site changes from 0 to 1 on exactly one from 0 to 1 on exactly one edge, and each extant edge, and each extant haplotype is created by haplotype is created by accumulating the changes accumulating the changes on a path from the root to a on a path from the root to a leaf, where that haplotype leaf, where that haplotype is displayed. is displayed.

In other words, the extant In other words, the extant haplotypes evolved along a haplotypes evolved along a perfect phylogenyperfect phylogeny with all-0 with all-0 root.root.

00000

1

2

4

3

510100

1000001011

00010

01010

12345

Extant Haplotypes

Page 16: L6: Haplotype phasing

PPH: Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

11 22

aa 22 22

bb 00 22

cc 11 00

11 22

aa 11 00

aa 00 11

bb 00 00

bb 00 11

cc 11 00

cc 11 00

1

c c a a

b

b

2

10 10 10 01 01

00

00

Haplotyping via Perfect Haplotyping via Perfect PhylogenyPhylogeny

Page 17: L6: Haplotype phasing

11 22

aa 22 22

bb 00 22

cc 11 00

11 22

aa 11 11

aa 00 00

bb 00 00

bb 00 11

cc 11 00

cc 11 00

No treepossiblefor thisexplanation

The Alternative ExplanationThe Alternative Explanation

Page 18: L6: Haplotype phasing

Arrange the haplotypes in a matrix, two Arrange the haplotypes in a matrix, two haplotypes for each individual. haplotypes for each individual.

Then (with Then (with nono duplicate columns), the duplicate columns), the haplotypes fit a haplotypes fit a unique unique perfect phylogeny if perfect phylogeny if and only if no two columns contain all four and only if no two columns contain all four pairs (Buneman): pairs (Buneman):

0,0 and 0,1 and 1,0 and 1,10,0 and 0,1 and 1,0 and 1,1

The 4 Gamete Test for Perfect The 4 Gamete Test for Perfect PhylogenyPhylogeny

00

01 11

10

Page 19: L6: Haplotype phasing

The Alternative Explanation

11 22

aa 22 22

bb 00 22

cc 11 00

11 22

aa 11 11

aa 00 00

bb 00 00

bb 00 11

cc 11 00

cc 11 00

No treepossiblefor thisexplanation

Page 20: L6: Haplotype phasing

11 22

aa 22 22

bb 00 22

cc 11 00

11 22

aa 11 00

aa 00 11

bb 00 00

bb 00 11

cc 11 00

cc 11 00

1

c c a a

b

b

2

0 0

0 1 0 1

0 0

The Tree Explanation Again

Page 21: L6: Haplotype phasing

The Combinatorial ProblemThe Combinatorial Problem

Input: A ternary matrix (0,1,2) M with N rowsInput: A ternary matrix (0,1,2) M with N rows Output: A binary matrix M’ created from M by Output: A binary matrix M’ created from M by

replacing each 2 in M with a 0 and 1, such that M’ replacing each 2 in M with a 0 and 1, such that M’ passes the 4 gamete testpasses the 4 gamete test

Gusfield (Recomb2002) proposed a solution Gusfield (Recomb2002) proposed a solution which used a reduction to Matroids. which used a reduction to Matroids.

We present a (slightly inefficient) solution using We present a (slightly inefficient) solution using elementary techniqueselementary techniques

Independently by (Eskin, Halperin, Karp’02)Independently by (Eskin, Halperin, Karp’02)

Page 22: L6: Haplotype phasing

Initial Observations Initial Observations

Forced Expansions:Forced Expansions: EX 1: If two columns(sites) of M contain the following EX 1: If two columns(sites) of M contain the following

rowsrows 2 0 2 0 0 20 2

Then M’ will contain a row with 1 0 and a row with 0 1 Then M’ will contain a row with 1 0 and a row with 0 1 in those columns.in those columns.

EX 2: Similarly, if two columns of M contain the rowsEX 2: Similarly, if two columns of M contain the rows 2 1 2 1 2 0 2 0 Then M’ will contain rows with 1 1 and 0 0 in those Then M’ will contain rows with 1 1 and 0 0 in those

columnscolumns

Page 23: L6: Haplotype phasing

If a forced expansion of two columns creates rows 0 1, and 1 0 in those columns, then any 2 2 in those columns must be set to be

0 11 0

We say that two columns are forced out-of-phase.

If a forced expansion of two columns creates 1 1, and 0 0 in those columns, then any 2 2 in those columns must be set to be

1 10 0

We say that two columns are forced in-phase.

Initial ObservationsInitial Observations

22

22

Page 24: L6: Haplotype phasing

Immediate FailureImmediate Failure

It can happen that the forced expansion of cellscreates a 4x2 submatrix that fails the 4-GameteTest. In that case, there is no PPH solution forM.

Example: 20 1202

Will fail the 4-Gamete Test

Page 25: L6: Haplotype phasing

An O(ns^2)-time AlgorithmAn O(ns^2)-time Algorithm

Find all the forced phase relationships by Find all the forced phase relationships by considering columns in pairs.considering columns in pairs.

Find all the inferred, invariant, phase Find all the inferred, invariant, phase relationships.relationships.

Find a set of column pairs whose phase Find a set of column pairs whose phase relationship can be arbitrarily set, so that all relationship can be arbitrarily set, so that all the remaining phase relationships can be the remaining phase relationships can be inferred.inferred.

Result: An implicit representation of all Result: An implicit representation of all solutions to the PPH problem.solutions to the PPH problem.

Page 26: L6: Haplotype phasing

11 22 22 22 00 00 00

22 00 22 00 00 00 22

11 22 22 22 00 22 00

11 22 22 00 22 00 00

22 22 00 00 00 22 00

00 00 00 00 00 00 00

A

B

C

D

E

F

1 2 3 4 5 6 7

A Running ExampleA Running Example

Page 27: L6: Haplotype phasing

1

• Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns.

•The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase.

4

7

25

3

6

1

Companion Graph G_cCompanion Graph G_c

11 22 22 22 00 00 00

22 00 22 00 00 00 22

11 22 22 22 00 22 00

11 22 22 00 22 00 00

22 22 00 00 00 22 00

00 00 00 00 00 00 00

A

B

C

D

E

F

1 2 3 4 5 6 7

Page 28: L6: Haplotype phasing

1

7

25

34

6

•Each Red edge indicates that the columns are forced in-phase.

•Each Blue edge indicates that the columns are forced out-of-phase.

Let G_f be the sub-graph of G_cdefined by the red and blueedges.

Phasing Edges in G_cPhasing Edges in G_c

Page 29: L6: Haplotype phasing

1

7

2

5

34

6

.

Connected Components in G_fConnected Components in G_f

Graph G_f has three Graph G_f has three connected componentsconnected components

Page 30: L6: Haplotype phasing

Phase-parity LemmaPhase-parity Lemma

That’s nice, but how do we assign the colors?

Lemma 1: There is a solution to the PPH Lemma 1: There is a solution to the PPH problem for M if and only if there is a coloring problem for M if and only if there is a coloring of the black edges of G_c with the following of the black edges of G_c with the following property:property:

For any triangle in G_c containing at least one For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of black edge, the coloring makes either 0 or 2 of the edgesthe edges

blue (i.e., out of phase)blue (i.e., out of phase)

Page 31: L6: Haplotype phasing

1 A Weak Triangulation RuleA Weak Triangulation Rule

Theorem 1: If there are any Theorem 1: If there are any black edges whose ends are in black edges whose ends are in the same connected the same connected component of G_f, at least one component of G_f, at least one edge is in a triangle where the edge is in a triangle where the other edges are not blackother edges are not black

In every PPH solution, it must In every PPH solution, it must be colored so that the triangle be colored so that the triangle has an even number of has an even number of Blue Blue (out of Phase) (out of Phase) edges.edges.

This an “inferred” coloringThis an “inferred” coloring..

3

Graph G_f

7

25

4

6

Page 32: L6: Haplotype phasing

3

Graph G_f

7

25

4

6

Page 33: L6: Haplotype phasing

3

Graph G_f

7

25

4

6

Page 34: L6: Haplotype phasing

3

Graph G_f

7

25

4

6

Page 35: L6: Haplotype phasing

3

Graph G_f

7

25

4

6

Page 36: L6: Haplotype phasing

CorollaryCorollary

Inside any connected component of G_f, ALL the phase Inside any connected component of G_f, ALL the phase relationships on edges (columns of M) are uniquely relationships on edges (columns of M) are uniquely determined, either as forced relationships based on determined, either as forced relationships based on pair-wise column comparisons, or by triangle-based pair-wise column comparisons, or by triangle-based inferred colorings.inferred colorings.

Hence, the phase relationships of all the columns in a Hence, the phase relationships of all the columns in a connected component of G_f are INVARIANT over all the connected component of G_f are INVARIANT over all the solutions to the PPH problem.solutions to the PPH problem.

The black edges in G_f can be ordered so that the The black edges in G_f can be ordered so that the inferred colorings can be done in linear time. inferred colorings can be done in linear time. Modification of DFSModification of DFS. .

Page 37: L6: Haplotype phasing

Phase Parity Lemma: ProofPhase Parity Lemma: Proof

22 XX

YY 22

22 22

If X ≠ 2, and Y ≠ 2, Then the two columns are forced

Page 38: L6: Haplotype phasing

Phase Parity Lemma: proofPhase Parity Lemma: proof

22 22 yy

xx 22 22

22 zz 22

A B C Lemma: If a triangle contains a Lemma: If a triangle contains a black edge, then a PPH solution black edge, then a PPH solution exists only if there are 0 or 2 blue exists only if there are 0 or 2 blue edges in the final coloring.edges in the final coloring.

Proof: Proof: No black edge unless x==2, or No black edge unless x==2, or

y==2 or z==2 (previous lemma)y==2 or z==2 (previous lemma) If there is a row with all 2s, then If there is a row with all 2s, then

there must be an even number of there must be an even number of blue edgesblue edges

AC

B

Page 39: L6: Haplotype phasing

Proof of Weak Triangulation Proof of Weak Triangulation TheoremTheorem

Arbitrary chordless Arbitrary chordless cycles are possible in the cycles are possible in the graph, with forced edges.graph, with forced edges. See example. The See example. The

pattern 0,2; 2,0; and 2,2 pattern 0,2; 2,0; and 2,2 implies a blue (out of implies a blue (out of phase) edgephase) edge

A single unforced edge A single unforced edge changes the picturechanges the picture

22 22 00 00 00

00 22 22 00 00

00 00 22 22 00

00 00 00 22 22

22 00 00 00 22

A B C D E

E

D

A

B

C

Page 40: L6: Haplotype phasing

Proof of Weak Triangulation TheoremProof of Weak Triangulation Theorem

Let (J,J’) be a black edge Let (J,J’) be a black edge connecting a ‘long’ path connecting a ‘long’ path J,K,…K’,J’ of forced edgesJ,K,…K’,J’ of forced edges

In the Matrix, x ≠ 2, In the Matrix, x ≠ 2, otherwise there is a otherwise there is a chord. Likewise y≠2chord. Likewise y≠2

By previous lemma, (J,J’) By previous lemma, (J,J’) is forcedis forced

22 22 xx

yy 22 22

22 22

K J J’ K’

J J’

K K’

Page 41: L6: Haplotype phasing

Finishing the SolutionFinishing the Solution

Problem: A connected component C of G may Problem: A connected component C of G may contain several connected components of G_f, so contain several connected components of G_f, so any edge crossing two components of G_f will still any edge crossing two components of G_f will still be black. How should they be colored?be black. How should they be colored?

Page 42: L6: Haplotype phasing

1

4

6

7

25

3

How should we How should we color the color the remaining black remaining black edges in a edges in a connected connected component C of component C of G_c?G_c?

Page 43: L6: Haplotype phasing

AnswerAnswer

• For a connected component C of G with k connected components of Gf, select any subset S of k-1 black edges in C, so that S together with the red and blue edges span all the nodes of C.• Arbitrarily, color each edge in S either red or blue.• Infer the color of any remaining black edges by successive use of the triangle rule.

7

25

34

6

Page 44: L6: Haplotype phasing

7

25

34

6

Page 45: L6: Haplotype phasing

Theorem 2Theorem 2

Any selected S works (allows the triangle rule Any selected S works (allows the triangle rule to work) and any coloring of the edges in S to work) and any coloring of the edges in S determines the colors of any remaining black determines the colors of any remaining black edges.edges.

Different colorings of S determine different Different colorings of S determine different colorings of the remaining black edges.colorings of the remaining black edges.

Each different coloring of S determines a Each different coloring of S determines a different solution to the PPH problem.different solution to the PPH problem.

All PPH solutions can be obtained in this way, All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.it in all 2^(k-1) ways.

Page 46: L6: Haplotype phasing

CorollaryCorollary

In a single connected component C of G with k In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of different solutions to the PPH problem in the columns of M represented by C.M represented by C.

If G_c has r connected components and t connected If G_c has r connected components and t connected components of G_f, then there are exactly 2^(t-r) components of G_f, then there are exactly 2^(t-r) solutions to the PPH problem.solutions to the PPH problem.

There is one unique PPH solution if and only if each There is one unique PPH solution if and only if each connected component in G is a connected component in connected component in G is a connected component in G_f.G_f.

Page 47: L6: Haplotype phasing

ConclusionConclusion

In the special case of blocks with no recombination, and no In the special case of blocks with no recombination, and no recurrent mutations, the haplotypes satisfy a perfect recurrent mutations, the haplotypes satisfy a perfect phylogenyphylogeny

Given a set of genotypes, there is an efficient (O(ns^2)) Given a set of genotypes, there is an efficient (O(ns^2)) algorithm for representing all possible haplotype solutions that algorithm for representing all possible haplotype solutions that satisfy a prefect phylogenysatisfy a prefect phylogeny

Efficiency:Efficiency: Input is size O(ns), Input is size O(ns), All operations except building the graph are O(ns+s^2)All operations except building the graph are O(ns+s^2) Valid PPH only if s = O(n). Is O(ns) possible?Valid PPH only if s = O(n). Is O(ns) possible? Current best solution is O(ns+n^(1-e) s^2) using Matrix Current best solution is O(ns+n^(1-e) s^2) using Matrix

Multiplication ideaMultiplication idea Future work involves combining this with some heuristics to Future work involves combining this with some heuristics to

deal with general cases (lo recombination/hi recombination)deal with general cases (lo recombination/hi recombination)

Page 48: L6: Haplotype phasing

Simulated DataSimulated Data

Coalescent model (Hudson)Coalescent model (Hudson) No RecombinationNo Recombination

400 chromosomes, 100 sites400 chromosomes, 100 sites Infinite sitesInfinite sites

RecombinationRecombination 100 chromosomes100 chromosomes Infinite sitesInfinite sites R=4.0 2501 R=4.0 2501

Pr(Recombination) = 4*10^(-9) between adjacent basesPr(Recombination) = 4*10^(-9) between adjacent bases

Page 49: L6: Haplotype phasing

Error MeasurementError Measurement

Discrepancy = 1 Discrepancy = 1 (Num Haplotypes incorrectly (Num Haplotypes incorrectly predicted)predicted)

Switch Error = 2 Switch Error = 2

0010101010

0000011111

0101000101

0101010101

02222

22222

Page 50: L6: Haplotype phasing

No RecombinationNo Recombination

Haplotype discrepancy

0

5

10

15

20

25

30

35

40

45

50

0 0.2 0.4 0.6 0.8 1 1.2 1.4

% Error

# results

Series1

Page 51: L6: Haplotype phasing

No RecombinationNo Recombination

Switch Error

0

5

10

15

20

25

30

35

40

0 0.1 0.15 0.2 0.25 0.3 0.35 0.4

% Error

% Results

Series1

Page 52: L6: Haplotype phasing

Choosing between solutionsChoosing between solutions

-1600

-1550

-1500

-1450

-1400

-1350

-1300

-12501 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121

log(likelihood)

0

1

2

3

4

5

6

discrepancy

Series2Series1

Page 53: L6: Haplotype phasing

Choosing between solutionsChoosing between solutions

Entropy vs Discrepancy

2.69

2.7

2.71

2.72

2.73

2.74

2.75

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127

Equivalent Solutions

Entropy

0

1

2

3

4

5

6

Discrepancy

Series2Series1

Page 54: L6: Haplotype phasing

Choosing between solutionsChoosing between solutions

Parsimony vs. Discrepancy

31

31.5

32

32.5

33

33.5

34

34.5

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127

Equivalent Solutions

# Haplotypes

0

1

2

3

4

5

6

Discrepancy

Series2Series1

Page 55: L6: Haplotype phasing

ConclusionConclusion

Extremely low error rates (< 1% discrepancy) Extremely low error rates (< 1% discrepancy) if no recombinationif no recombination

Randomly choosing between equivalent Randomly choosing between equivalent solutions is sufficientsolutions is sufficient

Other measures (Parsimony, Likelihood, Other measures (Parsimony, Likelihood, Entropy) do not improve the quality of solutionEntropy) do not improve the quality of solution

Page 56: L6: Haplotype phasing

Haplotype Discrepancy (R=4.0)

0

2

4

6

8

10

12

14

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

% Error

% predictions

Series1

With RecombinationWith Recombination

Page 57: L6: Haplotype phasing

ProblemsProblems

Many of the earlier problems Many of the earlier problems (structure/recombination rate) etc. correspond to (structure/recombination rate) etc. correspond to phased data.phased data.

Can they be resolved for unphased dataCan they be resolved for unphased data