26
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA Dagstuhl Seminar, 2010

Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA

  • Upload
    liko

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress. Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA. Suffix. Prefix. Breakpoint. Recombination. - PowerPoint PPT Presentation

Citation preview

Page 1: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Inferring Evolutionary History with Network Models in Population

Genomics: Challenges and Progress

Yufeng WuDept. of Computer Science and Engineering

University of Connecticut, USA

Dagstuhl Seminar, 2010

Page 2: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

2

Recombination

• One of the principle genetic forces shaping sequence variations within species

• Two equal length sequences generate a third new equal length sequence in genealogy• Spatial order is important: different parts of genome inherit

from different ancestors.

110001111111001

000110000001111

Prefix

Suffix

Breakpoint

1100 00000001111

Page 3: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Ancestral Recombination Graph (ARG)

10 01 00

S1 = 00S2 = 01S3 = 10S4 = 10

Mutations

S1 = 00S2 = 01S3 = 10S4 = 11

10 01 0011

Recombination

Network model: beyond tree model

1 0 0 1

1 1

00

10

Assumption: At most one mutation per site

Page 4: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

4

Reconstruction of Network-based Evolutionary History

Input: DNA sequences (haplotypes) or phylogenetic trees

Biology: meiotic recombination in populations, or reticulate evolutionary processes: horizontal gene transfer or hybrid speciation

Different formulation

Reconstruct the network-based evolutionary history (and related problems)• Efficiency• Accuracy

Same objective

Page 5: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Reconstructing ARGs by Parsimony

• Input: a set of binary sequences M• Goal: reconstruct ARGs deriving M• Parsimony formulation

– minARG: Minimize the number of recombination events

– NP complete (Wang, et al)5

Kreitman’s data for adh locus of D. Malonagaster (1983)

Page 6: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

The minARG Problem

Uniform sampling of minARGs by treating each minARG as equally likely (Wu)

Estimating the range of minARGs: lower and upper bounds

Structural constrained ARGs, e.g. galled trees (Wang, et al, Gusfield, et al).• Simplified ARG topology

Heuristic methods, e.g. program MARGARITA (Durbin, et al.), Song, et al., Parida, et al.

Exact minARG by branch and bound (Lyngso, Song and Hein)

Page 7: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

minARG for Kreitman’s data

Challenge: accurate inference of ARGs

Rmin: minimum number of recombination for M.L(M): lower bound on RminU(M): upper bound on Rmin

Several lower bounds give L(M)=7.

U(M)=7 for Kreitman’s data (Song, Wu and Gusfield). Thus, Rmin(M)=7

Page 8: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

8

ARG Induces Local Trees

0101 1010 00000110

0100

0000

0000

0010

Local trees: evolutionary history at a genomic position.

Trace backwards in time. At recombination node, pick the branch passing alleles to the recombinant at this location.

0110 1010

1110

Data

00000101011011101010

Local tree near site 3

MutationsRecombination

Page 9: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Local Trees Change Across the Genome

0101 1010 00000110

0100

0000

0000

0010

Local trees change when moving across recombination breakpoints.

0110 1010

1110

Data

00000101011011101010

Local tree near site 2

Spatial property:

Nearby local tree tends to be more similar.

How good is the inferred ARGs?Compare the inferred local tree topologies with the simulated trees

Page 10: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Inferring Local TreesProblem: given binary sequences, infer local tree topologies (one tree for each site, ignore branch length)

Parsimony-based approaches• Hein (1990,1993), Song and Hein (2005)• Wu (2010): shared topological features in nearby trees.

Key: local trees have different topology due to recombination

Trees or Network? Do not reconstruct full network; local trees are very informative

Challenge: How to improve the accuracy?

Accuracy: Robinson-Foulds distances between inferred trees and the simulated tree

Page 11: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

RENT: REfining Neighboring Trees

• Maintain for each SNP site a (possibly non-binary) tree topology– Initialize to a tree containing the split induced by the

SNP• Gradually refining trees by adding new splits to

the trees– Splits found by a set of rules (later)– Splits added early may be more reliable

• Stop when binary trees or enough information is recovered

11

Page 12: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

12

0 0 0 1 0 0 0 0 1 1 0 1 0 1 1

A B C

abcde

M

A Little Background: Compatibility

• Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible.• Easily extended to splits.

Sites A and B are compatible, but A and C are incompatible.

Page 13: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Fully-Compatible Region: Simple Case

• A region of consecutive SNP sites where these SNPs are pairwise compatible.– May indicate no topology-altering recombination

occurred within the region• Rule: for site s, add any such split to tree at s.

– Compatibility: very strong property and unlikely arise due to chance.

13

A

B

C

Page 14: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Split Propagation: More General Rule

• Three consecutive sites A,B and C. Sites A and B are incompatible. Does site C matter for tree at site A?– Trees at site A and B are different.– Suppose site C is compatible with sites A and B. Then?– Site C may indicate a shared subtree in both trees at sites A and B.

• Rule: a split propagates to both directions until reaching a incompatible tree.

14

AB

C

Page 15: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

1 2 3 4

Keep two red edges

Keep two black edgesHybridization

event: nodes with in-degree two or more

1 2 3 4

ρ

1 3 2 4

ρT T’

Reticulate NetworksGene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted- Different topologies at different genes

Reticulate evolution: one explanation- Hybrid speciation, horizontal gene transfer

Gene A1: 0 0 02: 0 0 13: 1 1 0 4: 1 0 0

Gene B1: 0 0 02: 1 0 13: 0 1 0 4: 0 0 1

Reticulate network: A directed acyclic graph displaying each of the gene trees

Page 16: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

The Minimum Reticulation ProblemGiven: a set of K gene trees G.

Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree.

NP complete: even for K=2

Current approaches: • exact methods for K=2 case (see Semple, et al)• impose topological constraints (e.g. galled networks, see Huson, et al.)

1 2 3 4

T1

1 2 3 4 1 2 4 3

T2 T3

1 2 3 4

NChallenge: efficient and accurate reconstruction of reticulate network for multiple trees.

Close lower and upper bounds for arbitrary number of trees (Wu, 2010)

Page 17: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Performance of PIRN: Optimal Solution

• Lower and upper bounds often match for many data 17

Horizontal axis: number of taxaVertical axis: % of data LB=UB

K: number of treesr: level of reticulation

Page 18: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Performance of PIRN: Gap of Bounds

• Gap between the lower and upper bounds is often small for many data 18

Horizontal axis: number of taxaVertical axis: gap between lower and upper bounds

K: number of treesr: level of reticulation

Page 19: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Reticulate Network for Five Poaceae Trees

19

rpoC2phyB rbcLndhF ITS

Lower bound: 11Upper bound: 13

Page 20: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Reticulate Network for Five Poaceae Trees

20

Upper bound: 13 used in this network

Page 21: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

21

Acknowledgement

• More information available at: http://www.engr.uconn.edu/~ywu

• Research supported by National Science Foundation and UConn Research Foundation

Page 22: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Coalescent with Recombination

Coalescent theory: define probabilistic distribution of genealogyLikelihood computation for coalescent with recombination

Probability of ARGs under certain parameters

Likelihood: summation of probability of all the ARGsChallenging: too many ARGs (Lyngso, Song and Hein)

Importance Sampling approach: draw samples (ARGs) wrt some probablistic distributionWork well with no recombinationNot working well with recombination

Page 23: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

Coalescent-based ARG Sampling

Uniform sampling of minARGs (Wu, 2007)• Treat each minARG as equally likely.• Algorithm for generating an minARG uniformly at random (exponential time for setting up, but polynomial-time in sampling)

Probability of ARGs under certain parameters

Challenge: develop a more general ARG sampling method that can efficiently sample ARGs approximately according to coalescent probabilities.

minARG

A related problem: compute coalescent likelihood with recombination efficiently.Recent work: exact computation of coalescent likelihood under infinite sites model with no recombination (Wu, 2009)

Page 24: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

The Mosaic ModelM: input sequences

Assumption: input sequences are descendent of K founder sequences (unknown)

Extant sequences: concatenation of exact copies of founder segment (no shift of position)

• Coloring: assign which position of a sequence is from which founder (color); need consistency

M, K=2

00000101011111111110

breakpoint

Total 5breakpoint

Page 25: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

The Minimum Mosaic Problem

• Problem: given a set of binary sequences and the number of founder K, find a K-coloring of these sequences to minimize the number of color change (recombination breakpoints)• And find the K founder sequences (not part of input)

Inferred founders

Data from Rastas and Ukkonen

20 sequences40 sites

55 breakpoints: minimum number of breakpoints

Page 26: Yufeng  Wu Dept. of Computer Science and Engineering University of Connecticut, USA

26

The Minimum Mosaic Problem• Introduced by Ukkonen (2002)• Simple and easier to visualize• Main known results

– An exponential-time algorithm which runs in polynomial-time algorithm for K=2 (Ukkonen 2002)

– An exact method that works for relatively small K and modest-sized data (Wu and Gusfield, 2007)

– Haplovisual program and other extensions by Rastas and Ukkonen (2007).

– Heuristic algorithm by Roli and Blum (2009)– Lower bounds for the minimum number of breakpoints

needed (Wu, 2010)• Challenges

– Polynomial-time algorithm for K 3?– Concrete applications in biology?