RECOMBINOMICS : Myth or Reality?

Preview:

DESCRIPTION

RECOMBINOMICS : Myth or Reality?. Laxmi Parida IBM Watson Research New York, USA. RoadMap. Motivation Reconstructability (Random Graphs Framework) Reconstruction Algorithm (DSR Algorithm) Conclusion. www.nationalgeographic.com/genographic. www.ibm.com/genographic. - PowerPoint PPT Presentation

Citation preview

RECOMBINOMICS: Myth or Reality?

Laxmi Parida

IBM Watson Research New York, USA

2

IBM Computational Biology Center

1. Motivation

2. Reconstructability (Random Graphs Framework)

3. Reconstruction Algorithm (DSR Algorithm)

4. Conclusion

RoadMap

3

IBM Computational Biology Center

4

IBM Computational Biology Center

www.nationalgeographic.com/genographic

5

IBM Computational Biology Center

www.ibm.com/genographic

6

IBM Computational Biology Center

Five year study, launched in April 2005 to address anthropological questions on a global scale using genetics as a tool

Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are?

Samples all around the world are being collected and the mtDNA and Y-chromosome are being sequenced and analyzed

phylogeographic question

7

IBM Computational Biology Center

DNA material in use under unilinear transmission

58 mill bp

0.38%

16000 bp

8

IBM Computational Biology Center

Missing information in unilinear transmissions

past

present

9

IBM Computational Biology Center

Table MountainCape Town, South Africa

10

IBM Computational Biology Center

Paradigm Shift in Locus & Analysis

Using recombining DNA sequences Why?

Nonrecombining gives a partial story

1. represents only a small part of the genome

2. behaves as a single locus

3. unilinear (exclusively male of female) transmission Recombining towards more complete information

Challenges Computationally very complex How to comprehend complex reticulations?

11

IBM Computational Biology Center

1. Motivation

2. Reconstructability (Random Graphs Framework)

3. Reconstruction Algorithm (DSR Algorithm)

4. Conclusion

RoadMap

L Parida, Pedigree History: A Reconstructability Perspective using Random-Graphs Framework,

Under preparation.

12

IBM Computational Biology Center

GRAPH DEF:1. Infinite number of vertices

arranged in finite sized rows

2. Edges introduced via a random processacross immediate rows

PROPERTIES:Address some topological questions

1. First, identify a Probability Space2. Then, pose and address specific questions

(such as expected depth of LCA etc..)

The Random Graphs Framework

13

IBM Computational Biology Center

1. Infinite number of verticeswith a specific organization

2. Edges introduced via a random processsatisfying specific rules

3. Address some topological questions1. Define a Probability Space2. Pose and answer specific questions

(such as expected depth of LCA etc..)

The Random Graphs Framework

Wright-Fisher Model

1. Constant population 2. Non-overlapping generations3. Panmictic

14

IBM Computational Biology Center

The Random Graphs Framework

15

IBM Computational Biology Center

Properties of this Pedigree Graph

1. DAG Directed Acyclic Graph

2. |E| = O(|V|) for any finite fragment; sparse graph…Vertex-centric view..

3. Focus on the flow of genetic material: relevant pedigree graph

16

IBM Computational Biology Center

Pedigree Graph: GPG(K,N)

K no of extant units 2N population size/generation

Can the model ignore color of vertex?

17

IBM Computational Biology Center

Pedigree Graph: GPG(K,N)

K no of extant units 2N population size/generation

Can the model ignore color of vertex?

Forbidden Structure

18

IBM Computational Biology Center

Probability Space

Space is non-enumerable

Uniform probability measure?WF pop

Probability of some event F(h) for a fixed depth, h, & take limit:

19

IBM Computational Biology Center

Topological Property of GPG(K,N)

Least Common Ancestor (LCA) of ALL (K) extant vertices ------TMRCA or GMRCA-------

How many LCA’s ?

Expected Depth of the shallowest LCA

20

IBM Computational Biology Center

Infinite No. of LCA’s in a GPG(4,3) instance …..

In fact, there exist infinite such instances!

21

IBM Computational Biology Center

Topological Property of GPG(K,N)

Least Common Ancestor (LCA)------TMRCA or GMRCA-------

How many LCA’s ?

Expected Depth of the shallowest “LCA” MEASURE OF RECONSTRUCTABILITY

22

IBM Computational Biology Center

(Genetic Exchange) Sexual Reproduction vs Graph Model

Ancestor without ancestry

23

IBM Computational Biology Center

1. Graph Theoretic (topological): CA common ancestor

LCA Least CA or Shallowest CAMRCA Most Recent CATMRCA The MRCA

2. Graph Theoretic + Biology (Genetic Exchange): CAA common ancestor-&-ancestry

LCAA Least CAAGMRCA Grand MRCA

Unilinear Transmission

Graph Theory vis-à-vis Population Genetics

24

IBM Computational Biology Center

Different Models as Subgraphs

mtDNA TreeNRY Tree

Genetic Exchange Model (ARG)

Pedigree Graph GPG(K,N)each vertex has 2 parents

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)each vertex has 1 parent

2. Mixed Subgraph GPGE(K,N,M)No of vertices/row no more than KM

each vertex has 1 OR 2 parentsM is no. of completely linked segs in each extant unit

25

IBM Computational Biology Center

Different Models

GPG(4,8) GPTY(4,8) GPGE(4,8,2)

26

IBM Computational Biology Center

Different Models as Subgraphs

LCA GMRCA

LCA TMRCA

LCA GMRCA

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

27

IBM Computational Biology Center

GPGE(K,N,M) ARG

Ancestral Recombinations Graph Griffiths & Marjoram ‘97

Embellish GPGE(K,N,M) with Genetic Exchanges (GE) Each extant unit has M segments No vertex with zero ancestral segments (to extant units)

28

IBM Computational Biology Center

1. Plausible GE assignment?2. Can GPGE(K,N,M) go colorless?

Yes....through algorithmic subsampling…

Mixed Subgraph GPGE(K,N,M)

29

IBM Computational Biology Center

Algorithm: Embellish GPGE(K,N,M)

1. Assign sequence, s, to an instance eg. s = K, (2K), (2K-7), (2K-15), ……….

2. Construct M sequences si

Each si is monotonically decreasing; si[j] no bigger than s[j]

3. Associate each si with a segment and each element si[j] = k to k randomly selected vertices at depth j

30

IBM Computational Biology Center

Algorithm: Constructing seqs…

31

IBM Computational Biology Center

“Topological” Defn of LCAAin GPGE(K,N,M)

Input: GPGE(K,N,M) with GE embellishment

LCAA1.CA in all M subgraphs (trees)2.Least such CA

32

IBM Computational Biology Center

Different Models as Subgraphs

LCAA GMRCA

LCA TMRCA

LCAA GMRCA

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

33

IBM Computational Biology Center

Probability of Instances with Unique LCA/LCAA

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

34

IBM Computational Biology Center

GMRCA LCAA LCA & lone pair

TMRCA LCA

GMRCA LCAA LCA & lone node

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

“Topological” Defns of LCAA

35

IBM Computational Biology Center

Expected Depth E(D) of LCA/LCAA

O(N2)

O(K)

O(KM)

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

36

IBM Computational Biology Center

RECONSTRUCTABILITY

O(N2)

O(K)

O(KM)

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

37

IBM Computational Biology Center

Summary:History Reconstruction?

1. Mixed Subgraph models recombinations Only fragments of the chromosome

2. In reality, only a minimal structure (HUD) of the GPGE(K,N,M) or ARG can be estimated Forbidden structures ….

38

IBM Computational Biology Center

1. Motivation

2. Reconstructability (Random Graph Framework)

3. Reconstruction Algorithm (DSR Algorithm)

4. Conclusion

RoadMap

L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns

Journal of Computational Biology, vol 15(9), pp 1—22, 2008

L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009

39

IBM Computational Biology Center

OUTPUT: Recombinational Landscape (Recotypes)

INPUT: Chromosomes (haplotypes)

40

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Our Approach

statistical

statistical

combinatorial

M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium Recombination-based genomics: a genetic variation analysis in human populations ,

under submission.

41

IBM Computational Biology Center

Preprocess: Dimension reduction via Clustering

11 12 13 14 15 16 0

17 1 18 4

19 6 5

20 8 21 9 10 7 22

23 3 2 24

42

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Analysis Flow

statistical

statistical

combinatorial

43

IBM Computational Biology Center

p-value Estimation

44

IBM Computational Biology Center

Comparison of the Randomization Schemes

45

IBM Computational Biology Center

SNP Blocks (granularity g=3)

46

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Analysis Flow

statistical

statistical

combinatorial

47

IBM Computational Biology Center

Stage Haplotypes: use SNP block patterns

Segment along the length: infer trees

Infer network (ARG)

biological insights

computational insights

IRiS(Identifying Recombinations in Sequences)

L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns

Journal of Computational Biology, vol 15(9), pp 1—22, 2008

48

IBM Computational Biology Center

Segmentation

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345 11111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----

49

IBM Computational Biology Center

Segmentation

50

IBM Computational Biology Center

Consensus of Trees

51

IBM Computational Biology Center

Algorithm Design

1. Ensure compatibility of component trees

2. Parsimony model: minimize the no. of recombinations

52

IBM Computational Biology Center

Algorithm Design

1. Ensure compatibility of component trees

2. Parsimony model: minimize the no. of recombinations

Theorem: The problem is NP-Hard.

“It is impossible to design an algorithm that guarantees optimality.”

53

IBM Computational Biology Center

DSR Scheme (Dominant—Subdominant---Recombinant)

54

IBM Computational Biology Center

DSR Scheme: Level 1

55

IBM Computational Biology Center

DSR Assignment Rules

1. At most one D per row and column;

if no D, at most one S per row and column

2. At most one non-R in the row and column, but not both

56

IBM Computational Biology Center

DSR Assignment Rules

1. Each row and each columnhas at most one D

ELSE has at most one S

2. A non-R can have other non-Rs either in its row or its column but NOT both

57

IBM Computational Biology Center

DSR Scheme: Level 1

58

IBM Computational Biology Center

DSR Scheme: Level 2

59

IBM Computational Biology Center

DSR Scheme: Level 2

60

IBM Computational Biology Center

DSR Scheme: Level 3

61

IBM Computational Biology Center

DSR Scheme: Level 3

62

IBM Computational Biology Center

DSR Scheme: Level 4

63

IBM Computational Biology Center

DSR Scheme: Level 5

64

IBM Computational Biology Center

Mathematical Analysis: Approximation Factor

Greedy DSR Scheme Z and Y are computable functions of the input

L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009

65

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Analysis Flow

statistical

statistical

combinatorial

66

IBM Computational Biology Center

IRiS Output: RECOTYPE

Recombination vectorsR1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 ……….

s1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 ……….

s2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 ……….

.

.

.

.

67

IBM Computational Biology Center

Quick Sanity Check:Ultrametric Network on RECOTYPES

68

IBM Computational Biology Center

Stage Haplotypes: use SNP block patterns

Segment along the length: infer trees

Infer network (ARG)

biological insights

computational insights

IRiS(Identifying Recombinations in Sequences)

L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns

Journal of Computational Biology, vol 15(9), pp 1—22, 2008

IRiS software will be released by the end of summer ’09

Asif Javed

69

IBM Computational Biology Center

What’s in a name?

1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations

2. Detects subcontinental divide from short segments based on populations level analysis

3. Detects populations from short segments based on recombination events analysis

RECOMBIN-OMICS Jaume Bertranpetit

RECOMBIN-OMETRICS

Robert Elston

70

IBM Computational Biology Center

1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations

2. Detects subcontinental divide from short segments based on populations level analysis

3. Detects populations from short segments based on recombination events analysis

Are we ready for the OMICS / OMETRICS?

o population-specific signals ?o other critical signals ?

o anything we didn’t already know?

71

IBM Computational Biology Center

Thank you!!

72

IBM Computational Biology Center