72
RECOMBINOMICS: Myth or Reality? Laxmi Parida IBM Watson Research New York, USA

RECOMBINOMICS : Myth or Reality?

Embed Size (px)

DESCRIPTION

RECOMBINOMICS : Myth or Reality?. Laxmi Parida IBM Watson Research New York, USA. RoadMap. Motivation Reconstructability (Random Graphs Framework) Reconstruction Algorithm (DSR Algorithm) Conclusion. www.nationalgeographic.com/genographic. www.ibm.com/genographic. - PowerPoint PPT Presentation

Citation preview

Page 1: RECOMBINOMICS : Myth or Reality?

RECOMBINOMICS: Myth or Reality?

Laxmi Parida

IBM Watson Research New York, USA

Page 2: RECOMBINOMICS : Myth or Reality?

2

IBM Computational Biology Center

1. Motivation

2. Reconstructability (Random Graphs Framework)

3. Reconstruction Algorithm (DSR Algorithm)

4. Conclusion

RoadMap

Page 3: RECOMBINOMICS : Myth or Reality?

3

IBM Computational Biology Center

Page 4: RECOMBINOMICS : Myth or Reality?

4

IBM Computational Biology Center

www.nationalgeographic.com/genographic

Page 5: RECOMBINOMICS : Myth or Reality?

5

IBM Computational Biology Center

www.ibm.com/genographic

Page 6: RECOMBINOMICS : Myth or Reality?

6

IBM Computational Biology Center

Five year study, launched in April 2005 to address anthropological questions on a global scale using genetics as a tool

Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are?

Samples all around the world are being collected and the mtDNA and Y-chromosome are being sequenced and analyzed

phylogeographic question

Page 7: RECOMBINOMICS : Myth or Reality?

7

IBM Computational Biology Center

DNA material in use under unilinear transmission

58 mill bp

0.38%

16000 bp

Page 8: RECOMBINOMICS : Myth or Reality?

8

IBM Computational Biology Center

Missing information in unilinear transmissions

past

present

Page 9: RECOMBINOMICS : Myth or Reality?

9

IBM Computational Biology Center

Table MountainCape Town, South Africa

Page 10: RECOMBINOMICS : Myth or Reality?

10

IBM Computational Biology Center

Paradigm Shift in Locus & Analysis

Using recombining DNA sequences Why?

Nonrecombining gives a partial story

1. represents only a small part of the genome

2. behaves as a single locus

3. unilinear (exclusively male of female) transmission Recombining towards more complete information

Challenges Computationally very complex How to comprehend complex reticulations?

Page 11: RECOMBINOMICS : Myth or Reality?

11

IBM Computational Biology Center

1. Motivation

2. Reconstructability (Random Graphs Framework)

3. Reconstruction Algorithm (DSR Algorithm)

4. Conclusion

RoadMap

L Parida, Pedigree History: A Reconstructability Perspective using Random-Graphs Framework,

Under preparation.

Page 12: RECOMBINOMICS : Myth or Reality?

12

IBM Computational Biology Center

GRAPH DEF:1. Infinite number of vertices

arranged in finite sized rows

2. Edges introduced via a random processacross immediate rows

PROPERTIES:Address some topological questions

1. First, identify a Probability Space2. Then, pose and address specific questions

(such as expected depth of LCA etc..)

The Random Graphs Framework

Page 13: RECOMBINOMICS : Myth or Reality?

13

IBM Computational Biology Center

1. Infinite number of verticeswith a specific organization

2. Edges introduced via a random processsatisfying specific rules

3. Address some topological questions1. Define a Probability Space2. Pose and answer specific questions

(such as expected depth of LCA etc..)

The Random Graphs Framework

Wright-Fisher Model

1. Constant population 2. Non-overlapping generations3. Panmictic

Page 14: RECOMBINOMICS : Myth or Reality?

14

IBM Computational Biology Center

The Random Graphs Framework

Page 15: RECOMBINOMICS : Myth or Reality?

15

IBM Computational Biology Center

Properties of this Pedigree Graph

1. DAG Directed Acyclic Graph

2. |E| = O(|V|) for any finite fragment; sparse graph…Vertex-centric view..

3. Focus on the flow of genetic material: relevant pedigree graph

Page 16: RECOMBINOMICS : Myth or Reality?

16

IBM Computational Biology Center

Pedigree Graph: GPG(K,N)

K no of extant units 2N population size/generation

Can the model ignore color of vertex?

Page 17: RECOMBINOMICS : Myth or Reality?

17

IBM Computational Biology Center

Pedigree Graph: GPG(K,N)

K no of extant units 2N population size/generation

Can the model ignore color of vertex?

Forbidden Structure

Page 18: RECOMBINOMICS : Myth or Reality?

18

IBM Computational Biology Center

Probability Space

Space is non-enumerable

Uniform probability measure?WF pop

Probability of some event F(h) for a fixed depth, h, & take limit:

Page 19: RECOMBINOMICS : Myth or Reality?

19

IBM Computational Biology Center

Topological Property of GPG(K,N)

Least Common Ancestor (LCA) of ALL (K) extant vertices ------TMRCA or GMRCA-------

How many LCA’s ?

Expected Depth of the shallowest LCA

Page 20: RECOMBINOMICS : Myth or Reality?

20

IBM Computational Biology Center

Infinite No. of LCA’s in a GPG(4,3) instance …..

In fact, there exist infinite such instances!

Page 21: RECOMBINOMICS : Myth or Reality?

21

IBM Computational Biology Center

Topological Property of GPG(K,N)

Least Common Ancestor (LCA)------TMRCA or GMRCA-------

How many LCA’s ?

Expected Depth of the shallowest “LCA” MEASURE OF RECONSTRUCTABILITY

Page 22: RECOMBINOMICS : Myth or Reality?

22

IBM Computational Biology Center

(Genetic Exchange) Sexual Reproduction vs Graph Model

Ancestor without ancestry

Page 23: RECOMBINOMICS : Myth or Reality?

23

IBM Computational Biology Center

1. Graph Theoretic (topological): CA common ancestor

LCA Least CA or Shallowest CAMRCA Most Recent CATMRCA The MRCA

2. Graph Theoretic + Biology (Genetic Exchange): CAA common ancestor-&-ancestry

LCAA Least CAAGMRCA Grand MRCA

Unilinear Transmission

Graph Theory vis-à-vis Population Genetics

Page 24: RECOMBINOMICS : Myth or Reality?

24

IBM Computational Biology Center

Different Models as Subgraphs

mtDNA TreeNRY Tree

Genetic Exchange Model (ARG)

Pedigree Graph GPG(K,N)each vertex has 2 parents

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)each vertex has 1 parent

2. Mixed Subgraph GPGE(K,N,M)No of vertices/row no more than KM

each vertex has 1 OR 2 parentsM is no. of completely linked segs in each extant unit

Page 25: RECOMBINOMICS : Myth or Reality?

25

IBM Computational Biology Center

Different Models

GPG(4,8) GPTY(4,8) GPGE(4,8,2)

Page 26: RECOMBINOMICS : Myth or Reality?

26

IBM Computational Biology Center

Different Models as Subgraphs

LCA GMRCA

LCA TMRCA

LCA GMRCA

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

Page 27: RECOMBINOMICS : Myth or Reality?

27

IBM Computational Biology Center

GPGE(K,N,M) ARG

Ancestral Recombinations Graph Griffiths & Marjoram ‘97

Embellish GPGE(K,N,M) with Genetic Exchanges (GE) Each extant unit has M segments No vertex with zero ancestral segments (to extant units)

Page 28: RECOMBINOMICS : Myth or Reality?

28

IBM Computational Biology Center

1. Plausible GE assignment?2. Can GPGE(K,N,M) go colorless?

Yes....through algorithmic subsampling…

Mixed Subgraph GPGE(K,N,M)

Page 29: RECOMBINOMICS : Myth or Reality?

29

IBM Computational Biology Center

Algorithm: Embellish GPGE(K,N,M)

1. Assign sequence, s, to an instance eg. s = K, (2K), (2K-7), (2K-15), ……….

2. Construct M sequences si

Each si is monotonically decreasing; si[j] no bigger than s[j]

3. Associate each si with a segment and each element si[j] = k to k randomly selected vertices at depth j

Page 30: RECOMBINOMICS : Myth or Reality?

30

IBM Computational Biology Center

Algorithm: Constructing seqs…

Page 31: RECOMBINOMICS : Myth or Reality?

31

IBM Computational Biology Center

“Topological” Defn of LCAAin GPGE(K,N,M)

Input: GPGE(K,N,M) with GE embellishment

LCAA1.CA in all M subgraphs (trees)2.Least such CA

Page 32: RECOMBINOMICS : Myth or Reality?

32

IBM Computational Biology Center

Different Models as Subgraphs

LCAA GMRCA

LCA TMRCA

LCAA GMRCA

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

Page 33: RECOMBINOMICS : Myth or Reality?

33

IBM Computational Biology Center

Probability of Instances with Unique LCA/LCAA

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

Page 34: RECOMBINOMICS : Myth or Reality?

34

IBM Computational Biology Center

GMRCA LCAA LCA & lone pair

TMRCA LCA

GMRCA LCAA LCA & lone node

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

“Topological” Defns of LCAA

Page 35: RECOMBINOMICS : Myth or Reality?

35

IBM Computational Biology Center

Expected Depth E(D) of LCA/LCAA

O(N2)

O(K)

O(KM)

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

Page 36: RECOMBINOMICS : Myth or Reality?

36

IBM Computational Biology Center

RECONSTRUCTABILITY

O(N2)

O(K)

O(KM)

Pedigree Graph GPG(K,N)

1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)

2. Mixed Subgraph GPGE(K,N,M)

Page 37: RECOMBINOMICS : Myth or Reality?

37

IBM Computational Biology Center

Summary:History Reconstruction?

1. Mixed Subgraph models recombinations Only fragments of the chromosome

2. In reality, only a minimal structure (HUD) of the GPGE(K,N,M) or ARG can be estimated Forbidden structures ….

Page 38: RECOMBINOMICS : Myth or Reality?

38

IBM Computational Biology Center

1. Motivation

2. Reconstructability (Random Graph Framework)

3. Reconstruction Algorithm (DSR Algorithm)

4. Conclusion

RoadMap

L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns

Journal of Computational Biology, vol 15(9), pp 1—22, 2008

L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009

Page 39: RECOMBINOMICS : Myth or Reality?

39

IBM Computational Biology Center

OUTPUT: Recombinational Landscape (Recotypes)

INPUT: Chromosomes (haplotypes)

Page 40: RECOMBINOMICS : Myth or Reality?

40

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Our Approach

statistical

statistical

combinatorial

M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium Recombination-based genomics: a genetic variation analysis in human populations ,

under submission.

Page 41: RECOMBINOMICS : Myth or Reality?

41

IBM Computational Biology Center

Preprocess: Dimension reduction via Clustering

11 12 13 14 15 16 0

17 1 18 4

19 6 5

20 8 21 9 10 7 22

23 3 2 24

Page 42: RECOMBINOMICS : Myth or Reality?

42

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Analysis Flow

statistical

statistical

combinatorial

Page 43: RECOMBINOMICS : Myth or Reality?

43

IBM Computational Biology Center

p-value Estimation

Page 44: RECOMBINOMICS : Myth or Reality?

44

IBM Computational Biology Center

Comparison of the Randomization Schemes

Page 45: RECOMBINOMICS : Myth or Reality?

45

IBM Computational Biology Center

SNP Blocks (granularity g=3)

Page 46: RECOMBINOMICS : Myth or Reality?

46

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Analysis Flow

statistical

statistical

combinatorial

Page 47: RECOMBINOMICS : Myth or Reality?

47

IBM Computational Biology Center

Stage Haplotypes: use SNP block patterns

Segment along the length: infer trees

Infer network (ARG)

biological insights

computational insights

IRiS(Identifying Recombinations in Sequences)

L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns

Journal of Computational Biology, vol 15(9), pp 1—22, 2008

Page 48: RECOMBINOMICS : Myth or Reality?

48

IBM Computational Biology Center

Segmentation

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345 11111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----

Page 49: RECOMBINOMICS : Myth or Reality?

49

IBM Computational Biology Center

Segmentation

Page 50: RECOMBINOMICS : Myth or Reality?

50

IBM Computational Biology Center

Consensus of Trees

Page 51: RECOMBINOMICS : Myth or Reality?

51

IBM Computational Biology Center

Algorithm Design

1. Ensure compatibility of component trees

2. Parsimony model: minimize the no. of recombinations

Page 52: RECOMBINOMICS : Myth or Reality?

52

IBM Computational Biology Center

Algorithm Design

1. Ensure compatibility of component trees

2. Parsimony model: minimize the no. of recombinations

Theorem: The problem is NP-Hard.

“It is impossible to design an algorithm that guarantees optimality.”

Page 53: RECOMBINOMICS : Myth or Reality?

53

IBM Computational Biology Center

DSR Scheme (Dominant—Subdominant---Recombinant)

Page 54: RECOMBINOMICS : Myth or Reality?

54

IBM Computational Biology Center

DSR Scheme: Level 1

Page 55: RECOMBINOMICS : Myth or Reality?

55

IBM Computational Biology Center

DSR Assignment Rules

1. At most one D per row and column;

if no D, at most one S per row and column

2. At most one non-R in the row and column, but not both

Page 56: RECOMBINOMICS : Myth or Reality?

56

IBM Computational Biology Center

DSR Assignment Rules

1. Each row and each columnhas at most one D

ELSE has at most one S

2. A non-R can have other non-Rs either in its row or its column but NOT both

Page 57: RECOMBINOMICS : Myth or Reality?

57

IBM Computational Biology Center

DSR Scheme: Level 1

Page 58: RECOMBINOMICS : Myth or Reality?

58

IBM Computational Biology Center

DSR Scheme: Level 2

Page 59: RECOMBINOMICS : Myth or Reality?

59

IBM Computational Biology Center

DSR Scheme: Level 2

Page 60: RECOMBINOMICS : Myth or Reality?

60

IBM Computational Biology Center

DSR Scheme: Level 3

Page 61: RECOMBINOMICS : Myth or Reality?

61

IBM Computational Biology Center

DSR Scheme: Level 3

Page 62: RECOMBINOMICS : Myth or Reality?

62

IBM Computational Biology Center

DSR Scheme: Level 4

Page 63: RECOMBINOMICS : Myth or Reality?

63

IBM Computational Biology Center

DSR Scheme: Level 5

Page 64: RECOMBINOMICS : Myth or Reality?

64

IBM Computational Biology Center

Mathematical Analysis: Approximation Factor

Greedy DSR Scheme Z and Y are computable functions of the input

L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009

Page 65: RECOMBINOMICS : Myth or Reality?

65

IBM Computational Biology Center

Granularity g

Analyze Results

YES

NO

IRiS

Acceptable p-value?

Analysis Flow

statistical

statistical

combinatorial

Page 66: RECOMBINOMICS : Myth or Reality?

66

IBM Computational Biology Center

IRiS Output: RECOTYPE

Recombination vectorsR1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 ……….

s1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 ……….

s2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 ……….

.

.

.

.

Page 67: RECOMBINOMICS : Myth or Reality?

67

IBM Computational Biology Center

Quick Sanity Check:Ultrametric Network on RECOTYPES

Page 68: RECOMBINOMICS : Myth or Reality?

68

IBM Computational Biology Center

Stage Haplotypes: use SNP block patterns

Segment along the length: infer trees

Infer network (ARG)

biological insights

computational insights

IRiS(Identifying Recombinations in Sequences)

L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns

Journal of Computational Biology, vol 15(9), pp 1—22, 2008

IRiS software will be released by the end of summer ’09

Asif Javed

Page 69: RECOMBINOMICS : Myth or Reality?

69

IBM Computational Biology Center

What’s in a name?

1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations

2. Detects subcontinental divide from short segments based on populations level analysis

3. Detects populations from short segments based on recombination events analysis

RECOMBIN-OMICS Jaume Bertranpetit

RECOMBIN-OMETRICS

Robert Elston

Page 70: RECOMBINOMICS : Myth or Reality?

70

IBM Computational Biology Center

1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations

2. Detects subcontinental divide from short segments based on populations level analysis

3. Detects populations from short segments based on recombination events analysis

Are we ready for the OMICS / OMETRICS?

o population-specific signals ?o other critical signals ?

o anything we didn’t already know?

Page 71: RECOMBINOMICS : Myth or Reality?

71

IBM Computational Biology Center

Thank you!!

Page 72: RECOMBINOMICS : Myth or Reality?

72

IBM Computational Biology Center