23
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Embed Size (px)

Citation preview

Page 1: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

De-novo Assembly Paradigm

Sequencing

The Contigs

The Scaffolds

The Reads

The Genome

Assembly

Scaffolding

Page 3: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Page 5: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Read Pairs

Paired Read Construction

2kb

2kb

same strand and orientation

R1 R2

Informative Reads

Align each read against the contigs

Only accept uniquely mapped reads Use the non-unique

reads laterBoth reads in a pair

must map to different contigs

Page 7: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Linkage Information

Possible States

Two contigs are adjacent if: A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be β€œconsistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 8: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

The Scaffolding Problem

Givenβ€’ Contigsβ€’ Paired readsFindβ€’ Orientationβ€’ Orderingβ€’ Relative DistanceGoalβ€’ Recreate true scaffolds

Possible Objectivesβ€’ Un-weightedβ€’ Max number of consistent

read pairsβ€’ Weightedβ€’ Each states is weighted:

β€’ Overlap with repeatβ€’ Deviation of expected distanceβ€’ …

π‘Š 𝑖𝑗𝐴 ,π‘Š 𝑖𝑗

𝐡 ,π‘Š 𝑖𝑗𝐢 ,π‘Š 𝑖𝑗

𝐷

Page 9: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Representation

Using input we can define a scaffolding graph:

This is an undirected multi-graph

Assume it is connected

𝐺=(𝑉 ,𝐸)

𝑉 ,𝑠𝑒𝑑 π‘œπ‘“ π‘Žπ‘™π‘™ π‘π‘œπ‘›π‘‘π‘–π‘”π‘ E, set of

Page 10: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Integer Linear Program Formulation

Variables

, ,

max βˆ‘( 𝑖 , 𝑗 ) ∈𝐸

(π‘Š ΒΏΒΏ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(π‘Š ΒΏΒΏ 𝑖𝑗𝐡 𝐡𝑖𝑗)+(π‘Š ΒΏΒΏ 𝑖𝑗𝐢𝐢𝑖𝑗)+(π‘Š ΒΏΒΏ 𝑖𝑗𝐷 𝐷𝑖𝑗)ΒΏΒΏΒΏΒΏ

Contig Pair State:

Contig Orientation: 𝑆 π‘–βˆˆ {0,1 }Pairwise Contig Consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 11: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Constraints

Pairwise Orientation

𝑆 𝑖𝑗≀𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≀2βˆ’π‘†π‘–βˆ’π‘† 𝑗

𝑆 𝑖𝑗β‰₯𝑆 π‘—βˆ’π‘† 𝑖

𝑆 𝑖𝑗β‰₯π‘†π‘–βˆ’π‘† 𝑗

𝐴𝑖𝑗+𝐷 𝑖𝑗≀1βˆ’π‘†π‘– 𝑗 𝐡𝑖𝑗+𝐢𝑖𝑗≀𝑆𝑖 𝑗

Mutually Exclusivity

Forbid 2 and 3 Cycles Explicitly

Page 12: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Decomposition: Articulation Points

solve

Articulation point

Page 13: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Decomposition: 2-cuts

2-cut+

+

+

-

-

+

-

-

Page 14: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Non-Serial Dynamic Programming

β€’ SPQR-tree to schedule decomposition

β€’ Traverse tree using DFS

β€’ NSDP utilizes solutions of previous stage in current stage

Page 15: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Connected Component

Page 16: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Biconnected Component

Page 17: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Triconnected Component

Page 18: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 19: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Testing Framework

Venter Genome

Read Type Total ReadsTotal

BasesAvg

lengthCoverage

Sanger 31,861,976 2.79E+10 875 9.930637

SOLiD pairs 4.85E+08 2.42E+10 50 8.623028

# Reads# Bases in

reads # Contigs# Bases in

contigs N50112,00,000 1.1E+10 422,837 2.26E+09 7704

4x Assembly

Page 20: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Testing Metrics

Computer Scientists Finding Scaffold = Binary Classification Test

n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV

Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50

Break scaffold at incorrect edges, then find N50

Page 21: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Results

test case method

bundle size sensitivity ppv N50 TP50

10% opera 2 81.13% 99.26% 27,567 27,327

10% mip 2 59.01% 98.94% 19,988 19,755

10% ilp 1 79.86% 98.58% 26,814

26,459

25% opera 2 80.44% 98.27% 27,296

26,849

25% mip 2 58.95% 97.56% 19,842 19,518

25% ilp 1 79.30% 96.93% 26,684

26,079

100% opera 3 pending … … … 100% mip 3 failed n/a n/a n/a

100% ilp 1 68.25% 89.90% 20,538

19,006

Page 22: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Conclusions

Success ILP solves scaffolding problem! NSDP works.

Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries,

merge ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??

Page 23: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Questions?