32
IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 1/31 18th June 20010 CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion 30 th Annual Convention of Indian Association for Cancer Research & International Symposium on "Signalling Network and Cancer" Indian Institute of Chemical Biology (IICB), Kolkata, 6-9 February, 2011 17th December 2009 Asoke K Talukder, Ph.D Indian Institute of Information Technology & Management, Gwalior, India

CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

Embed Size (px)

Citation preview

Page 1: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 1/3118th June 20010

CNVMiner: Pipeline to Mine CNV & Structural Variation in

Hierarchical Fashion

30th Annual Convention of Indian Association for Cancer Research & International Symposium on

"Signalling Network and Cancer"Indian Institute of Chemical Biology (IICB), Kolkata,

6-9 February, 2011

17th December 2009

Asoke K Talukder, Ph.DIndian Institute of Information Technology &

Management, Gwalior, India

Page 2: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 2/31

Acknowledgement

• Indian Association for Cancer Research• Dr Susanta Roy Choudhury & Dr Chitra Mandal• Indian Institute of Chemical Biology• Prof Dr Nitaipada Bhattacharyya• Open Source Software/Foundation• Authors of Open Source & Open Domain software• Authors & Publishers making various articles

available free in the Web

Page 3: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 3/31

Hope & Opportunities

• For Cancer Therapeutics, Time is the essence – Speed is the Mantra

• We need in-Silico Algorithms to– Make Speedy Diagnosis– Make it Reliable– Make it Repeatable– Make it Scalable– Make it Economic

Page 4: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 4/31

Challenges in Computing

• Biology needs similarity & not identity• Computers are efficient in discovering

identity but not similarity• All Biology problems are different & unique• Huge data generated by Next Generation

Sequencers with many errors• Eliminate Noise from Information• Minimize False Positive and False Negative

Page 5: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 5/31

Most Biology Solutions are NP-Hard

• If the data volume increases by x, complexity of solution is much higher than x (non deterministic polynomial time)

• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time

• You may not know when you have an optimal solution, if you use a heuristic

• Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution

• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need?

Page 6: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 6/31

Biology + Computing + Mathematics

•Better Predictability

•Higher Accuracy

•Less time to market

Page 7: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 7/31

CNVMiner: Pipeline to Mine CNV & Structural Variation

• Functions– Uses Library in Hierarchical Order– Uses Mate-Pair/Paired-end data– Determines Links & Structural Variations– Calculates Digital Gene Expression

Page 8: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 8/31

Structural Variation with NGS (Nature Methods, November 2009)

Page 9: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 9/31

Paired End Mapping (PEM)

Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing, Nature Methods Supplement| Vol.6 No.11s | November 2009

Page 10: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 10/31

NGS Data Types• Fixed Length short reads (NGS)

– All sequence reads are short– MAQ supports 63 bases (made 100 by us)– Bowtie supports 1024 bases

• Variable Length long reads (NGS & Classic)– All sequence reads are of variable size– Goes even > 1024 bases

Page 11: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 11/31

NGS Data Formats• Single-end

• Paired-end• Mate-pair

Insert Size

Library Size

Sequence

Sequence Sequence• FASTA• FASTQ• …

NO ORDER OR ORIENTATION

Page 12: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 12/31

Method-1 (Simple Variations)• Align the Paired-end/Mate-pair reads (donor)

as Single-end Sequence-reads to the Reference– BLAST for long & variable length sequences– BOWTIE for short fixed length sequences

• Establish the Link by locating the mates• Measure the distance between mates• Establish agreement with Library Inserts

Page 13: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 13/31

Method-2 (Complex Variations)• Take Unmatched Sequence-reads• Split them using Sliding-window and do

alignment as Single-end read• Identify the Cluster• Measure the distance between Mates

Clusters

Page 14: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 14/31

Method-3 (DGE)• Identify the Cluster• Calculate the number of reads• Calculate the Breadth of the Aligned set• Calculate the Depth of the Aligned set• Calculate the DGE (Digital Gene Expression)

– FPKM (Fragments Per Kilobase of exon Per Million mapped reads)

– Coverage (depth)– Aligned Reference (breadth)– Reads Aligned (total number in a cluster)

Page 15: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 15/31

Hierarchical Librariespriority 1library Large 25000 35000pair _1FW _1RVpriority 2library Moderate 15000 25000pair _3FW _3RVpriority 4library Small 8000 15000pair _4FW _4RVpriority 5library Tiny 3000 6000pair _5FW _5RV

Page 16: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 16/31

Alignment – Blast(for Variable Length Data)

# BLASTN 2.2.23+# Query: FMC01.F_A01_length_948# Database: mciceri_29cont_454_illumina# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q.

start, q. end, s. start, s. end, evalue, bit score# 1 hits foundFMC01.F_A01_length_948 chr_14_length_1427689 98.18 933 8 9

17 948 1131593 1130669 0.0 1635# BLASTN 2.2.23+# Query: FMC01.F_A02_length_992# Database: mciceri_29cont_454_illumina# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q.

start, q. end, s. start, s. end, evalue, bit score# 1 hits foundFMC01.F_A02_length_992 chr_19_length_679487 97.42 968 10 15

19 986 381039 381991 0.0 1650# BLASTN 2.2.23+# Query: FMC01.F_A03_length_1164# Database: mciceri_29cont_454_illumina# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q.

start, q. end, s. start, s. end, evalue, bit score# 1 hits foundFMC01.F_A03_length_1164 chr_19_length_679487 97.86 1076 4 19

16 1090 508930 507873 0.0 1847# BLASTN 2.2.23+# Query: FMC01.R_A04_length_1192# Database: mciceri_29cont_454_illumina# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q.

start, q. end, s. start, s. end, evalue, bit score# 2 hits foundFMC01.R_A04_length_1192 chr_22_length_757631 97.98 1141 8 15

7 1142 705343 704213 0.0 1975FMC01.R_A04_length_1192 chr_12_length_43706 97.98 1141 8 15 7

1142 29181 28051 0.0 1975

Page 17: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 17/31

Alignment – Bowtie(for Fixed Length Data)

HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_41607741374500 255 100M * 0 0

TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA %%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/>8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100NM:i:0

HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_41607741374501 255 100M * 0 0

CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT444454313355544455544433244445661493/3;;565=;491=;5;54==3=;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0MD:Z:100 NM:i:0

HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_41607741374502 255 100M * 0 0

TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATC BAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;55555;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100

NM:i:0

Page 18: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 18/31

Grouping (chr/contig wise)(.contig – TIGR/AMOS Format)

##gi_317165637_gb_CP002447.1_length_6353983 100 6353983 bases#FMC01_A01_xFW(3763405) [] 948 bases, {392 948} <3763405 3763971>#FMC01_A02_xFW(4429399) [] 992 bases, {20 899} <4429399 4430278>#FMC01_B08_xFW(2257526) [RC] 1140 bases, {1112 302} <2257526 2256713>#FMC01_D03_xFW(3775037) [] 1130 bases, {12 1119} <3775037 3776153>#FMC01_F01_xFW(2444650) [RC] 1017 bases, {444 270} <2444650 2444473>#FMC01_F03_xFW(438175) [] 1061 bases, {15 990} <438175 439151>#FMC01_F12_xFW(196934) [RC] 680 bases, {371 8} <196934 196568>#FMC01_G05_xFW(3663438) [] 1159 bases, {13 308} <3663438 3663734>#FMC01_H08_xFW(4782980) [] 935 bases, {21 935} <4782980 4783894>#FMC01_A08_xRV(3555569) [] 1174 bases, {655 1164} <3555569 3556080>#FMC01_A08_xRV(3555463) [] 1174 bases, {89 165} <3555463 3555539>#FMC01_C04_xRV(1933307) [] 1134 bases, {5 1083} <1933307 1934392>#FMC01_D02_xRV(1039634) [RC] 1163 bases, {1112 8} <1039634 1038528>#FMC01_D10_xRV(927106) [] 1203 bases, {84 447} <927106 927469>#FMC01_E03_xRV(5326284) [] 1150 bases, {5 1059} <5326284 5327343>#FMC01_E10_xRV(622907) [] 1175 bases, {67 1073} <622907 623932>#FMC01_E11_xRV(1634970) [] 1176 bases, {520 1121} <1634970 1635571>#FMC01_F08_xRV(3554606) [RC] 1168 bases, {756 552} <3554606 3554382>#FMC01_F10_xRV(3812335) [] 1207 bases, {9 978} <3812335 3813307>#FMC01_F11_xRV(6125371) [RC] 1180 bases, {118 28} <6125371 6125281>#FMC01_F12_xRV(152024) [] 1146 bases, {12 1034} <152024 153047>#FMC03_C06_xFW(5850746) [] 1154 bases, {12 311} <5850746 5851051>

Page 19: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 19/31

Intermediate XML File<?xml version="1.0" ?>

<EVIDENCE ID="project_1" DATE="Mon Feb 7 15:25:13 IST 2011" PROJECT="MyProject" PARAMETERS="">

<LIBRARY ID="lib_large" NAME="large" MIN="2000" MAX="60000"> <INSERT ID="ins_200" NAME="PL9_E9"> <SEQUENCE ID="seq_124" NAME="PL9_E9_FWR"/> <SEQUENCE ID="seq_125" NAME="PL9_E9_RVR"/> </INSERT> <INSERT ID="ins_201" NAME="PL9_H3"> <SEQUENCE ID="seq_156" NAME="PL9_H3_FWR"/> <SEQUENCE ID="seq_157" NAME="PL9_H3_RVR"/>

. . .

<SEQUENCE ID="seq_124" ORI="EB" ASM_LEND="853973“ ASM_REND="853707"/> <DIFF_R_TO_L ID="seq_124" ORI="EB" DIFF = 266/><SEQUENCE ID="seq_125" ORI="BE" ASM_LEND="853707" ASM_REND="853973"/> <DIFF_L_TO_R ID="seq_125" ORI="BE" DIFF = 266/>

Page 20: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 20/31

Unmatched Data(Inversion & Fuse)

CGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC CGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC GCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC GCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCT GCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTT CCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTT CCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTC CTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTC TTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCA CGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATC GGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGAC GGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGAC GCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACT CCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACTA

TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACCTTCATCGACAGGC

Page 21: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 21/31

Alignment in Genome Viewer

Page 22: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 22/31

Mining the Variation (PEM)Sequence #FMC01_D10_1FW (2617431 <--> 2616304) <====> #FMC01_D10_1RV

(3555790 <--> 3555424) in Library "Large" Priority 1Invalid Link: INSERTION : Effective Gap 937993 bases

Sequence #FMC01_E03_1FW (6000798 <--> 6001079) <====> #FMC01_E03_1RV (6025479 <--> 6024437) in Library "Large" Priority 1

Invalid Link: DELETION : Effective Gap 23358 bases

Sequence #FMC03_B08_1FW (3405991 <--> 3406934) <====> #FMC03_B08_1RV (3449204 <--> 3448283) in Library "Large" Priority 1

Invalid Link: INSERTION : Effective Gap 41349 bases

Sequence #FMC03_B10_1FW (4009642 <--> 4009447) <====> #FMC03_B10_1RV (4043713 <--> 4042741) in Library "Large" Priority 1

Valid Link: Insert Size 33099 bases

Sequence #FMC03_D10_1FW (4002883 <--> 4002973) <====> #FMC03_D10_1RV (4049290 <--> 4048293) in Library "Large" Priority 1

Invalid Link: INSERTION : Effective Gap 45320 bases

Page 23: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 23/31

CNVMine (Sequence Data)donEnd donStart donDiff refEnd refStart refDiff Chr/Contig SV_Info

-------------------------------------------------------------------------------------------

1484649 1520201 35552 121547 160092 38545 chr_15 Delete(2993)

1760942 1763068 2126 485407 486677 1270 chr_15 Insert(856)

1834755 1946223 111468 556660 574404 17744 chr_15 Insert(93724)

2296143 2304884 8741 1029365 1037561 8196 chr_15 Insert(545)

2467331 2494711 27380 1182894 1212071 29177 chr_15 Delete(1797)

2497348 2505581 8233 1215390 1222895 7505 chr_15 Insert(728)

2669853 4111343 1441490 1409178 1416675 7497 chr_15 Insert(1433993)

2898970 2912959 13989 1653470 1670528 17058 chr_15 Delete(3069)

2918147 2928707 10560 1675614 1686668 11054 chr_15 Delete(494)

Page 24: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 24/31

Digital Gene ExpressionrefStart refEnd donStart donEnd refBsize donBSize----------------------------------------------------------------9448 9457 4643011 4643020 109 10912635 12649 4645735 4645749 114 11438249 38263 4654405 4654419 114 114342068 73135 4687372 4700079 269033 1270787020 87029 4700707 4700716 109 10991302 91303 4702965 4702966 101 1011608380 1608379 4728865 4728866 101 1011607063 1607021 4730161 4730203 142 142377588 377581 4760359 4760366 107 1071578176 377406 4760494 4907831 1200870 147337377302 376991 4760645 4760956 411 411376767 375860 4761180 4762087 1007 1007

Page 25: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 25/31

FPKMrefBStart refEnd donStart donEnd Alignments FPKM------------------------------------------------------------------1374500 1374654 14041 14095 41 8982.24582435111374752 1374931 14293 14372 29 5465.96400756941375022 1375167 14563 14608 62 14425.98538787291376391 1376524 15932 15965 29 7356.44779966111377079 1377344 16491 16656 138 17569.32243526091381298 1381405 20747 20754 12 3783.72242612281384517 1384622 23975 23980 6 1927.89666473881384875 1385026 24333 24384 25 5585.79331671001417360 1417469 56951 56960 7 2227.99378708021423415 1423524 63402 63411 21 6500.01857148161427353 1427472 65583 65602 15 4252.71323104141462473 1462598 99400 99425 4 1079.6221322537

Page 26: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 26/31

Next Version• Define these Structural Variation Loci as

Biomarkers • Use Structural Variations Loci along-with

SNP in GWAS• Make this Cloud Computing Enabled

Page 27: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 27/31

Age

Grow

th

5 10 15 20 25 30 35 40 45 . . . .

Human Architecture! Growth Performance

Source: Rajkumar Buyya

Page 28: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 28/31

Number of Processors1 2 . . . .

Computational Power Improvement

Multiprocessor (fat, fatter)

Uniprocessor Supercomputers (tall, taller)

Source: Rajkumar Buyya

Page 29: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 29/31

Biology Solutions – Hype or Hope

• As Computers are becoming fatter (multiple cores) and clusters are becoming cheaper, it is slowly becoming possible to attempt solving NP-hard problems in Biology

• All computing algorithms to solve biology problems must be parallel & distributed

• HPC (High Performance Computing) and Parallel Programming will play a significant role in this attempt

Page 30: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 30/31

Cloud Computing

• If you need milk, you need not buy a Cow• Cloud computing is an emerging computing

paradigm where data and applications reside in the cyberspace – scientist/clinician will access their data and information through any web-connected device be it fixed or mobile.

• A biologist need not be constrained by the capability of the tool or the computing resources?

Page 31: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 31/31

Conclusion & Way Forward

• Combine and leverage the advancements of Computing Technologies like Cheaper Hardware & the Cloud

• Efficient and Optimized Algorithms• Interdisciplinary team of Computer

Scientists, Biologists, Mathematicians, Statisticians and HPC experts

• Pave the way for Affordable & Personalized Medicine

Page 32: CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion

IACR, Kolkata: 6-9 February 2011 © Asoke K Talukder 32/31

Thank You

Toolsmith: Asoke K Talukder, Ph.DEmail: “asoke” dot “talukder” (at) “geschcikten” dot “com”

Workshop: