19
Comparison of Genomic DNA to cDNA Alignment Methods Miguel Galves and Zanoni Dias Institute of Computing – Unicamp – Campinas – SP – Brazil {miguel.galves,zanoni}@ic.unicamp.br Scylla Bioinformatics – Campinas – SP – Brazil {miguel,zanoni}@scylla.com.br

Comparison of Genomic DNA to cDNA Alignment Methods

Embed Size (px)

Citation preview

Page 1: Comparison of Genomic DNA to cDNA Alignment Methods

Comparison of Genomic DNA to cDNA Alignment Methods

Miguel Galves and Zanoni Dias

Institute of Computing – Unicamp – Campinas – SP – Brazil

{miguel.galves,zanoni}@ic.unicamp.br

Scylla Bioinformatics – Campinas – SP – Brazil

{miguel,zanoni}@scylla.com.br

Page 2: Comparison of Genomic DNA to cDNA Alignment Methods

Agenda

Introduction Problem Aligners Data set Subsets Evaluation Methods Results: Exact Alignments Results: EST Alignments Running Time Comparison Conclusions

Page 3: Comparison of Genomic DNA to cDNA Alignment Methods

Introduction

Identifying genes in non-characterized DNA sequences is one of the greatest challenges in genomics

EST-to-DNA alignment is one of the most common methods

EST are key to understanding the inner working of an organism

– Human being has between 30000 and 35000 genes– Alternative Splicing plays an important role in diversity

Page 4: Comparison of Genomic DNA to cDNA Alignment Methods

CCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGGCCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGG

Problem

Mature mRNA

mRNA

Intron

Exon

Page 5: Comparison of Genomic DNA to cDNA Alignment Methods

Problem: How to solve ?

Classic algorithms– Dynamic programming

Heuristic based algorithms– Multi-steps– Based on other tools such as Blast and

local alignments.

Page 6: Comparison of Genomic DNA to cDNA Alignment Methods

Aligners

Java version of global and semi-global– Affine gap penalty function– Linear space– Global algorithm by Miller and Myers (1988)– Semi-global based on global algorithm

Heuristic based algorithms– sim4, Spidey and est_genome

Page 7: Comparison of Genomic DNA to cDNA Alignment Methods

Data Set

Human genome database– Based on FASTA a GENBANK’s flat format file from

NCBI repository.

Filtering criteria– Genes, mRNAs and CDS with /pseudo tag– mRNAs without any CDS– Genes without any mRNA– CDS matching wrong patterns

23124 genes and 27448 mRNAs stored in database

Page 8: Comparison of Genomic DNA to cDNA Alignment Methods

Subsets

Subset 1Subset 1:: 66 genes from chromossome Y whith less than 100000 bases

Subset 2: 50 complete genes from chromossome Y whith less than 100000 bases

Subset 3: 8056 complete genes from all chromossomes whith less than 100000 bases

Subset 4: 493 artificial EST based on complete genes from chromossome 6 with less than 100000 bases

Page 9: Comparison of Genomic DNA to cDNA Alignment Methods

Evaluation methods

Number of gaps introduced in the aligned gene sequence

Delta exons Bases similarity percentage Mismatch percentage

Page 10: Comparison of Genomic DNA to cDNA Alignment Methods

Experimental method

Two score systems, from 15 previously defined and an alignment strategy were choosed, using subsets 1 and 2:– Semi-global aligner– (1,-2,-1,0) and (1,-2,-10,0) score systems

The classic semi-global aligner was compared to sim4, Spidey and est_genome, both with subsets 3 and 4

Page 11: Comparison of Genomic DNA to cDNA Alignment Methods

Results: Exact Alignments

Extra GapStrategy Avg SD %Score 0

SG(1, -2, -1, 0) 0.00 0.00 100.00%

SG(1, -2, -10, 0)

0.00 0.00 100.00%

sim4 1.11 1.63 54.56%

est_genome 16.99 21.49 27.84%

Spidey 0.15 1.39 97.43%

Page 12: Comparison of Genomic DNA to cDNA Alignment Methods

Results: Exact Alignments

Delta ExonsStrategy Avg SD %Score 0

SG(1, -2, -1, 0) 0.00 0.00 100.00%

SG(1, -2, -10, 0) 0.01 0.07 99.91%

sim4 -0.01 0.20 97.46%

est_genome -0.14 0.30 76.79%

Spidey -4.04 3.10 0.00%

Page 13: Comparison of Genomic DNA to cDNA Alignment Methods

Results: Exact Alignments

Base SimilarityStrategy Avg SD %Scr. 100%

SG(1, -2, -1, 0) 99.89% 0.49% 53.56%

SG(1, -2, -10, 0) 99.89% 0.49% 53.49%

sim4 99.39% 1.34% 22.79%

est_genome 53.83% 35.00% 18.11%

Spidey 80.34% 36.49% 44.25%

Page 14: Comparison of Genomic DNA to cDNA Alignment Methods

Results: Exact Alignments

Mismatch PercentageStrategy Avg SD %Scr. 100%

SG(1, -2, -1, 0) 0.00% 0.00% 100.00%

SG(1, -2, -10, 0) 0.01% 0.03% 99.47%

sim4 0.17% 0.21% 36.68%

est_genome 1.19% 1.26% 21.55%

Spidey 0.15% 0.98% 90.65%

Page 15: Comparison of Genomic DNA to cDNA Alignment Methods

Results: EST Alignments

Page 16: Comparison of Genomic DNA to cDNA Alignment Methods

Results: EST Alignments

Page 17: Comparison of Genomic DNA to cDNA Alignment Methods

Running Time Comparison

EST-to-DNA

(sec/alignment)

mRNA-toDNA

(sec/alignment)

sim4 0.013 0.170

Spidey 0.066 0.140

est_genome 0.640 3.400

Semi-global 0.670 5.170

Page 18: Comparison of Genomic DNA to cDNA Alignment Methods

Conclusions

Classic semi-globl algorithm produces good results– Running time is a problem, although it can be

improved

Sim4 produces the best results amont external softwares tested

Page 19: Comparison of Genomic DNA to cDNA Alignment Methods

Thanks