14
Compressed Pattern Matching in DNA Sequences BARNA SAHA

Compressed Pattern Matching in DNA Sequences BARNA SAHA

Embed Size (px)

Citation preview

Page 1: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Compressed Pattern Matching in DNA

Sequences

BARNA SAHA

Page 2: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Motivation

Large size of DNA sequence Human genome: three billions characters

over twenty-three pairs of chromosomes

Secondary Storage

Bring to memory

Main Memory

Decompress

Search

Search

1. Saves main memory space

2. Reduces Time Complexity

Page 3: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Works on Compressed Pattern Matching BB (Boyer Moore on byte pair encoding)

[SHIBATA:CPM00]

SASO (Super-alphabet shift-or) [FRED:IPL03]

d-BM (Derivative Boyer Moore) [Chen:CSB04]

LZ-Grep (Boyer Moore on LZ compressed text) [NAVARRO:SPE05]

FED (Fast matching with encoded DNA sequences) [KIM:COMB07]

Project Goal: To implement Derivative Boyer Moore and compare it with Original Boyer Moore and AGREP

Page 4: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Project Goals

Explore d-BM algorithm, by implementing it

and comparing time complexity of it with

BM, AGREP (fastest algorithm so far) and

LZ-Grep.

If Possible extend the method for matching

multiple patterns and for inexact matching.

Page 5: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Two Steps of Implementation Compression of DNA string (Text, Pattern) Searching the compressed Pattern in the

compressed Text

Page 6: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Compression Technique for Text A=00, C=01, T=10, G=11 Four characters represented by 1 byte Don’t care symbol appended to make the

sequence divisible by 4 TACCGT =10 00 01 01 11 10 xx xx S=TACCGT, S’= 10 00 01 01 11 10 , A=2 Replace S by (S’,A)

Page 7: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Compression Technique for Pattern Text: TACT GATx Pattern: ACTG Axxx Boundary of Bytes don’t match

4 Patterns: 1. ACTG Axxx2. xACT GAxx3. xxAC TGAx4. xxxA CTGA

Page 8: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Searching Adapt Boyer Moore Algorithm Needs to handle Don’t Care Symbols

Page 9: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Handling Don't Care

Search (P', A1, A2) in (T',B1): Four cases

Matching P'[1] to T'[j], 1<=j<=m-1: mask

A1 cells of P'[1]

Matching P'[i], 2<=i<=n-1, to T'[j],

1<=j<=m-1: normal search

Matching P'[n] to T'[j], 1<=j<m-1: mask A2

end cells of P'[n]

Matching P'[n] to T'[m]: mask A2 end cells

of P'[n] and B1 cells of T'[m]

Page 10: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Handling Don't Care

Significant overhead if need to consider

these four cases separately while searching

and preprocessing

P'=<Pf, Pm, Pe>, T'=<Tf,Te>

Search Pm in Tf

If FOUND, extend search with Pf, Pe in T'

using case 1,3,4.

Page 11: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Example

Pattern=TACTTTGGA

Text=GCTACTTTGGATGCT

P'3=xx xx TA CTTT GGA xx

T'= GCTA CTTT GGAT GCTxx

Page 12: Compressed Pattern Matching in DNA Sequences BARNA SAHA

Two Reason For Speed-Up Shortened length of text and pattern Number of different symbols=256, hence

larger shift

Page 13: Compressed Pattern Matching in DNA Sequences BARNA SAHA

References. [Chen:CSB04] Lei Chen and Shiyong Lu and Jeffrey Ram, Compressed Pattern

Matching in DNA Sequences, CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference

[FRED:IPL03] Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003).

[SHIBATA:CPM00] Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000).

[KIM:COMB07] Jin Wook Kim, Eunsang Kim, and Kunsoo Park, “Fast Matching Method for DNA Sequences”, Book of Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, 271-281, 2007.

[NAVARRO:SPE05] Gonzalo Navarro and Jorma Tarhio, “LZgrep: a BoyerMoore string matching tool for ZivLempel compressed text: Research Articles”, Softw. Pract. Exper., 35, 2005.

Page 14: Compressed Pattern Matching in DNA Sequences BARNA SAHA

THANK YOU !