Compressed Pattern Matching in DNA Sequences BARNA SAHA

Compressed Pattern Matching in DNA

Sequences

BARNA SAHA

Motivation

Large size of DNA sequence Human genome: three billions characters

over twenty-three pairs of chromosomes

Secondary Storage

Bring to memory

Main Memory

Decompress

Search

Search

1. Saves main memory space

2. Reduces Time Complexity

Works on Compressed Pattern Matching BB (Boyer Moore on byte pair encoding)

[SHIBATA:CPM00]

SASO (Super-alphabet shift-or) [FRED:IPL03]

d-BM (Derivative Boyer Moore) [Chen:CSB04]

LZ-Grep (Boyer Moore on LZ compressed text) [NAVARRO:SPE05]

FED (Fast matching with encoded DNA sequences) [KIM:COMB07]

Project Goal: To implement Derivative Boyer Moore and compare it with Original Boyer Moore and AGREP

Project Goals

Explore d-BM algorithm, by implementing it

and comparing time complexity of it with

BM, AGREP (fastest algorithm so far) and

LZ-Grep.

If Possible extend the method for matching

multiple patterns and for inexact matching.

Two Steps of Implementation Compression of DNA string (Text, Pattern) Searching the compressed Pattern in the

compressed Text

Compression Technique for Text A=00, C=01, T=10, G=11 Four characters represented by 1 byte Don’t care symbol appended to make the

sequence divisible by 4 TACCGT =10 00 01 01 11 10 xx xx S=TACCGT, S’= 10 00 01 01 11 10 , A=2 Replace S by (S’,A)

Compression Technique for Pattern Text: TACT GATx Pattern: ACTG Axxx Boundary of Bytes don’t match

4 Patterns: 1. ACTG Axxx2. xACT GAxx3. xxAC TGAx4. xxxA CTGA

Searching Adapt Boyer Moore Algorithm Needs to handle Don’t Care Symbols

Handling Don't Care

Search (P', A1, A2) in (T',B1): Four cases

Matching P'[1] to T'[j], 1<=j<=m-1: mask

A1 cells of P'[1]

Matching P'[i], 2<=i<=n-1, to T'[j],

1<=j<=m-1: normal search

Matching P'[n] to T'[j], 1<=j<m-1: mask A2

end cells of P'[n]

Matching P'[n] to T'[m]: mask A2 end cells

of P'[n] and B1 cells of T'[m]

Handling Don't Care

Significant overhead if need to consider

these four cases separately while searching

and preprocessing

P'=<Pf, Pm, Pe>, T'=<Tf,Te>

Search Pm in Tf

If FOUND, extend search with Pf, Pe in T'

using case 1,3,4.

Example

Pattern=TACTTTGGA

Text=GCTACTTTGGATGCT

P'3=xx xx TA CTTT GGA xx

T'= GCTA CTTT GGAT GCTxx

Two Reason For Speed-Up Shortened length of text and pattern Number of different symbols=256, hence

larger shift

References. [Chen:CSB04] Lei Chen and Shiyong Lu and Jeffrey Ram, Compressed Pattern

Matching in DNA Sequences, CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference

[FRED:IPL03] Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003).

[SHIBATA:CPM00] Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000).

[KIM:COMB07] Jin Wook Kim, Eunsang Kim, and Kunsoo Park, “Fast Matching Method for DNA Sequences”, Book of Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, 271-281, 2007.

[NAVARRO:SPE05] Gonzalo Navarro and Jorma Tarhio, “LZgrep: a BoyerMoore string matching tool for ZivLempel compressed text: Research Articles”, Softw. Pract. Exper., 35, 2005.

THANK YOU !

Documents

Compressed Pattern Matching in DNA Sequences BARNA SAHA