Upload
erick-gray
View
215
Download
2
Embed Size (px)
Citation preview
Compressed Pattern Matching in DNA
Sequences
BARNA SAHA
Motivation
Large size of DNA sequence Human genome: three billions characters
over twenty-three pairs of chromosomes
Secondary Storage
Bring to memory
Main Memory
Decompress
Search
Search
1. Saves main memory space
2. Reduces Time Complexity
Works on Compressed Pattern Matching BB (Boyer Moore on byte pair encoding)
[SHIBATA:CPM00]
SASO (Super-alphabet shift-or) [FRED:IPL03]
d-BM (Derivative Boyer Moore) [Chen:CSB04]
LZ-Grep (Boyer Moore on LZ compressed text) [NAVARRO:SPE05]
FED (Fast matching with encoded DNA sequences) [KIM:COMB07]
Project Goal: To implement Derivative Boyer Moore and compare it with Original Boyer Moore and AGREP
Project Goals
Explore d-BM algorithm, by implementing it
and comparing time complexity of it with
BM, AGREP (fastest algorithm so far) and
LZ-Grep.
If Possible extend the method for matching
multiple patterns and for inexact matching.
Two Steps of Implementation Compression of DNA string (Text, Pattern) Searching the compressed Pattern in the
compressed Text
Compression Technique for Text A=00, C=01, T=10, G=11 Four characters represented by 1 byte Don’t care symbol appended to make the
sequence divisible by 4 TACCGT =10 00 01 01 11 10 xx xx S=TACCGT, S’= 10 00 01 01 11 10 , A=2 Replace S by (S’,A)
Compression Technique for Pattern Text: TACT GATx Pattern: ACTG Axxx Boundary of Bytes don’t match
4 Patterns: 1. ACTG Axxx2. xACT GAxx3. xxAC TGAx4. xxxA CTGA
Searching Adapt Boyer Moore Algorithm Needs to handle Don’t Care Symbols
Handling Don't Care
Search (P', A1, A2) in (T',B1): Four cases
Matching P'[1] to T'[j], 1<=j<=m-1: mask
A1 cells of P'[1]
Matching P'[i], 2<=i<=n-1, to T'[j],
1<=j<=m-1: normal search
Matching P'[n] to T'[j], 1<=j<m-1: mask A2
end cells of P'[n]
Matching P'[n] to T'[m]: mask A2 end cells
of P'[n] and B1 cells of T'[m]
Handling Don't Care
Significant overhead if need to consider
these four cases separately while searching
and preprocessing
P'=<Pf, Pm, Pe>, T'=<Tf,Te>
Search Pm in Tf
If FOUND, extend search with Pf, Pe in T'
using case 1,3,4.
Example
Pattern=TACTTTGGA
Text=GCTACTTTGGATGCT
P'3=xx xx TA CTTT GGA xx
T'= GCTA CTTT GGAT GCTxx
Two Reason For Speed-Up Shortened length of text and pattern Number of different symbols=256, hence
larger shift
References. [Chen:CSB04] Lei Chen and Shiyong Lu and Jeffrey Ram, Compressed Pattern
Matching in DNA Sequences, CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
[FRED:IPL03] Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003).
[SHIBATA:CPM00] Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000).
[KIM:COMB07] Jin Wook Kim, Eunsang Kim, and Kunsoo Park, “Fast Matching Method for DNA Sequences”, Book of Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, 271-281, 2007.
[NAVARRO:SPE05] Gonzalo Navarro and Jorma Tarhio, “LZgrep: a BoyerMoore string matching tool for ZivLempel compressed text: Research Articles”, Softw. Pract. Exper., 35, 2005.
THANK YOU !