Combinational Pattern Matching Shashank Kadaveru

Combinational Pattern Matching

Shashank Kadaveru

IntroductionIn Motif Finding problem, no

particular pattern is given to search for. We infer it from the sample.

In Combinational Pattern Matching, we look for exact or appropriate occurrences of given patterns in a long text.

Pattern MatchingMight seem simpler than Motif

finding because we know what we are looking for.

But due to the large size of genomes, in practice it is difficult.

Here, we develop number of ways to make pattern matching in a long string.

Repeat FindingGenetic diseases are associated with

deletions, duplications and rearrangements of long chromosomal regions.

For example, DiGeorge syndrome which results in impaired immune system and heart defects, is associated with a 3MB deletion on human chromosome 22.

The deletion removes important genes which leads to a disease.

Repeat FindingRepeats in DNA hold many

evolutionary secrets. A striking and still unexplained

phenomenon is the large number of repeats in many genomes.

For example, repeats account for about 50% of the human genome.

Duplicate Removal ProblemFind all unique entries in a list of

integers. For example, the list (8, 1, 5, 1,

0, 4, 5, 10, 1) has the elements 1 and 5 repeated multiple times, so the resulting list would be (8, 1, 5, 0, 4, 10)

Duplicate Removal AlgorithmDUPLICATEREMOVAL(a, n)1 m <- largest element of a2 for i <- 0 to m3 bi <- 04 for i <- 0 to n5 bai <- 16 for i <- 0 to m7 if bi = 18 output i9 return

Duplicate Removal AlgorithmThis approach requires O(n log n)

time to sort the list of size n.Difficult to use when the entries

are arbitrarily large.Creating a list with so many

values is not efficient and may not even be possible.

Duplicate Removal AlgorithmIf we can map any number of

entries to some number between 1 and 1000 then we can use binning strategy efficiently.

The function used for mapping is called Hash Function.

The second list is called Hash Table.

Hash FunctionShould be easy to calculate and

integer valued.A simple hash function takes h(x)

to integer remainder of x/|b|.We want different integers in the

array to map to different integers in b.

But this isn’t possible when the size of array a is larger than b.

Hash FunctionExample:

◦If h(x) = x/1000◦7, 1007, 2007 collide and fall into the

same bin.This is called a collision.To deal with collisions, we use

Chaining.

Chaining

Elements collapsing to the same bin are organized into a linked list.

Also, elements falling into the same bins reveal duplicate elements and hence solves the duplicate removal problem.

ChainingThis concept can apply to many

other problems.Ex: to find exact repeats of l-mer

in a genome.If we have to find exact repeats

for large values like l = 40, then we need not construct 440 entries.

We can design a suitable hashing function and use a table of much smaller size.

Exact Pattern MatchingIf we have a pattern p = p1 p2 . .

pn and a string t = t1 t2 . . tm Pattern matching problem is used

to find any or all occurrences of pattern p in the text t.

Brute force algorithm is used for solving Pattern Matching problem.

Exact Pattern Matching

PATTERNMATCHING(p, t)1 n <- length of pattern p2 m <- length of text t3 for i <- 1 to m− n + 14 if ti = p5 output i

Exact Pattern MatchingWorst case running time is O (mn)Such a worst-case scenario can be

seen by searching for p = AAAAT in t =AAAA. . .AAAAA.

Peter Weiner invented a data structure called Suffix Tree that solves Pattern Matching problem in linear time O(m) for any text or pattern.

To understand this, we need to learn about Keyword Trees.

Keyword TreesMultiple Pattern Matching

Problem:◦Given a set of patterns and a text,

find all occurrences of any of the patterns in the text.

◦For example, if t = ATGGTCGGT and p1 = GGT, p2 = GGG, p3 = ATG, and p4 = CG, then the result of an algorithm that solves the Multiple Pattern Matching problem would be positions 1, 3, 6, and 7.

Keyword TreesMultiple Pattern Matching

problem with k patterns can be reduced to k individual Pattern Matching problems and solved in O(knm) time.

However, there exists an even faster way to solve this problem in O(n + m) time where n is the total length of patterns p1, p2, . . . , pk

Keyword Trees

The keyword tree for the set of patterns apple, apropos, banana, bandana, and orange is shown in figure.

Keyword TreesThe keyword tree has at most N

vertices where N is the total length of all patterns, but may have fewer.

One can construct the keyword tree in O(N) time by progressively extending the keyword tree.

Suffix TreesSuffix trees allow one to

preprocess a text in such a way that for any pattern of length n, one can answer whether or not it occurs in the text, using only O(n) time.

It doesn’t depend on the length of the text.

Suffix Trees

Patterns: proud, perfect, muggle, ugly, rivet

t = “mr and mrs dursley of number 4 privet drive were proud to say that they were perfectly normal thank you very much”

Suffix Trees

Suffix Trees

Suffix TreesA suffix tree for a text of length

m can be constructed in O(m) time, which leads immediately to a linear O(n + m) algorithm for the Pattern Matching problem

Construct the suffix tree for t, in O(m) time, and then check whether p occurs in the tree, which requires O(n) time.

Suffix TreesThe suffix tree for a text t = t1 ...tm is a

rooted labeled tree with m leaves (numbered from 1 to m) satisfying the following conditions ◦Each edge is labeled with a substring of the

text◦Each internal vertex (except possibly the

root) has atleast 2 children,◦ Any two edges out of the same vertex start

with a different letter,◦Every suffix of text t is spelled out on a

path from the root to some leaf.

Suffix Trees

Suffix TreesSUFFIXTREEPATTERNMATCHING(p,

t) ◦Build the suffix tree for text t◦Thread pattern p through the suffix

tree. ◦if threading is complete

output positions of every p-matching leaf in the tree

◦ else output “pattern does not appear anywhere

in the text”

Heuristic Similarity Search Algorithms The suffix tree algorithm above, while

fast, can only find exact, rather than approximate, occurrences of a gene in a database.

When we are trying to find an approximate match to a gene of length 103 in a database of size 1010, the programming algorithms (like the Smith- Waterman local alignment algorithm) may be too slow.

Approximate Pattern MatchingFinds all approximate

occurrences of a pattern in a text. The naive brute force algorithm

for approximate pattern matching runs in O(nm) time.

The following algorithm will output all locations in t where p occurs with no more than k mismatches.

Approximate Pattern MatchingAPPROXIMATEPATTERNMATCHING(p, t,

k) n ← length of pattern p m ← length of text t for i←1tom−n+1

◦dist ← 0◦ for j←1to n

if ti+j−1 != pjdist ← dist + 1

if dist ≤ k ◦output i

Approximate Pattern MatchingIn 1985 Gadi Landau and Udi

Vishkin found an algorithm for approximate string matching with O(km) worst-case running time.

Although this algorithm yields the best known worst-case performance, it is not necessarily the best in practice.

Heuristic Similarity Search Algorithms Many heuristics for fast database

search in molecular biology use the same filtration idea.

In 1973 Donald Knuth suggested a method for pattern matching with one mismatch based on the observation that strings differing by a single mismatch must match exactly in either the first or second half.

Heuristic Similarity Search Algorithms In 1985 the idea of filtration in

computational molecular biology was used by David Lipman and Bill Pearson, in their FASTA algorithm.

It was further developed in BLAST, now the dominant database search tool in molecular biology.

Heuristic Similarity Search Algorithms Biologists frequently depict

similarities between two sequences in the form of dot matrices.

A dot matrix is simply a matrix with each entry either 0 or 1, where a 1 at position (i, j) indicates some similarity between the ith position of the first sequence and the jth position of the second sequence.

Heuristic Similarity Search Algorithms

BLAST: Comparing a Sequence against a Database Using shared l-mers for finding similarities,

as FASTA does, has some disadvantages. For example, two proteins can have different

amino acid sequences but still be biologically similar.

A common construct in many heuristic similarity search algorithms, including BLAST, is that of a scoring matrix similar to the scoring matrices introduced in chapter 6.

These scoring matrices reveal similarities between proteins even if they do not share a single l-mer.

BLAST: Comparing a Sequence against a Database BLAST, uses scoring matrices to

improve the efficiency of filtration and to introduce more accurate rules for locating potential matches.

Another powerful feature of BLAST is the use of Altschul-Dembo-Karlin statistics for estimating the statistical significance of found matches.

BLAST: Comparing a Sequence against a Database

A segment pair is just a pair of l-mers, one from each sequence.

The maximal segment pair is a segment pair with the best score over all segment pairs in the two sequences.

A segment pair is locally maximal if its score cannot be improved either by extending or by shortening both segments.

BLAST: Comparing a Sequence against a Database BLAST attempts to find all locally

maximal segment pairs in the query and database sequences with scores above some set threshold.

A fast algorithm for finding such l-mers is the key ingredient of BLAST.

BLAST: Comparing a Sequence against a Database An important observation is that if the

threshold is high enough, then the set of all l-mers that have scores above a threshold is not too large.

In this case the database can be searched for exact occurrences of the strings from this set using Multiple Pattern Matching Algorithm.

After the potential matches are located, BLAST attempts to extend them to see whether the resulting score is above the threshold.

BLAST: Comparing a Sequence against a Database In recent years BLAST has been

further improved by allowing insertions and deletions and combining matches on the same and close diagonals.

BLAST reports matches to sequences that either have one segment score above the threshold or that have several closely located segment pairs that are statistically significant when combined.

BLAST: Comparing a Sequence against a Database According to the Altschul-Dembo-

Karlin statistics, the number of matches with scores above θ is approximately Poisson-distributed, with mean

where m and n are the lengths of compared sequences, and K is a constant. Parameter λ is a positive root of the equation

BLAST: Comparing a Sequence against a Database

where px and py are frequencies of amino acids x and y from the twenty- letter alphabet A and δ is the scoring matrix.

The probability that there is a match of score greater that θ between two “random” sequences of length n and m is computed as 1 − eE(θ).

Any Questions?Thank You

Documents

Combinational Pattern Matching Shashank Kadaveru