40
Advanced Algorithms / T. Shibuya Advanced Algorithms: Text Algorithms Tetsuo Shibuya Human Genome Center, Institute of Medical Science (Adjunct at Department of Computer Science) University of Tokyo http://www.hgc.jp/~tshibuya

Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

  • Upload
    others

  • View
    40

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. ShibuyaAdvanced Algorithms / T. Shibuya

Advanced Algorithms:

Text Algorithms

Tetsuo Shibuya

Human Genome Center, Institute of Medical Science

(Adjunct at Department of Computer Science)

University of Tokyo

http://www.hgc.jp/~tshibuya

Page 2: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Self Introduction

Affilation:Laboratory of Sequence Analysis, Human Genome Center,

Institute of Medical Science

Research InterestBioinformatics algorithms

Our lab is located at the 4th floor

Page 3: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

The topics of this lecture

Today (July 2nd)

Text searching algorithmsKnuth-Morris-Pratt / Boyer-Moore / etc

Next week (July 9th)

Text indexing algorithmsSuffix arrays and their applications

The final week (July 16th)

Text compression algorithmsLZ77 / LZ78 / LZW / Arithmetic coding / Block sorting /etc

Page 4: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Reports

Please submit a report for the homework that I will give on the last day

or

Please submit scribe notes for one of my three lectures

In TeX format as for the previous lectures

One volunteer (if any) for one lecture

The submitted notes will be put on the web page

Page 5: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Textbooks

D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997.

The most famous book on text processing algorithms, but many parts are out of date.

W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing)

D. Salomon, Data Compression, 3rd Edition, Springer, 2004.

Related to the topic on the last day. (Very heavy book!)

Page 6: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Today's topic

Text processing algorithms

Brute-force algorithm

Knuth-Morris-Pratt algorithm

Colussi algorithm

Aho-Corasick algorithm

Boyer-Moore algorithm

Horspool algorithm

Turbo-BM algorithm

Rabin-Karp algorithm

Shift-Or method

etc.

Page 7: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Text matching

Problem

Given Text string T and a pattern (query) P

Output Substrings of T that are exactly same as P, if any.

exact matching: no insertion / deletion / modification(mutation)

Two approaches

Preprocess only the query pattern (today)

Preprocess the text beforehand (next week)

GGTGAGAAGTTATGATACAGGGTAGTTG

TGTCCTTAAGGTGTATAACGATGACATC

ACAGGCAGCTCTAATCTCTTGCTATGAG

TGATGTAAGATTTATAAGTACGCAAATT

TATAA

Text

Pattern (Query)

Page 8: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Two types of text matching algorithms

Skipping positions unnecessary to compareCheck from left

Knuth-Morris-Pratt

Aho-Corasick (for multiple queries)

Check from rightBoyer-Moore, Horspool, Turbo-BM

Brute-forceNaive algorithm

Fingerprinting (Hash-based) algorithmRabin-Karp

Bitwise computation-based algorithmShift-Or (Shift-And)

Page 9: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Naive algorithm

Just check one by one at each positionO(nm) in the worst case, but...

Linear time in average!Not so bad for cases when you have no time to implement:-)

But still it's much slower than other sophisticated algorithms in practice.

TextGGGACCAAGTTCCGCACATGCCGGATAGAAT

c

c

c

c

CCg

....

CCGt

....

Average length to check1+1/4+(1/4)2+... = 4/3 (constant!)

(for random DNA sequence)

CCGTATG

Pattern

Check one by one

Page 10: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt(1)

Improvement of the brute-force algorithm

The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time

→ Knuth-Morris-Pratt Algorithm

TAGTAGC

Pattern

Check from left

AATACTAGTAGGCATGCCGGAT

t

t

TAg

t

t

TAGTAGc

t

t

TAGt

...

skip

Text

skip

We already know the text is "TAGTAG" and cannot match with the pattern in these positions before comparison

Page 11: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt (2)

P[0..i] matches the text but P[i+1] does not, then

FailureLink[i+1]= max j s.t. P[0..j]≡P[i-j..i], P[j+1]≠P[i+1], and j <i if no such j exists, let FL[i]=-2 if P[i+1]=P[0], otherwise let FL[i]=-1.

FailureLink[i] can be computed before searching the text!

We can skip i+1-FailureLink[i+1] characters

Should be different(←Knuth)

Longest match with the prefixFailed matching HERE

Skip!

Falure Link

You don't have to check these positions again!

Page 12: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt (3)

CTACTGATCTGATCGCTAGATGC

CTGATCTGC

CTGATCTGC

CTGATCTGC

CTGATCTGC

CTGATCGCMP skips only 4 positions KMP skips 5 positions

Text

Pattern

Skip 1 position

Failed at the first position, so just proceed

Overlap of "CTG"

No overlap

Page 13: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt (4)

Preprocessing

A naive algorithm requires O(m2) or even O(m3) time

Linear time algorithm exists

Use the KMP itself

Z algorithm [Gusfield 97]

Not faster than the KMP, but easier to understand

Page 14: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Z Algorithm (1)

Zi

Compute it for all i

Longest common prefix length of S[0..n-1] and S[i..n-1]

righti

Max value of x+Zx-1 (x<i )

lefti

x that takes the maximum value of x+Zx-1 (x<i )

Initialization

Z1=right1=left1=0

i

Zilefti righti

Zleft_i

0 Zi

Z box

Page 15: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Z Algorithm (2)

Computation of Zi +1

In case i +1≤righti

We have already computed until the position righti

In case Zi < righti -i , we can copy the answer in O(1)

Otherwise compare naively after the position righti ― ①

In case i +1>righti

Compare naively ― ②

①+② can be done in linear time in total!

Z Algorithm itself is also a text matching algorithm

Compute Zi against P$T

P: pattern, T: text, $: some character that is not in P nor T

i

Zilefti righti

Zleft_i

i'

Zi+1=Zi'+1

Zleft_i

i'+1 i+1

righti-lefti0

Page 16: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Z Algorithm (3)

Example

ATGCGCATAATGCGCTGAATGGCCATAATCTGAA

0000002016000000013000002012000011

We have done to this position

Let's compute Zi for this position!

Zi

Text

rightleftSame text

Just copy the numbers if the numbers are smaller than 3

Page 17: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Zi & Failure Link (F[])

Zii

if (FailureLink[i+Zi] = -1) FailureLink[i+Zi] = Zi -1

Compute in this order

Failure links can be obtained by just scanning the ZiTable Initialize FailureLink[] with -1

pattern GTAGGCATGTAGCGTAGG

i 0123456789........

Zi 000110004001030011

Flink AAAB00AABAAB3BAA20A: -1 B: -2

Knuth's rule (post processing)

Page 18: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Computational Time Complexity of KMP

O(m+n)

n: text length, m: pattern length

Worst-case time complexity#comparison < 2n-1

Practically, it's not faster than the Boyer-Moore or Shift-Or algorithms in ordinary

though these algorithms does not achieve the worst-case linear time

Page 19: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Colussi Algorithm (A Variation of the KMP)

#comparison < 3n/2 Check the positions with the KMP strong rule later Skip lengths are different from KMP Preprocessing is also in linear time Practically not so faster, though cf. Galil-Giancarlo algorithms achieves #comparison < 4n/3

Step 1

FailureLink[i]+1( )

G a t G c t c a t G A T G t c c G A T G C c G t

0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 5 1-1 -1 -1 -1

Step 2

G a t G c t c a t G A T G t c c G A T G C c G t

Check in this order

Strong rule

Page 20: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

KMP and an automaton

A T A T T G

Failure Link

KMP can be described by an automaton

Page 21: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Aho-Corasick (1)

The automaton can be extended to deal with multiple queries!Linear time construction!

Linear time searching!

Failure Link

A

T

T

C

CG

T

T

GC

T TLink to the root if not specified

Page 22: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Aho-Corasick (2)

Construction of the keyword treeO(M) time

M: Sum of query string lengths

Can be used for dictionary searching

A

T

T

C

CG

T

T

GC

T T

Page 23: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Aho-Corasick (3)

Breadth-first searching

Start from the root

No failure link at the root

FailureLink(v)

Traverse FailureLinks of v'sparent to find a node that have a child w with the same label, and let (the nearest) w be FailureLink(v)

If no such node exists, let FailureLink(v) = root

a

b

a

c

a

b

v

w

Page 24: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Aho-Corasick (4)

Why it is linear time?

failure links to be made

1 shorter suffix

root

All the suffixes of some pattern

Existing paths from the root in the tree

traverse at most O(m) nodes

Page 25: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Aho-Corasick (5)

OutLink(v)

Pointers to the nodes with the alphabet thatv must outputs

Computation of OutLink()

Traverse the failure links to find a leaf if any

If there's no such leaf, there's no need to set the outlink

Also in linear time

1 together2 ether3 get4 her5 he

t

o g e t h e r

e t h e r

h e r

g

e t

1

2

4

5

3

Page 26: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Regular expression search based on automata (1)

Regular expression

Concatenation A, B → AB

Or A, B → A+B

Repeat A → A*

Extension of Aho-Corasick

AB(A+B)(AB+CD)*B

ABABABBBABAABBABACDBABBABBABBCDBABAABABBABAABCDB...

Page 27: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Regular expression search based on automata (2)

Construct the automaton for a regular expression

(A*B+AC)D

AB A+B A*A

B

ε Next

A B

Next

A

ε

Next

A

ε

D

C

B

A

εEnd

Start

Page 28: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Regular expression search based on automata (3)

A

ε

D

C

B

A

εEnd

Start

0 4

1

2

3

5 6

7 8

O(nm)

CDAABCAAABDDACDAAC

000000000000000000

113 11137 1 11

55 555 567556

8 8

You can start anywhere

Reachable nodes

DP

(Not including εstates)

Found!

Page 29: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Boyer-Moore (1)

Idea

Almost the same as KMP, but check from right!

Practically faster than KMPGood average-case time complexity

Bad worst-case time complexity

AATTGTTCCGGCCATGCCGGAT

......T

.....TT

....GTT

...cGTT failed

gtt...t failed

....g.t failed

GTTCGTT

Skip based on the information of "GTT"

Skip based on the information of "G"

Text

Pattern

Page 30: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Boyer-Moore (2)

Two rules Bad character rule

If the character at the failed position is x, we can move the last x in the pattern to the position

The algorithm that uses only this rule is called Horspool Algorithm

(Strong) Good suffix rule

Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm

cf. Knuth's rule in KMP

Do the larger shift of the above two

Failed SuccessFailed

Success

Different = strong

Page 31: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Boyer-Moore (3)

Bad character rule example

TTCCAAGTCGCCPattern

Do not consider the last character

CCCTGTCCATGCCGTCAGCCC

TTCCAAGTCGCC

TTCCAAGTCGCC

Failed

Last T

Text

Page 32: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Boyer-Moore (4)

(Strong) Good suffix rule example

CGTATATCCAATATCPattern

AGTCCCTCGGTCCGATATCGACCCTCCCG

CGTATATCCAATATC

CGTATATCCAATATC

TextFailed

Page 33: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Boyer-Moore (5)

Preprocess

Bad character ruleVery easy

Good suffix rule

Linear time by using the Z algorithm from backward

Page 34: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Boyer-Moore (6)

Computational time complexityAverage-case O(n/min (m, alphabet size))

i.e., average-case skip length is O(min(m, alphabet size))

Horspool algorithm has the same time complexity

Worst-case O(nm)

Bad for cases:

Many repeats

» KMP is faster

Small alphabet size

» Shift-Or is faster

Linear time for finding only 1 occurrence

Good for grep in editors

Worst-case O(n) algorithms based on BM

Turbo-BM (Crochemore et al. '92), Galil (1979), Smyth (2000), Apostolico-Giancarlo (1986), etc.

Page 35: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Turbo-BM

Turbo-shift

Additional rule that can be applied for a new shift after the strong good suffix rule-based shift

A=B、but ① ≠ ② ,so B cannot overlap with

w a b c z a b c w a b c a b c a b c

a b ca b c a b ca b c

xy(≠x)

zw

w(≠z)

z

y w z x w z a b ca b c a b ca b c

strong good suffix rule

strong good suffix rule

turbo-shift

Text

Pattern

Previous shift

x

② ¬z

y

Next position

+ Consider bad character rule too.

Failed

Failed

① z

A B

B

A

Previous Current Next

Page 36: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Rabin-Karp (1)

Based on fingerprinting (i.e., hashing)

e.g.,hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q

Pattern p → hash(p)

Text

hash(T[0..|P|-1])

hash(T[1..|P|])

hash(T[2..|P|+1])

compare with hash(p) at firstO(1) computation for each

q : some prime number

Page 37: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Rabin-Karp (2)

10111(16+4+2+1) mod 5 = 3

Pattern

11001101110100101...Text

check → YES!

check → NO

(16+8+1) mod 5 = 0

((0-1·16)·2+1) mod 5 = 4

((4-1·16)·2+0) mod 5 = 1

((1-0·16)·2+1) mod 5 = 3

((3-0·16)·2+1) mod 5 = 2

((2-1·16)·2+1) mod 5 = 3

O(1)

O(1)

O(1)

O(1)

O(1)

Page 38: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Shift-And Method

Bit-parallel (32 or 64) computation!

Efficient for small alphabet size-case

ACGT

T 0001

T 0001

A 1000

T 0001

T 0001

G 0010

C 0100

G 0010

Bit representation

1 if matched

X (shift (X, 1bit) or 1) and BA

TTTACGTATTATTACGTCC..

T 01110001011011000100..

T 00110000001001000000..

A 00001000000100100000..

T 00000000000010000000..

T 00000000000001000000..

G 00000000000000100000..

C 00000000000000010000..

G 00000000000000001000..

Text

パタン

Start from 0

Page 39: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Shift-Or Method

Just reverse the bits!

((001001 << 1) OR 000001) AND 010010vs.

(110100 << 1 ) OR 1011011.5 times faster?!

Page 40: Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

Advanced Algorithms / T. Shibuya

Summary

String searching algorithms

Brute-forceNaive, Rabin-Karp, Shift-Or

From leftKMP, AC

From rightBM, Horspool, Turbo-BM

Next week

Suffix arrays