45
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner

Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner

  • Upload
    molimo

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner. String Matching. Pattern: CCATT Text: ACTGCCATTCCTTAGGGCCATGTG Brute force & variant Automata-like strategies Suffix trees. To do: [CLRS] 32 Supplements #5. Some Applications. Text & p rogramming languages - PowerPoint PPT Presentation

Citation preview

Page 1: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

Design & Analysis of Algorithms COMP 482 / ELEC 420

John Greiner

Page 2: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

2

String Matching

Pattern: CCATT

Text: ACTGCCATTCCTTAGGGCCATGTG

• Brute force & variant• Automata-like strategies• Suffix trees

To do:[CLRS] 32Supplements#5

Page 3: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

3

Some Applications• Text & programming languages

– Spell-checking– Linguistic analysis– Tokenization– Virus scanning– Spam filtering– Database querying

• DNA sequence analysis• Music identification & analysis

Page 4: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

4

Brute Force Exact Match

Pattern: CCATT

Text: ACTGCCATTCCTTAGGGCCATGTG

Algorithm?Example of worst case?

Running time?

Page 5: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

5

Rabin-Karp, 1981

Pattern: CCATT

Text: ACTGCCATTCCTTAGGGCCATGTG

Hash pattern & -substrings– Compare hashes.– Compare strings only when hashes match.– easily computed from ,

Best and worst cases?

Page 6: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

6

Intuition for Better Algorithms

Pattern: CCATT

Text: CCAGCCATTCCTTAGGGCCATGTG

After failing match at 1st position, what should we do?

Page 7: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

7

Quick Overview of Two Algorithms

Knuth-Morris-Pratt, 1977– Use previous intuition.– Preprocessing builds shift table based upon pattern

prefixes. – Each text character compared once or twice.

Boyer-Moore, 1977 & Turbo Boyer-Moore, 1992– Use previous intuition, along with similar heuristics.

Complicated.– Match from right end of pattern.– Can often skip some text characters. , with best case.

Page 8: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

8

Finite Automata Matching Example

Pattern: CCATT

Text: CCAGCCATTCCTTAGGGCCATGTG

CCA CCAT CCATT

CCCC C

A

C

T

C

TC

C

to build

Page 9: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

9

Regular Expressions

Syntax: , , , , ,

Page 10: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

10

Regular Expression to Finite Automaton

𝑎 𝑎

𝜖 𝜖

Page 11: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

11

Regular Expression to Finite Automaton

𝑟+𝑠𝑟𝑠𝜖

𝜖 𝜖

𝜖

𝑟𝑠 𝑟 𝑠𝜖 𝜖𝜖

𝑟∗ 𝑟𝜖 𝜖

𝜖

𝜖

Page 12: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

12

Eliminating

𝑎 𝜖 𝜖

𝑎 𝑎𝑎

Page 13: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

13

Eliminating Non-Determinism

Each state is a set of the old states.

0 1

a

a

b

0 1

0,1

a

b a

b

a

b

a,b

Page 14: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

14

Can Minimize

States unreachable or equivalent?

0 1

a

a

b

0

0,1

a

b

1

a

b

a

b

a,b

Page 15: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

15

Finite Automaton Approach Summary

Preprocessing expensive for REs.Matching is flexible and still linear.

Page 16: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

16

Suffix Tree (Trie)

Text: CCAGAT

How many leaves, nodes, edges? Total space?How to search for a pattern? Running time?

T A

T GAT AGAT

C

CAGAT

5

4

3 2 1

6

GAT

Each edge labeled with two indices.

Page 17: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

17

Require Suffixes to End at Leaves

Text: CCAGAT

Reason: Simplicity. Distinguish nodes & leaves.Example text that would break that?

T A

T GAT AGAT

C

CAGAT

5

4

3 2 1

6

GAT

Page 18: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

18

Forcing Suffixes to End at Leaves

Text: CCAGAC$

$ A

C$ GAC$ AGAC$

C

CAGAC$

5

4

3 2 1

7

GAC$

6

$

Page 19: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

19

How Would You Create the Tree?

Text: CCAGAC$

$ A

C$ GAC$

C

5

4

3 2 1

7

GAC$

6

$ AGAC$ CAGAC$

Page 20: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

20

Suffix Tree Construction Example

Text: nananabanana$

nananabanana$

Page 21: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

21

Suffix Tree Construction Example

Text: nananabanana$

nananabanana$ananabanana$

Page 22: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

22

Suffix Tree Construction Example

Text: nananabanana$

Match nanabanana$, nananabanana$: 4 comparisons

nana

banana$nabanana$

ananabanana$

Page 23: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

23

Suffix Tree Construction Example

Text: nananabanana$

Match anabanana$, ananabanana$: 3 redundant compsNext …Match nabanana$, nanabanana$: 2 redundant compsMatch abanana$, anabanana$: 1 redundant comp

nana

banana$nabanana$

nabanana$

banana$

ana

Page 24: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

24

First Algorithm Improvement: No Redundant Matching

When inserting , nanabanana$If we discover is a prefix in tree, nanaThen will also be a prefix in tree. anaSo, don’t bother matching to verify.

Page 25: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

25

Suffix Tree Construction Example

Text: nananabanana$

Match nanabanana$, nananabanana$: 4 comparisons

nana

banana$nabanana$

ananabanana$

Page 26: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

26

Suffix Tree Construction Example

Text: nananabanana$

No matching. Just form new node and adjust edge string indices.

nana

banana$nabanana$

nabanana$

banana$

ana

Page 27: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

27

Suffix Tree Construction Example

Text: nananabanana$

No matching. Just form new node and adjust edge string indices.

na

banana$nabanana$

nabanana$

banana$

ana

na

banana$

Page 28: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

28

Suffix Tree Construction Example

Text: nananabanana$

No matching. Just form new node and adjust edge string indices.

na

banana$nabanana$

nabanana$

banana$

na

na

banana$

a

banana$

Page 29: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

29

Suffix Tree Construction Example

Text: nananabanana$

na

banana$nabanana$

nabanana$

banana$

na

na

banana$

a

banana$

banana$

Page 30: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

30

Suffix Tree Construction Example

Text: nananabanana$

anana is there, but it takes work to match & follow links.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na

Page 31: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

31

Suffix Tree Construction Example

Text: nananabanana$

nana is there, but it takes work to match & follow links.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na $

Page 32: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

32

Second Algorithm Improvement: Suffix Links

Text: nananabanana$

Each internal node for has pointer to node for .

banana$a

banana$ na

na

banana$$ na

banana$$ nabanana$banana$ $na

banana$ $

$

Page 33: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

33

Suffix Tree Construction Example

Text: nananabanana$

Have to follow links for match on first time.Need to create suffix link for new node.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na

Page 34: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

34

Suffix Tree Construction Example

Text: nananabanana$

Need to create suffix link for new node.Use parent’s suffix link to get close.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na

Page 35: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

35

Suffix Tree Construction Example

Text: nananabanana$

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na

Page 36: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

36

Suffix Tree Construction Example

Text: nananabanana$

Already found node. No searching or matching.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na $

Page 37: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

37

Suffix Tree Construction Example

Text: nananabanana$

Follow suffix link.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na $$

Page 38: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

38

Suffix Tree Construction Example

Text: nananabanana$

Follow suffix link.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na $$

$

Page 39: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

39

Suffix Tree Construction Example

Text: nananabanana$

Follow suffix link.

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na $$

$$

Page 40: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

40

Suffix Tree Construction Example

Text: nananabanana$

Done., but roughly how big is ?

na

banana$nabanana$banana$

na

na

banana$

a

banana$

banana$

$banana$

na $$

$$

$

Page 41: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

41

Almost Correct Analysis of Construction

Two indices:

: Each increment takes time.– Just search for one more character.

: Each increment takes time.– Follow suffix link, or from root, get to suffix match.– Possibly split node & create suffix link.– Add one edge & leaf.

, each incremented times total.

Page 42: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

42

Creating & Following Suffix Links

When creating new node, need new suffix link.Follow parent’s suffix link to get close, then search down.

Text:

𝑎𝐵 𝐵

𝐸𝐺𝐷

𝐶𝐶𝐷𝐸𝐺 𝐹 = index of next

suffix link

Page 43: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

43

Correct Analysis of Construction

Three indices: ( not part of alg.)

: Each increment takes time.: Follow some number of links in time. Increments by at least .: Each increment takes time in addition to that considered for .

, , each incremented at most times total.

Page 44: Design & Analysis of Algorithms  COMP 482 / ELEC 420 John Greiner

44

Some Applications of Suffix TreesSearch for fixed patternsSearch for regular expressionsFind longest common substrings, longest repeated substringsFind most commonly repeated substringsFind maximal palindromesFind Lempel-Ziv decomposition (for text compression)

As used in

BioinformaticsData compressionData clustering