34
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories

1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

1

Efficient String Matching : An Aid to Bibliographic Search

Alfred V. Aho and Margaret J. Corasick

Bell Laboratories

2

Virus Definition

Each virus has its peculiar signature Example in ClamAV

_0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed

_0017_0001_000 virus index Hex(21)=Dec(33)=‘!’

Match the signature for detecting virus

3

Regular Expression

Use RE to describe the signature ? can be any one char

W32.Hybris.C (Clam)=4000?????????????83??????75f2e9????ffff00000000

* can be any chars (including no char) Oror-fam

(Clam)=495243*56697275*53455859330f5455*4b617a61*536e617073686f

{n1-n2}, there are n1~n2 chars between two parts Worm.Bagle.AG-empty

(Clam)=6e74656e742d547970653a206170706c69636174696f6e2f6f637465742d73747265616d3b{40-130}2d2d2d2d2d2d2d2d

4

Introduction

Locate all occurrences of any of a finite number of keywords in a string of text.

Consists of two parts : constructing a finite state pattern matching

machine from the keywords using the pattern matching machine to

process the text string in a single pass.

5

Pattern Matching Machine(1)

Our problem is to locate and identify all substrings of x which are keywords in K. K : K={y1,y2,…,yk} be a finite set of strings which

we shall call keywords x : x is an arbitrary string which we shall call the

text string.The behavior of the pattern matching

machine is dictated by three functions: a goto function g, a failure function f, and an output function output.

6

Pattern Matching Machine(2)

g (s,a) = s’ or fail : maps a pair consisting of a state and an input symbol into a state or the message fail.

f (s) = s’ : maps a state into a state, and is consulted whenever the goto function reports fail.

output (s) = keywords : associating a set of keyword (possibly empty) with every state.

7

Pattern Matching Machine Example with keywords {he,she,his,hers}

8

9

Start state is state 0.Let s be the current state and a the

current symbol of the input string x.Operating cycle

If g(s,a)=s’, makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.

If g(s,a)=fail, make a failure transition f. If f(s)=s’, the machine repeats the cycle with s’ as the current state and a as the current input symbol.

10

Example

Text: u s h e r s

State: 0 0 3 4 5 8 9

2

In state 4, since g(4,e)=5, and the machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits output(5)

11

Example Cont’d

In state 5 on input symbol r, the machine makes two state transitions in its operating cycle.

Since g(5,r)=fail, M enters state 2=f(5) . Then since g(2,r)=8, M enters state 8 and advances to the next input symbol.

No output is generated in this operating cycle.

12

Algorithm 1. Pattern matching machine.Input. A text string x = a1 a2 … a n where each a i is an input symbol     and a pattern matching machine M with goto function g, failure     function f, and output function output, as described above.Output. Locations at which keywords occur in x.Method.  begin    state ← 0    for i ← 1 until n do      begin        while g (state, a i ) = fail do state ← f(state)

        state ← g (state, a i )        if output (state)≠ empty then          begin            print i            print output (state)          end      end  end

13

Construction the functions

Two part to the construction First : Determine the states and the goto

function. Second : Compute the failure function. Output function start at first, complete at

second.

14

Construction of Goto function

Construct a goto graph like next page.New vertices and edges to the graph,

starting at the start state.Add new edges only when necessary.Add a loop from state 0 to state 0 on all

input symbols other than the first one in each keyword.

15

Construction of Goto function with keywords {he,she,his,hers}

16

Algorithm 2. Construction of the goto function.Input. Set of keywords K = {yl, y2, . . . . . yk}.Output. Goto function g and a partially computed output function      output.Method. We assume output(s) is empty when state s is first created,     and g(s, a) = fail if a is undefined or if g(s, a) has not yet      been defined. The procedure enter(y) inserts into the goto      graph a path that spells out y.

  begin    newstate ← 0    for i ← 1 until k do enter(y i )    for all a such that g(0, a) = fail do g(0, a) ← 0  end

Algorithm 2

17

  procedure enter(a 1 a 2 … a m ):  begin    state ← 0; j ← 1    while g (state, aj )≠ fail do      begin        state ← g (state, aj)        j ← j + l      end    for p ← j until m do      begin        newstate ← newstate + 1        g (state, ap ) ← newstate        state ← newstate      end    output(state) ← { a 1 a 2 … a m}  end

18

Construction of Failure function

Depth of s : the length of the shortest path from the start state to state s.

The states of depth d can be determined from the states of depth d-1.

Make f(s)=0 for all states s of depth 1.

19

Construction of Failure function Cont’d

Compute failure function for the state of depth d, each state r of depth d-1 : 1. If g(r,a)=fail for all a, do nothing. 2. Otherwise, for each a such that g(r,a)=s, do the

following : a. Set state=f(r) . b. Execute state ←f(state) zero or more times, until a

value for state is obtained such that g(state,a)≠fail . c. Set f(s)=g(state,a) .

20

Algorithm 3. Construction of the failure function.Input. Goto function g and output function output from Algorithm 2.Output. Failure function fand output function output.Method.

  begin    queue ← empty    for each a such that g(0, a) = s≠0 do      begin        queue ← queue {s}∪        f(s) ← 0      end    

Algorithm 3

21

    while queue ≠ empty do      begin        let r be the next state in queue        queue ← queue - {r}        for each asuch that g(r, a) = s≠fail do          begin            queue ← queue {s}∪            state ← f(r)            while g (state, a) = fail do state ← f(state)            f(s) ← g(state, a)            output(s) ←output(s) output(f(s))∪          end      end  end

22

About construction

When we determine f(s)=s’, we merge the outputs of state s with the output of state s’.

In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1.

To avoid above, we can use the deterministic finite automaton, which discuss later.

23

Properties of Algorithms 1,2,3

Lemma 1: Suppose that in the goto graph state s is represented by the string u and state t is represented by the string v. Then f(s)=t iff v is the longest proper suffix of u that is also a prefix of some keyword.

Proof : Suppose u=a1a2…aj, and a1a2…aj-1 represents state r, let

r1,r2,…,rn be the sequence of states : 1. r1=f(r) ; 2. ri+1=f(ri) ; 3.g(ri,aj)=fail for 1 i≦ < n ; 4.g(rn,aj)=t

Suppose vi represents state ri, v1 is the longest proper suffix of a1a2…aj-1 that is a prefix of some keyword; v2 is the longest proper suffix of v1 that is a prefix of some keyword, and so on.

Thus vn is the longest suffix of a1a2…aj-1 such that vnaj is a prefix of some keyword.

24

Properties of Algorithms 1,2,3 Lemma 2 : The set output(s) contains y if and

only if y is a keyword that is a suffix of the string representing state s.

Proof : Consider a string y in output(s). If y is added to output(s) by algorithm 2, then y=u and

y is a keyword. If y is added to output(s) by algorithm 3, then y is in

output(f(s)). If y is a proper suffix of u, then from the inductive hypothesis and Lemma 1 we know output(f(s)) contains y.

25

Properties of Algorithms 1,2,3 Lemma 3 : After the jth operating cycle,

Algorithm 1 will be in state s iff s is represented by the longest suffix of a1a2…aj that is a prefix of some keyword. Proof : Similar to Lemma 1.

THEOREM 1THEOREM 1 : : Algorithms 2 and 3 produce valid goto,failure, and output functions. Proof : By Lemmas 2 and 3.

26

Time Complexity of Algorithms 1, 2, and 3

THEOREM 2 THEOREM 2 :: Using the goto, failure and output functions created by Algorithms 2 and 3, Algorithm 1 makes fewer than 2n state transitions in processing a text string of length n. From state s of depth d Algorithm 1 make d

failure transitions at most in one operating cycle. Number of failure transitions must be at least

one less than number of goto transitions. processing an input of length n Algorithm 1

makes exactly n goto transitions. Therefore the total number of state transitions is less than 2n.

27

Time Complexity of Algorithms 1, 2, and 3

THEOREM 3 THEOREM 3 :: Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords.

Proof : Straightforward

THEOREM 4 THEOREM 4 :: Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords.

Proof : Total number of executions of state← f(state) is

bounded by the sum of the lengths of the keywords. Using linked lists to represent the output set of a

state, we can execute the statement output(s) ← output(s) output(f(s))∪ in constant time.

28

  procedure enter(a 1 a 2 … a m ):  begin    state ← 0; j ← 1    while g (state, aj )≠ fail do      begin        state ← g (state, aj)        j ← j + l      end    for p ← j until m do      begin        newstate ← newstate + 1        g (state, ap ) ← newstate        state ← newstate      end    output(state) ← { a 1 a 2 … a m}  end

29

    while queue ≠ empty do      begin        let r be the next state in queue        queue ← queue - {r}        for each asuch that g(r, a) = s≠fail do          begin            queue ← queue {s}∪            state ← f(r)            while g (state, a) = fail do state ← f(state)            f(s) ← g(state, a)            output(s) ←output(s) output(f(s))∪          end      end  end

30

Eliminating Failure Transitions

Using in algorithm 1δ(s, a), a next move function δ such that

for each state s and input symbol a.By using the next move function δ, we can

dispense with all failure transitions, and make exactly one state transition per input character.

31

Algorithm 4. Construction of a deterministic finite automaton.Input. Goto function g from Algorithm 2 and failure function f from Algorithm 3.Output. Next move function 8.Method.  begin     queue ← empty     for each symbol a do       begin         δ(0, a) ← g(0, a)         if g (0, a) ≠ 0 then queue ← queue {g (0, a) } ∪      end    while queue ≠ empty do       begin         let r be the next state in queue         queue ← queue - {r}         for each symbol a do           if g(r, a) = s ≠ fail do             begin               queue ← queue ∪ {s}              δ(r, a) ← s             end           elseδ(r, a) ←δ(f(r), a)       end   end

32

   Fig. 3. Next move function.input symbol next statestate 0: h 1

s 3. 0

state 1 : e 2i 6h 1s 3. 0

state 9:state7: state3 : h 4s 3. 0

state 5:state2 : r 8h 1s 3. 0

state 6 : s 7h 1. 0

state 4 : e 5i 6h 1s 3. 0

state 8 : s 9h 1. 0

33

Conclusion

Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass.

Using Next move function can potentially reduce state transitions by 50%,

but more memory. Spend most time in state 0 from which there are

no failure transitions.

34

0 1 2 8 9

6 7

3 4 5s

h e r s

i s{h,s}’

h e