A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables...

Preview:

Citation preview

A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables

Kazuhide Aikou1, Yusuke Suzuki1,2, Takayoshi Shoudai1,

Tomoyuki Uchida2, Tetsuhiro Miyahara2

1. Department of Informatics, Kyushu University, Japan

2. Faculty of Information Sciences, Hiroshima City University, Japan

Contents

1. Backgrounds and Motivations

2. Preliminaries

- Ordered Term Trees

- Height-Constrained Variables

3. A Matching Algorithm of Ordered Term Trees having Height-Constrained Variables

4. Conclusions and Future Works

Increase of Tree-structured Data( Web Documents, HTML/XML, etc. )

Discovery of Tree-structured PatternsCommon to Tree-structured Data

App.:Knowledge Discoveryfrom Web Documents

<Salesperiod> <Quarter>Winter1998</Quarter> <Design> <Designnumber>C365</Designnumber> <Description>North Star Polo</Description> <Unitssold>35500</Unitssold> </Design></Salesperiod>

<Quarter>

Winter1998

<Salesperiod>

<Design>

<Designnumber> <Unitssold><Description>

C365 North Star Polo 35500

<HTML>

<Head> <Body>

<Title><Table>

Text_university

<Table> <Table>

Ordered Term Trees

Our Works:• COLT for Term Trees• Web Mining Systems Using Learning

Algorithms for Term Trees

Backgrounds

Ordered trees expresssemi-structured data (HTML, XML, etc).

<HTML>

  <HEAD>text1</HEAD>

  <BODY>

   <DIV>text2</DIV>

   <FONT>text3</FONT>

   <FONT>text4</FONT>

  </BODY>

</HTML>

HTML Data

TAG

TEXT

Object Exchange Model

1 2

<HTML>

<HEAD> <BODY>

1 2 3

<DIV><FONT><FONT>

1text1

1 1 1text2 text3 text4

Preliminaries

<HTML>

<HEAD> <BODY>

<DIV><FONT><FONT>text1

text2 text3 text4

Ordered Trees with Edge Labels

x,y,...: variable labels

Variable h2

An ordered term treet=(V,E,H)

V: A vertex setE: An edge setH: A variable set

Ordered Tree Patterns with Internal Structured Variables

u1

u2

u5

u3

u6 u7 u8

x

y

u4

The child ports of h2

The parent portof h2

The parent port of h1

The child port of h1

Variables with at least one child port

Multi-child port variables

A variable can be substituted with an arbitrary ordered tree.

Variable h1

Variables with exactly one child port

Single-child port variables

Ordered Term Trees with Multi-Child Port Variables

vi

w4

w2

w3

w1

vi

w4

w2

w3

w1

u6u5

u2

u3v2

u1

vi

w4

w2

u7

u4u4

u7

u6u5

u2

u3v2

u1

y

v4

v3v2

v1

vi

w4

w2

w3

w1u1

x

u7

y

u6u5

u4u3u2

v4

v3v2

v1 u1

x

u7

y

u6u5

u4u3u2

v4

v3v2

v1

u4

u7

u6u5

u2

v2

y

v4

v3v2

v1 u1

u3

An ordered tree T1 An ordered treeT2

Replacements of the variables with T1 and T2 An ordered term tree t A new ordered tree T

Identify the root of T1 with the parent port.

Identify the two leaves with the two child ports.

u6u5

u2

u3v2

u1

vi

w4

w2

u7

u4

Identify the root of T2 with the parent port.

Chose one of the leaves of T2 and Identify it with the child port.

Substitutions

x

y

A substitution

match

An ordered treeA linear ordered term tree

Linear Ordered Term Trees:All variables have mutually distinct variable labels.All variable replacements are decided independently.

INPUT T: an ordered tree; t: a linear ordered termtree with multi-child port variables.

PROBLEM Does t match T?

This matching problem is computed in O(nN) time, where n is the number of vertices in t and N is the number of vertices in T [Suzuki et al., ILP 02].

This matching problem is computed in O(nN) time, where n is the number of vertices in t and N is the number of vertices in T [Suzuki et al., ILP 02].

Matching Problem for Linear Ordered Term Trees with Multi-Child Port Variables

<HTML>

<HEAD>text1</HEAD>

<BODY>

<DIV>text2</DIV>

<FONT>text3</FONT>

<FONT>text4</FONT>

</BODY>

</HTML>

An HTML file

1 2

<HTML>

<HEAD> <BODY>

1 2 3

<DIV><FONT><FONT>

1text1

1 1 1text2 text3 text4

height

Observation:Most of ordered trees obtained from HTML files have low height.

A tree of a big height is rare.Then, it becomes a feature if there is a long branch.

A tree of a big height is rare.Then, it becomes a feature if there is a long branch.

0

10

20

30

40

0 500 1000 1500 2000

Size = The number of vertices in a tree

Height

Relationships between the size of the tree representing an HTML file and the height of it.

( i , j )

( i’, j’)

0 < i j≦

The trunklength i

i

Theheight j

j

Trunk Length: The path length between the root and the leaf which are identified with the ports.

Height-constrainedHeight-constrained single-child port variablesvariables

Example.

(2,2) (2,4)

123

O.KN.G.An orderedterm tree t

An ordered tree T

A linear ordered term tree t

(1,2) (4,6)

An ordered tree T

INPUT:

PROBLEM: Does t match T?

MATCHING PROBLEMfor Linear Ordered Term Trees with Height-Constrained Single-Child Port Variables

Main TheoremMain Theorem

MATCHING PROBLEM for Linear Ordered Term Trees with Height-Constrained Single-Child Port Variables is computed in O(N max{nDmax, S}) time, where

n: the number of vertices of t,

N: the number of vertices of T,

S: the total amount of the lowest trunk lengths of all variables of t,

Dmax: the maximum number of children of a vertex of T.

Sub Term Tree and SubtreeA linear ordered term tree t An ordered tree T

(4,6)

(1,1)

t[u’](4,6)

(1,1)

u’

(1,2)

uT[u]

u and all descendants of u

-T[v]

v

which are not proper descendants of v

Idea:Corresponding Sets CS(u)

v

u

v’

(v’,i,j) CS(u)∈t T

t=(Vt,Et,Ht): a term tree, T=(VT,ET): a tree.CS(u)Vt×NN×NN : a corresponding set of a vertex uVT.

(v’,i,j) CS(u)∈   shows that there is a descendant v of u such that

(1) t[v’] matches T[v],(2) the length between u and v is i (if i < i’-1), and(3) the height of T[u]-T[v] is j.

match

v

T[v]

v

(i’,j’)

t[v’]

v’

ji

u

v

uv’ T

(v’,0,0) CS(u)∈

match

t

Therefore,(v’,0,0)CS(u) if and only if t[v’] matches T[u].

(i’,j’)

(the root of t,0,0)CS(the root of T) if and only if t matches T.

Algorithm MatchingMatching(t,T)

Initialization;

while there is an unmarked vertex u of T do begin

Mark u;

VID-Inheriting(u);

C-Set-Attaching(u)

end

1

2

3

Algorithm MatchingMatching(t,T)

Initialization;

while there is an unmarked vertex u of T do begin

Mark u;

VID-Inheriting(u);

C-Set-Attaching(u)

end

(1,2) (2,2)(1,2) (2,2)

2

1

7

3

98

4 65

Vertex identifiers

Breadth-firstsearch order

Initialization:Vertex Identifiers

A linear ordered term tree t

The children of an internal vertexhave consecutive vertex identifiers.This saves computation time of main processes.This saves computation time of main processes.

Compute the corresponding set of each vertex from leaves to the root.

t1

7

3

98

4 65

2

(1,2) (3,6)

TA

E I

C

G

N

B

J

ML

F H

K

D

Q

O

Initialization: For all leaves u of T,Mark u;CS(u):={(u’,0,0) | u’ is a leaf of t.}; height(u):=0;

7

98

4 6CS(D)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0), (9,0,0) height(D)=0

CS(K)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(K)=0

CS(F)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(F)=0

CS(L)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(L)=0

CS(M)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(M)=0

CS(H)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(H)=0

CS(Q)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(Q)=0

CS(J)   (4,0,0),(6,0,0),= (7,0,0),(8,0,0),   (9,0,0) height(J)=0

J

ML

F H

K

D

Q

P

from leaves to the root

Algorithm MatchingMatching(t,T)

Initialization;

while there is an unmarked vertex u of T do begin

Mark u;

VID-Inheriting(u);

C-Set-Attaching(u)

end

N can become a vertex 3.

v’

u’

(i,j)

VID-Inheriting (1/3): Let v’ be the child port of an (i,j)-height constrained variable. For an internal vertex u of a tree, if there is an element (v’,i’,j’) in the CS of a child of u, add (v’, min{i’+1,i-1}, *) to CS(u).

7

3

(3,6)

Example

C

J

(7,0,0) CS(∈ Q)

(7,0,0) CS(J)∈

Add (7,1,1) to CS(P)

Add (7,2,2) to CS(O)

Add (7,2,3) to CS(N)

I

N

O

P

Q

Add (7,2,4) to CS(I)

If i’=i-1 then the parent of u can match the parent port u’.

Next slide

T

cb

a

4

         ∈ CS(a)

3

Choose the smallest height

(7,2,4) , (7,2,5)

(7,1,1) CS(b)∈height(b)=4

(7,1,3) CS(c)∈height(c)=3

7

3

(4,6)

cb

(7,2,4) CS(a)∈

VID-Inheriting (2/3):Case: At least two children have (v’,i’,*) for a vertex v’ and an integer i’.

VID-Inheriting (3/3):Case: A child has (v’,i’,*) and another child has (v’,i’’,*) for distinct integers i’ and i’’.

cb

a

4

     ,      ∈ CS(a)

3

(7,2,4) (7,3,5)

T

(7,1,3) CS(b)∈height(b)=4

(7,2,2) CS(c)∈

height(c)=3

7

3

(4,6)

cb

Add all triplets to CS(u) (at most i triplets)

• CS(a) contains at most S triplets.• Then the total time complexity of Inheriting of a vertex a

is O(Sma), where ma is the number of the children of a.

Algorithm MatchingMatching(t,T)

Initialization;

while there is an unmarked vertex u of T do begin

Mark u;

VID-Inheriting(u);

C-Set-Attaching(u)

end

C-Set-Attaching (Small Examples)

4 65

2

4 65

2

(1,2)

t

t

B

F HD

E G

B

F HD

(4,0,0)CS(D)

(5,0,0)CS(F)

(6,0,0)CS(H)

(2,0,0) should be added to CS(B).

(4,0,0)CS(D)

(5,0,0)CS(G)

(6,0,0)CS(H)

height(F)=2

height(E)=1

(2,0,0) is added to CS(B).

(5,0,0)CS(G) covers [E,G].

4 65

2

(1,2)

t

E G

B

F HD

(4,0,0)CS(D)

(5,1,1)CS(F)

(6,0,0)CS(H)

height(G)=2height(E)=1

(2,0,0) is added to CS(B).

(5,1,1)CS(F) covers [E,G].

4 65

2

(1,2)

t

E G

B

F HD

(4,0,0)CS(D)

(5,1,1)CS(F)

(6,0,0)CS(H)

height(G)=2height(E)=3

(2,0,0) may not be added to CS(B).

(5,1,1)CS(F) covers [F,G] but cannot cover E.

(4,8) (3,4) (5,5) (4,7)

1 2 3 4 5 6 7 8 9 10

11

C-Set-Attaching (A Big Example)

t

An ordered term tree

CS(K)=

(1,0,0),

height(A)=9

CS(A)= (2,0,0),

(4,0,0)

height(B)=5

CS(B)

= (5,0,0)height(C)=4

CS(C)= (3,3,4),

(6,0,0)

height(D)=5

CS(D)

=(3,3,3)

height(E)=3

CS(E)= (1,0,0),

(4,0,0)(7,2,3)

height(F)=2

CS(F)

=

(2,0,0),(4,0,0),(5,0,0),(8,4,4)

height(G)=5

CS(G)

=

(5,0,0),(6,0,0),(8,4,4),(9,0,0)

height(H)=6

CS(H)

=

(3,3,5),(6,0,0)

height(I)=5

CS(I)

=(7,2,3),(10,3,3)

height(J)=7

CS(J)

=

height(K)=1

φ (4,0,0),(8,4,4)

height(L)=9

CS(L)

=

(5,0,0),(9,0,0)

height(M)=4

CS(M)

=(6,0,0),(10,3,4)

height(N)=4

CS(N)

=

A B C D E F G H I J K L M N

An ordered tree O

1 2 3 4 5 6 7 8 9 10

A

B

C

D

E

F

G

H

I

J

K

L

M

N

First, we prepare a virtual table for a new graph.Rows and columns represent vertices of T and t, respectively.

(3,3,3)

height(R)=3

CS(E)= (1,0,0),

(4,0,0)(7,2,3)

height(F)=2

CS(F)

= (2,0,0),(4,0,0),(5,0,0),(8,4,4)

height(G)=5

CS(G)

=

(5,0,0),(6,0,0),(8,4,4),(9,0,0)

height(H)=6

CS(H)

=

(3,3,5),(6,0,0)

height(I)=5

CS(I)

=(3,3,4),(6,0,0)

height(F)=5

CS(D)

=

E F G H

O

ID

(3,4)

7

11

7

E

F

G

H

I

[E,F] (7,2,3)CS(F) covers [E,F].

An ordered tree An ordered term tree

Add a vertex labeled with [E,F] to F7 in the table.

(3,3,3)

height(E)=3

CS(E)= (1,0,0),

(4,0,0)(7,2,3)

height(F)=2

CS(F)

= (2,0,0),(4,0,0),(5,0,0),(8,4,4)

height(G)=5

CS(G)

=

(5,0,0),(6,0,0),(8,4,4),(9,0,0)

height(H)=6

CS(H)

=

(3,3,5),(6,0,0)

height(I)=5

CS(I)

=(3,3,4),(6,0,0)

height(D)=5

CS(D)

=

(5,5)

8

11

(3,4)

7

7 8

E

F

G

H

I

[E,G]

[E,F]

E F G H

O

ID

An ordered tree An ordered term tree

(8,4,4)CS(G) covers [E,G].

Add a vertex labeled with [E,G] to G8 in the table.

(3,3,3)

height(E)=3

CS(E)= (1,0,0),

(4,0,0)(7,2,3)

height(F)=2

CS(F)

= (2,0,0),(4,0,0),(5,0,0),(8,4,4)

height(G)=5

CS(G)

=

(5,0,0),(6,0,0),(8,4,4),(9,0,0)

height(H)=6

CS(H)

=

(3,3,5),(6,0,0)

height(I)=5

CS(I)

=(3,3,4),(6,0,0)

height(D)=5

CS(D)

=

(5,5)

8

11

(3,4)

7

7 8

E

F

G

H

I

[E,G]

[H,H]

[E,F]

E F G H

O

ID

An ordered tree An ordered term tree

(8,4,4)CS(H) covers [H,H].Add a directed edge from [E,F] at F7 to [E,G] at G8, because two consecutive variables cover all vertices from E to G.

Add a vertex labeled with [H,H] to H8 in the table.

1 2 3 4 5 6 7 8 9 10

A

B

C

D

E

F

G

H

I

J

K

L

M

N

[B,K]

[B,K]

[J,K]

[K,N]

[E,F]

[H,H]

[M,N]

[B,K]

[B,K]

vstart

vgoal

[B,K]

[J,K]

[K,N]

[M,N]

[E,G]

• If there is a directed path from vstart to vgoal, (11,0,0) is added to CS(O).

• The total time complexity of C-Set-Attaching of a vertex u of T and a vertex u’ of t is O(mu

2 m’u’), where mu and m’u’ are the numbers of the children of u and u’, respectively.

Total Time Complexity

VID-Inheriting(u): O(Smu) C-Set-Attaching(u): O(mu

2m’u’)mu: the number of children of a vertex u of T,

m’u’: the number of children of a vertex u’ of t. Total: O(N max{nDmax,S})

n: the number of vertices of t,N: the number of vertices of T,S: the total amount of the lowest trunk lengths of all variables of t,

Dmax: the maximum number of children of a vertex of T.

Conclusions• An O(N max{nDmax,S}) Time Matching Algorithm for

Ordered Term Trees with Height-Constrained Variables.

• [Our Related Works] Polynomial-Time Learning Algorithms for Ordered Term Trees with Height-Constrained Variables [Suzuki et al., PRICAI'04], [Matsumoto and Shoudai, ALT'04].

Future Works:Future Works:• An Efficient Matching Algorithm for Ordered Term Trees

with Height-Constrained Multi-Child Port Variables.

• Polynomial-Time Learning Algorithms for Ordered Term Trees with Height-Constrained Multi-Child Port Variables.

Thank you for your attention.

Recommended