1

Click here to load reader

CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column structural data

Embed Size (px)

Citation preview

Page 1: CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column structural data

SEQUEL: Query Completion via Pattern Mining on Multi-Column Structural Data

Chuancong Gao, Qingyan Yang, Jianyong Wang Tsinghua University, Beijing, China

Structural Data Description

Mined Pattern Structure

Suggestion Progress

STEP 1: Search the index of each column, find at least one combination (matching order) of columns matching on the input query. E.g., Query “www da” will be matched as (with the indexes in right-side):

Advantages Comparing to Other Systems

Pattern Index Structure – Trie Tree

Example on Column Title Phrase and Venue

Structural

Data

Formalize Mine & Index

Mined Patterns

Indexes for Each Column

Query

...

...Preprocess

...

...

Try to Match Greedily on

Each Column Index

Patterns for m

Match

Combinations

Top-k Selection on

Last-Matched Column

for m Combinations Top-k

Selection from

m×k

Candidates

Output

Offline Part

Online Part

≥ ≥

≥ ≥

≥ ≥

≥ ≥

... .........

≥ : Ranking Score Comparison

: supnn -

The DBLP Computer Science Bibliography (DBLP) • > 1,400,000 Publication Entries • Four Attributes for each Publication Entry:

• Authors (e.g. Jiawei Han, Guozhu Dong, Yiwen Yin) • Title (e.g. Efficient Mining of Partial Periodic Patterns in Time

Series Database) • Venue (e.g. ICDE) • Year (e.g. 1999)

1. Title Phrase “frequent patterns” appears 17 times in Venue “icdm” 2. Title Phrase “pattern” appears 14 times for Authors “jian pei” and

“jiawei han”

• Suggests Patterns mined from underlying Data instead of Query Logs • More Accurate and Meaningful • Low Amount and Quality of Query Logs on Structural Data

• No need to Specify Explicitly Different Columns in Query • Suggests Phrases instead of Single Terms • Fast for both Offline Pattern Mining and Online Suggestion

d

a

t

a

b

e

s

a w

e

b

tl

a

m

r

o

f

me

d

c

i

w

w

w

m

l1 2 3 ...

...

... ...

2 5 6 ... ... ...

3 4 8 10 ...

5 ... 4 ...

data

data icde

data www

data web www

database icde

icde

www

1

2

3

4

5

6

7

8

w

w

w

7 8 ...

www www

www

9

10

50263

514

14

14

312

2666

880

4

1262

Title Phrase Index Venue Index

Title Phrase Venueid supid

Some Selected Patterns

d

a

t

a

9 ...

Blank Node Normal Node Phrase-end Node

www data 17

http://dbgroup.cs.tsinghua.edu.cn/chuancong/sequel

STEP 2: Suggest on the last matched column of each matching order.

Based on Frequent Sequential Pattern Mining algorithm PrefixSpan: • Treat Authors as Itemset • Treat Title as Sequence • Treat Venue & Year as Single-Item • Concatenate all the columns together as a new Sequence • Mine and Index Used Minimum Support (Frequency) Threshold: 10

Pattern Mining Algorithm

• Used for fast column text matching • Every column has one corresponding Trie tree • All the indexes share a global table storing all the patterns • Close to 2GB in total in memory