32
Computer Science and Engineering Inverted Linear Quadtree: Ef cient Top K Spatial Keyword Search Chengyuan Zhang 1 ,Ying Zhang 1 ,Wenjie Zhang 1 , Xuemin Lin 2,1 1 The University of New South Wales, Australia 2 East China Normal University

Computer Science and Engineering

  • Upload
    keisha

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Computer Science and Engineering. Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search. Chengyuan Zhang 1 ,Ying Zhang 1 ,Wenjie Zhang 1 , Xuemin Lin 2,1. 1 The University of New South Wales, Australia 2 East China Normal University. Background. - PowerPoint PPT Presentation

Citation preview

Page 1: Computer Science and Engineering

Computer Science and Engineering

Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search

Chengyuan Zhang1,Ying Zhang1,Wenjie Zhang1, Xuemin Lin2,1

1The University of New South Wales, Australia

2 East China Normal University

Page 2: Computer Science and Engineering

An enormous amount of spatio-textual objects available in many applications

online local search

e.g., online yellow pages

social network services

e.g., Facebook, Flickr

Background

Page 3: Computer Science and Engineering

p1 (pizza,coffee,sushi)

p3 (pizza,sushi)

p2 (pizza,coffee,steak)

p4 (coffee,sushi)

p5 (pizza,steak,seafood)

pizza,coffee

Page 4: Computer Science and Engineering

4

Top k spatial keyword search (TOPK-SK)

DataA set of spatio-textual objects

Each object is represented a location and a set of keywords

QueryQuery location (q.loc)

A set of query keywords (q.T)

AnswerThe closest k objects, each of which contains all query keywords

Page 5: Computer Science and Engineering

Naïve Approach

11 spatio-textual objects

Vocabulary {t1, t2, t3}

Query q with q.T = {t1, t2} and k =1

p4 (t1)

p6 (t2,t3) P10 (t1)

p1 (t1,t2)

p10 (t1)

p3 (t1,t3)

p5 (t2,t3)

p8 (t3)

p7 (t3)

p11 (t2)

p9 (t2)

p2 (t1,t2)

Distance Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

Running Example

Page 6: Computer Science and Engineering

Inverted R-tree [Y. Zhou,et al., CIKM 2005] Distance

Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

For each keyword t, construct an R tree for objects containing t

E1 E2

R1 (t1)

R2 (t2)

R3 (t3)

P4 P10P1 P2 P3

E1 E2

P2 P5P1 P6 P11

E1 E2

P6 P7 P9P3 P5 P8

E1 E2

E1E2

E1 E2

Page 7: Computer Science and Engineering

7

IR2-tree [ I. D. Felipe, et. al., ICDE 2008]

Index Structure

Combination of an R-Tree and signature technique

Each node contains a rectangle and a signature ( a fixed length bitmap)

Each word is hashed to a particular bit

The signature of a node is the “ Bitwise OR ” of all the signatures of its child nodes

Page 8: Computer Science and Engineering

8

Example

E11 E12

11 11

Distance Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

E9 E10

11 11E7 E8

11 11

E6

11

E4 E5

01 11

E3

11

E1 E2

11 01

p2

11

p8 p5

01 01

p10 p11

10 01

p6 P9

01 10

p1 p3

11 11

p4 p7

10 01

E11

E7

t1

E10E9E8

E6E5E4

E3E2E1

E12

10

t3 01

t201

Page 9: Computer Science and Engineering

9

Observations

Naïve approach

Disadvantages: all objects in the search region are accessed ( large s and p=1 )

Inverted R-tree

Advantages: exclude unrelated objects ( small s )

Disadvantages: cannot take advantage of AND semantics (p=1)

IR2-tree

Advantages: have filtering technique to reduce p

Disadvantages: large s and p is affected by non-related objects

Other Single Augmented R-tree

Other spatial keyword search : KR tree [R. Hariharan, et al., SSDBM 2007]

WIR tree [D. Wu , et al., TKDE 2011]

Spatial keyword ranking query : IR tree [G. Cong ,et al., PVLDB 2009]

CM-CDIR tree [D. Wu ,et al., VLDBJ 2012]

Their shortcomings: same as IR2-tree

Page 10: Computer Science and Engineering

10

Motivation

Index structure have a small number of objects within the search region

can prune objects within the search region

Propertiesfalls in the category of inverted index

exploit the AND semantics

adaptive to the distribution of the objects for each keyword

Page 11: Computer Science and Engineering

11

Motivation

non-Empty 1

0

0

1

Empty

Page 12: Computer Science and Engineering

Regular space partition based indexing

Each node can be identified by its split sequence (Morton code, a.k.a Z order)

A circle and a square to denote the non-leaf node and leaf node

A leaf node is set black if it is not empty, otherwise, it is a white leaf node

Keep the black leaf nodes (B+ tree)

Linear Quadtree Structure

SW, SE

0001

NE1100

Page 13: Computer Science and Engineering

IL-QuadtreeFor each keyword ti V ∈ we build a linear quadtree, denoted by LQi, for the objects which contain the keyword ti

Besides the black leaf nodes we also keep the quadtree node information ( signature )

1 for black leaf nodes and non-leaf nodes and 0 otherwise

Page 14: Computer Science and Engineering

14

Search Algorithm Distance Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

Page 15: Computer Science and Engineering

DataA set of spatio-textual objects

Each objects has a location and a set of keywords

QueryA location (q.loc)

A set of query keywords (q.T)

A direction [, ]

AnswerThe closest k objects, each of which contains all keywords in q.T, and in the search direction

Direction-aware spatial keyword search [G. Li, et al., ICDE 2012]

Page 16: Computer Science and Engineering

16

Spatial Keyword Based Ranking [G. Cong ,et al., PVLDB 2009, VLDBJ 2012]

Query – Spatial location

– Query keywords

Returns the k best objects ranked by– Spatial distance to the query location

– Textual relevance to the query keywords

Spatio-textual ranking Score

The spatial proximity (δ) is the normalized Euclidean distance between p and qThe textual relevance (θ) is the tf-idf based textual similarity between the description of p and the query keywords.

• Our Solutionthe maximal keywords weight replaces the bit signature – aggregate inverted linear quadtree

spatial distance ranking function replaced by spatio-textual ranking score function

Score based pruning based on weight and region of the quadtree node

),()1(),(),( qpqpqp

Page 17: Computer Science and Engineering

17

Experimental Setting

Implemented in Java

Debian Linuxo Intel Xeon 2.40GHz dual CPUo 4 GB memory

Dataset

GN : US Board on Geographic NamesTigers, Cars :

o Spatial datasets from Rtree-Portalo Textual content from 20 Newsgroups

SYN: synthetic dataset

Query (1000) : location , #l query keywords

Evaluate Response time and # I/O

Page 18: Computer Science and Engineering

18

Definition Notation Default ValueNumber of required result k 10

Number of query keywords l 3

Term frequency of vocabulary z 1.1

Number of objects n 1,000,000

Vocabulary size v 100,000

Avg. keywords per object m 15

Parameters evaluated

Important Statistics

Page 19: Computer Science and Engineering

19

Tuning

w’ : Minimal depth of the black leaf node

c: The split threshold

Best performance:

– w’ = 8 and c = 64

Page 20: Computer Science and Engineering

20

l: The number of query keywords

Gird : [ M. Christoforaki,et al., CIKM, 2011]

Grid+SIG: the extension of Grid, utilizing signature technique

Page 21: Computer Science and Engineering

21

Algorithms Evaluated

ILQ– Inverted Linear Quadtree based techniques

IVR– inverted Rtree [Y. Zhou, et al., CIKM 2005]

MIR2– [I. D. Felipe,et al., ICDE 2008]

KR– [R. Hariharan,et al., SSDBM 2007]

WIR –[D. Wu ,et al., TKDE 2011]

IR– [G. Cong ,et al., PVLDB 2009]

CM-CDIR– [D. Wu ,et al., VLDBJ 2012]

Page 22: Computer Science and Engineering

22

Evaluation on different datasets

Page 23: Computer Science and Engineering

Comparison – Varying l

Page 24: Computer Science and Engineering

24

Comparison – Varying k

Page 25: Computer Science and Engineering

Comparison – Varying Parameters

Page 26: Computer Science and Engineering

26

Conclusion

Important properties of indexing techniques to support top k spatial keyword search

Propose the inverted linear quadtree structure to efficiently support top k spatial keyword search

Extensive experiment on both real and synthetic data

Future workEnhance the region based signature technique – group objects to reduce false positive.

Support top k spatial keyword search on other metric spaces

Page 27: Computer Science and Engineering

27

Page 28: Computer Science and Engineering

Our AlgorithmAggregate ILQ

Compare with

IR [G. Cong, et al., PVLDB 2009]

CM-CDIR [D. Wu ,et al., VLDBJ 2012]

Dataset: Tiger

Spatial Keyword Ranking Query

Page 29: Computer Science and Engineering

Direction-Aware TOPK-SK Query

Our AlgorithmILQ

Compare withDESKS [G.Li,et al., ICDE 2012]

Page 30: Computer Science and Engineering

30

Comparison – Varying k

Page 31: Computer Science and Engineering

31

IR-Tree

Page 32: Computer Science and Engineering

32

KR* Tree