Upload
keisha
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Computer Science and Engineering. Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search. Chengyuan Zhang 1 ,Ying Zhang 1 ,Wenjie Zhang 1 , Xuemin Lin 2,1. 1 The University of New South Wales, Australia 2 East China Normal University. Background. - PowerPoint PPT Presentation
Citation preview
Computer Science and Engineering
Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search
Chengyuan Zhang1,Ying Zhang1,Wenjie Zhang1, Xuemin Lin2,1
1The University of New South Wales, Australia
2 East China Normal University
An enormous amount of spatio-textual objects available in many applications
online local search
e.g., online yellow pages
social network services
e.g., Facebook, Flickr
Background
p1 (pizza,coffee,sushi)
p3 (pizza,sushi)
p2 (pizza,coffee,steak)
p4 (coffee,sushi)
p5 (pizza,steak,seafood)
pizza,coffee
4
Top k spatial keyword search (TOPK-SK)
DataA set of spatio-textual objects
Each object is represented a location and a set of keywords
QueryQuery location (q.loc)
A set of query keywords (q.T)
AnswerThe closest k objects, each of which contains all query keywords
Naïve Approach
11 spatio-textual objects
Vocabulary {t1, t2, t3}
Query q with q.T = {t1, t2} and k =1
p4 (t1)
p6 (t2,t3) P10 (t1)
p1 (t1,t2)
p10 (t1)
p3 (t1,t3)
p5 (t2,t3)
p8 (t3)
p7 (t3)
p11 (t2)
p9 (t2)
p2 (t1,t2)
Distance Order
P3
P4
P7
P8
P5
P1
P10
P9
P6
P2
P11
Running Example
Inverted R-tree [Y. Zhou,et al., CIKM 2005] Distance
Order
P3
P4
P7
P8
P5
P1
P10
P9
P6
P2
P11
For each keyword t, construct an R tree for objects containing t
E1 E2
R1 (t1)
R2 (t2)
R3 (t3)
P4 P10P1 P2 P3
E1 E2
P2 P5P1 P6 P11
E1 E2
P6 P7 P9P3 P5 P8
E1 E2
E1E2
E1 E2
7
IR2-tree [ I. D. Felipe, et. al., ICDE 2008]
Index Structure
Combination of an R-Tree and signature technique
Each node contains a rectangle and a signature ( a fixed length bitmap)
Each word is hashed to a particular bit
The signature of a node is the “ Bitwise OR ” of all the signatures of its child nodes
8
Example
E11 E12
11 11
Distance Order
P3
P4
P7
P8
P5
P1
P10
P9
P6
P2
P11
E9 E10
11 11E7 E8
11 11
E6
11
E4 E5
01 11
E3
11
E1 E2
11 01
p2
11
p8 p5
01 01
p10 p11
10 01
p6 P9
01 10
p1 p3
11 11
p4 p7
10 01
E11
E7
t1
E10E9E8
E6E5E4
E3E2E1
E12
10
t3 01
t201
9
Observations
Naïve approach
Disadvantages: all objects in the search region are accessed ( large s and p=1 )
Inverted R-tree
Advantages: exclude unrelated objects ( small s )
Disadvantages: cannot take advantage of AND semantics (p=1)
IR2-tree
Advantages: have filtering technique to reduce p
Disadvantages: large s and p is affected by non-related objects
Other Single Augmented R-tree
Other spatial keyword search : KR tree [R. Hariharan, et al., SSDBM 2007]
WIR tree [D. Wu , et al., TKDE 2011]
Spatial keyword ranking query : IR tree [G. Cong ,et al., PVLDB 2009]
CM-CDIR tree [D. Wu ,et al., VLDBJ 2012]
Their shortcomings: same as IR2-tree
10
Motivation
Index structure have a small number of objects within the search region
can prune objects within the search region
Propertiesfalls in the category of inverted index
exploit the AND semantics
adaptive to the distribution of the objects for each keyword
11
Motivation
non-Empty 1
0
0
1
Empty
Regular space partition based indexing
Each node can be identified by its split sequence (Morton code, a.k.a Z order)
A circle and a square to denote the non-leaf node and leaf node
A leaf node is set black if it is not empty, otherwise, it is a white leaf node
Keep the black leaf nodes (B+ tree)
Linear Quadtree Structure
SW, SE
0001
NE1100
IL-QuadtreeFor each keyword ti V ∈ we build a linear quadtree, denoted by LQi, for the objects which contain the keyword ti
Besides the black leaf nodes we also keep the quadtree node information ( signature )
1 for black leaf nodes and non-leaf nodes and 0 otherwise
14
Search Algorithm Distance Order
P3
P4
P7
P8
P5
P1
P10
P9
P6
P2
P11
DataA set of spatio-textual objects
Each objects has a location and a set of keywords
QueryA location (q.loc)
A set of query keywords (q.T)
A direction [, ]
AnswerThe closest k objects, each of which contains all keywords in q.T, and in the search direction
Direction-aware spatial keyword search [G. Li, et al., ICDE 2012]
16
Spatial Keyword Based Ranking [G. Cong ,et al., PVLDB 2009, VLDBJ 2012]
Query – Spatial location
– Query keywords
Returns the k best objects ranked by– Spatial distance to the query location
– Textual relevance to the query keywords
Spatio-textual ranking Score
The spatial proximity (δ) is the normalized Euclidean distance between p and qThe textual relevance (θ) is the tf-idf based textual similarity between the description of p and the query keywords.
• Our Solutionthe maximal keywords weight replaces the bit signature – aggregate inverted linear quadtree
spatial distance ranking function replaced by spatio-textual ranking score function
Score based pruning based on weight and region of the quadtree node
),()1(),(),( qpqpqp
17
Experimental Setting
Implemented in Java
Debian Linuxo Intel Xeon 2.40GHz dual CPUo 4 GB memory
Dataset
GN : US Board on Geographic NamesTigers, Cars :
o Spatial datasets from Rtree-Portalo Textual content from 20 Newsgroups
SYN: synthetic dataset
Query (1000) : location , #l query keywords
Evaluate Response time and # I/O
18
Definition Notation Default ValueNumber of required result k 10
Number of query keywords l 3
Term frequency of vocabulary z 1.1
Number of objects n 1,000,000
Vocabulary size v 100,000
Avg. keywords per object m 15
Parameters evaluated
Important Statistics
19
Tuning
w’ : Minimal depth of the black leaf node
c: The split threshold
Best performance:
– w’ = 8 and c = 64
20
l: The number of query keywords
Gird : [ M. Christoforaki,et al., CIKM, 2011]
Grid+SIG: the extension of Grid, utilizing signature technique
21
Algorithms Evaluated
ILQ– Inverted Linear Quadtree based techniques
IVR– inverted Rtree [Y. Zhou, et al., CIKM 2005]
MIR2– [I. D. Felipe,et al., ICDE 2008]
KR– [R. Hariharan,et al., SSDBM 2007]
WIR –[D. Wu ,et al., TKDE 2011]
IR– [G. Cong ,et al., PVLDB 2009]
CM-CDIR– [D. Wu ,et al., VLDBJ 2012]
22
Evaluation on different datasets
Comparison – Varying l
24
Comparison – Varying k
Comparison – Varying Parameters
26
Conclusion
Important properties of indexing techniques to support top k spatial keyword search
Propose the inverted linear quadtree structure to efficiently support top k spatial keyword search
Extensive experiment on both real and synthetic data
Future workEnhance the region based signature technique – group objects to reduce false positive.
Support top k spatial keyword search on other metric spaces
27
Our AlgorithmAggregate ILQ
Compare with
IR [G. Cong, et al., PVLDB 2009]
CM-CDIR [D. Wu ,et al., VLDBJ 2012]
Dataset: Tiger
Spatial Keyword Ranking Query
Direction-Aware TOPK-SK Query
Our AlgorithmILQ
Compare withDESKS [G.Li,et al., ICDE 2012]
30
Comparison – Varying k
31
IR-Tree
32
KR* Tree