Upload
cuthbert-mills
View
223
Download
3
Embed Size (px)
Citation preview
Preferential top-k search over local data
dissertation thesisRNDr. Martin Šumák
supervisor: doc. RNDr. Stanislav Krajči, PhD.consultant: RNDr. Peter Gurský, PhD.
2
Outline
• Top-k search– motivation and example– restrictions and assumptions
• R-tree-based solution– normalization of data– R++-tree
• Grid file-based solution• Experiments– Comparison with B+-trees-based solution, table scan,
etc.2013-08-05 Preferential top-k search over local data, Dissertation thesis, RNDr. Martin Šumák
3
Top-k search
• Example– find top 20 apartments with 3 or 4 rooms, not at
first floor, with price about 60000 not exceeding 70000 euro
– moreover, price is the most important attribute and floor is the least important attribute
2013-08-05 Preferential top-k search over local data, Dissertation thesis, RNDr. Martin Šumák
Preferential top-k search over local data - dissertation thesis - Martin Šumák
4
Top-k query• k = 20• preferences to attribute’s values – fuzzy functions
• importance of attributes – weights wprice = 3 wrooms = 2 wfloor = 1
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
5
Top-k query
• Overall value of object O is3*fprice(Oprice) + 2*frooms(Orooms) + 1*ffloor(Ofloor)
• In general c(fprice(Oprice), frooms(Orooms), ffloor(Ofloor))
2013-08-05
Function c has to be monotone!
Preferential top-k search over local data - dissertation thesis - Martin Šumák
6
The goal of top-k search
• to find top-k objects effectively– by processing minimum amount of data
• restrictions and assumptions– all the data is accessible locally– all attributes are numerical
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
7
R-tree-based solution• object– a vector of n numbers– a point of n-dimensional space
– R-tree, R*-tree, R+-tree, R++-tree2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
8
From kNN to top-k search• k nearest neighbour– known incremental algorithm
– distance from “query point Z” is the measure of “closeness”
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
9
From kNN to top-k search
• top-k search– overall value (h) is the measure of “goodness”
– by replacing distance with overall value and reversing order we change the result from kNN to top-k
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
10
Analogy of kNN and top-k search
• Correctness• Efficiency
2013-08-05
top-k
kNN
Preferential top-k search over local data - dissertation thesis - Martin Šumák
11
Disproportion of attribute values
• floor, area, price – very different ranges– solution: normalization – linear transformation of
attribute values to interval [0; 1]
• Another disproportion comes from weights2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
12
Normalization applicability
• Useful for– R*-tree
• Meaningless for– R-tree (proven for the quadratic split method)– R+-tree, R++-tree– Grid file
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
13
Why the R++-tree• Zero overlaps & minimum
bounding rectangles may cause a problem when adding new object
• R+-tree avoids overlaps at the price of rectangles size
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
14
The R++-tree idea
2013-08-05
• Zero overlaps & minimum bounding rectangles may cause a problem when adding new object
• R++-tree keeps two rectangles for each node – the minimum one and the parent covering one
Preferential top-k search over local data - dissertation thesis - Martin Šumák
15
The R++-tree properties
• Height-balanced• Zero overlaps• Overflow nodes at leaf level only• Minimum node occupancy is 1
• For the top-k search purposes, attribute values can be strings or any other comparable values (not just numbers)
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
16
Top-k search over Grid file
• Grid file is a spatial index for point data• We used static Grid file without extra directory
2013-08-05
Preferential top-k search over local data - dissertation thesis - Martin Šumák
17
Top-k search over Grid file
• We have proven correctness and efficiency as well
2013-08-05