17
Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter Gurský, PhD.

Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Embed Size (px)

Citation preview

Page 1: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data

dissertation thesisRNDr. Martin Šumák

supervisor: doc. RNDr. Stanislav Krajči, PhD.consultant: RNDr. Peter Gurský, PhD.

Page 2: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

2

Outline

• Top-k search– motivation and example– restrictions and assumptions

• R-tree-based solution– normalization of data– R++-tree

• Grid file-based solution• Experiments– Comparison with B+-trees-based solution, table scan,

etc.2013-08-05 Preferential top-k search over local data, Dissertation thesis, RNDr. Martin Šumák

Page 3: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

3

Top-k search

• Example– find top 20 apartments with 3 or 4 rooms, not at

first floor, with price about 60000 not exceeding 70000 euro

– moreover, price is the most important attribute and floor is the least important attribute

2013-08-05 Preferential top-k search over local data, Dissertation thesis, RNDr. Martin Šumák

Page 4: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

4

Top-k query• k = 20• preferences to attribute’s values – fuzzy functions

• importance of attributes – weights wprice = 3 wrooms = 2 wfloor = 1

2013-08-05

Page 5: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

5

Top-k query

• Overall value of object O is3*fprice(Oprice) + 2*frooms(Orooms) + 1*ffloor(Ofloor)

• In general c(fprice(Oprice), frooms(Orooms), ffloor(Ofloor))

2013-08-05

Function c has to be monotone!

Page 6: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

6

The goal of top-k search

• to find top-k objects effectively– by processing minimum amount of data

• restrictions and assumptions– all the data is accessible locally– all attributes are numerical

2013-08-05

Page 7: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

7

R-tree-based solution• object– a vector of n numbers– a point of n-dimensional space

– R-tree, R*-tree, R+-tree, R++-tree2013-08-05

Page 8: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

8

From kNN to top-k search• k nearest neighbour– known incremental algorithm

– distance from “query point Z” is the measure of “closeness”

2013-08-05

Page 9: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

9

From kNN to top-k search

• top-k search– overall value (h) is the measure of “goodness”

– by replacing distance with overall value and reversing order we change the result from kNN to top-k

2013-08-05

Page 10: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

10

Analogy of kNN and top-k search

• Correctness• Efficiency

2013-08-05

top-k

kNN

Page 11: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

11

Disproportion of attribute values

• floor, area, price – very different ranges– solution: normalization – linear transformation of

attribute values to interval [0; 1]

• Another disproportion comes from weights2013-08-05

Page 12: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

12

Normalization applicability

• Useful for– R*-tree

• Meaningless for– R-tree (proven for the quadratic split method)– R+-tree, R++-tree– Grid file

2013-08-05

Page 13: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

13

Why the R++-tree• Zero overlaps & minimum

bounding rectangles may cause a problem when adding new object

• R+-tree avoids overlaps at the price of rectangles size

2013-08-05

Page 14: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

14

The R++-tree idea

2013-08-05

• Zero overlaps & minimum bounding rectangles may cause a problem when adding new object

• R++-tree keeps two rectangles for each node – the minimum one and the parent covering one

Page 15: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

15

The R++-tree properties

• Height-balanced• Zero overlaps• Overflow nodes at leaf level only• Minimum node occupancy is 1

• For the top-k search purposes, attribute values can be strings or any other comparable values (not just numbers)

2013-08-05

Page 16: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

16

Top-k search over Grid file

• Grid file is a spatial index for point data• We used static Grid file without extra directory

2013-08-05

Page 17: Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter

Preferential top-k search over local data - dissertation thesis - Martin Šumák

17

Top-k search over Grid file

• We have proven correctness and efficiency as well

2013-08-05