Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science 600.416 Johns Hopkins University

Mutlidimensional Indices

Instructor: Randal Burns

Lecture for 29 November 2005

Computer Science 600.416

Johns Hopkins University

1 and 2D Indexing

• Index structures we know so far are one dimensional

– Event when indexing on multiple attributes

– The attributes are either:• ordered – a 1 dimensional binary relation

• hashed – placed into a 1-d hash space

• Not all data are 1 dimensional– Substructure within data items

– Need to be looked up on several fields

Dimensionality in Std. DBs• What is the dimensionality of data in the relational model?• The arity of a relational, data is inherently multi-dimensional in

DBs• Do relational DBs support mutli-dimensional queries using only

1-d indices?• Several techniques

– Mutliple indices – can look up on different attributes, but still only one at a time

– General queries – can conduct any query, this is the power of the relational model!

• What is the outstanding problem?• Indices optimize queries. While there is support for multi-d data,

the indices are not “tuned” for these queries

Overview of Techniques• Multidimensional hash tables

– Grid files– Partitioned hash functions

• Hierarchical indices– Multiple-key indices

• Multidimensional trees– Kd-trees– Quad trees– R-trees

Applications: Geographic Data• Geographic information systems – map• Circuit design – placment of components• Queries

– Partial match – match some dimensions, find all objects in others. Equality on some dimensions.

– Range – find objects in ranges in dimensions– Nearest neighbor – find objects close the a point or

specified object– Where-am-I queries – reverse mapping of a point to an

object, e.g. mouse click to button

Applications: Data Cubes• View all data as high-dimensional

– Consider a sale• day and time• store• item• cost

– Creates a 4-d grid

• Information in this grid can be clustered– Decision support– Data mining

• Look for trends in data– Example: determine what products sell in what stores and bind it to

demographic/political/cultural data

Multdimensional Queries in SQL

• SQL support for a nearest-neighbor query– Relation POINTS { float x, float y }

• Find the nearest point to point (10.0, 20.0)SELECT *

FROM POINTS p

WHERE NOT EXISTS (

SELECT *

FROM POINTS q

WHERE (q.x-10.0)*(q.x-10.0) + (q.y-20.0)*(q.y-20.0) < (p.x-

10.0)*(p.x-10.0) + (p.y-20.0)*(p.y-20.0)

Multdimensional Queries in SQL

• SQL support for a point in rectangle query– Relation RECT { id, xll, yll, xur, yur }

SELECT id

FROM RECT

WHERE xll <= 10.0 AND yll <= 20.0 AND xur >= 10 AND yur >= 20.0

Grid Files• Partition each dimension into ranges

– Create a bucket (block) for each combination of dimensions

– Buckets are in n-dimensions now

• Lookup – index in both dimension

• Insert – reverse lookup and insert– Complexities come if out of space

• Chain blocks in a grid bucket

• Reorganize grid lines/add new grid lines

– Same skew problems as with range partitioning, just in mutliple dimensions now

0

50

100

150

200

250

300

350

400

450

0 10 20 30 40 50 60 70 80 90

Age

Salary

Grid File Support for Queries

• Partial match: scoped to buckets in the specified dimensions

• Range: scoped to buckets in ranges• Nearest neighbor: need to consider grid boundaries

(draw), but scoped to feasible buckets• Where-am-I: no, this represents data points not

data objects

Partitioned Hashing

• For a series of hash attributes A1,A2,A3,…,An compute a function h=h1(A1),h2(A2),…,hn(An)

• Queries that only specify some dimensions, are scoped to suitable buckets

• If one specified all dimension except A2 and A4 with 3 bits per bucket

• Look in buckets 101XXX010XXX1100…..

Part. Hash. Query Support• Partial match: scoped to buckets in the specified

dimensions• Range: useless• Nearest neighbor: useless• Where-am-I: useless

• Relation between Part. Hash and Grids is similar to that between range and hash partitioning

– Skew and generality

Mutliple Key Indexes• Tree like multi-dimensional structure• Figure 5.11• Partial match: scoped when the higher dimensions

are specified, otherwise bad news• Range: very effective• Nearest neighbor: reasonably efficient when built

on top of a range query– E.g. find all neighbors less than distance d and compute

their distance

Kd-trees

• B-Tree in which each level alternates attribute– Leaves occur when only a block’s worth of tuples are

specified

• Figure 5.13 – specifies with block size of 2

0

50

100

150

200

250

300

350

400

450

0 10 20 30 40 50 60 70 80 90

Age

Salary

Kd–trees Query Support• Partial-match: only on specified attributes

• Range queries – when a range straddles a branch, must explore both sides

– But this is what they are good for

• Nearest neighbor, same approach as muliple-key indexes

• Compared– Kd-trees might (depending on data) provide better scoping by

alternating between dimensions

– Gains are specious for increased complexity

Quad Trees

• Each interior node divides the tree into another dimension of square regions

• Figure 5.17

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60 70 80 90 100

Age

Salary

Quad Trees Query Support

• Partial-match: on all attributes• Range queries – yes, but all overlapping quads• Nearest neighbor – only in so far as range queries• Has more in common with grid files• Problems

– with knowing domains a priori

– skew in data leads to different dimensionality of regions and many empty regions

Region Trees (R-Trees)• Partial-match: on all attributes, but complicated by overlap

• Range queries – yes, but complicated by overlap

• Nearest neighbor – only in so far as range queries

• Where am I – yes, can represent objects in R-tree regions

• Complexities– Managing shapes, limiting overlap, preserving containment

property

• Overlap is required for containment property to server the where-am-I query

R-Trees Query Support• Represent objects, not just points

• Good for spatial data

• Capture the spirit of B-Trees for multi-dimensional data– B-tree divides a line (1-d space) into intervals

– R-tree divides a space (n-d space) into regions• generally use simple shapes, like rectangles

– Regions may overlap, but should do so minimally

– Each object should be contained entirely within a single region

• Develop 5.20 and 5.21– Add another house 5.22

Documents

Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science 600.416 Johns Hopkins University