Upload
anthony-claud-alexander
View
213
Download
0
Embed Size (px)
Citation preview
Mutlidimensional Indices
Instructor: Randal Burns
Lecture for 29 November 2005
Computer Science 600.416
Johns Hopkins University
1 and 2D Indexing
• Index structures we know so far are one dimensional
– Event when indexing on multiple attributes
– The attributes are either:• ordered – a 1 dimensional binary relation
• hashed – placed into a 1-d hash space
• Not all data are 1 dimensional– Substructure within data items
– Need to be looked up on several fields
Dimensionality in Std. DBs• What is the dimensionality of data in the relational model?• The arity of a relational, data is inherently multi-dimensional in
DBs• Do relational DBs support mutli-dimensional queries using only
1-d indices?• Several techniques
– Mutliple indices – can look up on different attributes, but still only one at a time
– General queries – can conduct any query, this is the power of the relational model!
• What is the outstanding problem?• Indices optimize queries. While there is support for multi-d data,
the indices are not “tuned” for these queries
Overview of Techniques• Multidimensional hash tables
– Grid files– Partitioned hash functions
• Hierarchical indices– Multiple-key indices
• Multidimensional trees– Kd-trees– Quad trees– R-trees
Applications: Geographic Data• Geographic information systems – map• Circuit design – placment of components• Queries
– Partial match – match some dimensions, find all objects in others. Equality on some dimensions.
– Range – find objects in ranges in dimensions– Nearest neighbor – find objects close the a point or
specified object– Where-am-I queries – reverse mapping of a point to an
object, e.g. mouse click to button
Applications: Data Cubes• View all data as high-dimensional
– Consider a sale• day and time• store• item• cost
– Creates a 4-d grid
• Information in this grid can be clustered– Decision support– Data mining
• Look for trends in data– Example: determine what products sell in what stores and bind it to
demographic/political/cultural data
Multdimensional Queries in SQL
• SQL support for a nearest-neighbor query– Relation POINTS { float x, float y }
• Find the nearest point to point (10.0, 20.0)SELECT *
FROM POINTS p
WHERE NOT EXISTS (
SELECT *
FROM POINTS q
WHERE (q.x-10.0)*(q.x-10.0) + (q.y-20.0)*(q.y-20.0) < (p.x-
10.0)*(p.x-10.0) + (p.y-20.0)*(p.y-20.0)
Multdimensional Queries in SQL
• SQL support for a point in rectangle query– Relation RECT { id, xll, yll, xur, yur }
SELECT id
FROM RECT
WHERE xll <= 10.0 AND yll <= 20.0 AND xur >= 10 AND yur >= 20.0
Grid Files• Partition each dimension into ranges
– Create a bucket (block) for each combination of dimensions
– Buckets are in n-dimensions now
• Lookup – index in both dimension
• Insert – reverse lookup and insert– Complexities come if out of space
• Chain blocks in a grid bucket
• Reorganize grid lines/add new grid lines
– Same skew problems as with range partitioning, just in mutliple dimensions now
0
50
100
150
200
250
300
350
400
450
0 10 20 30 40 50 60 70 80 90
Age
Salary
Grid File Support for Queries
• Partial match: scoped to buckets in the specified dimensions
• Range: scoped to buckets in ranges• Nearest neighbor: need to consider grid boundaries
(draw), but scoped to feasible buckets• Where-am-I: no, this represents data points not
data objects
Partitioned Hashing
• For a series of hash attributes A1,A2,A3,…,An compute a function h=h1(A1),h2(A2),…,hn(An)
• Queries that only specify some dimensions, are scoped to suitable buckets
• If one specified all dimension except A2 and A4 with 3 bits per bucket
• Look in buckets 101XXX010XXX1100…..
Part. Hash. Query Support• Partial match: scoped to buckets in the specified
dimensions• Range: useless• Nearest neighbor: useless• Where-am-I: useless
• Relation between Part. Hash and Grids is similar to that between range and hash partitioning
– Skew and generality
Mutliple Key Indexes• Tree like multi-dimensional structure• Figure 5.11• Partial match: scoped when the higher dimensions
are specified, otherwise bad news• Range: very effective• Nearest neighbor: reasonably efficient when built
on top of a range query– E.g. find all neighbors less than distance d and compute
their distance
Kd-trees
• B-Tree in which each level alternates attribute– Leaves occur when only a block’s worth of tuples are
specified
• Figure 5.13 – specifies with block size of 2
0
50
100
150
200
250
300
350
400
450
0 10 20 30 40 50 60 70 80 90
Age
Salary
Kd–trees Query Support• Partial-match: only on specified attributes
• Range queries – when a range straddles a branch, must explore both sides
– But this is what they are good for
• Nearest neighbor, same approach as muliple-key indexes
• Compared– Kd-trees might (depending on data) provide better scoping by
alternating between dimensions
– Gains are specious for increased complexity
Quad Trees
• Each interior node divides the tree into another dimension of square regions
• Figure 5.17
0
50
100
150
200
250
300
350
400
0 10 20 30 40 50 60 70 80 90 100
Age
Salary
Quad Trees Query Support
• Partial-match: on all attributes• Range queries – yes, but all overlapping quads• Nearest neighbor – only in so far as range queries• Has more in common with grid files• Problems
– with knowing domains a priori
– skew in data leads to different dimensionality of regions and many empty regions
Region Trees (R-Trees)• Partial-match: on all attributes, but complicated by overlap
• Range queries – yes, but complicated by overlap
• Nearest neighbor – only in so far as range queries
• Where am I – yes, can represent objects in R-tree regions
• Complexities– Managing shapes, limiting overlap, preserving containment
property
• Overlap is required for containment property to server the where-am-I query
R-Trees Query Support• Represent objects, not just points
• Good for spatial data
• Capture the spirit of B-Trees for multi-dimensional data– B-tree divides a line (1-d space) into intervals
– R-tree divides a space (n-d space) into regions• generally use simple shapes, like rectangles
– Regions may overlap, but should do so minimally
– Each object should be contained entirely within a single region
• Develop 5.20 and 5.21– Add another house 5.22