1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records...

CS232A: Database System Principles

INDEXING

Given condition on attribute find qualified records

Attr = value

Condition may also be • Attr>value• Attr>=value

Indexing

Qualified records

valuevalue

Indexing• Data Stuctures used for quickly locating tuples

that meet a specific type of condition– Equality condition: find Movie tuples where

Director=X– Other conditions possible, eg, range conditions: find

Employee tuples where Salary>40 AND Salary<50

• Many types of indexes. Evaluate them on– Access time– Insertion time– Deletion time– Disk Space needed (esp. as it effects access time)

Topics

• Conventional indexes• B-trees• Hashing schemes

Terms and Distinctions• Primary index

– the index on the attribute (a.k.a. search key) that determines the sequencing of the table

• Secondary index– index on any other attribute

• Dense index– every value of the indexed

attribute appears in the index

• Sparse index– many values do not appear

10203040

100120

50708090

A Dense Primary Index

100120140150

SequentialFile

Dense and Sparse Primary Indexes

10203040

100120

50708090

Dense Primary Index

100120140150

Sparse Primary Index

10305080100140160200

100120

Find the index record with largestvalue that is less or equal to thevalue we are looking.

+ can tell if a value exists without accessing file (consider projection)+ better access to overflow records

+ less index space

more + and - in a while

Sparse vs. Dense Tradeoff

• Sparse: Less index space per record can keep more of

index in memory• Dense: Can tell if any record exists

without accessing file

(Later: – sparse better for insertions– dense needed for secondary indexes)

Multi-Level Indexes

• Treat the index as a file and build an index on it

• “Two levels are usually sufficient. More than three levels are rare.”

• Q: Can we build a dense second level index for a dense index ?

10305080100140160200

100120

10100250400

250270300350400460500550

6007509201000

A Note on Pointers

• Record pointers consist of block pointer and position of record in the block

• Using the block pointer only saves space at no extra disk accesses cost

Representation of Duplicate Values in

Primary Indexes• Index may point

to first instance of each value only

104070100

100120

Deletion from Dense Index

102030

100120

Delete 40, 80

HeaderHeader

Lists of available entries

• Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file

• Q: What about deletion from dense primary index with duplicates

Deletion from Sparse Index

• if the deleted entry does not appear in the index do nothing

10305080100140160200

100120

HeaderDelete 40

Deletion from Sparse Index (cont’d)

• if the deleted entry appears in the index replace it with the next search-key value– comment: we could leave

the deleted value in the index assuming that no part of the system may assume it still exists without checking the block

Delete 30

10405080100140160200

100120

Header

Deletion from Sparse Index (cont’d)

• if the deleted entry appears in the index replace it with the next search-key value

• unless the next search key value has its own index entry. In this case delete the entry

Delete 40, then 30

5080100140160200

100120

HeaderHeader

Insertion in Sparse Index

• if no new block is created then do nothing

10305080100140160200

100120

HeaderInsert 35

Insertion in Sparse Index

• if no new block is created then do nothing

• else create overflow record– Reorganize periodically– Could we claim space

of next block?– How often do we

reorganize and how much expensive it is?

– B-trees offer convincing answers

10 30 50 80

100140160200

100120

HeaderInsert 15

Secondary indexesSequencefield

File not sorted on secondary search key

• Sparse index

302080

does not make sense!

• Dense index10203040

506070...

105090...

sparsehighlevel

First level has to be dense,next levels are sparse (as usual)

Duplicate values & secondary indexes

10101020

20304040

4040...

one option...

Problem:excess overhead!

• disk space• search time

another option: lists of pointers

20Problem:variable sizerecords inindex!

10203040

5060...

Yet another idea :Chain records with same key?

Problems:• Need to add fields to records, messes up maintenance• Need to follow chain to know records

10203040

5060...

buckets

Why “bucket” idea is useful

Indexes RecordsName: primary EMP

(name,dept,year,...)

Dept: secondaryYear: secondary

• Enables the processing of queries workingwith pointers only.

• Very common technique in Information Retrieval

Advantage of Buckets: Process Queries Using

Pointers OnlyFind employees of the Toys dept with 4 years in the companySELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4

ToysPCsPensSuits

Dept IndexAaron Suits 4Helen Pens 3

Jack PCs 4Jim Toys 4

Joe Toys 3Nick PCs 2

Walt Toys 5Yannis Pens 1

Year Index

Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

This idea used in text information retrieval

Documents

...the cat is fat ...

...my cat and my dog like each other...

...Fido the dog ...Buckets known as

Inverted lists

Information Retrieval (IR) Queries

• Find articles with “cat” and “dog”– Intersect inverted lists

• Find articles with “cat” or “dog”– Union inverted lists

• Find articles with “cat” and not “dog”– Subtract list of dog pointers from list of cat

pointers

• Find articles with “cat” in title• Find articles with “cat” and “dog”

within 5 words

Common technique: more info in inverted

catTitle 5

Title 100

Author 10

Abstract 57

Title 12

typeposit

locatio

Posting: an entry in inverted list.Represents occurrence

ofterm in article

Size of a list: 1 Rare words or (in postings) mis-spellings

106 Common words

Size of a posting: 10-15 bits (compressed)

Vector space model

w1 w2 w3 w4 w5 w6 w7 …

DOC = <1 0 0 1 1 0 0 …>

Query= <0 0 1 1 0 0 0 …>

PRODUCT = 1 + ……. = score

• Tricks to weigh scores + normalize

e.g.: Match on common word not asuseful as match on rare

words...

Summary of Indexing So Far• Basic topics in conventional indexes

– multiple levels– sparse/dense– duplicate keys and buckets– deletion/insertion similar to sequential files

• Advantages– simple algorithms– index is sequential file

• Disadvantages– eventually sequentiality is lost because of overflows,

reorganizations are needed

Example Index (sequential)

continuous

free space

102030

405060

708090

39313536

323834

overflow area(not sequential)

Outline:

• Conventional indexes• B-Trees NEXT• Hashing schemes

• NEXT: Another type of index– Give up on sequentiality of index– Try to get “balance”

B+Tree Example n=3

3 5 11

Sample non-leaf

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

Sample leaf node:

From non-leaf node

to next leafin

sequence5

In textbook’s notationn=3

Non-leaf:

Size of nodes: n+1 pointersn keys (fixed)

Non-root nodes have to be at least half-full

• Use at least

Non-leaf: (n+1)/2pointers

Leaf: (n+1)/2 pointers to data

Full nodemin. node

Non-leaf

3 5 11

B+tree rules tree of order n

(1) All leaves at same lowest level(balanced tree)

(2) Pointers in leaves point to records except for “sequence pointer”

(3) Number of pointers/keys for B+tree

Non-leaf(non-root) n+1 n (n+1)/2 (n+1)/2- 1

Leaf(non-root) n+1 n

Root n+1 n 1 1

Max Max Min Min ptrs keys ptrsdata keys

(n+1)/2 (n+1)/2

Insert into B+tree

(a) simple case– space available in leaf

(b) leaf overflow(c) non-leaf overflow(d) new root

(a) Insert key = 32 n=33 5 11

(a) Insert key = 7 n=3

3 5 11

(c) Insert key = 160 n=3

(d) New root, insert 45 n=3

1 2 3 10

30new root

(a) Simple case - no example

(b) Coalesce with neighbor (sibling)

(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf

Deletion from B+tree

(b) Coalesce with sibling– Delete 50

(c) Redistribute keys– Delete 50

(d) Non-leaf coalese– Delete 37

new root

B+tree deletions in practice

– Often, coalescing is not implemented– Too hard and not worth it!

Comparison: B-trees vs. static indexed sequential

Ref #1: Held & Stonebraker“B-Trees Re-examined”CACM, Feb. 1978

Ref # 1 claims:- Concurrency control harder in B-Trees

- B-tree consumes more space

For their comparison:block = 512 byteskey = pointer = 4 bytes4 data records per block

Example: 1 block static index

127 keys

(127+1)4 = 512 Bytes-> pointers in index implicit! up to

127 contigous blocks

1 datablock

Example: 1 block B-tree

63 keys

63x(4+4)+8 = 512 Bytes-> pointers needed in B-tree up to 63

blocks because index and data blocksare not contiguous

1 datablock

Size comparison Ref. #1Size comparison Ref. #1

Static Index B-tree# data # datablocks height blocks

height

2 -> 127 2 2 -> 63 2128 -> 16,129 3 64 -> 3968 316,130 -> 2,048,383 4 3969 -> 250,047 4

250,048 -> 15,752,961 5

Ref. #1 analysis claims

• For an 8,000 block file,after 32,000 inserts

after 16,000 lookups Static index saves enough accesses

to allow for reorganization

Ref. #1 conclusion Static index better!!

Ref #2: M. Stonebraker, “Retrospective on a database

system,” TODS, June 1980

Ref. #2 conclusion B-trees better!!

• DBA does not know when to reorganize– Self-administration is important target

• DBA does not know how full to loadpages of new index

• Buffering– B-tree: has fixed buffer requirements– Static index: large & variable size

buffers needed due to overflow

• Speaking of buffering… Is LRU a good policy for B+tree

buffers? Of course not! Should try to keep root in memory

at all times(and perhaps some nodes from second level)

Interesting problem:

For B+tree, how large should n be?

n is number of keys / node

Assumptions

• You have the right to set the disk page size for the disk where a B-tree will reside.

• Compute the optimum page size n assuming that– The items are 4 bytes long and the pointers are

also 4 bytes long.– Time to read a node from disk is 12+.003n– Time to process a block in memory is unimportant– B+tree is full (I.e., every page has the maximum

number of items and pointers

Can get: f(n) = time to find a record

nopt n

FIND nopt by f’(n) = 0

Answer should be nopt = “few hundred”

What happens to nopt as

• Disk gets faster?• CPU get faster?

Variation on B+tree: B-tree (no +)• Idea:

– Avoid duplicate keys– Have record pointers in non-leaf nodes

to record to record to record with K1 with K2 with K3

to keys to keys to keys to keys < K1 K1<x<K2 K2<x<k3 >k3

K1 P1 K2 P2 K3 P3

B-tree example n=2

• sequence pointers not useful now! (but keep space for simplicity)

So, for B-trees:

MAX MINTree Rec Keys Tree Rec KeysPtrs Ptrs Ptrs Ptrs

Non-leafnon-root n+1 n n (n+1)/2 (n+1)/2-1

(n+1)/2-1Leafnon-root 1 n n 1 (n+1)/2

(n+1)/2

Rootnon-leafn+1 n n 2 1 1

RootLeaf 1 n n 1 1 1

Tradeoffs:

B-trees have marginally faster average lookup than B+trees (assuming the height does not change)

in B-tree, non-leaf & leaf different sizesSmaller fan-out in B-tree, deletion more complicated

B+trees preferred!

Example:- Pointers 4 bytes- Keys 4 bytes- Blocks 100 bytes (just example)- Look at full 2 level tree

Root has 8 keys + 8 record pointers+ 9 son pointers

= 8x4 + 8x4 + 9x4 = 100 bytes

B-tree:

Each of 9 sons: 12 rec. pointers (+12 keys)= 12x(4+4) + 4 = 100 bytes

2-level B-tree, Max # records =12x9 + 8 = 116

Root has 12 keys + 13 son pointers= 12x4 + 13x4 = 100 bytes

B+tree:

Each of 13 sons: 12 rec. ptrs (+12 keys)= 12x(4 +4) + 4 = 100 bytes

2-level B+tree, Max # records= 13x12 = 156

ooooooooooooo ooooooooo 156 records 108 records

Total = 116

8 records

• Conclusion:– For fixed block size,– B+ tree is better because it is bushier

Outline/summary

• Conventional Indexes• Sparse vs. dense• Primary vs. secondary

• B trees• B+trees vs. B-trees• B+trees vs. indexed sequential

• Hashing schemes --> Next

Hashing

• hash function h(key) returns address of bucket or record

• for secondary index buckets are required

• if the keys for a specific hash value do not fit into one page the bucket is a linked list of pages

key h(key)

Records

Buckets Records

Example hash function

• Key = ‘x1 x2 … xn’ n byte character string

• Have b buckets• h: add x1 + x2 + ….. xn

– compute sum modulo b

This may not be best function … Read Knuth Vol. 3 if you really

need to select a good function.

Good hash Expected number of function: keys/bucket is the

same for all buckets

Within a bucket:

• Do we keep keys sorted?

• Yes, if CPU time critical & Inserts/Deletes not too frequent

Next: example to illustrateinserts, overflows,

deletes

EXAMPLE 2 records/bucket

INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0

h(e) = 1

EXAMPLE: deletion

Delete:ef

maybe move“g” up

Rule of thumb:• Try to keep space utilization

between 50% and 80% Utilization = # keys used

total # keys that fit

• If < 50%, wasting space• If > 80%, overflows significant

depends on how good hashfunction is & on # keys/bucket

How do we cope with growth?

• Overflows and reorganizations• Dynamic hashing

• Extensible• Linear

Extensible hashing: two ideas

(a) Use i of b bits output by hash function

b h(K)

use i grows over time….

00110101

(b) Use directory

h(K)[0-i ] to bucket

Example: h(k) is 4 bits; 2 keys/bucket

Insert 1010

New directory

Insert:

Example continued

Insert:

Example continued

Extensible hashing: deletion

• No merging of blocks• Merge blocks

and cut directory if possible(Reverse insert procedure)

Deletion example:

• Run thru insert example in reverse!

Extensible hashing

Can handle growing files- with less wasted space- with no full reorganizations

Summary

Indirection(Not bad if directory in

memory)

Directory doubles in size(Now it fits, now it does not)

Linear hashing

• Another dynamic hashing scheme

Two ideas:(a) Use i low order bits of hash

01110101grows

(b) File grows linearly

Example b=4 bits, i =2, 2 keys/bucket

00 01 10 11

m = 01 (max used block)

Futuregrowthbuckets

If h(k)[i ] m, then look at bucket h(k)[i ]

else, look at bucket h(k)[i ] - 2i -1

0101• can have overflow chains!

• insert 0101

Example b=4 bits, i =2, 2 keys/bucket

00 01 10 11

Futuregrowthbuckets

0101 • insert 0101

11110101

Example Continued: How to grow beyond this?

00 01 10 11

111110100101

0 0 0 0100 101 110 111

• If U > threshold then increase m(and maybe i )

When do we expand file?

• Keep track of: # used slots total # of slots = U

Linear Hashing

Can handle growing files- with less wasted space- with no full reorganizations

No indirection like extensible hashing

Summary

Can still have overflow chains-

Example: BAD CASE

Very full

Very empty Need to move

m here…Would wastespace...

Hashing- How it works- Dynamic hashing

- Extensible- Linear

Summary

• Indexing vs Hashing• Index definition in SQL• Multiple key access

• Hashing good for probes given keye.g., SELECT …

FROM RWHERE R.A = 5

Indexing vs Hashing

• INDEXING (Including B Trees) good for

Range Searches:e.g., SELECT

FROM RWHERE R.A > 5

Indexing vs Hashing

Index definition in SQL

• Create index name on rel (attr)• Create unique index name on rel

(attr)defines candidate key

• Drop INDEX name

CANNOT SPECIFY TYPE OF INDEX

(e.g. B-tree, Hashing, …)

OR PARAMETERS(e.g. Load Factor, Size of

Hash,...)

... at least in SQL...

ATTRIBUTE LIST MULTIKEY INDEX

(next) e.g., CREATE INDEX foo ON

R(A,B,C)

Motivation: Find records where DEPT = “Toy” AND SAL >

Multi-key Index

Strategy I:

• Use one index, say Dept.• Get all Dept = “Toy” records

and check their salary

• Use 2 Indexes; Manipulate Pointers

Toy Sal>

Strategy II:

• Multiple Key Index

One idea:

Strategy III:

Example

ExampleRecord

DeptIndex

SalaryIndex

Name=JoeDEPT=SalesSAL=15k

ArtSalesToy

10k15k17k21k

12k15k15k19k

For which queries is this index good?

Find RECs Dept = “Sales” SAL=20kFind RECs Dept = “Sales” SAL > 20kFind RECs Dept = “Sales”Find RECs SAL = 20k

Interesting application:

• Geographic Data

<X1,Y1, Attributes> <X2,Y2, Attributes>

Queries:

• What city is at <Xi,Yi>?• What is within 5 miles from

<Xi,Yi>?• Which is closest point to <Xi,Yi>?

Examplee

kj25 15 35 20

h i a bcd efg

n omlj k

• Search points near f• Search points near b

Queries

• Find points with Yi > 20• Find points with Xi < 5• Find points “close” to i = <12,38>• Find points “close” to b = <7,24>

• Many types of geographic index structures have been suggested

• Quad Trees• R Trees

Two more types of multi key indexes

• Grid• Partitioned hash

Grid Index Key 2

X1 X2 …… Xn V1 V2

To records with key1=V3, key2=X2

• Can quickly find records with– key 1 = Vi Key 2 = Xj

– key 1 = Vi

– key 2 = Xj

• And also ranges….– E.g., key 1 Vi key 2 < Xj

But there is a catch with Grid Indexes!

• How is Grid Index stored on disk?

LikeArray... X

V1 V2 V3

Problem:• Need regularity so we can compute

position of <Vi,Xj> entry

Solution: Use Indirection

BucketsV1V2

V3 *Grid onlyV4 contains

pointers tobuckets

Buckets------

------

X1 X2 X3

With indirection:

• Grid can be regular without wasting space

• We do have price of indirection

Can also index grid on value ranges

Salary Grid

Linear Scale1 2 3

Toy SalesPersonnel

0-20K 1

20K-50K 2

50K- 38

Grid files

Good for multiple-key searchSpace, management overhead (nothing is free)

Need partitioning ranges that evenly split keys

Key1 Key2

Partitioned hash function

010110 1110010

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

<Fred,toy,10k>,<Joe,sales,10k><Sally,art,30k>

Insert

<Fred>

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.• Find Emp. with Dept. = Sales

Sal=40k

<Mary>

<Sally>

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.• Find Emp. with Sal=30k

<Mary>

<Sally>

look here

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.• Find Emp. with Dept. = Sales

<Mary>

<Sally>

look here

Post hashing discussion:- Indexing vs. Hashing

- SQL Index Definition- Multiple Key Access

- Multi Key IndexVariations: Grid, Geo Data

- Partitioned Hash

Summary

Reading Chapter 5

• Skim the following sections:– 5.3.6, 5.3.7, 5.3.8– 5.4.2, 5.4.3, 5.4.4

• Read the rest

The BIG picture….

• Chapters 2 & 3: Storage, records, blocks...• Chapter 4 & 5: Access Mechanisms -

Indexes- B trees- Hashing- Multi key

• Chapter 6 & 7: Query ProcessingNEXT

1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records...

Documents

Attr@ttivo LookBook Spring - Summer 2011

Percp & Attr

A FORGÁCSOLÁSTECHNOLÓGIAI TERVEZŐRENDSZER (ATTR) FUNKCIONÁLIS STRUKTÚRÁJA

A Sample-and-Clean Framework for Fast and Accurate Query ...€¦ · Query type: AVG Less Dirty: 3% value, 1% condition, and 2% duplication errors Very Dirty: 30% value, 10% condition,

Spreadsheet Formulae Some Common Functions =IF(condition, true, false) Returns one value if a condition you specify evaluates to TRUE and another value

CUDA Runtime API - CentraleSupélec Metz … · cudaMemcpyFromSymbolAsync.....101 cudaMemcpyPeer ... cudaError_t cudaDeviceGetAttribute (int *value, cudaDeviceAttr attr, int device)

Social 08.5 Social Cognition (Attr + SFP)

Menuet BWV 114 Clavìet+ùdtleìnfir Anna Magdalena Bach attr

Actes Attr Territoires (1)

NTLA-2001 for ATTR Amyloidosis: Interim Clinical Results

REPORT OF MARINE SURVEY PRE-PURCHASE CONDITION & VALUE

Review Loops – Condition – Index Functions – Definition – Call – Parameters – Return value

Advanced Condition Based Maintenance - Emerson …€¦ · · 2013-12-06Advanced Condition Based Maintenance . ... Maintenace Customer Value Compliance Customer Value ... CB Maintenance

Francesco Corbetta attr. - Monica Hall · 2018-04-03 · Francesco Corbetta attr. Selected pieces for baroque guitar from GB:Ob Ms.Mus.Sch.C94 Pieces de guittarre de differends autheurs

Hereditary ATTR (hATTR ) Amyloidosis: Cardiomyopathy—An ... › resource-files › ... · ATTR (wtATTR) amyloidosis, formerly known as senile systemic amyloidosis (SSA), is a nonhereditary,

supporting the 2018 River Murray: high value wetlands ... · Table 3: Condition assessment criteria used for the assessment of condition of high value wetlands. 5 Table 4: Definition

Specification item Value Unit Condition - Philips€¦ · 2/7 Electrical data controls input Specification item Value Unit Condition Control method DALI, Dynadimmer, LineSwitch, AmpDim

REPORT OF MARINE SURVEY CONDITION & VALUE Of the …REPORT OF MARINE SURVEY CONDITION & VALUE Of the vessel “Accomplished” 1989 Catalina 34 Masthead Sloop Prepared exclusively

Understanding the patient voice in ATTR amyloidosis

Delivering High Value Condition Caterpillar - …cdn.osisoft.com/corp/en/media/presentations/2014/UsersConference... · Delivering High Value Condition Monitoring at Caterpillar Dick