1
bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani , Walid G. Aref , Ahmed Elmagarmid, Yasin Silva, Umer Arshad, David Salt, Ivan Baxter Purdue University, Department of Computer Science, Cyber Center, Department of Horticulture and Landscape Architecture Annotation Management Annotations at multiple granularities (tuple vs. column, cell) Annotating data and operations Provenance (lineage) is handled as a special type of annotations Attach articles about this entry (Tuple level) This column is computed using a prediction tool (Column level) Experimentally verified (Cell level) S 1 copy S 2 copy Local insert operation P 1 update S 3 overwrite Q1: Where do these values come from? Q2: What is the source of this value at time T? Annotations Provenance (lineage) Data copied from Database D 1 (Table level) Adding Annotations at various Granularities Storage Optimization Techniques Archiving/Restoring Annotations Propagating/Filtering Annotations ADD NNOTATION [AS VIEW] TO <annotation_table_names> VALUE <annotation_body> [ON UPADTE PROPAGTE] [ON AGGREGATION PROPAGATE] ON <SELECT_statement> ARCHIVE NNOTATION FROM <annotation_table_names> WHERE <conditions> ON <SELECT_statement> CREATE ANNOTATION TABLE <annotation_table_names> ON <user_table_name> SELECT [DISTINCT] C i [PROMOTE (C j , C k , …)], FROM Relation_name [ANNOTATION (S 1 , S 2 , …)], … [WHERE <data_annotation_conditions>] [GROUP BY <data_columns> [HAVING <data_annotation_condition>] Gene Gene_lab Gene_provenance Relation w ith annotation tables Gene_public Colum ns Tuples Time (B 1,T1) (B 2,T2) (B 3,T3) (B 4,T4) (B 5,T5) Compression: Annotation tables store annotations in a compressed form Indexing: Building spatial index structures on annotations for efficient retrieval Categorization: Annotation tables allow categorization of annotations Archived annotations are not propagated along with query results ANNOTATION: qualifier to specify the propagated annotations PROMOTE: Carries the annotations from un-projected attributes Colum ns Tuples Time (A 1,T1) (A 2,T2) (A 4,T4) X M arked asarchived (A 3,T3) ADD ANNOTATION Query Processing Execute the SELECT statement Identify the output rows and columns Map the rows and columns to an ordered domain Which mapping is more efficient? Storage_Order Mapping Correlated_Columns Mapping Correlated_Rows Mapping Map the target table cells to be annotated to rectangles Snapshot versus View Annotations Snapshot Annotations: command is evaluated once and the annotation is attached to the current query results View Annotations: command is evaluated on the current database snapshot and continuously applied over new tuples Eager Approach: apply the annotation command at the insertion time Lazy Approach: apply the annotation command at the query time Q A Q A Q A Q A Q A Q A (1a) (2a) (3a) (1b) (2b) (3b) 1 2 3 4 1 2 3 4 1 2 3 1 2 3 1 2 1 2 t 1 t 2 t 3 t 1 t 2 t 3 Tuples Row -oriented division C olum n-oriented division Archiving Annotations SELECT statement Query Processing Identify cells on which annotations are archived Map the cells to rectangles Representation of Archived Annotations A single annotation rectangle may be divided into smaller ones How to divide an annotation rectangle? Non-traditional and Novel Access Methods Efficient indexing structures New operators to support complex search operations Efficient query processing Indexing compressed sequences Data compression techniques Biological sequences are very large Compressed sequences New index structures for compressed sequences Indexing Compressed Sequences (SBC-Tree) 9 1 12 22 16 PT 41 20 35 5 29 PT 24 39 18 33 3 PT 16 29 3 37 PT 5 10 20 25 40 50 60 100 120 124 150 160 200 220 225 5 10 20 25 40 50 60 100 120 124 150 160 200 220 225 245 250 260 280 300 Tag N U LL G2 B4 A5 A4 B5 E3 S1 27 14 31 7 37 PT 245 250 260 280 300 assigned tags B7 B6 min_ tag1 max_ tag1 Q 1 (245,B 0) (160,A 2) min_ tag2 max_ tag2 Q 3 (20,N U LL) (100,N U LL) Q 2 (160,N U LL) (245,N U LL) Preceding R LE -character Compression techniques gain significant importance: Significant storage reduction Reducing buffer requirements Reducing number of I/Os Enhance the overall system performance Spatial Data Indexing (SP-GiST Framework) PostgreSQ L Function M anager PostgreSQ L Engine PostgreSQ L StorageM anager S torage interface S P-G ist Internal M ethods SP-G ist kd-tree SP-G ist trie s e p r c e a d t a star space spade Trie variants Q uadtree variants Implementing non-traditional indexes involves significant overhead Functionalities (Insertion, deletion, searching), Storage management, integration, Recovery and concurrency control Extensible indexing frameworks Software engineering solution, One-time core development , Many times low- cost instantiation of a variety of index structures

Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,

Embed Size (px)

Citation preview

Page 1: Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,

bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani , Walid G. Aref , Ahmed Elmagarmid,

Yasin Silva, Umer Arshad, David Salt, Ivan BaxterPurdue University,

Department of Computer Science, Cyber Center, Department of Horticulture and Landscape Architecture

Annotation Management• Annotations at multiple granularities (tuple vs. column, cell)• Annotating data and operations• Provenance (lineage) is handled as a special type of annotations

Attach articles about this entry(Tuple level)

This column is computed using a prediction tool(Column level)

Experimentally verified(Cell level)

S1copy

S2copy

Local insert operation

P1

updateS3

overwrite

Q1: Where do these values come from?

Q2: What is the source of this value at time T?

Annotations Provenance (lineage)Data copied from

Database D1

(Table level)

Adding Annotations at various Granularities

Storage Optimization Techniques

Archiving/Restoring Annotations

Propagating/Filtering Annotations

ADD NNOTATION[AS VIEW]TO <annotation_table_names> VALUE <annotation_body>

[ON UPADTE PROPAGTE][ON AGGREGATION PROPAGATE]ON <SELECT_statement>

ARCHIVE NNOTATION

FROM <annotation_table_names> WHERE <conditions>

ON <SELECT_statement>

CREATE ANNOTATION TABLE <annotation_table_names> ON <user_table_name>

SELECT [DISTINCT] Ci [PROMOTE (Cj, Ck, …)], …

FROM Relation_name [ANNOTATION (S1, S2, …)], …

[WHERE <data_annotation_conditions>]

[GROUP BY <data_columns>

[HAVING <data_annotation_condition>]

Gene Gene_lab

Gene_provenance

Relation with annotation tables

Gene_public

Columns

Tuples

Time

(B1, T1)

(B2, T2)

(B3, T3)

(B4, T4)

(B5, T5)

Compression: Annotation tables store annotations in a compressed form Indexing: Building spatial index structures on annotations for efficient retrieval Categorization: Annotation tables allow categorization of annotations

Archived annotations are not propagated along with query results

ANNOTATION: qualifier to specify the propagated annotations PROMOTE: Carries the annotations from un-projected attributes Columns

Tuples

Time

(A1, T1)

(A2, T2)

(A4, T4)

X

Marked as archived

(A3, T3)

ADD ANNOTATION Query Processing Execute the SELECT statement Identify the output rows and columns Map the rows and columns to an ordered domain

Which mapping is more efficient? Storage_Order Mapping Correlated_Columns Mapping Correlated_Rows Mapping

Map the target table cells to be annotated to rectangles

Snapshot versus View Annotations Snapshot Annotations: command is evaluated once and the annotation is attached to the current query results View Annotations: command is evaluated on the current database snapshot and continuously applied over new tuples

Eager Approach: apply the annotation command at the insertion time

Lazy Approach: apply the annotation command at the query time

Q

A

Q

A

Q

A

Q

A

QA

QA

(1a) (2a) (3a)

(1b) (2b) (3b)

1

2 3

4

1

2

3

4

1

2

3

1

2

3

1

2

12

t1t2t3

t1t2t3

Tuples

Row-oriented division

Column-oriented division

Archiving Annotations

SELECT statement Query Processing Identify cells on which annotations are archived Map the cells to rectangles

Representation of Archived Annotations A single annotation rectangle may be divided

into smaller ones How to divide an annotation rectangle?

Non-traditional and Novel Access Methods• Efficient indexing structures• New operators to support complex search operations• Efficient query processing

Indexing compressed sequences

Data compression techniques

Biological sequences are very large

Compressed sequences

New index structures for compressed sequences

Indexing Compressed Sequences(SBC-Tree)

9 1 12 22 16

PT

41 20 35 5 29

PT

24 39 18 33 3

PT

16 29 3 37

PT

5 10 20 25 40 50 60 100 120 124 150 160 200 220 225

5 10 20 25 40 50 60 100 120 124 150 160 200 220 225 245 250 260 280 300Tag

NULL

G2

B4

A5

A4

B5

E3

S1

27 14 31 7 37

PT

245 250 260 280 300

assigned tags

B7

B6

min_tag1 max_tag1

Q1 (245, B0)

(160, A2)

min_tag2 max_tag2

Q3(20, NULL) (100, NULL)

Q2(160, NULL) (245, NULL)

Pre

ced

ing

RL

E-c

ha

ract

er

Compression techniques gain significant importance:

Significant storage reduction Reducing buffer requirements Reducing number of I/Os

Enhance the overall system performance

Spatial Data Indexing(SP-GiST Framework)

PostgreSQL Function Manager

PostgreSQL Engine

PostgreSQL Storage Manager

Sto

rage

inte

rfac

e SP-Gist Internal Methods

SP-Gist kd-tree

SP-Gist trie

s

e

p

rc

e

a

d

t

a

star

space spade

Trie variants Quadtree variants

Implementing non-traditional indexes involves significant overhead Functionalities (Insertion, deletion, searching), Storage management,

integration, Recovery and concurrency control

Extensible indexing frameworks Software engineering solution, One-time core development , Many times low-

cost instantiation of a variety of index structures