21
1 Projection Indexes in Projection Indexes in HDF5 HDF5 Rishi Rakesh Sinha Rishi Rakesh Sinha The HDF Group The HDF Group

Projection Indexes for HDF5 Datasets

Embed Size (px)

Citation preview

Page 1: Projection Indexes for HDF5 Datasets

11

Projection Indexes in Projection Indexes in HDF5HDF5

Rishi Rakesh SinhaRishi Rakesh Sinha

The HDF GroupThe HDF Group

Page 2: Projection Indexes for HDF5 Datasets

2

Science Produces Large DatasetsScience Produces Large Datasets

Observation/experiment drivenObservation/experiment driven

Simulation driven

Information driven

144 MB/hr

200 GB/run

> 7GB/expt

Page 3: Projection Indexes for HDF5 Datasets

3

Why Not Commercial DMBSs?Why Not Commercial DMBSs?

Proprietary formatProprietary format Lack of portabilityLack of portability Low scalabilityLow scalability Lack of desirable access modesLack of desirable access modes Presence of expensive concurrency Presence of expensive concurrency

control and logging mechanismcontrol and logging mechanism Expensive parallel versionsExpensive parallel versions

Page 4: Projection Indexes for HDF5 Datasets

4

State of the Art Not EnoughState of the Art Not Enough

Scientific file formatsScientific file formats and associated and associated I/O APIsI/O APIs Concentrating on HDF5Concentrating on HDF5

Data recovery is Data recovery is navigationalnavigational

SubsettingSubsetting only on a small set of only on a small set of attributesattributes

Page 5: Projection Indexes for HDF5 Datasets

5

Why Indexes?Why Indexes?

Easy

Not So Easy

Page 6: Projection Indexes for HDF5 Datasets

6

Previous Indexing EffortsPrevious Indexing Efforts

Implicit indexing in HDF5Implicit indexing in HDF5 JPL use of HDF VdatasJPL use of HDF Vdatas HDF-EOS point dataHDF-EOS point data PyTablesPyTables HDF5 internal B-Tree structuresHDF5 internal B-Tree structures

Page 7: Projection Indexes for HDF5 Datasets

7

Why a Standard Indexing API?Why a Standard Indexing API?

Avoid duplication of effortAvoid duplication of effort PyTablesPyTables

Standardize indexing in HDF5Standardize indexing in HDF5 Standard API can be differently Standard API can be differently

implementedimplemented Make indexes portableMake indexes portable

Store indexes in HDF5 filesStore indexes in HDF5 files

Page 8: Projection Indexes for HDF5 Datasets

8

H5IN APIH5IN API

Create_indexCreate_index Parameters: location of index, location of Parameters: location of index, location of

data, binning information, memory limitsdata, binning information, memory limits Returns: location of the indexReturns: location of the index

QueryQuery Parameters: dataset to query, query stringParameters: dataset to query, query string Returns: selection representing subset of the Returns: selection representing subset of the

data corresponding to the querydata corresponding to the query

Page 9: Projection Indexes for HDF5 Datasets

9

Design DecisionsDesign Decisions

Limited scope of the prototypeLimited scope of the prototype Index stored in a separate datasetIndex stored in a separate dataset Returns a selectionReturns a selection Projection indexProjection index Support for simple boolean queriesSupport for simple boolean queries

Page 10: Projection Indexes for HDF5 Datasets

10

Limited ScopeLimited Scope

11stst indexing prototype in HDF5 indexing prototype in HDF5 Presence of implicit indexingPresence of implicit indexing

Index on single datasetsIndex on single datasets Query over single datasetsQuery over single datasets

Conditions should be over a single datasetConditions should be over a single dataset Result could be mapped to a separate datasetResult could be mapped to a separate dataset

Page 11: Projection Indexes for HDF5 Datasets

11

Index StorageIndex StorageRoot Group: /

DAY1 DAY2 DAY3 DAY4

F3F3F2F2F1F1

Location DataPressureTemperature

Page 12: Projection Indexes for HDF5 Datasets

12

Index StorageIndex StorageRoot Group: /

DAY3

F3F3F2F2F1F1

Location Data

LD_INDEX

F1 F2

Page 13: Projection Indexes for HDF5 Datasets

13

Index StorageIndex StorageRoot Group: /

DAY3

PressureTemperature

T_IN P_IN

PressureTemperature

Page 14: Projection Indexes for HDF5 Datasets

14

Returns a SelectionReturns a Selection

Temperature Pressure

Concise StorageConcise Storage Efficient Boolean operationsEfficient Boolean operations

FIND PRESSURE WHERE TEMP IN [100, 200]

Page 15: Projection Indexes for HDF5 Datasets

15

Projection IndexProjection Index

TempTemp CategoryCategory PressurePressure

5252 AA 3232

4242 DD 3434

5757 FF 2121

2222 AA 2222

6767 DD 2727

AA

DD

FF

AA

FF

DD

Page 16: Projection Indexes for HDF5 Datasets

16

BinningBinning

11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414 1515

1-31-3 4-64-6 7-97-9 10-1210-12 13-1513-15

Page 17: Projection Indexes for HDF5 Datasets

17

Projection IndexProjection Index

605040

313029Pressure

Temp

Page 18: Projection Indexes for HDF5 Datasets

18

Why Projection Index ?Why Projection Index ?

Data is read onlyData is read only Mostly dataset once written is not changedMostly dataset once written is not changed

Index does not need to be updatedIndex does not need to be updated Projection indexes well suitedProjection indexes well suited

Number of disk accesses is same as in case Number of disk accesses is same as in case of a B-Treeof a B-Tree

Are not considering multidimensional Are not considering multidimensional queriesqueries

Page 19: Projection Indexes for HDF5 Datasets

19

Only Simple Boolean QueriesOnly Simple Boolean Queries

Query FormatQuery FormatSELECT SELECT SELECTIONSELECTION

WHEREWHERE c11 < Attribute1 < c12c11 < Attribute1 < c12

AND c21 < Attribute2 < c22AND c21 < Attribute2 < c22

…… Results being selections boolean operations Results being selections boolean operations

can be done inside the library can be done inside the library

Page 20: Projection Indexes for HDF5 Datasets

20

ConclusionConclusion

Developing a standard indexing API in Developing a standard indexing API in HDF5HDF5

Creating a proof of concept prototype Creating a proof of concept prototype using projection indexesusing projection indexes

Take first step towards developing a Take first step towards developing a query language for HDF5query language for HDF5

Page 21: Projection Indexes for HDF5 Datasets

21

Future WorkFuture Work

Multi-dimensionalityMulti-dimensionality Multiple datasets in same fileMultiple datasets in same file Multiple datasets across filesMultiple datasets across files Indexes on attributesIndexes on attributes Allow user to index subset of datasetsAllow user to index subset of datasets