28
SciDB A DBMS for Analytic Applications by Michael Stonebraker

SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

SciDBA DBMS for Analytic Applications

by

Michael Stonebraker

Page 2: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Outline Outline

ContextApplication areasWhy RDBMS Doesn’t WorkOur partnershipStatus and future

Page 3: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Context Context –– One Size Does Not Fit All One Size Does Not Fit All

Vertica – column store for warehouses50X the elephants

VoltDB - main memory, single threaded for(nearly partitionable) OLTP

30-40X the elephantsAt least one more vertical market template

Page 4: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Serious Analysis Applications Serious Analysis Applications

Science users (Astronomy, Earth Science,…)Web log analysisMedical imagingDrug DiscoverySpooks

Page 5: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Three Three Lighthouse Lighthouse CustomersCustomers

e-BayLSSTRussian astronomy project

Page 6: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

e-Bay Application e-Bay Application

Web log data (petabytes)SessionizationClustering (in N-space)MiningPredictive analysis

Page 7: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

LSST Data

Page 8: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

LSST Application LSST Application

Raw telescope imagery (big arrays)“Cooked” into features (geographic data)

Data clustering algorithmGrouped together into observations of thesame feature

Similarity metrics

Page 9: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

LSST Queries LSST Queries

Recook portions of the imageryWith different algorithm

Trajectory queriesNearest neighbor queries

Page 10: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Why SciDB? Why SciDB?

RDBMS has the wrong data modelArrays not tables

Data clustering is natural in Ndimensional space!!

Tables impossibly slow at simulating arrays(x 100)

Page 11: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Why SciDB? Why SciDB?

RDBMS has wrong operationsRegrid, cluster, not join (can’t even wrapyour mind around data clustering)Parallel, user-defined functions arequirement

My-new-clustering-technique

Page 12: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Why SciDB? Why SciDB?

RDBMS has features missingNamed versions

Recluster just the tech stocks using myfancy algorithm

ProvenanceWhat clustering technique was used?

UncertaintyWhat is the error in my clustering?

Page 13: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Net Result Net Result

Roll-your-own on the bare metalOr put up with a horrible kludge on RDBMS

With mountains of app logicAnd copying the world to app space

Page 14: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Design Team Design Team

Mike StonebrakerStan ZdonikDave MaierSam MaddenMagda BalazinskaDave DeWitt

Page 15: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Building SciDB Building SciDB

e-Bay (partial FTE)LSST (2 FTE working on project)Persistence Software (committed 3 FTE)M.I.T (1 postdoc)University of Moscow (3 FTE)Paul Brown (world’s best programmer)Washington (1 grad student)Brown (1 grad student)

Page 16: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

What is SciDB? What is SciDB?

Nested array data modelWith bells and whistles

Non-uniform dimensionsBoundaries and holes

Science specific operations (e.g. regrid)Array UDFs

Page 17: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

What is SciDB? What is SciDB?

No overwriteTime is another array dimension.New values written here

Partitioning across nodesSharding in multiple (one or moredimensions)With overlap!!!!

Page 18: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Current Status Current Status

“Sharded” multisite array systemWhich does a collection of interesting LSSTqueriesMissing bunches of stuff (optimizer, parser,bulk loader, …)

I.e. PoC demoware

Page 19: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Issues We are Noodling Issues We are Noodling

Storage managerBig blocks (chunks)Refining the node partitioningVertica-style compressionReplication by multiple copies with differentpartitioningEach attribute stored in a separate physicalarray

Page 20: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Issues We are Noodling Issues We are Noodling

Fixed strideEasy to indexBut blocks may be highly variable in size

Or variable strideNeed an R-treeBut packing can be more uniform

Page 21: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Issues We are Noodling Issues We are Noodling

Optimizer designMany “blocking” nodes in the plan (e.g.regrid)Opportunity to repartition arraysWhat cost function?Deal with replicationNot 2 phase

Page 22: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Issues We are Noodling Issues We are Noodling

Array UDFsNo cell level UDFsExperience of Postgres

Page 23: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Issues We are Noodling Issues We are Noodling

UncertaintyAdditional cell attributeUniform distribution

FastOr something more complex

Can be arbitrarily slow

Page 24: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Issues We are Noodling Issues We are Noodling

ProvenanceWant to trace backward from “bad” values toidentify “patient zero”Want to trace forward from patient zero

To fix all propagated errorsImplementation?

Trio a non-starter

Page 25: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Development Plan Development Plan

Complete system at end of Q1/10But no transactions or recoveryKludgy messaging system

Currently being horribly under-managedNo formal QANo formal doc

Page 26: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

SciDBSciDB will be 100X RDBMS will be 100X RDBMS

Optimized parallel UDFsRedistribution with overlapMulti-dimensional storage (not a column store;not a row store)Correct operations

Page 27: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

What Else Have We Done? What Else Have We Done?

Science benchmarkNearly done

A bunch of use casesSee our web site (scidb.org)

Page 28: SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

What We NeedWhat We Need

MoneyNSF does not like the word “infrastructure”VCs worry about the size of the market