SciDB Tutorial Technical Overview, and Best Practices

SciDB Tutorial

Technical Overview, and Best Practices

Overview

• What is SciDB?• Historical Context, Project Goals and Motives, current Status

• Architecture, Installation• How to install the software

• Application Development• Basic Schemas, Queries, Data Loading, Client Options

• Advanced Schemas, Plugins, Math• Managing dimensions• User-defined types, functions, operators, etc

• General Advice and Best Practice• Will be scattered throughout the tutorial

• Conclusions and ClosingRunning Time: 1 hour, with 30 minutes for discussion / Q&A.

Background, Motivation and Status

• XLDB - Survey of Scientific Data Management 2008• Who they were: Astronomy, Remote Sensing, Geology• What they wanted: Provenance, Dense Arrays, Legacy Data Format support• What they didn’t want: SQL a non-starter in this problem domain• What was difficult: “Big data” file explosion unwieldy, and parallel data processing

• SciDB and Paradigm4• SciDB is an open-source (GPL-3) platform, available through http://www.scidb.org/forum• Paradigm4 is a commercial (venture backed) company that sponsors SciDB development and … • … makes a living selling “Enterprise” features to customers who can pay for them.

• What We Learned Since 2010• Little real-world enthusiasm for Provenance• About half of our use-cases emphasize sparse arrays• Most data arrives in .csv files or UTF-8 triples• Commercial demand for: Time-series, statistical/numeric analysis, funky-flavored-OLAP• Industries: Bio-IT, Industrial Sensor Data, Financial Analytics• Notable Science Successes: NIH 1000 genomes (400T), NERSC (128 instances, 100+T)

http://www.scidb.org/forum

MPPDatabase

Arraydata

model

Complex analytics

Commodity clusters or cloud

R, Python, Matlab, Julia,…

Big analytics without big hassles

Why SciDB?

SciDB Architectural Overview

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB Engine

SciDB Node

Local Store

SciDB EngineSciDB Client

( iquery, ‘R’, Java, Python )

SciDB Worker Nodes

SciDB Coordinator Node(s)

PostgreSQLPersistent System

Catalog Service

PostgreSQLConnection SciDB Inter-Node

Communication

1 2 3

4

5

6

SciDB Architectural Overview

• Massively Parallel Data Management• Installs onto a cluster of physical nodes• You can install N SciDB instances per 1 physical node• (Optional) Data redundancy for reliability

• Orthodox Query Processing• Client / server connections • RESTful API using mid-tier shim (mostly for ‘R’ and Python)• Parsing and plan generation on Coordinator(s)• Limited query optimization• Physical plan distribution to Worker(s)• Run-time coordination of data movement

Installation and Configuration• Download the user guide PDF from the forum

• http://www.scidb.org/forum

• Also at:• http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/• See Section 2 for script Instructions• See Appendix A for non-scripted steps (yum / apt-get)

• Cluster install script:• http://github.com/Paradigm4/deployment• Non-root option for RHEL or CentOS

• How-to video: • http://www.paradigm4.com/scidb-installation-video/

http://www.scidb.org/forum

http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/

http://github.com/Paradigm4/deployment

http://www.paradigm4.com/scidb-installation-video/



SciDB Configuration Guide• Basic Set of Questions:

• How many physical compute nodes? • How many physical disks per node? • How many cores (CPU cores) per node? • How man concurrently connected users? • How much DRAM per node?

• Use the config.ini generatorhttp://htmlpreview.github.io/?https://raw.github.com/

Paradigm4/configurator/master/config.14.8.html

SciDB Client Options• C/C++ Client Library libscidbclient.so

• Used inside the iquery• Low level, Array API based

• ‘shim’ – 3 tier web server model

SciDBshim

‘R’

‘Python’

libscidbclienthttps

• JDBC Driver implemented in Java

SciDB: Data Model Description• ‘Arrays’ instead of ‘Relations’

• Multi-dimensional (up to 99-D)• Multi-attribute (theoretically limited to 2^64, but we test arrays to 1,000)• Extensible (ie User-definable) types/functions/aggregates/operators

• Straight-forward Theoretical Mapping• Array.dimensions === Relation.key

• Constraints and Data Integrity Rules• Arrays can be dense or sparse• Dimension lengths can be constrained or unbounded• Sophisticated (maybe too sophisticated?) missing information management

• Subtle but Significant Differences between Array model and Relational• Dimensions implicitly order cells (not true of SQL tuples)

• Underlying algebra more explicit in the query language (AFL)

For ExampleCREATE ARRAY CALLS< bytes : int32 DEFAULT 16>[ CALLING=0:*,168000,0, CALLED=0:*,168000,0, WHEN=0:*,100000,0 ];

CREATE ARRAY CALL_SUMMARY < total_bytes : int32 NULLS, total_calls : int64 >[ CALLING=0:*,168000,0, CALLED=0:*,168000,0 ];

CREATE ARRAY MODIS < probe : double, rgb : int64, q : double > [ LAT=-90000000:90000000, 10000, 100, LONG=-180000000:180000000, 10000, 100, WHEN=0:*,100000,0 ];

Queries: Composible Array Algebra

• Analogs from Relational / OLAP• project, filter, join, group-by, union (merge)• window, cross join• theta-join

• Non-Relational Operators• regrid, multi-dimensional window, cumulate

• Pure Numerical / Mathematical Operators• multiply, transpose, reverse, gaussian-dc, gesvd, tsvd, gemm, spgemm

• Exotics• Operators are extensible (hard to do!) • Users have added their own gaussian smooth, feature detection, fourier

transforms, histogram …

AQL Examples---- Straight-forward SQL-like queries.SELECT SUM ( C.total_bytes ) AS total_bytes, COUNT ( * ) AS total_calls INTO CALLS_SUMMARY FROM CALLS AS CGROUP BY C.CALLING, C.CALLED;

---- More exotic … n-dimensional windowingSELECT MEDIAN ( M.probe ) AS M_Probe FROM MODIS WINDOW AS ( PARTITION BY LAT 50 PRECEDING AND 50 FOLLOWING, LONG 50 PRECEDING AND 50 FOLLOWING );

AFL Query Languagefilter ( -- apply ( -- “Query” to compute a single iteration of apply ( -- Conway’s Game of Life over a 2D array. join ( -- window ( -- Compute the number of "live" cells in 3x3 Life, 1, 1, 1, 1, sum ( alive ) as sum_n ) AS N_Step, Life AS P_Step -- Join neighbor_count (N_Step) with Previous ), -- state of game (was Life, aliased as P_Step) neighbor_count, N_Step.sum_n – P_Step.alive -- Trim out cell itself if alive ), next_alive, -- Apply Conway’s Life rules to neighbor_count iif ( P_Step.alive = 1 , iif (( neighbor_count < 2 OR neighbor_count > 3 ), 1, 0 ), iif (( 3 = neighbor_count ), 1, 0) ) ), next_alive = 1 )

Data Loading into SciDB ( 1 / 2 )

• Lengthy Tutorial with Scripting at: http://www.scidb.org/forum/viewtopic.php?f=11&t=1308#p2724

• Support for multiple file format options– Text, binary, OPAQUE and in 14.12, tsv– Performance: OPAQUE x 100 ~= Binary x 10 ~= text

• Simplest Method1. load (file,Load_Array,format ) to

< X, Y, data1, data2, … datan > [ Row ]

2. store(redimension(Load_Array,Target),Target) to < data1, data2, … datan > [ X, Y ]

Data Loading into SciDB ( 2 / 2 )

• load( file, array, format ) -> store ( input ( file, format ), array )

• redimension(…) is expensive (sort)• insert(…) substitutes for store(…)

– Difference: insert appends, doesn’t overwrite

Chunk Sizing ( 1 / 2 )

• Really: List of Per-Dimension Chunk Lengths• Easy when the data is completely dense• Harder when the data is sparse (and skewed)

Brief Review:Per-dimension chunk length is part of the CREATE TABLE dimension specification.

CREATE ARRAY …< data > [ dimension = 0 : * , length, overlap ];

Chunk length is a measure in logical space.

Set the per-dimension chunk lengths of your array so that the average number of cells-per-chunk ~= 1,000,000

Chunk Sizing ( 2 / 2 ) • calculate_chunk_length.py

– Ships with 14.8– Soon to be internalized (see below)

• Give it: 1. Name of an array that has data to be placed into your

eventual target array. 2. Specification of the target’s dimensional “shape”, with “?”

indicating the values you want it to compute. $ calculate_chunk_length.py modis_load"latitude=?:?,?,?, longitude=?:?,?,?"latitude=180937:393179,37916,0, longitude=-1206600:-921726,37969,0

• Internalized Version (14.12)AFL%> CREATE ARRAY Target < data > [ latitude=?:?,?,?, longitude=?:?,?,? ] AS modis_load;

Best Practice Tip: Give these tools as much data as you can to seed them.

SciDB Query Writing ( 1 / 3 )

• AFL Queries are trees of operators• Every operator:

1. Accepts one or more arrays as inputs–Plus additional parameters

2. Returns one array to the operator above it –Or the client if the root operator of the query

Best Practice Tip: Use the built-in help.AFL% help('filter');{i} help{0} 'Operator: filterUsage: filter(<input>, <expression>)’AFL%

SciDB Query Writing ( 2 / 3 )• SciDB is an example of a “meta-data driven” system

• Built-in tools to help you discover what’s possible.• list(‘operators’) – list of operators callable from AFL.• list(‘functions’) – list of functions usable in AFL apply(…) or filter(…) and AQL• list(‘arrays’) – list of arrays in the SciDB database.• list(‘queries’) – list of currently running queries in the installation• list(‘chunk map’) – list of all the chunks in the installation• show ( array_name ) – returns the shape of the named array• show ( ‘query string’, ‘afl|aql’) – returns the shape of the array

produced by the query.

• These operators return arrays, like any other data, and can be used in queries.

SELECT * FROM list (‘functions’) WHERE regex ( signature, ‘bool(.*)string,string(.*)’ );

SciDB Query Writing ( 3 / 3 )aggregate ( filter ( cross_join ( filter ( list('arrays'), name = 'Foo' ) AS A, list('chunk map') AS C ), A.uaid = C.uaid ), min ( name ) AS Name_Of_Array, COUNT(*) AS Number_of_Physical_Chunks, avg ( nelem ) AS Avg_Number_of_Cells_Per_Chunk);

Query Illustrates Several Ideas:• Composibility of operators• Meta-data as data source• Useful details about physical design

Plugins and Extensions

• Implement your plugin in C/C++– Compile into shared object library file (say, plugin.so)– Copy plugin .so file to pluginsdir on each instance– Load the library using the SciDB command

AFL% load_library('dense_linear_algebra’)AFL% unload_library('dense_linear_algebra') –restart scidb

AFL% list('libraries')

Example implementations of all flavors of C/C++ extensions to be found in the examples directory.

Functions, aggregates, types, operators and macros

© P

arad

igm

4 2

3

SciDB Plugins

• https://github.com/Paradigm4/dmetric• https://github.com/Paradigm4/r_exec• https://github.com/Paradigm4/knn• https://github.com/Paradigm4/superfunpack

• https://github.com/wangd/SciDB-HDF5• https://github.com/slottad/scidb-genotypes• https://github.com/parkerabercrombie/SciDB-GDAL• https://github.com/mkim48/TensorDB• https://github.com/tshead/scidb-string• https://github.com/ljiangjl/Percentile-in-SciDB…

Functions, aggregates, types and operators

Provided by P4 as well as the community

https://github.com/Paradigm4/dmetric



https://github.com/Paradigm4/r_exec

https://github.com/Paradigm4/r_exec

https://github.com/Paradigm4/knn

https://github.com/Paradigm4/knn

https://github.com/Paradigm4/superfunpack

https://github.com/Paradigm4/superfunpack

https://github.com/wangd/SciDB-HDF5



https://github.com/slottad/scidb-genotypes

https://github.com/slottad/scidb-genotypes

https://github.com/parkerabercrombie/SciDB-GDAL



https://github.com/mkim48/TensorDB

https://github.com/mkim48/TensorDB

https://github.com/tshead/scidb-string/blob/master/src/string.cpp

https://github.com/tshead/scidb-string/blob/master/src/string.cpp

https://github.com/ljiangjl/Percentile-in-SciDB

Notes on Macros ( 1 / 2 )• Macros are basic “functional” extensions

– Not written in C/C++ - more like AFL syntax– First stab at a much more elaborate scheme

• How are they implemented?– Currently, implement them in a (central) fileAFL%> load_module (‘macro_file.txt’);

– Examples in lib/scidb/modules/prelude.txt– Limited: naïve “macro expansion” approach

Notes on Macros ( 2 / 2 )• Example: In a file named /tmp/macro.txtarray_chunk_details ( __BAR__ ) = aggregate ( filter ( cross_join ( filter (list ('arrays'), name = __BAR__ ) AS A, list('chunk map') AS C ), A.uaid = C.uaid ), count(*) AS num_chunks, min ( C.nelem ) AS min_cells_per_chunk, max ( C.nelem) AS max_cells_per_chunk, avg ( C.nelem ) AS avg_cells_per_chunk);

• Load the macro using load_module(‘/tmp/macro.txt’)• Invoke it like any other operator

AFL%> array_chunk_details ( ‘Foo’ );

Conclusions

Too much to pack into one slide!

Documents

SciDB Tutorial Technical Overview, and Best Practices