The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1

The Gamma Operator for Big Data Summarizationon an Array DBMS

Carlos Ordonez

Acknowledgments

• Michael Stonebraker , MIT• My PhD students: Yiqun Zhang, Wellington Cabrera• SciDB team: Paul Brown, Bryan Lewis, Alex

Polyakov

Why SciDB?

• Large matrices beyond RAM size• Storage by row or column not good enough• Matrices natural in statistics, engineer. and science• Multidimensional arrays -> matrices, not same thing• Parallel shared-nothing best for big data analytics• Closer to DBMS technology, but some similarity with

Hadoop• Feasible to create array operators, having matrices as

input and matrix as output• Combine processing with R package and LAPACK

Old: separate sufficient statistics

New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]

Equivalent equations with projections from Γ

Properties of

Further properties details:non-commutative and distributive

Storage in array chunks

In SciDB we store the points in X as 2D array.

Worker11

Array storage and processing in SciDB

• Assuming d<<n it is natural to hash partition X by i=1..n• Gamma computation is fully parallel maintaining

local Gamma versions in RAM. • X can be read with a fully parallel scan• No need to write Gamma from RAM to disk during

scan, unless fault tolerant

Coordinator

Worker 1

Coordinator Worker 1

Point must fit in one chunk. Otherwise, join is needed (slow)

Parallel computation

Coordinator Worker 1 Worker 2

send send

Dense matrix operator: O(d2 n)

Sparse matrix operator: O(d n) for hyper-sparse matrix

Pros: Algorithm evaluation with physical array operators• Since xi fits in one chunk joins are avoided (at least 2X

I/O with hash or merge join)• Since xi*xi

T can be computed in RAM we avoid an aggregation which would require sorting points by i• No need to store X twice: X, XT: half I/O, half RAM space• No need transpose X, costly reorganization even in

RAM, especially if X spans several RAM segments• Operator works in C++ compiled code: fast; vector

accessed once; direct assignment (bypass C++ functions calls)

System issues and limitations

• Gamma not efficiently computable in AQL or AFL: hence operator is required• Arrays of tuples in SciDB are more general, but cumbersome for

matrix manipulation: arrays of single attribute (double)• Points must be stored completely inside a chunk: wide

rectangular chunks: may not be I/O optimal• Slow: Arrays must be pre-processed to SciDB load format,

loaded to 1D array and re-dimensioned=>optimize load.• Multiple SciDB instances per node improve I/O speed:

interleaving CPU• Larger chunks are better: 8MB, especially for dense matrices;

avoid shuffling; avoid joins• Dense (alpha) and sparse (beta) versions 18

Benchmark: scale up emphasis• Small: cluster with 2 Intel Quadcore servers 4GB

RAM, 3TB disk• Large: Amazon cloud 2

Why is Gamma faster than SciDB+LAPACK?

Gamma operator

d Gamma op Scan mem alloc CPU merge

100 3.5 0.7 0.1 2.2 0.0200 10.9 1.0 0.1 8.6 0.0

400 38.8 2.2 0.1 33.9 0.1800 145.0 4.6 0.1 134.7 0.4

1600 599.8 11.4 0.1 575.5 1.0

SciDB and LAPACK (crossprod() call in SciDB)

TOTAL transpose subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK MKL

77.3 0.1 0.3 41.7 0.1 25.9 0.0 8.0 0.8 0.2163.0 0.1 0.2 84.9 0.1 55.7 0.0 17.2 1.8 0.6373.1 0.1 0.3 172.6 0.5 120.6 0.3 39.4 5.4 2.1

1497.3 0.1 0.1 553.6 0.8 537.6 0.5 169.8 21.2 8.1* * * * * * * * * 33.4

Combination: SciDB + R

Can Gamma operator beat LAPACK?

Gamma versus Open BLAS LAPACK (90% performance of MKL)

Gamma: scan, sparse/dense 2 threads; disk+RAM+CPU

LAPACK: Open BLAS~=MKL; 2 threads; RAM+CPU

d=100 LAPACK d=200 LAPACK d=400 LAPACK d=800 LAPACK

ndensitydense

sparse Op BLAS dense sparse Op BLAS dense sparse Op BLAS2 dense sparse Open BLAS

100k 0.1% 3.3 0.1 0.4 11.3 0.1 1.0 38.9 0.2 3.1 145.0 0.6 10.7

100k 1.0% 3.3 0.1 0.4 11.3 0.2 1.0 38.9 0.4 3.1 145.0 1.0 10.7

100k 10.0% 3.3 0.5 0.4 11.3 0.9 1.0 38.9 2.2 3.1 145.0 6.2 10.7

100k 100.0% 3.3 4.5 0.4 11.3 15.4 1.0 38.9 55.9 3.1 145.0 201.0 10.71M 0.1% 31.1 0.2 3.8 103.5 0.2 10.0 316.5 0.4 423.2 1475.7 0.9fail1M 1.0% 31.1 0.5 3.8 103.5 1.1 10.0 316.5 3.8 423.2 1475.7 4.0fail1M 10.0% 31.1 4.0 3.8 103.5 7.0 10.0 316.5 16.3 423.2 1475.7 46.4fail1M 100.0% 31.1 44.0 3.8 103.5 148.8 10.0 316.5 542.3 423.2 1475.7 2159.6fail

SciDB in the Cloud: massive parallelism

Conclusions• One pass summarization matrix operator: parallel, scalable• Optimization of outer matrix multiplication as sum (aggregation) of vector

outer products• Dense and sparse matrix versions required• Operator compatible with any parallel shared-nothing system, but better for

arrays• Gamma matrix must fit in RAM, but n unlimited• Summarization matrix can be exploited in many intermediate computations

(with appropriate projections) in linear models• Simplifies many methods to two phases:

1. Summarization2. Computing model parameters

• Requires arrays, but can work with SQL or MapReduce

Future work: Theory

• Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs• Connection to frequent itemset• Sampling• Higher expected moments, co-variates• Unlikely: Numeric stability with

unnormalized sorted data

Future work: Systems

• DONE: Sparse matrices: layout, compression• DONE: Beat LAPACK on high d• Online model learning (cursor interface

needed, incompatible with DBMS)• Unlimited d (currently d>8000); join required

for high d? Parallel processing of high d more complicated, chunked• Interface with BLAS and MKL, not worth it?• Faster than column DBMS for sparse?

The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez 1

Documents

Distributed Database Systems · • Recovery. 3 Functional Layers ... Multi-DBMS Distributed Heterog. Multi-DBMS Centralized Heterog. Multi-DBMS Federated DBMS Multi-DBMS. 8 Common

DBMS (1)dbms

Integrating the R Language Runtime System with a …ordonez/pdf/rstrdsw-dexa.pdfa system that enables analysis in R on a time window, where the DBMS continuously inserts new records

Speech Summarization

Parallel DBMS: Competitive in the Graph Analytics …ordonez/pdf/mat-vec.pdfwith a performance better than Spark GraphX in our exper-imental set-up. Keywords DBMS, Graphs, Queries,

Visualization & Summarization

Comparing DBMSs and Alternative Big Data Systems: Scale Up ...ordonez/pdf/dbms-vs-nosql.pdf · 6 SASTRY: and spreadsheets [68] (which provide an interactive environment to de ne formulas

MIGUEL ÁNGEL ORELLANA ORDONEZ quien comparece por

Cisco - OSPF Design Guidefaculty.weber.edu/kcuddeback/Common_Items/OSPF Configuration.pdf · OSPF and Route Summarization Inter−Area Route Summarization External Route Summarization

jupid.tistory.com · 2015-01-22 · 2 PART 1 개요 • DBMS 시스템튜닝의목표 • DBMS 시스템튜닝의주체 • DBMS 시스템튜닝의순서 • DBMS 시스템튜닝을위한기본사항

Review Summarization System

Deber juan ordonez

TEACHING SUMMARIZATION

Calzada Ordonez

Document Summarization

Leo Ordonez

36 Horas de Chistes - Jose Ordonez

Scene Summarization

No. 13—1215 MANUEL ORDONEZ-QUINO, Petitioner, For the ...today.law.harvard.edu/wp-content/uploads/...Ordonez-Quino_v_Holder.pdf · Ordonez—Quino was born in Zacualpa, Department

Video Co-summarization: Video Summarization by …...Video Co-summarization: Video Summarization by Visual Co-occurrence Wen-Sheng Chu1 Yale Song2 Alejandro Jaimes2 1Robotics Institute,