sedem seakomo master thesis presentation · Master Thesis Presentation [IT4BI] Student: SedemSeakomo FIB-UPC (BarcelonaTech) saviour.sedem.kofi.seakomo @est.fib.upc.edu Analytical

Coupling Databases and Advanced

Master Thesis Presentation [IT4BI]

Student:

Sedem SeakomoFIB-UPC (BarcelonaTech)

saviour.sedem.kofi.seakomo

@est.fib.upc.edu

and Advanced Analytical Tools (R)

Supervisor:

Prof Alberto Abelló (PhD)FIB-UPC (BarcelonaTech)

[email protected]

1

Outline

� Introduction

� The Problem

� State of the art (Existing systems review)

2

State of the art (Existing systems review)

� Methodology

� Empirical Work

� The Results

� Conclusions

Introduction

� Introduction� Background & Motivation

� Research Questions

� Scope (Delimitations & Limitations)

� Importance & Contribution

3

� Importance & Contribution

� Related Work

� The Problem


� Methodology

� Empirical Work

� The Results

� Conclusions

Introduction

� SQL/relational DBMS are powerful systems!� Managing, querying, and aggregating data

� But what about complex analytics? Not really!

Inferences, predictions, subtle relationships in data

4

� Inferences, predictions, subtle relationships in data

� In spite of this, organizations still house large amount of data in various SQL/RDBMS

� So what do we do for complex analytics?

Introduction

� Objective: � Examine level of development of integration of

R+DBMS

� Assess performance, scalability and completeness of R+DBMS integration

5

R+DBMS integration

� Motivation:� New Industry (Analytics Industry)

� Development in Analytics front: Gleaning information and insights from data, now an industry in itself

� Data Mining (Complex Analytics)� Increasing relevance of data mining (to be driven by complex

analytics) for revealing valuable insights from data

Research Questions

� What is the current level of development (completeness) of

integration of R with DMBS?

� How is the performance of coupling databases with advanced

analytical tool (R) compared to stand-alone analytical tool (R)?

How is the scalability of coupling databases with advanced

6

� How is the scalability of coupling databases with advanced

analytical tool (R) compared to stand-alone analytical tool (R)?

� What are the inherent implications of architectures of R integration

that impact performance?

� Are there any lessons to be learnt on the way forward?

Scope (delimitations & limitations)

� Focused on benchmarking the performance, scalability and

completeness of selected DBMS+R

� Benchmarks covers mainly matrix operations employed (forms

the core of) in advanced analytics

Benchmarking of intra-command parallelism was not covered

7

� Benchmarking of intra-command parallelism was not covered

� Focused on coupling of R and RDBMS (Oracle, Postgres, DB2

and SQL Server); non-RDBMS or NoSQL databases not covered

� Focused on directly coupling R at the data layer (not at the

analytic layer and/or presentation layer)

Introduction

� Contributions:� Better performance is achievable by coupling

databases with advanced analytical tools (R)

� Such approach is recommended for complexanalytics involving significant amount of data

8

analytics involving significant amount of data

� Architectures where more analytic functions haveequivalent native SQL counterparts executable in-database produces best performance results

� Caveat: data used in analytic process must beefficiently retrieved and passed to the analyticfunctions, lest there will be worsen performance

Introduction

� Related work:� Analytics and databases:

� Database Analytics Acceleration using FPGAs [10]� For evaluating expensive analytics queries while saving CPU resources

� The MADlib Analytics Library or MAD skills, the SQL [11]� Introduces open-source library of in-database analytic methods of SQL-based

algorithms for machine learning, data mining and statistics inside database engine

9

algorithms for machine learning, data mining and statistics inside database engine

� Towards a Unified Architecture for in-RDBMS Analytics [12]� Presents unified architecture for in-RDBMS analytics with emphasis on faster

implementation of new statistical techniques in RDBMS

� Performance benchmark studies w.r.t R:� By Philippe Grosjean[3], Stefan Steinhaus[13], Donald Knuth[14]

� Centered on comparing performance of versions of R implementations, R implementation with and without some packages and R as analytical tool compared with other analytical tools

� But, our work:� The performance study of R+DBMS vs. stand-alone R

The Problem

� Introduction

� The Problem� Advanced analytical tools

� Database Management Systems

� Bringing the “two worlds” together

10

� Bringing the “two worlds” together

� Thesis statement (Hypothesis) declaration


� Methodology

� Empirical Work

� The Results

� Conclusions

Advanced Analytical Tools

� Inclined towards linear algebra

� Up-side:� Statistical software provide rich and very advanced

11

� Statistical software provide rich and very advanced analytical functionality for data analysis and modelling

� Down-side: � Can handle only limited amounts of data.

� Example: Some packages (base R and IBM SPSS) operate entirely in main memory

Database Management Systems

� Founded on relational algebra (RDBMS)

� Up-side:� DBMS can store and process large amount of data

12

� DBMS can store and process large amount of data

� Down-side:� But provide insufficient analytical functionality

� SQL simulations of linear algebra operations � will often result in abysmal I/O and CPU performance

� are knotty for linear algebra operations with iterations

� are hard to fathom and makes code maintenance expensive

Bringing the “two worlds” together

� We have a case at hand!

� So, how do we bridge the gap?

Advanced Statistical Packages (linear algebra)Database Management

Systems (relational algebra)

13

Packages (linear algebra)Systems (relational algebra)


� Solution: synergy!

� Employ extended RDBMS features to power the embedded/integrated/coupled execution of R.

14


� Has the following advantages:� Avert performance problems associated with the

abusive use of SQL (relational algebra ops) for advanced analytics (linear algebra ops)

15

� Synergize robust data management capabilities of DBMS and rich statistical functionalities of analytical tools

� Benefits (Performance+Security) of taking algorithms (Processing Logic) to data rather than data to algorithm

Thesis Statement (Hypothesis)

Coupling databases and advanced analytical tools (R)

16

leads to better and enhanced analytic performance

than stand-alone advanced analytical tools (R)

State of the art

� Introduction

� The Problem

� State of the art (Existing systems review)� Advanced analytical tool R

Different DBMS architecture of R implementation

17

� Different DBMS architecture of R implementation

� Choice of DBMS for empirical study

� Methodology

� Empirical Work

� The Results

� Conclusions

Integration with R

� At three layers within the analytic stack� Data Layer

� (e.g. Oracle R Enterprise, Sybase RAP, SAP HANA, IBM Netezza)

� Analytics Layer � (e.g. SAS, IBM SPSS, RStudio, Matlab, Zementis)

18

� Presentation Layer � (e.g. Tableau, Jaspersoft BI Software, TIBCO Spotfire's BI Dashboard)

Integration with R at Data Layer

� Alternative ways of integrating R with db� Outside-in:

� R connect with DB using JDBC/ODBC and R retrieves (pulls) the data to be analyzed from the db

19

� Inside-out: � Data is transferred (pushed) to R from within the database and

the aggregated and/or analyzed results sets are sent back from R to the database

� Embedded: � R environment (components) and/or execution is made an

integral part of the core DBMS

Diff DBMS Architecture w.r.t R

� Integrations/Architectural Arrangements

DBMS Embedded Outside-in/Inside-out

Oracle YES: ORE Server YES: ROracle, JDBC

PostgreSQL YES: PLR YES: RPostgres, RODBC

20

Sybase RAP YES: RAP Store- UDF(C, C++) YES: RJDBC

SQL Server NO: But CLR, Ext Proc YES: RODBC

DB2

NO: But CLR, Ext Proc UDF(C,

C++, Java, COBOL) YES: RJDBC, RODBC

Cloudera

Impala NO YES: ODBC, JDBC

SAP HANA NO YES: RODBC, RJDBC, RHANA

Methodology

� Introduction

� The Problem


� MethodologyResearch Approach

21

� Research Approach

� Research Design

� Data Used

� Empirical Work

� The Results

� Conclusions

Methodology

� Research Approach� Quantitative research (experimental) approach

� Need to collect numeric performance data

� Carry out various kinds of numeric-based analyses

Research Design

22

� Research Design� Adopted and adapted R Benchmark 2.5 [3] and

Revolution RevoR Enterprise Benchmark [2]

� Tests designed for stand-alone R and R+ Oracle, PostgreSQL, DB2, SQLServer

� Tests: 3 categories of performance tests;� Matrix Calculation, Matrix Functions and Program Control

Data Used

� Input data generated in R, data set consists� two dimensional array of floating-point numbers

� 1,000 observations (cols) by 16,000 variables (rows)

� Used a stochastic process, Brownian Motion

23

� Used a stochastic process, Brownian Motion

� Where:� Xi (the series) then stand-in for the Brownian Motion

� Yn is a sequence of k variables normally distributed elements

MatrixX obs1 obs2 ... obs1000

var1

var2

var3

...

var16000

)()(1 k

iYX

i

n

ni •=∑=

Empirical Work

� Introduction

� The Problem


� Methodology

Empirical Work

24

� Empirical Work� Benchmark tests

� Experimental design

� Measurements & controls

� The Results

� Conclusions

Experimental Setup:R+SQLServer

� Traditional (RODBC) Integration� Installed SQL Server 2012 (64-bit)

� Installed Open Source R 2.13.2 (64-bit client)

� Installed of RODBC package from RGUI

25

� Common Language Runtime (CLR) Integration CLR Stored Procedures are .NET objects which run in db memory

� Created the usual R script files

� Developed C# CLR with embedded R; compiled to get DLL

� Enabled CLR integration feature of the SQL Server

� Created assembly from the DLL; SPs which ref the assembly

� Ran the stored procedures with the R script files as input

The Results

� Introduction

� The Problem


� Methodology

Empirical Work

26

� Empirical Work

� The Results� MC, MF, PC, Overall benchmarks

� Implications of the results and findings

� Which integration architecture works well?

� Conclusions

Empirical Results

� Average overall benchmark results:

SQL Server

OVERALL Performance

27

- 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00

Stand-Alone R

PostgreSQL

Oracle

DB2

Run-time (normalised)

Sy

ste

m/+

R

OVERALL

Empirical Results

� Matrix Calculation (MC) benchmark results

SQL Server

Matrix Calculation Performance (MC)

28

- 10.00 20.00 30.00 40.00 50.00 60.00

Stand-Alone R

PostgreSQL

Oracle

DB2


Sy

ste

m/+

R

MC

Empirical Results

� Matrix Function (MF) benchmark results

SQL Server

Matrix Function Performance (MF)

29

- 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00

Stand-Alone R

PostgreSQL

Oracle

DB2


Sy

ste

m/+

R

MF

Empirical Results

� Program Control (PC) benchmark results

Benchmark Stand-Alone R PostgreSQL Oracle DB2 SQL Server

PC01 2.71 2.78 2.77 2.70 2.77

30

PC02 0.30 0.40 0.31 0.28 0.29

PC03 0.63 0.60 0.43 0.65 0.66

PC04 0.51 0.50 0.51 0.53 0.51

PC05 0.38 0.38 0.36 0.38 0.38

KEY:

PC01: Fibonacci numbers; ctrl flow PC03: gcd2; recursive PC05: Escoufier’s method on matrix

PC02: Hilbert matrix; ; ctrl flow PC04: Toeplitz matrix; looping

Empirical Results

� Paired t-test on PC results (PosgreSQL, Oracle)Variable 1 Variable 2

Mean 0.932 0.876

Variance 1.07492 1.12668Observations 5 5

Pearson Correlation 0.997786692

Hypothesized Mean

31

Hypothesized Mean Difference 0

df 4t Stat 1.691541861

P(T<=t) one-tail 0.082996265t Critical one-tail 2.131846782

P(T<=t) two-tail 0.16599253t Critical two-tail 2.776445105

The mean performance difference (M=0.06, SD =0.074, N= 5) was not

significantly greater than zero, t(4)=1.69, two-tail p = 0.166, providing evidence

that there is no considerable difference in the performances of the two DBMSs.

Empirical Results

� Scalability benchmark results

10.00

15.00

20.00

25.00

30.00

dTimes1-4m-r

dTimes1-4mc-ore

Oracle shows slightly better scalability edge over stand-alone R for small datasets

32

-

5.00

10.00

MC

1

MC

2

MC

3

MC

4

MC

5

MF

1

MF

2

MF

3

MF

4

MF

5

MF

6

MF

7

MF

8

dTimes1-4mc-ore

-

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

200.00

MC

1

MC

2

MC

3

MC

4

MC

5

MF

1

MF

2

MF

3

MF

4

MF

5

MF

6

MF

7

MF

8

dTimes4-16mc-r

dTimes4-16mc-ore

Stand-alone R is overwhelmed by large datasets; that is when R+DBMS’s edge is manifested

Empirical Results Reliability

� Average vs. Minimum Results: about same

Oracle

DB2

SQL Server

Sy

ste

m/+

R

Avg OVERALL Performance Performance patterns observed remain exactly same

33

- 50.00 100.00 150.00 200.00

Stand-Alone R

PostgreSQL


Sy

ste

m/+

R

OVERALL

- 50.00 100.00 150.00 200.00

Stand-Alone R

PostgreSQL

Oracle

DB2

SQL Server


Sy

ste

m/+

R

Min OVERALL Performance

OVERALL

No significant variations in actual recorded values

Why R+PostgreSQL works bad?

� Postgres performed well in tests with less data� Timing retrieval of database resident data as matrix

Retrieving

DB Data Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run7 Total Average

Oracle 0.15 0.14 0.14 0.14 0.14 0.15 0.14 1.00 0.14

PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 19.24

34

� Direct rows fetch (SELECT * FROM stockHist)� Oracle (4.51 sec) is 2.66 times faster than PostgreSQL (12.02 sec)

� Implication:� Poor performance of PostgreSQL-coupled-R is not

exclusively the consequence of the implementation but also the database itself (data retrieval /fetching)

PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 19.24

Why R+Oracle works well?

35

�In-db statistic engine

�Storing of R Script in-db

�Capability of spawning multiple R engine instances

�Efficient data retrieval and passing (rqTableEval, rqRowEval)

Architecture of Oracle R Enterprise. Adapted from [9]

Conclusions

� Introduction

� The Problem


� Methodology

Empirical Work

36

� Empirical Work

� The Results

� Conclusions� Lessons learnt

� Future Studies

� Final Words

Implications to Research Questions

� Growing level of development of R+DBMS� Most capabilities of stand-alone R obtainable in R+DBMS

� Better performance with R+DBMS� Provided data is efficiently retrieved and passed

� R still competitive in less data-intensive analytics

37

� R still competitive in less data-intensive analytics

� Architectural Implications� Positioning of analytic engine w.r.t database

� Existence of native SQL equivalent of analytic functions

� Extent of exploitation of db parallelism and db scalability

� Lessons, going forward� Technique for retrieving and passing data makes huge

impact, regardless of how fast substantive analytics runs

Future Studies

� Benchmark in-memory, col-oriented, doc-oriented dbs

� Study effective & efficient data access by analytic functions

� Compare performance gains: R+RDBMS vs. R+NoSQL dbs

38

� Max gain from parallelism and scalability of R+DBMSs

� Benchmark on different OS and varying data amounts

� Benchmark on datasets with varied attribute properties

Conclusions (final words)

� In recommending R+DBMS, architectures must

� Facilitate efficient retrieval and passing of data from

the database objects (tables, views, procedures, etc)

to the analytic functions;

39

to the analytic functions;

� Lessen or eliminate data movement and reduce other

run-time overheads;

� Maintain data security (the C.I.A of data);

� C.I.A = Confidentiality, Integrity and Availability

� Reduce development overheads for introducing new &

maintaining existing analytics techniques in the DBMSs

References

1. Robert Kabacoff. R in Action: Data analysis and graphics with R. Manning Publications Co., 2011

2. Revolution Analytics. Revolution RevoR Enterprise Benchmark Details: Benchmark Scripts.URL: http://www.revolutionanalytics.com/revolution-revor-enterprise-benchmark-details.

3. Philippe Grosjean. R Benchmark 2.5. 2008. URL: http://r.research.att.com/benchmarks/R-benchmark-25.R.

4. Edwin Grappin. Generate stock option prices - How to simulate a Brownian motion. http://probaperception.blogspot.com.es/2012/10/generate-stock-option-prices-how-to.html. Blog. 2012.

5. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed 12-March-2014]. ORACLE, 2014. URL: http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.

6. David Smith. R integrated throughout the enterprise analytics stack. http://blog.revolutionanalytics.com/2012/02/r-in-the-enterprise.html/. Blog. 2012.

40

2012.

7. Elaine Chen. Using R and Tableau. Tech. rep. Also available as http://www.tableausoftware.com/sites/default/files/media/using-r-and-tableau-software_0.pdf. 2013.

8. CodePlex (Microsoft). R.NET. URL: http://rdotnet.codeplex.com/.

9. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed12-March-2014]. ORACLE, 2014. URL: http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.

10. Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. “Database analytics acceleration using FPGAs”. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM. 2012, pp. 411–420.

11. Joseph M Hellerstein, Christoper R´e, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. “The MADlib analytics library: or MAD skills, the SQL”. In: Proceedings of the VLDB Endowment 5.12 (2012), pp. 1700–1711.

12. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher R´e. “Towards a unified architecture for in-RDBMS analytics”. In: Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data. ACM. 2012, pp. 325–336.

13. Stefan Steinhaus. Comparison of mathematical programs for data analysis. 1999.

14. Donald Knuth. Comparison of mathematical programs for data analysis. 2008. URL: http://www.scientificweb.com/ncrunch/ncrunch5.pdf.

Thank you

41

Documents

sedem seakomo master thesis presentation · Master Thesis Presentation [IT4BI] Student: SedemSeakomo FIB-UPC (BarcelonaTech) saviour.sedem.kofi.seakomo @est.fib.upc.edu Analytical