Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Coupling Databases and Advanced
Master Thesis Presentation [IT4BI]
Student:
Sedem SeakomoFIB-UPC (BarcelonaTech)
saviour.sedem.kofi.seakomo
@est.fib.upc.edu
and Advanced Analytical Tools (R)
Supervisor:
Prof Alberto Abelló (PhD)FIB-UPC (BarcelonaTech)
1
Outline
� Introduction
� The Problem
� State of the art (Existing systems review)
2
State of the art (Existing systems review)
� Methodology
� Empirical Work
� The Results
� Conclusions
Introduction
� Introduction� Background & Motivation
� Research Questions
� Scope (Delimitations & Limitations)
� Importance & Contribution
3
� Importance & Contribution
� Related Work
� The Problem
� State of the art (Existing systems review)
� Methodology
� Empirical Work
� The Results
� Conclusions
Introduction
� SQL/relational DBMS are powerful systems!� Managing, querying, and aggregating data
� But what about complex analytics? Not really!
Inferences, predictions, subtle relationships in data
4
� Inferences, predictions, subtle relationships in data
� In spite of this, organizations still house large amount of data in various SQL/RDBMS
� So what do we do for complex analytics?
Introduction
� Objective: � Examine level of development of integration of
R+DBMS
� Assess performance, scalability and completeness of R+DBMS integration
5
R+DBMS integration
� Motivation:� New Industry (Analytics Industry)
� Development in Analytics front: Gleaning information and insights from data, now an industry in itself
� Data Mining (Complex Analytics)� Increasing relevance of data mining (to be driven by complex
analytics) for revealing valuable insights from data
Research Questions
� What is the current level of development (completeness) of
integration of R with DMBS?
� How is the performance of coupling databases with advanced
analytical tool (R) compared to stand-alone analytical tool (R)?
How is the scalability of coupling databases with advanced
6
� How is the scalability of coupling databases with advanced
analytical tool (R) compared to stand-alone analytical tool (R)?
� What are the inherent implications of architectures of R integration
that impact performance?
� Are there any lessons to be learnt on the way forward?
Scope (delimitations & limitations)
� Focused on benchmarking the performance, scalability and
completeness of selected DBMS+R
� Benchmarks covers mainly matrix operations employed (forms
the core of) in advanced analytics
Benchmarking of intra-command parallelism was not covered
7
� Benchmarking of intra-command parallelism was not covered
� Focused on coupling of R and RDBMS (Oracle, Postgres, DB2
and SQL Server); non-RDBMS or NoSQL databases not covered
� Focused on directly coupling R at the data layer (not at the
analytic layer and/or presentation layer)
Introduction
� Contributions:� Better performance is achievable by coupling
databases with advanced analytical tools (R)
� Such approach is recommended for complexanalytics involving significant amount of data
8
analytics involving significant amount of data
� Architectures where more analytic functions haveequivalent native SQL counterparts executable in-database produces best performance results
� Caveat: data used in analytic process must beefficiently retrieved and passed to the analyticfunctions, lest there will be worsen performance
Introduction
� Related work:� Analytics and databases:
� Database Analytics Acceleration using FPGAs [10]� For evaluating expensive analytics queries while saving CPU resources
� The MADlib Analytics Library or MAD skills, the SQL [11]� Introduces open-source library of in-database analytic methods of SQL-based
algorithms for machine learning, data mining and statistics inside database engine
9
algorithms for machine learning, data mining and statistics inside database engine
� Towards a Unified Architecture for in-RDBMS Analytics [12]� Presents unified architecture for in-RDBMS analytics with emphasis on faster
implementation of new statistical techniques in RDBMS
� Performance benchmark studies w.r.t R:� By Philippe Grosjean[3], Stefan Steinhaus[13], Donald Knuth[14]
� Centered on comparing performance of versions of R implementations, R implementation with and without some packages and R as analytical tool compared with other analytical tools
� But, our work:� The performance study of R+DBMS vs. stand-alone R
The Problem
� Introduction
� The Problem� Advanced analytical tools
� Database Management Systems
� Bringing the “two worlds” together
10
� Bringing the “two worlds” together
� Thesis statement (Hypothesis) declaration
� State of the art (Existing systems review)
� Methodology
� Empirical Work
� The Results
� Conclusions
Advanced Analytical Tools
� Inclined towards linear algebra
� Up-side:� Statistical software provide rich and very advanced
11
� Statistical software provide rich and very advanced analytical functionality for data analysis and modelling
� Down-side: � Can handle only limited amounts of data.
� Example: Some packages (base R and IBM SPSS) operate entirely in main memory
Database Management Systems
� Founded on relational algebra (RDBMS)
� Up-side:� DBMS can store and process large amount of data
12
� DBMS can store and process large amount of data
� Down-side:� But provide insufficient analytical functionality
� SQL simulations of linear algebra operations � will often result in abysmal I/O and CPU performance
� are knotty for linear algebra operations with iterations
� are hard to fathom and makes code maintenance expensive
Bringing the “two worlds” together
� We have a case at hand!
� So, how do we bridge the gap?
Advanced Statistical Packages (linear algebra)Database Management
Systems (relational algebra)
13
Packages (linear algebra)Systems (relational algebra)
Bringing the “two worlds” together
� Solution: synergy!
� Employ extended RDBMS features to power the embedded/integrated/coupled execution of R.
14
Bringing the “two worlds” together
� Has the following advantages:� Avert performance problems associated with the
abusive use of SQL (relational algebra ops) for advanced analytics (linear algebra ops)
15
� Synergize robust data management capabilities of DBMS and rich statistical functionalities of analytical tools
� Benefits (Performance+Security) of taking algorithms (Processing Logic) to data rather than data to algorithm
Thesis Statement (Hypothesis)
Coupling databases and advanced analytical tools (R)
16
leads to better and enhanced analytic performance
than stand-alone advanced analytical tools (R)
State of the art
� Introduction
� The Problem
� State of the art (Existing systems review)� Advanced analytical tool R
Different DBMS architecture of R implementation
17
� Different DBMS architecture of R implementation
� Choice of DBMS for empirical study
� Methodology
� Empirical Work
� The Results
� Conclusions
Integration with R
� At three layers within the analytic stack� Data Layer
� (e.g. Oracle R Enterprise, Sybase RAP, SAP HANA, IBM Netezza)
� Analytics Layer � (e.g. SAS, IBM SPSS, RStudio, Matlab, Zementis)
18
� Presentation Layer � (e.g. Tableau, Jaspersoft BI Software, TIBCO Spotfire's BI Dashboard)
Integration with R at Data Layer
� Alternative ways of integrating R with db� Outside-in:
� R connect with DB using JDBC/ODBC and R retrieves (pulls) the data to be analyzed from the db
19
� Inside-out: � Data is transferred (pushed) to R from within the database and
the aggregated and/or analyzed results sets are sent back from R to the database
� Embedded: � R environment (components) and/or execution is made an
integral part of the core DBMS
Diff DBMS Architecture w.r.t R
� Integrations/Architectural Arrangements
DBMS Embedded Outside-in/Inside-out
Oracle YES: ORE Server YES: ROracle, JDBC
PostgreSQL YES: PLR YES: RPostgres, RODBC
20
Sybase RAP YES: RAP Store- UDF(C, C++) YES: RJDBC
SQL Server NO: But CLR, Ext Proc YES: RODBC
DB2
NO: But CLR, Ext Proc UDF(C,
C++, Java, COBOL) YES: RJDBC, RODBC
Cloudera
Impala NO YES: ODBC, JDBC
SAP HANA NO YES: RODBC, RJDBC, RHANA
Methodology
� Introduction
� The Problem
� State of the art (Existing systems review)
� MethodologyResearch Approach
21
� Research Approach
� Research Design
� Data Used
� Empirical Work
� The Results
� Conclusions
Methodology
� Research Approach� Quantitative research (experimental) approach
� Need to collect numeric performance data
� Carry out various kinds of numeric-based analyses
Research Design
22
� Research Design� Adopted and adapted R Benchmark 2.5 [3] and
Revolution RevoR Enterprise Benchmark [2]
� Tests designed for stand-alone R and R+ Oracle, PostgreSQL, DB2, SQLServer
� Tests: 3 categories of performance tests;� Matrix Calculation, Matrix Functions and Program Control
Data Used
� Input data generated in R, data set consists� two dimensional array of floating-point numbers
� 1,000 observations (cols) by 16,000 variables (rows)
� Used a stochastic process, Brownian Motion
23
� Used a stochastic process, Brownian Motion
� Where:� Xi (the series) then stand-in for the Brownian Motion
� Yn is a sequence of k variables normally distributed elements
MatrixX obs1 obs2 ... obs1000
var1
var2
var3
...
var16000
)()(1 k
iYX
i
n
ni •=∑=
Empirical Work
� Introduction
� The Problem
� State of the art (Existing systems review)
� Methodology
Empirical Work
24
� Empirical Work� Benchmark tests
� Experimental design
� Measurements & controls
� The Results
� Conclusions
Experimental Setup:R+SQLServer
� Traditional (RODBC) Integration� Installed SQL Server 2012 (64-bit)
� Installed Open Source R 2.13.2 (64-bit client)
� Installed of RODBC package from RGUI
25
� Common Language Runtime (CLR) Integration CLR Stored Procedures are .NET objects which run in db memory
� Created the usual R script files
� Developed C# CLR with embedded R; compiled to get DLL
� Enabled CLR integration feature of the SQL Server
� Created assembly from the DLL; SPs which ref the assembly
� Ran the stored procedures with the R script files as input
The Results
� Introduction
� The Problem
� State of the art (Existing systems review)
� Methodology
Empirical Work
26
� Empirical Work
� The Results� MC, MF, PC, Overall benchmarks
� Implications of the results and findings
� Which integration architecture works well?
� Conclusions
Empirical Results
� Average overall benchmark results:
SQL Server
OVERALL Performance
27
- 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00
Stand-Alone R
PostgreSQL
Oracle
DB2
Run-time (normalised)
Sy
ste
m/+
R
OVERALL
Empirical Results
� Matrix Calculation (MC) benchmark results
SQL Server
Matrix Calculation Performance (MC)
28
- 10.00 20.00 30.00 40.00 50.00 60.00
Stand-Alone R
PostgreSQL
Oracle
DB2
Run-time (normalised)
Sy
ste
m/+
R
MC
Empirical Results
� Matrix Function (MF) benchmark results
SQL Server
Matrix Function Performance (MF)
29
- 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00
Stand-Alone R
PostgreSQL
Oracle
DB2
Run-time (normalised)
Sy
ste
m/+
R
MF
Empirical Results
� Program Control (PC) benchmark results
Benchmark Stand-Alone R PostgreSQL Oracle DB2 SQL Server
PC01 2.71 2.78 2.77 2.70 2.77
30
PC02 0.30 0.40 0.31 0.28 0.29
PC03 0.63 0.60 0.43 0.65 0.66
PC04 0.51 0.50 0.51 0.53 0.51
PC05 0.38 0.38 0.36 0.38 0.38
KEY:
PC01: Fibonacci numbers; ctrl flow PC03: gcd2; recursive PC05: Escoufier’s method on matrix
PC02: Hilbert matrix; ; ctrl flow PC04: Toeplitz matrix; looping
Empirical Results
� Paired t-test on PC results (PosgreSQL, Oracle)Variable 1 Variable 2
Mean 0.932 0.876
Variance 1.07492 1.12668Observations 5 5
Pearson Correlation 0.997786692
Hypothesized Mean
31
Hypothesized Mean Difference 0
df 4t Stat 1.691541861
P(T<=t) one-tail 0.082996265t Critical one-tail 2.131846782
P(T<=t) two-tail 0.16599253t Critical two-tail 2.776445105
The mean performance difference (M=0.06, SD =0.074, N= 5) was not
significantly greater than zero, t(4)=1.69, two-tail p = 0.166, providing evidence
that there is no considerable difference in the performances of the two DBMSs.
Empirical Results
� Scalability benchmark results
10.00
15.00
20.00
25.00
30.00
dTimes1-4m-r
dTimes1-4mc-ore
Oracle shows slightly better scalability edge over stand-alone R for small datasets
32
-
5.00
10.00
MC
1
MC
2
MC
3
MC
4
MC
5
MF
1
MF
2
MF
3
MF
4
MF
5
MF
6
MF
7
MF
8
dTimes1-4mc-ore
-
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
MC
1
MC
2
MC
3
MC
4
MC
5
MF
1
MF
2
MF
3
MF
4
MF
5
MF
6
MF
7
MF
8
dTimes4-16mc-r
dTimes4-16mc-ore
Stand-alone R is overwhelmed by large datasets; that is when R+DBMS’s edge is manifested
Empirical Results Reliability
� Average vs. Minimum Results: about same
Oracle
DB2
SQL Server
Sy
ste
m/+
R
Avg OVERALL Performance Performance patterns observed remain exactly same
33
- 50.00 100.00 150.00 200.00
Stand-Alone R
PostgreSQL
Run-time (normalised)
Sy
ste
m/+
R
OVERALL
- 50.00 100.00 150.00 200.00
Stand-Alone R
PostgreSQL
Oracle
DB2
SQL Server
Run-time (normalised)
Sy
ste
m/+
R
Min OVERALL Performance
OVERALL
No significant variations in actual recorded values
Why R+PostgreSQL works bad?
� Postgres performed well in tests with less data� Timing retrieval of database resident data as matrix
Retrieving
DB Data Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run7 Total Average
Oracle 0.15 0.14 0.14 0.14 0.14 0.15 0.14 1.00 0.14
PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 19.24
34
� Direct rows fetch (SELECT * FROM stockHist)� Oracle (4.51 sec) is 2.66 times faster than PostgreSQL (12.02 sec)
� Implication:� Poor performance of PostgreSQL-coupled-R is not
exclusively the consequence of the implementation but also the database itself (data retrieval /fetching)
PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 19.24
Why R+Oracle works well?
35
�In-db statistic engine
�Storing of R Script in-db
�Capability of spawning multiple R engine instances
�Efficient data retrieval and passing (rqTableEval, rqRowEval)
Architecture of Oracle R Enterprise. Adapted from [9]
Conclusions
� Introduction
� The Problem
� State of the art (Existing systems review)
� Methodology
Empirical Work
36
� Empirical Work
� The Results
� Conclusions� Lessons learnt
� Future Studies
� Final Words
Implications to Research Questions
� Growing level of development of R+DBMS� Most capabilities of stand-alone R obtainable in R+DBMS
� Better performance with R+DBMS� Provided data is efficiently retrieved and passed
� R still competitive in less data-intensive analytics
37
� R still competitive in less data-intensive analytics
� Architectural Implications� Positioning of analytic engine w.r.t database
� Existence of native SQL equivalent of analytic functions
� Extent of exploitation of db parallelism and db scalability
� Lessons, going forward� Technique for retrieving and passing data makes huge
impact, regardless of how fast substantive analytics runs
Future Studies
� Benchmark in-memory, col-oriented, doc-oriented dbs
� Study effective & efficient data access by analytic functions
� Compare performance gains: R+RDBMS vs. R+NoSQL dbs
38
� Max gain from parallelism and scalability of R+DBMSs
� Benchmark on different OS and varying data amounts
� Benchmark on datasets with varied attribute properties
Conclusions (final words)
� In recommending R+DBMS, architectures must
� Facilitate efficient retrieval and passing of data from
the database objects (tables, views, procedures, etc)
to the analytic functions;
39
to the analytic functions;
� Lessen or eliminate data movement and reduce other
run-time overheads;
� Maintain data security (the C.I.A of data);
� C.I.A = Confidentiality, Integrity and Availability
� Reduce development overheads for introducing new &
maintaining existing analytics techniques in the DBMSs
References
1. Robert Kabacoff. R in Action: Data analysis and graphics with R. Manning Publications Co., 2011
2. Revolution Analytics. Revolution RevoR Enterprise Benchmark Details: Benchmark Scripts.URL: http://www.revolutionanalytics.com/revolution-revor-enterprise-benchmark-details.
3. Philippe Grosjean. R Benchmark 2.5. 2008. URL: http://r.research.att.com/benchmarks/R-benchmark-25.R.
4. Edwin Grappin. Generate stock option prices - How to simulate a Brownian motion. http://probaperception.blogspot.com.es/2012/10/generate-stock-option-prices-how-to.html. Blog. 2012.
5. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed 12-March-2014]. ORACLE, 2014. URL: http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.
6. David Smith. R integrated throughout the enterprise analytics stack. http://blog.revolutionanalytics.com/2012/02/r-in-the-enterprise.html/. Blog. 2012.
40
2012.
7. Elaine Chen. Using R and Tableau. Tech. rep. Also available as http://www.tableausoftware.com/sites/default/files/media/using-r-and-tableau-software_0.pdf. 2013.
8. CodePlex (Microsoft). R.NET. URL: http://rdotnet.codeplex.com/.
9. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed12-March-2014]. ORACLE, 2014. URL: http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.
10. Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. “Database analytics acceleration using FPGAs”. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM. 2012, pp. 411–420.
11. Joseph M Hellerstein, Christoper R´e, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. “The MADlib analytics library: or MAD skills, the SQL”. In: Proceedings of the VLDB Endowment 5.12 (2012), pp. 1700–1711.
12. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher R´e. “Towards a unified architecture for in-RDBMS analytics”. In: Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data. ACM. 2012, pp. 325–336.
13. Stefan Steinhaus. Comparison of mathematical programs for data analysis. 1999.
14. Donald Knuth. Comparison of mathematical programs for data analysis. 2008. URL: http://www.scientificweb.com/ncrunch/ncrunch5.pdf.
Thank you
41