Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
Extending HP VerticaExternal Procedures, SQL Functions, UDx
Mark Draper, Vertica Professional Services
August 10, 2015
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 SEIZE THE DATA. 2015
Extending Vertica
HP Vertica Extension FeaturesExternal procedures
− Execute external scripts or programs that are installed on a host in your database cluster.
User-Defined SQL Functions
− Store frequently-used SQL expressions; help you simplify and standardize your SQL scripts.
External procedures
− Develop your own analytic or data-loading tools using C++, Java, and R programming languages; useful when the type of data processing you want to perform is difficult or slow using SQL.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015
External Procedures
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015
External Procedures
Definition
What are external procedures?
• An external procedure is a procedure external to HP Vertica that you create, maintain, and store on the server.
• External procedures are simply executable files such as shell scripts, compiled code, code interpreters, and so on.
Where are external procedures?
• A procedure file must be owned by the database administrator (OS account) or by a user in the same group as the administrator. (The procedure file owner cannot be root.) The procedure file must also have the set UID attribute enabled, and allow read and execute permission for the group.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015
External Procedures
Resource Usage
The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on your system. If your external procedure is resource intensive, it could affect the performance and stability of HP Vertica. Consider the types of external procedures you create and when you run them. For example, you might run a resource-intensive procedure during off hours.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015
External Procedures
Definition
Once you have installed an external procedure, you need to make HP Vertica aware of it. To do so, use the CREATE PROCEDURE statement.
By default, only a superuser can create and execute a procedure. However, a superuser can grant the right to execute a stored procedure to a user on the operating system.
To execute an external procedure, the database user needs:
• EXECUTE privilege on procedure
• USAGE privilege on schema that contains the procedure
Once created, a procedure is listed in the V_CATALOG.USER_PROCEDURES system table. Users can see only those procedures that they have been granted the privilege to execute.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015
External Procedures
Execution
Once you define a procedure through the CREATE PROCEDURE statement, you can use it as a meta command through a simple SELECT statement. HP Vertica does not support using procedures in more complex statements or in expressions.
Procedures are executed on the initiating node. HP Vertica runs the procedure by forking and executing the program. Each procedure argument is passed to the executable file as a string. The parent fork process waits until the child process ends.
To stop execution, cancel the process by sending a cancel command (for example, CTRL+C) through the client. If the procedure program exits with an error, an error message with the exit status is returned.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 SEIZE THE DATA. 2015
External Procedures
Implementing External Procedures
To implement an external procedure:
• Create an external procedure executable file.
• Enable the SUID attribute (Set owner User ID up on execution) for the file and allow read and execute permission for the group (if the owner is not the database administrator).
$ chmod 4777 <proc-name>
#!/bin/bash
… processing here …
echo "extproc1 argument: $1" >> /tmp/extproc1.log
exit 0
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015
External Procedures
Implementing External Procedures
To implement an external procedure:
• Install the external procedure executable file.
• Create the external procedure in HP Vertica
$ admintools -t install_procedure -d <database> -f <full-path-to-procedure>
-p <db-password>
=> create procedure <db-proc-name>(arg1 varchar) as <os-proc-name> language 'external'
user <run-as-os-user>;
=> grant execute on <procedure> to <user|role|PUBLIC>;
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015
External Procedures
Usage
To execute an external procedure:
• Invoke the procedure from vsql.
=> select <db-proc-name>(arg1, …);
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015
External Procedures
Dropping Procedures
Only a superuser can drop an external procedure. To drop the definition for an external procedure from HP Vertica, use the DROP PROCEDURE statement. Only the reference to the procedure is removed. The external file remains in the <database_catalog_path>/procedures directory on each node in the database.
Note: The definition HP Vertica uses for a procedure cannot be altered; it can only be dropped.
• Drop procedure command.
=> drop procedure <db-proc-name>(arg1 varchar, …);
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015
External Procedures
Use Cases
Populate an external table.
• Run an external job which populates an external table.
Run ETL scripts.
• Run an ETL script from cluster host; this allows database user to run script without having access to the cluster host.
Callback.
• Run a script which connects to the database (or uses admintools).
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015
User Defined SQL
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 SEIZE THE DATA. 2015
User Defined SQLDefinition
User-Defined SQL Functions let you define and store commonly-used SQL expressions as a function. User-Defined SQL Functions are useful for executing complex queries and combining HP Vertica built-in functions. You simply call the function name you assigned in your query.
A User-Defined SQL Function can be used anywhere in a query where an ordinary SQL expression can be used, except in the table partition clause or the projection segmentation clause.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015
User Defined SQLPermission
CREATE
• The user must have CREATE privileges on the schema.
USE• To use a SQL function, the user must have USAGE privileges on the schema and EXECUTE privileges on the
defined function.
ALTER• Vertica allows multiple functions to share the same name with different argument types; therefore you must
specify the argument data type.
DROP
• Like with ALTER FUNCTION, you must specify the argument data type.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015
User Defined SQLExamples
create or replace function ucase (x varchar) return varchar
as
begin
return upper(x);
end;
=> create function store.modulus(x int, y int) return boolean
as
begin
return (
case mod(x,y+1)
when 0 then true
else false end);
end;
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015
User Defined SQL
Dropping Procedures
Only a superuser can drop an external procedure. To drop the definition for an external procedure from HP Vertica, use the DROP PROCEDURE statement. Only the reference to the procedure is removed. The external file remains in the <database_catalog_path>/procedures directory on each node in the database.
Note: The definition HP Vertica uses for a procedure cannot be altered; it can only be dropped.
• Drop procedure command.
=> drop procedure <db-proc-name>(arg1 varchar, …);
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19 SEIZE THE DATA. 2015
User Defined SQL
Use Cases
Migrating Built-In SQL Functions
• If you have built-in SQL functions from another RDBMS that do not map to an HP Vertica-supported function, you can migrate them into your HP Vertica database by using a user-defined SQL function.
Wrapper
• Functional interface for storage of commonly-used SQL expressions.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015
User Defined Extensions (Udx)
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015
UDx
Definition
A User Defined Extension (abbreviate as UDx) is a component that adds new abilities to the HP Vertica Analytics Platform. UDxs provide features such as new types of data analysis and the ability to parse and load new types of data.
UDxs can be developed in a three programming languages: C++, Java, and R.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015
UDx
Strengths
Can be used anywhere an internal function can be used.
Take full advantage of HP Vertica's distributed computing features. The extensions usually execute in parallel on each node in the cluster.
HP Vertica handles the distribution of the UDx library to the individual nodes. You only need to copy the library to the initiator node.
Your main programming task is to read in data, process it, and then write it out using the HP Vertica SDK APIs. All of the complicated aspects of developing a distributed piece of analytic code are handled for you by HP Vertica.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015
UDx
Implementation
User Defined Extensions (UDxs) are contained in libraries. A library can contain multiple UDxs. You can load multiple libraries in to HP Vertica. You load a library by:
• Copying the library file to a location on the initiator node.
• Connecting to the initiator node using vsql.
• Using the CREATE LIBRARY statement, passing it the path where you saved the library file.
The initiator node takes care of distributing the library file to the rest of the nodes in the cluster.
Once the library is loaded, you define individual User Defined Functions or User Defined Loads using SQL statements such as CREATE FUNCTION and CREATE SOURCE. These statement assigns SQL function names to the extension classes in the library.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 SEIZE THE DATA. 2015
UDx
Fenced Mode
UDxs in fenced mode run the UDxs code outside of the main HP Vertica process in a separate zygote process. UDx code that crashes while running in fenced mode does not impact the core HP Vertica process. There is a small performance impact when running UDx code in fenced mode. On average, using fenced mode adds about 10% more time to execution compared to unfenced mode.
All UDxs developed in the R and Java programming languages must run in fenced mode, since the R and Java runtimes cannot be directly run within the HP Vertica process. Fenced mode is currently available for all C++ UDxs with the exception of User Defined Aggregates and User Defined Load.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015
UDx
Unfenced Mode
User Defined Extensions (UDxs) written in the C++ programming language have the option of running in unfenced mode, which means running directly within the HP Vertica process. Since they run within HP Vertica, unfenced UDxs have little overhead, and can perform almost as fast as HP Vertica's own built-in functions. However, since they run within HP Vertica directly, any bugs in their code (memory leaks, for example) can destabilize the main HP Vertica process that can bring one or more database nodes down.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015
UDx
Updates
There are two cases where you need to update libraries that you have already deployed:
• When you have upgraded HP Vertica to a new version that contains changes to the SDK API. For your libraries to work with the new server version, you need to recompile them with new version of the SDK.
• When you have made changes to your UDxs and you want to deploy these changes. Before updating your UDx library, you need to determine if you have changed the signature of any of the functions contained in the library. If you have, you need to drop the functions from the HP Vertica catalog before you update the library.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27 SEIZE THE DATA. 2015
UDx
TYPESThere are five different types of user defined extensions:
• User Defined Scalar Functions (UDSFs) take in a single row of data and return a single value. These functions can be used anywhere a native HP Vertica function can be used, except CREATE TABLE BY PARTITION and SEGMENTED BY expressions.
• User Defined Transform Functions (UDTFs) operate on table segments and return zero or more rows of data. The data they return can be an entirely new table, unrelated to the schema of the input table, including having its own ordering and segmentation expressions. They can only be used in the SELECT list of a query.
• User Defined Aggregate Functions (UDAF) allow you to create custom Aggregate Functions specific to your needs. They read one column of data, and return one output column.
• User Defined Analytic Functions (UDAnF) are similar to UDSFs, in that they read a row of data and return a single row. However, the function can read input rows independently of outputting rows, so that the output values can be calculated over several input rows.
• The User Defined Load (UDL) feature allows you to create custom routines to load your data into HP Vertica. You create custom libraries using the HP Vertica SDK to handle various steps in the loading process.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28 SEIZE THE DATA. 2015
UDxLoading
The following function adds a library entry containing User Defined Extensions (UDxs) into the HP Vertica catalog. • CREATE [OR REPLACE] LIBRARY [[db-name.]schema.]library_name AS 'library_path' [ DEPENDS 'support_path' ] [ LANGUAGE
'language' ]
The following functions add a User Defined Function (UDF) to the catalog.
• CREATE [ OR REPLACE ] AGGREGATE FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory'
LIBRARY library_name;
• CREATE [ OR REPLACE ] ANALYTIC FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;
• CREATE [ OR REPLACE ] FILTER FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;
• CREATE [ OR REPLACE ] FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;
• CREATE [ OR REPLACE ] SOURCE[[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;
• CREATE [ OR REPLACE ] TRANSFORM FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29 SEIZE THE DATA. 2015
UDx
Logging
UDx code that runs in fenced mode is logged in the UDxZygote.log and is stored in the UDxLogsdirectory in the catalog directory of HP Vertica. Log entries for the side process are denoted by the UDx language, node, zygote process ID, and the UdxSideProcess ID.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30 SEIZE THE DATA. 2015
UDx
Zygotes
dbadmin=> select * from UDX_FENCED_PROCESSES;
node_name | process_type | session_id | language | max_memory_java_kb | pid | port |
status
----------------+------------------+------------+----------+--------------------+------+-------+------
--
v_dev_node0001 | UDxZygoteProcess | | | 140664675237920 | 3612 | 57868 | UP
v_dev_node0002 | UDxZygoteProcess | | | 140307924516896 | 2754 | 47316 | UP
v_dev_node0004 | UDxZygoteProcess | | | 140379328348192 | 6536 | 51902 | UP
v_dev_node0003 | UDxZygoteProcess | | | 140342888235040 | 2467 | 56394 | UP
(4 rows)
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31 SEIZE THE DATA. 2015
UDx
Configuration Parameters
Fenced mode supports two configuration parameters:
• FencedUDxMemoryLimitMB - The maximum memory size, in MB, to use for Fenced Mode processes. The default is -1 (no limit). The side process is killed if this limit is exceeded.
• ForceUDxFencedMode - When set to 1, force all UDx's that support fenced mode to run in fenced mode even if their definition specified NOT FENCED (C++ only). The default is 0 (disabled).
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.32 SEIZE THE DATA. 2015
UDx
EXAMPLES
Directory structure for sample code on every cluster node./opt/vertica/sdk/examples
|
|-- PloadFunctions
|-- data
|-- Rfunctions
|-- FilterFunctions
|-- ScalarFunctions
|-- HelperLibraries
|-- ApportionLoadFunctions
|-- TransformFunctions
|-- AnalyticFunctions
|-- ParserFunctions
|-- JavaUDx
|-- AggregateFunctions
|-- SourceFunctions
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33 SEIZE THE DATA. 2015
Udx in R
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34 SEIZE THE DATA. 2015
Udx in R
Installation
A User Defined Extension (abbreviate as UDx) is a component that adds new abilities to the HP Vertica Analytics Platform. UDxs provide features such as new types of data analysis and the ability to parse and load new types of data.
UDxs can be developed in a three programming languages: C++, Java, and R.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35 SEIZE THE DATA. 2015
Udx in R
Included Packages
The HP Vertica R Language Pack includes the following R packages in addition to the default packages bundled with R:
• Rcpp
• Rinside
• IpSolve
• lpSolveAPI
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36 SEIZE THE DATA. 2015
Udx in R
Installing Packages
You can install additional R packages not included in the HP Vertica R Language Pack by using one of two methods. You must install the same packages on all nodes.
• By using R Language Pack R binary at the command line and using the install.packages() R command. For example:
• By running the following command:
$ /opt/vertica/R/bin/R
> install.packages("<package-name>");
/opt/vertica/R/bin/R CMD INSTALL <path-to-package-tgz>
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.37 SEIZE THE DATA. 2015
Udx in R
K-means (From Wikipedia, the free encyclopedia)
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.
The algorithm has nothing to do with and should not be confused with k-nearest neighbor, another popular machine learning technique.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.38 SEIZE THE DATA. 2015
Udx in R
Function Source
###
# @brief Runs K-means clustering algorithm (with K=2) on the input data frame.
#
# @param x input data frame with two float columns, representing
# two-dimension points: (x float, y float).
# @return a data frame with three columns (the point coordinates plus
# their assigned cluster {1..k}): (x float, y float, cluster int).
###
kmeansClu <- function(x)
{
# Fix initial centroids to get predictable clustering.
cx <- c(1.5, 2.5)
cy <- c(3.5, 4.5)
centroids <- data.frame(cx,cy)
cl <- kmeans(x[,1:2], centroids)
res <- data.frame(x[,1:2], cl$cluster)
res
}
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.39 SEIZE THE DATA. 2015
Udx in R
Factory Source
kmeansCluFactory <- function()
{
list(name=kmeansClu, udxtype=c("transform"),intype=c("float","float"),
outtype=c("float","float","int"), outnames=c("x","y","cluster"))
}
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.40 SEIZE THE DATA. 2015
Udx in R
Load UDx
Create library.
Create function.
dbadmin=> create library rlib as '/opt/vertica/sdk/examples/RFunctions/RFunctions.R' language 'R';
CREATE LIBRARY
dbadmin=> create transform function kmeans as language 'R' name 'kmeansCluFactory' library rlib;
CREATE TRANSFORM FUNCTION
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.41 SEIZE THE DATA. 2015
Udx in R
Explain
create table point_data(x float, y float) unsegmented all nodes;
explain select kmeans(x, y) over() from point_data;
Access Path:
+-ANALYTICAL [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| Analytic Group
| Functions: kmeans()
| Execute on: Query Initiator
| +---> STORAGE ACCESS for point_data [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | Projection: public.point_data_node0001
| | Materialize: point_data.x, point_data.y
| | Execute on: Query Initiator
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.42 SEIZE THE DATA. 2015
Udx in R
Explain
create table point_data_seg(n int, x float, y float) segmented by hash(n) all nodes;
explain select n,kmeans(x, y) over(partition by n) from point_data_seg;
Access Path:
+-ANALYTICAL [Cost: 8, Rows: 21 (NO STATISTICS)] (PATH ID: 1)
| Analytic Group
| Functions: kmeans()
| Execute on: All Nodes
| +---> STORAGE ACCESS for point_data_seg [Cost: 7, Rows: 21 (NO STATISTICS)] (PATH ID: 2)
| | Projection: public.point_data_seg_b0
| | Materialize: point_data_seg.n, point_data_seg.x, point_data_seg.y
| | Execute on: All Nodes
SEIZE THE DATA. 2015
SEIZE THE DATA. 2015QUESTIONS?Please attend our Q&A with HP Big Data experts today
Marina Ballroom, Lobby level
10:15 am • 10:30 am
12:00 pm • 1:00 pm
2:30 pm • 3:00 pm
4:30 pm • 5:00 pm