SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external...

Preview:

Citation preview

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015

SEIZE THE DATA. 2015

Extending HP VerticaExternal Procedures, SQL Functions, UDx

Mark Draper, Vertica Professional Services

August 10, 2015

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 SEIZE THE DATA. 2015

Extending Vertica

HP Vertica Extension FeaturesExternal procedures

− Execute external scripts or programs that are installed on a host in your database cluster.

User-Defined SQL Functions

− Store frequently-used SQL expressions; help you simplify and standardize your SQL scripts.

External procedures

− Develop your own analytic or data-loading tools using C++, Java, and R programming languages; useful when the type of data processing you want to perform is difficult or slow using SQL.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015

External Procedures

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015

External Procedures

Definition

What are external procedures?

• An external procedure is a procedure external to HP Vertica that you create, maintain, and store on the server.

• External procedures are simply executable files such as shell scripts, compiled code, code interpreters, and so on.

Where are external procedures?

• A procedure file must be owned by the database administrator (OS account) or by a user in the same group as the administrator. (The procedure file owner cannot be root.) The procedure file must also have the set UID attribute enabled, and allow read and execute permission for the group.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015

External Procedures

Resource Usage

The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on your system. If your external procedure is resource intensive, it could affect the performance and stability of HP Vertica. Consider the types of external procedures you create and when you run them. For example, you might run a resource-intensive procedure during off hours.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015

External Procedures

Definition

Once you have installed an external procedure, you need to make HP Vertica aware of it. To do so, use the CREATE PROCEDURE statement.

By default, only a superuser can create and execute a procedure. However, a superuser can grant the right to execute a stored procedure to a user on the operating system.

To execute an external procedure, the database user needs:

• EXECUTE privilege on procedure

• USAGE privilege on schema that contains the procedure

Once created, a procedure is listed in the V_CATALOG.USER_PROCEDURES system table. Users can see only those procedures that they have been granted the privilege to execute.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015

External Procedures

Execution

Once you define a procedure through the CREATE PROCEDURE statement, you can use it as a meta command through a simple SELECT statement. HP Vertica does not support using procedures in more complex statements or in expressions.

Procedures are executed on the initiating node. HP Vertica runs the procedure by forking and executing the program. Each procedure argument is passed to the executable file as a string. The parent fork process waits until the child process ends.

To stop execution, cancel the process by sending a cancel command (for example, CTRL+C) through the client. If the procedure program exits with an error, an error message with the exit status is returned.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 SEIZE THE DATA. 2015

External Procedures

Implementing External Procedures

To implement an external procedure:

• Create an external procedure executable file.

• Enable the SUID attribute (Set owner User ID up on execution) for the file and allow read and execute permission for the group (if the owner is not the database administrator).

$ chmod 4777 <proc-name>

#!/bin/bash

… processing here …

echo "extproc1 argument: $1" >> /tmp/extproc1.log

exit 0

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015

External Procedures

Implementing External Procedures

To implement an external procedure:

• Install the external procedure executable file.

• Create the external procedure in HP Vertica

$ admintools -t install_procedure -d <database> -f <full-path-to-procedure>

-p <db-password>

=> create procedure <db-proc-name>(arg1 varchar) as <os-proc-name> language 'external'

user <run-as-os-user>;

=> grant execute on <procedure> to <user|role|PUBLIC>;

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015

External Procedures

Usage

To execute an external procedure:

• Invoke the procedure from vsql.

=> select <db-proc-name>(arg1, …);

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015

External Procedures

Dropping Procedures

Only a superuser can drop an external procedure. To drop the definition for an external procedure from HP Vertica, use the DROP PROCEDURE statement. Only the reference to the procedure is removed. The external file remains in the <database_catalog_path>/procedures directory on each node in the database.

Note: The definition HP Vertica uses for a procedure cannot be altered; it can only be dropped.

• Drop procedure command.

=> drop procedure <db-proc-name>(arg1 varchar, …);

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015

External Procedures

Use Cases

Populate an external table.

• Run an external job which populates an external table.

Run ETL scripts.

• Run an ETL script from cluster host; this allows database user to run script without having access to the cluster host.

Callback.

• Run a script which connects to the database (or uses admintools).

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015

User Defined SQL

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 SEIZE THE DATA. 2015

User Defined SQLDefinition

User-Defined SQL Functions let you define and store commonly-used SQL expressions as a function. User-Defined SQL Functions are useful for executing complex queries and combining HP Vertica built-in functions. You simply call the function name you assigned in your query.

A User-Defined SQL Function can be used anywhere in a query where an ordinary SQL expression can be used, except in the table partition clause or the projection segmentation clause.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015

User Defined SQLPermission

CREATE

• The user must have CREATE privileges on the schema.

USE• To use a SQL function, the user must have USAGE privileges on the schema and EXECUTE privileges on the

defined function.

ALTER• Vertica allows multiple functions to share the same name with different argument types; therefore you must

specify the argument data type.

DROP

• Like with ALTER FUNCTION, you must specify the argument data type.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015

User Defined SQLExamples

create or replace function ucase (x varchar) return varchar

as

begin

return upper(x);

end;

=> create function store.modulus(x int, y int) return boolean

as

begin

return (

case mod(x,y+1)

when 0 then true

else false end);

end;

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015

User Defined SQL

Dropping Procedures

Only a superuser can drop an external procedure. To drop the definition for an external procedure from HP Vertica, use the DROP PROCEDURE statement. Only the reference to the procedure is removed. The external file remains in the <database_catalog_path>/procedures directory on each node in the database.

Note: The definition HP Vertica uses for a procedure cannot be altered; it can only be dropped.

• Drop procedure command.

=> drop procedure <db-proc-name>(arg1 varchar, …);

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19 SEIZE THE DATA. 2015

User Defined SQL

Use Cases

Migrating Built-In SQL Functions

• If you have built-in SQL functions from another RDBMS that do not map to an HP Vertica-supported function, you can migrate them into your HP Vertica database by using a user-defined SQL function.

Wrapper

• Functional interface for storage of commonly-used SQL expressions.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015

User Defined Extensions (Udx)

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015

UDx

Definition

A User Defined Extension (abbreviate as UDx) is a component that adds new abilities to the HP Vertica Analytics Platform. UDxs provide features such as new types of data analysis and the ability to parse and load new types of data.

UDxs can be developed in a three programming languages: C++, Java, and R.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015

UDx

Strengths

Can be used anywhere an internal function can be used.

Take full advantage of HP Vertica's distributed computing features. The extensions usually execute in parallel on each node in the cluster.

HP Vertica handles the distribution of the UDx library to the individual nodes. You only need to copy the library to the initiator node.

Your main programming task is to read in data, process it, and then write it out using the HP Vertica SDK APIs. All of the complicated aspects of developing a distributed piece of analytic code are handled for you by HP Vertica.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015

UDx

Implementation

User Defined Extensions (UDxs) are contained in libraries. A library can contain multiple UDxs. You can load multiple libraries in to HP Vertica. You load a library by:

• Copying the library file to a location on the initiator node.

• Connecting to the initiator node using vsql.

• Using the CREATE LIBRARY statement, passing it the path where you saved the library file.

The initiator node takes care of distributing the library file to the rest of the nodes in the cluster.

Once the library is loaded, you define individual User Defined Functions or User Defined Loads using SQL statements such as CREATE FUNCTION and CREATE SOURCE. These statement assigns SQL function names to the extension classes in the library.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 SEIZE THE DATA. 2015

UDx

Fenced Mode

UDxs in fenced mode run the UDxs code outside of the main HP Vertica process in a separate zygote process. UDx code that crashes while running in fenced mode does not impact the core HP Vertica process. There is a small performance impact when running UDx code in fenced mode. On average, using fenced mode adds about 10% more time to execution compared to unfenced mode.

All UDxs developed in the R and Java programming languages must run in fenced mode, since the R and Java runtimes cannot be directly run within the HP Vertica process. Fenced mode is currently available for all C++ UDxs with the exception of User Defined Aggregates and User Defined Load.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015

UDx

Unfenced Mode

User Defined Extensions (UDxs) written in the C++ programming language have the option of running in unfenced mode, which means running directly within the HP Vertica process. Since they run within HP Vertica, unfenced UDxs have little overhead, and can perform almost as fast as HP Vertica's own built-in functions. However, since they run within HP Vertica directly, any bugs in their code (memory leaks, for example) can destabilize the main HP Vertica process that can bring one or more database nodes down.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015

UDx

Updates

There are two cases where you need to update libraries that you have already deployed:

• When you have upgraded HP Vertica to a new version that contains changes to the SDK API. For your libraries to work with the new server version, you need to recompile them with new version of the SDK.

• When you have made changes to your UDxs and you want to deploy these changes. Before updating your UDx library, you need to determine if you have changed the signature of any of the functions contained in the library. If you have, you need to drop the functions from the HP Vertica catalog before you update the library.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27 SEIZE THE DATA. 2015

UDx

TYPESThere are five different types of user defined extensions:

• User Defined Scalar Functions (UDSFs) take in a single row of data and return a single value. These functions can be used anywhere a native HP Vertica function can be used, except CREATE TABLE BY PARTITION and SEGMENTED BY expressions.

• User Defined Transform Functions (UDTFs) operate on table segments and return zero or more rows of data. The data they return can be an entirely new table, unrelated to the schema of the input table, including having its own ordering and segmentation expressions. They can only be used in the SELECT list of a query.

• User Defined Aggregate Functions (UDAF) allow you to create custom Aggregate Functions specific to your needs. They read one column of data, and return one output column.

• User Defined Analytic Functions (UDAnF) are similar to UDSFs, in that they read a row of data and return a single row. However, the function can read input rows independently of outputting rows, so that the output values can be calculated over several input rows.

• The User Defined Load (UDL) feature allows you to create custom routines to load your data into HP Vertica. You create custom libraries using the HP Vertica SDK to handle various steps in the loading process.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28 SEIZE THE DATA. 2015

UDxLoading

The following function adds a library entry containing User Defined Extensions (UDxs) into the HP Vertica catalog. • CREATE [OR REPLACE] LIBRARY [[db-name.]schema.]library_name AS 'library_path' [ DEPENDS 'support_path' ] [ LANGUAGE

'language' ]

The following functions add a User Defined Function (UDF) to the catalog.

• CREATE [ OR REPLACE ] AGGREGATE FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory'

LIBRARY library_name;

• CREATE [ OR REPLACE ] ANALYTIC FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] FILTER FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] SOURCE[[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] TRANSFORM FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29 SEIZE THE DATA. 2015

UDx

Logging

UDx code that runs in fenced mode is logged in the UDxZygote.log and is stored in the UDxLogsdirectory in the catalog directory of HP Vertica. Log entries for the side process are denoted by the UDx language, node, zygote process ID, and the UdxSideProcess ID.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30 SEIZE THE DATA. 2015

UDx

Zygotes

dbadmin=> select * from UDX_FENCED_PROCESSES;

node_name | process_type | session_id | language | max_memory_java_kb | pid | port |

status

----------------+------------------+------------+----------+--------------------+------+-------+------

--

v_dev_node0001 | UDxZygoteProcess | | | 140664675237920 | 3612 | 57868 | UP

v_dev_node0002 | UDxZygoteProcess | | | 140307924516896 | 2754 | 47316 | UP

v_dev_node0004 | UDxZygoteProcess | | | 140379328348192 | 6536 | 51902 | UP

v_dev_node0003 | UDxZygoteProcess | | | 140342888235040 | 2467 | 56394 | UP

(4 rows)

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31 SEIZE THE DATA. 2015

UDx

Configuration Parameters

Fenced mode supports two configuration parameters:

• FencedUDxMemoryLimitMB - The maximum memory size, in MB, to use for Fenced Mode processes. The default is -1 (no limit). The side process is killed if this limit is exceeded.

• ForceUDxFencedMode - When set to 1, force all UDx's that support fenced mode to run in fenced mode even if their definition specified NOT FENCED (C++ only). The default is 0 (disabled).

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.32 SEIZE THE DATA. 2015

UDx

EXAMPLES

Directory structure for sample code on every cluster node./opt/vertica/sdk/examples

|

|-- PloadFunctions

|-- data

|-- Rfunctions

|-- FilterFunctions

|-- ScalarFunctions

|-- HelperLibraries

|-- ApportionLoadFunctions

|-- TransformFunctions

|-- AnalyticFunctions

|-- ParserFunctions

|-- JavaUDx

|-- AggregateFunctions

|-- SourceFunctions

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33 SEIZE THE DATA. 2015

Udx in R

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34 SEIZE THE DATA. 2015

Udx in R

Installation

A User Defined Extension (abbreviate as UDx) is a component that adds new abilities to the HP Vertica Analytics Platform. UDxs provide features such as new types of data analysis and the ability to parse and load new types of data.

UDxs can be developed in a three programming languages: C++, Java, and R.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35 SEIZE THE DATA. 2015

Udx in R

Included Packages

The HP Vertica R Language Pack includes the following R packages in addition to the default packages bundled with R:

• Rcpp

• Rinside

• IpSolve

• lpSolveAPI

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36 SEIZE THE DATA. 2015

Udx in R

Installing Packages

You can install additional R packages not included in the HP Vertica R Language Pack by using one of two methods. You must install the same packages on all nodes.

• By using R Language Pack R binary at the command line and using the install.packages() R command. For example:

• By running the following command:

$ /opt/vertica/R/bin/R

> install.packages("<package-name>");

/opt/vertica/R/bin/R CMD INSTALL <path-to-package-tgz>

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.37 SEIZE THE DATA. 2015

Udx in R

K-means (From Wikipedia, the free encyclopedia)

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has nothing to do with and should not be confused with k-nearest neighbor, another popular machine learning technique.

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.38 SEIZE THE DATA. 2015

Udx in R

Function Source

###

# @brief Runs K-means clustering algorithm (with K=2) on the input data frame.

#

# @param x input data frame with two float columns, representing

# two-dimension points: (x float, y float).

# @return a data frame with three columns (the point coordinates plus

# their assigned cluster {1..k}): (x float, y float, cluster int).

###

kmeansClu <- function(x)

{

# Fix initial centroids to get predictable clustering.

cx <- c(1.5, 2.5)

cy <- c(3.5, 4.5)

centroids <- data.frame(cx,cy)

cl <- kmeans(x[,1:2], centroids)

res <- data.frame(x[,1:2], cl$cluster)

res

}

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.39 SEIZE THE DATA. 2015

Udx in R

Factory Source

kmeansCluFactory <- function()

{

list(name=kmeansClu, udxtype=c("transform"),intype=c("float","float"),

outtype=c("float","float","int"), outnames=c("x","y","cluster"))

}

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.40 SEIZE THE DATA. 2015

Udx in R

Load UDx

Create library.

Create function.

dbadmin=> create library rlib as '/opt/vertica/sdk/examples/RFunctions/RFunctions.R' language 'R';

CREATE LIBRARY

dbadmin=> create transform function kmeans as language 'R' name 'kmeansCluFactory' library rlib;

CREATE TRANSFORM FUNCTION

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.41 SEIZE THE DATA. 2015

Udx in R

Explain

create table point_data(x float, y float) unsegmented all nodes;

explain select kmeans(x, y) over() from point_data;

Access Path:

+-ANALYTICAL [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

| Analytic Group

| Functions: kmeans()

| Execute on: Query Initiator

| +---> STORAGE ACCESS for point_data [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 2)

| | Projection: public.point_data_node0001

| | Materialize: point_data.x, point_data.y

| | Execute on: Query Initiator

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.42 SEIZE THE DATA. 2015

Udx in R

Explain

create table point_data_seg(n int, x float, y float) segmented by hash(n) all nodes;

explain select n,kmeans(x, y) over(partition by n) from point_data_seg;

Access Path:

+-ANALYTICAL [Cost: 8, Rows: 21 (NO STATISTICS)] (PATH ID: 1)

| Analytic Group

| Functions: kmeans()

| Execute on: All Nodes

| +---> STORAGE ACCESS for point_data_seg [Cost: 7, Rows: 21 (NO STATISTICS)] (PATH ID: 2)

| | Projection: public.point_data_seg_b0

| | Materialize: point_data_seg.n, point_data_seg.x, point_data_seg.y

| | Execute on: All Nodes

SEIZE THE DATA. 2015

SEIZE THE DATA. 2015QUESTIONS?Please attend our Q&A with HP Big Data experts today

Marina Ballroom, Lobby level

10:15 am • 10:30 am

12:00 pm • 1:00 pm

2:30 pm • 3:00 pm

4:30 pm • 5:00 pm

Recommended