44
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 1 SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015

Page 2: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

SEIZE THE DATA. 2015

Extending HP VerticaExternal Procedures, SQL Functions, UDx

Mark Draper, Vertica Professional Services

August 10, 2015

Page 3: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 SEIZE THE DATA. 2015

Extending Vertica

HP Vertica Extension FeaturesExternal procedures

− Execute external scripts or programs that are installed on a host in your database cluster.

User-Defined SQL Functions

− Store frequently-used SQL expressions; help you simplify and standardize your SQL scripts.

External procedures

− Develop your own analytic or data-loading tools using C++, Java, and R programming languages; useful when the type of data processing you want to perform is difficult or slow using SQL.

Page 4: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015

External Procedures

Page 5: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015

External Procedures

Definition

What are external procedures?

• An external procedure is a procedure external to HP Vertica that you create, maintain, and store on the server.

• External procedures are simply executable files such as shell scripts, compiled code, code interpreters, and so on.

Where are external procedures?

• A procedure file must be owned by the database administrator (OS account) or by a user in the same group as the administrator. (The procedure file owner cannot be root.) The procedure file must also have the set UID attribute enabled, and allow read and execute permission for the group.

Page 6: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015

External Procedures

Resource Usage

The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on your system. If your external procedure is resource intensive, it could affect the performance and stability of HP Vertica. Consider the types of external procedures you create and when you run them. For example, you might run a resource-intensive procedure during off hours.

Page 7: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015

External Procedures

Definition

Once you have installed an external procedure, you need to make HP Vertica aware of it. To do so, use the CREATE PROCEDURE statement.

By default, only a superuser can create and execute a procedure. However, a superuser can grant the right to execute a stored procedure to a user on the operating system.

To execute an external procedure, the database user needs:

• EXECUTE privilege on procedure

• USAGE privilege on schema that contains the procedure

Once created, a procedure is listed in the V_CATALOG.USER_PROCEDURES system table. Users can see only those procedures that they have been granted the privilege to execute.

Page 8: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015

External Procedures

Execution

Once you define a procedure through the CREATE PROCEDURE statement, you can use it as a meta command through a simple SELECT statement. HP Vertica does not support using procedures in more complex statements or in expressions.

Procedures are executed on the initiating node. HP Vertica runs the procedure by forking and executing the program. Each procedure argument is passed to the executable file as a string. The parent fork process waits until the child process ends.

To stop execution, cancel the process by sending a cancel command (for example, CTRL+C) through the client. If the procedure program exits with an error, an error message with the exit status is returned.

Page 9: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 SEIZE THE DATA. 2015

External Procedures

Implementing External Procedures

To implement an external procedure:

• Create an external procedure executable file.

• Enable the SUID attribute (Set owner User ID up on execution) for the file and allow read and execute permission for the group (if the owner is not the database administrator).

$ chmod 4777 <proc-name>

#!/bin/bash

… processing here …

echo "extproc1 argument: $1" >> /tmp/extproc1.log

exit 0

Page 10: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015

External Procedures

Implementing External Procedures

To implement an external procedure:

• Install the external procedure executable file.

• Create the external procedure in HP Vertica

$ admintools -t install_procedure -d <database> -f <full-path-to-procedure>

-p <db-password>

=> create procedure <db-proc-name>(arg1 varchar) as <os-proc-name> language 'external'

user <run-as-os-user>;

=> grant execute on <procedure> to <user|role|PUBLIC>;

Page 11: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015

External Procedures

Usage

To execute an external procedure:

• Invoke the procedure from vsql.

=> select <db-proc-name>(arg1, …);

Page 12: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015

External Procedures

Dropping Procedures

Only a superuser can drop an external procedure. To drop the definition for an external procedure from HP Vertica, use the DROP PROCEDURE statement. Only the reference to the procedure is removed. The external file remains in the <database_catalog_path>/procedures directory on each node in the database.

Note: The definition HP Vertica uses for a procedure cannot be altered; it can only be dropped.

• Drop procedure command.

=> drop procedure <db-proc-name>(arg1 varchar, …);

Page 13: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015

External Procedures

Use Cases

Populate an external table.

• Run an external job which populates an external table.

Run ETL scripts.

• Run an ETL script from cluster host; this allows database user to run script without having access to the cluster host.

Callback.

• Run a script which connects to the database (or uses admintools).

Page 14: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015

User Defined SQL

Page 15: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 SEIZE THE DATA. 2015

User Defined SQLDefinition

User-Defined SQL Functions let you define and store commonly-used SQL expressions as a function. User-Defined SQL Functions are useful for executing complex queries and combining HP Vertica built-in functions. You simply call the function name you assigned in your query.

A User-Defined SQL Function can be used anywhere in a query where an ordinary SQL expression can be used, except in the table partition clause or the projection segmentation clause.

Page 16: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015

User Defined SQLPermission

CREATE

• The user must have CREATE privileges on the schema.

USE• To use a SQL function, the user must have USAGE privileges on the schema and EXECUTE privileges on the

defined function.

ALTER• Vertica allows multiple functions to share the same name with different argument types; therefore you must

specify the argument data type.

DROP

• Like with ALTER FUNCTION, you must specify the argument data type.

Page 17: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015

User Defined SQLExamples

create or replace function ucase (x varchar) return varchar

as

begin

return upper(x);

end;

=> create function store.modulus(x int, y int) return boolean

as

begin

return (

case mod(x,y+1)

when 0 then true

else false end);

end;

Page 18: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015

User Defined SQL

Dropping Procedures

Only a superuser can drop an external procedure. To drop the definition for an external procedure from HP Vertica, use the DROP PROCEDURE statement. Only the reference to the procedure is removed. The external file remains in the <database_catalog_path>/procedures directory on each node in the database.

Note: The definition HP Vertica uses for a procedure cannot be altered; it can only be dropped.

• Drop procedure command.

=> drop procedure <db-proc-name>(arg1 varchar, …);

Page 19: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19 SEIZE THE DATA. 2015

User Defined SQL

Use Cases

Migrating Built-In SQL Functions

• If you have built-in SQL functions from another RDBMS that do not map to an HP Vertica-supported function, you can migrate them into your HP Vertica database by using a user-defined SQL function.

Wrapper

• Functional interface for storage of commonly-used SQL expressions.

Page 20: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015

User Defined Extensions (Udx)

Page 21: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015

UDx

Definition

A User Defined Extension (abbreviate as UDx) is a component that adds new abilities to the HP Vertica Analytics Platform. UDxs provide features such as new types of data analysis and the ability to parse and load new types of data.

UDxs can be developed in a three programming languages: C++, Java, and R.

Page 22: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015

UDx

Strengths

Can be used anywhere an internal function can be used.

Take full advantage of HP Vertica's distributed computing features. The extensions usually execute in parallel on each node in the cluster.

HP Vertica handles the distribution of the UDx library to the individual nodes. You only need to copy the library to the initiator node.

Your main programming task is to read in data, process it, and then write it out using the HP Vertica SDK APIs. All of the complicated aspects of developing a distributed piece of analytic code are handled for you by HP Vertica.

Page 23: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015

UDx

Implementation

User Defined Extensions (UDxs) are contained in libraries. A library can contain multiple UDxs. You can load multiple libraries in to HP Vertica. You load a library by:

• Copying the library file to a location on the initiator node.

• Connecting to the initiator node using vsql.

• Using the CREATE LIBRARY statement, passing it the path where you saved the library file.

The initiator node takes care of distributing the library file to the rest of the nodes in the cluster.

Once the library is loaded, you define individual User Defined Functions or User Defined Loads using SQL statements such as CREATE FUNCTION and CREATE SOURCE. These statement assigns SQL function names to the extension classes in the library.

Page 24: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 SEIZE THE DATA. 2015

UDx

Fenced Mode

UDxs in fenced mode run the UDxs code outside of the main HP Vertica process in a separate zygote process. UDx code that crashes while running in fenced mode does not impact the core HP Vertica process. There is a small performance impact when running UDx code in fenced mode. On average, using fenced mode adds about 10% more time to execution compared to unfenced mode.

All UDxs developed in the R and Java programming languages must run in fenced mode, since the R and Java runtimes cannot be directly run within the HP Vertica process. Fenced mode is currently available for all C++ UDxs with the exception of User Defined Aggregates and User Defined Load.

Page 25: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015

UDx

Unfenced Mode

User Defined Extensions (UDxs) written in the C++ programming language have the option of running in unfenced mode, which means running directly within the HP Vertica process. Since they run within HP Vertica, unfenced UDxs have little overhead, and can perform almost as fast as HP Vertica's own built-in functions. However, since they run within HP Vertica directly, any bugs in their code (memory leaks, for example) can destabilize the main HP Vertica process that can bring one or more database nodes down.

Page 26: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015

UDx

Updates

There are two cases where you need to update libraries that you have already deployed:

• When you have upgraded HP Vertica to a new version that contains changes to the SDK API. For your libraries to work with the new server version, you need to recompile them with new version of the SDK.

• When you have made changes to your UDxs and you want to deploy these changes. Before updating your UDx library, you need to determine if you have changed the signature of any of the functions contained in the library. If you have, you need to drop the functions from the HP Vertica catalog before you update the library.

Page 27: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27 SEIZE THE DATA. 2015

UDx

TYPESThere are five different types of user defined extensions:

• User Defined Scalar Functions (UDSFs) take in a single row of data and return a single value. These functions can be used anywhere a native HP Vertica function can be used, except CREATE TABLE BY PARTITION and SEGMENTED BY expressions.

• User Defined Transform Functions (UDTFs) operate on table segments and return zero or more rows of data. The data they return can be an entirely new table, unrelated to the schema of the input table, including having its own ordering and segmentation expressions. They can only be used in the SELECT list of a query.

• User Defined Aggregate Functions (UDAF) allow you to create custom Aggregate Functions specific to your needs. They read one column of data, and return one output column.

• User Defined Analytic Functions (UDAnF) are similar to UDSFs, in that they read a row of data and return a single row. However, the function can read input rows independently of outputting rows, so that the output values can be calculated over several input rows.

• The User Defined Load (UDL) feature allows you to create custom routines to load your data into HP Vertica. You create custom libraries using the HP Vertica SDK to handle various steps in the loading process.

Page 28: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28 SEIZE THE DATA. 2015

UDxLoading

The following function adds a library entry containing User Defined Extensions (UDxs) into the HP Vertica catalog. • CREATE [OR REPLACE] LIBRARY [[db-name.]schema.]library_name AS 'library_path' [ DEPENDS 'support_path' ] [ LANGUAGE

'language' ]

The following functions add a User Defined Function (UDF) to the catalog.

• CREATE [ OR REPLACE ] AGGREGATE FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory'

LIBRARY library_name;

• CREATE [ OR REPLACE ] ANALYTIC FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] FILTER FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] SOURCE[[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

• CREATE [ OR REPLACE ] TRANSFORM FUNCTION [[db-name.]schema.]function-name AS LANGUAGE 'language' NAME 'factory' LIBRARY library_name;

Page 29: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29 SEIZE THE DATA. 2015

UDx

Logging

UDx code that runs in fenced mode is logged in the UDxZygote.log and is stored in the UDxLogsdirectory in the catalog directory of HP Vertica. Log entries for the side process are denoted by the UDx language, node, zygote process ID, and the UdxSideProcess ID.

Page 30: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30 SEIZE THE DATA. 2015

UDx

Zygotes

dbadmin=> select * from UDX_FENCED_PROCESSES;

node_name | process_type | session_id | language | max_memory_java_kb | pid | port |

status

----------------+------------------+------------+----------+--------------------+------+-------+------

--

v_dev_node0001 | UDxZygoteProcess | | | 140664675237920 | 3612 | 57868 | UP

v_dev_node0002 | UDxZygoteProcess | | | 140307924516896 | 2754 | 47316 | UP

v_dev_node0004 | UDxZygoteProcess | | | 140379328348192 | 6536 | 51902 | UP

v_dev_node0003 | UDxZygoteProcess | | | 140342888235040 | 2467 | 56394 | UP

(4 rows)

Page 31: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31 SEIZE THE DATA. 2015

UDx

Configuration Parameters

Fenced mode supports two configuration parameters:

• FencedUDxMemoryLimitMB - The maximum memory size, in MB, to use for Fenced Mode processes. The default is -1 (no limit). The side process is killed if this limit is exceeded.

• ForceUDxFencedMode - When set to 1, force all UDx's that support fenced mode to run in fenced mode even if their definition specified NOT FENCED (C++ only). The default is 0 (disabled).

Page 32: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.32 SEIZE THE DATA. 2015

UDx

EXAMPLES

Directory structure for sample code on every cluster node./opt/vertica/sdk/examples

|

|-- PloadFunctions

|-- data

|-- Rfunctions

|-- FilterFunctions

|-- ScalarFunctions

|-- HelperLibraries

|-- ApportionLoadFunctions

|-- TransformFunctions

|-- AnalyticFunctions

|-- ParserFunctions

|-- JavaUDx

|-- AggregateFunctions

|-- SourceFunctions

Page 33: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33 SEIZE THE DATA. 2015

Udx in R

Page 34: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34 SEIZE THE DATA. 2015

Udx in R

Installation

A User Defined Extension (abbreviate as UDx) is a component that adds new abilities to the HP Vertica Analytics Platform. UDxs provide features such as new types of data analysis and the ability to parse and load new types of data.

UDxs can be developed in a three programming languages: C++, Java, and R.

Page 35: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35 SEIZE THE DATA. 2015

Udx in R

Included Packages

The HP Vertica R Language Pack includes the following R packages in addition to the default packages bundled with R:

• Rcpp

• Rinside

• IpSolve

• lpSolveAPI

Page 36: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36 SEIZE THE DATA. 2015

Udx in R

Installing Packages

You can install additional R packages not included in the HP Vertica R Language Pack by using one of two methods. You must install the same packages on all nodes.

• By using R Language Pack R binary at the command line and using the install.packages() R command. For example:

• By running the following command:

$ /opt/vertica/R/bin/R

> install.packages("<package-name>");

/opt/vertica/R/bin/R CMD INSTALL <path-to-package-tgz>

Page 37: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.37 SEIZE THE DATA. 2015

Udx in R

K-means (From Wikipedia, the free encyclopedia)

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has nothing to do with and should not be confused with k-nearest neighbor, another popular machine learning technique.

Page 38: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.38 SEIZE THE DATA. 2015

Udx in R

Function Source

###

# @brief Runs K-means clustering algorithm (with K=2) on the input data frame.

#

# @param x input data frame with two float columns, representing

# two-dimension points: (x float, y float).

# @return a data frame with three columns (the point coordinates plus

# their assigned cluster {1..k}): (x float, y float, cluster int).

###

kmeansClu <- function(x)

{

# Fix initial centroids to get predictable clustering.

cx <- c(1.5, 2.5)

cy <- c(3.5, 4.5)

centroids <- data.frame(cx,cy)

cl <- kmeans(x[,1:2], centroids)

res <- data.frame(x[,1:2], cl$cluster)

res

}

Page 39: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.39 SEIZE THE DATA. 2015

Udx in R

Factory Source

kmeansCluFactory <- function()

{

list(name=kmeansClu, udxtype=c("transform"),intype=c("float","float"),

outtype=c("float","float","int"), outnames=c("x","y","cluster"))

}

Page 40: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.40 SEIZE THE DATA. 2015

Udx in R

Load UDx

Create library.

Create function.

dbadmin=> create library rlib as '/opt/vertica/sdk/examples/RFunctions/RFunctions.R' language 'R';

CREATE LIBRARY

dbadmin=> create transform function kmeans as language 'R' name 'kmeansCluFactory' library rlib;

CREATE TRANSFORM FUNCTION

Page 41: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.41 SEIZE THE DATA. 2015

Udx in R

Explain

create table point_data(x float, y float) unsegmented all nodes;

explain select kmeans(x, y) over() from point_data;

Access Path:

+-ANALYTICAL [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

| Analytic Group

| Functions: kmeans()

| Execute on: Query Initiator

| +---> STORAGE ACCESS for point_data [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 2)

| | Projection: public.point_data_node0001

| | Materialize: point_data.x, point_data.y

| | Execute on: Query Initiator

Page 42: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.42 SEIZE THE DATA. 2015

Udx in R

Explain

create table point_data_seg(n int, x float, y float) segmented by hash(n) all nodes;

explain select n,kmeans(x, y) over(partition by n) from point_data_seg;

Access Path:

+-ANALYTICAL [Cost: 8, Rows: 21 (NO STATISTICS)] (PATH ID: 1)

| Analytic Group

| Functions: kmeans()

| Execute on: All Nodes

| +---> STORAGE ACCESS for point_data_seg [Cost: 7, Rows: 21 (NO STATISTICS)] (PATH ID: 2)

| | Projection: public.point_data_seg_b0

| | Materialize: point_data_seg.n, point_data_seg.x, point_data_seg.y

| | Execute on: All Nodes

Page 43: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

SEIZE THE DATA. 2015

Page 44: SEIZE THE DATA. 2015The HP Vertica resource manager is unaware of resources used by external procedures. Additionally, HP Vertica is intended to be the only major process running on

SEIZE THE DATA. 2015QUESTIONS?Please attend our Q&A with HP Big Data experts today

Marina Ballroom, Lobby level

10:15 am • 10:30 am

12:00 pm • 1:00 pm

2:30 pm • 3:00 pm

4:30 pm • 5:00 pm