Using Hive-based Hadoop data sources with IBM Campaigndownload4.boulder.ibm.com/sar/CMA/OSA/05bhj/2/CampaignHadoopBi… · Using Hive-based Hadoop data sources with IBM ... Access

Limited Availability Release 4.30.15 – Page 1 of 12

Using Hive-based Hadoop data sources with IBM Campaign

IBM ® Campaign version 9.1.1.0.1 (9.1.1.0 Interim Fix 1) provides the ability to use Hive™

based implementations of Apache Hadoop® with IBM Campaign. This functionality lets you

integrate your Hive-based Hadoop big data system with IBM Campaign in the following ways:

Bring data into Campaign: Use your Hive-based Hadoop big data system as a user data

source for IBM Campaign. For example, create a marketing campaign flowchart in IBM

Campaign that uses customer account data from your big data instance to target

customers with specific account types and balances.

Export data from Campaign: Send content from IBM Campaign to your Hive-based

Hadoop big data system. You can create a marketing campaign flowchart that pulls user

data from other data sources, such as DB2 or Oracle databases. Use the Campaign

flowchart to create a specific market segment, then use the Snapshot process in the

flowchart to export that segment back to your big data instance.

The ability to create temp tables for in-database optimization is also supported. Using

the IBM Campaign in-database optimization feature can improve flowchart performance.

When in-database optimization is on, processing is done on the database server and

output is stored in temporary tables on the database server whenever possible. For more

information, search for useInDbOptimization in the IBM Knowledge Center.

Requirements and Restrictions This limited release has the following requirements and restrictions, in addition to any statements

specified in the accompanying readme file.

This limited availability release is available only to specific customers who have

partnered with IBM and who have obtained the necessary download key.

The following Hadoop distributions are supported, with Apache Hive as the connection

point: Cloudera, Hortonworks, IBM BigInsights ™, MapR.

Minimum supported Hive version: 0.13

The integration is currently supported on Linux RHEL 6.3 or higher.

Hive-based Hadoop is supported as a user data source only. It is not supported for

Campaign system tables.

The DataDirect Apache Hive ODBC driver from Progress.com is required: DataDirect

Connect64(R) for ODBC Release 7.1.5. Note: The customer is responsible for purchasing

the driver.

The integration does not currently support the IBM Campaign Cube, Optimize, or

Interact List process boxes or eMessage Landing Pages in an Extract process box.

https://www.ibm.com/support/knowledgecenter/


You can use a Hive-based Hadoop user data source on an IBM Campaign system that is

integrated with IBM SPSS-MA Marketing Edition and also with an IBM Campaign and

Digital Analytics integration.

Resources

Hive ODBC driver: https://www.progress.com/products/data-sources/apache-hadoop-hive

Hive: https://cwiki.apache.org/confluence/display/Hive/Home

HiveQL: https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Hive HBase integration: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Hue and Hadoop: http://gethue.com

IBM Campaign:

http://www.ibm.com/support/knowledgecenter/SSCVKV/product_welcome_kc_campaign.dita

Integration Architecture

Definition of Terms Apache Hadoop® is an open-source software framework written in Java for distributed storage

and distributed processing of very large data sets on computer clusters built from commodity

hardware.

Apache Hive™ is a data warehouse infrastructure built on top of Hadoop to facilitate querying

and managing large datasets that reside in distributed storage. Hive provides a mechanism to

project structure onto this data and query the data using an SQL-like language called HiveQL.

https://www.progress.com/products/data-sources/apache-hadoop-hive

https://cwiki.apache.org/confluence/display/Hive/Home

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

http://gethue.com/



Apache HBase™ is an open source, non-relational, distributed database written in Java. It runs

on top of HDFS, providing BigTable-like capabilities for Hadoop.

Hadoop Distributed File System (HDFS™) is a distributed file system that runs on commodity

hardware. It is designed to reliably store very large files across machines in a large cluster.

Hue is a Web interface for analyzing data with Apache Hadoop.

HiveQL (or HQL) is the Hive query language.

MapReduce is a programming model and an associated implementation for processing and

generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce is the

heart of Hadoop®. It is this programming paradigm that allows for massive scalability across

hundreds or thousands of servers in a Hadoop cluster.

Big data distributions of Apache Hadoop: A number of vendors have developed their own

distributions of Hadoop, including Cloudera, Hortonworks, IBM BigInsights, and MapR.

User tables is an IBM Campaign term that indicates any data source that contains an

organization’s marketing data for access by IBM Campaign flowcharts. Typically, user tables

contain data about customers, prospects, and products. For example, customer account data

pulled from user tables could be used in a flowchart to target customers with specific account

types and balances.

Apache Hive overview The Apache Hive data warehouse software facilitates querying and managing large datasets that

reside in distributed storage. Built on top of Apache Hadoop, Hive provides:

Tools to enable easy data extract/transform/load (ETL)

A mechanism to impose structure on a variety of data formats

Access to files stored either directly in Apache HDFS or in other data storage systems

such as Apache HBase

Query execution via MapReduce

Hive defines a simple SQL-like query language, called HiveQL (or HQL), that enables users

familiar with SQL to query the data.

You can use the Hue editor (Hadoop UI) to work with your big data instance (for example:

connect, view, create tables and databases).

For information about Hue and Hadoop: http://gethue.com.

For information about Hive: https://cwiki.apache.org/confluence/display/Hive/Home

For information about HiveQL: https://cwiki.apache.org/confluence/display/Hive/LanguageManual

http://gethue.com/

https://cwiki.apache.org/confluence/display/Hive/Home

https://cwiki.apache.org/confluence/display/Hive/LanguageManual


Installation and configuration Follow these steps to install and configure the integration of Hive-based Apache Hadoop with

IBM Campaign.

A. Install the DataDirect Hive ODBC driver

The DataDirect driver for Apache Hive is a fully-compliant ODBC driver that supports multiple

Hadoop distributions out-of-the-box. IBM Campaign requires this driver for the integration.

Note: The customer is responsible for purchasing the driver.

Prerequisite: KornShell (ksh) must be installed on the IBM Campaign listener (analytic) server.

1. Obtain the Progress DataDirect Connect ODBC driver (Progress DataDirect

Connect64(R) for ODBC Release 7.1.5) for Apache Hadoop Hive:


2. Download and install the DataDirect Hive driver on the machine where the IBM

Campaign listener (analytic) server is installed:

PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.5_LINUX_64.tar.Z

[DataDirectNew]# gunzip PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.5_LINUX_64.tar.Z

[DataDirectNew]# tar -xvf PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.5_LINUX_64.tar

3. Run the following command to start the installation:

>> ksh ./unixmi.ksh

4. Follow the prompts to complete the installation.

5. Perform basic testing of the driver:

>> ./ddtestlib /opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddhive27.so

Next Step: Configure the Data Direct Hive ODBC driver.

B. Configure the DataDirect Hive ODBC driver

Follow the instructions below to configure the Progress DataDirect Hive ODBC driver after

installing it.

1. Modify the ODBC.ini file to specify the Hive server information, as shown in the

example below. Note that you must customize the items in bold to match your own

configuration; the values shown below are only examples. Note that

RemoveColumnQualifiers must be set to 1.

[MapRHive]

Driver=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddhive27.so

Description=DataDirect 7.1 Apache Hive Wire Protocol

ArraySize=16384

Database=<database-name>

DefaultLongDataBuffLen=1024

EnableDescribeParam=0

HostName=<hostname or ip of Hive server on Hadoop Distribution machine>

LoginTimeout=30

LogonID=<username of Hadoop Distribution machine>

MaxVarcharSize=2147483647



Password=<password of Hadoop Distribution machine>

PortNumber=<port number of Hive server on Hadoop Distribution machine>

RemoveColumnQualifiers=1

StringDescribeType=12

TransactionMode=0

UseCurrentSchema=0

WireProtocolVersion=0

2. Assuming that your ODBC driver is installed in the following location:

/opt/Progress/DataDirect/Connect64_for_ODBC_71

Ensure that your LD_LIBRARY_PATH includes the following path: =/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib

Ensure that your PATH includes the following: =/opt/Progress/DataDirect/Connect64_for_ODBC_71/tools

Set your ODBCINI variable to point to the correct INI file. For example:

ODBCINI=/opt/Progress/DataDirect/Connect64_for_ODBC_71/odbc.ini ; export

ODBCINI

Set your ODBCINST variable to point to the correct INI file. For example:

ODBCINST=/opt/Progress/DataDirect/Connect64_for_ODBC_71/odbcinst.ini ; export

ODBCINST

3. Verify the connectivity of the DataDirect ODBC driver and your Hive-based Hadoop big

data system:

cd /opt/Progress/DataDirect/Connect64_for_ODBC_71/samples/example

>> ./example

4. For IBM Campaign, set the ODBCINI and CAMPAIGN_HOME environment variables,

then run the IBM Campaign odbctest utility to verify connectivity to IBM Campaign:

cd <Campaign_Home>/bin

>> ./odbctest

Next Step: Map existing HBase tables to Hive

C. Map existing HBase tables to Hive

This step is required only if you have existing tables that were created in Apache HBase. In this

case, you must make the existing HBase tables available to Apache Hive by running the

CREATE EXTERNAL TABLE query. After you expose the HBase tables to Hive, the tables are

then available to IBM Campaign for table mapping within Campaign.

External Tables

The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does

not use a default location for the table. This is useful if you already have data generated. An

EXTERNAL table points to any HDFS location for its storage, rather than being stored in a

folder specified by the configuration property. When dropping an EXTERNAL table, data in the

table is NOT deleted from the file system.

Example

1. Open the Hue editor and open the Hive Query Editor.


2. Create and execute the CREATE EXTERNAL TABLE command. Use the following

query as an example, substituting your own table name, field names, and other

parameters. This example uses 'CampaignAccounts' as the table name and 'f' as the

family name.

CREATE EXTERNAL TABLE HiveExt_CampaignAccounts(Acct_ID INT,Indiv_ID

INT,HHold_ID INT,Acct_Type_Code STRING,Acct_Status_Code INT,Acct_Open_Date

INT,Acct_Balance STRING,Acct_Balance_Last_Month STRING,Acct_Balance_Avg_6Month

STRING,Credit_Limit STRING,Acct_Number STRING,Last_Contact_Date STRING,Due_Date

STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH

SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:Indiv_ID, f:HHold_ID,

f:Acct_Type_Code, f:Acct_Status_Code, f:Acct_Open_Date, f:Acct_Balance,

f:Acct_Balance_Last_Month, f:Acct_Balance_Avg_6Month, f:Credit_Limit,

f:Acct_Number, f:Last_Contact_Date, f:Due_Date') TBLPROPERTIES

('hbase.table.name' = 'CampaignAccounts');

For information about Hive HBase integration:


Next Step: Configure the BigDataODBCHive data source template in IBM Campaign

D. Configure the BigDataODBCHive data source template in IBM Campaign

To communicate with your Hive-based Hadoop system, IBM Campaign requires that you

configure a data source based on the BigDataODBCHive template. This template is supplied

with the integration as BigDataODBCHive.xml.

The following procedure explains how to do two things:

Import the required template into IBM Campaign. Importing a template makes it

available for creating data sources for big data based on the template.

Use the imported template to create a separate data source for each Hive implementation

that you will use with IBM Campaign (MapR, Cloudera, Hortonworks, BigInsights).

Follow the procedure below to import the template and then configure a data source for each big

data instance.

1. Use the configTool utility to import the BigDataODBCHive.xml template into Campaign.

configTool is in <Marketing_Platform_Home>/tools/bin. For more

information, see the IBM Marketing Platform Administrator’s Guide in the IBM

Knowledge Center.

BigDataODBCHive.xml is in <Campaign_Home>/conf.

The following example imports the template into the default Campaign partition,

partition1. Replace <Campaign_Home> with the complete path to the IBM Campaign

installation directory.

./configTool -i -p "Affinium|Campaign|partitions|partition1|dataSources" –f

<Campaign_Home>/conf/BigDataODBCHive.xml





2. Use IBM Campaign to create a data source based on the BigDataODBCHive datasource

template. Create a separate data source for each Hive implementation that you will use

with IBM Campaign (MapR, Cloudera, Hortonworks, BigInsights):

a. In IBM Campaign, choose Settings > Configuration.

b. Go to Campaign | partitions | partition[n] | dataSources.

c. Select the BigDataODBCHiveTemplate.

d. Supply a New category name that identifies the Hive dataSource, for example

Hive_MapR or Hive_Cloudera or Hive_HortonWorks or Hive_BigInsights.

e. Fill in the fields to set the properties for the new data source, then save your

changes.

Important! Pay special attention to the properties described below. This is only a

partial list of the properties included in this template. For guidance on properties

not listed below, see the IBM Campaign Administrator’s Guide in the IBM

Knowledge Center.

Property Description

ASMUserForDBCredentials The Campaign system user.

DSN DSN Name as specified in the Progress

DataDirect odbc.ini file for the Hive-based

Hadoop big data instance.

JndiName JNDI name is not required.

SystemTableSchema The user of the database that you will connect

to.

OwnerForTableDisplay The user of the database that you will connect

to.

LoaderPreLoadDataFileCopyCmd SCP is used to copy data from IBM Campaign

to a temp folder on the Hive-based Hadoop

system. This value can either specify the SCP

command or call a script that specifies the

command. See “Exporting data from IBM

Campaign to a Hive-based Hadoop system.”

LoaderPostLoadDataFileRemoveCmd Data files are copied from IBM Campaign to a

temp folder on the Hive-based Hadoop system.

You must use the SSH "rm" command to

remove the temporary data file. See “Exporting

data from IBM Campaign to a Hive-based

Hadoop system.”

LoaderDelimiter The delimiter such as comma (,) or semi-colon

(;) that separates fields in the temporary data

files that are loaded into the big data instance.

Tab (/t) is not supported.

SuffixOnTempTableCreation

SuffixOnSegmentTableCreation

SuffixOnSnapshotTableCreation

Use the same character as specified for

LoaderDelimiter.




SuffixOnExtractTableCreation

SuffixOnUserBaseTableCreation

SuffixOnUserTableCreation

UseExceptForMerge Set to FALSE. Hive does not support the

EXCEPT clause, so a setting of TRUE can result

in process failures.

DateFormat

DateTimeFormat

DateTimeOutputFormatString

All Date strings must use the dash "-" character

to format dates. Hive does not support any other

characters for dates. Example: %Y-%m-%d

%H:%M:%S

Type BigDataODBC_Hive

UseSQLToRetrieveSchema Set to FALSE.

Next step: Map the Hive data source in IBM Campaign

E. Map the Hive data source in IBM Campaign

Mapping user tables is the process of making external data sources accessible in Campaign. A

typical user table contains information about your company's customers, prospects, or products,

for use in marketing campaigns. You must map any data sources that you configured to make

that data accessible to processes in flowcharts.

Prerequisite

You must define audience levels in IBM Campaign before you map user tables. For instructions,

see the IBM Campaign Administrator’s Guide in the IBM Knowledge Center.

Procedure

1. Select Settings > Campaign Settings > Manage Table Mappings.

2. In the Table Mappings dialog, click Show User Tables.

3. Click New Table. The New Table Definition dialog opens.

4. Click Next.

5. Select Map to Existing Table in Selected Database.

6. Select the BigDataODBCHive datasource, then click Next.

7. Follow the prompts to map the table.

For complete instructions on how to map user tables, see the IBM Campaign Administrator’s

Guide in the IBM Knowledge Center.

F. Configure SSH on the Campaign listener server

To enable data file transfers between the IBM Campaign listener (analytic) server and the Hive-

based Hadoop big data instance, you must configure SCP and SSH seamless login. SSH allows

secure connection between two computers; most systems use the OpenSSH client.




Procedure

1. On the machine that is running the IBM Campaign listener (analytic) server, configure

SSH for no password prompt for authentication. Log in as the user who is running the

IBM Campaign listener server and run the following commands, substituting the

username@IP address of your Hive server (in this example, the MapR machine):

>> ssh-keygen -t rsa

>> ssh [email protected] mkdir -p .ssh

>> cat .ssh/id_rsa.pub | ssh [email protected] 'cat >> .ssh/authorized_keys'

>> ssh [email protected] "chmod 700 .ssh; chmod 640 .ssh/authorized_keys"

2. Verify password-less authentication using RSA-based authorized keys. Run each

command, substituting the actual username@IP address, and verify that it works. You

need a local file called test1 for this test to work:

>> ssh [email protected]

>> scp test1 [email protected]:/tmp

>> ssh [email protected] "rm /tmp/test1"

Exporting data from IBM Campaign to a Hive-based Hadoop system You can send data from IBM Campaign to your Hive-based Hadoop big data system. For

example, create a flowchart in IBM Campaign that pulls user data from one or more data

sources, such as DB2, Oracle, and Hive databases. Configure the Snapshot process in a flowchart

to export the data to your big data instance. When you run the flowchart, the snapshot data is

exported to the Hive database.

The IBM Campaign configuration settings for the Hive dataSource determine how the data file is

transferred from Campaign to Hive.

Procedure

There are two main things that you need to do:

1. Configure the Hive data source (in Campaign | Partitions | Partition[n] | dataSources) to

specify the required SCP and SSH commands:

The LoaderPreLoadDataFileCopyCmd value copies the data from IBM

Campaign to Hive. This value can either specify the SCP command or call a script

that specifies the SCP command. See the two examples provided below.

The LoaderPostLoadDataFileRemoveCmd value must specify the SSH "rm"

command to remove the temporary file after it has been loaded into Hive.

Note that SSH must be configured on the Campaign listener server to support this

functionality (see “F. Configure SSH on the Campaign listener server”).

2. Configure the Snapshot process in a flowchart to obtain input data from one or more data

sources and export the data to your Hive database. Design the flowchart as you normally

would, including any desired processes such as Select and Merge.


When you run the flowchart, the following actions occur:

1. The entire dataset is exported to a temporary data file at

<Campaign_Home>/partitions/partition[n]/tmp.

2. Using the command specified for LoaderPreLoadDataFileCopyCmd, the temporary

data file is copied to the Hive server.

3. The data is loaded into a Hive table using the HiveQL LOAD DATA command. This is

handled internally; no manual action is required.

4. Using the command specified for LoaderPreLoadDataFileCopyCmd, the temporary

data file is removed from the Hive server.

Example 1: Configuring export to MapR

This example shows IBM Campaign configured for export to MapR (Campaign | Partitions |

Partition[n] | dataSources | Hive_MapR). The LoaderPreLoadDataFileCopyCmd uses SCP to

copy the data file from the local machine running IBM Campaign to a temp directory on the

remote machine running the Hive server (the MapR machine). The

LoaderPostLoadDataFileRemoveCmd uses SSH rm to remove the file.

Example 2: Configuring export to Cloudera using a script

Using a script can be useful to avoid file permission issues. If there are any issues related to file

permissions, the LOAD command will not be able to access the data file and the command will

fail. To avoid this type of issue, you can write your own shell or command-line script to SCP the

data file to Hive and update the file permissions of the data file.

This example shows IBM Campaign configured to use a script for export to Cloudera (Campaign

| Partitions | Partition[n] | dataSources | Hive_Cloudera). LoaderPreLoadDataFileCopyCmd

calls a script that uses SCP to copy the data file from the local machine running IBM Campaign

to a temp directory on the remote Cloudera machine. LoaderPostLoadDataFileRemoveCmd

removes the file.


Here is the script that is called by LoaderPreLoadDataFileCopyCmd in this example. The

script is on the IBM Campaign listener (analytic) server machine. The script executes the SCP

command as the user “cloudera” on the destination server (my.company.com) to copy the file to

the tmp directory. The SSH command connects as the same user to make sure that the

permissions are correct for the load and removal processes that will follow.

copyToHadoop.sh:

#!/bin/sh

scp $1 [email protected]:/tmp

ssh [email protected] "chmod 0666 /tmp/`basename $1`"

Hive Query Language conformance Apache Hive has its own query language called HiveQL (or HQL). While based on SQL,

HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL,

including multitable inserts and create table as select, but only offers basic support for indexes.

Also, HiveQL lacks support for transactions and materialized views, and only limited subquery

support.

Therefore, when IBM Campaign is integrated with Hive, adhere to the following guidelines:

The integration only supports SQL that conforms with HiveQL.

If you are writing raw queries for use in IBM Campaign, confirm that the queries work

on Hive.

You may need to modify existing queries for Hive if you are using raw SQL in IBM

Campaign process boxes, custom macros, or derived fields, for pre- and post-processing.


The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at www.ibm.com/legal/copytrade.shtml.

© Copyright International Business Machines Corporation 2015. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

http://www.ibm.com/legal/us/en/copytrade.shtml

Documents

Using Hive-based Hadoop data sources with IBM Campaigndownload4.boulder.ibm.com/sar/CMA/OSA/05bhj/2/CampaignHadoopBi… · Using Hive-based Hadoop data sources with IBM ... Access