Hadoop Deployment Guide - IBM · 2020. 10. 19. · This guide is to help customers, IBM Business Partners, and IBM technical staff plan for and validate Guardium for Hadoop in a test

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 1

IBM Security Guardium® V10.1 Deployment Guide for Hadoop Systems July 27, 2016

Revisions

12/16/2015 – you DO need kerberos configuration if either HBase or Hive is used. Previously it said you

don’t need it for Hive, but this was incorrect.

01/12/2016- added Solr port (8983) and IE type (HTTP)

06/08/2016 - updated with updated sizing rules of thumb. Updated with gpfs information for bigInsights.

Removed requirement to customize HBase report as this has been fixed in 10.1

6/27/2016 – added reference to Deployment guide for Hortwonworks for Ranger integration


Hadoop Deployment Guide

Objectives of this Guide...................................................................................................... 4

What’s new in V10 for Hadoop .......................................................................................... 4 What’s new in V10.1 for Hadoop ....................................................................................... 5 Planning .............................................................................................................................. 5

Sizing (capacity planning) .............................................................................................. 5 What version and distribution of Hadoop? ..................................................................... 6

What components of Hadoop are you running? ............................................................. 6 What are the business requirements for auditing? Considerations for policies and

reporting .......................................................................................................................... 7 Considerations for policy rules ....................................................................................... 7

Considerations for reporting ........................................................................................... 8 What is an object in Hadoop? ..................................................................................... 8

What is a verb for HDFS? ........................................................................................... 8 What are verbs for HBase? ......................................................................................... 9

Big SQL, Impala, and Hive verbs and objects ............................................................ 9 Restrictions on Monitoring and other operational considerations .................................. 9 Are you using Kerberos? .............................................................................................. 10

Deploying S-TAPS (and GIM clients) .............................................................................. 10 Configuring inspection engines .................................................................................... 11 Configuring inspection engines using Guardium API .................................................. 13

Default Hadoop policy ...................................................................................................... 15 Simple production policy .................................................................................................. 16

Rule: Privileged user activity: Log full details ......................................................... 17 Rule: Privileged user access to sensitive data: log policy violation ......................... 18

Built-in reports .................................................................................................................. 19 Hadoop Permissions...................................................................................................... 19

Privileged users accessing sensitive data ...................................................................... 20 Access denied exception report .................................................................................... 22 Users on the Hadoop cluster ......................................................................................... 22

Policies that support redaction and blocking (Advanced) ................................................ 23 Prerequisites for blocking ............................................................................................. 24

Blocking rules ............................................................................................................... 24 Prerequisites for redaction ............................................................................................ 27 Redaction rules.............................................................................................................. 27

Deployment Recommendations ........................................................................................ 30 Resources .......................................................................................................................... 31 Appendixes ....................................................................................................................... 32

Appendix A. Kerberos setup instructions when HBASE or Hive is used ................... 32

Step 1: Creating a keytab for use with Guardium® .................................................. 32 Step 2. Configure Guardium ..................................................................................... 35 Step 3: Basic operational testing of the configuration .............................................. 37 Alternate approach to creating Kerberos keytabs (for Cloudera) ............................. 37

Appendix B. Using computed attributes to pull out db user from SOLR, Impala Hue, or

Hive Hue/Beeswax........................................................................................................ 42


Impala: computed attribute to get user name from Hue ........................................... 45

Hive: computed attribute to get user name from Hue/Beeswax ............................... 46 Appendix C: Considerations for IBM InfoSphere BigInsights and Big SQL .............. 47

Hadoop on GPFS (IBM Spectrum Scale) ................................................................. 47

Big SQL .................................................................................................................... 47 Appendix D. Supported Hadoop components (Hadoop 2) ........................................... 48

Notices ...................................................................................................................... 49


Objectives of this Guide This guide is to help customers, IBM Business Partners, and IBM technical staff plan for

and validate Guardium for Hadoop in a test or sandbox environment. Over time, we hope

this guide can be augmented with more information about best practices and

performance, but will not and cannot replace the IBM Redbook, Deployment Guide for

InfoSphere Guardium, which is currently is the best source of information for overall

planning, performance tuning, and troubleshooting information.

Assumptions:

A client has already set up the environment on their own or worked with technical

sales, lab services or a knowledgeable Business Partner to do initial sizing and to

set up the environment, including network connectivity for the Guardium

appliances.

The person doing the implementation has Guardium knowledge and has involved

the relevant people who understand the client’s Hadoop architecture.

The client has a clear set of use cases to test based on an understanding of known

capabilities in the product for Hadoop (as described in this document, for

example). If a requirement is not addressed by the product, the client can use the

Request for Enhancement process to ensure that IBM product management is

aware of the request. https://www.ibm.com/developerworks/rfe/

This space is changing frequently and Guardium is evolving constantly to keep up

with the changes. Be sure to work with IBM to ensure that you have the latest

information or to check if there is a later version of this guide

What’s new in V10 for Hadoop The major new enhancement in V10 is support for blocking and redaction. There were

many changes as well to improve overall parsing and collection of relevant information

and reporting as well. This section will continue to be updated as further improvements

roll out through the maintenance stream.

Blocking and redaction for Hive and Impala (this was already supported for Big

SQL). For more information, see Policies that support redaction and blocking

(Advanced) on page 23.

New inspection engines for Hive, Hue, Impala and WebHDFS. For more

information, see Table 1 on page 12.

Removed restriction on Hue Metastore – Previously only MySQL was supported.

Now PostgreSQL and Oracle are also supported.

Now capture failed logins from Hue for MySQL, Oracle and PostgresSQL

datastores.

http://www.redbooks.ibm.com/abstracts/sg248129.html?Open


https://www.ibm.com/developerworks/rfe/


Support for ODBC traffic (delivered originally in V9p500).

Improved built in reports more focused on security, such as a permissions report,

privileged user accessing sensitive data, and so forth. For more information, see

Built-in reports, on page 19.

Dropped support for CDH3 and for BigInsights 1.x.

BigInsights- Added support for 4.0, 4.1, and dropped GuardiumProxy support.

If you are NOT using HBase or Hive you no longer need to specifically configure

Guardium for activity that uses Kerberos authentication. (This is also true in V9.)

What’s new in V10.1 for Hadoop The major enhancement in 10.1 is focused on integration with Apache Ranger for

Hortonworks distributions. If you are using SSL encryption with your Hortonworks

Hadoop cluster, the S-TAP as described in this guide will not work. Instead, you will

need to reference another guide: Guardium Activity Monitoring (and blocking) for

Hortonworks Hadoop using Apache Ranger Integration.

Planning

Sizing (capacity planning)

This section refers to capacity planning, not sizing for pricing. Pricing for Hadoop is per

node.

The current rule of thumb is based on deployments that are not high volume in terms of

what is being audited

10 management/server nodes per collector,

20+ data nodes per collector, assuming STAPs are needed for the data nodes

(They are not needed for all components)

Possibly even more nodes per collector if if physical appliances are used

Your sizing may vary, of course.

The other option is to size by the PVUs of the nodes. This may result in oversizing if you

are not auditing significant amounts of traffic.

The capacity sizing guidelines for Version 10 is 4000 PVU per collector.

http://www-01.ibm.com/support/docview.wss?uid=swg27046184

http://www.ibm.com/support/docview.wss?uid=swg21987893


http://www-01.ibm.com/support/docview.wss?uid=swg27046184


What version and distribution of Hadoop?

Although Guardium supports multiple versions and distributions of Hadoop, it is

important to record the exact distribution being used in your environment. This is because

the different levels of Apache Hadoop itself impact Guardium processing and thus could

require additional patches, but also because different distributions support or include

different add-on components, either open source or proprietary. It’s important to ensure

that you know exactly what is and isn’t covered by Guardium from a monitoring

perspective as to ensure that you have all the correct Guardium prereq/patches installed

for your version.

This is a space that is changing frequently, so double check with IBM if you are unsure or

to get the latest information.

As of the writing of this guide:

Hadoop 1.x is used with the following distributions

Cloudera 4.x

Hortonworks 1.x

IBM BigInsights 2.1

Pivotal (Greenplum) HD 1.2

Hadoop 2.x is used with the following distributions

Cloudera 5.x

Hortonworks 2.x

BigInsights 2.1.2, 3.0. 4.x

Pivotal 1.5

Record your Hadoop distribution and release here:

_______________________________________________________________________

What components of Hadoop are you running?

The basics in Hadoop include the file system (HDFS), where the data is stored, and

MapReduce (or MapReduce 2, YARN), which is the framework for accessing and

analyzing data. From a monitoring perspective, if you capture activity on these two

components, you are covering basic auditing requirements because at the end of the day

everything (except management console traffic) goes through HDFS.


Figure 1. Hadoop architecture

However, HDFS activity is not the most auditor-friendly – it is somewhat like monitoring

file accesses in a relational database. You may want to consider also monitoring activity

from other components that your organization is probably using, such as Hive, Big SQL,

or Impala that are more akin to what one might expect from database access.

Example report outputs from some of these components are included in this guide.

You can record which components you are running in Table 1 as well as whether you

require monitoring above and beyond HDFS monitoring.

What are the business requirements for auditing? Considerations for policies and reporting

For monitoring purposes, you must think about the user, the data object being monitored

and what actions/commands are being done. In Guardium terminology, these are,

respectively, the DB User, the Object, and the Verb (the “command”). Those of you

familiar with Guardium will remember that these entities can be used in policy rules to

trigger particular actions, such as a real time alert.

So, as with any other auditing exercise, a key step in setting up your security policies is to

inventory your assets and map your inventory of assets to users and servers.

Considerations for policy rules

Guardium policy rule actions allow you to not just to alert or log policy violations, but

also enable you to filter certain traffic for performance.


For Hadoop traffic, you cannot use session-level filtering actions, such as Ignore S-TAP

session. This is because Hadoop does not do session-management in the same way as

relational databases where you log into the database, which establishes a session, and

then run a bunch of SQL traffic within that session and then log back out again. With

Hadoop, each command is its own session and can spawn many more sessions as work

gets distributed throughout the cluster1.

In most cases, Guardium cannot catch failed logins for command line components.

Guardium can see failed logins from Hue and through IBM BigSQL.

You will get permission exceptions on the file system level, so you report on those using

the exceptions domain.

Considerations for reporting

This section includes lists of objects and commands (verbs) to Hadoop. For the

commands, you can cut and paste these into a group in Guardium if you like, using the

Group Builder tool. You will also need to create groups of users and objects based on

your own environment.

What is an object in Hadoop?

An object is:

HDFS files/directories

MapReduce job name (YARN only). Prior to MapReduce 2, the MapReduce job

names was not logged as a separate object but you could obtain it by using the

built in MapReduce report, which used computed attributes to pull the job name

out of the full message.

IBM Big SQL, Impala, Hive, HBase table and view names

What is a verb for HDFS?

Read verbs for HDFS:

getFileInfo

getBlockLocations

getFileLocation

getListing

Write verbs for HDFS:

addBlock

complete

1 Note that BigSQL traffic in BigInsights does have session information, even if the underlying

HDFS does not.


create

delete

mkdirs

rename

What are verbs for HBase?

Read verbs:

list

scan

Write verbs:

createTable

disableTable

deleteTable

multi (this is an insert/update) (With the Ranger integration deployment option,

this is ‘put’)

drop

Big SQL, Impala, and Hive verbs and objects

The Big SQL, Impala, and Hive query languages are like SQL and thus normal parsing

and logging rules apply as with most other relational databases in Guardium. Many of

those commands are already included in Guardium command groups, such as ALTER

commands, CREATE commands, administrative commands. The extent of SQL syntax

support varies greatly among these with Big SQL having the most extensive support.

Restrictions on Monitoring and other operational considerations

SSL encryption is not supported. The one exception to this is for Hortonworks

deployments that use Ranger. Guardium can leverage Ranger auditing to capture

traffic after decryption. This integration is covered in another deployment guide.

UID chaining is not supported.

Blocking and redaction is only supported for Big SQL, Hive, and Impala

Configuration Audit system and sensitive data discovery are not supported at this

time.

Guardium currently does not support administration command auditing (stop and

start services etc).

Guardium load balancing and failover options are not supported when Kerberos is

used. (F5 or other load balancer in which a virtual IP is used may be an option.)


Are you using Kerberos?

Guardium supports the use of Kerberos secure clusters with some restrictions (such as

load balancing not being supported). In order to decrypt Kerberos user IDs, Guardium

requires that keytab files be generated and placed in a specific location. Instructions are

included in Appendix A. Kerberos setup instructions on page 32.

If you are NOT using HBase or Hive, you do not need to configure Guardium for a

Kerberos configuration.

Deploying S-TAPS (and GIM clients) Only S-TAP and GIM clients are needed since Guardium does not yet support CAS and

database discovery for the Hadoop platforms.

As with any S-TAP deployment, be sure to download the correct S-TAP for your

operating system and kernel level.

Figure 2 provides a high level overview of where S-TAPs should be installed depending

on what you want to monitor. Note that the graphic does not necessarily reflect physical

servers.

Figure 2. S-TAPs in Hadoop

Edge nodes: An S-TAP is recommended for edge nodes as well, particularly if you are

using them as a landing zone for data.


Configuring inspection engines

After S-TAPs are deployed, the appropriate inspection engines must be defined from the

Guardium appliance. Inspection engines specify what traffic is to be monitored from a

particular S-TAP host. For example, the figure below shows that on this particular S-

TAP host, Guardium should monitor traffic from 8032 and 60000. Inspection engines are

also where you define the protocol, such as Hadoop or HTTP. Figure 3. S-TAP detailed architecture

Use the table below to record the ports and inspection engine protocols required for each

node. Combine this information into a spreadsheet with the server IP (S-TAP host IP) and

you will have everything you need to create grdapi commands if you prefer to use that

instead of configuring each of these using the Guardium UI.


Table 1. Indicate which services require monitoring and their associated ports

Required

(Y/N)

Hadoop

Node Service Default Ports Your ports IE Protocol

Namenode HDFS Name Node

8020

Hadoop

Namenode HTTP port (for

WebHDFS) 50070

WEBHDFS

Namenode Resource Manager

(YARN only)

8032

Hadoop

Only for

mapreduce

1

Job Tracker MapReduce Job

Tracker

8021

9290

50030

Hadoop

HBase

Master

HBase Master

60000

Hadoop

HBase

Region HBase Region

60020

Hadoop

Hive Server 2 Thrift protocol

messages 10000

HIVE

Hive

Metastore

Thrift protocol message

– used to get Impala

and Hive DB user from

Hue (requires

computed attribute)

9083

HADOOP

Impala

daemons Impala 21000

IMPALA

Impala Impala from Hue 21050

HIVE – Because impala

from hue uses hiveserver2.

Management

node BigSQL Server

51000

32051

(changed in

4.1)

DB2

Compute

node BigSQL Server

51000

32051

(changed in

4.1)

DB2

Hue node Hue UI (Oracle

backend) 1521

HUE

Hue Node Hue UI (MySQL

backend) 3306

HUE

Hue Node Hue UI (PGSQL

backend) 5432

HUE

Solr Search

node Solr search 8983

HTTP


Notes:

Hive CLI – This is deprecated in Hadoop distributions and is not supported by

Guardium.

Impala - Must set up Inspection Engines on all nodes that run Impala daemons.

HBase – Need S-TAPs on all data nodes as well as the Master.

Big SQL – If you are using Kerberos or GPFS, you must configure the S-TAP

with the DB2_Exit, which is a safe, efficient way to capture Big SQL/DB2

encrypted traffic and/or GPFS. This means, however, that blocking and redaction

are not supported. See the developerWorks article here for more details on

configuration and support for Big SQL:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1411hadoop-

biginsights-guardium/index.html

Configuring inspection engines using Guardium API

These examples use port ranges to reduce the number of inspection engines required to be

configured but it is in general a best practice to limit the number of ports that Guardium

is listening on. Create your own grdapi scripts to match your own configuration.

(This can also be done through the Guardium user interface under Manage > Activity

Monitoring >S-TAP Control.)

/* Master or NameNode or…

/*YARN

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP

ktapDbPort=8032 portMax=8050 portMin=8032 connectToIp=127.0.0.0

stapHost=10.19.232.21

/*HDFS



stapHost=10.19.232.21

/*Hive metastore to capture impala and hive db user through Hue



stapHost=10.19.232.21

/* WEBHDFS

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=WEBHDFS


stapHost=10.19.232.21

/*HBASE Master



stapHost=10.19.232.21

/*Impala daemon

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=IMPALA


stapHost=10.19.232.21

/*Impala from hue

http://www.ibm.com/developerworks/data/library/techarticle/dm-1411hadoop-biginsights-guardium/index.html



grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HIVE


stapHost=10.19.232.21

/*Hive

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HIVE


stapHost=10.19.232.21

/* BigSQL prior to 4.1

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=DB2

dbInstallDir=/home/bigsql procName=/home/bigsql/sqllib/adm/db2sysc


stapHost=10.19.232.21

/* BigSQL 4.1 or later




stapHost=10.19.232.21

/*Hue – Oracle backend

grdapi create_stap_inspection_engine stapHost=10.19.232.21 protocol=HUE

portMin=1521 portMax=1521 ktapDbPort=1521 connectToIp=127.0.0.0

client=0.0.0.0/0.0.0.0 dbInstallDir=/home/oracle11

procName=/home/oracle11/product/11.1.0/db_1/bin/oracle

/*Hue – MySQL backend

grdapi create_stap_inspection_engine stapHost=10.19.232.21

protocol=10.19.232.21 portMin=3306 portMax=3306 ktapDbPort=3306

connectToIp=127.0.0.0 client=0.0.0.0/0.0.0.0 procName=MySQL

/*Hue – Postgres backend

grdapi create_stap_inspection_engine stapHost=10.19.232.21 protocol=HUE

portMin=5432 portMax=5432 ktapDbPort=5432

connectToIp=127.0.0.client=0.0.0.0/0.0.0.0 procName=PGSQL

/* Solr search

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HTTP

ktapDbPort=8983 portMax=8983 portMin=8983 stapHost=10.19.232.21

/* data nodes

/* HBASE Region



stapHost=10.19.232.21

/*Impala daemon

grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=IMPALA


stapHost=10.19.232.21

/*Big sql server prior to 4.1




stapHost=10.19.232.21


/* BigSQL 4.1 and later




stapHost=10.19.232.21

Default Hadoop policy Use the built-in Hadoop policy first (shown in Error! Reference source not found.) to

ensure traffic is being captured. It’s recommended that you first try this in a low traffic

test environment, and you may even want to add one more access rule to restrict traffic to

just one server type, such as Hive, to reduce the amount of noise you see.

After you are comfortable that traffic is flowing to the collector, you can clone the default

policy and create one that aligns with your security and compliance requirements, as

described in Simple production policy on page 16.

Figure 4. Default Hadoop policy

There is a lot of noise with Hadoop internal communications, and the more background

noise you can filter out, the better. Rule 1 will filter out (not log) activity in which the

object is one of the items in the Hadoop Skip Objects group. You can edit this group to

add objects that you observe in your traffic.

The second rule filters out noisy commands that reflect internal communications.


The third rule is mostly used in a test environment where there may be non-Hadoop

servers associated with this appliance – it filters out traffic from any unrelated servers

based on the server IPs you specify.

Tip: You must put something in the Not Hadoop Servers group, even if it’s a dummy IP

or you will not collect traffic. If you don’t have any such servers, make sure you remove

this group altogether and uncheck the Not checkbox.

This rule also specifies LOG FULL DETAILS action for all nonfiltered traffic, which

may be handy for a small environment, but it is probably not what you want to do in

production. This may overload the collector because each command is logged in full.

Thus, you will likely modify or delete it after doing initial validation in a test

environment.

Recommendation: If you are not familiar with the way Guardium policies impact data

collection and reporting, familiarize yourself with that before moving ahead with policy

definitions. Some recommended resources on the Guardium community on

developerWorks (bit.ly/guardwiki) include:

• 4-part video series on policies

• Tech Talk: Reporting 101

The Deployment Guide for InfoSphere Guardium also includes a good introduction.

(http://www.redbooks.ibm.com/abstracts/sg248129.html?Open)

Simple production policy Figure 5 shows a simple production policy. It uses the default logging of constructs only

for most traffic and logs full details only for privileged user activity. With the default

construct logging, each unique message construct is logged and the number of times that

unique message construct is executed is aggregated once per hour. In general, log full

details is required only when exact timestamps are critical.

Figure 5. Simple Hadoop Production Policy

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Wf32fc3a2c8cb_4b9c_83e4_09b3c6f60e46/page/InfoSphere Guardium Policies Deep Dive

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Wf32fc3a2c8cb_4b9c_83e4_09b3c6f60e46/page/Reporting 101



We also deleted the server IP filter rule that was present in the default policy.

Rule: Privileged user activity: Log full details

This rule is an example of a business-oriented rule that might be needed if you need

detailed, exact time recording of all privileged user activity, as is required by some

compliance regulations.

Figure 6. Log full details of privileged user activity


Tip: Your Log Full Details rule must be before any rule that does not log full details (that

is, that uses normal logging of constructs only).

Rule: Privileged user access to sensitive data: log policy violation

In this case, a policy violation of medium severity will be logged whenever someone in

the privileged user group accesses any object (HDFS file, HBase Table, etc.) with the

string ‘customer’ in its name. (Most likely you will be creating a group of sensitive

objects.)

Figure 7. Log policy violation of access to sensitive objects

The violation will appear in the Policy Violations / Incident Management report.

(Comply>Reports>Incident Management).

Figure 8. Log policy violation for access to sensitive data


Built-in reports This section includes a selection of prebuilt reports. To see the complete list of built-in

reports for Hadoop, simply go to My Dashboards > Create a new dashboard, then

click on Add Report. Start typing in ‘Hadoop’ and you will see the full list. A partial list

is shown below. Figure 9. Partial list of Hadoop reports in Guardium

Some of these reports are component-based reporting, which is probably of most use

when validating your configuration and that you are catching traffic from the component.

We’ll go into a little more detail on the following reports, which are more focused on

security and compliance.

Hadoop -Permissions

Privileged Users Accessing Sensitive Objects

Exception report

Hadoop logged in users

Hadoop Permissions

This report shows when permissions are changed on any Hadoop file system object.


Figure 10. Hadoop permissions

report

This report uses a built in group called Hadoop Permissions below. You could also

choose to include Hive, BigSQL or Impala grant/revoke statement sin this report by

adding those commands as well, or catch those using another report (for example, the

built in Execution of Grant Commands report).

Figure 11. Group of Hadoop permission commands

b

Privileged users accessing sensitive data

This report relies on two groups: privileged users (Figure 12) and sensitive objects

(Figure 13). For users, include the complete user name. For Kerberos systems, include

the user name and the domain (or use wild card if appropriate).


Figure 12. Hadoop privileged users

For objects, you can use full file directory paths (for HDFS), wild cards, or a combination

of both. Note that if you are also specifically monitoring HBase, BigSQL, or Hive and

they also use ‘customer’ in their names, those will also match.

Figure 13. Group of sensitive objects

And here is the report.


Figure 14. Privileged users accessing sensitive data

Access denied exception report

File permission exceptions are indicated by error code 101, which is used in the query

conditions section of the Exception Report query builder shown in Error! Reference

source not found..

Figure 15. Hadoop Permission Exception Report

Users on the Hadoop cluster

This report can help you understand which users(IDs) are accessing the Hadoop cluster,

The Session Start attribute shows the latest date and time that the particular DB User with


the corresponding attributes of client IP, Server IP and Server Type were active on the

system.

Figure 16. Hadoop Users

Policies that support redaction and blocking (Advanced) Guardium V10 supports redaction (using extrusion rules) and blocking (S-GATE

Terminate) for Hive and Impala. (Blocking for BigSQL was supported in V9.x when the

S-TAP is used.)

Here is a policy that includes both blocking and redaction rules. We’ll examine these

rules and their prerequisites in more detail in this section.

Figure 17. Policy with blocking and redaction rules added


Prerequisites for blocking

As with blocking on other databases, the S-TAP must be configured with

firewall_installed=1.

BigSQL- on all nodes where BigSQL is running. Important: If you are using the

communications exit to facilitate BigSQL auditing, then blocking is not

supported.

Impala – on all nodes where Impala is running

Hive – On the node where HiveServer2 is running

For more information about other firewall parameters, see http://www-

01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc.stap/stap

/r_stapparmsw_firewall.html

For more information about the blocking actions (S-GATE Terminate) see http://www-

01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r

ule_actions.html

Blocking rules

Although there are additional nuances that are covered in other sources, blocking requires

a minimum of two actions:

1. S-GATE Attach – specify the conditions under which S-TAP must start watching

the session traffic for possible blocking (which requires checking all actions

against the policy on the collector.

2. S-GATE Terminate – when this condition is met, terminate the connection.

Important: Blocking has performance implications because S-TAP must hold the

command and check with the policy on the appliance to see if this command should be

allowed through. Thus, it is important that you limit which conditions you use for attach

to those that are not performance sensitive, such as privileged user access.

Also, because of the way Hive and Impala traffic is processed in Hadoop, you must do

the following in the blocking policy rule:

1. Specify the DBTYPE in all S-GATE ATTACH and S-GATE TERMINATE)

policy rules; that is, either Impala or Hive.

2. Ensure that ATTACH happens on a combination of user and object and/or

command.

The rules shown below in Figure 18 and Figure 19 can be translated as follows:

1. Whenever there is a connection from svoruga to any Hive table that includes

‘customer’ in the name, start “watching” this for possible blocking.

2. If svoruga issues a SELECT command (or any command in that group) on a

customer table, block the connection.

http://www-01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc.stap/stap/r_stapparmsw_firewall.html



http://www-01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/rule_actions.html




Figure 18. Attach rule specifies DB Type, User and Object


Figure 19. Block this user when they issue a SELECT command on customer objects

Figure 20 shows a select from customer in hive using beeline and how it was blocked.

The policy violation report shows that the rule was triggered.

Figure 20. Hive select was blocked


Prerequisites for redaction

As a reminder, you must enable inspection of returned data in the “master” inspection

engine configuration. Go to Manage > Activity Monitoring > Inspection Engines and

select the Inspect Returned Data checkbox as shown here:

Figure 21. Required configuration to enable redaction

Redaction rules

Figure 22 below is one of the extrusion rules in our policy to inspect the returned data for

a pattern that matches social security numbers and then redact data in the pattern. Figure

23 has the same rules except for credit cards.

For information about the special pattern tests, such as those for credit cards and social

security numbers, see the Knowledge Center here: http://www-01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r_patterns.html

http://www-01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r_patterns.html




Figure 22. Redact social security numbers (Hive, Impala or BigSQL)


Figure 23. Redact credit card data (Hive, Impala or BigSQL)

Figure 24 shows the redaction on Hive data whether the query was issued in the UI (Hue)

or in the beeline command line.

Figure 24. Redacted data for Hive


Deployment Recommendations

To avoid flooding the collector and to make problem diagnosis simpler, consider tactics

to reduce the amount and types of traffic that has to be processed by the Guardium

collector.

To limit data that must flow across the network to the appliance, restrict the

number of inspection engines you configure.

To limit the amount of data that is logged on the collector, put in conditions on

the policy.

One strategy might be to just configure for Hive command line queries and try that before

adding additional inspection engines and opening up the policy to more types of traffic

such as HDFS, which will generate a much higher volume of traffic.

For each new inspection engine that is configured, you must restart S-TAP.

Monitor appliance as more services generate more traffic. The Guardium deployment

redbook includes details on how to monitor the appliance and make sure the traffic is not

excessive for the collector.


Resources IBM Redbook: Information Governance Principles and Practices for a Big Data

Landscape, http://www.redbooks.ibm.com/abstracts/sg248165.html?Open

IBM Redbook: Deployment Guide for InfoSphere Guardium,


Guardium Activity Monitoring (and blocking) for Hortonworks Hadoop using Apache

Ranger Integration. http://www.ibm.com/support/docview.wss?uid=swg21987893

IBM developerWorks article: “Protect sensitive Hadoop data using InfoSphere

BigInsights Big SQL and InfoSphere Guardium,”



IBM Security Guardium product documentation: http://www-

01.ibm.com/support/knowledgecenter/SSMPHH/SSMPHH_welcome.html






http://www-01.ibm.com/support/knowledgecenter/SSMPHH/SSMPHH_welcome.html

http://www-01.ibm.com/support/knowledgecenter/SSMPHH/SSMPHH_welcome.html


Appendixes

Appendix A. Kerberos setup instructions when HBASE or Hive is used

This appendix provides the procedure to configure Guardium so that it can properly

decrypt user names when Kerberos is used at the authentication mechanism for Hadoop

and HBase or Hive is also used.

The configuration requires that each node in the cluster (that is running Guardium S-

TAP) has a keytab that includes the Hadoop services that are running on the node.

Guardium will use those keys to decrypt the user name for services running on the node.

Important: These instructions use Cloudera as the Hadoop distribution. The same basic

instructions can be used for other distributions, but the process to obtain the Kerberos

principals will vary.

A keytab file must be created/updated each time a principal’s encryption key has

been changed or when a service is added or deleted on a node.

Overview of the procedure:

Export keys from Kerberos and create a keytab on each node. This uses the option

-norandkey to export the principals. If you cannot use this option, see heading

below entitled Alternate approach to creating Kerberos keytabs (for Cloudera) for

a different set of instructions to do this step.

Configure Guardium to look for that keytab.

Sanity check the configuration.

Step 1: Creating a keytab for use with Guardium®

Important: If your Kerberos does not support use of –norandkey, use the alternate

instructions below for this step.

1. Identify the principals needed from the Cloudera Manager interface.

From the Cloudera Administration menu, select Kerberos. This will provide a list of

principals that are used by each node. Make a copy of this list for your reference.

The principals are defined in the following manner: <service>/<nodename>@<kerberos domain>

Example: hdfs/[email protected]


2. On the Kerberos server as root or kinit with a user that has kadmin privileges, use the

kadmin.local command to access the Kerberos server. Verify that the principals

identified in step one are available.

Example: kadmin.local: listprincs HTTP/[email protected] HTTP/[email protected] HTTP/[email protected] HTTP/[email protected] HTTP/[email protected] hbase/[email protected] hbase/[email protected] hbase/[email protected] hbase/[email protected] hbase/[email protected] ... etc.

3. Use the xst command in kadmin.local to export every principal for a particular node

into the same keytab file:

xst -k /tmp/krb5.keytab-nodename -norandkey <service>/<nodename>@<kerberos

domain>

IMPORTANT: -norandkey is an important flag that prevents the invalidation of

previous keytabs that included the principal being exported. For example, if you are using

Cloudera, be sure to use –norandkey to avoid invalidating Cloudera keytabs.

NOTE: Each node may have a different service depending on what that node is running.

Example: Exporting the services for all nodes to a single keytab file: kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey HTTP/[email protected] kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hdfs/[email protected] kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hbase/[email protected] kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hive/[email protected]


kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hue/[email protected] kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey zookeeper/[email protected]

This will create a file named krb5.keytab-node01 in the /tmp directory on the server.

Create a keytab file for each node in your cluster.

4. Copy the keytabs to each respective node to /etc/krb5.keytab.

NOTE: The name of the keytab should be the same on each respective node

scp /tmp/krb5.keytab-node01 root@node01:/etc/krb5.keytab scp /tmp/krb5.keytab-node02 root@node02:/etc/krb5.keytab scp /tmp/krb5.keytab-node03 root@node03:/etc/krb5.keytab

5. Verify your keytab principals using the klist command from the node's command line

using the following command: klist -k <keytab>

Example: [root@rh6-cl-01:]$ klist -k /etc/krb5.keytab Keytab name: FILE:/etc/krb5.keytab KVNO Principal ---- -------------------------------------------------------------------------- 2 hbase/[email protected] 2 hbase/[email protected] 2 hbase/[email protected] .... etc

6. Verify that you can authenticate with the keytab: kinit -k -t /etc/krb5.keytab <service>/<nodename>@<kerberos domain>

Use the klist command to verify the authentication. In the example below, klist shows

that no credentials are in use. A kinit is then done using the keytab file. A klist is issued

to show that the credentials are in use.

Example: [root@rh6-cl-01 ~]# klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0) [root@rh6-cl-01 ~]# kinit -k -t /etc/krb5.keytab hdfs/[email protected] [root@rh6-cl-01 ~]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hdfs/[email protected] Valid starting Expires Service principal 06/11/14 11:31:25 06/12/14 11:31:25 krbtgt/[email protected] renew until 06/12/14 11:31:25 [root@rh6-cl-01 ~]#


7. Restart the UTAP to re-read the keytab using start and stop: stop utap

start utap

Example: [root@rh6-cl-01 ~]# stop utap utap stop/waiting [root@rh6-cl-01 ~]# start utap utap start/running, process 366 [root@rh6-cl-01 ~]#

Step 2. Configure Guardium

Many enterprise deployments will use the Guardium Installation Manager to push out

server-side updates, such as S-TAPs. This step has two subsections: one if GIM is

installed on the server node and one if it is not.

Use these steps if GIM is not installed on the server:

1. Stop STAP temporarily: $ stop utap

2. Create a new directory: /usr/local/guardium/kerberos

To that new directory, copy the following files from directory

/usr/local/guardium/lib64: guard_stap_runner

guardkerbplugin.conf

libguardkerbplugin.so

utap.conf

Important: Make sure all the files are all readable by root, and that

guard_stap_runner and libguardkerbplugin.so are executable by root.

3. Make sure the kerberos libraries (libkrb5.so, libk5crypto.so, etc) are in

one of /lib64, /usr/lib64, /lib, /usr/lib. If they are not there, then edit

guard_stap_runner so that the LD_LIBRARY_PATH includes their location.

4. Make sure that the kerberos configuration file is at /etc/krb5.conf. If not, then

edit the file /usr/local/guardium/kerberos/guardkerbplugin.conf

appropriately.

5. Make sure that the kerberos keytab file is at /etc/krb5.keytab. If not, then edit the

file /usr/local/guardium/kerberos/guardkerbplugin.conf appropriately.

6. Make sure that the guard_stap in the guard_stap_runner points to the executable

file and not a directory.


7. Make sure the kerberos configuration file (/etc/krb5.conf) includes the following

line in the [libdefaults] section:

clockskew=600

8. Configure the /usr/local/guardium/guard_stap/guard_tap.ini file as

needed, such as adding inspection engines and making sure the SQLGuard

section(s) point to the appropriate Guardium appliance(s).

Edit the kerberos_plugin_dir line to be this: kerberos_plugin_dir=/usr/local/guardium/kerberos

9. Replace the file /etc/init/utap.conf with the one in this directory: $ mv /etc/init/utap.conf /etc/init/utap.conf.O $ cp utap.conf /etc/init

Make sure the new file has the same ownership/permissions as the old one.

10. Restart STAP: $ start utap

Use these steps if GIM is installed on the server:

These steps assume the following:

The server has an S-TAP of Version 9 or later

Guardium Installation Manager (GIM) is installed.

All the relevant nodes are configured with the same directory structures

Create a new directory: /usr/local/guardium/kerberos

To that new directory, copy the following files from directory

/usr/local/guardium/lib64: libguardkerbplugin.so

guardkerbplugin.conf

Alternatively, create a file with the following lines: #comment

KRB5RCACHETYPE=none

KRB5_KTNAME=/etc/krb5.keytab

KRB5_CONFIG=/etc/krb5.conf

Important: Make sure the files are readable by root, and libguardkerbplugin.so is

executable by root.

Make sure the kerberos libraries (libkrb5.so, libk5crypto.so, etc) are in


one of /lib64, /usr/lib64, /lib, /usr/lib.

Make sure that the kerberos configuration file is at /etc/krb5.conf and that the

kerberos keytab file is at /etc/krb5.keytab. If not, then edit the file

/usr/local/guardium/kerberos/guardkerbplugin.conf appropriately.

Create a tar file of the kerberos/ directory.

Copy the tar file to the server node and extract the tar with the –C option to create

the destination directory.

Edit the /usr/local/guardium/guard_stap/guard_tap.ini to add the

Kerberos directory setting: kerberos_plugin_dir=/usr/local/guardium/kerberos

Stop S-TAP

$ ps –ef | grep stap. $ kill <stap_pid>

The GIM (actually, the GIM supervisor process) restarts the S-TAP after it is

killed.

Step 3: Basic operational testing of the configuration

1. When the STAP started above, it printed a PID for the STAP process. Make sure

the STAP continues to run in that PID, and does not continually restart, by

running this command several times: $ ps -ef| grep guard_stap | grep -v grep

The STAP puts any error messages in the file /tmp/guard_stap.stderr.txt.

Make sure that file is not growing in size. If it is, or if the STAP is constantly

restarting, check the file contents for error messages.

2. Make sure the kerberos plugin is loaded: Using the PID from above, do the

following (this example assumes the pid is 12345): $ lsof -p 12345

Make sure that libguardkerbplugin.so is one of the open files listed.

Alternate approach to creating Kerberos keytabs (for Cloudera)

These instructions are to be used in a Cloudera deployment when the –norandkey option

described in Step 1 above cannot be used. After you have the keytab, continue with above

instructions for configuring Guardium S-TAP.


Overview

In a Cloudera deployment where Kerberos is used for authentication, the following

guidelines can be used to create a single keytab file for use with Guardium S-TAP for

monitoring.

The instructions below utilize the ktutil that is found in the krb5-workstation for Linux

environments. Similar tools for Microsoft Windows Active Directory environments can

also be used.

In general, each node in the Cloudera deployment will contain a keytab for each service.

These services will vary depending on your cluster and the services installed/running on

each host. For each node, each service keytab will be read into ktutil. Once all the

required keytabs are read, then a single keytab is written. This resultant keytab can then

be used by Guardium S-TAP for monitoring.

Identifying keytabs

On each node, identify the service that is running. The services files are typically found

in /var/run/cloudera-scm-agent/process/.

Search for the latest process number for the service. In the example below, use the ls -l

command to list the possible folders for the hdfs processes. In the example, you’d find

that the newest folders contain the correct keytabs.

[root@cloudera-cl1-01 process]# ls -ld *hdfs*

drwxr-x--x 3 hdfs hdfs 460 Oct 31 13:08 2697-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Oct 31 13:08 2704-hdfs-HTTPFS





drwxr-x--x 3 hdfs hdfs 460 Nov 5 11:24 3077-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Nov 5 11:24 3084-hdfs-HTTPFS


drwxr-x--x 3 hdfs hdfs 460 Nov 5 13:40 3161-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Nov 5 13:40 3168-hdfs-HTTPFS

drwxr-x--x 3 hdfs hdfs 460 Dec 3 11:49 3712-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Dec 3 11:49 3719-hdfs-HTTPFS

Inside these folders, you will find the service keytab. For example, for HDFS, the

corresponding keytab is named hdfs.keytab.

Identify all keytabs that each node uses and note their locations.

You can test each keytab to verify that it is a valid keytab using kinit:

kinit -t -k <keytab> <service>/<principle>@<domain>

Use klist to verify that you have obtained a ticket. For example, the following example

tests an HDFS keytab:

[root@cloudera-cl1-01 ~]# kdestroy

[root@cloudera-cl1-01 ~]# klist

klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0)

[root@cloudera-cl1-01 ~]# kinit -k -t /var/run/cloudera-scm-agent/process/3712-hdfs-

NAMENODE/hdfs.keytab hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

[root@cloudera-cl1-01 ~]# klist

Ticket cache: FILE:/tmp/krb5cc_0

Default principal: hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

Valid starting Expires Service principal

12/04/14 16:47:49 12/05/14 16:47:49 krbtgt/CLOUDERA@CLOUDERA

renew until 12/09/14 16:47:49

Once all the keytabs on all nodes are identified, the keytabs can be merged.

Merging the keytabs

Once the keytabs are identified, they will need to be merged together for use with

Guardium S-TAP.

Use ktutil to read in the keytabs and write keytabs. On each node, merge the identified

keytabs together.

The following example uses ktutil to read in the keytab and list the principal for HDFS.


# ktutil

ktutil: read_kt /var/run/cloudera-scm-agent/process/3712-hdfs-NAMENODE/hdfs.keytab

ktutil: list

slot KVNO Principal

---- ---- ---------------------------------------------------------------------

1 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA






7 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA






The following example uses ktutil to read in the keytab and list the principal for hive.

# ktutil

ktutil: read_kt /var/run/cloudera-scm-agent/process/3737-hive-HIVEMETASTORE/hive.keytab

ktutil: list

slot KVNO Principal

---- ---- ---------------------------------------------------------------------

1 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA






The following example uses ktutil to read in the HDFS and Hive keytabs, then writes out

a single krb5.keytab. Once the krb5.keytab is written, it is then read back into ktutil and

the principals listed.

# ktutil

ktutil: read_kt /var/run/cloudera-scm-agent/process/3712-hdfs-NAMENODE/hdfs.keytab

ktutil: read_kt /var/run/cloudera-scm-agent/process/3737-hive-HIVEMETASTORE/hive.keytab

ktutil: write_kt /tmp/krb5.keytab

ktutil: read_kt /tmp/krb5.keytab

ktutil: list

slot KVNO Principal

---- ---- ---------------------------------------------------------------------




















You can then use kinit as shown in previous examples to test the keytab and the different

principals.

Once this single keytab is created, krb5.keytab in the example above, it can be

copied/moved to its final location. Then the Guardium S-TAP can be configured to use

this keytab. Use the steps described in Step 2. Configure Guardium on page 35 to do this.


Appendix B. Using computed attributes to pull out db user from SOLR, Impala Hue, or Hive Hue/Beeswax

Here are the steps to create a computed attribute for SOLR. Use the same basic procedure

for Impala Hue and Hive Hue/Beeswax. The SQL expressions for those are included

below.

Navigate to Reports Guardium Configuration Items Query Entities

&Attributes:

Right click on any row in the report and select Invoke… create_computed-attribute

as shown below.


In the UI, enter information as shown below. (The attributeLabel can be anything you

want.). The SQL used in the expression is shown here. You can copy and paste this into

the expression field.

if( (LOCATE('user.name=hue&doAs=',FULL_SQL)>0) ,

substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR

UCT_TEXT.FULL_SQL,'doAs=')+0)),'&',1),' ')


Click Invoke now. The attribute will now be available in the FULL SQL entity of the

access domain to be used in reports as shown here.


Figure 25. Computed attribute for Solr users now appears in Query Builder

Here is an example of what SOLR message traffic looks like before and after the

computed attribute is applied. GET

/solr/yelp_demo/select?user.name=hue&doAs=svoruga&q=%2A%3A%2A&wt=json&r

ows=10&start=0&facet=true&facet.mincount=0&facet.limit=10&facet.field={

%21ex%3Dstars}stars&f.stars.facet.limit=16&f.stars.facet.mincount=0&fac

et.field={%21ex%3Dbusiness_id}business_id&f.business_id.facet.limit=21&

f.business_id.facet.mincount=0&facet.field={%21ex%3Dfull_address}full_a

ddress&f.full_address.facet.limit=11&f.full_address.facet.mincount=0&fa

cet.range={%21ex%3Duseful}useful&f.useful.facet.range.start=0&f.useful.

facet.range.end=0&f.useful.facet.range.gap=1&f.useful.facet.mincount=0&

fq={%21tag%3Duseful}useful%3A[0+TO+0}&fq={%21tag%3Dfull_address}{%21fie

ld+f%3Dfull_address}az&fl=date%2Cid&hl=true&hl.fl=%2A&hl.snippets=3

HTTP/1.1

Host: cloudera-cl1-06.guard.swg.usma.ibm.com:8983

Accept-Encoding: gzip, deflate, compress

Accept: */*

User-Agent: python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-

431.11.2.el6.x86_64

Result after computed attribute is applied doAs=svoruga

Impala: computed attribute to get user name from Hue

This SQL assumes you are gathering information based on a log full details policy rule:

IF (instr(GDM_CONSTRUCT_TEXT.FULL_SQL,'__THRIFT

message={method=''query'',struct:1=')>0 ,


UCT_TEXT.FULL_SQL,'string:4')+10)),'''',1), '')


This SQL is if you are not using log full details:

IF (instr(GDM_CONSTRUCT.ORIGINAL_SQL,'__THRIFT

message={method=''query'',struct:1=')>0 ,

substring_index(substring(GDM_CONSTRUCT.ORIGINAL_SQL,(instr(GDM_CONSTRU

CT.ORIGINAL_SQL,'string:4')+10)),'''',1), '')

Hive: computed attribute to get user name from Hue/Beeswax

This SQL assumes you are gathering information based on a log full details policy rule:


message={method=''get_table'',struct:0=')>0 ,


UCT_TEXT.FULL_SQL,'string:3')+10)),'''',1), '')

This SQL is if you are not using log full details:


message={method=''get_table'',struct:0=')>0 ,

substring_index(substring(GDM_CONSTRUCT_TEXT.ORIGINAL_SQL,(instr(GDM_CO

NSTRUCT_TEXT.FULL_SQL,'string:3')+10)),'''',1), '')


Appendix C: Considerations for IBM InfoSphere BigInsights and Big SQL

For most Hadoop activity, the recommendations in this guide apply as for all other

Hadoop distributions, with the following exceptions.

Hadoop on GPFS (IBM Spectrum Scale)

As of Version 10.1 of Guardium, you can use the GPFS deployment of BigInsights by

configuring the HDFS Transparency Connector. You can find out more about the

connector here:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General

%20Parallel%20File%20System%20%28GPFS%29/page/2nd%20generation%20HDFS

%20Transparency%20Protocol

Big SQL

An S-TAP must be installed on all nodes in which a Big SQL engine is installed. The

support for Big SQL is quite comprehensive and is similar to what Guardium already

supports for DB2. For more details on configuration and capabilities, see the

developerWorks article on this topic here:



If Kerberos and/or GPFS is used, then you must configure a special communications exit

on each Big SQL node. Guardium provides a dynamically loaded shared library that

interacts with Big SQL. Big SQL will invoke functions within that library at run time

when it performs SQL and utility requests. Directions for this are included in the

developerWorks article.

Restrictions: Only monitoring and auditing are supported using the exit methodology.

Redaction and blocking are advanced features that are only supported using S-TAP.

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/2nd%20generation%20HDFS%20Transparency%20Protocol






Appendix D. Supported Hadoop components (Hadoop 2)

The table summarizes the degree to which particular Hadoop services traffic is parsed

and logged. For example, a complete level of logging means you get user name, object

names, etc, for that specific message type. Other components can only be capture at the

lower level of MapReduce/YARN and/or HDFS traffic.

Component Level of parsing/logging Computed attributes required?

Hue Traffic through Hue is captured

except for Impala (bug opened)

Failed logins for Hue are not

currently captured. (Bug)

N/A

HDFS Complete No

MapReduce Complete. Yes (included in prebuilt report)

YARN Complete No

HBase Complete No

Hive Complete. Yes, if you use Hue, to return the

user name for THRIFT messages

into the DB User field. See Hive:

computed attribute to get user name

on page 46.

WebHDFS Complete No

Solr Complete Yes to return DB User. Note, you

will need a log full details policy

rule for Solr traffic to get the

computed attribute.

Impala Complete . Yes, if you use Hue, to return the

user name for THRIFT messages

into the DB User field. See Impala:

computed attribute to get user name

on page 45.

SPARK Not supported for in-memory

usage. (open requirement)

You will catch HDFS and

MapReduce as data is brought into

memory, but interactions in

memory will not be captured by

Guardium.

Shark Not supported n/a

Sqoop Returned as HDFS and YARN No

Pig Returned as HDFS and YARN

traffic

No

Zookeeper Returned as HDFS and YARN

traffic

No

Avro Returned as HDFS and YARN No

Flume Returned as HDFS and YARN No


Cascading Returned as HDFS and YARN No

Slider Returned as HDFS and YARN No

Storm Returned as HDFS and YARN No

Knox Guardium catches the WebHDFS

and other resulting traffic.

N/A

Ambari Guardium catches resulting

traffic issued from Ambari.

N/A

Slider Returned as HDFS and YARN No

Tez Returned as HDFS and YARN No

Falcon Returned as HDFS and YARN No

NFS Not supported N/A

Java/Scala Returned as HDFS and YARN No

Notices

© Copyright IBM Corp. 2016. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, Guardium, and ibm.com® are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” (www.ibm.com/legal/copytrade.shtml) Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

http://www.ibm.com/legal/copytrade.shtml

Documents

Hadoop Deployment Guide - IBM · 2020. 10. 19. · This guide is to help customers, IBM Business Partners, and IBM technical staff plan for and validate Guardium for Hadoop in a test