Waterline Data Inventory Installation and Administration Guide€¦ · Lineage. Waterline Data Inventory uses the detailed metadata it collects about each file and table to identify

© 2014 - 2016 Waterline Data, Inc. All rights reserved.

Waterline Data Inventory

Installation and Administration Guide

Product Version 2.5.0 Document Version 4.19.2016

Table of Contents Waterline Data Inventory

2 © 2014 - 2016 Waterline Data, Inc. All rights reserved.

Table of Contents Preface ............................................................................................................................................. 3

Part 1 Overview & Architecture Product overview ............................................................................................................................ 5 Architecture ..................................................................................................................................... 6 MapReduce ..................................................................................................................................... 6 Hive .................................................................................................................................................. 7 Access to data .................................................................................................................................. 8 User access and roles .................................................................................................................... 13

Part 2 Setup and Installation System requirements .................................................................................................................... 15 Checklist for installing Waterline Data Inventory ......................................................................... 17 Plan tasks that impact cluster administration ............................................................................... 18 Validate Hadoop configuration ..................................................................................................... 18 Validate security requirements ..................................................................................................... 21 Configure a service user ................................................................................................................ 23 Configure a MySQL repository database ....................................................................................... 27 Download Waterline Data Inventory ............................................................................................ 28 Run the installer ............................................................................................................................ 29 Update Hive Server with Waterline Data Inventory JARs ............................................................. 36

Part 3 Administration Starting Waterline Data Inventory services .................................................................................. 38 Running Waterline Data Inventory jobs ........................................................................................ 41 Monitoring Waterline Data Inventory jobs ................................................................................... 51 Setting up users to access Waterline Data Inventory ................................................................... 57 Backing up Waterline Data Inventory metadata ........................................................................... 59

Part 4 Tuning Configuration controls ................................................................................................................... 62 Data source configuration ............................................................................................................. 65 Port configuration ......................................................................................................................... 68 Repository configuration ............................................................................................................... 68 Authentication configuration ........................................................................................................ 70 Security configurations .................................................................................................................. 72 Metadata and staging file locations .............................................................................................. 73 Debugging configuration ............................................................................................................... 75 Format discovery tuning ................................................................................................................ 76 Profiling tuning .............................................................................................................................. 79 Search functionality ....................................................................................................................... 87 Discovery functionality .................................................................................................................. 88 Browser app functionality ............................................................................................................. 92

Installation and Administration Guide Preface

© 2014 - 2016 Waterline Data, Inc. All rights reserved. 3

Preface

Waterline Data Inventory reveals information about the metadata and data quality of files in a Hadoop distributed file system (HDFS) so the users of the data can identify the files they need for analysis and downstream processing. The application installs on an edge node in the cluster and runs MapReduce jobs to collect data and metadata from files in HDFS and Hive. It then discovers relationships and patterns in the profiled data and stores the results in its metadata repository. A browser application lets users search, browse, and tag HDFS files and Hive tables using the benefits of the collected metadata and Data Inventory’s discovered relationships.

This document describes the process of installing Waterline Data Inventory on a node in a Hadoop cluster.

Related Documents

Waterline Data Inventory Sandbox. Available on CDH, HDP, MapR and for images on VirtualBox and VMWare.

Waterline Data Inventory User Guide, available from the menu in the browser application and in the /docs directory in the installation.

Waterline Data Inventory REST API Reference Guide, available at apidocs.waterlinedata.com.

For the most recent documentation and product tutorials, sign in to the Waterline Data community support site, support.waterlinedata.com.

For more information on configuring Hive with Kerberos in an enterprise environment, see "Multi-User Scenarios and Programmatic Login to Kerberos KDC":

cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients - HiveServer2Clients-Multi-UserScenariosandProgrammaticLogintoKerberosKDC

http://apidocs.waterlinedata.com/

http://support.waterlinedata.com/

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Multi-UserScenariosandProgrammaticLogintoKerberosKDC

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Multi-UserScenariosandProgrammaticLogintoKerberosKDC


Part 1

Overview & Architecture

o Product overview o Architecture o MapReduce o Hive o Access to data o User access and roles

Installation and Administration Guide Product overview


Product overview

The Waterline Data Inventory software builds an inventory of data assets in HDFS. It profiles the data assets to produce field-level data quality statistics and identify representative data so users can understand the content and quality of the data quickly.

Data Catalog. Waterline Data Inventory gives users access to the file and field-level metadata available for the entire catalog of data assets. In addition, for the assets each user has authorization to view, Waterline Data Inventory displays a rich view of data-based details such as minimum, maximum, and most frequent values.

Tags. Waterline Data Inventory provides an interface for users to label data with information about its business value. It distributes these labels or “tags” to similar data across the cluster, producing a powerful index for business language searches. It enables business users to find the right data quickly and to understand the meaning and quality of the data at a glance.

Lineage. Waterline Data Inventory uses the detailed metadata it collects about each file and table to identify lineage relationships among the data assets. With labels you add to identify the landing locations of data brought into the cluster, the lineage helps users trace the data they are interested in to reliable sources.

User roles. Roles assigned to Waterline Data Inventory users let administrators control which users can create annotation tags, which users can apply the tags to data, and which users can approve or reject the metadata suggested by Waterline Data Inventory's discovery operations.

Architecture Waterline Data Inventory


Architecture

Waterline Data Inventory comprises a profiling and discovery engine, a metadata repository, a series of search indexes, and an application server. These components can be installed together on a server that can access the name node of the Hadoop cluster (an edge node). If needed, they can also be distributed across more than one server to provide additional storage or compute resources.

The profiling and discovery engine runs most of its operations as MapReduce jobs; in addition, it performs some operations on the local server. Most of the data used to calculate profiling statistics (selectivity, min/max, etc.) are held in HDFS files; the portion of this data that's accessible through the browser application is held in the metadata repository on the edge node.

Waterline Data Inventory runs on one or more edge nodes

MapReduce

Waterline Data Inventory runs profiling jobs against HDFS and Hive data using MapReduce. The application code is transferred to each cluster node to be executed against the data that resides on that node. The results are accumulated in HDFS files. Waterline Data Inventory runs MapReduce jobs against the resulting metadata to determine tag association suggestions; the tag association results are also stored in HDFS. Waterline Data Inventory moves a portion of the HDFS to the repository on the edge node to support viewing data through the browser.

It is important to understand that Waterline Data Inventory jobs are standard MapReduce jobs: the expertise your organization already has for tuning cluster operations applies to running and tuning Waterline Data Inventory jobs.

Like any other MapReduce jobs, the Waterline Data Inventory jobs can be tuned to fit the cluster resources. Waterline Data Inventory configuration files provide controls for the number and scope of map and reduce tasks. In addition, you can specify overrides to cluster settings as options on the Waterline Data Inventory job command line.

Installation and Administration Guide Hive


Because these jobs read all the data included in the inventory, the user running the jobs needs to have Read access to the data; make sure to configure the Waterline Data Inventory service user with security in mind to ensure this broad access is appropriately controlled.

Hive

Waterline Data Inventory can include Hive tables in the cluster catalog. In addition, given the appropriate access to data, users can generate Hive tables from HDFS files. This level of interaction with Hive requires some manual configuration and privileges.

Configuration points between Waterline Data Inventory and Hive

Hive authorization

Profiling Hive tables requires that the Waterline Data Inventory service user have read access to the Hive database and table.

Browsing Hive tables requires that the active user be authorized to access the database and table and the Waterline service user have read access to the backing file.

Creating Hive tables requires that the active user can create tables in at least one Hive database and have write access to the folder where the source files reside.

Access to data Waterline Data Inventory


Hive authorization on Kerberized cluster

During profiling, Waterline Data Inventory interacts with Hive through the metastore. In a Kerberized environment, typically the only user allowed to access Hive data through the metastore is the Hive superuser; to perform this operation, Waterline Data Inventory needs the following configurations:

The Hive access URL must include the Hive superuser principal name.

The Waterline Data Inventory service user must be configured as a proxy user in Hadoop so it can perform profiling operations as the Hive superuser.

These configuration requirements are described in detail in the installation steps in this document.

Profiling tables and their backing files

When Waterline Data Inventory profiles Hive tables, it uses the Hive metadata to determine the backing directory for each table and includes that directory and constituent files in the inventory, whether the HDFS files have been profiled or not. By default, it does not profile the backing HDFS files. If you choose to independently profile the backing files, it is possible that Waterline Data Inventory will show different views for the same data based on the input formats and parsing used for the HDFS file itself and for the Hive table. For example, the Hive table may have different column names, a subset of columns or rows, and may use a different delimiter to determine the fields within each row of data.

Creating Hive tables from Waterline Data Inventory

Creating Hive tables from inside Waterline Data Inventory requires that the active user have:

Authorization for at least one Hive database available through Waterline Data Inventory.

Write access to the parent directory of the source file; alternatively to avoid giving write access to data directories, you can configure Waterline Data Inventory to copy files to a separate location before creating the Hive table.

When users create a Hive table, the hive table shows up in Waterline Data Inventory immediately; however, detailed metadata and data aren't available until a Hive profiling job runs for that database.

Access to data

For Waterline Data Inventory to produce an inventory of HDFS, it needs read access to all the files that are included in the inventory. In addition, it needs read access to Hive tables. Waterline Data Inventory uses HDFS to store the profiling information it collects from HDFS and Hive tables: for the metadata, Waterline Data Inventory needs write access into a storage location in HDFS.

Installation and Administration Guide Access to data


Profiling HDFS files

To profile HDFS files, Waterline Data Inventory connects to:

1. HDFS Root Node: Waterline Data Inventory’s connection to HDFS for profiling includes: Read access for all HDFS files Write access to areas to collect profiling results

2. Repository: The Waterline Data Inventory engine writes profiling and discovery results to a repository on the edge node using the Waterline Data Inventory service user credentials.

Configure the HDFS and repository connections to profile HDFS files



Browsing HDFS files

When data scientists and analysts access HDFS files through Waterline Data Inventory, they see only files and tables that they have permission to view: all file system operations are performed as the signed-in user. The user access permissions are established through the operating system permissions or through a Hadoop authorization system such as Ranger or Sentry. Waterline Data Inventory uses its status as a proxy user on HDFS to perform operations with the authority available to the current user.

For end-users to browse HDFS files, Waterline Data Inventory connects to:

1. HDFS Root Node

2. Repository

Users connect to the Waterline Data Inventory web application:

3. Through a URL pointing to the Waterline Data Inventory application server (host and port), then use their authentication credentials, whether through explicit login or authentication configured for the browser such as when single sign-on is configured using Kerberos.

Configure the App Server connection for user access

Installation and Administration Guide Access to data


Profiling Hive tables

Waterline Data Inventory uses its read access to include Hive tables in the inventory. In addition to access to Hive databases, Waterline Data Inventory uses read/write access to a staging directory in HDFS where it holds profiling information for Hive tables. This can be the same staging area used for profiling HDFS files.

To profile Hive tables, Waterline Data Inventory connects to:

1. HDFS Root Node, including write access to an HDFS staging area for profiling results.

2. Repository.

3. Hive database access: for Waterline Data Inventory to include Hive tables, it needs read access to each Hive database to be included.

Waterline Data Inventory uses MapReduce to profile Hive tables



Browsing and Creating Hive tables

Users can create new Hive tables from HDFS files they identify in Waterline Data Inventory.

For end-users to create and browse Hive tables, Waterline Data Inventory connects to:

1. Repository

2. Hive database access. Waterline Data Inventory’s connection to Hive for browsing includes read access to all Hive databases. To create new Hive tables from HDFS files, Waterline Data Inventory needs write access to the databases where users would expect new tables to appear.

End-users log into Waterline Data Inventory with:

3. Browser URL pointing to the Waterline Data Inventory application server combined with user credentials, whether through explicit login or authentication configured for the browser.

Users see the Hive tables they have access to and can create new Hive tables from HDFS files

Installation and Administration Guide User access and roles


User access and roles

Many Waterline Data Inventory user actions can be controlled by assigning roles to users.

Data access is controlled by your cluster's authorization system, whether that is Linux and HDFS file permissions, ACLs defined on the cluster, or authorization management systems such as Ranger or Sentry.

Waterline Data Inventory has a primary administrator created when the services are first run. That administrator can assign roles to additional users.

To allow an authorized user to access Waterline Data Inventory: an administrator can configure the application to accept logins from any authorized user (Waterline Data Inventory automatically creates a profile with default roles) or you can configure the application so that user profiles must be manually entered and configured with roles.


Part 2 Setup and Installation

o System requirements o Checklist for installing Waterline Data Inventory o Plan tasks that impact cluster administration o Validate Hadoop configuration o Validate security requirements o Configure a service user o Configure a MySQL repository database o Download Waterline Data Inventory o Run the install o Update Hive Server with Waterline Data Inventory JARs

Installation and Administration Guide System requirements


System requirements

Waterline Data Inventory runs on an edge node in a Hadoop cluster. The following specifications describe the platform compatibilities and the minimum requirements for the edge node.

Hadoop compatibility

Cloudera CDH 5.2, 5.3, 5.4, 5.5/5.6

Hortonworks HDP 2.2, 2.3, 2.4

The edge node on which Waterline Data Inventory is installed needs to have the Hadoop and Hive clients required to access the Hadoop namenode and HiveServer2.

Hive compatibility

Reading Hive tables created in Waterline Data Inventory requires Hive 0.13 or later. All of the supported distributions have this support.

Oozie compatibility

Use Oozie 4.2 or later to schedule and monitor Waterline Data Marketplace MapReduce jobs. There are variations in Oozie support: HDP 2.2 with Oozie 4.1 is not compatible, CDH 5.4 with Oozie 4.1 is compatible.

Edge node minimum requirements

Optimizing input/output operations per second (IOPS) on the edge node is the most important factor in providing the best performance for Waterline Data Inventory operations. Provisioning a higher IOPS disk can reduce the overall profiling time significantly. For example, going from 3000 IOPS to 10000 IOPS can improve performance 1.5 times.

Two to four 500 GB disks, the faster the disks the better

2 quad-core CPUs, running at least 2-2.5 GHz

32 GB of RAM

Bonded Gigabit Ethernet or 10 Gigabit Ethernet

Linux operating system compatible with the configured Hadoop distribution

JDK version 1.7.x or 1.8.x

Be sure that you have installed the JCE policy file available from Oracle:

www.oracle.com/technetwork/java/javase/downloads/index.html

http://www.oracle.com/technetwork/java/javase/downloads/index.html

System requirements Waterline Data Inventory


Database configuration

One of the following:

Apache Derby (provided)

MySQL configured to use case-sensitive collation and to use the InnoDB Storage Engine (Waterline Data Inventory does not have a dependency on the version of MySQL)

The speed of the repository database is an important component of the overall performance of Waterline Data Inventory operations.

Waterline Data Inventory is shipped with Embedded Derby by default. This document provides instructions to configure Waterline Data Inventory to work with MySQL. To configure Waterline Data Inventory to work with other relational databases that support JDBC connectivity, contact [email protected].

Kerberos compatibility

This release is compatible with Kerberos version 5.

Browser compatibility

Waterline Data Inventory supports the following browsers. If your cluster uses Kerberos, be sure to configure Kerberos support in end-users' browsers:

Microsoft Internet Explorer 9 and later (not supported on Mac OS)

Chrome 36 or later

Firefox 31 or later

Multi-byte support

Waterline Data Inventory handles cluster data transparently: assuming the data is stored in formats that Waterline Data Inventory reads, the application doesn't enforce any additional limitations beyond what Hadoop and its components enforce. That said, there are places where the configuration of your Hadoop environment needs to align with what data you are managing, such as:

Operating system locale

Character set supported by Hive client and server

Character set supported by Waterline Data Inventory repository database client and server (Derby uses the system defaults on the node where it is installed)

Waterline Data Inventory browser application allows users to enter multi-byte characters to annotate HDFS data. Again, where Waterline Data Inventory interfaces with other applications, such as Hive, Waterline Data Inventory enforces the requirements of the integrated application.

Installation and Administration Guide Checklist for installing Waterline Data Inventory


Checklist for installing Waterline Data Inventory Plan for cluster changes such as setting up storage locations in HDFS, configuring

the Waterline Data user with trusted delegation privileges in HDFS, and updating the Hive installation. > Details

Validate Hadoop. Make sure services are running, you have Hive access, and you know the IP address or host name of the HDFS root. > Details

Validate security requirements. Make sure you know how your Hadoop system is authenticated. > Details

Configure a service user. Make sure this user can authenticate in your Hadoop environment, has write access to locations on the local node and in a storage location on HDFS, and has authorization to read data in HDFS and Hive, and is configured for trusted delegation with "proxyuser" entries in the HDFS configuration. > Details

(Optional) Create a MySQL repository database if you chose not to use the default Derby repository. > Details

Download and extract the new version into the install location. > Details

Run the installer, which walks through steps for configuring the user authentication method, a repository database, and connection information for HDFS and Hive data sources. > Details

Update JAR files in Hive. Move Waterline Data Inventory custom functions to a directory where HiveServer2 can locate them. Restart HiveServer2 after this change (the restart is not needed for basic validation of the Waterline Data Inventory installation). > Details

Plan tasks that impact cluster administration Waterline Data Inventory


Plan tasks that impact cluster administration < Back to Checklist

The following list identifies the changes that need to happen outside of the Waterline Data Inventory environment or on other cluster nodes. If you do not have control over or access to these components, you may need to plan in advance to perform these changes:

HDFS metadata storage

Waterline Data Inventory requires permanent storage on HDFS. This location is used by all profiling jobs and the application server. The Waterline Data service user needs read/write/execute access to this location.

HiveServer2 restart

Waterline Data Inventory installation inserts JAR files into the HiveServer2 auxlib directory. There are symbolic links from these files to the HiveServer2 directory. These files and links will be automatically updated if Hive is co-located with Waterline Data Inventory; if Hive is installed on a separate node, these files need to be manually copied to the Hive installation.

To allow other applications to read Hive tables created from inside Waterline Data Inventory, HiveServer2 needs to be restarted after replacing these files.

Trusted delegation on HDFS

Setting up the Waterline Data Inventory service user with trusted delegation requires adding the Waterline Data Inventory service user as a proxy user for HDFS. Applying these configuration changes can require restarting cluster components.

Repository database instance

Waterline Data provides a Derby database manager where the Waterline Data Inventory repository can reside. You may choose to configure a MySQL database instead; if so, plan the location, storage requirements, and backup processes.

Installation node setup

On the installation node itself, you'll need to identify locations for the software, search indexes, and log files. If you plan to use the provided Derby instance for the repository, it too will need a location on the edge node.

Validate Hadoop configuration < Back to Checklist

Hadoop is a complex system with many overlapping configurations and controls. You can ensure that Waterline Data Inventory will install smoothly if you first validate that the existing Hadoop components are running and communicating properly among themselves. The following steps prepare for Waterline Data

Installation and Administration Guide Validate Hadoop configuration


Inventory installation by exercising each of the places where Waterline Data Inventory interacts with Hadoop.

1. File system URI. Identify the host name for the file system, referred to in this document as <HDFS file system host>.

Typically, this is the fs.defaultFS parameter in Hadoop's core-site.xml file. You can find the host name for your cluster using the following command:

$ hdfs getconf -confKey fs.defaultFS

For MapR, use:

$ cat /opt/mapr/conf/mapr-clusters.conf

2. Cluster status. Verify that Hadoop components are running.

You can also use a cluster management tool (Ambari, Cloudera Manager, or MapR Control System). If the cluster is not managed using one of these tools, check individual services by running the command line for the component. For example:

$ hdfs dfsadmin -report

$ yarn version

$ beeline (!quit to exit)

Before installing Waterline Data Inventory, make sure that HDFS, MapReduce, and YARN are running; if Hive is configured for your cluster, Hive and its constituent components (Hive Metastore, HiveServer2, MySQL Server, WebHCat Server) must be running.

3. HDFS access. Check that users have access to HDFS files and Hive tables.

Waterline Data Inventory depends on the cluster authorization system to manage user access to HDFS resources. Verify that you have access to some HDFS files and Hive tables so that when you use Waterline Data Inventory to access the same files, you can validate that the proper access is available. You'll need access to these files as an end-user, not just as the Waterline Data Inventory service user.

To verify that you have access to these files and tables, you can, for example:

Use Hue or Ambari to navigate to existing data in HDFS or to load new data.

Verify that you can access files you own as well as files for which you have access through group membership. If you can't sign into Hue or Ambari or can't access HDFS files from inside one of these tools or from the command line, ask your Hadoop administrator for appropriate credentials.

Use Beeswax (accessible through Hue and Ambari) or Beeline (Hive command line) to verify that you can access existing databases and tables. If you can't sign into Beeline or can't access Hive tables, ask your Hadoop administrator for appropriate credentials.

Validate Hadoop configuration Waterline Data Inventory


4. MapReduce Configuration. Run a sample MapReduce job.

All of the Hadoop distributions provide sample code that you can run directly in the jar file:

hadoop-mapreduce-examples-<version>.jar

where the version may be specific to the distribution and version of Hadoop. Run an example MapReduce job as follows:

Use "locate" or "find" to determine where the examples JAR file is. Run the sample job "pi" with values for the number of map tasks (10) and

samples (1000) to run:

hadoop jar <full path>/hadoop-mapreduce-examples-*.jar pi 10 1000

If the example runs successfully, you'll see output that shows the MapReduce job running:

Number of Maps = 10

Samples per Map = 1000

Wrote input for Map #0










Starting Job

15/06/01 04:48:41 INFO impl.TimelineClientImpl: Timeline service address:

http://sandbox.hortonworks.com:8188/ws/v1/timeline/

15/06/01 04:48:41 INFO client.RMProxy: Connecting to ResourceManager at

sandbox.hortonworks.com/10.0.2.15:8050

15/06/01 04:48:42 INFO input.FileInputFormat: Total input paths to process : 10

15/06/01 04:48:42 INFO mapreduce.JobSubmitter: number of splits:10

15/06/01 04:48:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1432835905062_0003

15/06/01 04:48:43 INFO impl.YarnClientImpl: Submitted application application_1432835905062_0003

15/06/01 04:48:43 INFO mapreduce.Job: The url to track the job:

http://sandbox.hortonworks.com:8088/proxy/application_1432835905062_0003/

15/06/01 04:48:43 INFO mapreduce.Job: Running job: job_1432835905062_0003

15/06/01 04:48:59 INFO mapreduce.Job: Job job_1432835905062_0003 running in uber mode : false

15/06/01 04:48:59 INFO mapreduce.Job: map 0% reduce 0%







15/06/01 04:50:35 INFO mapreduce.Job: Job job_1432835905062_0003 completed successfully

Job Finished in 116.303 seconds

...

Estimated value of Pi is 3.14080000000000000000

You'll see a similar output pattern when Waterline Data Inventory MapReduce jobs run.

Installation and Administration Guide Validate security requirements


Validate security requirements < Back to Checklist

Before installing Waterline Data Inventory on a secure cluster, you need to be sure that you know what security measures your cluster employs so you can properly integrate the application to use the sanctioned security channels.

1. Authentication. Make sure the computer on which you will install Waterline Data Inventory is configured to access the service used for authentication. On the computer where you will install Waterline Data Inventory, validate that you can access your authentication system. Operating system authentication through SSH: You’ve probably already

confirmed this in previous steps: validate that you can access the edge node and from the edge node that you can access HDFS.

Kerberos authentication: Check that the Kerberos client is installed and configured to contact the KDC:

$ kinit

This command prompts for the current user's password. If it does, type anything and exit the command. If it doesn't, this computer is not yet configured with Kerberos. Work with your Kerberos administrator to install Kerberos, add this computer to the Kerberos database, and generate a keytab for this computer as a Kerberos application server.

If kinit works as expected, make sure that the Kerberos configuration file (/etc/krb5.conf) includes a description of the realm in which Waterline Data Inventory resides. For example, for a server in a company called "Acme":

[libdefaults]

default_realm = ACME.COM

dns_lookup_realm = false

dns_lookup_kdc = false

ticket_lifetime = 24h

renew_lifetime = 7d

forwardable = true

[realms]

ACME.COM = {

kdc = server1.acme.com:88

admin_server = server1.acme.com:88

}

[domain_realm]

.acme.com = ACME.COM

acme.com = ACME.COM

MapR authentication (with or without Kerberos): Check that the MapR ticket generation utility is installed and configured to access the CLDB node:

$ maprlogin

If this command prompts for the current user’s password, this node is configured as part of a secure MapR cluster.

2. Secure browser connectivity. Make sure you have access to the appropriate Hadoop components (including Waterline Data Inventory) through your browser. For example, you'll need access to the cluster host from the remote

Validate security requirements Waterline Data Inventory


computer. If your cluster is configured to use Kerberos, you'll need to configure the browser with a Kerberos plug-in and make sure you have valid user credentials.

From a browser running on a computer that is not the edge node where you are installing Waterline Data Inventory, verify that you can sign into cluster components, such as one of the following:

Hue (CDH, MapR) http://<HDFS file system host>:8888 Hue (HDP) http://<HDFS file system host>:8000 Ambari (HDP) http://<HDFS file system host>:8080 Cloudera Manager (CDH) http://<HDFS file system host>:7180 MapR Control System (MapR) http://<HDFS file system host>:8443

If you are not able to sign in, check that: The Hadoop service is running. The active user has access to the Hadoop application. For a Kerberos environment:

The current user has a valid ticket (run klist from a terminal on the client computer).

The browser is configured to use Kerberos when accessing secure sites. A Kerberos KDC is accessible from this computer.

3. Authorization. Identify the components your cluster uses to control access to cluster objects and actions. For example, Cloudera distributions support Sentry for managing file-level access. You'll need to make sure that the Waterline Data Inventory service user is properly configured in your cluster authorization system.

Installation and Administration Guide Configure a service user


Configure a service user < Back to Checklist

We recommend that you configure a dedicated service user to own the installation directory and to run Waterline Data Inventory jobs and services. This document refers to the service user as "waterlinesvc". If you choose not to create a "waterlinesvc" user, choose another user that will be dedicated to running Waterline Data Inventory jobs and services.

Because of the extensive access privileges that Waterline Data Inventory needs to produce an inventory of HDFS files, it is critical that the account that runs Waterline Data Inventory jobs be created to adhere to all enterprise security requirements.

The service user needs to be configured for enterprise authentication and authorization.

Authentication

Edge-node authorization

Hue access

Authorization

Proxy user setup

Authentication

The Waterline Data Inventory service user (waterlinesvc) needs to be an valid user in the system used by your enterprise to authenticate cluster users.

Kerberos principal and keytab. If your cluster is Kerberized, ask your Kerberos administrator to configure a principal name for the Waterline Data Inventory service user and create a corresponding keytab file. You'll need this information to configure the Waterline Data Inventory application server and to run Waterline Data Inventory jobs.

In a Kerberos-controlled environment, if you expect to use the service user to sign into the web application in addition to running services, consider configuring the user's keytab so it allows both keytab-based and password-based login.

To generate this keytab (assuming the Waterline Data service user is “waterlinesvc”):

$ kadmin.local

> add_principal waterlinesvc

Prompts for password

> ktadd -k <full path to keytab file> -norandkey waterlinesvc

> quit

Here the -norandkey argument is necessary; without it the waterlinesvc user's password will be invalidated when keytab is created and the user will not be able to sign-in to the Waterline Data Inventory web application.

Configure a service user Waterline Data Inventory


MapR ticket. For a MapR deployment, use the maprlogin utility and a username/password pair to generate a valid maprticket.

Edge-node authorization

Edge node directories. Root access may be required to configure the directories needed for installing and running Waterline Data Inventory.

The Waterline Data service user requires full access to the following locations on the edge node:

Directory Typical Locations

Software installation location /opt/waterlinedata

Search indexes (and repository if using Derby) /var/lib/waterlinedata

Logs /var/log/waterlinedata

Temporary storage /tmp

Waterline Data Inventory doesn't have any specific requirement for where it is installed. We recommend that you install Waterline Data Inventory in the same way other Hadoop cluster edge node applications are installed. Some clusters use /usr/lib; others /opt. It can be installed in other locations, such as in the home directory for the Waterline Data service user /home/waterlinesvc/waterlinedata.

If appropriate, you can nest these artifacts inside the same directory structure.

Shared folder access. If the installation is on a VirtualBox virtual machine image, it is convenient to include the waterlinesvc user as a member in the group created for the VM to share folders between the host and the VM, vboxsf group.

Hue access

Hue user. As a convenience, if you plan to use Hue to manage HDFS or MapR-FS files and monitor MapReduce jobs, create a corresponding user account for the Waterline Data Inventory service user on Hue. Alternatively, to identify jobs run by the service user, use that username to filter the job lists.

HDFS and Hive authorization

Access permission. The service user needs access permission to be able to read the data it will profile for the inventory.

One way to provide the service user with enough access to perform these operations is t include it in the file system group, such as hdfs or mapr.

If your cluster security is ensured using Apache Ranger or Apache Sentry or with Access Control Lists (ACL), here's how to set access permissions to make sure that both the Waterline Data Inventory service user and end-users of the browser application have the access they need to HDFS files and Hive tables.

Installation and Administration Guide Configure a service user


HDFS User and Area of access Read Write Execute

Waterline Data Inventory service user "waterlinesvc" HDFS directories and files included in inventory X X HDFS staging area for profiling results

(.wld_hdfs_metadata directory) X X X

Hive metastore user Waterline HDFS home directory X

End-users HDFS directories and files this user needs access to X X HDFS directories that include files for which this user may

create Hive tables X

Hive User and Area of access Hive Operation

Waterline Data Inventory service user "waterlinesvc" Profile existing tables SELECT Browse existing tables SHOW DATABASE Create new tables CREATE*, ALTER‡

End-users Hive databases and tables this user needs read access to SELECT Hive databases in which this user can create Hive tables CREATE, ALTER‡

* CREATE privileges on the Hive database need to be paired with WRITE access to the HDFS directory containing the file or files that contribute to the Hive table.

‡ ALTER privileges are required only for creating Hive tables from Waterline Data Inventory collections. If the user creating the table does not have ALTER privileges on tables, the Hive table is created without partitions.

Proxy user setup

Waterline Data Inventory has two modes of operation: batch jobs that run under the service user and web application processes. The users signed into the web application can see only the data they have specific authorization for, which may be a different level of access from other users and from the service user. To ensure that users see and do only what they have access for, the Waterline Data user performs the web application processes as if it were the signed-in user. To configure "trusted delegation" or the ability of the service user to act as the end-users, make the service user a proxy user for the hosts and groups involved in access the inventory.

In a Kerberos-controlled environment, trusted delegation has another value. The Hive metastore is typically accessible only through the dedicated Hive user. Waterline Data Inventory uses trusted delegation to perform operations against the Hive metastore. Because the Waterline Data Inventory service user has trusted delegation privileges, Hive performs the requests.

Configure a service user Waterline Data Inventory


Applying changes to cluster configuration files may require restarting cluster components. You may need to arrange with cluster administrators to include this change in their maintenance schedule.

To configure Waterline Data Inventory to use HDFS trusted delegation:

1. Add the Waterline Data Inventory service user (typically waterlinesvc) to the HDFS or MapR-FS superuser group.

2. Enable the secure impersonation properties for the Waterline Data Inventory superuser in the core-site.xml file on your Hadoop nodes.

These are the hosts and groups that represent users who will use the application: the Waterline Data Inventory service user needs to be able to act on behalf of these users.

Include all hosts or all groups using an asterisk (*) as the property value. Alternatively, you can specify a comma-separated list of fully qualified hostnames or group names. Change "waterlinesvc" to the name you are using for the Waterline Data Inventory service user.

For example:

<property>

<name>hadoop.proxyuser.waterlinedata.groups</name>

<value>*</value>

<description>Allow the superuser 'waterlinesvc' to impersonate any user

</description>

</property>

<property>

<name>hadoop.proxyuser.waterlinedata.hosts</name>

<value>*</value>

<description>The superuser 'waterlinesvc' can connect from any host to

impersonate a user</description>

</property>

3. Restart components as necessary to apply the changes on the cluster.

If running on a Kerberized system, be sure to use the same username in the same format when configuring Waterline Data Inventory property for logging in through Kerberos.

Installation and Administration Guide Configure a MySQL repository database


Configure a MySQL repository database < Back to Checklist

Out of the box, Waterline Data Inventory runs an embedded version of Apache Derby database as its repository. If your cluster uses MySQL for other applications and you want to standardize on that database, you can configure Waterline Data Inventory to use MySQL instead of Derby.

If you are installing Waterline Data Inventory using the provided Derby database for the metadata repository, you can skip this section.

The following instructions describe the requirements for configuring a MySQL database to persist Waterline Data Inventory metadata.

These instructions assume that you have an installed instance of MySQL already running on your cluster, such as the instance used by Hive metastore, and that you have access to a MySQL database administrator.

1. In the MySQL instance, create a user dedicated to Waterline Data Inventory operations.

For example, create the "waterlinesvc" user with password "wdipass":

mysql> CREATE USER 'waterlinesvc' IDENTIFIED BY 'wdipass';

We recommend that you use an encrypted password when you create the user account. The password can be retrieved by SELECT user,password from mysql.user.

2. Create a MySQL database "waterlinedatastore".

mysql> CREATE DATABASE waterlinedatastore CHARACTER SET latin1 COLLATE latin1_bin;

where the character set parameters allow the repository to accommodate case sensitive HDFS paths.

3. Switch to the newly created waterlinedatastore database and execute the following grants, where <MySQL host name> is replaced with the host name for the node where MySQL is running and the MySQL username is waterlinedata.

mysql> use waterlinedatastore;

mysql> GRANT USAGE ON waterlinedatastore.* TO 'waterlinesvc'@'%' IDENTIFIED BY 'wdipass';

mysql> GRANT USAGE ON waterlinedatastore.* TO 'waterlinesvc'@'<MySQL host name>' IDENTIFIED BY 'wdipass';

mysql> flush privileges;

mysql> GRANT all ON waterlinedatastore.* TO 'waterlinesvc'@'<MySQL host name>' IDENTIFIED BY 'wdipass';

mysql> GRANT all ON waterlinedatastore.* TO 'waterlinesvc'@'%' IDENTIFIED BY 'wdipass';

mysql> flush privileges;

4. Make a note of the MySQL configuration parameters you used here to incorporate into the Waterline Data Inventory properties: User Password Database name MySQL host name or address

Download Waterline Data Inventory Waterline Data Inventory


In the Waterline Data Inventory installation, you'll be prompted for:

Username and password.

The JDBC connection URL for the database. Create that from the values you collected above:

jdbc:mysql://<host name>:<port>/<databasename>?createDatabaseIfNotExist=true

where by default the port is 3306.

The location of the MySQL driver JAR (mysql-connector-java-<version>-bin.jar) on the edge node where Waterline Data Inventory is installed. If it isn’t already available on that node, copy it from the MySQL installation node or download it from dev.mysql.com/downloads/connector/j/

Download Waterline Data Inventory < Back to Checklist

If you haven't already, download the Waterline Data Inventory distribution from the location provided by Waterline Data. If your organization has subscribed to support, you can find the location through the Waterline Data Support portal, support.waterlinedata.com.

https://dev.mysql.com/downloads/connector/j/

http://support.waterlinedata.com/

Installation and Administration Guide Run the installer


Run the installer < Back to Checklist

Upgrading from an existing version?

The following instructions apply to a fresh installation of Waterline Data Inventory. If you have an existing installation and are ready to upgrade to the most recent version, contact [email protected] for instructions to migrate your configuration to the new system.

Run this script when you are installing Waterline Data Inventory for the first time or if you don't want to keep metadata profiled using a previous version of the application. This script clears the directories that you specify for the new installation! If you want to save any material from a previous installation: make sure make backups before running the installation.

To run the installer:

Command Line Portion

1. As the Waterline Data service user, extract the installer from the package and run it:

$ tar xf /path/to/wld-installer-2.5.0-GENERIC.tar.gz

$ ./wld-installer-2.5.0-GENERIC.run

If another application on the cluster is using one of the ports that Waterline Data uses (or if a previous version of Waterline Data Inventory is running on the edge node), the installer will warn you that the expected ports have conflicts. Follow the instructions provided for resolving port conflicts; more information is available at Configuring Waterline Data Inventory ports.

2. Choose "Custom install".

3. Set locations for the software, indexes (and database), and log locations.

The locations you specify should exist and be writeable by the Waterline Data service user.

The installer creates a "waterlinedata" directory inside the locations you specify; if that directory already exists, the installer deletes any existing content.

4. Review the Hive client location.

The installer reviews the environment and attempts to detect a Hive client. If Hive isn't running, the installer will fail to detect Hive; in that case, either restart Hive or specify the Hive client location on this node.

Run the installer Waterline Data Inventory


5. Indicate the database to be used for the repository.

If you do not want to use the default Derby database for the repository, enter "n" at the prompt. The installer then asks you to confirm or specify the location of the MySQL driver JAR file. This is the driver you added to the edge node in the previous section, Configure a MySQL repository database.

6. Review and accept the installation choices.

Browser Portion

The installation continues in a browser pointing to the host where you are installing Waterline Data Inventory, at port 8082. For example:

http://edgenode05:8082

1. When the installation page appears, read and agree to the Waterline Data Inventory license terms.

2. Select an authentication method.

This is the authentication method that allows Waterline Data Inventory to validate users who log into the web application. Choose from SSH, Kerberos, and Local Database. Local Database allows you to create users independent of an existing authentication system.

For AWS EMR deployments, only Local Database authentication is supported.

3. Enter a valid user as the Admin User.

This user is granted all Waterline Data Inventory roles, including the administrator role. Sign in with this user to assign administrator privileges to additional users.

If you are using Kerberos authentication and there is more than one realm configured for this user, specify the user name and realm, such as:

[email protected]



4. Configure a database to use as a repository.

These options specify how the Waterline Data Inventory service user connects to the repository database. For Derby, the installer provides the required settings. For MySQL, you need to review and correct the default values based on how you configured the database previously (in Configure a MySQL repository database).

Check the host and database name in the JDBC URL:

Then specify the username and password that you set when you created the database.

5. Add HDFS as a data source.

These settings indicate how the Waterline Data Inventory service user will connect to HDFS on this cluster. The name and description fields are for your benefit; the URL and authentication credentials determine the connection. For AWS EMR, specify the EMRFS name of the S3 bucket containing your

data. For example, "s3://acme-impressions-data/". For Kerberos, specify the keytab and principal created for the Waterline Data

Inventory service user.

Make sure to test your entries!



6. Add Hive as a data source.

These settings indicate how the Waterline Data Inventory service user will connect to HDFS on this cluster. The name and description fields are for your benefit; the URL and authentication credentials determine the connection.

The Hive URL is the connection string used by the Waterline Data service user to access Hive:

jdbc:hive2://<HiveServer2 host>:10000

For Kerberos, as you add values for the first section, the installer builds a URL for you: Hostname and port: the HiveServer2 host. By default, the port is 10000. Domain: a domain corresponding to the realm configured for the cluster. Use

a domain_realm entry from the krb5.conf file on the Hive host. This value will be used to construct the principal parameter of the URL.

Specify the keytab and principal created for the Waterline Data Inventory service user.


7. Review your entries and restart the application server.



Setup Portion

After the application server restarts, you'll be ready to validate the installation and prepare for users.

1. Sign into Waterline Data Inventory as the Admin User set in step 2 of the browser installation steps.

2. Configure settings that depend on HDFS.

From Manage > Configuration, set the following properties that correspond to your HDFS organization: HDFS metadata store location: Waterline Data Inventory storage that

supplements the data in the repository. The service user must have write access to this location. By default, /user/waterlinesvc/.wld_hdfs_metadata.

HDFS browse home location: The HDFS location users see when they click Browse. By default, this location is /user/<username>; if the user does not have a corresponding directory in HDFS, clicking Browse goes to the top of HDFS.

Copy files when creating Hive tables: The behavior for how files are managed when users create Hive tables from inside Waterline Data Inventory. Hive requires that the user creating the table has write access to the parent directory for the source files. If that permission isn't available (or if your data includes ORC or RC formats), set this option to true and set Hive backing file copy location to the HDFS location where files can be copied and users have write access.

Restart the Application Server after these changes.

3. Create one or more tag domains.

Even if you aren't sure of the tag domains you will use, you'll want to create at least one temporary tag domain. Without a new tag domain, no users will be able to create tags.

To create a tag domain:

a. Open Manage > Glossary. b. Click Create and select New Tag Domain.



c. Enter a tag domain name and description and click Create. d. Repeat if you want to include additional tag domains.

4. Configure additional users.

Create additional user profiles and assign the administrator role to at least one of those users.

To configure user profiles:

a. In the toolbar, choose Manage > User Profiles.

If you don’t see this option, make sure you are signed in as the Admin User.

b. Click Add User. c. Enter the name of an existing user to whom you want to give administrator

privilege. d. On the left, select the username. e. On the right, select Administrator and other roles as needed.

If you choose to give this user the Data Steward role, also select the tag domains that this user can manage.

f. Click Save.

5. Profile an HDFS directory.

For this first run, select a single directory with a small number of files to validate the installation.



Run the following command on the edge node where Waterline Data Inventory is installed and as the Waterline Data service user (shown here as waterlinesvc):

$ su waterlinesvc

$ bin/waterline profile <full path to HDFS directory with data to profile>

For example:

$ bin/waterline profile /user/waterlinedata/Landing/data.gov

The console fills with status messages for each stage of the profiling sequence. These messages are also written to /var/log/waterlinedata/wd-jobs.log. When this command completes, you can repeat it with additional directories or move on to viewing the profiled data.

6. Verify that there is field-level information for the files in the directory you profiled in step 4.

Files should appear with their content type and status of "Profiled".

If profiling was successful, files will show "Profiled"

If files show content type "Unknown" and a status indicating that they were not profiled ("Unprocessed" or "Profile Failed"), review the console output from step 5 to determine the failure.

If profiling didn't complete successfully, files will show "Unprocessed" or "Profile Failed"

Update Hive Server with Waterline Data Inventory JARs Waterline Data Inventory


Update Hive Server with Waterline Data Inventory JARs < Back to Checklist

When creating Hive tables, Waterline Data Inventory uses some third party and custom functions and serdes contained in JAR files. These JARs need to be identified to HiveServer2 such that applications reading these tables have access to the same JARs.

To configure HiveServer2 for Waterline Data Inventory:

1. Point HiveServer2 to the location of the custom functions.

There are three ways to achieve this configuration; review your cluster configuration to determine the appropriate option: Place the JARs in the hive/auxlib directory in the HiveServer2 node. Place the JARs in the hive/auxlib directory or some other location on the

HiveServer2 node and set the contents of this directory in the hive.aux.jars.path in hive-site.xml. For example:

<property>

<name>hive.aux.jars.path</name>

<value>/usr/hdp/2.4.0.0-169/hive/auxlib/*</value>

<description>The location of the plugin jars that contain implementations

of user defined functions and serdes.</description>

</property>

Place the JARs in the hive/auxlib directory or some other location on the HiveServer2 node and set the contents of this directory in the HIVE_AUX_JARS_PATH in the hive-env.sh. For example:

export HIVE_AUX_JARS_PATH=/usr/hdp/2.4.0.0-169/hive/auxlib/*

2. Move the Waterline Data Inventory JAR files to the chosen location on the HiveServer2 node.

The JAR files are provided in a tar file in the installation:

waterlinedata/lib/hive/waterlinedata-hive-dependencies.tar.gz

Move this file to the destination directory on the HiveServer2 host and untar it:

$ tar xf waterlinedata-hive-dependencies.tar.gz

3. Restart HiveServer2.


Part 3 Administration

o Starting Waterline Data Inventory services o Running Waterline Data Inventory jobs o Monitoring Waterline Data Inventory jobs o Setting up users to access Waterline Data

Inventory o Backing up Waterline Data Inventory

metadata

Starting Waterline Data Inventory services Waterline Data Inventory


Starting Waterline Data Inventory services

These steps assume you have access to the Linux computer (edge node) where Waterline Data Inventory is installed and you can sign in as the Waterline Data Inventory service user.

1. Use SSH to access the computer where Waterline Data Inventory is installed and sign in as the Waterline Data service user.

2. Navigate to the Waterline Data Inventory installation directory.

For example:

$ cd /opt/waterlinedata

3. Start the Waterline Data Inventory services.

$ bin/waterline serviceStart

If you are running the Derby repository, right away you should see a response that ends with "...started and ready to accept connections on port 4444", which indicates that Derby started successfully.

As the Jetty application server starts, you'll see many logging messages on the console. These messages are also written to /var/log/waterlinedata/wd-ui.log for later reference.

See also Commands to control Waterline Data Inventory services.

4. Open a browser and navigate to:

http://<edgenode>:8082

If the Waterline Data Inventory login screen doesn’t appear, look in the console output to see if any error occurred. The output is also available at /var/log/waterlinedata/wd-ui.log. Typically, errors at this point are similar to the following: Port forwarding. If you are accessing the application remotely, make sure

that the connection between hosts allows forwarding of port 8082. User permissions. Make sure you started the services with the Waterline

Data service user. Next, if the service user does not have the correct permissions, you may see errors in the Jetty output. Review the user access requirements and make sure the user has the correct access.

Kerberos ticket cache disabled. In a Kerberos-controlled environment, if you see the following error in the Jetty console and log, the ticket cache may be disabled for the user starting the Jetty process:

WARN |2015-03-22 13:52:57,827 org.apache.hadoop.security.UserGroupInformation -

PriviledgedActionException as:waterlinesvc (auth:KERBEROS) cause:java.io.IOException:

javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No

valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

To resolve this error, make sure the Waterline Data Inventory service user running the Jetty process has read access to HDFS and has a valid Kerberos ticket (run kinit) on the computer where Waterline Data Inventory is

Installation and Administration Guide Starting Waterline Data Inventory services


installed. Then check that the Kerberos ticket cache is available for the user (run klist).

5. Sign into Waterline Data Inventory using the credentials for the Waterline Data Inventory superuser.

This is the user configured in the installation process, step 2.

For Kerberos, depending on how your system is configured, you may need to qualify the username with the realm name.

Commands to control Waterline Data Inventory services

Waterline Data Inventory runs three processes on the edge node:

Derby Repository database (disregard if using MySQL)

Jetty application server

Web server daemon

The primary commands to control these services are:

bin/waterline serviceStart

bin/waterline serviceStop

Control the Waterline Data Inventory Derby repository database and application server. If you are using a different database, this script will skip the database controls and only start the application server and web daemon.

When Derby is running, the stop command prompts for the username and password used by Waterline Data Inventory components to access the repository. By default, these are "waterlinedata" and "waterlinedata". You can set these values as described in Repository username and password.

In addition, you can start and stop the embedded Derby database or Jetty application server individually:

bin/derbyStart

bin/derbyStop

Controls only the Waterline Data Inventory Derby repository database. Typically, you wouldn't want to stop Derby without stopping the Jetty application server first.

The stop command prompts for the username and password used by Waterline Data Inventory components to access the repository. By default, these are "waterlinedata" and "waterlinedata". You can set these values as described in Repository username and password.

bin/jettyStart

bin/jettyStop

Controls the Waterline Data Inventory application server.

Starting Waterline Data Inventory services Waterline Data Inventory


bin/daemonStart

bin/daemonStop

Controls the Waterline Data Inventory web daemon, which runs in the background to allow admin users to remotely restart the application server through the browser.

Use this process command to validate that the Waterline Data Inventory services are running:

ps -ef | grep /opt/water

Including part of the installation location ("/opt/water" in this example) allows you to see all the processes but excludes other processes such as Hive that refer to Waterline Data Inventory JARs in their class paths.

Active Derby, Web daemon, and Jetty processes

Derby process

Jetty process

Daemon process

Installation and Administration Guide Running Waterline Data Inventory jobs


Running Waterline Data Inventory jobs

Waterline Data Inventory format discovery and profiling jobs are MapReduce jobs run in Hadoop. These jobs populate the Waterline Data Inventory repository with file format and schema information, sample data, and data quality metrics for files in HDFS and Hive. Waterline Data Inventory can process HDFS files formatted as delimited text files, JSON, Avro, XML, ORC, RC, Parquet, and Apache log files. Individual files in these formats compressed as sequence files are also profiled. Individual files in delimited text format, Apache log format, or JSON compressed as gzip (GNU zip) are also profiled. Waterline Data Inventory recognizes many other file types (such as PDF and image files) but provides only basic file information for these files.

Tag discovery and lineage discovery jobs run as MapReduce jobs and use the HDFS profiling information to suggest tag associations and lineage relationships for files in the inventory. This information is stored in the repository.

Collection discovery and origin propagation jobs are jobs run on the edge node where Waterline Data Inventory is installed. These jobs use data from the repository to suggest relationships among files and to propagate origin information. The results are stored in the repository.

Waterline Data Inventory jobs are run on a command line on the edge node on which Waterline Data Inventory is installed. The jobs are started using scripts located in the bin subdirectory in the installation location. They can also be run as workflows in Apache Oozie.

Running Waterline Data Inventory jobs Waterline Data Inventory


Running jobs in parallel

Waterline Data Inventory profiling jobs can be run at the same time. For best results, we recommend that concurrent jobs profile different sections of data. Use the ability to run multiple profiling jobs to allow administrators who manage distinct data sets to be able to profile this data independently.

Waterline Data Inventory discovery jobs often run across all resources in the repository. Two discovery jobs cannot be run at the same time. If a discovery job is started when there is a conflicting job running already, you'll see a warning on the console and the job will be delayed until the conflicting job completes.

Discovery jobs may be run at the same time as profiling jobs; keep in mind that the results from the profiling jobs will not be reflected in the results of the discovery jobs running concurrently.

To automate job execution, use the job return codes to ensure one job finishes before the next job starts.

Command summary

Run Waterline Data Inventory commands as options to the waterline script found in the bin directory of the installation:

$ bin/waterline <command option> <parameters> <overrides>

The command options and parameters are described in the following table. Overrides allow you to specify MapReduce and Waterline Data Inventory properties for the job that override the currently configured setting.

Job configuration values are pulled from the following places in order of precedence:

Job command line. All properties including MapReduce and Waterline Data Inventory properties can be specified on the command line prefixed with "-D". These values are written to the Hadoop configuration object for the job.

Waterline Data Inventory configuration. Set properties through Manage > Configuration. Any properties that are available in this list (as described in Configuration controls) are written to the Hadoop configuration object for the job.

Hadoop configuration. Any properties that are not explicitly set in the previous locations will be pulled from the cluster configuration.

Command option Summary

profile <HDFS directories>

Profiling of the files in the indicated HDFS directories. Indicate more than one directory with a comma-separated list. Configuration parameters can be passed through to override configured properties for this job.



Command option Summary

profileHive <Hive databases> Profiling of the tables in the indicated Hive databases. Indicate more than one database with a comma-separated list. To specify individual tables, use the property waterlinedata.profile.hivenamefilter as an override. Waterline Data Inventory uses the location configured for waterlinedata.profile.hivedir to copy HDFS files as needed when users choose to create Hive tables. Configuration parameters can be passed through to override configured properties for this job.

lineage Discover lineage relationships among all profiled files and tables and calculate file and table origins.

collections [HDFS directories] Discover collections among all profiled files. If you are running discovery tasks individually, be sure to discover collections before discovering tag associations. You can improve the performance of the collection discovery process by reducing the number of possible directories to consider. Specify HDFS directories on the command line that will never have collections that span across them. For example, if you have data organized by layers such as "raw" and "processed", you can identify those directories in the command: waterline collections /layer/raw,/layer/processed.

origin Calculate file and table origins using all lineage relationships. This command is run as part of the lineage command.

tag Propagate tag associations across all profiled files and tables. This operation uses metadata and sample data collected during profiling. If you are experimenting with tag associations based on regular expressions, consider reprofiling data to get a complete picture of how tag associations from regular expressions will perform.

evaluateRegex Reapply tag associations based on regular expressions using existing metadata and sample data collected during profiling.

serviceStart serviceStop serviceRestart

Start, stop, and restart Waterline Data Inventory services, including the Jetty application server and the daemon that controls restarting the application server. If the configured repository is running in Derby, these commands control the Derby instance also.

showDebug setDebug resetDebug

Set the logging level for MapReduce job logs. These logs are found in the cluster history server, accessible through Hue or Ambari.

krenewStart krenewStop

Ensure that the Waterline Data Inventory service user has a valid Kerberos ticket for long-running jobs.

showVersion Display Waterline Data Inventory version information.



HDFS file profiling

$ <install location>/bin/waterline profile <HDFS dir>,[<HDFS dir>] [<overrides>]

This command recursively profiles new and updated files in the directory or directories indicated. Note that there is no space between HDFS directory paths.

When run for the first time, this command profiles all files in the indicated directories. Subsequent runs identify changed, deleted, and new files in the cluster and perform profiling only on those files. For new files added to existing collections, Waterline Data Inventory profiles the new files and aggregates the new information into the existing profiling information for the collection.

You can force Waterline Data Inventory to re-profile files it has already profiled by appending an override option (-Dwaterlinedata.incremental=false) to the command. Consider reprofiling files when you change the profiling criteria such as when adding new delimiters or changing profiling properties. An additional override option (-Dwaterline.reprofile_files_with_error ) allows you to reprofile failed files along with any new or updated files.

If you specify a valid HDFS file instead of a directory, Waterline Data Inventory will profile just the file.

Example:

$ bin/waterline profile /user/waterlinedata/Landing,/user/waterlinedata/finance

-Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.java.opts=-Xmx3072m

The profile command triggers the following individual operations:

Format discovery (one MapReduce job, with map tasks allocated 4 GB of memory)

Profiling "crawl" (one or more MapReduce jobs per file format type, with map tasks allocated 4 GB of memory)

Origin propagation for any new files found in a folder with an existing Landing marking.

The progress of each job is indicated by messages on the console and logged in the wd-jobs.log found in /var/log/waterlinedata. To see details for the MapReduce jobs, follow the job link provided in the console messages, for example:

INFO | 2016-02-19 20:12:55,430 | Job [main] - The url to track the job:

http://myhost:8088/proxy/application_1455926257075_0007/

After profiling HDFS, run the collection discovery, lineage discovery, and tag propagation commands in that order, described next.



Collection discovery

$ <install location>/bin/waterline collections [-d] [-r] [list of HDFS dirs]

This command reviews repository data to determine if any folders contain files that can be considered a collection. Run this command when you've added files to the cluster that are likely to be members of a new collection.

The command options are:

-d Delete existing collections, including collections created manually.

-r Rerun collection discovery, first deleting all existing collections, including collections created manually. Consider using the -r option when you have removed files from data sets in the cluster to update collections (removed files are not removed from the collection during normal profiling processes).

list of HDFS

dirs Identify a list of HDFS directories to discover collections within. The list is a comma separate list of HDFS paths, in double quotation marks. If no directories are specified, collection discovery runs against the entire inventory.

The progress of the job is indicated by messages on the console and logged in the wd-jobs.log found in /var/log/waterlinedata.

Example:

$ bin/waterline collections /user/waterlinedata/Landing,/user/waterlinedata/finance

Supplying a list of HDFS directories can improve collection discovery performance by limiting the scope of the discovery: in this example, Waterline Data Inventory does not look for collections that span between the Landing and finance directories.

We highly recommend that to discover collections in the entire inventory that you divide the cluster data into directories. For example, if your data is organized in layers such as “raw” and “processed”, indicating these high-level divisions on the collections command line can improve performance significantly:

./waterline collections /raw,/processed

Lineage discovery

$ <install location>/bin/waterline lineage [<overrides>] [-r]

This command runs a MapReduce job to discover lineage relationships among files and a local job to propagate origin information. If the lineage job fails (returns status code of -1), the origin job does not run.

This command operates on data in the Waterline Data Inventory repository; if new files are added to the cluster, you must run a profile command to collect data into the repository before you will see information for the new files reflected in lineage relationships. For performance reasons, consider profiling all data on the cluster before running lineage discovery; during regular maintenance, run lineage



discovery after significant numbers of files are added rather than running it for each incremental change.

This command allows a -r option, which will drop existing suggested lineage relationships and rediscover lineage for all files in the cluster, not just new files. The -r option does not affect accepted or rejected lineage relationships.

Example:

$ bin/waterline lineage -Dmapreduce.map.memory.mb=8192

-Dmapreduce.map.java.opts=-Xmx7168m -r

The lineage command triggers lineage discovery across the entire repository, re-evaluating any existing suggested lineage relationships and using 4 GB memory per map task.


Origin propagation

$ <install location>/bin/waterline origin [-r]

This command propagates origins across the files in the cluster that have lineage relationships. You can use this command to propagate landing information across a cluster that has already been profiled and has lineage information discovered. This command allows a -r option, which propagates all origins, not just new origins.

Note that the profile job propagates origins for new files that are added to folders that are marked with a landing value already.


Tag discovery

$ <install location>/bin/waterline tag [<overrides>] [-r]

This command discovers tag associations for new and changed tags across the fields in the cluster. Use this command when users have added tags and tag associations that you want Waterline Data Inventory to consider for propagation. Typically, this job would run at least daily and potentially many times a day depending on tagging activity.

This command allows a -r option, which re-evaluates tag association suggestions for all tags, not just new tags.

You should also run the evaluateRegex command when you run tag discovery, otherwise regular expression tags are not be included in tag discovery. It is not necessary to run evaluateRegex when you use the -r option.

The progress of the job (comprised of one MapReduce job and one edge node job) is indicated by messages on the console and logged in the wd-jobs.log found in /var/log/waterlinedata.



To see details for the MapReduce jobs, follow the job link provided in the console messages or use Hue or Ambari to show the MapReduce jobs for the Waterline Data Inventory service user.

Regular expression tag rule evaluation

$ <install location>/bin/waterline evaluateRegex

This command uses data from the repository to apply tag association regular expression rules. Use this command to apply tag associations for new and updated regular expression tags against the existing profiling metadata in the repository. Note that this method may not give you the best possible tag association results; for best results for regular expression tag associations, files should be profiled after the rule is defined so that the profiling process collects metadata specific to the requirements of the regular expression.


Hive table profiling

$ <install location>/bin/waterline profileHive <Hive database> [overrides]

This command profiles new and updated tables in the Hive database or databases indicated. You can include the following override to indicate the table or tables from the specified databases to profile:

-Dwaterlinedata.profile.hivenamefilter=table-list

where "table-list" is a regular expression that resolves to describe one or more tables in the database. Here are some examples of table-list expressions:

Single table by the table name tableA

More than one table by name 'tableA|tableB'

All tables that begin with the prefix "abc" 'abc.*'

All tables that end with 2015 '.*2015'

To profile all tables in the "default" database, use:

$ bin/waterline profileHive default

To profile more than one database at a time, include multiple databases in the command, separated by commas with no space between names:

$ bin/waterline profileHive dbRaw,dbTransformed

To profile tables with the prefix "abc" from database dbRaw, include an override to indicate the tables to profile from the dbRaw database:

$ bin/waterline profileHive dbRaw -Dwaterlinedata.profile.hivenamefilter="abc.*"

Note that because the table-list is a regular expression, the “.*” indicates any additional characters, not a wildcard on “abc.”.



When run for the first time, the profileHive command profiles all tables in the indicated database or table-list. Subsequent runs identify changed, deleted, and new tables in the database or table-list and perform profiling on those tables. Updated partitions in tables are profiled and combined with the existing profile data for the table.

You can force Waterline Data Inventory to re-profile tables it has already profiled by appending an override option (-Dwaterlinedata.incremental=false) to the command. Consider reprofiling tables when you change the profiling criteria such as when adding tags with regular expression rules.

Specifically, the profile command triggers the following individual operations:

Format discovery (one MapReduce job)

Profiling "crawl" (one or more MapReduce jobs depending on the size of data in each table and the format of the underlying data)

The progress of each job is indicated by messages on the console and logged in the wd-jobs.log found in /var/log/waterlinedata. To see details for the MapReduce jobs, follow the job link provided in the console messages.

After profiling all the databases in the cluster, run the lineage discovery and tag propagation commands.

Example:

$ bin/waterline profileHive default,finance

To profile only the table “web-sales” from database dbRaw, include an override to indicate the table to profile:

$ bin/waterline profileHive dbRaw -Dwaterlinedata.profile.hivenamefilter="web-sales"

Starting and stopping services

$ bin/waterline serviceStart

$ bin/waterline serviceStop

$ bin/waterline serviceRestart

During normal operation, Waterline Data Inventory runs the following services to support user access to the browser application:

Application server on port 8082. This is the Jetty server that serves the browser application.

Web daemon on port 8084. This daemon makes it possible to start and stop the application server remotely.

Derby server on port 4444. This service only runs if the system is configured to use Derby as the repository database.

If needed, you can start, stop, or restart these processes from the command line. Users with the admin role can restart the application server from inside the browser application. The only reason to use these command-line controls is when applying



software upgrades, making changes to the default ports used by these services, or when you need to make sure to stop all services rather than just the application server.

Logging levels for MapReduce jobs

$ bin/waterline showDebug

$ bin/waterline setDebug

$ bin/waterline resetDebug

These commands change the logging levels for the MapReduce jobs triggered by Waterline Data Inventory operations. showDebug indicates the current setting for the logging levels. setDebug changes the level to DEBUG. resetDebug changes the level to the default INFO setting. The Waterline Data Inventory logs are not affected; see Debugging information.

Version information

$ bin/waterline showVersion

This command displays the Waterline Data Inventory version installed. This information also shows a Hadoop distribution. If the Hadoop distribution listed here is different from the distribution running on the cluster, you may have configuration problems.

You can also find Waterline Data Inventory version information at the beginning of the output of each job. For example:

INFO | 2015-12-08 05:55:51,411 | SequenceOfJobsCrawler [main] - Waterline Data

Inventory(GENERIC) Version: 2.1.1 Generic Build: 227

Specifying MapReduce and Waterline Data Inventory properties

The Waterline Data Inventory job command line allows you to specify job-specific values for MapReduce and Waterline Data Inventory properties. Prefix "-D" to the property name; separate additional properties with a space. For example, to set the memory requirements for map tasks and turn off incremental profiling (on one line):

$ bin/waterline profileHive default,finance -Dmapreduce.map.memory.mb=8192

-Dmapreduce.mapreduce.map.java.opt=-Xmx7168 -Dwaterlinedata.incremental=false

The Waterline Data Inventory properties you specify on commands override any values set in Waterline Data Inventory properties files. However, to specify MapReduce properties, remove the any duplicate option from the Waterline Data Inventory properties files.

The Waterline Data Inventory properties are described in Configuration controls.



Ensuring valid Kerberos tickets throughout long-running jobs

$ bin/waterline krenewStart

$ bin/waterline krenewStop

In a Kerberos-controlled environment, it’s important to ensure that the user running Waterline Data Inventory jobs has a valid Kerberos ticket for the duration of the job. For the Hadoop jobs started by Waterline Data Inventory, Hadoop manages renewal of the tickets during the job. To ensure tickets are valid for the duration of Waterline Data Inventory jobs, run the job command after starting a ticket renewal script. This routine polls hourly to see if the current ticket issued for the Waterline Data Inventory service user is close to expiring; if so, it renews the ticket.

To run a long-running Waterline Data Inventory job in a Kerberos-controlled environment:

1. Make sure the Waterline Data Inventory service user has a valid Kerberos ticket.

$ su waterlinesvc

$ kinit

2. From the Waterline Data Inventory bin directory, start the ticket renewal routine.

$ cd /opt/waterlinedata/bin

$ ./waterline krenewStart

3. Run the Waterline Data Inventory job.

$ ./waterline profile /user

4. After the job completes, stop the ticket renewal routine.

$ ./waterline krenewStop

To run the job as part of a batch process, the script might look like the following:

echo "Starting WDI HDFS profiling for /landing/raw"

bin/waterline krenewStart

bin/waterline profile /landing/raw -Dmapreduce.map.memory.mb=8196

-Dmapreduce.map.java.opts=1Xmx7172m > log/landing-raw-latest.log

if [ "$?" -ne "0" ]; then

echo "bin/waterline profile command failed."

echo "See the log in /log/landing-raw-latest.log"

exit 1

fi

bin/waterline krenewStop

echo "HDFS profiling for /landing/raw complete"

Installation and Administration Guide Monitoring Waterline Data Inventory jobs


Monitoring Waterline Data Inventory jobs

Waterline Data Inventory provides a record of job history in the Dashboard of the browser application.

In addition, you can follow detailed progress of each job on the console where you run the command.

Monitoring Hadoop jobs

When you run the “profile” command, you’ll see an initial job for format discovery followed by one or more profiling jobs. There will be at least one profiling job for each file type Data Inventory identified in the format discovery pass.

The console output includes a link to the job log for the running job. For example:

2014-09-20 18:17:27,048 INFO [WaterlineData Format Discovery Workflow V2] mapreduce.Job

(Job.java:submit(1289)) - The url to track the job:

http://sandbox.hortonworks.com:8088/proxy/application_1913847052944_0004/

While the job is running, you can follow this link to see the progress of the MapReduce activity.

Monitoring Waterline Data Inventory jobs Waterline Data Inventory


Alternatively, you can monitor the progress of these jobs using Hue in a browser:

http://<cluster IP address>:8888/jobbrowser

or Ambari's HDFS file browser:

http://<cluster IP address>:8080/#/main/views/FILES/1.0.0/Files

You’ll need to specify the Waterline Data Inventory service user or, if the service user has a corresponding account, sign in using that user.

When reviewing the results for an individual file, you can follow a link in the single-file view to the specific MapReduce log that corresponds to the profiling job for that file:

In addition, you can set the debugging level for these logs using an option on the waterline command; see Logging levels for MapReduce jobs.

Monitoring local jobs

After the Hadoop jobs complete, Waterline Data Inventory runs local jobs to process the data collected in the repository. You can follow the progress of these jobs by watching console output in the command window in which you started the job.



Profiling results

After Waterline Data Inventory jobs run successfully, there may still be individual files that are not profiled or are not profiled completely. The following chart describes the status values that folders, files, and tables can have before and after being processed by Waterline Data Inventory:

Status values for folders, files, and tables after processing

There are three places to look to understand the results of a profiling job:

Dashboard. From inside the Waterline Data Inventory browser application, click Dashboard in the toolbar. This page lists the current and past jobs. If files in a job produced errors and were not processed or were not fully processed, the job status indicates the errors.



Advanced Search. From inside the Waterline Data Inventory browser application, click Advance Search in the toolbar. Open the Profile Status facet and select one or more of the profile status to list the files with a given status. File status values include: Deleted — Files and tables that were previously processed, but when

processed again they were not found in HDFS or Hive. These files will not appear in searches or elsewhere in the application.

Crawled — Files and tables that were processed but only format discovery completed. This status applies only to repositories created in Waterline Data Inventory 1.2.5 and earlier.

Processed — Directories that were included in a profiling job.

Unprocessed — Directories, files, and tables that have never been part of a profiling job. Rarely, this status can include files or tables that were not processed because a formatting job failed. Note that the only reason a directory or file would end up as “unprocessed” is because a user browsed to the location in HDFS before the location was included in a profiling job.

Recognized — Files that Waterline Data Inventory identified by format but the format is not one that Waterline Data Inventory can profile. No profiling was attempted for these files.

Unrecognized — Files that Waterline Data Inventory could not identify by format. No profiling was attempted for these files.

Profiled — A HDFS file or Hive table profiled successfully. This status also applies if sampling is turned on or if profiling succeeded for a significant portion of the file.

Profile Failed — Profiling encountered too many errors in this HDFS file or Hive table to produce profiling output. Look for specific errors in the output of the profiling job.

To find all files or tables that are not processed successfully, select facets for Unprocessed, Unrecognized, and Profile Failed.

Browse, Search, and Single File views. The file information for each profiled file or table includes the profile status for the file. From inside the Waterline Data Inventory web application, you can see the profile status with other file-level information.

Sampled profiling results

You can configure Waterline Data Inventory jobs to profile only a sample of the data in files or tables (waterlinedata.profile.sampled=true). You can identify the files or



tables that are sampled using the Profile Sampled facet in the Advanced Search. Sample status values include:

Not Applicable. Returns folders, files, and tables that are either unprocessed or unprofiled.

Full. Returns files and tables that were successfully profiled using all the available data.

Sampled. Returns files and tables that were successfully profiled using only a sample of the available data.

The file information for each profiled file or table includes the sample status for the file. From inside Waterline Data Inventory browser application, you can see the sample status with other file-level information.

Debugging information

There are multiple sources of debugging information available for Data Inventory. If you encounter a problem, collect the following information for Waterline Data support.

Job messages

Waterline Data Inventory generates console output for jobs run at the command prompt. If the job encounters problems, you would review the job output for clues to the problem. These messages appear on the console and are collected in log files with debug logging level:

/var/log/waterlinedata/wd-jobs.log

See also Setting logging levels.

MapReduce job messages

Many Waterline Data Inventory jobs trigger MapReduce operations. These jobs produce output that is accessible on the cluster jobhistory server and also accessible through Hue’s job browser or the Ambari MapReduce job history. You can open the MapReduce job log that corresponds to the profiling operation for a given file or table by opening that resource in the browser and following the link in the profile status:



See also Logging levels for MapReduce jobs.

Application server messages

The embedded application server, Jetty, produces output corresponding to user interactions with the browser application including the installation process. These messages appear on the console and are collected in a log file:

/var/log/waterlinedata/wd-ui.log

Use tail to see the most recent entries in the log:

$ tail -f /var/log/waterlinedata/wd-ui.log

See also Setting logging levels.

Lucene search indexes

In some cases, it may be useful to examine the search indexes produced by the product. These indexes are found in the following directory:

/var/lib/waterlinedata/index

Waterline Data Inventory repository

In some cases it may be useful to examine the actual repository files produced by the product. The repository datastore is found in the following directory:

/var/lib/waterlinedata/db/waterlinedatastore

Installation and Administration Guide Setting up users to access Waterline Data Inventory


Setting up users to access Waterline Data Inventory

The things you need to care about for your end-users are these:

Authentication. Only authenticated users can sign into Waterline Data Inventory web application. You can configure Waterline Data Inventory to accept any authenticated user or to require the user be pre-configured before signing in. See Controlling user access to the web application.

Authorization. Again, this isn't something that has anything to do with Waterline Data Inventory but the results of it have everything to do with Waterline Data Inventory. That is, once your users start accessing your cluster through Waterline Data Inventory, they'll be very aware of what they do and don't have access to. If you don't have a process for deciding who gets access to what, you'll want to get something in place as part of your roll-out of the catalog.

Waterline Data Inventory user profiles. You have an option of allowing any authenticated users sign into the web application or creating user profiles in advance and restricting access to the users who have profiles. If you choose to allow any authenticated users access, each user is automatically endowed with the “End User” role, which gives them view access to Waterline Data Inventory metadata. To give users more privileges, you’ll need to modify their user profiles. See Managing Waterline Data Inventory user profiles.

Supporting self-service users

Waterline Data Inventory is designed to enhance the ability of users of Hadoop data to find the right data in Hadoop. It endeavors to open Hadoop to these users while reducing the burden on IT to provide the access while maintaining control over secure and sensitive data.

To achieve this balance of better data tools for end-users and a secure and controlled data environment, administrators configure end-user access to Waterline Data Inventory in the following ways:

Secure access. Users of the Waterline Data Inventory browser application need to have accounts that can access the cluster, whether through the operating system or through an authentication system such as Kerberos.

HDFS and MapR-FS navigation. If users have a matching account in HDFS, the users’ browsing home in Waterline Data Inventory is their HDFS home directory. If the end-users of your organization’s cluster data do not have accounts in HDFS, you can configure Waterline Data Inventory to open at a set location in HDFS. See waterlinedata.defaultdirectory in Self-service browsing.

Hive table creation. Waterline Data Inventory integrates with Hive to include Hive tables in the catalog. It accesses Hive databases and tables as the current user to securely

Setting up users to access Waterline Data Inventory Waterline Data Inventory


display profiling information. It allows users to create Hive tables from HDFS files. This third integration point provides a gateway for users to act on files they identify using Waterline Data Inventory: users can request a file be copied into a Hive table, then access the Hive database from visualization, reporting, and analytic tools outside the cluster.

Managing Waterline Data Inventory user profiles

Waterline Data Inventory provides roles to organize end-user access to operations such as creating tags, associating tags, managing tag associations and lineage relationships.

Waterline Data Inventory provides four roles that organize Waterline Data Inventory operations. For example, for a user to be able to create tags, assign the Data Steward role to the user profile and then indicate in which tag domain the user can manage tags. To configure a user to be able to create tags and assign them to files and fields, assign the Data Steward and Annotator roles, making sure to set the appropriate domain for the Data Steward role. You can assign any or all of the roles to any user profile. Users are automatically configured with the “end-user” role, which allows them to search and browse data and metadata they are authorized to view and to create Hive tables from HDFS files.

Create your first set of tag domains before setting up Data Steward roles.

The roles authorize operations as follows:

Administrator User profiles Create, remove, update user profiles; add and remove roles Tag domains Create, remove, and update tag domains

Data Steward Tags Create, remove, and update tags within the authorized domain(s)

Annotator Tag associations Create, approve, and reject tag associations (any tag domain) Origins Create, update, and remove origins Lineage

relationships Create, approve, and reject lineage relationships

Collections Create, approve, and remove collections

End User Data View authorized data Metadata View authorized metadata Hive tables Create Hive tables

Installation and Administration Guide Backing up Waterline Data Inventory metadata


To configure user profiles:

1. Click Manage in the toolbar and open User Profiles.

2. Create a new user (click Create) or select an existing user.

3. Select the role or roles to assign to this user.

4. If you selected the Data Steward role, select one or more tag domains over which this user has management privileges.

5. Click Save.

Backing up Waterline Data Inventory metadata

The important consideration for backing Waterline Data Inventory software and metadata is that the three locations where metadata is stored should be backed up together so that the metadata is synchronized across all three locations. The metadata locations are:

Metadata repository

HDFS metadata files

Search indexes

If your Waterline Data Inventory configuration includes making copies of data to back Hive tables, the location where data files are copied needs to be accounted for in your backup plan; as these files are located in HDFS, your standard backup process may account for their backup. These files are not sensitive to the concurrency requirement of the metadata storage locations.

See also the Waterline Data technical note Backup, failover, and recovery.

https://s3-us-west-1.amazonaws.com/wld-product-downloads/docs/TechNotes/WaterlineDataInventory-Failover.pdf

Backing up Waterline Data Inventory metadata Waterline Data Inventory


Backup

Plan the backup process to occur such that each metadata storage location is backed up at the same time.

1. Stop Waterline Data Inventory processes, including the Jetty application server and any running jobs.

2. Archive the software. For example copy the installation directory to another location:

$ cp -r /opt/waterlinedata /opt/waterlinedata_v25

3. Archive the repository.

Backup the Derby or MySQL database used as the Waterline Data Inventory repository.

For Derby, this data is located in /var/lib/waterlinedata/db on the node where Waterline Data Inventory is installed.

4. Archive HDFS metadata files.

These files are located in HDFS as specified by the configuration property waterlinedata.profile.processingdirectory.

5. Archive search indexes.

Search indexes are located in /var/lib/waterlinedata/index by default; the location is specified by the configuration property waterlinedata.metadata.search.index.rootDir.

6. If using, validate that the HDFS Hive backup files are archived.

If configured, copies of data for Hive tables are located in HDFS as specified by the configuration property waterlinedata.profile.hivedir.

Restore

The key to restoring Waterline Data Inventory metadata is restoring all three locations before resuming normal operation.

1. Restore Waterline Data Inventory software distribution.

2. Restore content of repository.

3. Restore content of search index data.

4. Restore content of HDFS metadata files.

5. Verify that data copies for Hive tables are in place.

6. Run bin/waterline upgradeInstall.

7. Start Waterline Data Inventory services.


Part 4 Tuning

o Configuration controls o Data source configuration o Port configuration o Repository configuration o Authentication configuration o Security configurations o Metadata and staging file locations o Debugging configuration o Format discovery tuning o Profiling tuning o Search functionality o Discovery functionality o Browser app functionality

Configuration controls Waterline Data Inventory


Configuration controls

Waterline Data Inventory provides a number of configuration settings and integration interfaces to enable extended functionality. The configuration controls reside in property and configuration files within the application and as part of Hadoop.

Waterline Data configuration controls

The configuration controls within the Waterline Data Inventory application are available from the Manage tab, including:

User profiles (with authentication settings)

Configuration properties

Data sources (including HDFS and Hive)

There are additional configuration points for rarely changed items:

waterlinedata/lib/resources/install.properties

waterlinedata/lib/resources/dbconf.properties

waterlinedata/lib/resources/derby.properties

waterlinedata/lib/resources/log4j.xml

waterlinedata/bin/waterline

External configuration settings

The Waterline Data Inventory functionality that is controlled in Hadoop and Hive configuration files are listed in the following table. Typically, you'll make such changes using the management tools in use on the cluster (such as Cloudera Manager, Ambari, or MapR Control System).

Function Configuration File Location Description

Impersonation Hadoop core-site.xml

If you need to configure impersonation as the strategy to allow the Waterline Data Inventory service user to access to files and Hive tables using the active user's credentials, you'll need to modify properties in this file to include the service user as a proxy holder for HDFS. See Configure cluster-level impersonation.

Memory management

Hadoop mapred-site.xml

If you need to adjust the memory allocations for map and reduce tasks to make profile processes more efficient, you'll need to modify properties in this file. See MapReduce job performance controls.

Installation and Administration Guide Configuration controls


Function Configuration File Location Description

Waterline Data JARs needed during MapReduce jobs

Hive lib and/or auxlib

Waterline Data Inventory places files in the class path for Hive so that applications that open Hive tables created from Waterline Data Inventory have access to the input formats needed to access the Hive tables. If you need to modify this location or find other ways to manage Waterline Data Inventory JAR files, see Making Waterline Data Inventory files available for using Hive.

Configuration control summary

The following table summarizes the Configuration properties that control Waterline Data Inventory behavior. They are listed in alpha order by property name. Links are provided when there is more information available for a given property.

Property Description

mapreduce.map.output.compress Enable compression for temporary files produced between map and reduce tasks

mapreduce.map.output.compress.codec Codec for compression for temporary files produced between map and reduce tasks

mapreduce.output.fileoutputformat.compress Enable compression for output files from reduce tasks

mapreduce.output.fileoutputformat.compress.codec Codec for compression of output files from reduce tasks

mapreduce.output.fileoutputformat.compress.type Type of compression for output files from reduce tasks

waterlinedata.auditing.enabled Enables auditing of metadata operations within Waterline Data Inventory

waterlinedata.cleanorphanedhivetables Remove Hive tables from the repository that no longer have backing HDFS files

waterlinedata.compression Enable any compression of map or reduce output

waterlinedata.datetime.maxlength DateTime length limit

waterlinedata.defaultdirectory Browse home location if not the user's HDFS home directory

waterlinedata.discovery.allow_field_name_matching Use of field names in tag association suggestions

waterlinedata.discovery.lineage.batch_size Batch size for candidate child resources

waterlinedata.discovery.lineage.diff_same_directory_sec Time window for lineage candidates in the same directory

waterlinedata.discovery.lineage.field_name_match Minimum level of field-name similarity to be considered a match in lineage discovery

waterlinedata.discovery.lineage.mapred.reduce.tasks Max number of reduce tasks for lineage

waterlinedata.discovery.lineage.max_splits Max combined splits per map for lineage

waterlinedata.discovery.lineage.min_lineage_field_count Min matching fields for lineage

waterlinedata.discovery.lineage.min_lineage_percent_field_count Minimum rate of matching fields count required to consider two resources as candidates for a lineage relationship

waterlinedata.discovery.lineage.min_max_cardinality Minimum cardinality to consider a field for lineage

Configuration controls Waterline Data Inventory



waterlinedata.discovery.lineage.overlap Minimum level of field overlap to be considered a match in lineage discovery

waterlinedata.discovery.min.value_overlap Min % of values required to match for value tags

waterlinedata.discovery.regex.tolerance.weight Min % of values required to match for regular expression tags

waterlinedata.discovery.smallest.collection.size Minimum number of items considered for successful collection discovery

waterlinedata.discovery.tag_propagation.mapred.reduce.tasks Number of reduce tasks used in tag discovery

waterlinedata.discovery.tags.max_suggested Limit to the number of tags suggested for the same field; highest weight tags win out

waterlinedata.discovery.tags.min_cardinality.partial_match Minimum cardinality value used for matching

waterlinedata.discovery.tags.null_values Field values synonymous with null values

waterlinedata.discovery.tags.value_hit_diff Maximum gap allowed between tag association weights for suggested tags

waterlinedata.discovery.tolerance.name_match Value tag name-matching tolerance

waterlinedata.discovery.tolerance.name_match_pre_defined Pre-defined tag name-matching tolerance

waterlinedata.discovery.tolerance.name_match_reg_ex Regular expression tag name-matching tolerance

waterlinedata.discovery.tolerance.temporal_sec Tolerance for temporal data comparison (in seconds)

waterlinedata.discovery.tolerance.weight Minimum weight for tag associations to be suggested

waterlinedata.formatdiscovery.deep Flatten map columns into a column for each key-value pairs within the map column

waterlinedata.formatdiscovery.text_type_patterns Mime types that are considered for format detection for log, XML, CSV, and JSON file types in text files

waterlinedata.hive.create_table_in_place Copy of backing files for Hive tables created from Waterline Data Inventory

waterlinedata.incremental Profile only new and updated files and tables

waterlinedata.ignore.orc Disable checking for ORC file format when discovering format types.

waterlinedata.max.numeric.digits Maximum number of digits for a string to be considered as a number

waterlinedata.max.top_k_length Maximum length of data stored for most-frequent values

waterlinedata.metadata.search.maxHits.dataResource Sets the cap for search results

waterlinedata.pattern.top_k_capacity Number of the most-frequent value patterns stored for each field in each file or table.

waterlinedata.profile.cardinality_precision Precision of cardinality calculation for relatively high cardinality values

waterlinedata.profile.cardinality_sparse_precision Precision of cardinality calculation for relatively low cardinality values

waterlinedata.profile.combined.max_mappers_per_job Maximum number of map tasks to use for a given MapReduce job

waterlinedata.profile.combined.max_mappers_per_job Max number of map tasks for profiling

waterlinedata.profile.combinedmapper Combine profile processing of files into a single map task based on the size of the files

waterlinedata.profile.create_hive_lineage Create lineage relationship between a Hive table and resources referenced as backing files

Installation and Administration Guide Data source configuration



waterlinedata.profile.data_format_discovery Turn off default data type discovery for data with no defined types

waterlinedata.profile.datetime.formats Patterns identified as dates

waterlinedata.profile.format.discovery.consider_separators Letters to allow as delimiters during format discovery

waterlinedata.profile.format.discovery.non_separators Characters to avoid as delimiters during format discovery

waterlinedata.profile.high_cardinality.optimization Optimize profiling at the expense of accurate sample data for high-cardinality values

waterlinedata.profile.hivedir Location of copies of HDFS files used to create Hive tables

waterlinedata.profilehivebackingfiles Profile HDFS backing files in addition to profiling the data through the Hive table

waterlinedata.profile.maximum.field.count Maximum number of fields per record

waterlinedata.profile.numeric.formats Specialized Java numeric formats to consider strings as numbers during profiling

waterlinedata.profile.parquetcombined Combine resources in map tasks for parquet file types

waterlinedata.profile.processingdirectory Location of HDFS metadata store

waterlinedata.profile.regex_evaluation Scope of regular expression evaluation during profiling

waterlinedata.profile.sampled Enable profiling based on a sample of file or table contents

waterlinedata.profile.sampled.fraction Fraction of data per file used for sampling (used if file or table is larger than 3 blocks)

waterlinedata.profile.top_k Number of the most frequent values used in search indexes and shown in the web applications

waterlinedata.profile.top_k_capacity Number of the most frequent values collected during profiling

waterlinedata.profile.top_k_tokens Number of the most frequent values used to perform tag association matches

waterlinedata.reprofile_files_with_error Force reprofiling of files that have failed profiling in a previous job

waterlinedata.security.administrator.username Default superuser username

waterlinedata.temproot Location of temporary staging area on local computer

waterlinedata.userprofile.autocreate User profiles created at login

Data source configuration

Waterline Data Inventory supports configuring two kinds of data sources for your inventory: HDFS and Hive. Use the Manage > Data Sources page to create or update the connection and authentication information for data sources.

Include the following information for connecting to a data source:

Name: an arbitrary name of the data source. This name appears on the right edge of the list of data sources.

Description: optional text to describe the data source.

Data source configuration Waterline Data Inventory


URL: the connection string used by both the application server and Waterline Data Inventory jobs to locate inventory resources. Be sure to test this connection before saving it. Note that for

HDFS data source

Waterline Data Inventory can read data from and write data to HDFS. To configure a HDFS as a source of data for the inventory, add HDFS as a data source using Manage > Data Sources. This is typically done during the install process.

The name and description fields are for documenting the configuration settings. The URL determines the location where Waterline Data Inventory expects to find the root of HDFS. It is used by both the Waterline Data Inventory profiling engine and the application server to access HDFS. Different Hadoop distributions and deployment types have requirements for the format of the URL:

For high-availability clusters (HA):

Specify the logical name for the cluster root. This value is defined by dfs.nameservices in the hdfs-site.xml configuration file.

For AWS EMR:

Specify the EMRFS name of the S3 bucket containing your data. For example, "s3://acme-impressions-data/".

For MapR:

Typically you would identify the root of the MapR file system with "maprfs:///".

For Kerberos:

Specify the keytab and principal created for the Waterline Data Inventory service user in addition to the URL.

Hive data source

Waterline Data Inventory can read data from and write data to Hive. To configure a Hive metastore as a source of data for the inventory, add Hive as a data source using Manage > Data Sources.

The name and description fields are for documenting the configuration settings. All Hive connections include a URL as the connection string used by the Waterline Data service user to access Hive:

jdbc:hive2://<HiveServer2 host>:10000

The URL format differs based on the authentication required to connect to Hive.

For SSH authentication:

waterlinedata.hiveurl=jdbc:hive2://localhost:10000

waterlinedata.hivedatabasename=<default database>

Installation and Administration Guide Data source configuration


For MapR (secure and non-secure):

The Hive connection URL includes an additional parameter to explicitly set the authentication mode to NOSASL:

waterlinedata.hiveurl=jdbc:hive2://mapr-host:10000/default;auth=noSasl

For Kerberos:

When the Waterline Data Inventory application server service uses Kerberos authentication, the hiveurl needs to include the principal for the Hive user account and authentication parameters:

jdbc:hive2://<Hive host name>:<Hive port>/<Hive database>;

principal=<Hive service principal name>; auth=kerberos;kerberosAuthType=fromSubject

In the Data Sources configuration page, as you add values for the first section, the installer builds a URL for you:

Hostname and port: the HiveServer2 host. By default, the port is 10000.

Domain: a domain corresponding to the realm configured for the cluster. Use a domain_realm entry from the krb5.conf file on the Hive host. This value will be used to construct the principal parameter of the URL.

Specify the keytab and principal created for the Waterline Data Inventory service user.


Port configuration Waterline Data Inventory


Port configuration

Waterline Data Inventory processes listen on ports 4444, 8082, 8084, and 8482 by default. If you need to change which ports are used, do the following after running the command-line portion of the installer:

1. Be sure that the Waterline Data Inventory services are stopped.

$ ps -ef | grep water

If they are still running, stop them using:

$ waterlinedata/bin/waterline serviceStop

2. Change the conflicting port numbers in the locations listed in the following table.

3. Restart services.

$ waterlinedata/bin/waterline serviceStart

The services, default ports, and configuration locations are as follows:

Service Default Port Configuration Location

Derby repository database

4444 Three locations in the waterlinedata/lib/resources directory:

install.properties: DERBY_PORT=4444 dbconfig.properties: javax.persistence.jdbc.url =

jdbc:derby://localhost:4444/waterlinedatastore;create=true

derby.properties: derby.drda.portNumber=4444

Jetty application server

8082 (HTTP) 8482 (HTTPS)

waterlinedata/lib/resources/install.properties:

JETTY_HTTP_PORT=8082 JETTY_HTTPS_PORT=8482

Web server daemon

8084 waterlinedata/lib/resources/install.properties: WEB_DAEMON_PORT=8084

Repository configuration

Repository connection credentials

Regardless of the repository database type, there is a single user configured to access it from Waterline Data Inventory services. Configure the username and password in the following properties:

[dbconfig.properties file]

javax.persistence.jdbc.user=waterlinedata (default)

javax.persistence.jdbc.password=<encrypted password>

Shut down all Waterline Data Inventory services before making changes to repository properties.

Installation and Administration Guide Repository configuration


When security is not a factor, you can insert repository credentials in plain text; however, Waterline Data Inventory provides a utility to provide obfuscate stored passwords, as described in Obscuring passwords in Waterline Data Inventory configuration files.

Derby repository database connection

Waterline Data Inventory includes embedded Derby as its repository database. Both Waterline Data Inventory jobs and the application server access Derby using the following connection information. You won't need to change this information unless you are replacing Derby with another database or if you need to change the default port selection.

[dbconfig.properties file]

javax.persistence.jdbc.driver=org.apache.derby.jdbc.ClientDriver

javax.persistence.jdbc.url=jdbc:derby://<local host name>:4444/

waterlinedatastore;create=true


Changing the default Derby communication port

By default, Waterline Data Inventory's instance of Derby communicates on port 4444. If you need to change that port number to avoid a conflict with another Hadoop process, stop Waterline Data Inventory services, then update the port number as follows:

1. lib/resources/install.properties DERBY_PORT=4444

2. lib/resources/derby.properties derby.drda.portNumber=4444

3. lib/resources/dbconf.properties

On one line:

javax.persistence.jdbc.url=

jdbc:derby://localhost:4444/waterlinedatastore;create=true


Authentication configuration Waterline Data Inventory


Firewall configuration

If you expect administrators or end-users to access Waterline Data Inventory across a firewall, consider allowing access to the following ports at the cluster IP address:

Port Application Component Use

8082 Waterline Data Inventory browser application

End-users: Access to Waterline Data Inventory browser application.

8482 Waterline Data Inventory browser application with HTTPS

End-users: Access to Waterline Data Inventory browser application with HTTPS.

10000 Hive End-users: Access to Hive tables.

19888 Hadoop job history Administrators: Access to troubleshooting information.

4444 Derby Administrators: Access to troubleshooting information.

8888 Hue Administrators: Access to HDFS files and to MapReduce job status and logs.

8080 Ambari Administrators: Access to HDFS files and to MapReduce job status and logs.

Authentication configuration

Waterline Data Inventory integrates with your existing Linux and Hadoop authentication mechanisms, such as SSH-based authentication or through systems such as Kerberos.

By default, it uses SSH authentication, meaning users configured for HDFS are assumed to have a corresponding Linux account; they would sign in to Waterline Data Inventory using their network credentials. To configure different authentication, an administrator would configure the Waterline Data Inventory application server to accept the other authentication system.

To switch authentication systems after installation:

1. Sign into the Waterline Data Inventory web application as a user with administrator privileges.

2. Go to Manage > User Profiles.

3. Click the current authentication type at the top of the page:

Installation and Administration Guide Authentication configuration


4. Change the authentication type and configure the new system, including specifying a valid user for the admin user.

5. Click Save.

6. Restart the application server.

SSH configuration

SSH is a reliable security mechanism that has one limitation: it assumes the password authentication mechanism is available to the application server. As such, it will not work on systems that use Amazon AWS or Google Compute clouds.

When configured to use SSH for user authentication, Waterline Data Inventory application server communicates with the host system on the listen address and port defined in /etc/ssh/sshd_config. By default, the port is set to 22. If your organization uses a different convention, update the port (authPort) setting for the sshd service as described in Port configuration.

User access configuration for public cloud clusters

Amazon Web Services (AWS) and Google Cloud Platform do not support a password authentication mechanism for managing users; instead, they use SSH key-based authentication. Currently, Waterline Data Inventory does not support SSH keys for authentication on cloud deployments. It uses a local database to determine user credentials. Note that this method does not supersede the cloud provider's security nor does it override the operating system's security concepts. The user list grants access to Waterline Data Inventory web application only. Waterline Data Inventory respects the access privileges granted by the file system: the user list can include user names configured in the operating system. Listed users that are not mirrored in the operating system see only files that can be read by all users.

Security configurations Waterline Data Inventory


To switch to local database authentication after installation:

1. Sign into the Waterline Data Inventory web application as a user with administrator privileges.

2. Go to Manage > User Profiles.

3. Click the current authentication type at the top of the page:

4. Change the authentication used to "Local Database".

5. Specify a username for the initial Admin User.

This user does not have to exist on the operating system or in the cluster.

6. Click Save.

7. Restart the application server.

Security configurations

Waterline Data Inventory is designed to leverage the security infrastructure configured for your cluster, whether that involves Kerberos, Kerberos and Sentry, Ranger, or systems developed outside of these standards.

Secure communication between browser and application server (SSL)

You can configure Waterline Data Inventory to use SSL to communicate between the client where the browser is running and the application server. This setup requires:

Server X.509 certificate for the external application server address. This can be a real RSA, VeriSign, or similar certificate or a self-signed certificate.

Secure keystore inside Waterline Data Inventory's Jetty application server distribution.

Installation and Administration Guide Metadata and staging file locations


The Jetty documentation provides instructions for generating a self-signed certificate and for creating and loading keystore values:

www.eclipse.org/jetty/documentation/current/configuring-ssl.html#generating-key-pairs-and-certificates

The Waterline Data Inventory Jetty configuration is included in the following directory:

<install location>/waterlinedata/jetty-distribution-*/waterlinedata-base

Configuration files include:

Component Configuration File Location

Keystore etc/keystore

HTTPS start.d/https.ini

SSL start.d/ssl.ini

Obscuring passwords in Waterline Data Inventory configuration files

To convert passwords obfuscated values, run the following command, provide the Hive password, then insert the output in the appropriate resource file.

<install location>/waterlinedata/bin/internal/obfuscate

The output is also saved as obfuscate.out.

Metadata and staging file locations

The following configuration properties allow you to specify the location of staging files that Waterline Data Inventory creates when it collects profiling information from HDFS files (and Hive tables) and when it runs discovery processes on the edge node.

Metadata files for HDFS and Hive profiling

Use this configuration property to identify the HDFS (or MapR-FS) directory Waterline Data Inventory uses to store metadata it generates during profiling HDFS files and Hive tables. Some of this data is temporary staging data; some of this data persists to augment the data maintained in the repository. Make sure that the Waterline Data Inventory service user has write access to the configured location. If you change this property, we recommend you keep the directory name even if you change the HDFS path.

When profiling Hive tables, you can override this value by specifying a staging location on the command line.

waterlinedata.profile.processingdirectory=<HDFS Path>/.wld_hdfs_metadata

http://www.eclipse.org/jetty/documentation/current/configuring-ssl.html#generating-key-pairs-and-certificates

http://www.eclipse.org/jetty/documentation/current/configuring-ssl.html#generating-key-pairs-and-certificates

Metadata and staging file locations Waterline Data Inventory


Hive table backing file location

When users create Hive tables from ORC, RC, and Sequence files, Waterline Data Inventory creates a copy of the data in a new location in HDFS and creates the Hive table from the copied file or files. The browser application includes links between the backing files and the Hive table. If users create Hive tables from text, JSON, or log files or collections, Waterline Data Inventory does not create a copy of the file before creating the Hive table. You can change this behavior so that Waterline Data Inventory always makes copies of the files before creating the Hive table.

Because Hive requires write access to the directory where the source files reside, you may want to have Waterline Data Inventory make copies of the files so that you do not have to give write access to users who may create Hive tables.

Note that Hive tables made from files that form a partitioned collection are never copied from their original location. Therefore to make a Hive table from a partition collection, the user needs to have write access to the parent directory.

There are two properties that control the behavior for creating Hive tables:

waterlinedata.profile.hivedir=<HDFS Directory>

waterlinedata.hive.create_table_in_place=true

By default, Hive tables are created in place; if copies of files are needed, set the property to “false” and set the location where Waterline Data Inventory will create file copies.

If the hivedir property is not set, file copies are placed in a directory named “HiveFiles” in the active user’s home directory. If the active user does not have a home directory in HDFS, files are copied to the location specified by waterlinedata.defaultdirectory. If none of these locations is writeable by the active user, the user won't be able to create Hive tables from Waterline Data Inventory.

Text, JSON, Log, Avro

ORC, RC, Sequence Collections*

Create Table with Files in Place waterlinedata.hive.create_table_in_place=true (default)

User has Write access in parent directory Succeeds N/A N/A User does not have Write access in parent directory Fails N/A N/A Create Tables from File Copies waterlinedata.hive.create_table_in_place=false

User has Write access in directory set by waterlinedata.profile.hivedir

Succeeds Succeeds N/A

Property hivedir is not set and user has write access to /user/current-user or location described by waterlinedata.defaultdirectory

Succeeds Succeeds N/A

Property hivedir is not set and the user does not have write access to /user/<current-user> or location described by waterlinedata.defaultdirectory

Fails Fails N/A

* Files are never copied when Hive tables are created from collections.

Note that you need to restart the application server to have these options take effect.

Installation and Administration Guide Debugging configuration


Hive requires that users creating Hive tables have WRITE access to the directory containing the backing file or files contributing to the Hive table. If the source HDFS files are related with one or more partitions, Hive also requires that users have ALTER privileges for the table. For more information, see Cluster access permissions.

Staging area for discovery tasks

This property indicates the local file system directory Waterline Data Inventory uses to store temporary files created during lineage and collection discovery processing. Make sure that the Waterline Data Inventory service user has write access to the configured location. By default, this value is set to /tmp.

waterlinedata.temproot=<existing local directory>

Search index location

This property indicates the location of the search indexes. Consider changing this location if you want to distribute storage to a different disk on the same node as Waterline Data Inventory or if you want to include the index as part of the overall software location.

waterlinedata.metadata.search.index.rootDir=/var/lib/waterlinedata/index (default)

If you change this parameter, be sure to move any existing indexes to the new location; stop and restart the application server to have this change take effect.

Debugging configuration

Setting logging levels

Waterline Data Inventory jobs use the Apache log4j logging API and its conventions for identifying the level of messages reported by the application. Waterline Data Inventory produces messages with the levels ERROR, WARN, INFO, and DEBUG. By default, the console output is set to display messages at the INFO level and more severe while logs are set to the DEBUG level. You can increase or reduce the severity of the messages recorded in a given log by adjusting the levels in logging control files.

Operation Output Location Logging Controls

Jetty app server console wd-ui.log

<install-location>/waterlinedata/

jetty-distribution-9.2.1.v20140609/

waterlinedata-base/resources/logback.xml

Waterline Data Inventory Jobs

console wd-jobs.log

<install-location>/waterlinedata/lib/resources/

log4j.xml

MapReduce Jobs cluster jobhistory server, accessible through Hue or Ambari

Command line options (as described in Logging levels for MapReduce jobs): waterline setDebug

waterline resetDebug

waterline showDebug

Format discovery tuning Waterline Data Inventory


Format discovery tuning

Format discovery is a required step before Waterline Data Inventory can profile files or tables. During format discovery, Waterline Data Inventory reviews all the files in the processing directory (or tables in the Hive database list) to determine whether the resource has been profiled already (assuming the system is set for incremental profiling) and how the data is formatted. The data format determines what profiling processes are applied.

Identifying field separators

Waterline Data Inventory parses flat files such as comma-separated or log files to determine field separators, looking for characters (not letters or numbers) that are repeated within each row of the file. If it finds more than one candidate for a field delimiter, it ranks the choices based on the number of occurrences of the character in the file and uses the highest ranked candidate.

You can specify characters (including digits or letters) that Waterline Data Inventory should accept as field separators by adding them to the following property. Include additional characters in the list as their Unicode values without spaces or commas.

waterlinedata.profile.format.discovery.consider_separators="\u00FE\u00E7\u00C7"

By default, the product finds thorn (þ), small letter c with cedilla (ç), and capital letter C with cedilla (Ç) as delimiters.

In addition you can specify characters that Waterline Data Inventory does not consider as field delimiters. There are a number of characters not considered as delimiters by default; you may find that you need to remove characters from this configuration to correctly parse your data.

waterlinedata.profile.format.discovery.non_separators="+-.\\/\"`()[]{}<>'"

To include special characters such as tabs, follow the Java conventions for escape sequences described here: docs.oracle.com/javase/tutorial/java/data/characters.html

It is possible that when Waterline Data Inventory doesn't determine the appropriate delimiter for a file, it may count a very large number of fields for the file. To avoid producing profiling results for a badly parsed file, there is a limit for the point when many fields are considered too many fields and a signal that format discovery is producing inappropriate results.

waterlinedata.profile.maximum.field.count=5000

If a file is determined to have more than 5000 fields, Waterline Data Inventory stops processing the file and marks the file as "Profile Failed". There is a message in the profiling log to indicate the problem.

If your inventory data includes files that legitimately contain more than 5000 fields, reset this property to a value above the maximum number of fields that are likely to

http://docs.oracle.com/javase/tutorial/java/data/characters.html

Installation and Administration Guide Format discovery tuning


be included in a given file. For example, XML files may include very large numbers of fields as each element in the file is recorded as a separate field.

Custom format identification

You can extend the list of files that Waterline Data Inventory can recognize as well as describe additional flat-file delimiters by including additional mime type definitions in the following configuration file:

{install_dir}/waterlinedata/lib/resources/org/apache/tika/mime/custom-mimetypes.xml

Waterline Data Inventory recognizes the standard mime types supported by Apache Tika without modification; see tika-mimetypes.xml published by the Apache Tika project.

To have Waterline Data Inventory recognize a file type by its extension that by default would appear as "Unknown" and "Unrecognized", include the following mime-type definition in the configuration file. The specified files will be marked as “Recognized” and as the file type you indicate; the files will not be profiled.

<mime-type type="<mime-type>">

<magic priority="50">

<glob pattern="*.<file extension>"/>

</magic>

</mime-type>

Replace <mime-type> with the major type, a forward slash, and the minor type where the minor type is typically your custom file type extension. This value will appear in Waterline Data Inventory to identify files of this type. Replace <file extension> with the lower-case letters of the file extension. For more information, see Add your MIME-type in the Tika documentation.

For example, to have Waterline Data Inventory recognize Hive Query Language (HQL) files (named with the .HQL extension) and mark their content type, use the following mime-type definition:

<mime-type type="text/HQL">


<glob pattern="*.hql"/>

</magic>

</mime-type>

To have Waterline Data Inventory find a (non-letter or digit) character as a delimiter that it does not identify by default, include the delimiter in a mime-type definition such as:

<mime-type type="text/plain">


<match value="<regular expression indicating the delimiter among other

characters>" type="regex" />

</magic>

</mime-type>

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.tika/tika-core/0.6/org/apache/tika/mime/tika-mimetypes.xml

https://tika.apache.org/1.5/parser_guide.html#Add_your_MIME-Type

https://tika.apache.org/1.5/parser_guide.html#Add_your_MIME-Type

Format discovery tuning Waterline Data Inventory


For example, to have Waterline Data Inventory profile flat files that use an ASCII control code as a delimiter (such as Control A), you would include the following mime-type definition:

<mime-type type="text/plain">


<match value="^.*(\\cA).*" type="regex" />

</magic>

</mime-type>

To have Waterline Data Inventory find a letter or digit character as a delimiter that it does not identify by default, add the character in the white list of delimiters described in Identifying field separators.

Filename limitations

Waterline Data Inventory supports file and directory names to the maximum length allowed in HDFS, 2048 characters.

Files will fail format discovery if filenames include special characters. Generally, files that would trigger a problem with Waterline Data Inventory will not work for HDFS either. Specifically:

Curly and square brackets {} and [] appear in filenames in Hue but cannot be profiled by Waterline Data Inventory.

Semicolons ; are not recognized by HDFS. All characters after semicolon in a filename will be ignored (including the file extension).

Operating systems do not support colons : or backslash / in filenames.

Double quotation marks (") are not recognized in filenames. HDFS replaces double quotation marks in filenames with %22.

Recognized and profile-able formats

Waterline Data Inventory processes tables and files in a two-step process: format discovery followed by profiling. Format discovery includes the use of Apache Tika™ to identify file formats. Tika recognizes a large number of file formats, many more than contain data for profiling. Files that are recognized by Tika but not profiled are included in the inventory: they are searchable, can be tagged, and are included in lineage relationships. You can find these files by searching on the Profile Status “Recognized”; the recognized file type appears in the Content Type facet. If files are not recognized in this process, they are marked as “Unrecognized” and the content type will be set to “Unknown”.

By default, the format discovery process considers the file names and extensions in determining the file format. For example, files with extension *.java will be marked as Java source files. It is possible that cluster files are named such that this assumption produces incorrect file format identification. If so, set the following property to disable the use of file names in format determination:

waterlinedata.formatdiscovery.usefilename

Installation and Administration Guide Profiling tuning


For information on customizing how files recognized by Tika appear in Waterline Data Inventory, see Custom format identification.

Data type discovery

When Waterline Data Inventory profiles data that does not have type information, it reads field values to determine data types. Use this property to disable data type discovery (-1), to use all field values to determine data types (0), or to limit data type discovery to the most frequent values (1 - default) as identified by the profiling property waterlinedata.profile.top_k_capacity.

waterlinedata.profile.data_format_discovery=1

XML format discovery

Waterline Data Inventory successfully identifies and profiles XML files with the following characteristics:

Elements Attribute in non-root element Attribute in root element Multiple attributes Multiple roots Doc type comments Comments Default namespace User defined namespace XSI namespace

<note xmlns="http://www.w3schools.com"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.w3schools.com note.xsd">

XSD <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema”

targetNamespace="http://www.library.org" xmlns="http://www.library.org"

elementFormDefault="qualified"><xsd:include schemaLocation="BookCatalogue.xsd"/>

Waterline Data Inventory does not profile XML files with xsi:nil= "true".

Profiling tuning

In terms of performance optimization, profiling breaks into two areas to consider: MapReduce operations that occur on the cluster’s data nodes and writing profiling data to the Waterline Data Inventory repository on the edge node. Performance in these areas is dependent less on the size of the cluster data than on the number of columns in the cluster data. That is, a 2 GB file with 30 columns profiles faster and take up less space in the repository than a 2 GB file with 300 columns.

Profiling tuning Waterline Data Inventory


MapReduce job performance controls

The important factors in MapReduce performance are the number of CPUs available across the cluster and the amount of memory available on each node. In both cases, more is better.

Tuning Waterline Data Inventory to run on your cluster is like any other MapReduce operation: you want to make sure that the volume of data being processed and the number of processes running at one time fits within the resources available on the cluster. As part of the cluster configuration (outside Waterline Data Inventory), configure Hadoop parameters based on the data node configuration:

Memory allocated for map tasks Memory allocated for reduce tasks Java heap space available Maximum number of split locations Application manager container size

If you find that the values configured for your cluster are not optimal for your Waterline Data Inventory MapReduce jobs, you can pass specific values as parameters to the Waterline Data Inventory command line to set these values for the job. For example, your command might include multiple cluster memory settings:

./waterline profile <target HDFS directory> -Dmapreduce.map.memory.mb=8192

-Dmapreduce.map.java.opts=-Xmx7168 -Dmapreduce.job.max.split.locations=100

-Dwaterlinedata.profile.combined.max_mappers_per_job=250

-Dyarn.app.mapreduce.am.resource.mb=18216 -Dyarn.app.mapreduce.am.command-opts=-Xmx17400m

Once these parameters are in place, Waterline Data Inventory gives you the ability to control the number of map and reduce tasks started by Waterline Data Inventory MapReduce jobs. These numbers are bound by the number of CPUs available for processing. Within that limit, choose the number of map and reduce tasks to keep the size of data each task processes more or less constant. You would increase the maximum number of map or reduce tasks created when processing many small files (more columns overall); decrease the number of map and reduce tasks when processing fewer larger files (fewer columns overall). Waterline Data Inventory triggers MapReduce jobs sequentially. The configured number of map and reduce tasks applies to each job. You many need to change the number of map or reduce tasks to stay within your cluster's resources as the maximum number of map tasks applies per job. The maximum number of split locations is typically set for the entire cluster; the value should be the same or greater than the number of data nodes in the cluster. If for some reason this value is not set appropriately on the cluster, you can override the value when running Waterline Data Inventory jobs.

For more information on how to determine the appropriate number of map and reduce tasks to run on your cluster configuration, see the tech note Best Practices for Memory Management.

https://s3-us-west-1.amazonaws.com/wld-product-downloads/docs/TechNotes/WaterlineDataInventory-MemoryManagement.pdf

https://s3-us-west-1.amazonaws.com/wld-product-downloads/docs/TechNotes/WaterlineDataInventory-MemoryManagement.pdf



Repository writing performance controls

The most important factor in optimizing the performance of writing profiling results to the Waterline Data Inventory repository is the number of input and output operations per second (IOPS) on the disk on the edge node where Waterline Data Inventory is installed. Profiling results increase in size based on the number of columns profiled and this produces a lot of data to move from HDFS to the edge node.

The second most important factor is the efficiency of the repository database itself. While Waterline Data Inventory ships with Embedded Derby configured as the repository, you can improve performance in this area by upgrading to a multithreaded database such as MySQL.

There are two additional and related parameters to consider to ensure you get the best possible performance during the write to the repository. If processes are running out of memory while writing to the repository (post processing operations after the MapReduce jobs have completed), you can adjust these parameters.

Heap available for reading from HDFS (Client operation)

Restricted by the amount of memory available on the edge node. This is set to 8 GB by default in the waterline script, HADOOP_HEAPSIZE setting:

<install location>/waterlinedata/bin/waterline

Number of reducers.

If you adjust the client operation memory and still run out of memory writing to the repository, you can increase the maximum number of reduce tasks available to Waterline Data Inventory jobs so that the volume of data produced by each reduce task is smaller. See Controlling the number of reduce tasks used per MapReduce job.



Using samples to calculate data metrics

By default, Waterline Data Inventory uses all data in files or tables to calculate field-level metrics such as the minimum and maximum values, the cardinality and density of the values, and the most frequent values. You can reduce profiling duration in very large files by sampling data for profiling operations. When sampling is enabled, Waterline Data Inventory reads the first and last blocks of data and enough other blocks to reach the sample fraction you specify. For example, profiling a 20 GB file normally requires reading 157 blocks of data, assuming a block size of 128 MB. With a sample fraction of 10%, Waterline Data Inventory will read 16 blocks of the 20 GB file, including the first block, the last block, and 14 additional blocks chosen at random.

waterlinedata.profile.sampled=false (by default)

waterlinedata.profile.sampled.fraction=0.1 (by default)

The block size in your cluster is configurable. If your block size is large relative to the size of your data files, it may not make sense for you to enable sampling. To determine your cluster's block size, see the following configurations:

Distribution Configuration Parameter Location Default Value

CDH 5.x dfs.blocksize hdfs-site.xml 128 MB

HDP 2.x dfs.blocksize hdfs-site.xml 128 MB

MapR 4.x ChunkSize .dfs_attributes 256 MB

Re-profiling existing files versus profiling only new and changed files

By default, Waterline Data Inventory only profiles new files or files that have changed since the last profiling job. Change the following property to false to reprofile all files in the target directory. You might choose to do this if you add data formats (see Configuring additional date formats) or change other parameters that affect the profiling data collected. Use this property on the job command line if you want to change the behavior only for the one job. For example, to reprofile all files in the Landing directory (on one line):

$ bin/waterline profile /user/finance/Landing -Dwaterlinedata.incremental=false

waterlinedata.incremental=true (by default)

By default, Waterline Data Inventory does not reprofile files that previously failed profiling. If you want to force reprofiling of failed files, set this property to true or include it on the command line for the job.

waterlinedata.reprofile_files_with_error=false (default)

With this property enabled, Waterline Data Inventory will profile previously failed files and new or updated files. For example, to reprofile failed files in the Landing directory (on one line):

$ bin/waterline profile /user/finance -Dwaterlinedata.reprofile_files_with_error=true



Controlling the number of map tasks used per MapReduce job

You can limit the number of map tasks Waterline Data Inventory generates per format or profiling job. You might consider setting a map task limit when you are profiling many small files; by default, the ability to combine multiple files into a single map task is enabled and set to limit map tasks to 80.

To control the number of map tasks per job, set the following properties:

waterlinedata.profile.combinedmapper=true (default)

waterlinedata.profile.parquetcombined=true (default)

waterlinedata.profile.combined.max_mappers_per_job=<limit>

Controlling the number of reduce tasks used per MapReduce job

Waterline Data Inventory allows you to configure the maximum number of reduce tasks used by MapReduce profiling jobs from the job command line only. Consider adjusting the number of reduce tasks if jobs are running out of memory during the reduce tasks for Waterline Data Inventory MapReduce jobs.

We recommend that this control be set to 20 or larger. Set it to approximately 1 reduce task for each 10 map tasks.

By default, Waterline Data Inventory uses the value specified in the Yarn configuration. To use a different value, specify the MapReduce property on the Waterline Data Inventory command line. For example:

$ bin/waterline profile /user/finance/Landing -Dmapred.reduce.tasks=40

Controlling the number of nodes used to supply input data (split locations)

“Splits” are the package of data that a single map task can operate against. In Hadoop, the property mapreduce.job.max.split.locations determines how many locations (typically individual data nodes) can be used to stage the splits for processing. If this value is set too low, it can cause map tasks to wait to receive splits to operate on, slowing profiling performance.

Waterline Data Inventory uses the value set for the cluster either through the value in mapreduce-site.xml or on the Waterline Data Inventory command line (using -Dmapreduce.job.max.split.locations). If no value is set at the cluster level or provided on the command line, Waterline Data Inventory reads the value set in the configuration property:

mapreduce.job.max.split.locations=<integer greater than the number of nodes>

If Waterline Data Inventory does not find value for this setting, it uses 100 as the number of split locations.

For example, to force reprofiling of two Hive databases and set the number of split locations (on one line):

$ bin/waterline profileHive default,finance

-Dmapreduce.job.max.split.locations=108 -Dwaterlinedata.incremental=false



Using compression

By default, Waterline Data Inventory uses compression at two points in its management of metadata: when it generates temporary files during profiling (between map and reduce tasks) and when it generates the metadata that it persists. It uses Snappy compression for the temporary files and gzip for the persisted files. For each stage where compression is used, you can set the compression format and enable or disable compression. If you need to change one of the following properties, make sure to uncomment it to override the default.

Temporary files

Enable compression overall:

waterlinedata.compression=true

Enable compression for the output of map tasks:

mapreduce.map.output.compress=true

Set compression format for the output of map tasks:

mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec

To change this format to LZO, use:

mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec

Persisted files

Enable compression:

mapreduce.output.fileoutputformat.compress=true

Set compression format:

mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

Set the type of compression; set NONE, BLOCK, RECORD, where RECORD is the default:

mapreduce.output.fileoutputformat.compress.type=RECORD

Configuring additional date formats

When Waterline Data Inventory profiles string data such as in delimited files where no type information is available, it examines the data to reveal likely data types. It uses the format conventions described by the International Components for Unicode (ICU) conventions for dates and numeric values. You can add your own date formats using the conventions described here:

icu-project.org/apiref/icu4j/com/ibm/icu/text/SimpleDateFormat.html

The pre-defined formats are as follows:

waterlinedata.profile.datetime.formats=EE MMM dd HH:mm:ss ZZZ yyyy, M/d/yy HH:mm,

EEE MMM d h:m:s z yy, yy-MM-dd hh:mm:ss ZZZZZ, yy-MM-dd,yy-MM-dd HH:mm:ss,yy/M/dd,

M/d/yy hh:mm:ss a, YYYY-MM-dd'T'HH:mm:ss.SSSSSSSxxx



Controlling most frequent data value collection

Waterline Data Inventory collects 2000 of the most frequent values in each field in each file. You can change the number of values collected, control how many characters are included in each sample, and how many of these values are used in search indexes and to propagate tags.

Number of most frequent values collected

waterlinedata.profile.top_k_capacity=2000 (by default)

Limit of strings maintained (any additional characters are not available in the Waterline Data Inventory repository)

waterlinedata.max.top_k_length=128 (by default)

Number of most frequent values used in search indexes and UI lists

waterlinedata.profile.top_k=50 (by default)

Number of most frequent values used to determine tag association matches

waterlinedata.profile.top_k_tokens=750 (by default)

Controlling cardinality calculation

Waterline Data Inventory calculates cardinality for fields using an implementation of HyperLogLog, which manages the challenge of counting unique values in a very large data set without requiring an enormous amount of memory. The two properties that control cardinality calculation handle fields with mostly unique values (“normal data set”) and fields with null or repeated values (“sparse data set”). Set these property values higher based on the general size of your data.

Note that to have changes to either of these properties take affect requires that you reprofile the data.

IMPORTANT: Metadata collected with one set of values for these properties is not compatible with metadata collected with another set. If you change these property values, you must reprofile all resources in the repository.

waterlinedata.profile.cardinality_precision=11

waterlinedata.profile.cardinality_sparse_precision=25

Another method of controlling cardinality includes increasing the number of values for which Waterline Data Inventory calculates actual cardinality rather than using an estimate. Increasing these values increases the size of the HDFS metadata maintained by Waterline Data Inventory and may have an impact on profiling performance.

waterlinedata.profile.top_k_capacity=2000

https://en.wikipedia.org/wiki/HyperLogLog



Profiling Hive tables

The following properties control interaction with Hive. For Hive connection information, see Communication between Waterline Data Inventory and Hive.

Making Waterline Data Inventory files available for using Hive

Waterline Data Inventory uses its own input formats when creating Hive tables; when other applications read these tables, they require that the Waterline Data JARs be available in the Hive class path. The Waterline Data Inventory install process places JAR files in the Hive auxlib directory if it can locate Hive on the local computer. If the Hive server is installed on a different node from Waterline Data Inventory, you can ensure the JARs are available to applications reading the Hive tables by placing the JAR files in the Hive server class path. See Update Hive Server with Waterline Data Inventory JARs.

Profile backing files

When you run a job to profile Hive tables, Waterline Data Inventory does not profile the HDFS files that correspond to the profiled tables. The interface shows profiling results for the Hive tables and a link from the table to the backing file or directory; however, unless the corresponding HDFS files where profiled independently, there are no profiling results for the files.

This behavior is useful for profiling performance and for when end-users are interested in the Hive tables and not the backing files. This configuration is an advantage in cases where Waterline Data Inventory can profile the data through Hive even when it can't directly profile the backing files. If you need to, you can run a separate job to profile the HDFS files.

If you want to profile both the Hive table and its backing HDFS files in the Hive profiling job, you can set the following parameter to 'true'.

waterlinedata.profilehivebackingfiles=true

Clear deleted Hive tables

By default when profiling Hive tables, Waterline Data Inventory reviews the tables in the database to ensure that the data they are based on still exists. If the backing files have been deleted for a given table, Waterline Data Inventory clears out the table. You can turn off this check; turning it off reduces the overall profiling time by a small amount.

waterlinedata.cleanorphanedhivetables=true

Profiling Hive map columns

The default behavior for profiling Hive tables includes flattening maps columns into a column for each key in the key-value pairs within the map column (referred to as “deep map”). The Waterline Data Inventory representation will include an additional column for each entry in the map column. If this representation does not

Installation and Administration Guide Search functionality


provide end-users with the best view of the data or if the map keys contain characters that cannot be used as column names, consider changing the default behavior such that the map column is not flattened.

Default Behavior Deep Map Disabled event_time

uuid

traits

site_id

cookie_pattern-c_referrer

cookie_pattern-c_purchaseID

cookie_pattern-c_products

cookie_pattern-c_prop17

cookie_pattern-d_cb

cookie_pattern-d_ld

url

segments

ip_address

event_time

uuid

traits

site_id

cookie_pattern Values --> url

segments

ip_address

c_referrer:http://...

c_purchaseID:0.15...

c_products:%3B...

c_prop17:293...

d_cb:demo...

d_ld:part...

To disable the default behavior to flatten map columns in Hive tables, set the following property:

waterlinedata.profile.deepmap=false

Hive limitations

In Hive versions newer than Hive 0.14, field names cannot contain commas (,) or colons (:). Waterline Data Inventory may be able to profile the backing files for such data, but users will not be able to create Hive tables from those files.

Search functionality

The following properties control how Waterline Data Inventory prepares and returns search results in the web application.

Limits on total results returned

Waterline Data Inventory maintains at least six indexes to support searching where each index includes entries for different types of objects. For example, there is a search index for tags, for folder, file, and table names (“resources”), and another for the sample values collected for fields (“top-K”). For a large inventory, there can be hundreds of thousands of results from multiple indexes for a given set of search criteria. Waterline Data Inventory determines which results match the search criteria, then returns the results that match the top file-level results. That is, if the search includes criteria to match field tags, the field-level results are compiled then the maximum number of corresponding files are returned, with the matching fields found in those files are returned.

If needed, you can change the following parameter to scale the number of results returned:

waterlinedata.metadata.search.maxHits.dataResource=100000

Consider increasing this property only if all of the following conditions exist:

Discovery functionality Waterline Data Inventory


There are significantly more than 100,000 resources (files, tables, folders) in the inventory.

Users typically perform searches that return more than 100,000 resources in a single search.

Users are seeing cases where the large number of search results still do not include all of the results expected.

Discovery functionality

The following properties control how Waterline Data Inventory makes suggestions for lineage relationships and for tag associations. Note that from the tag glossary you can disable tag propagation for individual tags, including built-in tags.

Balancing profiling performance against data quality calculations

Waterline Data Inventory calculates cardinality and selectivity for each field in each file profiled. Cardinality is the number of unique values in the field; selectivity is the cardinality divided by the number of records that do not have null values.

In addition to selectivity and cardinality, Waterline Data Inventory collects a sample of the most frequent values in the field. Use this parameter to reduce the amount of time Waterline Data Inventory spends during profiling making the sample lists accurate. By default, this optimization is disabled.

waterlinedata.profile.high_cardinality.optimization=false (default)

Controlling tag association discovery

Tag discovery performance

Tag association discovery operations occur in MapReduce jobs. As with profiling jobs, you can pass MapReduce memory settings through with the Waterline Data Inventory tag command. The following properties can be set only on the command line to control MapReduce operations. For example, for a 100 node cluster with significant RAM available per node, you can provide 8 GB per map task, 18 GB per application manager container, and account for at least one split location per node:

./waterline tag -Dmapreduce.map.memory.mb=8192



Consider setting these parameters in the command line when you see errors in the tag job that indicates that the map tasks exceeded their Java heap space (Exception in thread "main" java.lang.OutOfMemoryError: Java heap space).

Thresholds for what tag suggestions are exposed

Waterline Data Inventory has default values set for all field-level tag propagation as follows. Some of these values can be configured individually for each tag from the tag’s rule definition (in the web application, go to Manage > Glossary).

Installation and Administration Guide Discovery functionality


Waterline Data Inventory gives a weight to its suggestions for matching tag associations. You can choose to expose more or fewer of these suggestions by configuring the cutoff weight. Tag associations whose calculated weight is below this value are not exposed to users. You can set this value per tag from the Glossary.

waterlinedata.discovery.tolerance.weight=40.0 (by default)

Limit to the number of any tags that will be suggested for a given field.

waterlinedata.discovery.tags.max_suggested=3

Eliminating weak associations. If more than one tag is suggested for a field, the tag with the highest weight will be suggested; for other tags to be suggested, their weights must be within this value of the weight on the top tag. That is, for additional tags to be suggested, their weight must be greater than the weight calculated for the top tag minus the value_hit_diff:

WeightAdditional Tag > WeightTop Tag - value_hit_diff

waterlinedata.discovery.tags.value_hit_diff=20.0

Tag association for low-cardinality data

When fields have low cardinality (the same values appear many times in the field for the file), tag propagation can be skewed toward making connections that are not representative of the data. Waterline Data Inventory provides some tools to help you avoid false positive tag associations among fields with low cardinality.

Conventions for indicating missing values

One common case where low cardinality values cause unexpected tag associations is when the data includes one or more values to indicate that there isn’t a value. For example, if data uses a convention of “not available” or “NA” in the file to identify places where values are not provided, this value may be mistakenly considered to be related to other data that also uses “not available” or “NA” even though other values in the data are unrelated.

Waterline Data Inventory provides a blacklist of values that should be ignored when making low cardinality matches. You can modify this comma-separated list to meet the requirements of your data, including providing localized versions of these indicators. Note that you should include values in lower case as all field values are changed to lower case when matches are calculated.

waterlinedata.discovery.tags.null_values=na,n/a,unspecified,not

available,null,empty,blank,missing

Tag association discovery among low cardinality values

For low cardinality values (few distinct values among all field values), Waterline Data Inventory requires 100% of the values in the candidate field to match for a tag to be associated with the candidate field. By default, “low cardinality” fields are fields with two or fewer distinct values. To require more values from a candidate field to match before a tag is suggested for an association, change the following option to a larger number.

waterlinedata.discovery.tags.min_cardinality.partial_match=2

Discovery functionality Waterline Data Inventory


Tag association using tag rules

Tags can have rules that use regular expressions to identify the field data that should be associated with the tag. Use this property to control how Waterline Data Inventory handles these tag rules:

Disable evaluating tagging rules (-1)

Use all field values to identify matches with tagging rules (0)

Limit matching tagging rule evaluation to the most frequent values (1)(default)

The quantity of most frequent values is identified by the profiling property waterlinedata.profile.top_k_capacity (see Controlling most frequent data value collection).

waterlinedata.profile.regex_evaluation=1

Tag association using field names

By default, Waterline Data Inventory uses field names when making tag association suggestions. If considering field names in the matching produces too many false positive tag associations, consider disabling field name matching.

waterlinedata.discovery.allow_field_name_matching=true (default)

Tag association for credit and bank cards

Waterline Data Inventory uses the Luhn algorithm to improve the tag association suggestions for bank and credit cards. This algorithm validates that the 12 to 16 digit numbers found in profiling are valid credit card numbers. This functionality is not configurable. If you want to turn off this functionality, disable automatic tag discovery for the built-in tag “Major Credit Card Number.”

Controlling collections discovery

When you have a growing set of data that extends across multiple files, Waterline Data Inventory allows users to view the data as a single resource, a “collection.” When a file is added to one of the directories identified as part of the collection, Waterline Data Inventory knows to include that data in the collection.

At this time, Waterline Data Inventory does not re-evaluate the collection if files are removed from the data set. It also does not update any Hive tables generated from the collection when the collection is updated.

By default, Waterline Data Inventory only considers folders with 3 or more files in any one folder to be a candidate for a collection. You can control this value to better reflect the organization of your cluster. Note that there are other qualifications that must be met before the files in the folder are marked as a collection.

waterlinedata.discovery.smallest.collection.size=3 (by default)

Folders with fewer files than the waterlinedata.discovery.smallest.collection.size threshold are not considered when generating collections except when they are part of a larger set of directories that together form a higher-level collection. Folders

https://en.wikipedia.org/wiki/Luhn_algorithmhttps:/en.wikipedia.org/wiki/Luhn_algorithm

Installation and Administration Guide Discovery functionality


within folders are rolled up as a collection at the highest level that doesn't break the collection requirement of including files with matching schemas.

Controlling lineage discovery

Lineage discovery performance

Lineage discovery operations occur in MapReduce jobs. As with profiling jobs, you have a number of controls to make sure the jobs run efficiently given the available resources on the cluster.

As with profiling jobs, you can pass MapReduce memory settings through with the Waterline Data Inventory lineage command. The following properties can be set only on the command line to control MapReduce operations. For example, for a 100 node cluster with significant RAM available per node, you can provide 8 GB per map task, 18 GB per application manager container, and account for at least one split location per node:

./waterline lineage -Dmapreduce.map.memory.mb=8192



Consider setting these parameters in the command line when you see errors in the lineage job that indicate that the map tasks exceeded their Java heap space (Exception in thread "main" java.lang.OutOfMemoryError: Java heap space).

The following additional properties allow you to further control lineage discovery; these properties can be set in properties files or on the command line.

The volume of lineage resources handled in memory. This value is optimized to an estimate of the number of fields in each file or table and the assumption that each map task has 8 GB of memory available. Reduce this property if your data nodes do not have 8 GB of memory available to map tasks or if you have exceptionally “wide” resources. In these cases, you might see an error indicating that the map tasks exceeded their Java heap space (Exception in thread "main" java.lang.OutOfMemoryError: Java heap space)

waterlinedata.discovery.lineage.batch_size=240

The number of data “splits” that a map task running in a lineage job can process. Consider increasing this number if you are processing a large cluster of data and your lineage map tasks are not using the memory allocated to them.

waterlinedata.discovery.lineage.max_splits=64

The number of reducers used in lineage discovery jobs. Increase the number of reducers if you find that the performance of lineage jobs are bound by the number reducers.

waterlinedata.discovery.lineage.mapred.reduce.tasks=5

Lineage discovery behavior

When reviewing files for lineage relationships, Waterline Data Inventory is able to tolerate a number of changes to file schemas and data and still find a connection

Browser app functionality Waterline Data Inventory


among files. These properties control the parameters used to determine a lineage relationship.

The amount of overlapping data between fields to consider the fields matching.

waterlinedata.discovery.lineage.overlap=0.8 (by default)

If multiple fields from the same resource match the fields from another resource, Waterline Data Inventory uses field names to determine if the fields match. This mechanism is used only if field names are similar within the percentage indicated by this property, 0.8 (80%) by default.

waterlinedata.discovery.lineage.field_name_match=0.8

By default, Waterline Data Inventory avoids making lineage relationships out of copies of files in the same directory. To do this, it does not identify lineage relationships among files with modification dates (HDFS equivalent of creation dates) that differ by less than 30 seconds when the files are in the same directory. If you have a transformation process that runs on files to create similar or “refined” copies of data in the same directory, consider reducing this limit to ensure that Waterline Data Inventory finds lineage relationships between the original file and the modified file.

waterlinedata.discovery.lineage.diff_same_directory_sec=30

Browser app functionality

The following sections describe the properties used to control aspects of the Waterline Data Inventory browser application.

Controlling user access to the web application

By default, any authenticated user can sign-in to the Waterline Data Inventory web application; signed-in users are automatically configured with the “end-user” role, which allows them to search and browse data and metadata they are authorized to view and to create Hive tables. See Managing Waterline Data Inventory user profiles.

To require users be pre-configured in Waterline Data Inventory to sign-in, set the following property to false:

waterlinedata.userprofile.autocreate=true

Identifying the superuser for user management

By default, a user named waterlinedata is created as the superuser for managing user profiles and other administration tasks. Ideally, use the Waterline Data Inventory service user for this purpose. If that user isn’t an option—perhaps because the service user is a service user with no ability to use a password to authenticate, choose another user account that is permanent, secure, and perhaps rarely used. If necessary, you can remove roles from this user to reduce the reach of this user account in the scope of Waterline Data Inventory.

Installation and Administration Guide Browser app functionality


The superuser account is determined using the following property:

waterlinedata.security.administrator.username=waterlinedata

If you need to change the username for the superuser, contact [email protected] for instructions.

User interface performance controls

There are two parameters that have a significant impact on the performance of the Waterline Data Inventory browser application. If the application server or repository database runs out of memory while serving data to users running the web application, you can adjust these parameters.

Application server heap

Restricted by the amount of memory available on the edge node. This is set to 6 GB by default in the waterlinedata/bin/jettyStart script, java_opts -Xmx setting:

java_opts="-Xss2m -Xms32m -Xmx6144m -XX:PermSize=32m -XX:MaxPermSize=512m"

Derby database heap

Restricted by the amount of memory available on the edge node. This is set to 4 GB by default in the waterlinedata/bin/derbyStart script, in the -Xmx parameter of the command for launching Derby:

${WATERLINEDATA_JAVA} -Dderby.system.home=${DERBY_SYSTEM_HOME}

-Dderby.install.url=${DERBY_INSTALL_URL} -Djava.security.manager

-Djava.security.policy=${DERBY_SECURITY_POLICY} -Xmx4096M

-jar ${DERBY_HOME}/lib/derbyrun.jar server start -h ${DERBY_HOST} -p ${DERBY_PORT} &

Self-service browsing

If your end-users have accounts in HDFS or MapR-FS and corresponding home directories, Waterline Data Inventory uses the directories as the users’ home in the browser application: click Browse in Waterline Data Inventory to open the HDFS directory corresponding to the current user.

If your end-users do not have accounts in HDFS, Waterline Data Inventory defaults the HDFS root directory. To improve end-users’ experience, consider setting the home directory each user sees when they open Waterline Data Inventory. Set the HDFS directory path in the following property:

waterlinedata.defaultdirectory=<HDFS directory>

For example:

waterlinedata.defaultdirectory=/user/waterlinedata/Landing

Browser app functionality Waterline Data Inventory


SSH authentication for remote host access

Waterline Data Inventory uses SSH infrastructure to validate end-users against the configured authenticating service. This SSH security requires confirmation between the application server host and the authentication service to establish mutual trust.

To avoid interactive SSH trusted host confirmation for each user signing into Waterline Data Inventory, the installation process calculates and caches the trusted host public key fingerprints.

There is no need to manually alter the SSH authentication fingerprints unless you experience problems with user access through SSH.

The installer determines the 16 octet fingerprints as follows:

$ ssh-keygen -l -f /etc/ssh/ssh_host_key.pub

$ ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key

$ ssh-keygen -l -f /etc/ssh/ssh_host_dsa_key

The output from each command shows the length of the key, the key fingerprint, and the name of the public key file that matches the key file provided in the command. Here’s an example of these commands and their output:

$ ssh-keygen -l -f ssh_host_dsa_key

1024 b8:3a:2c:5f:23:78:6c:ff:34:44:5b:63:a6:f4:26:15 ssh_host_dsa_key.pub (DSA)

$ ssh-keygen -l -f ssh_host_key.pub

2048 0f:52:05:19:17:c4:58:41:2f:d7:de:1e:58:14:c4:ad ssh_host_key.pub (RSA1)

$ ssh-keygen -l -f ssh_host_rsa_key

2048 30:c0:38:c1:84:50:6e:0b:f6:3f:85:f8:3e:8f:33:34 ssh_host_rsa_key.pub (RSA)

The resulting configuration for SSH will look similar to the following:

authHostFingerprints="b8:3a:2c:5f:23:78:6c:ff:34:44:5b:63:a6:f4:26:15

0f:52:05:19:17:c4:58:41:2f:d7:de:1e:58:14:c4:ad

30:c0:38:c1:84:50:6e:0b:f6:3f:85:f8:3e:8f:33:34"

Event auditing

For folders, files, tags, lineage relationships, origins, and user profiles, Waterline Data Inventory collects events that occur to that object. For example, Waterline Data Inventory collects when a tag was created and when and by whom it was associated with a file or a field. Collecting this information has a small performance impact on the browser application and increases the size of the repository.

You can keep Waterline Data Inventory from collecting new events by setting the following property to false:

waterlinedata.auditing.enabled=true (default)

Authorization caching

Waterline Data Inventory caches authorization information to optimize browsing performance. If you make changes to users' authorization, such as by changing policies in Ranger, you'll need to wait for about 3 minutes for those changes to propagate through the system. This time may be longer if users access the resources

Installation and Administration Guide Browser app functionality


during that time; authorization information is considered stale if it is older than 10 minutes and is refreshed.

Browser timeout

Waterline Data Inventory automatically signs users out of the browser application after two hours (120 minutes). To change this default, edit <install location>/ jetty-distribution*/waterlinedata-base/etc/webdefault.xml and add or update the following section:

<session-config>

<session-timeout>120</session-timeout>

</session-config>

To remove any timeout, change this setting to -1.

Service user as admin

When you first install Waterline Data Inventory, you set a user account for a Waterline Data Inventory administrator. This user is pre-configured with privileges to create and configure additional Waterline Data Inventory users. If appropriate, you can use the Waterline Data service user as the superuser.

Note that in a Kerberos-controlled environment, if you configure the Waterline Data service user as the default admin, this user would need to authenticate as a service (using a keytab) and as a user (using username and password). If that configuration is not possible in your environment, choose another username to set as the Waterline Data Inventory default admin.

To generate a keytab that keeps the principal password valid and allows both keytab-based and password-based login at the same time, assuming the Waterline Data service user is “waterlinesvc”:

$ kadmin.local

> add_principal waterlinesvc

(Prompts for password) > ktadd -k <full path to keytab file> -norandkey waterlinesvc

> quit

Here the -norandkey argument is necessary; without it the waterlinesvc user's password will be invalidated when keytab is created and the user will not be able to sign in to the Waterline Data Inventory web application.