Pentaho Data Integration Suite

PENTAHO DATA INTEGRATION SUITE

PENTAHO DATA INTEGRATION SUITE Kettle is an acronym for "Kettle E.T.T.L. Environment"

Extraction, Transformation, Transportation and Loading of data.

Spoon is a graphical user interface that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen

Pan is a data transformation engine that performs a multitude of functions such as reading, manipulating, and writing data to and from various data sources

itchen is a program that executes jobs designed by Spoon in XML or in a database repository

Jobs are usually scheduled in batch mode to be run automatically at regular intervals.

GETTING STARTED Follow the instructions below to install Spoon: Install the Sun Microsystems Java Runtime

Environment version 1.5 or higher. You can download a JRE for free at http://www.javasoft.com/.

Unzip the binary distribution zip-file in a directory of your choice.

Under Unix-like environments (Solaris, Linux, MacOS, for example), you must make the shell scripts executable. Execute these commands to make all shell scripts in the Kettle directory executable:

cd Kettle chmod +x *.sh

http://www.javasoft.com/

LAUNCHING SPOON

Spoon.bat: launch Spoon on the Windows platform.

spoon.sh: launch Spoon on a Unix-like platform such as Linux, Apple OSX, Solaris To make a shortcut under the Windows platform,

an icon is provided. Use "spoon.ico" to set the correct icon. Point the shortcut to the spoon.bat file.

USER INTERFACE OVERVIEW The Main tree in the upper-left panel of Spoon

allows you to browse connections associated with the jobs and transformations you have open.

When designing a transformation, the Core Objects palate in the lower left-panel contains the available steps used to build your transformation including input, output, lookup, transform, joins, scripting steps and more.

When designing a job, the Core objects palate contains the available job entries.

The Core Objects bar contains a variety of job entry types.

REPOSITORY

Spoon allows you to store transformation and job files to the local file system or in the Kettle repository

The Kettle repository can be housed in any common relational database

Transformation DefinitionsThe table below contains a list of transformation definitions:

Transformation DescriptionValue Values are part of a row

and can contain any type of data: strings, floating point numbers, unlimited precision BigNumbers, integers, dates or boolean values

Row A row consists of 0 or more values that are processed together as a single entry.

Input Stream A stack of rows that enters a step

Hop A graphical representation of one or more data streams between two steps; a hop always represents the output stream for one step and the input stream for another — the number of streams is equal to the copies of the destination step (one or more)

Note Descriptive text that can be added to a transformation

Job DefinitionsThe table below contains a list of job definitions

Job DescriptionJob Entry A part of a job that

performs a specific task

Hop A graphical representation of one or more data streams between two steps; a hop always represents the output stream for one step and the input stream for another — the number of streams is equal to the copies of the destination step (one or more)

Note Descriptive text that that can be added to a job

TOOLBAR ICONS

THE ICONS ON THE TOOLBAR OF THE MAIN SCREEN ARE FROM LEFT TO RIGHT

DescriptionCreate a new job or transformationOpen transformation/job from file if you're not connected to a repository or from the repository if you are connected to one.

Save the transformation/job to a file or to the repository.Save the transformation/job under a different name or filename.

Open the print dialog.Run transformation/job: runs the current transformation from XML file or repository.Preview transformation: runs the current transformation from memory. You can preview the rows that are produced by selected steps.

Run the transformation in debug mode allowing you to troubleshoot execution errors.Replay the processing of a transformation for a certain date and time. This will cause certain steps (Text File Input and Excel Input) to only process rows that failed to be interpreted correctly during the run on that particular date and time.

Verify transformation: Spoon runs a number of checks for every step to see if everything is going to run as it should.

Run an impact analysis: what impact does the transformation have on the used databases.Generate the SQL that is needed to run the loaded transformation.Launches the database explorer allowing you to preview data, run SQL queries, generate DDL and more.

General TabNote: Spoon automatically clears the database cache when you launch DDL (Data Definition Language) statements towards a database connection; however, when using third party tools, clearing the database cache manually may be necessary

Feature DescriptionMaximum Undo Level Sets the maximum

number of steps that can be undone (or redone) by Spoon

Default number of lines in preview dialog

Allows you to change the default number of rows that are requested from a step during transformation previews

Maximum nr of lines in the logging windows

Specifies the maximum limit of rows to display in the logging window

Show tips at startup? Sets the display of tips at startup

Show welcome page at startup?

Controls whether or not to display the Welcome page when launching Spoon

Use database cache? Spoon caches information that is stored on the source and target databases. In some instances, caching causes incorrect results when you are making database changes. To prevent errors you can disable the cache altogether instead of clearing the cache every time.

.

OTHERSearch MetadataThis option will search in any available fields, connectors or notes of all loaded jobs and transformations for the string specified in the Filter field. The metadata search returns a detailed result set showing the location of any search hits. This feature is accessed by choosing Edit|Search metadata from the menu bar.

Set Environment VariableThe Set Environment Variable feature allows you to create and set environment variables for the current user session explicitly. This is a useful feature when designing transformations for testing variable substitutions that are normally set dynamically by another job or transformation.This feature is accessible by choosing Edit|Set Environment Variable from the menu bar.Note: This page also displays when you run a transformation that use undefined variables. This allows you to define them right before execution time.

Show environment variablesThis feature displays the current list of environment variables and their values. It is accessed by selecting the Edit|Show environment variables option from the menu bar.

Execution Log historyIf you have configured your Job or Transformation to store log information in a database table, you can view the log information from previous executions by right-clicking on the job or transformation in the Main Tree and selecting 'Open History View'. A view similar to the one below appears:

Note: todo: replace screenshot when PDI-224 is fixedNote: The log history for a job or transformation also opens by default each next time you execute the file.

Generate Mapping Against Target StepIn cases where you have a fixed target table, map the fields from the stream to their corresponding fields in the target output table. Use a Select Values step in your transformation. The 'Generate mapping against target' option provides you with an easy-to-use dialog for defining these mappings that automatically creates the resulting Select Values step that can be dropped into your transformation flow prior to the table output step.To access the 'Generate mapping against target' option right click in the table output step.

Generate mappings example

Below is an example of a simple transformation in which we want to generate mappings to our target output table:1.Begin by right-clicking on the Table output step and selecting 'Generate mappings against target'.

2.Add all necessary mappings using the Generate Mapping dialog shown above and click OK. You will now see a Table output mapping step has been added to the canvas.

3.Drag the generated Table output Mapping step into your transformation flow prior to the table output step:

Creating a Transformation or Job1. You create a new Transformation in one of three ways:

By clicking on the New Transformation button on the main tool barBy clicking New, then TransformationBy using the CTRL-N hot keyAny one of these actions opens a new Transformation tab for you to begin designing your transformation.You create a new Job in one of three ways:By clicking on the New Job button on the main tool barBy clicking New, then JobBy using the CTRL-ALT-N hot keyAny one of these actions opens a new Job tab for you to begin designing your job.

Creating a New Database Connection

This section describes how to create a new database connection and includes a detailed description of each connection property available in the Connection information dialog box.To create a new connection right click the Database Connections in the tree and select New or New Connection Wizard. You can also double click Database Connections, or press F3. The Connection information dialog box appears. The topics that follow describe the configuration options available on each tab of the Connection information dialog box.

Database ExplorerThe Database Explorer provides the ability to explore configured database connections. The Database Explorer also supports tables, views, and synonyms along with the catalog and/or schema to which the table belongs.The buttons to the right provide quick access the following features for the selected table:

Feature DescriptionPreview first 100 rows of... Returns the first 100 rows

from the selected table

Preview first ... rows of... Prompts the user for the number of rows to return from the selected table

Number of rows... Specifies the three digit client number for the connection

Show Layout of... Displays a list of column names, data types, and so on from the selected table

Generate DDL Generates the DDL to create the selected table based on the current connection type

Generate DDL for other connection

Prompts the user for another connection, then generates the DDL to create the selected table based on the user selected connection type.

Open SQL for... Launches the Simple SQL Editor for the selected table

Truncate table... Generates a TRUNCATE table statement for the current table.Note: The statement is commented out by default to prevent users from accidentally deleting the table data.

KETTLE AND SPOON The first lesson of our Kettle ETL tutorial will explain how to create a simple

transformation using theSpoon application, which is a part of the Pentaho Data Integration suite.The transformation in our example will read records from a table in an Oracle database, and then it will filter them out and write output to two separate text files. The records which will pass the validation rule will be spooled into a text file and the ones that won’t will be redirected to the rejects link which will place them in a different text file.

Assuming that the Spoon application is installed correctly, the first thing to do after running it is to configure a repository. Once the ‘Select a repository’ window appears, it’s necessary to create or choose one. A repository is a place where all Kettle objects will be stored – in this tutorial it will be an Oracle database.

To create new repositories click the ‘New’ button and type in connection parameters in the ‘Connection information’ window. There are some very useful options on the screen, one is ‘Test’ which allows users to test new connections and the other is ‘Explore’ which lets users browse a database schema and explore the database objects. After clicking the ‘Create or Upgrade’ a new repository is created. By default, an user with administrator rights is created – it’s login name is admin and the password is also admin. It is highly recommended to change the password after the first login.

Database connection in Spoon - a part of Kettle ETL:

If a connection with repository is established successfully, a Spoon main application window will show up. To design a new transformation which will perform the tasks described above it’s necessary to take the following steps:

Click the ‘New transformation’ icon and enter it’s name (in our tutorial it will be trsfCountry)

Define a database connection. It is located in the left hand-side menu in the ‘Main tree’ area in the Database connections field

Drag and drop the following elements from the ‘Core Objects’ menu to the transformation design area in the center of the screen: Table Input (menu Output), Filter Rows (menu Transform) and two Text Field Output objects (menu Output).Edit the Table Input – choose a source database and define an SQL query which will return records to the transform flow. The ‘Preview’ option is usually very useful here as it shows the preview of the records returned from the database.

Oracle table input data in Spoon:

HOPS

Next thing to do is to link the objects together. The links between elements are called Hops and they indicate which direction the transform flows go. Hops elements can be found, created and edited in the Main Tree section.

The easiest way to create a Hop is to drag and drop a link between two objects with left SHIFT pressed.

Once the hops are defined, it’s time to define validation criteria in the ‘Filter Values’ object. In that place we define the data flow and the direction of that flow based on a validation rule.

RUN The last thing to do is to change the text files

output configuration. Enter the names of the files and its extension in the properties window and if needed, adjust other text files specific options here.

Save and run the transform (menu -> Transformation -> Run or just press the F9 key).

Documents

Pentaho Data Integration Suite