Pd i 2000 Lectures

Embed Size (px)

Citation preview

  • 8/17/2019 Pd i 2000 Lectures

    1/623

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    July 2011

    en a o a a n egra onforDatabase Developers

  • 8/17/2019 Pd i 2000 Lectures

    2/623

    Welcome Agenda

    ● Audience and prerequisites

    ● Learning objectives

    ● Class process

    ● Course outline

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 2

  • 8/17/2019 Pd i 2000 Lectures

    3/623

    Audience and Course Prerequisites

    ● Intended audience

    — Course targeted for database administrators and databasedevelopers.

    — Portions of the course assumes knowledge of SQL and relationaldatabase concepts.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 3

     

    — There are no Pentaho Training prerequisites for this course.

  • 8/17/2019 Pd i 2000 Lectures

    4/623

    Learning Objectives

    ● At the end of the course, you should understand:

    — The basic architecture and features of Pentaho Data Integration.

    — The concept and features of the advanced Pentaho Data Integration

    Enterprise Edition.

    — How PDI supports you in the Agile BI approach.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 4

  • 8/17/2019 Pd i 2000 Lectures

    5/623

    Learning Objectives

    ● At the end of the course, you should be able to

    — Load and write data from and to different data sources

    — Join data from different sources

    — Use PDI and ETL design patterns (like restartable solutions)

    — Influence the performance aspects of databases and transformations

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 5

    —  

    — Schedule jobs and transformations

    — Use logging, monitoring and error handling features of PDI

    — Load, transform, and create complex XML structures

    — Use scripting (JavaScript, Formula, Java) in transformations

    — Apply clustering and partitioning solutions for high volumes

  • 8/17/2019 Pd i 2000 Lectures

    6/623

    Learning Objectives

    ● What are your objectives?

    ● What are you expectations?

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 6

  • 8/17/2019 Pd i 2000 Lectures

    7/623

    Course Process● Daily schedule:

    — 9:00 am – 5:00 pm

    — 1 hour lunch break -noon

    — 15 minute morning break -10:30— 15 minute afternoon break -3:30

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 7

    ● e course s a com na on o ec ure, emo an a s.

    ● Feel free to ask questions or to seek clarification!

    ● Online survey will be provided for feedback and suggestions.

  • 8/17/2019 Pd i 2000 Lectures

    8/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Pentaho Data IntegrationOverview

  • 8/17/2019 Pd i 2000 Lectures

    9/623

    Pentaho Data Integration (PDI) Introduction

    ● PDI is the product associated with the

    KETTLE open source project:

    — KETTLE is open source software that

    makes up the core of PDI EnterpriseEdition.

    — PDI Enterprise Edition is ‘whole

    roduct’.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 9

    • Professional technical support

    • Maintenance releases

    • EE-only features including

    enterprise security integration,scheduling, and more

    • Documentation

    ● Member of the Pentaho BI Suite

  • 8/17/2019 Pd i 2000 Lectures

    10/623

    Use as a BI Platform Component

    ● PDI Jobs and Transformations can be

    run in Pentaho BI Platform:

    — PDI 3.X Jobs/Transforms execute inPentaho BI Platform 1.7.X andbeyond.

    — PDI 4.X Jobs/Transforms execute in

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 10

    en a o a orm . . anbeyond.

    — For example: Fill a Pentaho Reportwith data from a Transformation.

    — Details in a separate module.

  • 8/17/2019 Pd i 2000 Lectures

    11/623

    Enterprise Edition (EE) Data Integration Server

    Standalone without the BI Platform

    PDI Enterprise Edition Architecture:

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 11

  • 8/17/2019 Pd i 2000 Lectures

    12/623

    Enterprise Edition (EE) Data Integration Server

    Primary features and functions:

    ● Execution: Executes ETL jobs and transformations using the Pentaho Data

    Integration Engine.

    ● Security: Allows you to manage users and roles (default security) or integrate

    security to your existing security provider such as LDAP or Active Directory.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 12

     

    ETL jobs and transformations. This includes full revision history on content and

    features such as sharing and locking for collaborative development

    environments.

    ● Scheduling: Provides the services allowing you to schedule and monitorscheduled activities on the Data Integration Server from within the Spoon design

    environment.

  • 8/17/2019 Pd i 2000 Lectures

    13/623

    The Enterprise Console

    ● Provides a thin client for managing deployments of Pentaho Data

    Integration Enterprise Edition including:

    — Management of Enterprise Edition licenses.

    — Monitoring and controlling activity on a remote Pentaho Data

    Integration server.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 13

    —  

    Transformations.

  • 8/17/2019 Pd i 2000 Lectures

    14/623

    The KETTLE Project

    ● What is KETTLE?

    — Recursive acronym much like GNU (GNU’s Not Unix).

    — K ettle Extraction Transformation Transportation & Loading

    Environment.● Created out of frustration with other solutions

    — Custom built PL/SQL, C-SQL (embedded SQL), hacked VB solutions 

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 14

    — ommerc a pro uc s:• Oracle Warehouse Builder• Information Builders iWay• SQL Server DTS• Data Mirror

  • 8/17/2019 Pd i 2000 Lectures

    15/623

    PDI (KETTLE) History

    ● Early Years: 2001 – 2005

    — KETTLE project started by 2001 by Matt Casters

    — Years focused on easy-to-use, maintainability and deployability

    ● KETTLE 2.3: December 2005

    — First LGPL-licensed open source version

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 15

    ● Project acquired by Pentaho: April 2006

    — Provided paid staff of developers

    — Offered support and services to customers

  • 8/17/2019 Pd i 2000 Lectures

    16/623

    PDI (KETTLE) History

    ● Pentaho Data Integration 2.4: February 2007

    — Parallel processing support

    — Multi-tab interface for developers editing multiple transformations

    — Integration of transformation design and job execution userinterfaces

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 16

      .

    — Enhanced MySQL database support

    — Transformation Explorer for organizing and accessing transformations

  • 8/17/2019 Pd i 2000 Lectures

    17/623

    PDI (KETTLE) History

    ● Pentaho Data Integration 3.0: Nov 2007

    — ETL Developer productivity— Rapid, community-fueled evolution

    — Performance and scalability• Clean separation of data and

    metadata Reduced Java object creation

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 17

    • Clustering improvements:Support for multiple step copiesData re-partitioningDynamic cluster schemas

    • Faster flat file reading:

    Use of non-blocking I/O (NIO) to read large blocks at a timeParallel file readingSupport for lazy conversionSimplified algorithms

  • 8/17/2019 Pd i 2000 Lectures

    18/623

    PDI (KETTLE) History

    ● Pentaho Data Integration 3.1: September 2008— Enterprise Console for remote monitoring and performance trend

    analysis

    — Ease of use improvements including a consolidated log, executionhistory, step performance graph results panel

    — Numerous new steps, job entries and expanded data source support

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 18

    ● Pentaho Data Integration 3.2: May 2009— Dynamic Clustering - dynamically distribute execution to available

    cluster/cloud nodes

    — Named Parameters

    — Usability improvements - annotate hops, more visual feedback— Over 20 new Transformation Steps/Job Entries with numerous

    updates to existing steps

  • 8/17/2019 Pd i 2000 Lectures

    19/623

    PDI (KETTLE) History

    ● Pentaho Data Integration 4.0: June 2010— Data Modeling and Visualization Perspectives (Pentaho Agile BI)

    — EE Data Integration Server with scheduling, revision history andmore

    — Usability improvements

    — New steps/Job entries

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 19

    ● Pentaho Data Integration 4.1: November 2010— Hadoop Integration (Enterprise Edition only)

    — One click disable/enable of steps downstream of a hop

    — Metadata Injection (experimental first steps)

    — Many new steps and job entries

  • 8/17/2019 Pd i 2000 Lectures

    20/623

    PDI Version 4.2 (July 2011)

    ● Graphical performance and progress feedback for transformations● Metadata Injection

    ● Report bursting by the Pentaho Reporting Output step

    ● Automatic Documentation step● Talend Job Execution job entry

    ● Single Threader step for parallel performance tuning of large

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 20

    transformations

    ● Allow a job to be started at a job entry of your choice

    ● The XML Input Stream (StAX) step to read huge XML files at optimal

    performance

    ● The Get ID from Slave Server step allows multi-host or clustered

    transformations to get globally unique integer IDs from a slave server… 4.2 continued on next page …

  • 8/17/2019 Pd i 2000 Lectures

    21/623

    PDI Version 4.2

    ● Carte improvements1. reserve next value range from a slave sequence service

    2. allow parallel (simultaneous) runs of clustered transformations

    3. list (reserved and free) socket reservations service

    4. new options in XML for configuring slave sequences

    5. allow time-out of stale objects using environment variableKETTLE_CARTE_OBJECT_TIMEOUT_MINUTES

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 21

    ● Memory tuning of logging back-end with1. KETTLE_MAX_LOGGING_REGISTRY_SIZE

    2. KETTLE_MAX_JOB_ENTRIES_LOGGED

    3. KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage fornever ending ETL in general and jobs specifically.

    … 4.2 continued on next page …

  • 8/17/2019 Pd i 2000 Lectures

    22/623

    PDI Version 4.2

    ● Repository Import/Export1. Export at the repository folder level

    2. Export and Import with optional rule-based validations

    3. Import command line utility allow for rule-based (optional)import of lists of transformations, jobs and repository exportfiles: http://wiki.pentaho.com/display/EAI/Import+User+Documentation

    ● ETL Metadata Injection

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 22

    1. Retrieval of rows of data from a step to the “metadatainjection” step

    2. Support for injection into the “Excel Input” step

    3. Support for injection into the “Row normaliser” step

    4. Support for injection into the “Row Denormaliser” step

    ● And many more new steps and job entries

  • 8/17/2019 Pd i 2000 Lectures

    23/623

    Why Pentaho Data Integration?

    ● Ease of use:

    — 100% metadata driven (define WHAT you want to do, not HOW to doit)

    — No extra code generation means lower complexity— Simple setup, intuitive graphical designers and easy to maintain

    ● Flexibility:

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 23

    — Never forces a certain path on the user— Pluggable architecture for extending functionality

  • 8/17/2019 Pd i 2000 Lectures

    24/623

    Why Pentaho Data Integration?

    ● Modern Standards-based Architecture

    — 100% Java with broad, cross platform support— Over 100 out-of-the-box mapping objects (steps and job entries)

    — Enterprise class performance and scalability

    ● Lower total cost of ownership (TCO)

    — No license fees

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 24

     

    — Short implementation cycles— Reduced maintenance costs

  • 8/17/2019 Pd i 2000 Lectures

    25/623

    Pentaho Data Integration Adoption

    ● Wide range of production deployments:

    — Small and medium-sized companies

    — Large enterprises

    ● Rapid product evolution

    — Driven by Pentaho investment

    — Includes significant community

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 25

     

    contributions• ‘Contribution-friendly’architecture

    • Natural fit for additional data

    sources, targets andtransformations

  • 8/17/2019 Pd i 2000 Lectures

    26/623

    Common Uses

    ● Data warehouse population:

    — Built-in support for slowly changing

    dimensions, junk dimensions and other data

    warehouse concepts.

    ● Export of database(s) to text-file(s) or other

    databases.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 26

    ● Import of data into databases, ranging fromtext-files to Excel© spreadsheets.

    ● Data migration between database applications.

  • 8/17/2019 Pd i 2000 Lectures

    27/623

    Common Uses

    ● Exploration of data in existing database

    (tables, views, synonyms, …).

    ● Information enrichment by looking up data

    in various information stores (databases,

    text-files, Excel© spreadsheets, …).

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 27

     

    conditions in data transformations.

    ● Application integration.

  • 8/17/2019 Pd i 2000 Lectures

    28/623

    Example

    21

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 28

    2 … or DB2, Teradata, Microsoft, Sybase, MySQL, PostgreSQL, Ingres, etc.

    1 … or PeopleSoft, Siebel, JD Edwards, Axapta, Navision, SugarCRM, Compiere, and others

  • 8/17/2019 Pd i 2000 Lectures

    29/623

    Agile BI

    ● Modeling and Visualization perspectives

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 29

  • 8/17/2019 Pd i 2000 Lectures

    30/623

    Agile BI

    ● Modeling and Visualization perspectives.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 30

  • 8/17/2019 Pd i 2000 Lectures

    31/623

    EE Data Integration Server

    ● Enterprise Repository and Content Management:— New repository based on JCR (Content Repository API for Java)— Improved Repository Browser— Enterprise security:

    • Configurable Authentication including support for LDAP and MSAD• Task permissions to control what actions a user/role can performsuch as read/execute content, create content and administersecurity

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 31

     

    previous revisions of a Job or Transformation.— Ability to lock Transformations/Jobs for editing.— ‘Recycling bin’ concept for working with deleted files.

  • 8/17/2019 Pd i 2000 Lectures

    32/623

    EE Data Integration Server

    ● Scheduling

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 32

  • 8/17/2019 Pd i 2000 Lectures

    33/623

    Since PDI 4.0 & 4.2: Usability

    ● Hover-over menus & simplify the

    connection of steps

    ● Graphical indicators on hops representing

    the flow of information between steps.

    ● New activity indicators on jobs and steps (since 4.2) help highlight current

    activity and bottlenecks during execution.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 33

  • 8/17/2019 Pd i 2000 Lectures

    34/623

    Since PDI 4.0: Improved Logging

    ● Internal object IDs (small API change)

    ● Logging channels (GUIDs)

    ● Step logging

    ● Sniffing (debugging, data lineage)

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 34

  • 8/17/2019 Pd i 2000 Lectures

    35/623

    Pentaho Data Integration Components

    ● Spoon

    — Graphical environment for modeling— Transformations are metadata models

    describing the flow of data— Jobs are workflow-like models for

    coordinating resources, execution anddependencies of ETL activities

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 35

    ● Pan

    — Command line tool for executingtransformations modeled in Spoon

    ● Kitchen

    — Command line tool for executing jobsmodeled in Spoon

    Spoon Interface – Designing a Transformation

    Job Example

  • 8/17/2019 Pd i 2000 Lectures

    36/623

    Pentaho Data Integration Components

    ● Carte

    — Lightweight web\HTTP server for remotelyexecuting Jobs and Transformations.

    — Carte accepts XML containing the transformation toexecute and the execution configuration.

    — Enables remote monitoring

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 36

     

    — Remote servers running Carte are referred to asSlave Servers.

  • 8/17/2019 Pd i 2000 Lectures

    37/623

    Enterprise Edition (EE) Data Integration Server

    ● Enterprise Edition alternative to Carte providing

    — Execution and remote monitoring (can act as master/slave similar to Carte)

    — Integrated scheduling

    — Enterprise Security options— Enhanced content management including revision history and locking

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 37

  • 8/17/2019 Pd i 2000 Lectures

    38/623

    Repository: the Metadata Store

    ● Kettle can store metadata in

    — XML files

    — RDBMS repository

    — Enterprise repository

    ● Objects stored in repository

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 38

    — Transformations

    — Jobs

    — Schemas

    — User and profile definitions● Repository supports

    collaborative development

  • 8/17/2019 Pd i 2000 Lectures

    39/623

    How to start the user interface?

    ● Start Spoon.bat (Windows) or Spoon.sh (Linux, MacOS) in the Kettle

    folder.

    ● The command launch-designer.bat /.sh is also possible in the archive

    based installation.

    ● For our trainings we do not use the repository, all training data will be

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 39

    s ore n e e sys em as - rans orma ons or - es o s n

    XML format.

    ● Another option would be to start via Java Web Start:

    — Latest version loaded via the Internet (usable for configuration

    management).— JNLP-files (Java Network Launching Protocol) located in the

    KETTLE/webstart folder.

  • 8/17/2019 Pd i 2000 Lectures

    40/623

    Transformations

    ● Transformations are a network of logical tasks (Steps):

    — Read a flat file

    — Filter it

    — Sort it— Load it into MySQL

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 40

  • 8/17/2019 Pd i 2000 Lectures

    41/623

    Steps – Job Entries

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 41

  • 8/17/2019 Pd i 2000 Lectures

    42/623

    Hops - within Transformations

    ● Are data pathways that connect steps together

    ● Allow schema metadata to be passed from step to step

    ● Determine the flow of data through the steps

    ● Example: The pathway for all data and the true and false path from a Filter

    rows step.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 42

  • 8/17/2019 Pd i 2000 Lectures

    43/623

    Hops – Data Movement: Copy or Distribute?

    ● Specify if data can either be copied or distributed between multiple hops

    leaving a step (right click on a step and select ‘Data Movement’).

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 43

    H O h T i T f i

  • 8/17/2019 Pd i 2000 Lectures

    44/623

    Hops – Other Types in Transformations

    ● Info Steps: When data is retrieved (pulled) from another step.

    ● Error handling steps: When error handling is enabled.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 44

    H ithi J b

  • 8/17/2019 Pd i 2000 Lectures

    45/623

    Hops - within Jobs

    ● Are defining the execution sequence for job entries

    ● There are three types of hops within Jobs (right click on the hop):

    — Unconditional

    — Follow when result is true

    — Follow when result is false

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 45

    D t Fl Th di h i

  • 8/17/2019 Pd i 2000 Lectures

    46/623

    Data Flow, Threading mechanism

    ● All Steps run are started and run in parallel:— Initialization sequence is not predictable

    ● PDI takes care of the correct data flow

    — ‘Pulling and pushing’ data from step to step● PDI capable to process an unlimited number of rows:

    — Steps vary on execution speed and memory consumption

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 46

    — Set the threshold on number of rows will wait to be processed bynext step:

    • If this number of waiting rows is reached, the source step waitsfor room.

    • When there is room to process, more rows are put into the datastream.

    Transformation Properties / Miscellaneous / Number of rows in

    rowset

    Th di h i

  • 8/17/2019 Pd i 2000 Lectures

    47/623

    Threading mechanism

    ● There are additional options that influence the threading mechanism inTransformation Properties/Miscellaneous

    1. Manage thread priorities: If there

    is not much to do for a thread(step), put the thread to sleep for

    some milliseconds. This reduces

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 47

    2. (Serial) Single Threaded: This allows to build thread pools in combination

    with sub-transformations. It will not work if any step is getting or putting rowsfrom/to more than one step (e.g. the Stream lookup step).

    3. KETTLE properties: KETTLE_BATCHING_ROWSET,

    KETTLE_ROWSET_GET/PUT_TIMEOUT

     

    and is enabled by default.

    Values Metadata and Data

  • 8/17/2019 Pd i 2000 Lectures

    48/623

    Values, Metadata and Data

    ● Values are like columns in data rows:

    — Composed of the metadata and the data.

    ● PDI version 3.0 separated metadata and data

    — Metadata is only transported with the first data row.— All subsequent data rows reference to this metadata.

    ● PDI ma s database JDBC data t es to PDI data

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 48

     

    — Implementation can (and often is) different from database todatabase.

    Values Metadata and Data: Formatting etc

  • 8/17/2019 Pd i 2000 Lectures

    49/623

    Values, Metadata and Data: Formatting etc.

    ● Metadata is used for formatting when:

    — Data is presented e.g. in a preview.

    — Data is written to the outside world, e.g. to a text- or XML-file.

    ● Metadata is NOT used for formatting when:— Data is just loaded from one table and written to another table.

    ● Metadata is used to create SQL-statements to know the field types, length etc.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 49

    ● Metadata is used to check the data for the right data type.

    ● Note: A change in metadata does not change the data, e.g.

    — A modification of length does not change/truncate the data.

    — A new formatting does not change the data.

    ● But, when you modify the type in the ‘Select Values’ step, the data isconverted to a new data type (e.g. from a String to a Number with the given

    formatting).

    Numeric Data Types

  • 8/17/2019 Pd i 2000 Lectures

    50/623

    Numeric Data Types

    ● Number (double, floating point)

    — Double precision can handle only 15 significant digits

    — Is sometimes not exact

    To avoid this, use BigNumber but will have performance drop● Integer

    — Is optimal for storing and processing data from a performance

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 50

     viewpoint.

    ● BigNumber

    — Offers a extremely high precision level but needs more memory andCPU then the other numerical values.

    ● Note: Formatting Number Data Types is done by default with the

    pattern #.#;-#.# that means with only one digit precision.

  • 8/17/2019 Pd i 2000 Lectures

    51/623

    Other Data Types

  • 8/17/2019 Pd i 2000 Lectures

    52/623

    Other Data Types

    ● Date

    — Includes date and time— Note: Formatting Date Data Types is done by default with the

    pattern yyyy/MM/dd HH:mm:ss.SSS

    ● Boolean

    — True or false, representation in the database depends on databaseboolean su ort.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 52

     

    — Note: Please see the following option to turn this on when yourdatabase supports this: Database connection/Advanced/Supportsboolean data type

    ● Binary

    — Can hold any binary data like pictures, used mainly for BLOB data.● Serializable

    — An object to transfer from/to specific steps (internally only).

    Data Types and Silent Conversions

  • 8/17/2019 Pd i 2000 Lectures

    53/623

    Data Types and Silent Conversions

    ● Until version 2.5.x, data types mostly converted silently.

    — Can lead to some undiscovered type mismatches.

    ● PDI 3.0

    — Data types are strictly checked

    — No silent conversions take place (with some exceptions forcompatibility)

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 53

    ● Example for a type conflict:— Samples directory: Denormalizer - Simple example.ktr

    Lazy Conversion

  • 8/17/2019 Pd i 2000 Lectures

    54/623

    Lazy Conversion

    ● Lazy conversion is a delayed conversion of data types

    — Provides a performance boost.

    ● Conversion takes place only where it is really needed

    — Ideally at output steps

    — Sometimes not at all—reading from text-file and writing back totext-file

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 54

    — If output format is the same as the input format, no conversion

    ● Steps support lazy conversion

    — Specifically: CSV File Input, Fixed File Input and Table Input

    — Other steps support it transparently

    Lazy Conversion

  • 8/17/2019 Pd i 2000 Lectures

    55/623

    Lazy Conversion

    ● Binary form of data can cause issues:

    — For example: sorting on binary form ignores character set sortingsequence.

    ● New feature in the ‘Select Values’ step

    — Covered in more detail in a separate module

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 55

    — Converts binary data to and from binary character data

    Handling the User Interface

  • 8/17/2019 Pd i 2000 Lectures

    56/623

    Handling the User Interface

    ● Main tree

    — Lists all open Transformations and Jobs and their contents.

    ● Core objects and favorite Steps/Job entries

    — Core is toolbox with all the available Steps/Job entries (plug-ins arein bold).

    — Favorites are static ‘most’ used steps.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 56

    ● Notes— Can appear anywhere on graphical view.

    — Right clicking on the canvas and selecting Add Note.

    ● Options and settings

    — Options are valid for the entire PDI environement.

    — Settings are valid for a particular transformation or job.

    Handling the User Interface

  • 8/17/2019 Pd i 2000 Lectures

    57/623

    Handling the User Interface

    ● Draw new hops

    — Other methods:• Middle or scroll wheel button, click on the first ste and dra onto the

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 57

     second.

    • Use SHIFT+Click and drag from one step to another• Select 2 steps, right click on one of them and select ‘New hop’.• Drag ‘Hops’ onto the canvas.

    ● Inserting a step (or job entry) between others:

    — Move the step over the arrow until the arrow becomes drawn in bold— Release the mouse-button

    ● Window sizes:

    — May have to resize some dialogs to all parameters

    Handling the User Interface

  • 8/17/2019 Pd i 2000 Lectures

    58/623

    g

    ● Click right on the first column in any dialog table (grid) for a list of allthe options.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 58

    Handling the User Interface

  • 8/17/2019 Pd i 2000 Lectures

    59/623

    g

    ● Click right on a Step for a list of all the options in the context menu.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 59

    Run and Preview

  • 8/17/2019 Pd i 2000 Lectures

    60/623

    ● Can execute the entire Transformation or just preview a particular step.— Preview also possible by selecting a step and pressing the F10 key.

    ● Need at least two steps connected with a hop to run or preview

    ● Closing a preview (vs. choosing the stop/get more rows options) willleave the Transformation in a paused execution state. If you attempt to

    restart the Transformation, it will tell you it can not be started twice.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 60

     

    ● Note: Preview may be destructive.— Subsequent steps are also initiated (could cause truncation of target

    table).— Rows passed to subsequent steps; not stopped at the previewed step.

    — You may temporarily disable the hop after the step to preview.

    Log View

  • 8/17/2019 Pd i 2000 Lectures

    61/623

    ● Shows statistics associated with execution of a Transformation.

    ● Used to understand performance and to check the results.

    ● Logging can be very granular – down to the row level if needed.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 61

  • 8/17/2019 Pd i 2000 Lectures

    62/623

    Usage of ‘Safe Mode’ at design time

  • 8/17/2019 Pd i 2000 Lectures

    63/623

    ● Performing a union on data with different layouts generates warning

    — The name of field number 1 is not the same as in the first rowreceived: you're mixing rows with different layout. Field[ThisIsAStringValue String(10)] does not have the same name as field[ThisIsANumberValue Number].

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 63

    Analyzing Errors

  • 8/17/2019 Pd i 2000 Lectures

    64/623

    2010/05/18 16:38:00 - Generate Rows.0 - ERROR (version 4.0.0-…) : Couldn't parse Integer

    field [WrongType] with value [abc] -->org.pentaho.di.core.exception.KettleValueException:

    ● Log entries include

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 64

     

    — Step (Generate Rows) generating the error— Detailed information with the PDI version (useful when submittingcases)

    — Any stack traces (useful for bug tracking due to program errors)

    — Error lines are in red since 4.0 (easier to find)● Use Show Error Lines to find errors easier

    Debugging

  • 8/17/2019 Pd i 2000 Lectures

    65/623

    ● Introduced in PDI 3.0

    ● Provides condition break points

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 65

  • 8/17/2019 Pd i 2000 Lectures

    66/623

    Special PDI Application Files

  • 8/17/2019 Pd i 2000 Lectures

    67/623

    ● User-specific files found in .kettle directory of user’s home directory

    — kettle.properties: Default properties file for variables

    — shared.xml: Default shared objects file

    — db.cache: The database cache for meta data— repositories.xml: The local repositories file

    — .spoonrc: User interface settings, last opened transformation/job …

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 67

    — .languageChoice: User language (delete to revert language)

    ● Some temporary logs stored also in temp folder :

    — Usually cleaned by Java Virtual Machine (VM).

    — May need to be cleaned (deleted) due to defects in VM.— Directory of temporary logs determined by OS and VM

    (for example: typically C:\Windows\Temp on Windows©).

    Special PDI Application Files – KETTLE_HOME

  • 8/17/2019 Pd i 2000 Lectures

    68/623

    ● The HOME directory may change depending on the user who is logged on. As aresult, the configuration files that control the behavior of PDI Jobs and

    Transformations are different from user to user. Setting the KETTLE_HOME

    variable can be performed system wide by the operating system or just before

    the start of PDI using a shell script or batch. For example: use the SET

    command.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 68

    ● Point the KETTLE_HOME to the directory that contains the .kettle directory. The

    .kettle gets appended by PDI. For example: when you have stored the common

    files in C:\Pentaho\Kettle\common\.kettle you need to set the KETTLE_HOME

    variable to C:\Pentaho\Kettle\common.

    ● When running PDI from the Pentaho BI Platform, please see the Knowledge Basefor setting the variable.

    Pentaho Data Integration

  • 8/17/2019 Pd i 2000 Lectures

    69/623

    The ‘Conceptual’ Model – any questions?

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 69

    Pentaho Data Integration - Conceptual model

  • 8/17/2019 Pd i 2000 Lectures

    70/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 70

  • 8/17/2019 Pd i 2000 Lectures

    71/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Connections, Inputs and Outputs

    Database Connections

  • 8/17/2019 Pd i 2000 Lectures

    72/623

    ● Multiple database connections to different databases can be created.

    ● With a PDI repository:

    — Defined connections readily available to transformations and jobs.

    — Connection information for the repository itself is stored inrepositories.xml.

    ● Without a PDI repository:

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 72

    — Connection definition contained in a single Transformation or Job.

    — Can share connection definitions in subsequent Transformations andJobs.

    Database Connections

  • 8/17/2019 Pd i 2000 Lectures

    73/623

    ● Available database connections appear in the Main Tree.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 73

    ● Choose Share in the context menu of any connection to share it.

    — Shared connections appear in bold.

    Database Connections

  • 8/17/2019 Pd i 2000 Lectures

    74/623

    ● General database connection options

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 74

    Access Via JDBC

  • 8/17/2019 Pd i 2000 Lectures

    75/623

    ● Pentaho Data Integration ships with the most suitable JDBC drivers for

    the listed databases.

    ● Additional drivers can by added to the lib/libext directory.

    — Use Generic tab of connection dialog to use unlisted drivers.

    — Permits connections to non-listed databases.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 75

    ● Existing drivers can be replaced in the lib/libext directory.

    ● Special database issues and experiences with different JDBC drivers can

    be found in the Pentaho Data Integration Wiki:

    — http://wiki.pentaho.com/display/EAI/Special+database+issues+and+

    experiences

    Other Access Methods

  • 8/17/2019 Pd i 2000 Lectures

    76/623

    ● ODBC connections are possible:

    — ODBC connections must be defined in Windows.

    — ODBC connections made via ODBC-JDBC-Bridge.

    • Some limitations of the SQL syntax

    • Generally slower than JDBC due to additional layer.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 76

    ● Use a JNDI connection to connect to a datasource defined in an

    application server like JBoss or Webshere.

    ● Plugin specific access methods are supplied by a specific database

    driver (like SAP R/3 or PALO connections).

    Advanced Database Connections

  • 8/17/2019 Pd i 2000 Lectures

    77/623

    ● Pooling— Useful for performance tuning

    — Can limit number of connections for database client licensing reasons

    ● Driver Specific Options (Options tab)

    — Pass additional parameters to the drivers

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 77

    — Allows driver specific tuning for performance

    — Database vendor documentation is available by clicking help button

    Advanced Database Connections

  • 8/17/2019 Pd i 2000 Lectures

    78/623

    ● Clustering— Enables clustering for the database connection and creation

    connections to data partitions.

    — Requires partition ID, name and port of host, and user name and

    password● Identifiers (Advanced tab)

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 78

    —  

    — Post Connection SQL

    Quoting

  • 8/17/2019 Pd i 2000 Lectures

    79/623

    ● Quoting is used when reserved names or special characters are used.

    — For example: Field names, ‘sum’, ‘V.A.T.’, ‘overall sales’.

    ● PDI has an internal list of reserved names for most of the supported

    database types.

    ● PDI’s automatic quoting can be overridden.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 79

    — Feedback on quoting is always welcome to improve quoting

    algorithms.

    Database Explorer

  • 8/17/2019 Pd i 2000 Lectures

    80/623

    ● In the toolbar:

    ● In the connection context menu:

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 80

    DB Cache: The Metadata Cache

  • 8/17/2019 Pd i 2000 Lectures

    81/623

    ● Metadata for fields each connection and SQL statement is cached in

    internal file ‘db.cache’.

    ● Metadata cache must be refreshed.

    — Refreshed automatically when the table is changed in the PDI

    context (from the SQL statement window).

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 81

    ● Manual refreshing of cache maybe necessary.

    — Typically detected by missing fields or mismatches in ‘Show

    input/output fields’ of a step.

    Impact Analysis

  • 8/17/2019 Pd i 2000 Lectures

    82/623

    ● What impact does the Transformation have on the used databases?

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 82

    SQL Editor

  • 8/17/2019 Pd i 2000 Lectures

    83/623

    ● Creates the needed DDL for the output steps related to a databasetable, often CREATE statements for tables or indices.

    ● SQL button in the toolbar creates all needed DDL for tables.

    ● No automatic mechanism to alter tables when the layout changed.

    — For example: A field type from a source table is changed

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 83

    —  can e eas y an manua y c ange .

    Steps Covered In This Section

  • 8/17/2019 Pd i 2000 Lectures

    84/623

    ● Input Steps Output Steps Special Dimension Steps

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 84

    Text File Input

  • 8/17/2019 Pd i 2000 Lectures

    85/623

    ● Reads a large number of different text files, including CSV files generated by

    spreadsheets.

    ● Text File Input Options:

    — Filename specification options: Files may be added to ‘selected files’ list.

    — Accept filenames from previous step: Filename can come from any source.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 85

    —   .

    — Error handling: Allows user to define how to react to errors.— Filters: Allows user to specify the lines to be skipped in the text file.

    — Fields: Allows user to define characteristics of the fields.

    — Formats: Includes formatting of number and date fields.

    CSV File Input

  • 8/17/2019 Pd i 2000 Lectures

    86/623

    ● Reads a CSV file format only.

    ● Due to the internal processing, this step is much faster.

    ● Options are a subset of Text File Input

    — NIO buffer size: Set the buffer size used by the Java I/O classes(NIO, improved performance in the areas of buffer management).

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 86

      .

    Fixed File Input

  • 8/17/2019 Pd i 2000 Lectures

    87/623

    ● Reads a fixed file format only.

    ● Due to the internal processing, this step is much faster.

    ● Options are a subset of Text File Input

    — NIO buffer size and lazy conversion options are identical to the CSVfile input options.

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 87

     cluster with many workers processing on large file.

    Table Input

  • 8/17/2019 Pd i 2000 Lectures

    88/623

    ● Reads information from a database, using a connection and SQL.

    ● Table Input Options

    — Step Name: The name has to be unique in a single Transformation.

    — Connection: The database connection used to read data from.

    — S L: The statement used to read information from the database

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 88

     

    connection; may be any query.

    — Insert data from step: The input step name where parameters for

    the SQL come from, if appropriate.

    — Limit: Sets the number of lines that are read from the database.

    Excel Input

    R d i f i f E l fil

  • 8/17/2019 Pd i 2000 Lectures

    89/623

    ● Reads information from one or more Excel files.

    ● The options provided by PDI GUI for accepting excel inputs include:

    — Step Name: Name of the step.

    — File Tab: To define the filenames, with variable support.

    — Sheet Tab: To define the sheet(s) to import.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 89

    — Fields Tab: To specify the fields that need to be read from the

    Excel files.

    — Error handling Tab: Allows user to define how to react when error

    is encountered.

    — Content Tab: Includes the sub options of Header, No empty rows,Stop on empty rows, Field name, Sheet name field, Row number

    field and Limit.

    Access Input

  • 8/17/2019 Pd i 2000 Lectures

    90/623

    ● Reads information from one or more Access® files.

    ● No ODBC connection necessary.

    — Allows Access® files to be read on non-Windows platforms.

    ● Access Input Options:

    — Ste Name: Name of the ste .

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 90

     

    — File Tab: To define the filenames, with variable support.— Content Tab: Specify the table name and the inclusion of the file

    name, table name, row number and limit.

    — Fields Tab: Specify the fields that need to be read from the Accessfiles.

    XBase Input

    R d d t f t t f DBF fil d i t ll d th XB

  • 8/17/2019 Pd i 2000 Lectures

    91/623

    ● Reads data from most types of DBF file derivates called the XBasefamily. (dBase III/IV, Foxpro, Clipper, ...)

    ● Options:

    — Step name: Unique name (in transformation) of the step.— Filename: Name of XBase file with variable support.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 91

    — Limit size: Only read this number of rows; zero means unlimited.

    — Add rownr?: Adds a field to the output with the specified name that

    contains the row number.

    Apache Virtual File System (VFS) support

  • 8/17/2019 Pd i 2000 Lectures

    92/623

    ● Apache VFS allows references to files from virtual any location.

    — Apache VFS support available in the Pentaho Platform and Pentaho

    Analysis.

    ● All file names are treated as URIs.

    — file:///somedir/somefile.txt

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 92

    — zip:http://somehost/downloads/somefile.zip

    — http://myusername@somehost/index.html— sftp://myusername:mypassword@somehost/pub/dl/somefile.tgz— webdav://somehost:8080/dist

    ● Further information: http://commons.apache.org/vfs/filesystems.html

    Generate Rows

    O b f d f l b ll

  • 8/17/2019 Pd i 2000 Lectures

    93/623

    ● Outputs a number of rows, default is empty but optionally containing a

    number of static fields.

    ● Options

    — Step name: Unique name (in Transformation) of the step.

    — Limit: The number of rows user wants to out ut.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 93

     

    Fields: Static fields user might want to include in the output row.

  • 8/17/2019 Pd i 2000 Lectures

    94/623

    LDAP Input

    R d d f LDAP

  • 8/17/2019 Pd i 2000 Lectures

    95/623

    ● Reads data from a LDAP server.

    ● Options

    — Host: Hostname or IP address of the LDAP server.

    — Port: The TCP port to use, typically 389.

    — User Authentication: Enable to pass authentication credentials to

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 95

    .

    — Username/Password: For authenticating with the LDAP server.

    — Search base: Location in the directory from which the LDAP searchbegins.

    — Filter String: The filter string for filtering the results.

    — Fields: Define the return fields and type.

    De-serialize/Serialize to/from File

    R d d it d t d PDI t d t t th f d t fil

  • 8/17/2019 Pd i 2000 Lectures

    96/623

    ● Read and write data and PDI metadata together from and to a file.

    ● Use Cases

    — Transfer data from one Transformation to another where the memory

    is not sufficient to hold the amount of data.— Save data and metadata to be processed at another time.— Transfer data to another user with no need to reanal ze the

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 96

     metadata (instead of text files).

    Text File Output

    ● Exports data to a variety of different text file formats including CSV

  • 8/17/2019 Pd i 2000 Lectures

    97/623

    ● Exports data to a variety of different text file formats, including CSV.

    ● Options

    — Extension

    — Append— Separator— Enclosure

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 97

    — Header/Footer

    — Zipped— Include step number/date/time in file name— Encoding— Right pad fields

    — Split every or ‘n’ row(s)

    Table Output

    ● Insert (only) information in a database table

  • 8/17/2019 Pd i 2000 Lectures

    98/623

    ● Insert (only ) information in a database table.

    ● Options

    — Target table

    — Commit size— Truncate table— I nore insert errors

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 98

     — Partition data over tables

    — Use batch update for inserts— Return auto-generated key— Name auto-generated key field— Is the name of the table defined in a field

    Insert / Update

    ● Automates simple merge processing:

  • 8/17/2019 Pd i 2000 Lectures

    99/623

    ● p g p g— Look up a row using one or more lookup keys.— If a row is not found, insert the new row.— If found and targeted fields are the identical, do nothing.

    — If found and targeted fields are not identical, update the row.● Options

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 99

    —  

    — Connection— Target table— Commit size— Keys

    — Update fields— Do not perform any updates (If used, operates like Table Output,but without any Insert errors caused by duplicate keys).

    Update

    ● Same as the Insert / Update step except no insert is performed in thedatabase table

  • 8/17/2019 Pd i 2000 Lectures

    100/623

    Same as the p step except no insert is performed in thedatabase table.— ONLY updates are performed.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 100

  • 8/17/2019 Pd i 2000 Lectures

    101/623

    Excel Output

    ● Exports data to an Excel® file

  • 8/17/2019 Pd i 2000 Lectures

    102/623

    p● Options

    — Sheet name— Protect sheet with a password— Use a template (e.g. with a preformatted sheet)— Append or override the contents of the template

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 102

    Access Output

    ● Exports data to an Access® file

  • 8/17/2019 Pd i 2000 Lectures

    103/623

    p

    — ODBC not required

    — Can be used on non-Windows platforms

    ● Options

    — Database filename (.mdb) 

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 103

    —  — Target table— Create table— Commit size

    XML Output

    ● Writes rows from any source to one or more XML files

  • 8/17/2019 Pd i 2000 Lectures

    104/623

    ● Options

    — File name

    — Extension— Include stepnr in file name— Include date in file name

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 104

     — Include time in file name

    — Split every N rows— Parent XML element— Row XML element— Fields

    — Zipped— Encoding

    Dimension Update / Lookup

    ● Implements slowly changing dimensions (Type I and Type II)

  • 8/17/2019 Pd i 2000 Lectures

    105/623

    ● Can be used for updating a dimension table and for looking up values in

    a dimension. (Lookup, if not found, then update/insert)

    ● Each entry in the dimension table has the following fields— Technical key: The primary (surrogate) key of the dimension—  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 105

     

    — Start of date range: Field containing validity starting date

    — End of date range: Field containing the validity ending date— Keys: Business keys used in source systems; used for lookup

    functionality.— Fields: Actual information of a dimension and can be set individually

    to update all versions or to add a new version when a new valueappears.

  • 8/17/2019 Pd i 2000 Lectures

    106/623

  • 8/17/2019 Pd i 2000 Lectures

    107/623

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Introduction to the Training Data

    Introduction to the Training Data

    ● Represents fictitious company Steel WheelsBuys collectable model cars trains trucks etc

  • 8/17/2019 Pd i 2000 Lectures

    108/623

    — Buys collectable model cars, trains, trucks, etc.from manufacturers

    — Sells to distributors across the globe

    ● Data adapted from the sample data provided by

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 108

    Eclipse’s BIRT project.

    ● pentaho_oltp database has many tables:

    — Offices, Employees, Customers, Products,

    Orders, Orderdetails, Payments

    pentaho_oltp

    Pentaho Training Data: Tables● Offices

    — 7 offices worldwide (San Francisco, Boston, NYC, Paris, Tokyo,

  • 8/17/2019 Pd i 2000 Lectures

    109/623

    ( , , , , y ,

    Sydney, London)

    — Headquartered in San Francisco, CA

    — Each office is assigned to a sales territory (APAC, NA, EMEA or

    JAPAN)

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 109

    ● Employees

    — 23 employees: 6 Execs and 17 Sales Reps

    — Each assigned to one of the seven offices

    — Sales Reps also assigned to a number of customers (distributors)— New Sales Reps (that are still in training) don’t have assigned

    customers

    Pentaho Training Data: Tables● Customers

    — Steel Wheels has 122 customers worldwide

  • 8/17/2019 Pd i 2000 Lectures

    110/623

    — Approximately 20 of those new customers without a sales rep any orders

    — Each has a credit limit which determines maximum outstanding balance

    ● Products

    — 110 unique models purchased from 13 vendors

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 110

    —   , , ,

    Trucks and Buses, Planes, Ships, Trains.

    — Additionally models are classified based on their scale (e.g. 1:18, 1:72 etc.)

    — Cost paid and MSRP (suggested retail price)

    ● Payments

    — Customers make payments on average 2-3 weeks after they place an order.— In some cases one payment covers more than 1 order.

    Pentaho Training Data: Tables

    ● Orders

    — 2560 orders, which span the period from 1/1/2000 to 12/31/2007

  • 8/17/2019 Pd i 2000 Lectures

    111/623

    , p p

    — Each in a given state: In Process, Shipped, Cancelled, Disputed,Resolved, or On Hold.

    ● OrderDetails

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 111

    — r er ne ems re ec nego a e pr ce an quan y per pro uc

    — Training data has 23,640 OrderDetails

    pentaho_oltp

    Database Schema OLTP (Source Database)Database schema is in $PENTAHO_TRAINING/data/models

  • 8/17/2019 Pd i 2000 Lectures

    112/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 112

    Database Schema OLAP (Target Database)Database schema is in $PENTAHO_TRAINING/data/models

  • 8/17/2019 Pd i 2000 Lectures

    113/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 113

  • 8/17/2019 Pd i 2000 Lectures

    114/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Data Warehouse Steps

    Target Database Design (your Data Warehouse)

    ● Prior to defining mappings from sources, the target database must bedesigned.

  • 8/17/2019 Pd i 2000 Lectures

    115/623

    g

    ● Staging tables and/or file format designs are often identical to source

    format to simplify extract processing.● Target star schemas must be designed to meet analytical processing

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 115

     nee s as we as eas ty o oa rom ent e source ata.

    Source to Target Mapping

    ● Identifies how each target table/column will be populated from the

    sources.

  • 8/17/2019 Pd i 2000 Lectures

    116/623

    ● Includes details of the following:

    — Source table/column or how the value is otherwise derived

    — Data types/lengths and any format Transformation

    — Special cleansing or Transformation logic

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 116

    — Exception handling

    ● This document will aid in creation of the actual programmingspecifications for the ETL developers or in creation of instructions

    (technical meta data) for the ETL tool.

    Dimensional Design in your Data WarehouseBest Practice Analytical Database Design (Kimball, et al.)

    EmployeeProduct

    Geography

  • 8/17/2019 Pd i 2000 Lectures

    117/623

    Product

    Sales Facts

    Customer  Time

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 117

    Fact Table Dimension Table

    ● A fact table contains items that you wantto measure. For example:

    — Revenue

    — Amount sold

    — Average price

    ● Metrics are the values you are trying to

    report.

    ● Dimensions are the ways that you want tolook at the data. For example:

    — By customer

    — By date

    — By product

    ● Dimensions provide context in reports

    (grouping, labels, filters, etc.).

    Dimensional models are often called Star-Schemas.

    Why Fact- and Dimension tables?

    ● The idea behind this separation is that you have a large fact table with

    much smaller dimension tables joined to it.

  • 8/17/2019 Pd i 2000 Lectures

    118/623

    ● Fact tables are huge compared with the dimensions. They are usually

    huge compared with anything you have had in an OLTP database (e.g.

    when you store historical data or add data from external sources).

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 118

     

    with a speedy access.

    ● Fact and dimension tables should be joined by integers.

    Slowly Changing Dimensions● In a dimension table the value of an attribute may change over time.

    ● You may need to reference the old and new value.

  • 8/17/2019 Pd i 2000 Lectures

    119/623

    ● Examples:

    — You have a product dimension with the price as an attribute. The

    price changes over time. You may want to calculate the profit forcertain periods by the product price for a certain time. Then you

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 119

      .

    — You have a customer dimension with the location (e.g. defined by

    the zip code) as an attribute. Customers can relocate and getanother zip code. The geographic organized sales team madebudgets for their customers and want to compare this with theactual. What happens when a customer relocates into another sales

    region?● You can solve this easy by storing your dimensions with the concept of

    Slowly Changing Dimensions.

    Slowly Changing Dimensions

    ● ‘Type 1’ Dimension:

    — New information overwrites the old information.— Old information is not saved, it is lost.

  • 8/17/2019 Pd i 2000 Lectures

    120/623

    O ,— Can only be used in applications where maintaining a chronicle of data is not

    essential; used for update only.

    ● ‘Type 2’ Dimension:— New information is appended to the old information.—  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 120

      , .— Can be used in applications where maintaining a chronicle of data is

    required so that changes in a data warehouse can be tracked.● ‘Type 3’ Dimension:

    — New information is saved alongside the old information.— Old information is partially saved.

    — Additional columns are created to show the time from which the newinformation has taken effect.— Enables view of facts in both current state and “what-if” past states of

    dimensional values

    Dimension Update / Lookup

    ● Implements slowly changing dimensions: Type 1 and Type 2

    ● Can be used for updating a dimension table and for looking up values in a

    di i L k if f d h d /i

  • 8/17/2019 Pd i 2000 Lectures

    121/623

    dimension. Lookup, if not found, then update/insert.

    ● In dimension implementation each entry in the dimension table has the

    following fields:

    — Technical key: This is the primary (surrogate) key of the dimension.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 121

    — Version field: Shows the version of the dimension entry (a revision number).

    — Start of date range: This is the fieldname containing validity starting date.

    — End of date range: This is the fieldname containing the validity ending date.

    — Keys: These are business keys used in source systems such as customer no,product id, etc. These are used for lookup functionality.

    — Fields: These fields contain the actual information of a dimension and can be

    set individually to update all versions or to add a new version when a newvalue appears.

    Example for Slowly Changing Dimensions

    ● We create a new dimension for the product from the pentaho_oltp

    database into the pentaho_olap database.

  • 8/17/2019 Pd i 2000 Lectures

    122/623

    OLAPOLTP

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 122

    Example for Slowly Changing Dimensions

    ● The dimension step added the fields:

    — productid: The technical key

    version: The version of the dimension row as a reference

  • 8/17/2019 Pd i 2000 Lectures

    123/623

    — version: The version of the dimension row as a reference

    — date_from, date_to: The valid date range for the actual dimension

    row.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 123

  • 8/17/2019 Pd i 2000 Lectures

    124/623

    Example for Slowly Changing Dimensions

    ● The data within our dim_product table looks like this:

  • 8/17/2019 Pd i 2000 Lectures

    125/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 125

    ● Look at the first line: This is the default row returned when you lookup

    a dimension and the key is not found. This row (for not found entries) is

    created automatically by PDI with null values.

    Example for Slowly Changing Dimensions

    ● Now we have a price change for two products:

  • 8/17/2019 Pd i 2000 Lectures

    126/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 126

    Example for Slowly Changing Dimensions

    ● The price change came from a price file where the other information

    like productname failed, so they are missing in the new versions:

  • 8/17/2019 Pd i 2000 Lectures

    127/623

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 127

    ● We can look it up before the dimension load.

    Example for Slowly Changing Dimensions

    ● You have to uncheck the ‘Update the dimension?’ box.

  • 8/17/2019 Pd i 2000 Lectures

    128/623

    ● And enter the fields you want to retrieve (mind the change of the

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 128

    columns, especially ‘Type of return field’ instead of ‘Type of dimension

    update’):

    Example for Slowly Changing Dimensions

    ● The handling of the Stream Datefield.

  • 8/17/2019 Pd i 2000 Lectures

    129/623

    ● When the date field is empty, the actual date and time for lookup and

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 129

    new inserts is used.

    ● When you have a date field with a ‘valid from’ date, you can use thishere.

    Example for Slowly Changing Dimensions

    ● Example for a different date used.

  • 8/17/2019 Pd i 2000 Lectures

    130/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 130

    Example for Slowly Changing Dimensions

    ● If you do a lookup of a dimension and want to include the effective

    version and the date ranges of your found row, add them to fields:

  • 8/17/2019 Pd i 2000 Lectures

    131/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 131

  • 8/17/2019 Pd i 2000 Lectures

    132/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Lookups

    Lookups

    ● The Lookup feature of PDI accesses a data source to find values per adefined matching criteria, i.e. key.

    ● The following steps have lookup functionality in PDI:

  • 8/17/2019 Pd i 2000 Lectures

    133/623

    — Commonly Used

    • Database Lookup• Stream Lookup

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 133

    • Merge Join

    — Others• Database Join

    • Call database procedure

    • Dimension Update/Lookup

    • Combination Update/Lookup• HTTP Lookup

    Database Lookup

    ● Lookup attributes from a single table based on a key-matching criteria

    ● Options for performing database lookup include:

    — Lookup table: The name of the table where the lookup is done

  • 8/17/2019 Pd i 2000 Lectures

    134/623

    Lookup table: The name of the table where the lookup is done.

    — Enable cache: This option caches database lookups for the duration ofthe Transformation.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 134

    • Ena ing t is option can increase per ormance.

    • Danger: If other processes are changing values in the table do notset this option.

    — Load all data from table: Preload the complete data in memory at the

    initialization phase. This can replace a Stream Lookup step incombination with a Table Input step and is faster.

    Database Lookup (cont)

    SELECT

  • 8/17/2019 Pd i 2000 Lectures

    135/623

    ATTRIB1 as ‘FullName’

    FROM

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 135

     _

    WHERE

    ID =

  • 8/17/2019 Pd i 2000 Lectures

    136/623

    Stream Lookup (Example)

    ● The following Transformation add information coming from a text-file todata coming from a database table:

  • 8/17/2019 Pd i 2000 Lectures

    137/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. •US and Worldwide: +1 (866) 660-7555 | Slide 137

    ● ‘B’ is the source step. It is where the in-memory lookup stream resides.

    Merge Join

    ● Takes TWO sorted streams and performs a traditional JOIN on EQUALITY

    — INNER = Only output a row when the key is in both streams

    — LEFT OUTER = Output a row even if there is no matching key in 2nd

    Step

  • 8/17/2019 Pd i 2000 Lectures

    138/623

    Step

    — RIGHT OUTER = Output a row even if there is no matching key in 1

    st

    Step

    — FULL OUTER = Output a row regardless of matching

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 138

     

    ● Options provided by PDI GUI for merge join include:

    — First Step: Step to refer to as the 1st

    — Second Step: Step to refer to as the 2nd

    — Keys for 1st: The key fields from the 1st Stream

    — Keys for 2nd: The key fields from the 1st Stream

    — Join Type: The key fields from the 1st Stream

    Merge Join (cont)

  • 8/17/2019 Pd i 2000 Lectures

    139/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 139

    FULLOUTER

    Database Join

    ● Options provided by Kettle GUI for database join procedure include:

    — SQL: The SQL query to launch towards the database.

    — Number of rows to return: 0 means all, any other number limits the

  • 8/17/2019 Pd i 2000 Lectures

    140/623

    number of rows.

    — Outer join?: When checked, will always return a single record for eachinput stream record, even if the query did not return a result.

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 140

    ● The parameters to use in the query.

    —Parameters noted as ? in the query—Order of fields in parameter list must match the order of the ? in the

    query.

    Call Database Procedure

    ● Executes a database procedure (or function) and gets the result(s) back.

    ● Options for call database procedure include:

    — Proc-name: Name of the procedure or function to call.

  • 8/17/2019 Pd i 2000 Lectures

    141/623

    — Enable auto-commit: This can be used to perform updates in the

    database using a specified procedure. The user can either have thechanges done using auto-commit or by disabling this. If auto-commit is

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 141

    disabled, a single commit is performed after the last row is received

    by this step.

    — Result name:

    • When calling a database function this field needed.

    • When calling a database procedure this field must not be

    entered.

    Call Database Procedure (cont)

    ● Result type: Type of result of function call. Not used in case of a

    procedure.

    ● Parameters: List of parameters that the procedure or function needs:

  • 8/17/2019 Pd i 2000 Lectures

    142/623

    p p

    ● Field name: Name of the field.● Direction: Can be either IN (input only), OUT (output only), INOUT (value

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 142

    s c ange on e a a ase .

    ● Type: Used for output parameters so that Kettle knows what comes back.

    Call Database Procedure (cont)

    ● Other buttons available in this step are:

    —“Find it...” button: Searches the specified database connection for

    available procedures and functions.

  • 8/17/2019 Pd i 2000 Lectures

    143/623

    —“Get fields” button: Fills in all the fields in the input streams to make

    the process easier by deleting the lines that are not needed and re-

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 143

      .

    Dimension Lookups

    ● Uses the same dimension step used for updating● Same fields, same setup

    ● Stream Datefield – stream_date between EFFECT and EXPIRE

  • 8/17/2019 Pd i 2000 Lectures

    144/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 144

    HTTP Lookup

    ● Covered in Web Service module

  • 8/17/2019 Pd i 2000 Lectures

    145/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 145

  • 8/17/2019 Pd i 2000 Lectures

    146/623

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Field Transformations Part 1

    Field Transformations

    ● Field Transformations are steps that operate at the field level within a

    stream record.● The step types covered in this section include:

    — Select ValuesCalculator

  • 8/17/2019 Pd i 2000 Lectures

    147/623

    — Calculator

    — Add Constants— Null If

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 147

    Select Values

    ● This step type is used to:

    — Select/remove fields from the process stream.— Rename fields— Specify/change the length and/or precision of fields.

    ● 3 Tabs are provided:

  • 8/17/2019 Pd i 2000 Lectures

    148/623

    ● 3 Tabs are provided:

    — Select and Alter: Specify the exact order and name in which the fieldshave to be placed in the output rows.

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 148

     rows.

    — Meta-data: Change the name, type, length and precision (the meta-data) of one or more fields.

    Select Values (cont)

    ● Options provided for this step include:

    — Step name: Name of the step. This name has to be unique in a singleTransformation.

    — Attributes that can be changed for a given field:• Fieldname: The fieldname to select or change

  • 8/17/2019 Pd i 2000 Lectures

    149/623

    • Rename to: To be left blank if rename not required.• Length: Number has to be entered to specify the length (-1: no length

    specified). 

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 149

      -precision.

    Calculator

    ● Provides a list of functions that can be executed on field values.

    ● An important advantage Calculator has over custom JavaScript scripts is

    that the execution speed of Calculator is many times that of a script.

    B id th t (Fi ld A Fi ld B d Fi ld C) th l d

  • 8/17/2019 Pd i 2000 Lectures

    150/623

    ● Besides the arguments (Field A, Field B and Field C) the user also needs

    to specify the return type of the function.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 150

     

    values were calculated. This is useful for removing temporary values.

    Calculator (cont)

    ● The list of functions supported by

    the calculator includes commonlyused mathematical and datefunctions.

  • 8/17/2019 Pd i 2000 Lectures

    151/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 151

    Add Constants

    ● Adds constants to a stream.

    ● The use is very simple:

    — Specify the name

    — Enter value in the form of a string

  • 8/17/2019 Pd i 2000 Lectures

    152/623

    Enter value in the form of a string

    — Specify the formats to convert the value into the chosen data type.

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 152

    Null If

    ● If the string representation of a field is equal to a specified value, then

    the output value is set the null (empty).

  • 8/17/2019 Pd i 2000 Lectures

    153/623

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 153

  • 8/17/2019 Pd i 2000 Lectures

    154/623

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com.

    Set Transformations

    Set Transformations

    ● Set Transformations are steps that operate on the entire set of data

    within a stream.

    ● The operations operate across all rows and not strictly within a row

    ● The steps covered in the section include:

  • 8/17/2019 Pd i 2000 Lectures

    155/623

    ● The steps covered in the section include:

    — Filter Rows

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 155

     

    — Join Rows

    — Merge Rows— Unique Rows

    — Aggregate Rows

    — Group By

    Filter Rows● Filter rows based upon conditions and comparisons with full boolean logic

    supported.

    ● Output can be diverted into 2 streams: Records which pass (true) the condition

    and records which fail (false).

  • 8/17/2019 Pd i 2000 Lectures

    156/623

    ● Often used to:— Identify exceptions that must be written to a bad file

     

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 156

    — Branc trans ormation ogic i sing e source as two interpretations

    ● The options provided for this step include:

    — Send ‘true’ data to step: Which step receives those rows which pass the

    condition.

    — Send ‘false’ data to step: Which step receives those rows which fail the

    condition.

    Sort Rows

    ● Sort rows based upon specified fields, including sub sorts, in ascending

    or descending order.

    ● The options provided for this step include:

    — A list of fields and whether they should be sorted ascending or not.

  • 8/17/2019 Pd i 2000 Lectures

    157/623

    A list of fields and whether they should be sorted ascending or not.

    — Sort directory: This is the directory in which the temporary files are

    stored when needed. The default is the standard tem orar director

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 157

     

    for system.

    — Sort size: The more rows you can store in memory, the faster thesort. Eliminating need for temp files reduced costly disk I/O.

    — The TMP-file prefix: Choose a recognizable prefix to identify the

    files when they show up in the temp directory.

    Join Rows

    ● Produces combinations of all rows on the input streams.

    The options provided by PDI on this feature include:

    — Step name: Name of the step; name has to be unique.

    — Main step to read from: Specifies the step to read the most data

  • 8/17/2019 Pd i 2000 Lectures

    158/623

    Ma step to ead o : Spec es t e step to ead t e ost data

    from. This step is not cached or spooled to disk, the others are.

    — The condition: User can enter a com lex condition to limit the

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 158

     

    number of output rows. If empty, the result is a cartesian product.

    — Temp directory: Specify the name of the directory where the systemstores temporary files.

    — Temporary file prefix: This is the prefix of the temporary files that

    will be generated.

    — Max. cache size: The number of rows to cache before the systems

    reads data from temporary files.

    Merge Join

    ● The ‘Merge Join’ step performs a classic merge join between data sets

    with data coming from two different input steps. Join options include

    INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.

    The options provided by PDI on this feature include:

  • 8/17/2019 Pd i 2000 Lectures

    159/623

    p p y

    — Step name: Name of the step; name has to be unique.

    —  

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 159

      .

    — Second Step: Specify the second input step to the merge join.

    — Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER

    — Keys for 1st step: Specify the key fields on which the incoming data

    is sorted.

    — Keys for 2nd step: Specify the key fields on which the incoming datais sorted.

    Sorted Merge

    ● The ‘Sorted Merge’ step merges rows coming from multiple input steps

    providing these rows are sorted themselves on the given key fields.

    The options provided by PDI on this feature include:

    — Fields: Specify the fieldname and sort direction

  • 8/17/2019 Pd i 2000 Lectures

    160/623

    p y

    (ascending/descending).

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 160

    Merge Rows

    ● Compares and merges two streams of data

    — Reference Stream— Compare Stream

    ● Mostly used to identify deltas in source data when no timestamp is

  • 8/17/2019 Pd i 2000 Lectures

    161/623

    available— Reference Stream = The previously loaded data 

    • © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 161

    —  

    ● Usage note: Ensure streams are sorted by comparison key fields

    ● The output row is marked as follows:

    — “identical”: The key was found in both streams and the values tocompare were identical.

    — “changed”: The key was found in both streams but one or morevalues is different.

    — “new”: The key was not found in the reference stream.— “deleted”: The key was not found in the compare stream.

    Unique Rows● Removes duplicates from the input stream.

    ● Usage Note: Only consecutive records will be compared for duplicates, thus thestream must be sorted by comparison fields.

    ● The options provided for this step include:

  • 8/17/2019 Pd i 2000 Lectures

    162/623