372
DataStage Enterprise Edition

DataStage EE

Embed Size (px)

DESCRIPTION

datastage EE

Citation preview

Page 1: DataStage EE

DataStage Enterprise Edition

Page 2: DataStage EE

Proposed Course Agenda

Day 1– Review of EE Concepts

– Sequential Access

– Best Practices

– DBMS as Source

Day 2– EE Architecture

– Transforming Data

– DBMS as Target

– Sorting Data

Day 3– Combining Data

– Configuration Files

– Extending EE

– Meta Data in EE

Day 4– Job Sequencing

– Testing and Debugging

Page 3: DataStage EE

The Course Material

Course Manual Course Manual

Online HelpOnline Help

Exercise Files and Exercise Guide

Exercise Files and Exercise Guide

Page 4: DataStage EE

Using the Course Material

Suggestions for learning– Take notes

– Review previous material

– Practice

– Learn from errors

Page 5: DataStage EE

IntroPart 1

Introduction to DataStage EE

Page 6: DataStage EE

What is DataStage?

Design jobs for Extraction, Transformation, and Loading (ETL)

Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations

Import, export, create, and managed metadata for use within jobs

Schedule, run, and monitor jobs all within DataStage

Administer your DataStage development and execution environments

Page 7: DataStage EE

DataStage Server and Clients

Page 8: DataStage EE

DataStage Administrator

Page 9: DataStage EE

Client Logon

Page 10: DataStage EE

DataStage Designer

Page 11: DataStage EE

DataStage Director

Page 12: DataStage EE

Developing in DataStage

Define global and project properties in Administrator

Import meta data into Designer

Build job in Designer

Compile Designer

Validate, run, and monitor in Director

Page 13: DataStage EE

DataStage Projects

Page 14: DataStage EE

Quiz– True or False

DataStage Designer is used to build and compile your ETL jobs

Designer is used to execute your jobs after you build them

Director is used to execute your jobs after you build them

Administrator is used to set global and project properties

Page 15: DataStage EE

IntroPart 2

Configuring Projects

Page 16: DataStage EE

Module Objectives

After this module you will be able to:– Explain how to create and delete projects

– Set project properties in Administrator

– Set EE global properties in Administrator

Page 17: DataStage EE

Project Properties

Projects can be created and deleted in Administrator

Project properties and defaults are set in Administrator

Page 18: DataStage EE

Setting Project Properties

To set project properties, log onto Administrator, select your project, and then click “Properties”

Page 19: DataStage EE

Licensing Tab

Page 20: DataStage EE

Projects General Tab

Page 21: DataStage EE

Environment Variables

Page 22: DataStage EE

Permissions Tab

Page 23: DataStage EE

Tracing Tab

Page 24: DataStage EE

Tunables Tab

Page 25: DataStage EE

Parallel Tab

Page 26: DataStage EE

IntroPart 3

Managing Meta Data

Page 27: DataStage EE

Module Objectives

After this module you will be able to:– Describe the DataStage Designer components and

functionality

– Import and export DataStage objects

– Import metadata for a sequential file

Page 28: DataStage EE

What Is Metadata?

TargetSource Transform

Meta DataRepository

Data

Meta Data

Meta Data

Page 29: DataStage EE

Repository Contents

Metadata describing sources and targets: Table definitions

DataStage objects: jobs, routines, table definitions, etc.

Page 30: DataStage EE

Import and Export

Any object in Repository can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from one project to another

Use to share DataStage jobs and projects with other developers

Page 31: DataStage EE

Export Procedure

In Designer, click “Export>DataStage Components”

Select DataStage objects for export

Specified type of export: DSX, XML

Specify file path on client machine

Page 32: DataStage EE

Quiz: True or False?

You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file.

Page 33: DataStage EE

Quiz: True or False?

The directory to which you export is on the DataStage client machine, not on the DataStage server machine.

Page 34: DataStage EE

Exporting DataStage Objects

Page 35: DataStage EE

Exporting DataStage Objects

Page 36: DataStage EE

Import Procedure

In Designer, click “Import>DataStage Components”

Select DataStage objects for import

Page 37: DataStage EE

Importing DataStage Objects

Page 38: DataStage EE

Import Options

Page 39: DataStage EE

Exercise

Import DataStage Component (table definition)

Page 40: DataStage EE

Metadata Import

Import format and column destinations from sequential files

Import relational table column destinations

Imported as “Table Definitions”

Table definitions can be loaded into job stages

Page 41: DataStage EE

Sequential File Import Procedure

In Designer, click Import>Table Definitions>Sequential File Definitions

Select directory containing sequential file and then the file

Select Manager category

Examined format and column definitions and edit is necessary

Page 42: DataStage EE

Repository Table Definition

Page 43: DataStage EE

Importing Sequential Metadata

Page 44: DataStage EE

IntroPart 4

Designing and Documenting Jobs

Page 45: DataStage EE

Module Objectives

After this module you will be able to:– Describe what a DataStage job is

– List the steps involved in creating a job

– Describe links and stages

– Identify the different types of stages

– Design a simple extraction and load job

– Compile your job

– Create parameters to make your job flexible

– Document your job

Page 46: DataStage EE

What Is a Job?

Executable DataStage program

Created in DataStage Designer, but can use components from Repository

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH)

Page 47: DataStage EE

Job Development Overview

In Designer, import metadata defining sources and targets

In Designer, add stages defining data extractions and loads

And Transformers and other stages to defined data transformations

Add linkss defining the flow of data from sources to targets

Compiled the job

In Director, validate, run, and monitor your job

Page 48: DataStage EE

Designer Work Area

Page 49: DataStage EE

Designer Toolbar

Provides quick access to the main functions of Designer

Job properties

Compile

Show/hide metadata markers

Page 50: DataStage EE

Tools Palette

Page 51: DataStage EE

Adding Stages and Links

Stages can be dragged from the tools palette or from the stage type branch of the repository view

Links can be drawn from the tools palette or by right clicking and dragging from one stage to another

Page 52: DataStage EE

Sequential File Stage

Used to extract data from, or load data to, a sequential file

Specify full path to the file

Specify a file format: fixed width or delimited

Specified column definitions

Specify write action

Page 53: DataStage EE

Job Creation Example Sequence

Brief walkthrough of procedure

Presumes meta data already loaded in repository

Page 54: DataStage EE

Designer - Create New Job

Page 55: DataStage EE

Drag Stages and Links Using Palette

Page 56: DataStage EE

Assign Meta Data

Page 57: DataStage EE

Editing a Sequential Source Stage

Page 58: DataStage EE

Editing a Sequential Target

Page 59: DataStage EE

Transformer Stage

Used to define constraints, derivations, and column mappings

A column mapping maps an input column to an output column

In this module will just defined column mappings (no derivations)

Page 60: DataStage EE

Transformer Stage Elements

Page 61: DataStage EE

Create Column Mappings

Page 62: DataStage EE

Creating Stage Variables

Page 63: DataStage EE

Result

Page 64: DataStage EE

Adding Job Parameters

Makes the job more flexible

Parameters can be:– Used in constraints and derivations

– Used in directory and file names

Parameter values are determined at run time

Page 65: DataStage EE

Adding Job Documentation

Job Properties– Short and long descriptions

– Shows in Manager

Annotation stage– Is a stage on the tool palette

– Shows on the job GUI (work area)

Page 66: DataStage EE

Job Properties Documentation

Page 67: DataStage EE

Annotation Stage on the Palette

Page 68: DataStage EE

Annotation Stage Properties

Page 69: DataStage EE

Final Job Work Area with Documentation

Page 70: DataStage EE

Compiling a Job

Page 71: DataStage EE

Errors or Successful Message

Page 72: DataStage EE

IntroPart 5

Running Jobs

Page 73: DataStage EE

Module Objectives

After this module you will be able to:– Validate your job

– Use DataStage Director to run your job

– Set to run options

– Monitor your job’s progress

– View job log messages

Page 74: DataStage EE

Prerequisite to Job Execution

Result from Designer compile

Page 75: DataStage EE

DataStage Director

Can schedule, validating, and run jobs

Can be invoked from DataStage Manager or Designer– Tools > Run Director

Page 76: DataStage EE

Running Your Job

Page 77: DataStage EE

Run Options – Parameters and Limits

Page 78: DataStage EE

Director Log View

Page 79: DataStage EE

Message Details are Available

Page 80: DataStage EE

Other Director Functions

Schedule job to run on a particular date/time

Clear job log

Set Director options– Row limits

– Abort after x warnings

Page 81: DataStage EE

Module 1

DSEE – DataStage EE

Review

Page 82: DataStage EE

Ascential’s Enterprise Data Integration Platform

CRMERPSCM

RDBMSLegacy

Real-time Client-server Web services

Data WarehouseOther apps.

ANY SOURCE

ANY TARGET

CRMERPSCMBI/AnalyticsRDBMSReal-time Client-server Web servicesData WarehouseOther apps.

Command & Control

DISCOVERDISCOVER

Gather relevant information for target enterprise applications

Data Profiling

PREPAREPREPARE

Data Quality

Cleanse, correct and match input data

TRANSFORMTRANSFORM

Extract, Transform,

Load

Standardize and enrich data and load to targets

Meta Data Management

Parallel Execution

Page 83: DataStage EE

Course Objectives

You will learn to:– Build DataStage EE jobs using complex logic

– Utilize parallel processing techniques to increase job performance

– Build custom stages based on application needs

Course emphasis is:– Advanced usage of DataStage EE

– Application job development

– Best practices techniques

Page 84: DataStage EE

Course Agenda

Day 1– Review of EE Concepts

– Sequential Access

– Standards

– DBMS Access

Day 2– EE Architecture

– Transforming Data

– Sorting Data

Day 3– Combining Data

– Configuration Files

Day 4– Extending EE

– Meta Data Usage

– Job Control

– Testing

Page 85: DataStage EE

Module Objectives

Provide a background for completing work in the DSEE course

Tasks– Review concepts covered in DSEE Essentials course

Skip this module if you recently completed the DataStage EE essentials modules

Page 86: DataStage EE

Review Topics

DataStage architecture

DataStage client review– Administrator

– Manager

– Designer

– Director

Parallel processing paradigm

DataStage Enterprise Edition (DSEE)

Page 87: DataStage EE

Microsoft® Windows NT or UNIX

Designer DirectorRepositoryManagerAdministrator

Extract Cleanse Transform IntegrateDiscover Prepare Transform Extend

Parallel Execution

Meta Data Management

Command & Control

Microsoft® Windows NT/2000/XP

ANY SOURCE

ANY TARGET

CRMERPSCMBI/AnalyticsRDBMSReal-Time Client-server Web servicesData WarehouseOther apps.

Server Repository

Client-Server Architecture

Page 88: DataStage EE

Process Flow

Administrator – add/delete projects, set defaults

Manager – import meta data, backup projects

Designer – assemble jobs, compile, and execute

Director – execute jobs, examine job run logs

Page 89: DataStage EE

Administrator – Licensing and Timeout

Page 90: DataStage EE

Administrator – Project Creation/Removal

Functions specific to a

project.

Page 91: DataStage EE

Administrator – Project Properties

RCP for parallel jobs should be

enabled

Variables for parallel

processing

Page 92: DataStage EE

Administrator – Environment Variables

Variables are category specific

Page 93: DataStage EE

OSH is what is run by the EE Framework

Page 94: DataStage EE

DataStage Manager

Page 95: DataStage EE

Export Objects to MetaStage

Push meta data to

MetaStage

Page 96: DataStage EE

Designer Workspace

Can execute the job from

Designer

Page 97: DataStage EE

DataStage Generated OSH

The EE Framework runs OSH

Page 98: DataStage EE

Director – Executing Jobs

Messages from previous run in different

color

Page 99: DataStage EE

Stages

Can now customize the Designer’s palette

Select desired stages and drag to favorites

Page 100: DataStage EE

Popular Developer Stages

Row generator

Peek

Page 101: DataStage EE

Row Generator

Can build test data

Repeatable property

Edit row in column tab

Page 102: DataStage EE

Peek

Displays field values– Will be displayed in job log or sent to a file

– Skip records option

– Can control number of records to be displayed

Can be used as stub stage for iterative development (more later)

Page 103: DataStage EE

Why EE is so Effective

Parallel processing paradigm– More hardware, faster processing

– Level of parallelization is determined by a configuration file read at runtime

Emphasis on memory– Data read into memory and lookups performed like

hash table

Page 104: DataStage EE

DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously– If you add more resources

(CPUs, RAM, and disks) you increase system performance

• Example system containing6 CPUs (or processing nodes)and disks

1 2

3 4

5 6

Parallel Processing Systems

Page 105: DataStage EE

Three main types of scalable systems

Symmetric Multiprocessors (SMP): shared memory and disk

Clusters: UNIX systems connected via networks

MPP: Massively Parallel Processing

note

Scaleable Systems: Examples

Page 106: DataStage EE

• Multiple CPUs with a single operating system• Programs communicate using shared memory• All CPUs share system resources

(OS, memory with single linear address space, disks, I/O)

When used with Enterprise Edition:• Data transport uses shared memory• Simplified startup

cpu cpu

cpu cpu

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP

SMP: Shared Everything

Page 107: DataStage EE

Source

Transform

Target

Data Warehouse

Operational Data

Archived Data

Clean Load

Disk Disk Disk

Traditional approach to batch processing:• Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O

• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes

• disk I/O consumes the processing• terabytes of disk required for temporary staging

Traditional Batch Processing

Page 108: DataStage EE

Data Pipelining• Transform, clean and load processes are executing simultaneously on the same processor• rows are moving forward through the flow

Source

Transform

Target

Data Warehouse

Operational Data

Archived Data Clean Load

• Start a downstream process while an upstream process is still running.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

Pipeline Multiprocessing

Page 109: DataStage EE

Data Partitioning

Transform

SourceData

Transform

Transform

Transform

Node 1

Node 2

Node 3

Node 4

A-F

G- M

N-T

U-Z

• Break up big data into partitions

• Run one partition on each processor

• 4X times faster on 4 processors - With data big enough: 100X faster on 100 processors

• This is exactly how the parallel databases work!

• Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform

Partition Parallelism

Page 110: DataStage EE

Putting It All Together: Parallel Dataflow

Source Target

Transform Clean Load

Pipelining

Par

titio

ning

SourceData

Data Warehouse

Combining Parallelism Types

Page 111: DataStage EE

Putting It All Together: Parallel Dataflow with Repartioning on-the-fly

Without Landing To Disk!

Source Target

Transform Clean Load

Pipelining

SourceData Data

WarehousePar

titio

ning

Rep

artit

ioni

ng

A-FG- M

N-TU-Z

Customer last name Customer zip code Credit card number

Rep

artit

ioni

ng

Repartitioning

Page 112: DataStage EE

• Dataset: uniform set of rows in the Framework's internal representation - Three flavors: 1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format

read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk - The Framework processes only datasets—hence possible need for Import - Different datasets typically have different schemas- Convention: "dataset" = Framework data set.

• Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).

- All the partitions of a dataset follow the same schema: that of the dataset

EE Program Elements

Page 113: DataStage EE

Orchestrate Program(sequential data flow)

Orchestrate Application Frameworkand Runtime System

Import

Clean 1

Clean 2

Merge Analyze

Configuration File

Centralized Error Handlingand Event Logging

Parallel access to data in files

Parallel access to data in RDBMS

Inter-node communications

Parallel pipelining

Parallelization of operations

Import

Clean 1

Merge Analyze

Clean 2

Relational Data

PerformanceVisualization

Flat Files

Orchestrate Framework:Provides application scalability

DataStage Enterprise Edition:Best-of-breed scalable data integration platformNo limitations on data volumes or throughput

DataStage EE Architecture

DataStage:Provides data integration platform

Page 114: DataStage EE

DSEE:– Automatically scales to fit the machine– Handles data flow among multiple CPU’s and disks

With DSEE you can:– Create applications for SMP’s, clusters and MPP’s…

Enterprise Edition is architecture-neutral– Access relational databases in parallel– Execute external applications in parallel– Store data across multiple disks and nodes

Introduction to DataStage EE

Page 115: DataStage EE

Developer assembles data flow using the Designer

…and gets: parallel access, propagation, transformation, and load.

The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file.

No need to modify or recompile the design

Job Design VS. Execution

Page 116: DataStage EE

Partitioners distribute rows into partitions– implement data-partition parallelism

Collectors = inverse partitioners

Live on input links of stages running – in parallel (partitioners)

– sequentially (collectors)

Use a choice of methods

Partitioners and Collectors

Page 117: DataStage EE

Example Partitioning Icons

partitioner

Page 118: DataStage EE

Exercise

Complete exercises 1-1 and 1-2, and 1-3

Page 119: DataStage EE

Module 2

DSEE Sequential Access

Page 120: DataStage EE

Module Objectives

You will learn to:– Import sequential files into the EE Framework

– Utilize parallel processing techniques to increase sequential file access

– Understand usage of the Sequential, DataSet, FileSet, and LookupFileSet stages

– Manage partitioned data stored by the Framework

Page 121: DataStage EE

Types of Sequential Data Stages

Sequential– Fixed or variable length

File Set

Lookup File Set

Data Set

Page 122: DataStage EE

The EE Framework processes only datasets

For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations – this is performed by import and export OSH operators generated by Sequential or FileSet stages

During import or export DataStage performs format translations – into, or out of, the EE internal format

Data is described to the Framework in a schema

Sequential Stage Introduction

Page 123: DataStage EE

How the Sequential Stage Works

Generates Import/Export operators, depending on whether stage is source or target

Performs direct C++ file I/O streams

Page 124: DataStage EE

Using the Sequential File Stage

Importing/Exporting Data

Both import and export of general files (text, binary) are performed by the SequentialFile Stage.

– Data import:

– Data export

EE internal format

EE internal format

Page 125: DataStage EE

Working With Flat Files

Sequential File Stage– Normally will execute in sequential mode

– Can be parallel if reading multiple files (file pattern option)

– Can use multiple readers within a node

– DSEE needs to knowHow file is divided into rowsHow row is divided into columns

Page 126: DataStage EE

Processes Needed to Import Data

Recordization– Divides input stream into records

– Set on the format tab

Columnization– Divides the record into columns

– Default set on the format tab but can be overridden on the columns tab

– Can be “incomplete” if using a schema or not even specified in the stage if using RCP

Page 127: DataStage EE

File Format Example

Fie ld 1

F ie ld 1

F ie ld 1

F ie ld 1

F ie ld 1

F ie ld 1

,

,

,

,

,

,

Last fie ld

Last fie ld

n l

n l,

F ie ld D e lim ite r

F in a l D e lim ite r = c o m m a

F in a l D e lim ite r = e n d

R e co rd d e lim ite r

Page 128: DataStage EE

Sequential File Stage

To set the properties, use stage editor– Page (general, input/output)

– Tabs (format, columns)

Sequential stage link rules– One input link

– One output links (except for reject link definition)

– One reject linkWill reject any records not matching meta data in the column

definitions

Page 129: DataStage EE

Job Design Using Sequential Stages

Stage categories

Page 130: DataStage EE

General Tab – Sequential Source

Multiple output links

Show records

Page 131: DataStage EE

Properties – Multiple Files

Click to add more files having the same meta data.

Page 132: DataStage EE

Properties - Multiple Readers

Multiple readers option allows you to set number of

readers

Page 133: DataStage EE

Format Tab

File into records Record into columns

Page 134: DataStage EE

Read Methods

Page 135: DataStage EE

Reject Link

Reject mode = output

Source– All records not matching the meta data (the column

definitions)

Target– All records that are rejected for any reason

Meta data – one column, datatype = raw

Page 136: DataStage EE

File Set Stage

Can read or write file sets

Files suffixed by .fs

File set consists of:1. Descriptor file – contains location of raw data files +

meta data

2. Individual raw data files

Can be processed in parallel

Page 137: DataStage EE

File Set Stage Example

Descriptor file

Page 138: DataStage EE

File Set Usage

Why use a file set?– 2G limit on some file systems

– Need to distribute data among nodes to prevent overruns

– If used in parallel, runs faster that sequential file

Page 139: DataStage EE

Lookup File Set Stage

Can create file sets

Usually used in conjunction with Lookup stages

Page 140: DataStage EE

Lookup File Set > Properties

Key column specified

Key column dropped in

descriptor file

Page 141: DataStage EE

Data Set

Operating system (Framework) file

Suffixed by .ds

Referred to by a control file

Managed by Data Set Management utility from GUI (Manager, Designer, Director)

Represents persistent data

Key to good performance in set of linked jobs

Page 142: DataStage EE

Persistent Datasets

Accessed from/to disk with DataSet Stage.

Two parts: – Descriptor file:

contains metadata, data location, but NOT the data itself

– Data file(s) contain the data multiple Unix files (one per node), accessible in parallel

input.ds

node1:/local/disk1/…node2:/local/disk2/…

record ( partno: int32; description: string; )

Page 143: DataStage EE

Quiz!

• True or False?Everything that has been data-partitioned must be

collected in same job

Page 144: DataStage EE

Data Set Stage

Is the data partitioned?

Page 145: DataStage EE

Engine Data Translation

Occurs on import– From sequential files or file sets

– From RDBMS

Occurs on export– From datasets to file sets or sequential files

– From datasets to RDBMS

Engine most efficient when processing internally formatted records (I.e. data contained in datasets)

Page 146: DataStage EE

Managing DataSets

GUI (Manager, Designer, Director) – tools > data set management

Alternative methods – Orchadmin

Unix command line utilityList recordsRemove data sets (will remove all components)

– DsrecordsLists number of records in a dataset

Page 147: DataStage EE

Data Set Management

Display data

Schema

Page 148: DataStage EE

Data Set Management From Unix

Alternative method of managing file sets and data sets– Dsrecords

Gives record count– Unix command-line utility– $ dsrecords ds_name

I.e.. $ dsrecords myDS.ds156999 records

– Orchadmin Manages EE persistent data sets

– Unix command-line utility

I.e. $ orchadmin rm myDataSet.ds

Page 149: DataStage EE

Exercise

Complete exercises 2-1, 2-2, 2-3, and 2-4.

Page 150: DataStage EE

Module 3

Standards and Techniques

Page 151: DataStage EE

Objectives

Establish standard techniques for DSEE development

Will cover:– Job documentation

– Naming conventions for jobs, links, and stages

– Iterative job design

– Useful stages for job development

– Using configuration files for development

– Using environmental variables

– Job parameters

Page 152: DataStage EE

Job Presentation

Document using the annotation

stage

Page 153: DataStage EE

Job Properties Documentation

Description shows in DS Manager and MetaStage

Organize jobs into categories

Page 154: DataStage EE

Naming conventions

Stages named after the – Data they access

– Function they perform

– DO NOT leave defaulted stage names like Sequential_File_0

Links named for the data they carry– DO NOT leave defaulted link names like DSLink3

Page 155: DataStage EE

Stage and Link Names

Stages and links renamed to data

they handle

Page 156: DataStage EE

Create Reusable Job Components

Use Enterprise Edition shared containers when feasible

Container

Page 157: DataStage EE

Use Iterative Job Design

Use copy or peek stage as stub

Test job in phases – small first, then increasing in complexity

Use Peek stage to examine records

Page 158: DataStage EE

Copy or Peek Stage Stub

Copy stage

Page 159: DataStage EE

Transformer StageTechniques

Suggestions -– Always include reject link.

– Always test for null value before using a column in a function.

– Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later.

– Be aware of Column and Stage variable Data Types.Often user does not pay attention to Stage Variable type.

– Avoid type conversions.Try to maintain the data type as imported.

Page 160: DataStage EE

The Copy Stage

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder): – Partitioners– Sort / Remove Duplicates– Rename, Drop column

… can be inserted on: – input link (Partitioning): Partitioners, Sort, Remove Duplicates)– output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:– Rename,– Drop, – Implicit type Conversions– Link Constraint – break up schema

Page 161: DataStage EE

Developing Jobs

1. Keep it simple• Jobs with many stages are hard to debug and maintain.

2. Start small and Build to final Solution• Use view data, copy, and peek. • Start from source and work out.• Develop with a 1 node configuration file.

3. Solve the business problem before the performance problem.• Don’t worry too much about partitioning until the

sequential flow works as expected.

4. If you have to write to Disk use a Persistent Data set.

Page 162: DataStage EE

Final Result

Page 163: DataStage EE

Good Things to Have in each Job

Use job parameters

Some helpful environmental variables to add to job parameters– $APT_DUMP_SCORE

Report OSH to message log

– $APT_CONFIG_FILEEstablishes runtime parameters to EE engine; I.e. Degree of

parallelization

Page 164: DataStage EE

Setting Job Parameters

Click to add environment

variables

Page 165: DataStage EE

DUMP SCORE Output

Double-click

Mapping Node--> partition

Setting APT_DUMP_SCORE yields:

PartitonerAnd

Collector

Page 166: DataStage EE

Use Multiple Configuration Files

Make a set for 1X, 2X,….

Use different ones for test versus production

Include as a parameter in each job

Page 167: DataStage EE

Exercise

Complete exercise 3-1

Page 168: DataStage EE

Module 4

DBMS Access

Page 169: DataStage EE

Objectives

Understand how DSEE reads and writes records to an RDBMS

Understand how to handle nulls on DBMS lookup

Utilize this knowledge to:– Read and write database tables

– Use database tables to lookup data

– Use null handling options to clean data

Page 170: DataStage EE

Parallel Database Connectivity

TraditionalTraditionalClient-ServerClient-Server Enterprise EditionEnterprise Edition

SortSort

ClientClient

Parallel RDBMSParallel RDBMS

ClientClient

ClientClient

ClientClient

ClientClient

Parallel RDBMSParallel RDBMS

Only RDBMS is running in parallel Each application has only one connection Suitable only for small data volumes

Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes Higher levels of integration possible

ClientClient

LoadLoad

Page 171: DataStage EE

RDBMS AccessSupported Databases

Enterprise Edition provides high performance / scalable interfaces for:

DB2

Informix

Oracle

Teradata

Page 172: DataStage EE

Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions

RDBMS nulls converted to/from nullable field values

Support for standard SQL syntax for specifying:– field list for SELECT statement– filter for WHERE clause

Can write an explicit SQL query to access RDBMS EE supplies additional information in the SQL query

RDBMS Access

Page 173: DataStage EE

RDBMS Stages

DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

Page 174: DataStage EE

RDBMS Usage

As a source– Extract data from table (stream link)

– Extract as table, generated SQL, or user-defined SQL– User-defined can perform joins, access views

– Lookup (reference link)– Normal lookup is memory-based (all table data read into

memory)– Can perform one lookup at a time in DBMS (sparse option)– Continue/drop/fail options

As a target– Inserts– Upserts (Inserts and updates)– Loader

Page 175: DataStage EE

RDBMS Source – Stream Link

Stream link

Page 176: DataStage EE

DBMS Source - User-defined SQL

Columns in SQL statement must match the meta data in columns tab

Page 177: DataStage EE

Exercise

User-defined SQL– Exercise 4-1

Page 178: DataStage EE

DBMS Source – Reference Link

Reject link

Page 179: DataStage EE

Lookup Reject Link

“Output” option automatically creates the reject link

Page 180: DataStage EE

Null Handling

Must handle null condition if lookup record is not found and “continue” option is chosen

Can be done in a transformer stage

Page 181: DataStage EE

Lookup Stage Mapping

Link name

Page 182: DataStage EE

Lookup Stage Properties

Reference link

Must have same column name in input and reference links.

You will get the results of the lookup in the output column.

Page 183: DataStage EE

DBMS as a Target

Page 184: DataStage EE

DBMS As Target

Write Methods– Delete

– Load

– Upsert

– Write (DB2)

Write mode for load method– Truncate

– Create

– Replace

– Append

Page 185: DataStage EE

Target Properties

Upsert mode determines options

Generated code can be copied

Page 186: DataStage EE

Checking for Nulls

Use Transformer stage to test for fields with null values (Use IsNull functions)

In Transformer, can reject or load default value

Page 187: DataStage EE

Exercise

Complete exercise 4-2

Page 188: DataStage EE

Module 5

Platform Architecture

Page 189: DataStage EE

Objectives

Understand how Enterprise Edition Framework processes data

You will be able to:– Read and understand OSH

– Perform troubleshooting

Page 190: DataStage EE

Concepts

The Enterprise Edition Platform– Script language - OSH (generated by DataStage

Parallel Canvas, and run by DataStage Director)

– Communication - conductor,section leaders,players.

– Configuration files (only one active at a time, describes H/W)

– Meta data - schemas/tables

– Schema propagation - RCP

– EE extensibility - Buildop, Wrapper

– Datasets (data in Framework's internal representation)

Page 191: DataStage EE

Output Data Set schema:prov_num:int16;member_num:int8;custid:int32;

Input Data Set schema:prov_num:int16;member_num:int8;custid:int32;

EE Stages Involve A Series Of Processing Steps

Inpu

tInte

rface

Pa

rtition

er

Bu

siness

Log

ic

Ou

tput

Interface

EE Stage

• Piece of Application Logic Running Against Individual Records

• Parallel or Sequential

DS-EE Stage Elements

Page 192: DataStage EE

• EE Delivers Parallelism in Two Ways

– Pipeline– Partition

• Block Buffering Between Components

– Eliminates Need for Program Load Balancing

– Maintains Orderly Data FlowPipeline

Partition

Dual Parallelism Eliminates Bottlenecks!

Producer

Consumer

DSEE Stage Execution

Page 193: DataStage EE

Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is controlled by Stage– default = parallel for most Ascential-supplied Stages– Developer can override default mode– Parallel Stage inserts the default partitioner (Auto) on its input links – Sequential Stage inserts the default collector (Auto) on its input links – Developer can override default

execution mode (parallel/sequential) of Stage > Advanced tab

choice of partitioner/collector on Input > Partitioning tab

Page 194: DataStage EE

How Parallel Is It?

Degree of parallelism is determined by the configuration file

– Total number of logical nodes in default pool, or a subset if using "constraints".

Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage

Page 195: DataStage EE

OSH

DataStage EE GUI generates OSH scripts– Ability to view OSH turned on in Administrator

– OSH can be viewed in Designer using job properties

The Framework executes OSH

What is OSH?– Orchestrate shell

– Has a UNIX command-line interface

Page 196: DataStage EE

OSH Script

An osh script is a quoted string which specifies:– The operators and connections of a single

Orchestrate step

– In its simplest form, it is:osh “op < in.ds > out.ds”

Where:– op is an Orchestrate operator

– in.ds is the input data set

– out.ds is the output data set

Page 197: DataStage EE

OSH Operators

OSH Operator is an instance of a C++ class inheriting from APT_Operator

Developers can create new operators

Examples of existing operators:– Import

– Export

– RemoveDups

Page 198: DataStage EE

Enable Visible OSH in Administrator

Will be enabled for all projects

Page 199: DataStage EE

View OSH in Designer

Schema

Operator

Page 200: DataStage EE

OSH Practice

Exercise 5-1 – Instructor demo (optional)

Page 201: DataStage EE

• Operators• Datasets: set of rows processed by Framework

– Orchestrate data sets:

– persistent (terminal) *.ds, and

– virtual (internal) *.v.

– Also: flat “file sets” *.fs

• Schema: data description (metadata) for datasets and links.

Elements of a Framework Program

Page 202: DataStage EE

• Consist of Partitioned Data and Schema• Can be Persistent (*.ds) or Virtual (*.v, Link)• Overcome 2 GB File Limit

=

What you program: What gets processed:

. . .

Multiple files per partitionEach file up to 2GBytes (or larger)

Operator A

Operator A

Operator A

Operator A

Node 1 Node 2 Node 3 Node 4

data filesof x.ds

$ osh “operator_A > x.ds“

GUI

OSH

Datasets

What gets generated:

Page 203: DataStage EE

Computing Architectures: Definition

Clusters and MPP Systems

Shared Disk Shared Nothing

Uniprocessor

Dedicated Disk

• IBM, Sun, HP, Compaq• 2 to 64 processors• Majority of installations

Shared Memory

SMP System(Symmetric Multiprocessor)

DiskDisk

CPU

Memory

CPU CPU CPU

• PC• Workstation• Single processor server

CPU

• 2 to hundreds of processors• MPP: IBM and NCR Teradata• each node is a uniprocessor or SMP

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

Page 204: DataStage EE

Job Execution:Orchestrate

Conductor Node

C

Processing Node

SL

PP P

SL

PP P

Processing Node

• Conductor - initial DS/EE process– Step Composer– Creates Section Leader processes (one/node)– Consolidates massages, outputs them– Manages orderly shutdown.

• Section Leader – Forks Players processes (one/Stage)– Manages up/down communication.

• Players– The actual processes associated with Stages– Combined players: one process only– Send stderr to SL– Establish connections to other players for data

flow– Clean up upon completion.• Communication:

- SMP: Shared Memory- MPP: TCP

Page 205: DataStage EE

Working with Configuration Files

You can easily switch between config files:'1-node' file - for sequential execution, lighter reports—handy for

testing 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism

'BigN-nodes' file - aims at full data-partitioned parallelism

Only one file is active while a step is runningThe Framework queries (first) the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match # CPUsSame configuration file can be used in development and target

machines

Page 206: DataStage EE

SchedulingNodes, Processes, and CPUs

DS/EE does not: – know how many CPUs are available– schedule

Who knows what?

Who does what?– DS/EE creates (Nodes*Ops) Unix processes – The O/S schedules these processes on the CPUs

Nodes = # logical nodes declared in config. fileOps = # ops. (approx. # blue boxes in V.O.)Processes = # Unix processesCPUs = # available CPUs

Nodes Ops Processes CPUs

User Y N

Orchestrate Y Y Nodes * Ops N

O/S " Y

Page 207: DataStage EE

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }

1

43

2

Configuring DSEE – Node Pools

Page 208: DataStage EE

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }

1

43

2

Configuring DSEE – Disk Pools

Page 209: DataStage EE

node

1node

2

Parallel to parallel flow may incur reshuffling:Records may jump between nodes

partitioner

Re-Partitioning

Page 210: DataStage EE

Partitioning Methods

Auto

Hash

Entire

Range

Range Map

Page 211: DataStage EE

• Collectors combine partitions of a dataset into a single input stream to a sequential Stage

data partitions

collector

sequential Stage

...

–Collectors do NOT synchronize data

Collectors

Page 212: DataStage EE

Partitioning and Repartitioning Are Visible On Job Design

Page 213: DataStage EE

Partitioning and Collecting Icons

Partitioner Collector

Page 214: DataStage EE

Setting a Node Constraint in the GUI

Page 215: DataStage EE

Reading Messages in Director

Set APT_DUMP_SCORE to true

Can be specified as job parameter

Messages sent to Director log

If set, parallel job will produce a report showing the operators, processes, and datasets in the running job

Page 216: DataStage EE

Messages With APT_DUMP_SCORE = True

Page 217: DataStage EE

Exercise

Complete exercise 5-2

Page 218: DataStage EE

Module 6

Transforming Data

Page 219: DataStage EE

Module Objectives

Understand ways DataStage allows you to transform data

Use this understanding to:– Create column derivations using user-defined code or

system functions

– Filter records based on business criteria

– Control data flow based on data conditions

Page 220: DataStage EE

Transformed Data

Transformed data is:– Outgoing column is a derivation that may, or may not,

include incoming fields or parts of incoming fields

– May be comprised of system variables

Frequently uses functions performed on something (ie. incoming columns)– Divided into categories – I.e.

Date and timeMathematicalLogicalNull handlingMore

Page 221: DataStage EE

Stages Review

Stages that can transform data– Transformer

ParallelBasic (from Parallel palette)

– Aggregator (discussed in later module)

Sample stages that do not transform data– Sequential

– FileSet

– DataSet

– DBMS

Page 222: DataStage EE

Transformer Stage Functions

Control data flow

Create derivations

Page 223: DataStage EE

Flow Control

Separate records flow down links based on data condition – specified in Transformer stage constraints

Transformer stage can filter records

Other stages can filter records but do not exhibit advanced flow control– Sequential can send bad records down reject link

– Lookup can reject records based on lookup failure

– Filter can select records based on data value

Page 224: DataStage EE

Rejecting Data

Reject option on sequential stage– Data does not agree with meta data– Output consists of one column with binary data type

Reject links (from Lookup stage) result from the drop option of the property “If Not Found”– Lookup “failed”– All columns on reject link (no column mapping option)

Reject constraints are controlled from the constraint editor of the transformer– Can control column mapping– Use the “Other/Log” checkbox

Page 225: DataStage EE

Rejecting Data Example

“If Not Found” property

Constraint – Other/log option

Property Reject Mode = Output

Page 226: DataStage EE

Transformer Stage Properties

Page 227: DataStage EE

Transformer Stage Variables

First of transformer stage entities to execute

Execute in order from top to bottom– Can write a program by using one stage variable to

point to the results of a previous stage variable

Multi-purpose– Counters

– Hold values for previous rows to make comparison

– Hold derivations to be used in multiple field dervations

– Can be used to control execution of constraints

Page 228: DataStage EE

Stage Variables

Show/Hide button

Page 229: DataStage EE

Transforming Data

Derivations– Using expressions

– Using functionsDate/time

Transformer Stage Issues– Sometimes require sorting before the transformer

stage – I.e. using stage variable as accumulator and need to break on change of column value

Checking for nulls

Page 230: DataStage EE

Checking for Nulls

Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition

Can be handled in constraints, derivations, stage variables, or a combination of these

Page 231: DataStage EE

Transformer - Handling Rejects

Constraint Rejects– All expressions are

false and reject row is checked

Page 232: DataStage EE

Transformer: Execution Order

• Derivations in stage variables are executed first

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

Page 233: DataStage EE

Parallel Palette - Two Transformers

All > Processing >

Transformer

Is the non-Universe transformer

Has a specific set of functions

No DS routines available

Parallel > Processing

Basic Transformer

Makes server style transforms available on the parallel palette

Can use DS routines

• Program in Basic for both transformers

Page 234: DataStage EE

Transformer Functions From Derivation Editor

Date & Time

Logical

Null Handling

Number

String

Type Conversion

Page 235: DataStage EE

Exercise

Complete exercises 6-1, 6-2, and 6-3

Page 236: DataStage EE

Module 7

Sorting Data

Page 237: DataStage EE

Objectives

Understand DataStage EE sorting options

Use this understanding to create sorted list of data to enable functionality within a transformer stage

Page 238: DataStage EE

Sorting Data

Important because– Some stages require sorted input

– Some stages may run faster – I.e Aggregator

Can be performed – Option within stages (use input > partitioning tab and

set partitioning to anything other than auto)

– As a separate stage (more complex sorts)

Page 239: DataStage EE

Sorting Alternatives

• Alternative representation of same flow:

Page 240: DataStage EE

Sort Option on Stage Link

Page 241: DataStage EE

Sort Stage

Page 242: DataStage EE

Sort Utility

DataStage – the default

UNIX

Page 243: DataStage EE

Sort Stage - Outputs

Specifies how the output is derived

Page 244: DataStage EE

Sort Specification Options

Input Link Property– Limited functionality

– Max memory/partition is 20 MB, then spills to scratch

Sort Stage– Tunable to use more memory before spilling to

scratch.

Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE

Page 245: DataStage EE

Removing Duplicates

Can be done by Sort stage – Use unique option

OR

Remove Duplicates stage– Has more sophisticated ways to remove duplicates

Page 246: DataStage EE

Exercise

Complete exercise 7-1

Page 247: DataStage EE

Module 8

Combining Data

Page 248: DataStage EE

Objectives

Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages

Use this understanding to create jobs that will– Combine data from separate input streams

– Aggregate data to form summary totals

Page 249: DataStage EE

Combining Data

There are two ways to combine data:

– Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g.,

JoinsLookupMerge

– Vertically: One input link, one output link with column combining values from all input rows. E.g.,

Aggregator

Page 250: DataStage EE

Join, Lookup & Merge Stages

These "three Stages" combine two or more input links according to values of user-designated "key" column(s).

They differ mainly in:– Memory usage

– Treatment of rows with unmatched key values

– Input requirements (sorted, de-duplicated)

Page 251: DataStage EE

Not all Links are Created Equal

Joins Lookup Merge

Primary Input: port 0 Left Source MasterSecondary Input(s): ports 1,… Right LU Table(s) Update(s)

• Enterprise Edition distinguishes between:- The Primary Input (Framework port 0)- Secondary - in some cases "Reference" (other ports)

• Naming convention:

Tip: Check "Input Ordering" tab to make sure

intended Primary is listed first

Page 252: DataStage EE

Join Stage Editor

One of four variants:– Inner– Left Outer– Right Outer– Full Outer

Several key columns allowed

Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)

Page 253: DataStage EE

1. The Join Stage

Four types:

2 sorted input links, 1 output link – "left outer" on primary input, "right outer" on secondary input– Pre-sort make joins "lightweight": few rows need to be in RAM

• Inner• Left Outer• Right Outer• Full Outer

Page 254: DataStage EE

2. The Lookup Stage

Combines: – one source link with– one or more duplicate-free table links

no pre-sort necessaryallows multiple keys LUTsflexible exception handling forsource input rows with no match

Lookup

Sourceinput

One or more tables (LUTs)

Output Reject

0

1

2

0

1

Page 255: DataStage EE

The Lookup Stage

Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging)

On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link

On an SMP, no physical duplication of a Lookup Table occurs

Page 256: DataStage EE

The Lookup Stage

Lookup File Set – Like a persistent data set only it

contains metadata about the key.– Useful for staging lookup tables

RDBMS LOOKUP– NORMAL

Loads to an in memory hash table first

– SPARSE Select for each row. Might become a performance

bottleneck.

Page 257: DataStage EE

3. The Merge Stage

Combines – one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links.– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with

joins, but opposite to lookup). Follows the Master-Update model:

– Master row and one or more updates row are merged if they have the same value in user-specified key column(s).

– A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored)

– Unmatched ("Bad") master rows can be either kept dropped

– Unmatched ("Bad") update rows in input link can be captured in a "reject" link

– Matched update rows are consumed.

Page 258: DataStage EE

The Merge Stage

Allows composite keys

Multiple update links

Matched update rows are consumed

Unmatched updates can be captured

Lightweight

Space/time tradeoff: presorts vs. in-RAM table

Master One or more updates

Output Rejects

Merge

0

0

21

21

Page 259: DataStage EE

In this table:• , <comma> = separator between primary and secondary input links

(out and reject links)

Synopsis:Joins, Lookup, & Merge

Joins Lookup Merge

Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)Memory usage light heavy light

# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)

Mandatory Input Sort both inputs no all inputsDuplicates in primary input OK (x-product) OK Warning!Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | dropOptions on unmatched secondary NONE NONE capture in reject set(s)

On match, secondary entries are reusable reusable consumed

# Outputs 1 1 out, (1 reject) 1 out, (N rejects)Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

Page 260: DataStage EE

The Aggregator Stage

Purpose: Perform data aggregations

Specify:

Zero or more key columns that define the aggregation units (or groups)

Columns to be aggregated

Aggregation functions:count (nulls/non-nulls) sum max/min/range

The grouping method (hash table or pre-sort) is a performance issue

Page 261: DataStage EE

Grouping Methods

Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed– doesn’t require sorted data– good when number of unique groups is small. Running

tally for each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM.

– Example: average family income by state, requires .05MB of RAM

Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.– requires input sorted by grouping keys– can handle unlimited numbers of groups– Example: average daily balance by credit card

Page 262: DataStage EE

Aggregator Functions

Sum

Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation

Page 263: DataStage EE

Aggregator Properties

Page 264: DataStage EE

Aggregation Types

Aggregation types

Page 265: DataStage EE

Containers

Two varieties– Local

– Shared

Local– Simplifies a large, complex diagram

Shared– Creates reusable object that many jobs can include

Page 266: DataStage EE

Creating a Container

Create a job

Select (loop) portions to containerize

Edit > Construct container > local or shared

Page 267: DataStage EE

Using a Container

Select as though it were a stage

Page 268: DataStage EE

Exercise

Complete exercise 8-1

Page 269: DataStage EE

Module 9

Configuration Files

Page 270: DataStage EE

Objectives

Understand how DataStage EE uses configuration files to determine parallel behavior

Use this understanding to– Build a EE configuration file for a computer system

– Change node configurations to support adding resources to processes that need them

– Create a job that will change resource allocations at the stage level

Page 271: DataStage EE

Configuration File Concepts

Determine the processing nodes and disk space connected to each node

When system changes, need only change the configuration file – no need to recompile jobs

When DataStage job runs, platform reads configuration file– Platform automatically scales the application to fit the

system

Page 272: DataStage EE

Processing Nodes Are

Locations on which the framework runs applications

Logical rather than physical construct

Do not necessarily correspond to the number of CPUs in your system– Typically one node for two CPUs

Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node

Page 273: DataStage EE

Optimizing Parallelism

Degree of parallelism determined by number of nodes defined

Parallelism should be optimized, not maximized– Increasing parallelism distributes work load but also

increases Framework overhead

Hardware influences degree of parallelism possible

System hardware partially determines configuration

Page 274: DataStage EE

More Factors to Consider

Communication amongst operators– Should be optimized by your configuration– Operators exchanging large amounts of data should

be assigned to nodes communicating by shared memory or high-speed link

SMP – leave some processors for operating system

Desirable to equalize partitioning of data

Use an experimental approach– Start with small data sets– Try different parallelism while scaling up data set sizes

Page 275: DataStage EE

Factors Affecting Optimal Degree of Parallelism

CPU intensive applications– Benefit from the greatest possible parallelism

Applications that are disk intensive– Number of logical nodes equals the number of disk

spindles being accessed

Page 276: DataStage EE

Configuration File

Text file containing string data that is passed to the Framework– Sits on server side– Can be displayed and edited

Name and location found in environmental variable APT_CONFIG_FILE

Components– Node– Fast name– Pools– Resource

Page 277: DataStage EE

Node Options

Node name – name of a processing node used by EE – Typically the network name– Use command uname –n to obtain network name

Fastname – – Name of node as referred to by fastest network in the system– Operators use physical node name to open connections– NOTE: for SMP, all CPUs share single connection to network

Pools– Names of pools to which this node is assigned– Used to logically group nodes– Can also be used to group resources

Resource– Disk– Scratchdisk

Page 278: DataStage EE

Sample Configuration File

{

node “Node1"

{

fastname "BlackHole"

pools "" "node1"

resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }

resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }

}

}

Page 279: DataStage EE

Disk Pools

Disk pools allocate storage

By default, EE uses the default pool, specified by “”

pool "bigdata"

Page 280: DataStage EE

Sorting Requirements

Resource pools can also be specified for sorting:

The Sort stage looks first for scratch disk resources in a “sort” pool, and then in the default disk pool

Page 281: DataStage EE

{ node "n1" { fastname “s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} resource scratchdisk "/scratch" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} resource scratchdisk "/scratch" {} } ...}

{ node "n1" { fastname “s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} resource scratchdisk "/scratch" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} resource scratchdisk "/scratch" {} } ...}

4 5

1

6

2 3

Another Configuration File Example

Page 282: DataStage EE

Resource Types

Disk

Scratchdisk

DB2

Oracle

Saswork

Sortwork

Can exist in a pool– Groups resources together

Page 283: DataStage EE

Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

Page 284: DataStage EE

Building a Configuration File

Scoping the hardware:– Is the hardware configuration SMP, Cluster, or MPP?– Define each node structure (an SMP would be single

node): Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed)

– Is the machine dedicated to EE? If not, what other applications are running on it?

– Get a breakdown of the resource usage (vmstat, mpstat, iostat)

– Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?

Page 285: DataStage EE

Exercise

Complete exercise 9-1 and 9-2

Page 286: DataStage EE

Module 10

Extending DataStage EE

Page 287: DataStage EE

Objectives

Understand the methods by which you can add functionality to EE

Use this understanding to:– Build a DataStage EE stage that handles special

processing needs not supplied with the vanilla stages

– Build a DataStage EE job that uses the new stage

Page 288: DataStage EE

EE Extensibility Overview

Sometimes it will be to your advantage to leverage EE’s extensibility. This extensibility includes:

Wrappers

Buildops

Custom Stages

Page 289: DataStage EE

When To Leverage EE Extensibility

Types of situations:Complex business logic, not easily accomplished using standard EE stagesReuse of existing C, C++, Java, COBOL, etc…

Page 290: DataStage EE

Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not want to modify the application and performance is not critical.

Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces.

Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.

Page 291: DataStage EE

Building “Wrapped” Stages

You can “wrapper” a legacy executable: Binary Unix command Shell script

… and turn it into a Enterprise Edition stage capable, among other things, of parallel execution…

As long as the legacy executable is: amenable to data-partition parallelism

no dependencies between rows

pipe-safe can read rows sequentially no random access to data

Page 292: DataStage EE

Wrappers (Cont’d)

Wrappers are treated as a black box EE has no knowledge of contents

EE has no means of managing anything that occurs inside the wrapper

EE only knows how to export data to and import data from the wrapper

User must know at design time the intended behavior of the wrapper and its schema interface

If the wrappered application needs to see all records prior to processing, it cannot run in parallel.

Page 293: DataStage EE

LS Example

Can this command be wrappered?

Page 294: DataStage EE

Creating a Wrapper

Used in this job ---

To create the “ls” stage

Page 295: DataStage EE

Creating Wrapped Stages

From Manager:Right-Click on Stage Type

> New Parallel Stage > Wrapped

We will "Wrapper” an existing Unix executables – the ls command

Wrapper Starting Point

Page 296: DataStage EE

Wrapper - General Page

Unix command to be wrapped

Name of stage

Page 297: DataStage EE

Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.

The "Creator" Page

Page 298: DataStage EE

Wrapper – Properties Page

If your stage will have properties appear, complete the Properties page

This will be the name of the property as it

appears in your stage

Page 299: DataStage EE

Wrapper - Wrapped Page

Interfaces – input and output columns - these should first be entered into the

table definitions meta data (DS Manager); let’s do that now.

Page 300: DataStage EE

• Layout interfaces describe what columns the stage:

– Needs for its inputs (if any)– Creates for its outputs (if any)– Should be created as tables with columns in

Manager

Interface schemas

Page 301: DataStage EE

Column Definition for Wrapper Interface

Page 302: DataStage EE

How Does the Wrapping Work?

– Define the schema for export and importSchemas become interface

schemas of the operator and allow for by-name column access

import

export

stdout ornamed pipe

stdin ornamed pipe

UNIX executable

output schema

input schema

QUIZ: Why does export precede import?

Page 303: DataStage EE

Update the Wrapper Interfaces

This wrapper will have no input interface – i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.

Page 304: DataStage EE

Resulting Job

Wrapped stage

Page 305: DataStage EE

Job Run

Show file from Designer palette

Page 306: DataStage EE

Wrapper Story: Cobol Application

Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node.

Software:– DB2/EEE, COBOL, EE

Original COBOL Application:– Extracted source table, performed lookup against table in DB2,

and Loaded results to target table.– 4 hours 20 minutes sequential execution

Enterprise Edition Solution:– Used EE to perform Parallel DB2 Extracts and Loads– Used EE to execute COBOL application in Parallel– EE Framework handled data transfer between

DB2/EEE and COBOL application– 30 minutes 8-way parallel execution

Page 307: DataStage EE

Buildops

Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper)

Reasons to use Buildop include: Speed / Performance

Complex business logic that cannot be easily represented using existing stages

– Lookups across a range of values– Surrogate key generation– Rolling aggregates

Build once and reusable everywhere within project, no shared container necessary

Can combine functionality from different stages into one

Page 308: DataStage EE

BuildOps

– The DataStage programmer encapsulates the business logic

– The Enterprise Edition interface called “buildop” automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.

– Exploits extensibility of EE Framework

Page 309: DataStage EE

From Manager (or Designer):Repository pane:

Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}

• "Build" stages from within Enterprise Edition

• "Wrapping” existing “Unix” executables

BuildOp Process Overview

Page 310: DataStage EE

General Page

Identicalto Wrappers,except: Under the Build

Tab, your program!

Page 311: DataStage EE

Logic Tab forBusiness Logic

Enter Business C/C++ logic and arithmetic in four pages under the Logic tab

Main code section goes in Per-Record page- it will be applied to all rows

NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it won’t compile within EE either!

Page 312: DataStage EE

Code Sections under Logic Tab

Temporary variables declared [and initialized] here

Logic here is executed once BEFORE processing the FIRST row

Logic here is executed once AFTER processing the LAST row

Page 313: DataStage EE

I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

Optional renaming of output port from default "out0"

Write row

Input page: 'Auto Read'Read next row

In-RepositoryTable Definition

'False' setting,not to interfere with Transfer page

First line: output 0

Page 314: DataStage EE

I/O and Transfer

• Transfer all columns from input to output.• If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written

First line:Transfer of index 0

Page 315: DataStage EE

BuildOp Simple Example

Example - sumNoTransfer– Add input columns "a" and "b"; ignore other columns

that might be present in input

– Produce a new "sum" column

– Do not transfer input columns

sumNoTransfera:int32; b:int32

sum:int32

Page 316: DataStage EE

NO TRANSFER

- RCP set to "False" in stage definition and

- Transfer page left blank, or Auto Transfer = "False"

• Effects:

- input columns "a" and "b" are not transferred

- only new column "sum" is transferred

Compare with transfer ON…

From Peek:

No Transfer

Page 317: DataStage EE

Transfer

TRANSFER- RCP set to "True" in stage definition

or- Auto Transfer set to "True"

• Effects:- new column "sum" is transferred, as well as- input columns "a" and "b" and- input column "ignored" (present in input, but

not mentioned in stage)

Page 318: DataStage EE

Columns

DS-EE type

Defined in Table Definitions

Value refreshed from row to row

Temp C++ variables

C/C++ type

Need declaration (in Definitions or Pre-Loop page)

Value persistent throughout "loop" over rows, unless modified in code

Columns vs. Temporary C++ Variables

Page 319: DataStage EE

Exercise

Complete exercise 10-1 and 10-2

Page 320: DataStage EE

Exercise

Complete exercises 10-3 and 10-4

Page 321: DataStage EE

Custom Stage

Reasons for a custom stage:– Add EE operator not already in DataStage EE

– Build your own Operator and add to DataStage EE

Use EE API

Use Custom Stage to add new operator to EE canvas

Page 322: DataStage EE

Custom Stage

DataStage Manager > select Stage Types branch > right click

Page 323: DataStage EE

Custom Stage

Name of Orchestrate operator to be used

Number of input and output links allowed

Page 324: DataStage EE

Custom Stage – Properties Tab

Page 325: DataStage EE

The Result

Page 326: DataStage EE

Module 11

Meta Data in DataStage EE

Page 327: DataStage EE

Objectives

Understand how EE uses meta data, particularly schemas and runtime column propagation

Use this understanding to:– Build schema definition files to be invoked in

DataStage jobs

– Use RCP to manage meta data usage in EE jobs

Page 328: DataStage EE

Establishing Meta Data

Data definitions– Recordization and columnization

– Fields have properties that can be set at individual field level

Data types in GUI are translated to types used by EE

– Described as properties on the format/columns tab (outputs or inputs pages) OR

– Using a schema file (can be full or partial)

Schemas– Can be imported into Manager

– Can be pointed to by some job stages (i.e. Sequential)

Page 329: DataStage EE

Data Formatting – Record Level

Format tab

Meta data described on a record basis

Record level properties

Page 330: DataStage EE

Data Formatting – Column Level

Defaults for all columns

Page 331: DataStage EE

Column Overrides

Edit row from within the columns tab

Set individual column properties

Page 332: DataStage EE

Extended Column Properties

Field and

string settings

Page 333: DataStage EE

Extended Properties – String Type

Note the ability to convert ASCII to EBCDIC

Page 334: DataStage EE

Editing Columns

Properties depend on the

data type

Page 335: DataStage EE

Schema

Alternative way to specify column definitions for data used in EE jobs

Written in a plain text file

Can be written as a partial record definition

Can be imported into the DataStage repository

Page 336: DataStage EE

Creating a Schema

Using a text editor– Follow correct syntax for definitions

– OR

Import from an existing data set or file set– On DataStage Manager import > Table Definitions >

Orchestrate Schema Definitions

– Select checkbox for a file with .fs or .ds

Page 337: DataStage EE

Importing a Schema

Schema location can be on the server or local

work station

Page 338: DataStage EE

Data Types

Date

Decimal

Floating point

Integer

String

Time

Timestamp

Vector

Subrecord

Raw

Tagged

Page 339: DataStage EE

Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP).

RCP is always on at runtime.

Design and compile time column mapping enforcement.– RCP is off by default.– Enable first at project level. (Administrator project properties)– Enable at job level. (job properties General tab)– Enable at Stage. (Link Output Column tab)

Page 340: DataStage EE

Enabling RCP at Project Level

Page 341: DataStage EE

Enabling RCP at Job Level

Page 342: DataStage EE

Enabling RCP at Stage Level

Go to output link’s columns tab

For transformer you can find the output links columns tab by first going to stage properties

Page 343: DataStage EE

Using RCP with Sequential Stages

To utilize runtime column propagation in the sequential stage you must use the “use schema” option

Stages with this restriction:– Sequential

– File Set

– External Source

– External Target

Page 344: DataStage EE

Runtime Column Propagation

When RCP is Disabled– DataStage Designer will enforce Stage Input

Column to Output Column mappings.– At job compile time modify operators are

inserted on output links in the generated osh.

Page 345: DataStage EE

Runtime Column Propagation

When RCP is Enabled– DataStage Designer will not enforce mapping

rules.– No Modify operator inserted at compile time.– Danger of runtime error if column names

incoming do not match column names outgoing link – case sensitivity.

Page 346: DataStage EE

Exercise

Complete exercises 11-1 and 11-2

Page 347: DataStage EE

Module 12

Job Control Using the Job Sequencer

Page 348: DataStage EE

Objectives

Understand how the DataStage job sequencer works

Use this understanding to build a control job to run a sequence of DataStage jobs

Page 349: DataStage EE

Job Control Options

Manually write job control– Code generated in Basic

– Use the job control tab on the job properties page

– Generates basic code which you can modify

Job Sequencer– Build a controlling job much the same way you build

other jobs

– Comprised of stages and links

– No basic coding

Page 350: DataStage EE

Job Sequencer

Build like a regular job

Type “Job Sequence”

Has stages and links

Job Activity stage represents a DataStage job

Links represent passing control

Stages

Page 351: DataStage EE

Example

Job Activity stage – contains

conditional triggers

Page 352: DataStage EE

Job Activity Properties

Job parameters to be passed

Job to be executed – select from dropdown

Page 353: DataStage EE

Job Activity Trigger

Trigger appears as a link in the diagram

Custom options let you define the code

Page 354: DataStage EE

Options

Use custom option for conditionals– Execute if job run successful or warnings only

Can add “wait for file” to execute

Add “execute command” stage to drop real tables and rename new tables to current tables

Page 355: DataStage EE

Job Activity With Multiple Links

Different links having different

triggers

Page 356: DataStage EE

Sequencer Stage

Can be set to all or any

Build job sequencer to control job for the collections application

Page 357: DataStage EE

Notification

Notification Stage

Page 358: DataStage EE

Notification Activity

Page 359: DataStage EE

Sample DataStage log from Mail Notification

Sample DataStage log from Mail Notification

Page 360: DataStage EE

E-Mail Message

Notification Activity Message

Page 361: DataStage EE

Exercise

Complete exercise 12-1

Page 362: DataStage EE

Module 13

Testing and Debugging

Page 363: DataStage EE

Objectives

Understand spectrum of tools to perform testing and debugging

Use this understanding to troubleshoot a DataStage job

Page 364: DataStage EE

Environment Variables

Page 365: DataStage EE

Parallel Environment Variables

Page 366: DataStage EE

Environment VariablesStage Specific

Page 367: DataStage EE

Environment Variables

Page 368: DataStage EE

Environment VariablesCompiler

Page 369: DataStage EE

Typical Job Log Messages:

Environment variables

Configuration File information

Framework Info/Warning/Error messages

Output from the Peek Stage

Additional info with "Reporting" environments

Tracing/Debug output

– Must compile job in trace mode– Adds overhead

The Director

Page 370: DataStage EE

• Job Properties, from Menu Bar of Designer• Director will

prompt you before eachrun

Job Level Environmental Variables

Page 371: DataStage EE

Troubleshooting

If you get an error during compile, check the following:

Compilation problems– If Transformer used, check C++ compiler, LD_LIRBARY_PATH– If Buildop errors try buildop from command line– Some stages may not support RCP – can cause column mismatch .– Use the Show Error and More buttons– Examine Generated OSH– Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Page 372: DataStage EE

Generating Test Data

Row Generator stage can be used– Column definitions

– Data type dependent

Row Generator plus lookup stages provides good way to create robust test data from pattern files