Partitioning In Datastage

7/30/2019 Partitioning In Datastage

1/27

Partitioning


2/27

2002. Infosys Technologies Ltd. 2

Agenda

Introduction

Why do we need partitioning

Types of partitioning


3/27


Introduction

Strength of DataStage Parallel Extender is in the parallel processing capability itbrings into your data extraction and transformation applications.

DataStage PX version has the ability to slice the data into chunks and process itsimultaneously.

Parallelism in DataStage PX is of two types.

Pipeline parallelism.

Partition parallelism.


4/27


Types of Parallelism

Parallelism in PX jobs is of two types.

Pipeline

output of a producer operator is processed by a consumer operatorbefore the producer operator completes processing of the input.

Partition

Data is broken into packets and processed by each of the produceroperators at the same time.


5/27


Pipeline parallelism

Job using the parallel extender running sequentially, each stage would processa single row of data then pass it to the next process, which would run andprocess this row then pass it on. General

Run the same job in parallel, the stage reading would start on one node andstart filling a pipeline with the data it had read. Next stage would start running onanother node as soon as there was data in the pipeline, process it and startfilling another pipeline.


6/27


Pipeline


7/27 2002. Infosys Technologies Ltd. 7

Partition parallelism

Same job when processing huge volume of data pipelining the data would taketime. We can use the power of parallel processing of DataStage by partitioningthe data into separate sets of data.

Each of these sets is then processed a node.



Partition and Pipeline

When no of processors are more then both Pipeline and Partition parallelprocessing can be used to achieve better performance.



Why do we need

To induce parallel processing into job data should be partitioned.

To achieve greater performance data should be partitioned.

Each node works on different partition.



Types of partitioning

Following are various partitioning methods

Round Robin

Random

Same

Entire

Hash

Modulus

Range

DB2

Auto



General



Round Robin

First records goes to first processing node, second record goes to secondprocessing node. Once last processing node is reached , next records goes tofirst processing node.

Used to re-sizing the partitions that are not equal in size.

This method is used to create equal sized partitions.

This method is used to create sequences.



Round Robin



Same

Fastest method of partitioning.

Records are processed by same processing node.

There is no repartitioning done by the operator using the output from precedingstage.



Same



Entire

Every processing node of the Stage get entire set of data.

Used when data is small and can fit into memory. Access to entire data isneeded.

Generally used in lookups to create hash table.



Entire



Hash

Partitioning is based on a function of columns chosen as hash keys.

This method is used when related records need to be kept in same partition.

It does not ensure that partitioned are evenly distributed.

This partitioning method is used in join, sort, merge and lookup Stages.



Hash



Modulus

Partitioning is based on a key column modulo the number of partitions

This method is similar to hash by field, but involves simpler computation.



Range

Divides a data set into approximately equal-sized partitions, each of whichcontains records with key columns within a specified range.

This method is also useful for ensuring that related records are in the samepartition.

This method needs a Range map to be created which decides which recordsgoes to which processing node.



Range



Range map


24/27


DB2

Data is partitioned same as DB2 table.

Used when writing to a DB2 table.

Default partitioning method for DB2 Stages


25/27


DB2


26/27


Degree of parallelism

Degree of Parallelism is determinedby the configuration file

Total number of logical nodes in default pool, or a subset if using "constraints".

Constraints are assigned to specific pools as defined inconfiguration file and can be referenced in the stage

Job performance by choosing best configuration for a job.


27/27

Partitioning and Collecting Icons

Partitioner Collector

Documents

Partitioning In Datastage