Upload
vamsi-karthik
View
225
Download
1
Embed Size (px)
Citation preview
7/30/2019 Partitioning In Datastage
1/27
Partitioning
7/30/2019 Partitioning In Datastage
2/27
2002. Infosys Technologies Ltd. 2
Agenda
Introduction
Why do we need partitioning
Types of partitioning
7/30/2019 Partitioning In Datastage
3/27
2002. Infosys Technologies Ltd. 3
Introduction
Strength of DataStage Parallel Extender is in the parallel processing capability itbrings into your data extraction and transformation applications.
DataStage PX version has the ability to slice the data into chunks and process itsimultaneously.
Parallelism in DataStage PX is of two types.
Pipeline parallelism.
Partition parallelism.
7/30/2019 Partitioning In Datastage
4/27
2002. Infosys Technologies Ltd. 4
Types of Parallelism
Parallelism in PX jobs is of two types.
Pipeline
output of a producer operator is processed by a consumer operatorbefore the producer operator completes processing of the input.
Partition
Data is broken into packets and processed by each of the produceroperators at the same time.
7/30/2019 Partitioning In Datastage
5/27
2002. Infosys Technologies Ltd. 5
Pipeline parallelism
Job using the parallel extender running sequentially, each stage would processa single row of data then pass it to the next process, which would run andprocess this row then pass it on. General
Run the same job in parallel, the stage reading would start on one node andstart filling a pipeline with the data it had read. Next stage would start running onanother node as soon as there was data in the pipeline, process it and startfilling another pipeline.
7/30/2019 Partitioning In Datastage
6/27
2002. Infosys Technologies Ltd. 6
Pipeline
7/30/2019 Partitioning In Datastage
7/27 2002. Infosys Technologies Ltd. 7
Partition parallelism
Same job when processing huge volume of data pipelining the data would taketime. We can use the power of parallel processing of DataStage by partitioningthe data into separate sets of data.
Each of these sets is then processed a node.
7/30/2019 Partitioning In Datastage
8/27 2002. Infosys Technologies Ltd. 8
Partition and Pipeline
When no of processors are more then both Pipeline and Partition parallelprocessing can be used to achieve better performance.
7/30/2019 Partitioning In Datastage
9/27 2002. Infosys Technologies Ltd. 9
Why do we need
To induce parallel processing into job data should be partitioned.
To achieve greater performance data should be partitioned.
Each node works on different partition.
7/30/2019 Partitioning In Datastage
10/27 2002. Infosys Technologies Ltd. 10
Types of partitioning
Following are various partitioning methods
Round Robin
Random
Same
Entire
Hash
Modulus
Range
DB2
Auto
7/30/2019 Partitioning In Datastage
11/27 2002. Infosys Technologies Ltd. 11
General
7/30/2019 Partitioning In Datastage
12/27 2002. Infosys Technologies Ltd. 12
Round Robin
First records goes to first processing node, second record goes to secondprocessing node. Once last processing node is reached , next records goes tofirst processing node.
Used to re-sizing the partitions that are not equal in size.
This method is used to create equal sized partitions.
This method is used to create sequences.
7/30/2019 Partitioning In Datastage
13/27 2002. Infosys Technologies Ltd. 13
Round Robin
7/30/2019 Partitioning In Datastage
14/27 2002. Infosys Technologies Ltd. 14
Same
Fastest method of partitioning.
Records are processed by same processing node.
There is no repartitioning done by the operator using the output from precedingstage.
7/30/2019 Partitioning In Datastage
15/27 2002. Infosys Technologies Ltd. 15
Same
7/30/2019 Partitioning In Datastage
16/27 2002. Infosys Technologies Ltd. 16
Entire
Every processing node of the Stage get entire set of data.
Used when data is small and can fit into memory. Access to entire data isneeded.
Generally used in lookups to create hash table.
7/30/2019 Partitioning In Datastage
17/27 2002. Infosys Technologies Ltd. 17
Entire
7/30/2019 Partitioning In Datastage
18/27 2002. Infosys Technologies Ltd. 18
Hash
Partitioning is based on a function of columns chosen as hash keys.
This method is used when related records need to be kept in same partition.
It does not ensure that partitioned are evenly distributed.
This partitioning method is used in join, sort, merge and lookup Stages.
7/30/2019 Partitioning In Datastage
19/27 2002. Infosys Technologies Ltd. 19
Hash
7/30/2019 Partitioning In Datastage
20/27 2002. Infosys Technologies Ltd. 20
Modulus
Partitioning is based on a key column modulo the number of partitions
This method is similar to hash by field, but involves simpler computation.
7/30/2019 Partitioning In Datastage
21/27 2002. Infosys Technologies Ltd. 21
Range
Divides a data set into approximately equal-sized partitions, each of whichcontains records with key columns within a specified range.
This method is also useful for ensuring that related records are in the samepartition.
This method needs a Range map to be created which decides which recordsgoes to which processing node.
7/30/2019 Partitioning In Datastage
22/27 2002. Infosys Technologies Ltd. 22
Range
7/30/2019 Partitioning In Datastage
23/27 2002. Infosys Technologies Ltd. 23
Range map
7/30/2019 Partitioning In Datastage
24/27
2002. Infosys Technologies Ltd. 24
DB2
Data is partitioned same as DB2 table.
Used when writing to a DB2 table.
Default partitioning method for DB2 Stages
7/30/2019 Partitioning In Datastage
25/27
2002. Infosys Technologies Ltd. 25
DB2
7/30/2019 Partitioning In Datastage
26/27
2002. Infosys Technologies Ltd. 26
Degree of parallelism
Degree of Parallelism is determinedby the configuration file
Total number of logical nodes in default pool, or a subset if using "constraints".
Constraints are assigned to specific pools as defined inconfiguration file and can be referenced in the stage
Job performance by choosing best configuration for a job.
7/30/2019 Partitioning In Datastage
27/27
Partitioning and Collecting Icons
Partitioner Collector