Upload
vamsi-karthik
View
222
Download
0
Embed Size (px)
Citation preview
7/30/2019 Data Stage Job Design Approach
1/18
Job Design Approach
7/30/2019 Data Stage Job Design Approach
2/18
2002. Infosys Technologies Ltd. 2
Agenda
Introduction
Framework
Scheduling Approach
Restart Ability
Reusability
Templates
Modularity and Maintain Ability
Performance Considerations
7/30/2019 Data Stage Job Design Approach
3/18
2002. Infosys Technologies Ltd. 3
Introduction
Job design will be influenced by following points.
Framework
Scheduling Approach
Restart Ability
Reusability/Templates
Modularity and Maintain Ability
Performance Considerations
Metadata Management
7/30/2019 Data Stage Job Design Approach
4/18
2002. Infosys Technologies Ltd. 4
Framework
Reprocessing
System Health Tables
ACR Balancing
Logs , Errors & Warnings
7/30/2019 Data Stage Job Design Approach
5/18
2002. Infosys Technologies Ltd. 5
Framework
Reprocessing - Records will be error out according to business rules definedand records should be reconsidered when the Job runs in next run
Reprocessing will be required/enforced, if the quality of data is not goodenough.
Reprocessing will influence Jobs Design/Framework in many ways
Error records need to be retained to allow corrections, need for landing/work table
Job should have logic to handle duplicate records with same natural key
ACR log file should accommodate the count of reprocessed records
End users should be able to identify error records and correct
7/30/2019 Data Stage Job Design Approach
6/18
2002. Infosys Technologies Ltd. 6
Framework
System Health Tables Jobs should provide necessary information to maintain, track, and control data loading.
System Health Tables will have data of start and end time of a Job, # of recordsread, # of records written, # of records bypassed, Start of Batch , end of batch.
System Health Tables will directly/indirectly influence Jobs Design/Framework
To have necessary files generated with necessary information
To have enough information like link counts etc.
Reusable and Common jobs will be identified
Scheduling and Sequencing will be influenced
7/30/2019 Data Stage Job Design Approach
7/18 2002. Infosys Technologies Ltd. 7
Framework
Few Common Tables from CSL/ABI projects
DTMT_PRCS: Stores information about business processes.
DTMT_PGM_CNTL: Stores all control table entries.
DTMT_PGM_ERR: Stores information about errors occurred during program
execution.
DTMT_PGM_EXEC_H: Stores Execution history of every program execution
DTMT_REC_ERR_LOG (Staging table): Staging table for error records to becorrected
DTMT_SRC: Contains Source file names
DTMT_PGM: Contains details about all the programs
7/30/2019 Data Stage Job Design Approach
8/18 2002. Infosys Technologies Ltd. 8
Framework
Logs, Errors, Warning : Datastage jobs should have provisions to maintainslogs, Errors and Warnings
Logs are required to facilitate in debugging and keep track
Errors and Warning need to be logged to validate business rules and datavalidations
Restart Ability will play vital role in loading Errors and Warning.
Reusability/Common Jobs can be identified
7/30/2019 Data Stage Job Design Approach
9/18 2002. Infosys Technologies Ltd. 9
Scheduling
Scheduling approach will effect the Job designs.
Scheduling can be done in two approaches
Use Sequencers of DataStage for Sequencing the Job. Use Control M only forScheduling. Sequences should be build with restart points
Pros : Sequencing Complexity Abstracted inside Sequencers.
Pros : Scheduling will be simplified only Starting point
Cons : Complexity and additional effort in building sequencers. Sequencing and Job Designstightly coupled
Use Control M for sequencing and scheduling . Break the functionality required intoRestartable jobs and use Control M for sequencing and scheduling
Pros : Simplified Job Design and Sequencing and Job Designs are loosely coupled
Pros : Flexibility to break/join jobs without major effect on sequencing. No additional overhead of
maintaining Restartable points
Cons : Complexity of sequencing is shifted to scheduling.
7/30/2019 Data Stage Job Design Approach
10/18 2002. Infosys Technologies Ltd. 10
Scheduling Sequencer Approach
7/30/2019 Data Stage Job Design Approach
11/18 2002. Infosys Technologies Ltd. 11
Scheduling Control M Approach
The scheduling of jobs/scripts in a project is done through Cntl-m.
The dependency between jobs within the same module or across themodules (successor/predecessor) are tracked in an xls and is submitted tothe cntl-m team
The dependency of the jobs is set up in the cntl-m using triggers, so that ajob starts execution only after all its predecessors completed their executionsuccessfully
The trigger can be the successful completion of a job, presence of aparticular file, etc.
Sample Control M excel attached
ApplicationDescription ofRequest
Test Prod 04/01/2005
TableName
JobName
ActionRequested
DaysScheduled Holi days Dependencies
TimeWindowforJobStart
(Iftable exists) (Ifjob exists)Add, Change,
D el et e T es t P ro d(M,T,W,Th,F,Sa,
Su)(job namesorline
number ) (opti onal )
START_OF_CYCL
E ADWGR0010T Addgrmetltest/
adwgradm F 2am
START_OF_CYCL
E ADWGR0020T Addgrmetltest/
adwgradm F ADWGR0010T
START_OF_CYCLE ADWGR0080T Add
grmetltest/adwgradm F ADWGR0020T
LANDING_JOBS ADWGR1005T Addgrmetltest/adwgradm F ADWGR0080T
LANDING_JOBS ADWGR1005B Addgrmetltest/
adwgradm F ADWGR1005T
LANDING_JOBS ADWGR1005L Changegrmetltest/
adwgradm F ADWGR1005B
LANDING_JOBS ADWGR1008T Addgrmetltest/
adwgradm F ADWGR0080T
LANDING_JOBS ADWGR1008B Addgrmetltest/
adwgradm F ADWGR1008T
LANDING_JOBS ADWGR1008L Changegrmetltest/adwgradm F ADWGR1008B
LANDING_JOBS ADWGR1010T Addgrmetltest/
adwgradm F ADWGR1008L
LANDING_JOBS ADWGR1010B Addgrmetltest/
adwgradm F ADWGR1010T
LANDING_JOBS ADWGR1010L Changegrmetltest/
adwgradm F ADWGR1010B
LANDING_JOBS ADWGR1015T Addgrmetltest/adwgradm F ADWGR0080T
LANDING_JOBS ADWGR1015B Addgrmetltest/adwgradm F ADWGR1015T
LANDING_JOBS ADWGR1015L Changegrmetltest/
adwgradm F ADWGR1015B
LANDING_JOBS ADWGR1020T Addgrmetltest/
adwgradm F ADWGR0080T
RequesterName B ri an Tu rb es A DW G if t R eg is tr yContactInformati on 612-304-0476, [email protected] NewjobsetupforapplicationADWGRRequested Migration Date 2/10/2005
Server/ AccountPath Name, ScriptName, Parameters
/opt/scripts/te st/adwetlrun.ksh -fADWGR0010T_parms.dat ADWGRADWGR0010TtableEtlPrcsGrpADWGR0010Tadwgrcur/opt/scripts/te st/adwetlrun.ksh -f
ADWGR0020T_parms.dat ADWGRADWGR0020TtableEtlPrcs ADWGR0020Tadwgrcur
/opt/scripts/test/adwacrrun.ksh ADWGR1005BADWGR1005B ADW3407 adwgrcurADWGR
/opt/scripts/test/adwetlrun.ksh -f
ADWGR1005L_parms.datADWGRADWGR0030TtableEtlSubPrcs.ADWGR1005
ADWGR1005L adwgrcur
/opt/scripts/test/adwetlrun.ksh -f
ADWGR1005T_parms.datADWGRADWGR1005TtableGftrgE ADWGR1005T
/opt/scripts/te st/adwetlrun.ksh -fADWGR0080T_parms.dat ADWGRADWGR0080TtablePrcsCntl ADWGR0080Tadwgrcur
/opt/scripts/test/adwetlrun.ksh -fADWGR1008T_parms.datADWGR
ADWGR1008Tdss1008GftrgCustADWGR1008T
/opt/scripts/test/adwetlrun.ksh -fADWGR1008L_parms.datADWGR
ADWGR0030TtableEtlSubPrcs.ADWGR1008ADWGR1008L adwgrcur
/opt/scripts/test/adwetlrun.ksh -f
ADWGR1010T_parms.datADWGRADWGR1010TtableGftrgCustE ADWGR1010T
adwgrcur/opt/scripts/test/adwacrrun.ksh ADWGR1010B
ADWGR1010B ADW3409 adwgrcurADWGR
/opt/scripts/test/adwacrrun.ksh ADWGR1008B
ADWGR1008B ADW3401 adwgrcurADWGR
/opt/scripts/test/adwetlrun.ksh -f
ADWGR1015T_parms.datADWGR
ADWGR1015TtableGftrgBabyE ADWGR1015Tadwgrcur
/opt/scripts/test/adwacrrun.ksh ADWGR1015BADWGR1015B ADW3402 adwgrcurADWGR
/opt/scripts/test/adwetlrun.ksh -fADWGR1015L_parms.datADWGR
ADWGR0030TtableEtlSubPrcs.ADWGR1015
ADWGR1015L adwgrcur/opt/scripts/test/adwetlrun.ksh -f
ADWGR1020T_parms.datADWGR
ADWGR1020TtableGftrgCharE ADWGR1020T
/opt/scripts/test/adwetlrun.ksh -f
ADWGR1010L_parms.datADWGR
ADWGR0030TtableEtlSubPrcs.ADWGR1010ADWGR1010L adwgrcur
7/30/2019 Data Stage Job Design Approach
12/18 2002. Infosys Technologies Ltd. 12
Restart Ability
Restart Ability will influence Job Designs in breaking up Jobs
Restart Ability is very important in ETL Jobs and each Job should be restart able
Restart Ability will play vital role in
Loading tables with History
Sequence Number Generation
Reprocessing
Loading Errors/Warning Tables
Loading System Health Tables
If Sequencers are used for sequencing Sequencer Routines and Shell scripts will beplace holders to maintain restartable points
If Control M is used for sequencing , breaking of Jobs/Identifying Common Jobs is key
7/30/2019 Data Stage Job Design Approach
13/18 2002. Infosys Technologies Ltd. 13
Reusability
Reusability is very imp in Software projects
DataStage allows reusability in following forms
Shared Containers
Build Ops
Common Jobs
Routines
Templates
Shared Containers are best form of reusability on DataStage. Typical Examples that are probablefor usage of Shared Container are
Sequence Id Generation Logic
Errors/Warning Generation/Loading
Loading Landing tables with common functionalities
Common Business Rules & Logic
A Container is a group of stages and links which will perform a particular task. The container replaces the complexlogic into one unit and acts as a stage.
7/30/2019 Data Stage Job Design Approach
14/18 2002. Infosys Technologies Ltd. 14
Reusability
Build Ops provide Flexibility to write own logic
Build Ops can be used to obtain common functionality within/across modules , if logicto achieve that functionality using DataStage stages is complex.
Code-ease: Handling complex conditions, say, many nested if-else statements orhandling many stage variables and their computation is much easier in BuildOp thanTransformer stage.
Coding-liberties: BuildOp allows the use of data-structures like arrays and string, loop-statements like for and while loops and many other normal coding paradigms. It alsoallows use of various header files and their built-in functions. For ex: Include string.hand it provides you with function APT_String, which can be used for string declarationsand other string operations. All the above mentioned coding features are otherwise notease to use in DataStage.
7/30/2019 Data Stage Job Design Approach
15/18 2002. Infosys Technologies Ltd. 15
Reusability
Common Job will perform common tasks across project/modules taking different parameter todifferent context
Common Jobs should be run in Multiple Instance to allow multiple instances in parallel Routines will help in performing Pre Job Initiation and Post Job Initiation activities like Copying
Input files to different directories, ACR File generation , Log Files Etc.
Clarity in defining activities between Shell Scripts, DataStage Job , Routines ,Sequences,Generic Shell Script is key having clean separation and consistency across project.This will influence the Job Designs
The job template should contain generic Annotations which would act as a guideline while creatingthe jobs
All the parameters that are common across all the jobs should be defined in the job templates
Specific stage properties that are common or mandatory to be set, should be defined in the jobtemplates
Templates will act as Design Pattern/Guideline in achieving consistency and strict enforcement ondos and donts
Identifying common patterns and defining templates will achieve consistency
Few reusable components will evolve as we progress in project , but enough exercise should bedone to bring out reusable components. Piloting a module will also be another option in brining outreusable components
7/30/2019 Data Stage Job Design Approach
16/18 2002. Infosys Technologies Ltd. 16
Modularity and Maintainability
Modularity and Maintainability is another influencing factor in Job Designs
Reusable Components and Restart Ability will bring the required Modularity andMaintainability
A proper optimization need to be achieved between Modularity and I/O operations in aJob, keeping Restart Ability into consideration
Performance Considerations and Maintainability should be properly balanced. For Ex,Reducing # of Transformers in a Job will enhance the performance , but not at the cost
of its maintainability.
7/30/2019 Data Stage Job Design Approach
17/18 2002. Infosys Technologies Ltd. 17
Performance Considerations
Identifying correct stage for required functionality is key in Job Design
Sequencing of stages in Job design should be decided keeping the performanceconsiderations. For ex avoid repartitioning
Usage of temporary tables/worktables/datasets may enhance the performance byreducing load on Jobs, which will influence Job Design
Make sure all the necessary environment variables are part of template , which caninfluence performance
Consider volume of data while deciding the stage.
Detailed points , which can influence performance of Job are covered in performancetuning
7/30/2019 Data Stage Job Design Approach
18/1818
Metadata Management
Job design will be influenced by Metadata Management Considerations
Jobs should not be driven by Reject Links.
To avoid reject links, Looks should have dummy column selected from reference link andshould be checked in next stages like transformer.