datastage exckusive

  • Upload
    karthik

  • View
    272

  • Download
    1

Embed Size (px)

Citation preview

  • 8/9/2019 datastage exckusive

    1/42

    1. Explain how can you implement slowly changed dimensions in datastage?

    2. Is it possible to join flat file and database in datastage? If yes, how?

    3. What is the exact difference betwwen Join, Merge and Lookup stage?

    4. What is DS Director used for?

    5. In what way can you implement Lookup in DataStage Server jobs?

    6. How can you implement Complex Jobs in datastage?

    7. What is Merge, How is it used?

    8. State the difference between Datastage and Informatica?

    9. State the difference between serverjobs and parallerjobs?

    10. Is it possible to run paralleljobs in serverjobs?

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    %%%%%%%%%%

    Difference between Query calculation and Layout calculation

    query calculation is used to deal with tables(to perform any changes to query)

    Layout calculations can be used to perform any changes regards to appearance of

    reports

    Query Calcualtion is used to perform Data Scrubbing.

    Layout Calcuation is used to provide run time information, but not used to perform

    any operation on the data

    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$4

    In datastage, what is scheduling, how it works

  • 8/9/2019 datastage exckusive

    2/42

    Actually datastage is in short consolidation of all the data

    here we have to dump a data which may be historical,current into a database

    for this we are using different components in datastage. Basically datastage is ETL

    tool means it extract the data, apply particular rules and load it into respective

    database or format what the user wants.

    There are two ways to schedule the job

    1] through datstage

    2] autosys

    .

    Thanks

    Pinky

    In Datastage after the job is developed and compiled clean in Designer, it can be

    run NOW or can be scheduled to run at a particular frequency(daily/weekly etc.)using Director. Thisscheduling is nothing but when to run the job. Hope that

    answered your question.

    Schedule specify data and time, to run the job.

    Schedule can be created using DataStage Client Component.to schedule the job , Run JOb(option)

    to select date,time.

    &&&&

    Look for DSTX

    &&&

    1. What is the flow of loading data into fact & dimensional tables?

    A) Fact table - Table with Collection of Foreign Keys corresponding to the Primary

    Keys in Dimensional table. Consists of fields with numeric values.

    Dimension table - Table with Unique Primary Key.

    http://www.geekinterview.com/talk/newreply.php?do=newreply&p=42849
  • 8/9/2019 datastage exckusive

    3/42

    Load - Data should be first loaded into dimensional table. Based on the primary key

    values in dimensional table, the data should be loaded into Fact table.

    2. What is the default cache size? How do you change the cache size if needed?

    A. Default cache size is 256 MB. We can increase it by going into Datastage

    Administrator and selecting the Tunable Tab and specify the cache size over there.

    3. What are types of Hashed File?

    A) Hashed File is classified broadly into 2 types.

    a) Static - Sub divided into 17 types based on Primary Key Pattern.

    b) Dynamic - sub divided into 2 types

    i) Generic ii) Specific.

    Dynamic files do not perform as well as a well, designed static file, but do perform better

    than a badly designed one. When creating a dynamic file you can specify the following

    Although all of these have default values)

    By Default Hashed file is "Dynamic - Type Random 30 D"

    4. What does a Config File in parallel extender consist of?

    A) Config file consists of the following.

    a) Number of Processes or Nodes.

    b) Actual Disk Storage Location.

    5. What is Modulus and Splitting in Dynamic Hashed File?

    A. In a Hashed File, the size of the file keeps changing randomly.

    If the size of the file increases it is called as "Modulus".

    If the size of the file decreases it is called as "Splitting".

    6. What are Stage Variables, Derivations and Constants?

    A. Stage Variable - An intermediate processing variable that retains value during read

    and doesnt pass the value into target column.

  • 8/9/2019 datastage exckusive

    4/42

    Derivation - Expression that specifies value to be passed on to the target column.

    Constant - Conditions that are either true or false that specifies flow of data with a link.

    7. Types of views in Datastage Director?

    There are 3 types of views in Datastage Director

    a) Job View - Dates of Jobs Compiled.

    b) Log View - Status of Job last run

    c) Status View - Warning Messages, Event Messages, Program Generated Messages.

    8. Types of Parallel Processing?

    A) Parallel Processing is broadly classified into 2 types.

    a) SMP - Symmetrical Multi Processing.

    b) MPP - Massive Parallel Processing.

    9. Orchestrate Vs Datastage Parallel Extender?

    A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities and

    running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version

    of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased

    Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0

    i.e Parallel Extender.

    10. Importance of Surrogate Key in Data warehousing?

    A) Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is

    it is independent of underlying database. i.e. Surrogate Key is not affected by the changes

    going on with a database.

    11. How to run a Shell Script within the scope of a Data stage job?

    A) By using "ExcecSH" command at Before/After job properties.

    12. How to handle Date conversions in Datastage? Convert a mm/dd/yyyy format to

    yyyy-dd-mm?

  • 8/9/2019 datastage exckusive

    5/42

    A) We use a) "Iconv" function - Internal Conversion.

    b) "Oconv" function - External Conversion.

    Function to convert mm/dd/yyyy format to yyyy-dd-mm is

    Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

    13 How do you execute datastage job from command line prompt?

    A) Using "dsjob" command as follows.

    dsjob -run -jobstatus projectname jobname

    14. Functionality of Link Partitioner and Link Collector?

    Link Partitioner: It actually splits data into various partitions or data flows using

    various partition methods.

    Link Collector: It collects the data coming from partitions, merges it into a single data

    flow and loads to target.

    15. Types of Dimensional Modeling?

    A) Dimensional modeling is again sub divided into 2 types.

    a) Star Schema - Simple & Much Faster. Denormalized form.

    b) Snowflake Schema - Complex with more Granularity. More normalized form.

    16. Differentiate Primary Key and Partition Key?

    Page 4 of 210

    Primary Key is a combination of unique and not null. It can be a collection of key values

    called as composite primary key. Partition Key is a just a part of Primary Key. There are

    several methods of partition like Hash, DB2, and Random etc. While using Hash partition

    we specify the Partition Key.

    17. Differentiate Database data and Data warehouse data?

    A) Data in a Database is

    a) Detailed or Transactional

  • 8/9/2019 datastage exckusive

    6/42

    b) Both Readable and Writable.

    c) Current.

    18. Containers Usage and Types?

    Container is a collection of stages used for the purpose of Reusability.

    There are 2 types of Containers.

    a) Local Container: Job Specific

    b) Shared Container: Used in any job within a project.

    19. Compare and Contrast ODBC and Plug-In stages?

    ODBC: a) Poor Performance.

    b) Can be used for Variety of Databases.

    c) Can handle Stored Procedures.

    Plug-In: a) Good Performance.

    b) Database specific. (Only one database)

    c) Cannot handle Stored Procedures.

    20. Dimension Modelling types along with their significance

    Data Modelling is Broadly classified into 2 types.

    a) E-R Diagrams (Entity - Relatioships).

    b) Dimensional Modelling.

    Q 21 What are Ascential Dastastage Products, Connectivity

    Ans:

    Ascential Products

    Ascential DataStage

    Ascential DataStage EE (3)

    Ascential DataStage EE MVS

    Ascential DataStage TX

  • 8/9/2019 datastage exckusive

    7/42

    Ascential QualityStage

    Ascential MetaStage

    Ascential RTI (2)

    Ascential ProfileStage

    Ascential AuditStage

    Ascential Commerce Manager

    Industry Solutions

    Connectivity

    Files

    RDBMS

    Real-time

    PACKs

    EDI

    Other

    Q 22 Explain Data Stage Architecture?

    Data Stage contains two components,

    Client Component.

    Server Component.

    Client Component:

    Data Stage Administrator.Data Stage ManagerData Stage DesignerData Stage Director

    Server Components:

    Data Stage Engine

  • 8/9/2019 datastage exckusive

    8/42

    Meta Data RepositoryPackage Installer

    Data Stage Administrator:

    Used to create the project.

    Contains set of properties

    We can set the buffer size (by default 128 MB)

    We can increase the buffer size.

    We can set the Environment Variables.

    In tunable we have in process and inter-process

    In-processData read in sequentially

    Inter-process It reads the data as it comes.

    It just interfaces to metadata.

    Data Stage Manager:

    We can view and edit the Meta data Repository.

    We can import table definitions.

    We can export the Data stage components in .xml or .dsx format.

    We can create routines and transforms

    We can compile the multiple jobs.

    Data Stage Designer:

    We can create the jobs. We can compile the job. We can run the job. We can

    declare stage variable in transform, we can call routines, transform, macros, functions.

    We can write constraints.

    Data Stage Director:

    We can run the jobs.

    We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly)

  • 8/9/2019 datastage exckusive

    9/42

    We can monitor the jobs.

    We can release the jobs.

    Q 23 What is Meta Data Repository?

    Meta Data is a data about the data.

    It also contains

    Query statisticsETL statisticsBusiness subject areaSource InformationTarget InformationSource to Target mapping Information.

    Q 24 What is Data Stage Engine?

    It is a JAVA engine running at the background.

    Q 25 What is Dimensional Modeling?

    Dimensional Modeling is a logical design technique that seeks to present the data

    in a standard framework that is, intuitive and allows for high performance access.

    Q 26 What is Star Schema?

    Star Schema is a de-normalized multi-dimensional model. It contains centralized fact

    tables surrounded by dimensions table.

    Dimension Table: It contains a primary key and description about the fact table.

    Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

    Q 27 What is surrogate Key?

    It is a 4-byte integer which replaces the transaction / business / OLTP key in the

    dimension table. We can store up to 2 billion record.

    Q 28 Why we need surrogate key?

  • 8/9/2019 datastage exckusive

    10/42

    It is used for integrating the data may help better for primary key.

    Index maintenance, joins, table size, key updates, disconnected inserts and

    partitioning.

    Q 29 What is Snowflake schema?

    It is partially normalized dimensional model in which at two represents least one

    dimension or more hierarchy related tables.

    Q 30 Explain Types of Fact Tables?

    Factless Fact: It contains only foreign keys to the dimension tables.

    Additive Fact: Measures can be added across any dimensions.

    Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount

    Non-Additive: Measures cannot be added across any dimensions. Eg, Average

    Page 8 of 210

    Conformed Fact: The equation or the measures of the two fact tables are the same under

    the facts are measured across the dimensions with a same set of measures.

    Q 31 Explain the Types of Dimension Tables?

    Conformed Dimension: If a dimension table is connected to more than one fact table,

    the granularity that is defined in the dimension table is common across between the fact

    tables.

    Junk Dimension: The Dimension table, which contains only flags.

    Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.

    De-generative Dimension: It is line item-oriented fact table design.

    Q 32 What are stage variables?

    Stage variables are declaratives in Transformer Stage used to store values. Stage

    variables are active at the run time. (Because memory is allocated at the run time).

    Q 33 What is sequencer?

  • 8/9/2019 datastage exckusive

    11/42

    It sets the sequence of execution of server jobs.

    Q 34 What are Active and Passive stages?

    Active Stage: Active stage model the flow of data and provide mechanisms for

    combining data streams, aggregating data and converting data from one data type to

    another. Eg, Transformer, aggregator, sort, Row Merger etc.

    Passive Stage: A Passive stage handles access to Database for the extraction or writing

    of data. Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

    Q 35 What is ODS?

    Operational Data Store is a staging area where data can be rolled back.

    Q 36 What are Macros?

    They are built from Data Stage functions and do not require arguments.

    A number of macros are provided in the JOBCONTROL.H file to facilitate getting

    information about the current job, and links and stages belonging to the current job.

    These can be used in expressions (for example for use in Transformer stages), job control

    routines, filenames and table names, and before/after subroutines.

    These macros provide the functionality of using the DSGetProjectInfo, DSGetJobInfo,

    DSGetStageInfo, and DSGetLinkInfo functions with the DSJ.ME token as the JobHandle

    and can be used in all active stages and before/after subroutines. The macros provide the

    functionality for all the possible InfoType arguments for the DSGetInfo functions. See

    the Function call help topics for more details.

    The available macros are:

    DSHostName

    DSProjectName

    DSJobStatus

    DSJobName

  • 8/9/2019 datastage exckusive

    12/42

    Page 9 of 210

    DSJobController

    DSJobStartDate

    DSJobStartTime

    DSJobStartTimestamp

    DSJobWaveNo

    DSJobInvocations

    DSJobInvocationId

    DSStageName

    DSStageLastErr

    DSStageType

    DSStageInRowNum

    DSStageVarList

    DSLinkRowCount

    DSLinkLastErr

    DSLinkName

    1) Examples

    2) To obtain the name of the current job:

    3) MyName = DSJobName

    To obtain the full current stage name:

    MyName = DSJobName :. : DSStageNameQ 37 What is keyMgtGetNextValue?

    It is a Built-in transform it generates Sequential numbers. Its input type is literal string &

    output type is string.

    Q 38 What are stages?

  • 8/9/2019 datastage exckusive

    13/42

    The stages are either passive or active stages.

    Passive stages handle access to databases for extracting or writing data.

    Active stages model the flow of data and provide mechanisms for combining data

    streams, aggregating data, and converting data from one data type to another.

    Q 39 What index is created on Data Warehouse?

    Bitmap index is created in Data Warehouse.

    Q 40 What is container?

    A container is a group of stages and links. Containers enable you to simplify and

    modularize your server job designs by replacing complex areas of the diagram with a

    single container stage. You can also use shared containers as a way of incorporating

    server job functionality into parallel jobs.

    DataStage provides two types of container:

    Local containers. These are created within a job and are only accessible by thatjob. A local container is edited in a tabbed page of the jobs Diagram window.

    Shared containers. These are created separately and are stored in the Repositoryin the same way that jobs are. There are two types of shared container

    Q 41 What is function? ( Job Control Examples of Transform Functions )

    Functions take arguments and return a value.

    BASIC functions: A function performs mathematical or string manipulations onthe arguments supplied to it, and return a value. Some functions have 0

    arguments; most have 1 or more. Arguments are always in parentheses, separated

    by commas, as shown in this general syntax:

    FunctionName (argument, argument)

    DataStage BASIC functions: These functions can be used in a job controlroutine, which is defined as part of a jobs properties and allows other jobs to be

  • 8/9/2019 datastage exckusive

    14/42

  • 8/9/2019 datastage exckusive

    15/42

    Get information about a controlled jobs parameters DSGetParamInfo

    Get the log event from the job log DSGetLogEntry

    Get a number of log events on the specified subject from the

    job log

    DSGetLogSummary

    Get the newest log event, of a specified type, from the job log DSGetNewestLogId

    Log an event to the job log of a different job DSLogEvent

    Stop a controlled job DSStopJob

    Return a job handle previously obtained from DSAttachJob DSDetachJob

    Log a fatal error message in a job's log file and aborts the job. DSLogFatal

    Log an information message in a job's log file. DSLogInfo

    Put an info message in the job log of a job controlling current

    job.

    DSLogToController

    Log a warning message in a job's log file. DSLogWarn

    Generate a string describing the complete status of a valid

    attached job.

    DSMakeJobReport

    Insert arguments into the message template. DSMakeMsg

    Ensure a job is in the correct state to be run or validated. DSPrepareJob

    Interface to system send mail facility. DSSendMail

    Log a warning message to a job log file. DSTransformError

    Convert a job control status or error code into an explanatory

    text message.

    DSTranslateCode

  • 8/9/2019 datastage exckusive

    16/42

    Suspend a job until a named file either exists or does not exist. DSWaitForFile

    Checks if a BASIC routine is cataloged, either in VOC as a

    callable item, or in the catalog space.

    DSCheckRoutine

    Execute a DOS or Data Stage Engine command from a

    before/after subroutine.

    DSExecute

    Set a status message for a job to return as a termination

    message when it finishes

    DSSetUserStatus

    Q 42 What is Routines?

    Routines are stored in the Routines branch of the Data Stage Repository, where you can

    create, view or edit. The following programming components are classified as routines:

    Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX

    (OLE) functions, Web Service routines

    Q 43 What is data stage Transform?

    Q 44 What is Meta Brokers?

    Q 45 What is usage analysis?

    Q 46 What is job sequencer?

    Page 12 of 210

    Q 47 What are different activities in job sequencer?

    Q 48 What are triggers in data Stages? (conditional, unconditional, otherwise)

    Q 49 Are u generated job Reports? S

    Q 50 What is plug-in?

    Q 51 Have u created any custom transform? Explain? (Oconv)

  • 8/9/2019 datastage exckusive

    17/42

    Question: Dimension Modeling types along with their significance

    Answer:

    Data Modelling is broadly classified into 2 types.

    A) E-R Diagrams (Entity - Relatioships).

    B) Dimensional Modelling.

    Question: Dimensional modelling is again sub divided into 2 types.

    Answer:

    A) Star Schema - Simple & Much Faster. Denormalized form.

    B) Snowflake Schema - Complex with more Granularity. More normalized form.

    Question: Importance of Surrogate Key in Data warehousing?

    Answer:

    Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is, it

    is independent of underlying database, i.e. Surrogate Key is not affected by the changes

    going on with a database.

    Question: Differentiate Database data and Data warehouse data?

    Answer:

    Data in a Database is

    A) Detailed or Transactional

    B) Both Readable and Writable.

    C) Current.

    Question: What is the flow of loading data into fact & dimensional tables?

  • 8/9/2019 datastage exckusive

    18/42

    Answer:

    Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys

    in Dimensional table. Consists of fields with numeric values.

    Dimension table - Table with Unique Primary Key.

    Load - Data should be first loaded into dimensional table. Based on the primary key

    values in dimensional table, then data should be loaded into Fact table.

    Question: Orchestrate Vs Datastage Parallel Extender?

    Answer:

    Orchestrate itself is an ETL tool with extensive parallel processing capabilities and

    running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version

    of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased

    Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0

    i.e. Parallel Extender.

    Question: Differentiate Primary Key and Partition Key?

    Answer:

    Primary Key is a combination of unique and not null. It can be a collection of key values

    called as composite primary key. Partition Key is a just a part of Primary Key. There are

    several methods of partition like Hash, DB2, Random etc...While using Hash partition we

    specify the Partition Key.

    Question: What are Stage Variables, Derivations and Constants?

    Answer:

    Stage Variable - An intermediate processing variable that retains value during read and

    doesnt pass the value into target column.

  • 8/9/2019 datastage exckusive

    19/42

    Constraint - Conditions that are either true or false that specifies flow of data with a link.

    Derivation - Expression that specifies value to be passed on to the target column.

    Question: What is the default cache size? How do you change the cache size if

    needed?

    Answer:

    Default cache size is 256 MB. We can increase it by going into Datastage Administrator

    and selecting the Tunable Tab and specify the cache size over there.

    Question: What is Hash file stage and what is it used for?

    Answer:

    Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI

    tables for better performance.

    Question: What are types of Hashed File?

    Answer:

    Hashed File is classified broadly into 2 types.

    A) Static - Sub divided into 17 types based on Primary Key Pattern.

    B) Dynamic - sub divided into 2 types

    i) Generic

    ii) Specific

    Default Hased file is "Dynamic - Type Random 30 D"

    Question: What are Static Hash files and Dynamic Hash files?

    Answer:

    As the names itself suggest what they mean. In general we use Type-30 dynamic Hash

    files. The Data file has a default size of 2GB and the overflow file is used if the data

  • 8/9/2019 datastage exckusive

    20/42

    exceeds the 2GB size.

    Question: What is the Usage of Containers? What are its types?

    Answer:

    Container is a collection of stages used for the purpose of Reusability.

    There are 2 types of Containers.

    A) Local Container: Job Specific

    B) Shared Container: Used in any job within a project.

    Question: Compare and Contrast ODBC and Plug-In stages?

    Answer:

    ODBC PLUG-IN

    Poor Performance Good Performance

    Can be used for Variety of Databases Database Specific (only one database)

    Can handle Stored Procedures Cannot handle Stored Procedures

    Question: How do you execute datastage job from command line prompt?

    Answer:

    Using "dsjob" command as follows.

    dsjob -run -jobstatus projectname jobname

    Question: What are the command line functions that import and export the DS

    jobs?

    Answer:

    dsimport.exe - imports the DataStage components.dsexport.exe - exports the DataStage components.

    Question: How to run a Shell Script within the scope of a Data stage job?

  • 8/9/2019 datastage exckusive

    21/42

    Answer:

    By using "ExcecSH" command at Before/After job properties.

    Question: What are OConv () and Iconv () functions and where are they used?

    Answer:

    IConv() - Converts a string to an internal storage format

    OConv() - Converts an expression to an output format.

    Question: How to handle Date convertions in Datastage? Convert mm/dd/yyyy

    format to yyyy-dd-mm?

    Answer:

    We use

    a) "Iconv" function - Internal Convertion.

    b) "Oconv" function - External Convertion.

    Function to convert mm/dd/yyyy format to yyyy-dd-mm is

    Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

    Question: Types of Parallel Processing?

    Answer:

    Parallel Processing is broadly classified into 2 types.

    a) SMP - Symmetrical Multi Processing.

    b) MPP - Massive Parallel Processing.

    Question: What does a Config File in parallel extender consist of?

    Answer:

    Config file consists of the following.

    a) Number of Processes or Nodes.

  • 8/9/2019 datastage exckusive

    22/42

    b) Actual Disk Storage Location.

    Question: Functionality of Link Partitioner and Link Collector?

    Answer:

    Link Partitioner: It actually splits data into various partitions or data flows using various

    Partition methods.

    Link Collector: It collects the data coming from partitions, merges it into a single data

    flow and loads to target.

    Question: What is Modulus and Splitting in Dynamic Hashed File?

    Answer:

    In a Hashed File, the size of the file keeps changing randomly.

    If the size of the file increases it is called as "Modulus".

    If the size of the file decreases it is called as "Splitting".

    Question: Types of views in Datastage Director?

    Answer:

    There are 3 types of views in Datastage Director

    a) Job View - Dates of Jobs Compiled.

    b) Log View - Status of Job last Run

    c) Status View - Warning Messages, Event Messages, Program Generated Messages.

    Question: Did you Parameterize the job or hard-coded the values in the jobs?

    Answer:

    Always parameterized the job. Either the values are coming from Job Properties or from

    a Parameter Manager a third part tool. There is no way you will hardcode some

    parameters in your jobs. The often Parameterized variables in a job are: DB DSN name,

  • 8/9/2019 datastage exckusive

    23/42

    username, password, dates W.R.T for the data to be looked against at.

    Question: Have you ever involved in updating the DS versions like DS 5.X, if so tell

    us some the steps you have taken in doing so?

    Answer:

    Yes.

    The following are some of the steps:

    Definitely take a back up of the whole project(s) by exporting the project as a .dsx file

    See that you are using the same parent folder for the new version also for your old jobs

    using the hard-coded file path to work.

    After installing the new version import the old project(s) and you have to compile them

    all again. You can use 'Compile All' tool for this.

    Make sure that all your DB DSN's are created with the same name as old ones. This step

    is for moving DS from one machine to another.

    In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS

    CD that can do this for you.

    Do not stop the 6.0 server before the upgrade, version 7.0 install process collects project

    information during the upgrade. There is NO rework (recompilation of existing

    jobs/routines) needed after the upgrade.

    Question: How did you handle reject data?

    Answer:

    Typically a Reject-link is defined and the rejected data is loaded back into data

    warehouse. So Reject link has to be defined every Output link you wish to collect

    rejected data. Rejected data is typically bad data like duplicates of Primary keys or nullrows

  • 8/9/2019 datastage exckusive

    24/42

    where data is expected.

    Page 17 of 210

    Question: What are other Performance tunings you have done in your last project

    to increase the performance of slowly running jobs?

    Answer:

    Staged the data coming from ODBC/OCI/DB2UDB stages or any database on theserver using Hash/Sequential files for optimum performance also for data recovery in

    case job aborts.

    Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values forfaster inserts, updates and selects.

    Tuned the 'Project Tunables' in Administrator for better performance.Used sorted data for Aggregator.Sorted the data as much as possible in DB and reduced the use of DS-Sort for better

    performance of jobs.

    Removed the data not used from the source as early as possible in the job.Worked with DB-admin to create appropriate Indexes on tables for better

    performance of DS queries.

    Converted some of the complex joins/business in DS to Stored Procedures on DS forfaster execution of the jobs.

    If an input file has an excessive number of rows and can be split-up then use standardlogic to run jobs in parallel.

    Before writing a routine or a transform, make sure that there is not the functionalityrequired in one of the standard routines supplied in the sdk or ds utilities categories.

  • 8/9/2019 datastage exckusive

    25/42

    Constraints are generally CPU intensive and take a significant amount of time toprocess. This may be the case if the constraint calls routines or external macros but if

    it is inline code then the overhead will be minimal.

    Try to have the constraints in the 'Selection' criteria of the jobs itself. This willeliminate the unnecessary records even getting in before joins are made.

    Tuning should occur on a job-by-job basis.Use the power of DBMS.Try not to use a sort stage when you can use an ORDER BY clause in the database.Using a constraint to filter a record set is much slower than performing a SELECT

    WHERE.

    Make every attempt to use the bulk loader for your particular database. Bulk loadersare generally faster than using ODBC or OLE.

    Question: Tell me one situation from your last project, where you had faced

    problem and How did u solve it?

    Answer:

    1. The jobs in which data is read directly from OCI stages are running extremely slow. I

    had to stage the data before sending to the transformer to make the jobs run faster.

    2. The job aborts in the middle of loading some 500,000 rows. Have an option either

    cleaning/deleting the loaded data and then run the fixed job or run the job again from

    the row the job has aborted. To make sure the load is proper we opted the former.

    Question: Tell me the environment in your last projects

    Answer:

    Give the OS of the Server and the OS of the Client of your recent most project

  • 8/9/2019 datastage exckusive

    26/42

    Page 18 of 210

    Question: How did u connect with DB2 in your last project?

    Answer:

    Most of the times the data was sent to us in the form of flat files. The data is dumped and

    sent to us. In some cases were we need to connect to DB2 for look-ups as an instance

    then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation

    and availability. Certainly DB2-UDB is better in terms of performance as you know the

    native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver

    9.00.02.02' - ODBC drivers to connect to AS400/DB2.

    Question: What are Routines and where/how are they written and have you written

    any routines before?

    Answer:

    Routines are stored in the Routines branch of the DataStage Repository, where you can

    create, view or edit.

    The following are different types of Routines:

    1. Transform Functions

    2. Before-After Job subroutines

    3. Job Control Routines

    Question: How did you handle an 'Aborted' sequencer?

    Answer:

    In almost all cases we have to delete the data inserted by this from DB manually and fix

    the job and then run the job again.

    Question: What are Sequencers?

  • 8/9/2019 datastage exckusive

    27/42

    Answer:

    Sequencers are job control programs that execute other jobs with preset Job parameters.

    Question: Read the String functions in DS

    Answer:

    Functions like [] -> sub-string function and ':' -> concatenation operator

    Syntax:

    string [ [ start, ] length ]

    string [ delimiter, instance, repeats ]

    Question: What will you in a situation where somebody wants to send you a file and

    use that file as an input or reference and then run job.

    Answer:

    Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run thejob. May be you can schedule the sequencer around the time the file is expected to

    arrive.

    Under UNIX: Poll for the file. Once the file has start the job or sequencer dependingon the file.

    Question: What is the utility you use to schedule the jobs on a UNIX server other

    than using Ascential Director?

    Answer:

    Use crontab utility along with dsexecute() function along with proper parameters passed.

    Page 19 of 210

    Question: Did you work in UNIX environment?

    Answer:

  • 8/9/2019 datastage exckusive

    28/42

    Yes. One of the most important requirements.

    Question: How would call an external Java function which are not supported by

    DataStage?

    Answer:

    Starting from DS 6.0 we have the ability to call external Java functions using a Java

    package from Ascential. In this case we can even use the command line to invoke the

    Java function and write the return values from the Java program (if any) and use that files

    as a source in DataStage job.

    Question: How will you determine the sequence of jobs to load into data warehouse?

    Answer:

    First we execute the jobs that load the data into Dimension tables, then Fact tables, then

    load the Aggregator tables (if any).

    Question: The above might raise another question: Why do we have to load the

    dimensional tables first, then fact tables:

    Answer:

    As we load the dimensional tables the keys (primary) are generated and these keys

    (primary) are Foreign keys in Fact tables.

    Question: Does the selection of 'Clear the table and Insert rows' in the ODBC stage

    send a Truncate statement to the DB or does it do some kind of Delete logic.

    Answer:

    There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete

    from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate

    options. They are radically different in permissions (Truncate requires you to have alter

  • 8/9/2019 datastage exckusive

    29/42

    table permissions where Delete doesn't).

    Question: How do you rename all of the jobs to support your new File-naming

    conventions?

    Answer:

    Create an Excel spreadsheet with new and old names. Export the whole project as a dsx.

    Write a Perl program, which can do a simple rename of the strings looking up the Excel

    file. Then import the new dsx file probably into a new project for testing. Recompile all

    jobs. Be cautious that the name of the jobs has also been changed in your job control jobs

    or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

    Question: When should we use ODS?

    Answer:

    DWH's are typically read only, batch updated on a schedule

    ODS's are maintained in more real time, trickle fed constantly

    Page 20 of 210

    Question: What other ETL's you have worked with?

    Answer:

    Informatica and also DataJunction if it is present in your Resume.

    Question: How good are you with your PL/SQL?

    Answer:

    On the scale of 1-10 say 8.5-9

    Question: What versions of DS you worked with?

    Answer:

    DS 7.5, DS 7.0.2, DS 6.0, DS 5.2

  • 8/9/2019 datastage exckusive

    30/42

    Question: What's the difference between Datastage Developers...?

    Answer:

    Datastage developer is one how will code the jobs. Datastage designer is how will design

    the job, I mean he will deal with blue prints and he will design the jobs the stages that are

    required in developing the code

    Question: What are the requirements for your ETL tool?

    Answer:

    Do you have large sequential files (1 million rows, for example) that need to be compared

    every day versus yesterday?

    If so, then ask how each vendor would do that. Think about what process they are going

    to do. Are they requiring you to load yesterdays file into a table and do lookups?

    If so, RUN!! Are they doing a match/merge routine that knows how to process this in

    sequential files? Then maybe they are the right one. It all depends on what you need the

    ETL to do.

    If you are small enough in your data sets, then either would probably be OK.

    Question: What are the main differences between Ascential DataStage and

    Informatica PowerCenter?

    Answer:

    Chuck Kelleys Answer: You are right; they have pretty much similar functionality.

    However, what are the requirements for your ETL tool? Do you have large sequential

    files (1 million rows, for example) that need to be compared every day versus yesterday?

    If so, then ask how each vendor would do that. Think about what process they are going

    to do. Are they requiring you to load yesterdays file into a table and do lookups? If so,

  • 8/9/2019 datastage exckusive

    31/42

    RUN!! Are they doing a match/merge routine that knows how to process this in

    sequential files? Then maybe they are the right one. It all depends on what you need the

    ETL to do. If you are small enough in your data sets, then either would probably be OK.

    Les Barbusinskis Answer: Without getting into specifics, here are some differences

    you may want to explore with each vendor:

    Does the tool use a relational or a proprietary database to store its Meta data andscripts? If proprietary, why?

    Page 21 of 210

    What add-ons are available for extracting data from industry-standard ERP,

    Accounting, and CRM packages?

    Can the tools Meta data be integrated with third-party data modeling and/orbusiness intelligence tools? If so, how and with which ones?

    How well does each tool handle complex transformations, and how much externalscripting is required?

    What kinds of languages are supported for ETL script extensions?Almost any ETL tool will look like any other on the surface. The trick is to find out

    which one will work best in your environment. The best way Ive found to make this

    determination is to ascertain how successful each vendors clients have been using their

    product. Especially clients who closely resemble your shop in terms of size, industry, inhouse

    skill sets, platforms, source systems, data volumes and transformation complexity.

    Ask both vendors for a list of their customers with characteristics similar to your own that

    have used their ETL product for at least a year. Then interview each client (preferably

    several people at each site) with an eye toward identifying unexpected problems, benefits,

  • 8/9/2019 datastage exckusive

    32/42

    or quirkiness with the tool that have been encountered by that customer. Ultimately, ask

    each customer if they had it all to do over again whether or not theyd choose the

    same tool and why? You might be surprised at some of the answers.

    Joyce Bischoffs Answer: You should do a careful research job when selecting products.

    You should first document your requirements, identify all possible products and evaluate

    each product against the detailed requirements. There are numerous ETL products on the

    market and it seems that you are looking at only two of them. If you are unfamiliar with

    the many products available, you may refer to www.tdan.com, the Data Administration

    Newsletter, for product lists.

    If you ask the vendors, they will certainly be able to tell you which of their products

    features are stronger than the other product. Ask both vendors and compare the answers,

    which may or may not be totally accurate. After you are very familiar with the products,

    call their references and be sure to talk with technical people who are actually using the

    product. You will not want the vendor to have a representative present when you speak

    with someone at the reference site. It is also not a good idea to depend upon a high-level

    manager at the reference site for a reliable opinion of the product. Managers may paint a

    very rosy picture of any selected product so that they do not look like they selected an

    inferior product.

    Question: How many places u can call Routines?

    Answer:

    Four Places u can call

    1. Transform of routine

    a. Date Transformation

  • 8/9/2019 datastage exckusive

    33/42

    b. Upstring Transformation

    2. Transform of the Before & After Subroutines

    3. XML transformation

    Page 22 of 210

    4. Web base transformation

    Question: What is the Batch Program and how can generate?

    Answer: Batch program is the program it's generate run time to maintain by the

    Datastage itself but u can easy to change own the basis of your requirement (Extraction,

    Transformation, Loading) .Batch program are generate depends your job nature either

    simple job or sequencer job, you can see this program on job control option.

    Question: Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4

    ) if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target

    table remaining are not loaded and your job going to be aborted then.. How can

    short out the problem?

    Answer:

    Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this

    condition should go director and check it what type of problem showing either data type

    problem, warning massage, job fail or job aborted, If job fail means data type problem or

    missing column action .So u should go Run window ->Click-> Tracing->Performance or

    In your target table ->general -> action-> select this option here two option

    (i) On Fail -- Commit , Continue

    (ii) On Skip -- Commit, Continue.

    First u check how much data already load after then select on skip option then

  • 8/9/2019 datastage exckusive

    34/42

    continue and what remaining position data not loaded then select On Fail , Continue

    ...... Again Run the job defiantly u gets successful massage

    Question: What happens if RCP is disable?

    Answer:

    In such case OSH has to perform Import and export every time when the job runs and the

    processing time job is also increased...

    Question: How do you rename all of the jobs to support your new File-naming

    conventions?

    Answer: Create a Excel spreadsheet with new and old names. Export the whole project

    as a dsx. Write a Perl program, which can do a simple rename of the strings looking up

    the Excel file. Then import the new dsx file probably into a new project for testing.

    Recompile all jobs. Be cautious that the name of the jobs has also been changed in your

    job control jobs or Sequencer jobs. So you have to make the necessary changes to these

    Sequencers.

    Question: What will you in a situation where somebody wants to send you a file and

    use that file as an input or reference and then run job.

    Answer: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and

    then run the job. May be you can schedule the sequencer around the time the file is

    expected to arrive.

    B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending

    on the file

    Page 23 of 210

    Question: What are Sequencers?

  • 8/9/2019 datastage exckusive

    35/42

    Answer: Sequencers are job control programs that execute other jobs with preset Job

    parameters.

    Question: How did you handle an 'Aborted' sequencer?

    Answer: In almost all cases we have to delete the data inserted by this from DB manually

    and fix the job and then run the job again.

    Question34: What is the difference between the Filter stage and the Switch stage?

    Ans: There are two main differences, and probably some minor ones as well. The two

    main differences are as follows.

    1) The Filter stage can send one input row to more than one output link. The Switch

    stage can not - the C switch construct has an implicit break in every case.

    2) The Switch stage is limited to 128 output links; the Filter stage can have a

    theoretically unlimited number of output links. (Note: this is not a challenge!)

    Question: How can i achieve constraint based loading using datastage7.5.My target

    tables have inter dependencies i.e. Primary key foreign key constraints. I want my

    primary key tables to be loaded first and then my foreign key tables and also primary key

    tables should be committed before the foreign key tables are executed. How can I go

    about it?

    Ans:1) Create a Job Sequencer to load you tables in Sequential mode

    In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign

    key tables, when triggering the Foreign tables load Job trigger them only when Primary

    Key load Jobs run Successfully ( i.e. OK trigger)

    2) To improve the performance of the Job, you can disable all the constraints on the

    tables and load them. Once loading done, check for the integrity of the data. Which does

  • 8/9/2019 datastage exckusive

    36/42

    not meet raise exceptional data and cleanse them.

    This only a suggestion, normally when loading on constraints are up, will drastically

    performance will go down.

    3) If you use Star schema modeling, when you create physical DB from the model, you

    can delete all constraints and the referential integrity would be maintained in the ETL

    process by referring all your dimension keys while loading fact tables. Once all

    dimensional keys are assigned to a fact then dimension and fact can be loaded together.

    At the same time RI is being maintained at ETL process level.

    Question: How do you merge two files in DS?

    Page 24 of 210

    Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files

    are same or create a job to concatenate the 2 files into one, if the metadata is different.

    Question: How do you eliminate duplicate rows?

    Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using

    that stage we can eliminate the duplicates based on a key column.

    Question: How do you pass filename as the parameter for a job?

    Ans: While job development we can create a parameter 'FILE_NAME' and the value can

    be passed while

    Question: How did you handle an 'Aborted' sequencer?

    Ans: In almost all cases we have to delete the data inserted by this from DB manually

    and fix the job and then run the job again.

    Question: Is there a mechanism available to export/import individual DataStage

    ETL jobs from the UNIX command line?

  • 8/9/2019 datastage exckusive

    37/42

    Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement.

    You can only export full projects from the command line.

    You can find the export and import executables on the client machine usually someplace

    like: C:\Program Files\Ascential\DataStage.

    Question: Diff. between JOIN stage and MERGE stage.

    Answer:

    JOIN: Performs join operations on two or more data sets input to the stage and then

    outputs the resulting dataset.

    MERGE: Combines a sorted master data set with one or more sorted updated data sets.

    The columns from the records in the master and update data set s are merged so that the

    out put record contains all the columns from the master record plus any additional

    columns from each update record that required.

    A master record and an update record are merged only if both of them have the same

    values for the merge key column(s) that we specify .Merge key columns are one or more

    columns that exist in both the master and update records.

    Question: Advantages of the DataStage?

    Answer:

    Business advantages:

    Helps for better business decisions;

    It is able to integrate data coming from all parts of the company; It helps to understand the new and already existing clients; We can collect data of different clients with him, and compare them;Page 25 of 210

  • 8/9/2019 datastage exckusive

    38/42

    It makes the research of new business possibilities possible; We can analyze trends of the data read by him.Technological advantages:

    It handles all company data and adapts to the needs; It offers the possibility for the organization of a complex business intelligence; Flexibly and scalable; It accelerates the running of the project; Easily implementable.

    1. What is the architecture of data stage?

    Basically architecture of DS is client/server architecture.

    Client components & server components

    Client components are 4 types they are

    1. Data stage designer

    2. Data stage administrator

    3. Data stage director

    4. Data stage manager

    Data stage designer is user for to design the jobs

    Data stage manager is used for to import & export the project to view & edit the

    contents of the repository.

    Data stage administrator is used for creating the project, deleting the project & setting

    the environment variables.

    Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs.

    Server components

  • 8/9/2019 datastage exckusive

    39/42

    DS server: runs executable server jobs, under the control of the DS director, that extract,

    transform, and load data into a DWH.

    DS Package installer: A user interface used to install packaged DS jobs and plug-in;

    Repository or project: a central store that contains all the information required to build

    DWH or data mart.

    2. What r the stages u worked on?

    Page 26 of 210

    3. I have some jobs every month automatically delete the log details what r the steps

    u have to take for that

    We have to set the option autopurge in DS Adminstrator.

    4. I want to run the multiple jobs in the single job. How can u handle.

    In job properties set the option ALLOW MULTIPLE INSTANCES.

    5. What is version controlling in DS?

    In DS, version controlling is used for back up the project or jobs.

    This option is available in DS 7.1 version onwards.

    Version controls r of 2 types.

    1. VSS- visual source safe

    2. CVSS- concurrent visual source safe.

    VSS is designed by Microsoft but the disadvantage is only one user can access at a time,

    other user can wait until the first user complete the operation.

    CVSS, by using this many users can access concurrently. When compared to VSS, CVSS

    cost is high.

    6. What is the difference between clear log file and clear status file?

  • 8/9/2019 datastage exckusive

    40/42

    Clear log--- we can clear the log details by using the DS Director. Under job menu

    clear log option is available. By using this option we can clear the log details of

    particular job.

    Clear status file---- lets the user remove the status of the record associated with all

    stages of selected jobs.(in DS Director)

    7. I developed 1 job with 50 stages, at the run time one stage is missed how can u

    identify which stage is missing?

    By using usage analysis tool, which is available in DS manager, we can find out the what

    r the items r used in job.

    8. My job takes 30 minutes time to run, I want to run the job less than 30 minutes?

    What r the steps we have to take?

    By using performance tuning aspects which are available in DS, we can reduce time.

    Tuning aspect

    In DS administrator : in-process and inter process

    In between passive stages : inter process stage

    Page 27 of 210

    OCI stage : Array size and transaction size

    And also use link partitioner & link collector stage in between passive stages

    9. How to do road transposition in DS?

    Pivot stage is used to transposition purpose. Pivot is an active stage that maps sets of

    columns in an input table to a single column in an output table.

    10. If a job locked by some user, how can you unlock the particular job in DS?

    We can unlock the job by using clean up resources option which is available in DS

  • 8/9/2019 datastage exckusive

    41/42

    Director. Other wise we can find PID (process id) and kill the process in UNIX server.

    11. What is a container? How many types containers are available? Is it possible to

    use container as look up?

    A container is a group of stages and links. Containers enable you to simplify and

    modularize your server job designs by replacing complex areas of the diagram with a

    single container stage.

    DataStage provides two types of container:

    Local containers. These are created within a job and are only accessible by that job

    only.

    Shared containers. These are created separately and are stored in the Repository in the

    same way that jobs are. Shared containers can use any job in the project.

    Yes we can use container as look up.

    12. How to deconstruct the shared container?

    To deconstruct the shared container, first u have to convert the shared container to local

    container. And then deconstruct the container.

    13. I am getting input value like X = Iconv(31 DEC 1967,D)? What is the X

    value?

    X value is Zero.

    Iconv Function Converts a string to an internal storage format.It takes 31 dec 1967 as

    zero and counts days from that date(31-dec-1967).

    14. What is the Unit testing, integration testing and system testing?

    Unit testing: As for Ds unit test will check the data type mismatching,

    Size of the particular data type, column mismatching.

  • 8/9/2019 datastage exckusive

    42/42

    Page 28 of 210

    Integration testing: According to dependency we will put all jobs are integrated in to

    one sequence. That is called control sequence.

    System testing: System testing is nothing but the performance tuning aspects in Ds.

    15. What are the command line functions that import and export the DS jobs?

    Dsimport.exe ---- To import the DataStage components

    Dsexport.exe ---- To export the DataStage components

    16. How many hashing algorithms are available for static hash file and dynamic

    hash file?

    Sixteen hashing algorithms for static hash file.

    Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM)

    17. What happens when you have a job that links two passive stages together?

    Obviously there is some process going on. Under covers Ds inserts a cut-down

    transformer stage between the passive stages, which just passes data straight from one

    stage to the other.

    18. What is the use use of Nested condition activity?

    Nested Condition. Allows you to further branch the execution of a sequence depending

    on a condition.

    19. I have three jobs A,B,C . Which are dependent on each other? I want to run A

    & C jobs daily and B job runs only on Sunday. How can u do it?

    First you have to schedule A & C jobs Monday to Saturday in one sequence.

    Next take three jobs according to dependency in one more sequence and schedule that job

    only Sunday