ETL ques

Embed Size (px)

Citation preview

  • 8/6/2019 ETL ques

    1/11

    ETL stands for extraction, transformation and loading. Etl is a process that involves thefollowing tasks:

    y extracting data from source operational or archive systems which are the primary sourceof data for the data warehouse

    ytransforming the data - which may involve cleaning, filtering, validating and applyingbusiness rules

    y loading the data into a data warehouse or any other database or application that housesdata

    The ETL process is also very often referred to as Data Integration process and ETL tool as aData Integration platform.

    The terms closely related to and managed by ETL processes are: data migration, datamanagement, data cleansing, data synchronization and data consolidation.

    The main goal of maintaining an ETL process in an organization is to migrate and transform data

    from the source OLTP systems to feed a data warehouse and form data marts.

    In datastage the ETL execution flow is managed by controlling jobs, called Job Sequences. A

    master controlling job provides a single interface to pass parameter values down to controlledjobs and launch hundreds of jobs with desired parameters. Changing runtime options (like

    moving project from testing to production environment) is done in job sequences and does notrequire changing the 'child' jobs.

    Controlled jobs can be run in parallel or in serial (when a second job is dependant on the first). Incase of serial job execution it's very important to check if the preceding set of jobs was executed

    successfully.A normal datastage ETL process can be broken up into the following segments (each of the

    segments can be realized by a set of datastage jobs):

    y jobs accessing source systems - extract data from the source systems. They typically dosome data filtering and validations like trimming white spaces, eliminating (replacing)nulls, filtering irrelevant data (also sometimes detect if the data has changed since the last

    run by reading timestamps).y loading lookups - these jobs usually need to be run in order to run transformations. They

    load lookup hashed files, prepare surrogate key mapping files, set up data sequences andset up some parameters.

    y transformations jobs - these are jobs where most of the real processing is done. Theyapply business rules and shape the data that would be loaded into the data warehouse

    (dimension and fact tables).y loadingjobs load the transformed data into the database. Usually a typical Data

    Warehouse load involves assigning surrogate keys, loading dimension tables and loadingfact tables (in a Star Schema example).

  • 8/6/2019 ETL ques

    2/11

    Datawarehouse master load sequence

    Usually the whole set of daily executed datastage jobs is run and monitored by one Sequencejob. It's created graphically in datastage designer in a similiar way as a normal server job.

    Very often the following job sequencer stages/activities are used to do a master controller:

    y Wait for file activity - check for a file which would trigger the whole processingy Execute command - executes operating system commands or datastage commandsy Notification - sends email with a notification and/or job execution log. Can also be

    invoked when an exception occurs and for example notify people from support so theyare aware of a problem straight away

    y Exception - catches exceptions and can be combined with notification stageExample of a master job sequence architecture

    It's a good practice to follow one common naming convention of jobs. Job names proposed in theexample are clear, easy to sort and to analyze what's the jobs hierarchy.

    -Master job controller: SEQ_1000_MAS--Job sequences accessing source: SEQ_1100_SRC

    ----loading customers: SEQ_1110_CUS----loading products: SEQ_1120_PRD

    ----loading time scale: SEQ_1130_TM----loading orders: SEQ_1140_ORD

    ----loading invoices: SEQ_1150_INV--Job filling up lookup keys : SEQ_1200_LK

    ----loading lookups: SEQ_1210_LK--Job sequences for transforming data: SEQ_1300_TRS

    ----transforming customers (dimension): SEQ_1310_CUS_D----transforming products (dimension): SEQ_1320_PRD_D

    ----transforming time scale (dimension): SEQ_1330_TM_D----transforming orders (fact): SEQ_1340_ORD_F

    ----transforming invoices (fact): SEQ_1350_INV_F--Job sequence for loading the transformed data into the DW: SEQ_1400_LD

    The master job controller (sequence job) for data warehouse load process SEQ_1000_MAS can

    be designed as depicted below. Please notice that it will not start until a trigger file is present(WaitFoRFile activity). The extract-transform-load job sequences (each of them may contain

    server jobs or job sequences) will be triggered in serial fashion (not in parallel) and an emailnotification will finish the process.

  • 8/6/2019 ETL ques

    3/11

    Mater job sequence for loading a DataWarehouse

  • 8/6/2019 ETL ques

    4/11

    Q. What is a staging area? Do we need it? What is the purpose of a staging area?

    A1. Staging area is place where you hold temporary tables on data warehouse server. Staging

    tables are connected to work area or fact tables. We basically need staging area to hold the data ,

    and perform data cleansing and merging , before loading the data into warehouse.

    A2.

    he staging area is:-

    * One or more database schema(s) or file stores used to stage data extracted from the sourceOLTP systems prior to being published to the warehouse where it is visible to end users.

    * Data in the staging area is NOT visible to end users for queries, reports or analysis of any kind.It does not hold completed data ready for querying.

    * It may hold intermediate results, (if data is pipelined through a process)* Equally it may hold state data the keys of the data held on the warehouse, and used to

    detect whether incoming data includes New or Updated rows. (Or deleted for that matter).

  • 8/6/2019 ETL ques

    5/11

    * It is likely to be equal in size (or maybe larger) than the presentation area itself.* Although the state data eg. Last sequence loaded may be backed up, much of the staging

    area data is automatically replaced during the ETL load processes, and can with care avoidadding to the backup effort. The presentation area however, may need backup in many cases.

    * It may include some metadata, which may be used by analysts or operators monitoring the state

    of the previous loads (eg. audit information, summary totals of rows loaded etc).* Its likely to hold details of rejected entries data which has failed quality tests, and mayneed correction and re-submission to the ETL process.

    * Its likely to have few indexes (compared to the presentation area), and hold data in a quitenormalised form. The presentation area (the bit the end users see), is by comparison likely to be

    more highly indexed (mainly bitmap indexes), with highly denormalised tables (the Dimensiontables anyway).

    The staging area exists to be a separate back room or engine room of the warehouse where

    the data can be transformed, corrected and prepared for the warehouse.

    It should ONLY be accessible to the ETL processes working on the data, or administrators

    monitoring or managing the ETL process.

    In summary. A typical warehouse generally has three distinct areas:-

    1. Several source systems which provide data. This can include databases (Oracle, SQL Server,Sybase etc) or files or spreadsheets

    2. A single staging area which may use one or more database schemas or file stores (dependingupon warehouse load volumes).

    3. One or more visible data marts or a single warehouse presentation area where data ismade visible to end user queries. This is what many people think of as the warehouse although

    the entire system is the warehouse it depends upon your perspective.

    The staging area is the middle bit.

    Q. What are active transformation / Passive transformations?

    A1. An active transformation can change the number of rows as output after a transformation,

    while a passive transformation does not change the number of rows and passes through the samenumber of rows that was given to it as input.

    A2.Transformations can be active or passive. An active transformation can change the number of

    rows that pass through it, such as a Filter transformation that removes rows that do not meet the

  • 8/6/2019 ETL ques

    6/11

    filter condition. A passive transformation does not change the number of rows that pass throughit, such as an Expression transformation that performs a calculation on data and passes all rows

    through the transformation

    Active transformations:

    Advanced External Procedure

    Aggregator

    Application Source Qualifier

    Filter

    Joiner

    N

    ormalizer

    Rank

    Router

    Update Strategy

    Passive transformation:

    Expression

    External Procedure

    Maplet- Input

    Lookup

    Sequence generator

    XML Source Qualifier

    Maplet - Output

  • 8/6/2019 ETL ques

    7/11

    Q. What is the difference between Power Center & Power Mart?

    A1.

    Power Mart is designed for:Low range of warehouses

    only for local repositoriesmainly desktop environment.

    Power mart is designed for:

    High-end warehousesGlobal as well as local repositories

    ERP support.

    Q. What is the difference between etl tool and olap tools?

    A1.

    ETL tool is ment for extraction data from the legecy systems and load into specified data basewith some process of cleansing data.

    ex: Informatica,data stage ....etc

    OLAP is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional model. so that

    u can write smple query to extract data fro the data base.

    ex: Businee objects,Cognos....etc

    A2.ETL tools are used to extract, transformation and loading the data into data warehouse / data

    martOLAP tools are used to create cubes/reports for business analysis from data warehouse / data

    mart

    Q. What are the various ETL tools? - Name a few

    A1.

    - Informatica- Abinitio

    - DataStage- Cognos Decision Stream

    - Oracle Warehouse Builder

  • 8/6/2019 ETL ques

    8/11

    - Business Objects XI (Extreme Insight)- SAP Business Warehouse

    - SAS Enterprise ETL Server

    Q. What are the various transformation available?

    A1.

    Transformation plays an important role in Datawarehouse. Transformation are used when data ismoved from source to destination. Depding upon crieteria transformations are done. Some of the

    transformations are Aggregater,Lookup,Filter,Source Qualifier,Sequence Generator,Expression

    A2.the various type of transformation in informatica

    source qualifieraggregate

    sequence generatorsorter

    routerfilter

    lookupupdate strategy

    joinernormalizer

    expression

    A3.Aggregator Transformation

    Expression TransformationFilter Transformation

    Joiner TransformationLookup Transformation

    Normalizer TransformationRank Transformation

    Router TransformationSequence Generator Transformation

    Stored Procedure TransformationSorter Transformation

    Update Strategy TransformationXML Source Qualifier Transformation

    Advanced External Procedure TransformationExternal Transformation

  • 8/6/2019 ETL ques

    9/11

    Q. What are the different Lookup methods used in Informatica?

    A1.

    connected lookup will receive input from the pipeline and sends output to the pipeline and canreturn any number of values

    Unconnected lookup can return only one column

    Q. Lets suppose we have some 10,000 odd records in source system and when load them into target

    how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.

    How do we test it. We can't check every record as number of records are huge.

    A1.Data Quality checks come in a number of forms:-

    1. For FACT table rows, is there a valid lookup against each of the Dimensions

    2. For FACT or DIMENSION rows, for each value:-

    * Is it Null when it shouldnt be

    * Is the Data Type correct (eg. Number, Date)* Is the range of values or format correct

    * Is the row valid with relation to all the other source system business rules?

    There is no magic way of checking the integrity of data.

    You could simply count the number of rows in and out again and assume its all OK, but for afact table (at the very minimum) youll need to cope with failed Dimension lookups (typically

    from late arriving Dimension rows).

    Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your

    Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One.You may need to store and re-cycle rows with failed lookups and treat these as updates so ifthe missing Dimension row appears, the data is corrected.

    Otherwise, youve no option. If the incoming data is from an unreliable source, youll need to

    check its validity or accept the warehouse includes wrong results.

    If the warehouse includes a high percentage of incorrect or misleading values whats the point

  • 8/6/2019 ETL ques

    10/11

    of having it ?

    A2. To do this, you must profile the data at the source to know the domain of all the values, get

    the actual number of rows in the source, get the types of the data in the source. After it is loaded

    into the target, this process can be repeated i.e. checking the data values with respect to range,type, etc and also checking the actual number of rows inserted. If the result before and aftermatch, then we are OK. This process is automated typically in ETL tools.

    Q. What is ODS (operation data source)?

    A1. ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low

    level of granularity.

    Once data was poopulated in ODS aggregated data will be loaded into into EDW through ODS.

    A2. ODS is the Operational Data Source which is also called transactional data ODS is thesource of a warehouse. Data from ODs is staged, transformed and then moved to datawarehouse.

    Refresh Load: the table will be truncated and data will be loaded again. Here we use to load static

    dimension table or type tables using this method.

    Incremental Load: It is a method to capture on the newly created or updated record. Based upon the

    falg or Date this load will be performed.

    Full Load: when we are loading the data for first time, either it may be a base load or history all the

    set of records will be loaded at a strech depends upon the volume.

    A mapping represents dataflow from sources to targets.

    A mapplet creates or configures a set of transformations.

    A workflow is a set of instruction sthat tell the Informatica server how to execute the tasks.

    A worklet is an object that represents a set of tasks.

    A session is a set of instructions that describe how and when to move data from sources to targets.

    First Method: If there is a column in the source which identifies the record inserted date. Then it will

    be easy to put a filter condition in the source qualifier.

  • 8/6/2019 ETL ques

    11/11

    Second Method: If there is no record in the source to identify the record inserted date. Then we need

    to do a target lookup based on the primary key and determine the new record and then insert