Upload
vijay-chavda
View
216
Download
0
Embed Size (px)
Citation preview
8/6/2019 ETL ques
1/11
ETL stands for extraction, transformation and loading. Etl is a process that involves thefollowing tasks:
y extracting data from source operational or archive systems which are the primary sourceof data for the data warehouse
ytransforming the data - which may involve cleaning, filtering, validating and applyingbusiness rules
y loading the data into a data warehouse or any other database or application that housesdata
The ETL process is also very often referred to as Data Integration process and ETL tool as aData Integration platform.
The terms closely related to and managed by ETL processes are: data migration, datamanagement, data cleansing, data synchronization and data consolidation.
The main goal of maintaining an ETL process in an organization is to migrate and transform data
from the source OLTP systems to feed a data warehouse and form data marts.
In datastage the ETL execution flow is managed by controlling jobs, called Job Sequences. A
master controlling job provides a single interface to pass parameter values down to controlledjobs and launch hundreds of jobs with desired parameters. Changing runtime options (like
moving project from testing to production environment) is done in job sequences and does notrequire changing the 'child' jobs.
Controlled jobs can be run in parallel or in serial (when a second job is dependant on the first). Incase of serial job execution it's very important to check if the preceding set of jobs was executed
successfully.A normal datastage ETL process can be broken up into the following segments (each of the
segments can be realized by a set of datastage jobs):
y jobs accessing source systems - extract data from the source systems. They typically dosome data filtering and validations like trimming white spaces, eliminating (replacing)nulls, filtering irrelevant data (also sometimes detect if the data has changed since the last
run by reading timestamps).y loading lookups - these jobs usually need to be run in order to run transformations. They
load lookup hashed files, prepare surrogate key mapping files, set up data sequences andset up some parameters.
y transformations jobs - these are jobs where most of the real processing is done. Theyapply business rules and shape the data that would be loaded into the data warehouse
(dimension and fact tables).y loadingjobs load the transformed data into the database. Usually a typical Data
Warehouse load involves assigning surrogate keys, loading dimension tables and loadingfact tables (in a Star Schema example).
8/6/2019 ETL ques
2/11
Datawarehouse master load sequence
Usually the whole set of daily executed datastage jobs is run and monitored by one Sequencejob. It's created graphically in datastage designer in a similiar way as a normal server job.
Very often the following job sequencer stages/activities are used to do a master controller:
y Wait for file activity - check for a file which would trigger the whole processingy Execute command - executes operating system commands or datastage commandsy Notification - sends email with a notification and/or job execution log. Can also be
invoked when an exception occurs and for example notify people from support so theyare aware of a problem straight away
y Exception - catches exceptions and can be combined with notification stageExample of a master job sequence architecture
It's a good practice to follow one common naming convention of jobs. Job names proposed in theexample are clear, easy to sort and to analyze what's the jobs hierarchy.
-Master job controller: SEQ_1000_MAS--Job sequences accessing source: SEQ_1100_SRC
----loading customers: SEQ_1110_CUS----loading products: SEQ_1120_PRD
----loading time scale: SEQ_1130_TM----loading orders: SEQ_1140_ORD
----loading invoices: SEQ_1150_INV--Job filling up lookup keys : SEQ_1200_LK
----loading lookups: SEQ_1210_LK--Job sequences for transforming data: SEQ_1300_TRS
----transforming customers (dimension): SEQ_1310_CUS_D----transforming products (dimension): SEQ_1320_PRD_D
----transforming time scale (dimension): SEQ_1330_TM_D----transforming orders (fact): SEQ_1340_ORD_F
----transforming invoices (fact): SEQ_1350_INV_F--Job sequence for loading the transformed data into the DW: SEQ_1400_LD
The master job controller (sequence job) for data warehouse load process SEQ_1000_MAS can
be designed as depicted below. Please notice that it will not start until a trigger file is present(WaitFoRFile activity). The extract-transform-load job sequences (each of them may contain
server jobs or job sequences) will be triggered in serial fashion (not in parallel) and an emailnotification will finish the process.
8/6/2019 ETL ques
3/11
Mater job sequence for loading a DataWarehouse
8/6/2019 ETL ques
4/11
Q. What is a staging area? Do we need it? What is the purpose of a staging area?
A1. Staging area is place where you hold temporary tables on data warehouse server. Staging
tables are connected to work area or fact tables. We basically need staging area to hold the data ,
and perform data cleansing and merging , before loading the data into warehouse.
A2.
he staging area is:-
* One or more database schema(s) or file stores used to stage data extracted from the sourceOLTP systems prior to being published to the warehouse where it is visible to end users.
* Data in the staging area is NOT visible to end users for queries, reports or analysis of any kind.It does not hold completed data ready for querying.
* It may hold intermediate results, (if data is pipelined through a process)* Equally it may hold state data the keys of the data held on the warehouse, and used to
detect whether incoming data includes New or Updated rows. (Or deleted for that matter).
8/6/2019 ETL ques
5/11
* It is likely to be equal in size (or maybe larger) than the presentation area itself.* Although the state data eg. Last sequence loaded may be backed up, much of the staging
area data is automatically replaced during the ETL load processes, and can with care avoidadding to the backup effort. The presentation area however, may need backup in many cases.
* It may include some metadata, which may be used by analysts or operators monitoring the state
of the previous loads (eg. audit information, summary totals of rows loaded etc).* Its likely to hold details of rejected entries data which has failed quality tests, and mayneed correction and re-submission to the ETL process.
* Its likely to have few indexes (compared to the presentation area), and hold data in a quitenormalised form. The presentation area (the bit the end users see), is by comparison likely to be
more highly indexed (mainly bitmap indexes), with highly denormalised tables (the Dimensiontables anyway).
The staging area exists to be a separate back room or engine room of the warehouse where
the data can be transformed, corrected and prepared for the warehouse.
It should ONLY be accessible to the ETL processes working on the data, or administrators
monitoring or managing the ETL process.
In summary. A typical warehouse generally has three distinct areas:-
1. Several source systems which provide data. This can include databases (Oracle, SQL Server,Sybase etc) or files or spreadsheets
2. A single staging area which may use one or more database schemas or file stores (dependingupon warehouse load volumes).
3. One or more visible data marts or a single warehouse presentation area where data ismade visible to end user queries. This is what many people think of as the warehouse although
the entire system is the warehouse it depends upon your perspective.
The staging area is the middle bit.
Q. What are active transformation / Passive transformations?
A1. An active transformation can change the number of rows as output after a transformation,
while a passive transformation does not change the number of rows and passes through the samenumber of rows that was given to it as input.
A2.Transformations can be active or passive. An active transformation can change the number of
rows that pass through it, such as a Filter transformation that removes rows that do not meet the
8/6/2019 ETL ques
6/11
filter condition. A passive transformation does not change the number of rows that pass throughit, such as an Expression transformation that performs a calculation on data and passes all rows
through the transformation
Active transformations:
Advanced External Procedure
Aggregator
Application Source Qualifier
Filter
Joiner
N
ormalizer
Rank
Router
Update Strategy
Passive transformation:
Expression
External Procedure
Maplet- Input
Lookup
Sequence generator
XML Source Qualifier
Maplet - Output
8/6/2019 ETL ques
7/11
Q. What is the difference between Power Center & Power Mart?
A1.
Power Mart is designed for:Low range of warehouses
only for local repositoriesmainly desktop environment.
Power mart is designed for:
High-end warehousesGlobal as well as local repositories
ERP support.
Q. What is the difference between etl tool and olap tools?
A1.
ETL tool is ment for extraction data from the legecy systems and load into specified data basewith some process of cleansing data.
ex: Informatica,data stage ....etc
OLAP is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional model. so that
u can write smple query to extract data fro the data base.
ex: Businee objects,Cognos....etc
A2.ETL tools are used to extract, transformation and loading the data into data warehouse / data
martOLAP tools are used to create cubes/reports for business analysis from data warehouse / data
mart
Q. What are the various ETL tools? - Name a few
A1.
- Informatica- Abinitio
- DataStage- Cognos Decision Stream
- Oracle Warehouse Builder
8/6/2019 ETL ques
8/11
- Business Objects XI (Extreme Insight)- SAP Business Warehouse
- SAS Enterprise ETL Server
Q. What are the various transformation available?
A1.
Transformation plays an important role in Datawarehouse. Transformation are used when data ismoved from source to destination. Depding upon crieteria transformations are done. Some of the
transformations are Aggregater,Lookup,Filter,Source Qualifier,Sequence Generator,Expression
A2.the various type of transformation in informatica
source qualifieraggregate
sequence generatorsorter
routerfilter
lookupupdate strategy
joinernormalizer
expression
A3.Aggregator Transformation
Expression TransformationFilter Transformation
Joiner TransformationLookup Transformation
Normalizer TransformationRank Transformation
Router TransformationSequence Generator Transformation
Stored Procedure TransformationSorter Transformation
Update Strategy TransformationXML Source Qualifier Transformation
Advanced External Procedure TransformationExternal Transformation
8/6/2019 ETL ques
9/11
Q. What are the different Lookup methods used in Informatica?
A1.
connected lookup will receive input from the pipeline and sends output to the pipeline and canreturn any number of values
Unconnected lookup can return only one column
Q. Lets suppose we have some 10,000 odd records in source system and when load them into target
how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.
How do we test it. We can't check every record as number of records are huge.
A1.Data Quality checks come in a number of forms:-
1. For FACT table rows, is there a valid lookup against each of the Dimensions
2. For FACT or DIMENSION rows, for each value:-
* Is it Null when it shouldnt be
* Is the Data Type correct (eg. Number, Date)* Is the range of values or format correct
* Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.
You could simply count the number of rows in and out again and assume its all OK, but for afact table (at the very minimum) youll need to cope with failed Dimension lookups (typically
from late arriving Dimension rows).
Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your
Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One.You may need to store and re-cycle rows with failed lookups and treat these as updates so ifthe missing Dimension row appears, the data is corrected.
Otherwise, youve no option. If the incoming data is from an unreliable source, youll need to
check its validity or accept the warehouse includes wrong results.
If the warehouse includes a high percentage of incorrect or misleading values whats the point
8/6/2019 ETL ques
10/11
of having it ?
A2. To do this, you must profile the data at the source to know the domain of all the values, get
the actual number of rows in the source, get the types of the data in the source. After it is loaded
into the target, this process can be repeated i.e. checking the data values with respect to range,type, etc and also checking the actual number of rows inserted. If the result before and aftermatch, then we are OK. This process is automated typically in ETL tools.
Q. What is ODS (operation data source)?
A1. ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low
level of granularity.
Once data was poopulated in ODS aggregated data will be loaded into into EDW through ODS.
A2. ODS is the Operational Data Source which is also called transactional data ODS is thesource of a warehouse. Data from ODs is staged, transformed and then moved to datawarehouse.
Refresh Load: the table will be truncated and data will be loaded again. Here we use to load static
dimension table or type tables using this method.
Incremental Load: It is a method to capture on the newly created or updated record. Based upon the
falg or Date this load will be performed.
Full Load: when we are loading the data for first time, either it may be a base load or history all the
set of records will be loaded at a strech depends upon the volume.
A mapping represents dataflow from sources to targets.
A mapplet creates or configures a set of transformations.
A workflow is a set of instruction sthat tell the Informatica server how to execute the tasks.
A worklet is an object that represents a set of tasks.
A session is a set of instructions that describe how and when to move data from sources to targets.
First Method: If there is a column in the source which identifies the record inserted date. Then it will
be easy to put a filter condition in the source qualifier.
8/6/2019 ETL ques
11/11
Second Method: If there is no record in the source to identify the record inserted date. Then we need
to do a target lookup based on the primary key and determine the new record and then insert