125
Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts Johnson Cyriac Dec 29, 2013 DW Design | ETL Design inShare In the typical case for a data warehouse, dimensions are processed first and the facts are loaded later, with the assumption that all required dimension data is already in place. This may not be true in all cases because of nature of your business process or the source application behavior. Fact data also, can be sent from the source application to the warehouse way later than the actual fact data is created. In this article lets discusses several options for handling late arriving dimension and Facts. What is Late Arriving Dimension Late arriving dimensions or sometimes called early-arriving facts occur when you have dimension data arriving in the data warehouse later than the fact data that references that dimension record. For example, an employee availing medical insurance through his employer is eligible for insurance coverage from the first day of employment. But the employer may not provide the medical insurance information to the insurance provider for several weeks. If the employee undergo any medical treatment during this time, his medical claim records will come as fact records with out having the corresponding patient dimension details.

Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Embed Size (px)

DESCRIPTION

how to handle the dimensions in informatica

Citation preview

Page 1: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts Johnson Cyriac Dec 29, 2013DW Design | ETL Design

inShare

In the typical case for a data warehouse, dimensions are processed first and the facts are loaded later, with the assumption that all required dimension data is already in place. This may not be true in all cases because of nature of your business process or the source application behavior. Fact data also, can be sent from the source application to the warehouse way later than the actual

fact data is created. In this article lets discusses several options for handling late arriving dimension and Facts.

What is Late Arriving Dimension

Late arriving dimensions or sometimes called early-arriving facts occur when you have dimension data arriving in the data warehouse later than the fact data that references that dimension record.

For example, an employee availing medical insurance through his employer is eligible for insurance coverage from the first day of employment. But the employer may not provide the medical insurance information to the insurance provider for several weeks. If the employee undergo any medical treatment during this time, his medical claim records will come as fact records with out having the corresponding patient dimension details.

Design Approaches

Depending on the business scenario and the type of dimension in use, we can take different design approaches.

Ads not by this site

Hold the Fact record until Dimension record is available

One approach is to place the fact row in a suspense table. The fact row will be held in the suspense table until the associated dimension record has been processed. This solution is relatively easy to implement, but the primary drawback is that the fact row isn’t available for reporting until the associated dimension record has been handled.

This approach is more suitable when your data warehouse is refreshed as a scheduled batch

Page 2: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

process and a delay in loading fact records until the dimension records are available is acceptable for the business.

'Unknown' or default Dimension record

Another approach is to simply assign the “Unknown” dimension member to the fact record. On the positive side, this approach does allow the fact record to be recorded during the ETL process. But it won’t be associated with the correct dimension value. 

The "Unknown" fact records can also be kept into a suspense table. Eventually, when the Dimension data is processed, the suspense data can be reprocessed and associate with a real, valid Dimension record.

Page 3: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Inferring the Dimension record

Another method is to insert a new Dimension record with a new surrogate key and use the same surrogate key to load the incoming fact record. This only works if you have enough details about the dimension in the fact record to construct the natural key. Without this, you would never be able to go back and update this dimension row with complete attributes.

Page 4: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

In the insurance claim example explained in the beginning; it is almost certain that the "patient id" will be part of the claim fact, which is the natural key of the patient dimension. So we can create a new placeholder dimension record for the patient with a new surrogate key and the natural key "patient id".

Note : When you get all other attributes for the patient dimension record in a later point, you will have to do a SCD Type 1 update for the first time and SCD Type 2 going forward.

Late Arriving Dimension and SCD Type 2 changes

Late arriving dimension with SCD Type 2 changes gets more complex to handle.

Late Arriving Dimension with multiple historical changes

Page 5: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

As described above, we can handle late arriving dimension by keeping an "Unknown" dimension record or an "Inferred" dimension record, which acts an a placeholder.

Even before we get the full dimension record details from the source system, there may be multiple SCD Type 2 changes to the placeholder dimension record. This leads to the creation of new dimension record with new surrogate key and modify any subsequent fact records surrogate key to point the new surrogate key.

Late Arriving Dimension with retro effective changes

You can get Dimension records from source system with retro effective dates. For example you might update your marital status in your HR system way later than your marriage date. This update come to data warehouse with retro effective date.

This leads to a new dimension record with a new surrogate key and changes in effective dates for the affected dimension. You will have to scan forward in the dimension to see if there is any subsequent type 2 rows for this dimension. This further leads in modify any subsequent fact records surrogate key to point the new surrogate key.

Ads not by this site

What is Late Arriving Facts

Late arriving fact scenario occurs when the transaction or fact data comes to data warehouse way later than the  actual transaction occurred in the source application. If the late arriving fact need to be associated with an SCD Type 2 dimension, the situation become messy. This is because we have to search back in history within the dimensions to decide how to assign the right dimension keys that were in effect when the activity occurred in the past.

Design Approaches

Unlike late arriving dimensions, late arriving fact records can be handles relatively easily. When loading the fact record, the associated dimension table history has to be searched to find out the appropriate surrogate key which is effective at the time of the transaction occurrences. Below data flow describes the late arriving fact design approach.

Page 6: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Hope you guys enjoyed this article and gave you some new insights into late arriving dimension and fact scenarios in Data Warehouse. Leave us your questions and commends. We would also like to hear how you have handled late arriving dimension and fact in your data warehouse.

SOFT and HARD Deleted Records and Change Data Capture in Data Warehouse Johnson Cyriac Dec 8, 2013DW Design | ETL Design

Page 7: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

inShare 10

In our couple of prior  articles we spoke about change data capture, different techniques to capture change data and a change data capture frame work as well. In this article we will deep dive into different aspects for change data in Data Warehouse including soft and hard deletions in source systems.

Revisiting Change Data Capture (CDC)

When we talk about Change Data Capture (CDC) in DW, we mean to capture those changes that have happened at the source side so far after we have run our job last time. In Informatica we call our ETL code as ‘Mapping’, because we MAP the source data (OLTP) into the target data (DW) and the purpose of running the ETL codes is to keep the source and target data in sync, along with some transformations in between, as per the business rules.

Now, data may get changed at source in three different ways.

NEW transactions happened at source. CORRECTIONS happened on old transactional values or measured values. INVALID transactions removed from source.

Ads not by this site

Usually in our ETL we take care of the 1st and 2nd case(Insert/Update Logic); the 3rd change is not captured in DW unless it is specifically instructed in the requirement specification. But when it’s especially amended, we need to devise convenient ways to track the transactions that were removed i.e., to track the deleted records at source and accordingly DELETE those records in DW.

One thing to make clear is that Purging might be enabled at your OLTP, i.e OLTP keeping data for a fixed historical period of time, but that is a different scenario. Here we are more interested about what was DELETED at Source because the transactions was NOT valid.

Effects in DW for Source Data Deletion

DW tables can be divided into three categories as related to the deleted source data.

Page 8: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

1. When the DW table load nature is 'Truncate & Load' or 'Delete & Reload', we don't have any impact, since the requirement is to keep the exact snapshot of the source table at any point of time.

2. When the DW table does not track history on data changes and deletes are allowed against the source table. If a record is deleted in the source table, it is also deleted in the DW.

3. When the DW table tracks history on data changes and deletes are allowed against the source table. The DW table will retain the record that has been deleted in the source system, but this record will be either expired in DW based on the change captured date or 'Soft Delete' will be applied against it.

Types of Data Deletion

Academically, deleting records from DW table is forbidden, however, it’s a common practice in most DWs when we face this kind of situations. Again, if we are deleting records from DW, it has to be done after proper discussions with Business. If your Business requires DELETION, then there are two ways.

Logical Delete :- In this case, we have a specific flag in the source table as STATUS which would be having the values as ‘ACTIVE’ or ‘INACTIVE’. Some OLTPs keep the field name as ACTIVE with the values as ‘I’, ‘U’ or ‘D’, where ‘D’ means that the record is deleted or the record is INACTIVE. This approach is quite safe and also known as Soft DELETE.

Ads not by this site

Physical Delete :- In this case the record related to invalid transactions are fully deleted from the source table by issuing DML statement. This is usually done after thorough discussing with Business Users and related business rules are strictly followed. This is also known as Hard DELETE.

ETL Perspective on Deletion

When we have ‘Soft DELETE’ implemented at the source side, it becomes very easy to track the invalid transactions and we can tag those transactions in DW accordingly. We just need to filter the records from source using that STATUS field and issue an UPDATE in DW for the corresponding records. Few things to be kept in mind in this case.

Page 9: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

If only ACTIVE records are supposed to be used in ETL processing, we need to add specific filters while fetching source data.

Sometimes INACTIVE records are pulled into the DW and moved till the ETL Data Warehouse level. While pushing the data into Exploration Data Warehouse, only the ACTIVE records are sent for reporting purpose.

For ‘Hard DELETE’, if Audit Table is maintained at source systems for what are transactions were deleted, we can source the same, i.e. join the Audit table and the Source table based on NK and logically delete them in DW too. But it becomes quite cumbersome and costly when no account is kept of what was deleted at all. In these cases, we need to use different ways to track them and update the corresponding records in DW.

Deletion in Data Warehouse : Dimension Vs Fact

In most of the cases, we see only the transactional records to be deleted from source systems. DELETION of Data Warehouse records are a rare scenario.

Deletion in Dimension Tables

Ads not by this site

If we have DELETION enabled for Dimensions in DW, it's always safe to keep a copy of the OLD record in some AUDIT table, as it helps to track any defects in future. A simple DELETE trigger should work fine; since DELETION hardly happens, this trigger would not degrade the performance much.

Let's take this ORDERS table into consideration. Along with this, we can have a History table for ORDERS, e.g. ORDERS_Hist, which would store the DELETED records from ORDERS.

 The below Trigger will work fine to achieve this.

Page 10: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Ads not by this site

The AUDIT Fields will convey when this particular record was deleted and by which user. But this table needs to be created for each and every DW table where we want to keep the audit of what was DELETED. If the entire record is not need and only fields involved in Natural Key(NK) may work, we can have a consolidated table for all the Dimensions.

Here the Record_IDENTIFIER field contains the values of all the columns involved in the Natural Key(NK) separated by '#' of the table mentioned in the OBJECT_NAME field.

Sometimes, we face a situation in DW where a FACT table record contains a Surrogate Key(SK) from a Dimension but the Dimension table doesn't own it anymore. In those cases, the FACT table record becomes orphan and it will hardly be able to appear in any report since we always use the INNER JOIN between Dimensions and Fact while retrieving data in the reporting layer, and there it misses the Referential Integrity(RI).

Suppose, we want to track the orphan records from the SALES Fact table in respect of Product Dimension. We can use the query as below.

Ads not by this site

So, the above query will provide only the Orphan records, BUT certainly it cannot provide you the records DELETED from the PRODUCT_Dimension. So, one feasible solution could be while populating the EVENT table with the SKs from PRODUCT_Dimension that are being DELETED, provided we don't reuse our Surrogate Keys. So, when we have both the SKs and the NKs from the PRODUCT_Dimension in the EVENT table for DELETED entries, we can achieve a better compliance over the Data Warehouse data.

Another useful but least used approach is enabling the audit for any table for DELETE in an Oracle DB using queries like the following.

Page 11: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Audit DELETE on SCHEMA.TABLE;

The table DBA_AUDIT_STATEMENT will contain all the related details related to this deletion, example the user who issued the, exact DML statement and so on, but this cannot provide you with the record that was deleted. Since this approach cannot directly provide you information on which record was deleted, it’s not so useful in our current discussion, so I would like to keep aloof from the topic here.

Deletion in Fact Tables

Now, this was all about DELETION in DW Dimension tables. Regarding FACT data DELETION, I would like to cite an extract of what Ralph Kimball has to say on Physical Deletion of Facts from DW.

        

Change Data Capture & Apply for 'Hard DELETE' in Source

Again, whether we should track the DELETED records from source or not depends on the type of table and its Load Nature. I will share few genuine scenarios that are usually faced in any DW and discuss about the solutions accordingly.

1. Records are DELETED from SOURCE for a known Time Period, no Audit Trail was kept.

In this case, the ideal solution is to DELETE the entire records’ set in DW for the Target table and pull the source records once again for the time period. This will bring the DW in

Page 12: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

sync with Source and DELETED records also will not be available in DW.

Usually time period is mentioned in terms of Ship_DATE or Invoice_DATE or Event_DATE, i.e. a DATE type field from the actual dataset of the source table is used, and hence the way we can filter the records for Extraction from source table using WHERE clause, we can do the same in DW table as well. 

Obviously, in this case we are NOT able to capture the 'Hard DELETE' from the Source i.e., we cannot track the History of DATA, but we would be able to bring the Source and DW in sync at the least. Again, this approach is recommended only when the situation occurs once in a while and not on regular basis.

2. Records are DELETED from SOURCE on regular basis with NO Timeframe, no Audit Trail was kept.

The possible solution in this case would be to implement FULL Outer JOIN between the Source and the Target table. The tables should be joined on the fields involved in the Natural Key(NK). This approach will help us to track all three kinds of changes to source data in one shot.

The logic can be better explained with the help of a Venn diagram.

Page 13: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Out of the Joiner (kept in FULL Outer Join mode),

Records that have values for the NK fields only from the Source and not from the Target, they should go for the INSERT flow. These are all new records coming from source.

Records that have values for the NK fields from both the Source and the Target, they should go for the UPDATE flow. These are already existing records of Source.

Records that have values for the NK fields only from Target, will go for the DELETE flow. These are the records that were somehow DELETED from Source table.

Now, what we do with those DELETED records from Source, i.e. apply 'Soft DELETE' or 'Hard DELETE' in DW, depends on our requirement specification and business scenarios.

But this approach is having severe disadvantage in terms of ETL Performance. Whenever we go for a FULL Outer JOIN between Source and Target, we are using the entire data set from both the ends and this will obviously obstruct the smooth processing of ETL when data volume increases. 

3. Records are DELETED from SOURCE, Audit Trail was kept.

Even though I'm mentioning it a DELETION, it's NOT the kind of Physical DELETION that we discussed previously. This is mainly related to incorrect transactions in Legacy Systems, e.g. Mainframes, which usually send data in flat files. 

Page 14: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

When some old transactions become invalidated, source team sends those transactions related records again to DW but with inverted measures, i.e. the sales figure are same as the old ones but they are negative. So, DW contains both the old set of records and the newly arrived records, but the aggregated measures become NULL in the aggregated FACT table, thus diminishing the impact of those invalid transactions in DW to NULL.

Only disadvantage of this approach is, Aggregated FACT contains the correct data at the summarized level, but the transactional FACT dual set of records, which together

About the Author

represent the real scenario, i.e. at first the transaction happened(with the older record) and then it became invalid(with the newer record).

Hope you guys enjoyed this article and gave you some new insights into change data capture in Data Warehouse. Leave us your questions and commends. We would like to hear how you have handled change data capture in your data warehouse.

People who read this also read :

Surrogate Key Generation Approaches Using Informatica PowerCenter Johnson Cyriac Nov 21, 2013ETL Design | Mapping Tips

Page 15: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

inShare 9

Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. We discussed about Surrogate Key in in detail in our previous article. Here in this article we will concentrate on different approaches to generate Surrogate Key for different type

ETL process.

Surrogate Key for Dimensions Loading in Parallel

When you have a single dimension table loading in parallel from different application data sources, special care should be given to make sure that no keys are duplicated. Lets see different design options here.

1. Using Sequence Generator Transformation

This is the simplest and most preferred way to generate Surrogate Key(SK). We create a reusable Sequence Generator transformation in the mapping and map the NEXTVAL port to the SK field in the target table in the INSERT flow of the mapping. The start value is usually kept 1 and incremented by 1.

Below shown is a reusable Sequence Generator transformation.

Page 16: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

NEXTVAL port from the Sequence Generator can be mapped to the surrogate key in the target table. Below shown is the sequence generator transformation.

Ads not by this site

Note : Make sure to create a reusable transformation, so that the same transformation can be reused in multiple mappings, which loads the same dimension table.

2. Using Database Sequence

We can create a SEQUENCE in the database and use the same to generate the SKs for any table. This can be invoked by a SQL Transformation or a using a Stored Procedure Transformation.

First we create a SEQUENCE using the following command. CREATE SEQUENCE DW.Customer_SK

MINVALUE 1

MAXVALUE 99999999

START WITH 1

INCREMENT BY 1;

Using SQL Transformation

You can create a create reusable reusable SQL Transformation as shown below. It takes the name of the database sequence and the schema name as input and returns SK numbers.

Page 17: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the transformation and the output can be mapped to the target SK column. Below shown is the SQL transformation image.

Using Stored Procedure Transformation

We use the SEQUENCE DW.Customer_SK to generate the SKs in an Oracle function, which in turn called via a stored procedure transformation.

Create a database function as below. Here we are creating an Oracle function.

CREATE OR REPLACE FUNCTION DW.Customer_SK_Func

Page 18: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

RETURN NUMBER

IS

Out_SK NUMBER;

BEGIN

SELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL;

RETURN Out_SK;

EXCEPTION

WHEN OTHERS THEN

raise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);

END;

You can import the database function as a stored procedure transformation as shown in below image.

Now, just before the target instance for Insert flow, we add an Expression transformation. We add an output port there with the following formula. This output port GET_SK can be connected to the target surrogate key column.

GET_SK =:SP. CUSTOMER_SK_FUNC()

Page 19: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective

Surrogate Key for Non Parallel Loading Dimensions

If the dimension table is not loading in parallel from different application data sources, we have couple of more options to generate SKs. Lets see different design options here.

Using Dynamic LookUP

When we implement Dynamic LookUP in any mapping, we may not even need to use the Sequence Generator for generating the SK values.

For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we can exploit this advantage.

The Integration Service uses the following process to generate Sequence IDs.

Page 20: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

When the Integration Service creates the dynamic lookup cache, it tracks the range of values for each port that has a sequence ID in the dynamic lookup cache.

When the Integration Service inserts a row of data into the cache, it generates a key for a port by incrementing the greatest sequence ID value by one.

When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one. The Integration Service increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.

Above shown is a dynamic lookup configuration to generate SK for CUST_SK.

The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any records which is already present in the Target, it gets the SK value from the Target Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and connect to the target SK field, there will not be any need to generate SK values separately, since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP.

Ads not by this site

The disadvantage of this technique lies in the fact that we don’t have any separate SK Generating Area and the source of SK is totally embedded into the code.

Page 21: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Using Expression Transformation

Suppose we are populating a CUSTOMER_DIM. So in the Mapping, first create a Unconnected Lookup for the dimension table, say LKP_CUSTOMER_DIM. The purpose is to get the maximum SK value in the dimension table. Say the SK column is CUSTOMER_KEY and the NK column is CUSTOMER_ID.

Select CUSTOMER_KEY as Return Port and Lookup Condition as

CUSTOMER_ID = IN_CUSTOMER_ID

Use the SQL Override as below:

SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY, '1' AS CUSTOMER_ID FROM CUSTOMER_DIM

Next in the mapping after the SQ use an Expression transformation. Here actually we will be generating the SKs for the Dimension based on the previous value generated. We will create the following ports in the EXP to compute the SK value.

VAR_COUNTER = IIF(ISNULL( VAR_INC ), NVL(:LKP.LKP_CUSTOMER_DIM('1'), 0) + 1, VAR_INC + 1 )

VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER

When the mapping starts, for the first row we will look up the Dimension table to fetch the maximum available SK in the table. Next we will keep on incrementing the SK value stored in the variable port by 1 for each incoming row. Here the O_COUNTER will give the SKs to be populated in CUSTOMER_KEY.

Using Mapping & Workflow Variable

Here again we will use the Expression transformation to compute the next SK, but will get the MAX available SK in a different way.

Suppose, we have a session s_New_Customer, which loads the Customer Dimension table. Before that session in the Workflow, we add a dummy session as s_Dummy.

In s_Dummy, we will have a mapping variable, e.g. $$MAX_CUST_SK which will be set with the value of MAX (SK) in Customer Dimension table.

SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY FROM CUSTOMER_DIM

Page 22: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

We will have the CUSTOMER_DIM as our source table and target can be a simple flat file, which will not be used anywhere. We pull this MAX (SK) from the SQ and then in an EXP we assign this value to the mapping variable using the SETVARIABLE function. So, we will have the following ports in the EXP:

INP_CUSTOMER_KEY = INP_CUSTOMER_KEY -– The MAX of SK coming from Customer Dimension table.

OUT_MAX_SK = SETVARIABLE ($$MAX_CUST_SK, INP_CUSTOMER_KEY) –- Output Port

This output port will be connected to the flat file port, but the value we assigned to the variable will persist in the repository.

In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how can we pass the parameter value from one session into the other one?

Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy, we assign the value of $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-session variable assignment section of s_New_Customer.

So, the sequence is:

Post-session on success variable assignment of First Session: o $$MAX_SK = $$MAX_CUST_SK

Pre-session variable assignment of Second Session: o $$START_VALUE = $$MAX_SK

Now in the actual mapping, we add an EXP and the following ports into that to compute the SKs one by one for each records being loaded in the target.

VAR_COUNTER = IIF (ISNULL (VAR_INC), $$START_VALUE + 1, VAR_INC + 1)

About the Author

VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER

OUT_COUNTER will be connected to the SK port of the target.

Hope you enjoyed this article and earned some new ways to generate surrogate keys for your dimension tables. Please leave us a comment or feedback if you have any, we are happy to hear from you.

Page 23: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

People who read this also read :

Surrogate Key in Data Warehouse, What, When, Why and Why Not Johnson Cyriac Nov 13, 2013DW Design | ETL Design

inShare 11

Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. It join between the fact and dimension tables and is necessary to handle changes in dimension table attributes.

What Is Surrogate Key

Surrogate Key (SK) is sequentially generated meaningless unique number attached with each and every record in a table in any Data Warehouse (DW).

It is UNIQUE since it is sequentially generated integer for each record being inserted in the table.

It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table.

It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table, starting with one and going up to the highest number that is needed.

Surrogate Key Pipeline and Fact Table

During the FACT table load, different dimensional attributes are looked up in the corresponding Dimensions and SKs are fetched from there. These SKs should be fetched from the most recent versions of the dimension records. Finally the FACT table in DW contains the factual data along with corresponding SKs from the Dimension tables.

Page 24: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

The below diagram shows how the FACT table is loaded from the source.

Why Should We Use Surrogate Key

Basically it’s an artificial key that is used as a substitute for a Natural Key (NK). We should have defined NK in our tables as per the business requirement and that might be able to uniquely identify any record. But, SK is just an Integer attached to a record for the purpose of joining different tables in a Star or Snowflake schema based DW. SK is much needed when we have very long NK or the datatype of the NK is not suitable for Indexing.

The below image shows a typical Star Schema, joining different Dimensions with the Fact using SKs.

Ads not by this site

Ralph Kimball emphasizes more on the abstraction of NK. As per him, Surrogate Keys should NOT be:

Page 25: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Smart, where you can tell something about the record just by looking at the key. Composed of natural keys glued together. Implemented as multiple parallel joins between the dimension table and the fact table; so-

called double or triple barreled joins.

As per Thomas Kejser, a “good key” is a column that has the following properties:

It forced to be unique It is small It is an integer Once assigned to a row, it never changes Even if deleted, it will never be re-used to refer to a new row It is a single column It is stupid It is not intended as being remembered by users

If the above mentioned features are taken into account, SK would be a great candidate for a Good Key in a DW.

Apart from these, few more reasons for choosing this SK approach are:

If we replace the NK with a single Integer, it should be able to save a substantial amount of storage space. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintain Referential Integrity (RI), and here instead of storing of those big or huge NKs, storing of concise SKs would result in less amount of space needed. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK which may be alphanumeric.

Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound to improve join performance, since joining two Integer columns works faster. So, it provides an extra edge in the ETL performance by fastening data retrieval and lookup.

Advantage of a four-byte integer key is that it can represent more than 2 billion different values, which would be enough for any dimension and SK would not run out of values, not even for the Big or Monster Dimension.

SK is usually independent of the data contained in the record, we cannot understand anything about the data in a record simply by seeing only the SK. Hence it provides Data Abstraction.

So, apart from the abstraction of critical business data involved in the NK, we have the advantage of storage space reduction as well to implement the SK in our DW. It has become a Standard Practice to associate an SK with a table in DW irrespective of being it a Dimension, Fact, Bridge or Aggregate table.

Why Shouldn’t We Use Surrogate Key

There are myriad number of disadvantages as well while working with SK. Let’s see them one by one:

The values of SKs have no relationship with the real world meaning of the data held in a row. Therefore over usage of SKs lead to the problem of disassociation.

Page 26: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

The generation and attachment of SK creates unnecessary ETL burden. Sometimes it may be found that the actual piece of code is short and simple, but generating the SK and carrying it forward till the target adds extra overhead on the code.

During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension, we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. This may come as an extra overhead on the ETL.

Even query optimization becomes difficult since SK takes the place of PK, unique index is applied on that column. And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique index on the SK.

Replication of data from one environment to another, i.e. Data Migration, becomes difficult since SKs from different Dimension tables are used as the FKs in the Fact table and SKs are DW specific, any mismatch in the SK for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema.

If duplicate records come from the source, there is a potential risk of duplicates

About the Author

being loaded into the target, since Unique Constraint is defined on the SK and not on the NK.

SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK.

Reference : Ralph Kimball, Thomas KejserAds not by this site

People who read this also read :

Informatica PowerCenter on Grid for Greater Performance and Scalability Johnson Cyriac Oct 31, 2013ETL Design | Performance Tips

inShare 7

Informatica has developed a solution that leverages the power of grid computing for greater data integration scalability and performance. The grid option delivers the load balancing, dynamic partitioning, parallel processing and high availability to

Page 27: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

ensure optimal scalability, performance and reliability. In this article lets discuss how to setup Infrmatica Workflow to run on grid.

What is PowerCenter On Grid

Performance Improvement Features

Pushdown OptimizationPipeline PartitionsDynamic Partitions

Concurrent Workflows Grid Deployments

Workflow Load Balancing

When a PowerCenter domain contains multiple nodes, you can configure workflows and sessions to run on a grid. When you run a workflow on a grid, the Integration Service runs a service process on each available node of the grid to increase performance and scalability. When you run a session on a grid, the Integration Service distributes session threads to multiple DTM processes on nodes in the grid to increase performance and scalability.

Domain : A PowerCenter domain consists of one or more nodes in the grid environment. PowerCenter services run on the nodes. A domain is the foundation for PowerCenter service administration.

Node : A node is a logical representation of a physical machine that runs a PowerCenter service.

Admin Console with Grid Configuration

Below shown is an Informatica Admin Console, with two node Grid configuration. We can see two nodes Node_1, Node_2 and the Node_GRID grid created using two nodes. The integration service Int_service_GRID is running on the grid.

Page 28: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Setting up Workflow on Grid

When you setup a workflow to run grid, the Integration Service distributes workflows across the nodes in a grid. It also distributes the Session, Command, and predefined Event-Wait tasks within workflows across the nodes in a grid.

You can setup the workflow to run on grid as shown in below image.You can assign the integration service, which is configured on grid to run the workflow on grid.

Page 29: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Setting up Session on Grid

When you run a session on a grid, the Integration Service distributes session threads across nodes in a grid. The Load Balancer distributes session threads to DTM processes running on different nodes. You might want to configure a session to run on a grid when the workflow contains a session that takes a long time to run.

You can setup the session to run on grid as shown in below image.

Page 30: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Workflow Running on Grid

Below workflow monitor screen shots sows a workflow running on grid. You see two of the session in the workflow wf_Load_CUST_DIM run on Node_1 and other one on Node_1 from 'Task Progress Details' Window.

Page 31: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Key Features and Advantages of Grid

Load Balancing : While facing spikes in data processing, load balance guarantees smooth operations by switching the data processing between nodes on the grid. The node is chosen dynamically based on process size, CPU utilization, memory requirements etc...

High Availability : Grid complements the High Availability feature or PowerCenter by switching the master node in case of a node failure. This ensures the monitoring and the shorten time needed for recovery processes.

Dynamic Partitioning : Dynamic Partitioning helps making the best use of currently available nodes on the grid. By adapting to available resources, it also helps increasing the performance of the whole ETL process.

Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

People who read this also read :

Time Zones Conversion and Standardization Using Informatica PowerCenter

Page 32: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Johnson Cyriac Oct 23, 2013ETL Design | Mapping Tips

inShare 8

When your data warehouse is sourcing data from multi-time zoned data sources, it is recommended to capture a universal standard time, as well as local times. Same goes with transactions involving multiple currencies. This design enables analysis on the local time along with the universal standard time. The time standardization will

be done as part of the ETL, which loads the warehouse. In this article lets discuss about the implementation using Informatica PowerCenter.

We will concentrate only on the ETL part of time zone conversion and standardization, but not the data modeling part. You can learn more about the dimensional modeling aspect from Ralph Kimball.

Business Use Case

Lets consider an ETL job, which is used to integrate sales data from different global sales regions in to the enterprise data warehouse. Sales transactions are happening in different time zones and from different sales applications. Local sales applications are capturing sales in the local time. Data in the warehouse needs to be standardized and sales transaction need to be captured in local as well as GMT time.

Solution : Create a reusable expression to convert the local time into GMT time. This transformation can be reused in all the ETL process, which needs a time standardization. This reusable transformation can be used in any Mapping, which needs the time zone conversion.

Building the Reusable Expression

You can create the reusable transformation in the Transformation Developer.

In the expression transformation, you can create below ports and the corresponding expressions. Be sure to have the ports created in the same order, data type and precision in the transformation.

o LOC_TIME_WITH_TZ : STRING(36) (Input)o DATE_TIME : DATE/TIME (Variable) o TZ_DIFF : INTEGER (Variable)o TZ_DIFF_HR (V) : INTEGER (Variable)

Page 33: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

o TZ_DIFF_MI (V) : INTEGER (Variable)o GMT_TIME_HH : DATE/TIME (Variable) o GMT_TIME_MI : DATE/TIME (Variable) o GMT_TIME_WITH_TZ STRING(36) (Output)

Now create expressions as below for all the ports.Ads not by this site

o DATE_TIME : TO_DATE(SUBSTR(LOC_TIME_WITH_TZ,0,29),'DD-MON-YY HH:MI:SS.US AM')

o TZ_DIFF : IIF(SUBSTR(LOC_TIME_WITH_TZ,30,1)='+',-1,1)o TZ_DIFF_HR : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ,31,2))o TZ_DIFF_MI : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ,34,2))o GMT_TIME_HH : ADD_TO_DATE(DATE_TIME,'HH',TZ_DIFF_HR*TZ_DIFF)o GMT_TIME_MI : ADD_TO_DATE(GMT_TIME_HH,'MI',TZ_DIFF_MI*TZ_DIFF)o GMT_TIME_WITH_TZ : TO_CHAR(GMT_TIME_MI,'DD-MON-YYYY HH:MI:SS.US AM')

|| ' +00:00'

Note : The expression is based on the timestamp format 'DD-MON-YYYY HH:MI:SS.FF AM TZH:TZM'. If you are using a different oracle timestamp format, this expression might not work. Below is the expression transformation with the expressions added.

The reusable transformation can be used in any Mapping, which needs the time zone conversion. Below shown is the completed expression transformation.

Page 34: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

You can see a sample output data generated by expression as shown in below image.

Expression Usage

This reusable transformation takes one input port and gives one output port. The input port should be a date timestamp with time zone information. Below shown is a mapping using this reusable transformation.

Note : Timestamp with time zone is processed as STRING(36) data type in the mapping. All the transformations should use STRING(36) data type. Source and target should use VARCHAR2(36) data type.

Download

You can download the reusable expression we discussed in this article. Click here for the download link.

Hope this tutorial was helpful and useful for your project. Please leave you questions and commends, We will be more than happy to help you.

People who read this also read :

Page 35: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Informatica Performance Tuning Guide, Performance Enhancements - Part 4 Johnson Cyriac Nov 30, 2013ETL Admin | Performance Tips

inShare 10

In our performance turning article series, so far we covered about the performance turning basics, identification of bottlenecks and resolving different bottlenecks. In this article we will cover different performance enhancement features available in Informatica PowerCener. In addition to the features provided by PowerCenter, we

will go over the designs tips and tricks for ETL load performance improvement.

Performance Enhancements Features

The main PowerCenter features for Performance Enhancements are.

Performance Tuning Tutorial SeriesPart I : Performance Tuning Introduction. Part II : Identify Performance Bottlenecks. Part III : Remove Performance Bottlenecks .Part IV : Performance Enhancements.

1. Pushdown Optimization.2. Pipeline Partitions.3. Dynamic Partitions.4. Concurrent Workflows.5. Grid Deployments.6. Workflow Load Balancing.7. Other Performance Tips and Tricks.

1. Pushdown Optimization

Pushdown Optimization Option enables data transformation processing, to be pushed down into any relational database to make the best use of database processing power. It converts the transformation logic into SQL statements, which can directly execute on database. This minimizes the need of moving data between servers and utilizes the power of database engine.

Page 36: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Read More about Pushdown Optimization.

2. Session Partitioning

The Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. Partitioning option will let you split the large data set into smaller subsets which can be processed in parallel to get a better session performance.

Read More about Session Partitioning.

3. Dynamic Session Partitioning

Informatica PowerCenter session partition can be used to process data in parallel and achieve faster data delivery. Using Dynamic Session Partitioning capability, PowerCenter can dynamically decide the degree of parallelism. The Integration Service scales the number of session partitions at run time based on factors such as source database partitions or the number of CPUs on the node resulting significant performance improvement.

Read More about Dynamic Session Partition .

4. Concurrent Workflows

Ads not by this site

A concurrent workflow is a workflow that can run as multiple instances concurrently. A workflow instance is a representation of a workflow. We can configure two types of concurrent workflows. It can be concurrent workflows with the same instance name or unique workflow instances to run concurrently.

Read More about Concurrent Workflows.

5. Grid Deployments

When a PowerCenter domain contains multiple nodes, you can configure workflows and sessions to run on a grid. When you run a workflow on a grid, the Integration Service runs a service process on each available node of the grid to increase performance and scalability. When you run a session on a grid, the Integration Service distributes session threads to multiple DTM processes on nodes in the grid to increase performance and scalability.

Page 37: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Read More about Grid Deployments.

6. Workflow Load Balancing

Informatica Load Balancing is a mechanism which distributes the workloads across the nodes in the grid. When you run a workflow, the Load Balancer dispatches different tasks in the workflow such as Session, Command, and predefined Event-Wait tasks to different nodes running the Integration Service. Load Balancer matches task requirements with resource availability to identify the best node to run a task. It may dispatch tasks to a single node or across nodes on the grid.

Read More about Workflow Load Balancing.

7. Other Performance Tips and Tricks

Through out this blog we have been discussing different tips and tricks to improve your ETL load performance. We would like to reference those tips and tricks in this article for your reference.

Read More about Other Performance Tips and Tricks.

Hope you guys enjoyed these tips and tricks and it is helpful for your project needs. Leave us your questions and commends. We would like to hear any other performance tips you might have used in your projects.

Ads not by this site

People who read this also read :

Informatica PowerCenter Load Balancing for Workload Distribution on Grid Johnson Cyriac Nov 8, 2013ETL Admin | Performance Tips

inShare 9

Page 38: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Informatica PowerCenter Workflows runs on grid, distributes workflow tasks across nodes in the grid. It also distributes Session, Command, and predefined Event-Wait tasks within workflows across the nodes in a grid. PowerCenter uses load balancer to distribute workflows and session tasks to different nodes. This article describes, how to use load balancer to setup high workflow priorities and how to allocate resources.

What is Informatica Load Balancing

Performance Improvement Features

Pushdown OptimizationPipeline PartitionsDynamic Partitions

Concurrent Workflows Grid Deployments

Workflow Load Balancing

Informatica load Balancing is a mechanism which distributes the workloads across the nodes in the grid. When you run a workflow, the Load Balancer dispatches different tasks in the workflow such as Session, Command, and predefined Event-Wait tasks to different nodes running the Integration Service. Load Balancer matches task requirements with resource availability to identify the best node to run a task. It may dispatch tasks to a single node or across nodes on the grid.

Identifying the Nodes to Run a Task

Load Balancer matches the resources required by the task with the resources available on each node. It dispatches tasks in the order it receives them. You can adjust the workflow priorities and the assign resources needs for tasks, such that load balancer can distribute the tasks to the right nodes and right priority.

Ads not by this site

Assign service levels : You assign service levels to workflows. Service levels establish priority among workflow tasks that are waiting to be dispatched.

Assign resources : You assign resources to tasks. Session, Command, and predefined Event-Wait tasks require PowerCenter resources to succeed. If the Integration Service is configured to check resources, the Load Balancer dispatches these tasks to nodes where the resources are available.

Assigning Service Levels to Workflows

Service levels determine the order in which the Load Balancer dispatches tasks from the dispatch queue. When multiple tasks are waiting to be dispatched, the Load Balancer dispatches high priority

Page 39: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

tasks before low priority tasks. You create service levels and configure the dispatch priorities in the Administrator tool.

Integration service will be limited to run You give Higher Service Level for the workflows, which needs to be dispatched first, when multiple workflows are running in parallel. Service Levels are set up in the Admin console.

You assign service levels to workflows on the General tab of the workflow properties as shown below.

Assigning Resources to Tasks

If the Integration Service runs on a grid and is configured to check for available resources, the Load Balancer uses resources to dispatch tasks. The Integration Service matches the resources required by tasks in a workflow with the resources available on each node in the grid to determine which nodes can run the tasks.

You can configure the resource requirements by the tasks as shown in below image.

Page 40: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Below configuration shows that, the source qualifier needs source file from File Directory NDMSource, which is accessible only from one node. Available resource on different nodes are configured from Admin console.

Hope you enjoyed this article and this will help you prioritize your workflows to to meet your data refresh time lines. Please leave us a comment or feedback if you have any, we are happy to hear from you.

People who read this also read :

Dynamically Changing ETL Calculations Using Informatica Mapping Variable Johnson Cyriac Oct 16, 2013ETL Design | Mapping Tips

Page 41: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

inShare 6

Quite often we deal with ETL logic, which is very dynamic in nature. Such as a discount calculation which changes every month or a special weekend only logic. There is a lot of practical difficulty in making such frequent ETL change into production environment. Best option to deal with this dynamic scenario is

parametrization. In this article let discuss how we can make the ETL calculations dynamic.

Business Use Case

Lets start our discuss with the help of a real life use case.

The sales department wants to build a monthly sales fact table. The fact table need to be refreshed after the month end closure. Sales commission is one of the fact table data element, its calculation is dynamic in nature. It is a factor of sales or sales revenue or net sales.

Sales Commission calculation can be :

1. Sales Commission = Sales * 18 / 1002. Sales Commission = Sales Revenue * 20 / 100 3. Sales Commission = Net Sales * 20 / 100

Note : The expression calculation can be as complex as the business requirement demands.

The calculation need to be used by the month end ETL will be decided by the Sales Manager before the month ETL load.

Mapping Configuration

Now we understand the use case, lets build the mapping logic.

Here we will be building the dynamic sales commission calculation logic with the help of a mapping variable. The changing expression for the calculation will be passed into the mapping using a session parameter file.

Step 1 : As the first step, Create a mapping variable $$EXP_SALES_COMM and set the isExpVar property TRUE as shown in below image.

Page 42: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Note : Precision for the mapping variable should be big enough to hold the whole expression.

Step 2 : In an expression transformation, create an output port and provide the mapping variable as the expression. Below shown is the screenshot of expression transformation.

Note : All the ports used in the expression $$EXP_SALES_COMM should be available as an input or input/output port in the expression transformation.

Page 43: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Workflow Configuration

In the workflow configuration, we will create the parameter file with the expression for Sales Commission and set up in the session.

Step 1 : Create the session parameter file with the expression for Sales Commission calculation with the below details.

[s_m_LOAD_SALES_FACT]$$EXP_SALES_COMM=SALES_REVENUE*20/100

Step 2 : Set the parameter in the session properties as shown below.

With that we are done with the configuration. You can update the expression in the parameter file when ever a change is required in the sales commission calculation. This clearly eliminate the need of a ETL code change.

Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

People who read this also read :

Page 44: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Informatica HTTP Transformation, The Interface Between ETL and Web Services Johnson Cyriac Sep 30, 2013Transformations

inShare 12

In a matured data warehouse environment, you will see all sorts of data sources, like Mainframe, ERP, Web Services, Machine Logs, Message Queues, Hadoop etc. Informatica has provided a variety of connector to get data extracted from such data sources. Using Informatica HTTP transformation, you can make Web Service

calls and get data from web servers. We will have this transformation explained in this article with a use case.

What is HTTP Transformation

The HTTP transformation enables you to connect to an HTTP server to use its services and applications. When you run a session with an HTTP transformation, the Integration Service connects to the HTTP server and issues a request to retrieve data from or update data on the HTTP server.

For example, you can get the currency conversion rate between USD and EUR by calling this web service call. http://rate-exchange.appspot.com/currency?from=USD&to=EUR Using HTTP Transformation you can :

Ads not by this site

1. Read data from an HTTP server :- It retrieves data from the HTTP server and passes the data to a downstream transformation in the mapping.

2. Update data on the HTTP server :- It posts data to the HTTP server and passes HTTP server responses to a downstream transformation in the mapping.

Developing HTTP Transformation

Like any other transformation, you can create HTTP transformations in the Transformation Developer or in the Mapping Designer. As shown in below image, all the configuration required for this transformation in on the HTTP tab.

Page 45: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Read or Write data to HTTP server

As shown in the image, on the HTTP tab, you can configure the transformation to read data or write data to the HTTP server. Select GET method to read data and POST or SIMPLE POST method to write data to an HTTP server.

Configuring Groups and Ports

Base on the type of the HTTP method, you choose and the port group and port in the transformation in the HTTP tab..

Output. Contains data from the HTTP response. Passes responses from the HTTP server to downstream transformations.

Input. Used to construct the final URL for the GET method or the data for the POST request. Header. Contains header data for the request and response.

In the above shown image, we have two input ports for the GET method and the response from the server as the output port

Configuring a URL

Page 46: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

The web service will be accessed using a URL and the base URL of the web service need to be provided in the transformation. The Designer constructs the final URL for the GET method based on the base URL and port names in the input group.

In the above shown image, you can see the base url and the constructed URL, which includes the query parameters. This web service call is to get the currency conversion and we are passing two parameters to the base url, "from" and "to" currency.

Connecting to the HTTP Server

If the HTTP server requires authentication, you can create an HTTP connection object in the Workflow Manager. This connection can be used in the session configuration to connect the HTTP server.

HTTP Transformation Use Case

Lets consider an ETL job, which is used to integrate sales data from different global sales regions in to the enterprise data warehouse. Data in the warehouse needs to be standardized and all the sales figure need to be stored in US Dollars (USD).

Solution : Here in the ETL process lets us use a web service call to get the real time currency conversion rate and convert the foreign currency to USD. We will use HTTP Transformation to call the web service.

For the demo, we will concentrate only on the HTTP transformation. We will be using the web service from http://rate-exchange.appspot.com/ for the demonstration. This web service take two parameters, "from currency" and "to currency" and returns a JSON document, with the exchange rate information.

http://rate-exchange.appspot.com/currency?from=USD&to=EUR

Step 1 :- Create the HTTP Transformation like any other transformation in the mapping designer. We need to configure the transformation for the GET HTTP method to access currency conversion data. Below shown is the configuration.

Ads not by this site

Page 47: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Step 2 :- Create two input ports as shown in below image. The ports need to be string data type and the port name should match with the url parameter name.

Page 48: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Step 3 :- Now you can provide the base URL for the web service and the designer will construct the complete URL with the parameters included.

Page 49: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Step 4 :- The output from the HTTP transformation will look similar to what is given below.

{"to": "USD", "rate": 1.3522000000000001, "from": "EUR"}

Finally, you can plug in the transformation into the mapping as shown in below image. Parse the output from HTTP Transformation in an expression transformation and do the calculation to convert the currency to USD.

Page 50: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out HTTP transformation or share us if you use any different use cases you want to implement using HTTP transformation.

People who read this also read :

Informatica SQL Transformation, SQLs Beyond Pre & Post Session Commands Johnson Cyriac Sep 24, 2013Transformations

inShare 10

SQL statements can be used as part of pre or post SQL commands in a PowerCenter workflow. These are static SQLs and can run only once before or after the mapping pipeline is run. With the help of SQL transformation, we can use SQL statements much more effectively to build your ETL logic. In this tutorial lets learn more about

the transformation and its usage with a real time use case.

What is SQL Transformation

The SQL transformation can be used to processes SQL queries midstream in a mapping. You can execute any valid SQL statement using this transformation. This can be external SQL scripts or SQL queries that are created with in the transformation. SQL transformation processes the query and returns rows and database errors if any.

Configuring SQL Transformation

Page 51: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

SQL transformation can run in two different modes.

Script mode :- Runs SQL scripts from text files that are externally located. You pass a script name to the transformation with each input row. It outputs script execution status and any script error.

Query mode :- Executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries. You can output multiple rows when the query has a SELECT statement.

Script Mode

An SQL transformation running in script mode runs SQL scripts from text files. It creates an SQL procedure and sends it to the database to process. The database validates the SQL and executes the query. You cannot use scripting languages such as Oracle PL/SQL or Microsoft/Sybase T-SQL in the script.

In the script mode, you pass script file name with the complete path from the source to the SQL transformation ScriptName port. ScriptResult port gives the status of the script execution status. It will be either PASSED or FAILED. ScriptError returns errors that occur when a script fails for a row.

Above shown is an SQL transformation in Script Mode, which will have a ScriptName input and ScripResult, ScriptError as output.

Query Mode

When SQL transformation runs in query mode, it executes an SQL query defined in the transformation. You can pass strings or parameters to the query from the transformation input ports to change the SQL query statement or the query data. The SQL query can be static or dynamic.

Static SQL query :- The query statement does not change, but you can use query parameters to change the data, which is passed in through the input ports of the transformation.

Dynamic SQL query :- You can change the query statements and the data, which is passed in through the input ports of the transformation.

With static query, the Integration Service prepares the SQL statement once and executes it for each row. With a dynamic query, the Integration Service prepares the SQL for each input row.

Page 52: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Above shown SQL transformation, which runs in query mode has two input parameters and returns one output.

SQL Transformation Use CaseAds not by this site

Lets consider the ETL for loading Dimension tables into a data warehouse. The surrogate key for each of the dimension tables are populated using an Oracle Sequence. The ETL architect needs to create an Informatica reusable component, which can be reused in different dimension table loads to populate the surrogate key.

Solution : Here lets create a reusable SQL transformation in Query mode, which can take the name of the oracle sequence generator, and pass the sequence number as the output.

Step 1 :- Once you have the transformation developer open you can start creating the SQL transformation like any other transformations. It opens up a window like shown in below image.

Page 53: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

This screen will let you choose the mode, database type, database connection type and you can make the transformation active or passive. If the database connection type is dynamic, you can dynamically pass in the connection details into the transformation. If the SQL query returns more than one record, you need to make the transformation active.

Step 2 :- Now create the input and output ports as shown in the below image. We are passing in the database schema name and the sequence name. It return sequence number as an output port.

Page 54: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Step 3 :- Using the SQL query editor, we can build the query to get the sequence generator. Using the 'String Substitution' ports we can make the SQL dynamic. Here we are making the query dynamic by passing the schema name, sequence name dynamically as an input port.

Page 55: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

That is all we need for the reusable SQL transformation. Below shown is the completed SQL transformation, which can take two input values (schema name, sequence name) and returns one output value (sequence number).

Step 4 :- We can use this transformation just like any other reusable transformations, Need to pass in the schema name, sequence name as input ports and returns sequence number, which can be used to populate the surrogate key of the dimension table as shown below.

Page 56: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

As per the above example, integration service will convert the SQL as follows during the session runtime. SELECT DW.S_CUST_DIM.NEXTVAL FROM DUAL;

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this tutorial or share us if you use any different use cases you want to implement using SQL transformation.

Ads not by this site

People who read this also read :

Informatica Java Transformation to Leverage the Power of Java Programming Johnson Cyriac Sep 17, 2013Transformations

inShare 9

Java is, one of the most popular programming languages in use, particularly for client-server web applications. With the introduction of PowerCenter Java Transformation, ETL developers can get their feet wet with Java programming and leverage the power of Java. In this article lets learn more about Java

Transformation, its components and its usage with the help of a use case.

What is Java Transformation

Page 57: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

With Java transformation you can define transformation logic using java programming language without advanced knowledge of the Java programming language or an external Java development environment.

The PowerCenter Client uses the Java Development Kit (JDK) to compile the Java code and generate byte code for the transformation. The PowerCenter Client stores the byte code in the PowerCenter repository. When the Integration Service runs a session with a Java transformation, the Integration Service uses the Java Runtime Environment (JRE) to execute the byte code and process input rows and generate output rows.

Developing Code in Java Transformation

You can use the code entry tabs to enter Java code snippets to define Java transformation functionality. Using the code entry tabs with in the transformation, you can import Java packages, write helper code, define Java expressions, and write Java code that defines transformation behavior for specific transformation events.

Below image shows different code entry tabs under 'Java Code'.

Import Packages :- Import third-party Java packages, built-in Java packages, or custom Java packages.

Helper Code :- Define variables and methods available to all tabs except Import Packages. After you declare variables and methods on the Helper Code tab, you can use the variables and methods on any code entry tab except the Import Packages tab.

Page 58: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

On Input Row :- Define transformation behavior when it receives an input row. The Java code in this tab executes one time for each input row

On End of Data :- Use this tab to define transformation logic when it has processed all input data.

On Receiving Transaction :- Define transformation behavior when it receives a transaction notification. You can use this only with active Java transformations.

Java Expressions : - Define Java expressions to call PowerCenter expressions. You can use this in multiple code entry tabs.

Ads not by this site

Java Transformation Use Case 

Lets take a simple example for our demonstration. The employee data source contains the employee ID, name, Age, Employee description, and the manager ID. We need to create an ETL transformation to find the manager name for a given employee based on the manager ID and generates output file that contain employee ID, name, Employee description, and the Manager name.

Below shown is the complete structure of the mapping to build the functionality we described above. We are using only Java Transformation other than source, target and source qualifier.

Step 1 :- Once you have the source and source qualifier pulled in to the Java Transformation and create input and output ports as shown in below image. Just like any other transformation, you can drag and drop ports from other transformations to create new ports.

Page 59: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Step 2 :- Now move to the 'Java Code' tab and from 'import package' tab import the external java classes required by the java code. This tab can be used to import any third party java classes or build in java classes.

Page 60: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

As shown in above image here is the import code used.

import java.util.Map;import java.util.HashMap;

Step 3 :- In the 'Helper Code' tab, define the variables, objects and functions required by the java code, which will be written in 'On Input Row'. Here we have created four objects.

Below is the code used.

Page 61: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

private static Map <Integer, String> empMap = new HashMap <Integer, String> ();private static Object lock = new Object();private boolean generateRow;private boolean isRoot;

Step 4 :- In the 'On Input Row' tab, define the ETL logic, which will be executed for every input record.

Below is the complete code we need to place it in the 'On Input Row'

generateRow = true;isRoot = false;if (isNull ("EMP_ID_INP") || isNull ("EMP_NAME_INP")){ incrementErrorCount(1); generateRow = false;} else { EMP_ID_OUT = EMP_ID_INP; EMP_NAME_OUT = EMP_NAME_INP;}if (isNull ("EMP_DESC_INP")){ setNull("EMP_DESC_OUT");} else { EMP_DESC_OUT = EMP_DESC_INP;

Page 62: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

}boolean isParentEmpIdNull = isNull("EMP_PARENT_EMPID");if(isParentEmpIdNull){ isRoot = true; logInfo("This is the root for this hierarchy."); setNull("EMP_PARENT_EMPNAME");}synchronized(lock){ if(!isParentEmpIdNull) EMP_PARENT_EMPNAME = (String) (empMap.get(new Integer (EMP_PARENT_EMPID))); empMap.put (new Integer(EMP_ID_INP), EMP_NAME_INP);}if(generateRow) generateRow();

With this we are done with the coding required in Java Transformation and only left with code compilation. Remaining tabs in this java transformation do not need any code for our use case.

Compile the Java Code

To compile the full code for the Java transformation, click Compile on the Java Code tab. The Output window displays the status of the compilation. If the Java code does not compile successfully, correct the errors in the code entry tabs and recompile the Java code. After you successfully compile the transformation, save the transformation to the repository.

Completed Mapping  

Remaining tabs do not need any code for our use case and all the ports from the java transformation can be connected from the source qualifier and to the target. Below shown is the completed structure of the mapping.

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this java code and java transformation or share us if you use any different use cases you want to implement using java transformation.

Page 63: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

People who read this also read :

Informatica Performance Tuning Guide, Identify Performance Bottlenecks - Part 2 Johnson Cyriac Sep 8, 2013ETL Admin | Performance Tips

inShare 9

In our previous article in the performance tuning series, we covered the basics of Informatica performance tuning process and the session anatomy. In this article we will cover the methods to identify different performance bottlenecks. Here we will use session thread statistics, session performance counter and workflow monitor

properties to help us understand the bottlenecks.

Source, Target & Mapping Bottlenecks Using Thread Statistics

Performance Tuning Tutorial SeriesPart I : Performance Tuning Introduction. Part II : Identify Performance Bottlenecks. Part III : Remove Performance Bottlenecks .Part IV : Performance Enhancements.

Thread statics gives run time information from all the three threads; reader, transformation and writer thread. The session log provides enough run time thread statistics to help us understand and pinpoint the performance bottleneck.

Gathering Thread Statistics

You can get thread statistics from the session long file. When you run a session, the session log file lists run time information and thread statistics with below details.

Ads not by this site

Run Time : Amount of time the thread runs. Idle Time : Amount of time the thread is idle. Includes the time the thread waits for other

thread processing. Busy Time : Percentage of the run time. It is (run time - idle time) / run time x 100.

Page 64: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Thread Work Time : The percentage of time taken to process each transformation in a thread.

Note : Session Log file with normal tracing level is required to get the thread statistics.

Understanding Thread Statistics

When you run a session, the session log lists run information and thread statistics similar to the following text.

If you read it closely, you will see reader, transformation and writer thread and how much time is spent on each thread and how busy each thread is. Additional to that, transformation thread shows how much busy each transformation in the mapping is.

The total run time for the transformation thread is 506 seconds and the busy percentage is 99.7%. This means the transformation thread was never idle for the 506 seconds. The reader and writer busy percentages were significantly smaller, about 9.6% and 24%. In this session, the transformation thread is the bottleneck in the mapping.

To determine which transformation in the transformation thread is the bottleneck, view the busy percentage of each transformation in the thread work time breakdown. The transformation RTR_ZIP_CODE had a busy percentage of 53%.

Hint : Thread with the highest busy percentage is the bottleneck.

Session Bottleneck Using Session Performance Counters

Page 65: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

All transformations have counters to help measure and improve performance of the transformations. Analyzing these performance details can help you identify session bottlenecks. The Integration Service tracks the number of input rows, output rows, and error rows for each transformation.

Ads not by this site

Gathering Performance Counters

You can setup the session to gather performance counters in the workflow manager. Below image shows the configuration required for a session to collect transformation performance counters.

Understanding Performance Counters

Below shown image is the performance counters for a session, which you can see from the workflow monitor session run properties.. You can see the transformations in the mapping and the corresponding performance counters.

A non-zero counts for readfromdisk and writetodisk indicate sub-optimal settings for transformation index or data caches. This may indicate the need to tune session transformation caches manually.

Page 66: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

A non-zero count for Errorrows indicates you should eliminate the transformation errors to improve performance.

Errorrows : Transformation errors impact session performance. If a transformation has large numbers of error rows in any of the Transformation_errorrows counters, you should eliminate the errors to improve performance.

Readfromdisk and Writetodisk : If these counters display any number other than zero, you can increase the cache sizes to improve session performance.

Readfromcache and Writetocache : Use this counters to analyze how the Integration Service reads from or writes to cache.

Rowsinlookupcache : Gives the number of rows in the lookup cache. To improve session performance, tune the lookup expressions for the larger lookup tables.

Session Bottleneck Using Session Log File

When the Integration Service initializes a session, it allocates blocks of memory to hold source and target data. Not having enough buffer memory for DTM process , can slowdown reading, transforming or writing and cause large fluctuations in performance.

If the session is not able to allocate enough memory for the DTP Process, Integration service will write a warning message in to the session log file and gives you the recommended buffer size. Below is a sample message seen in the session

Page 67: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Message: WARNING: Insufficient number of data blocks for adequate performance. Increase DTM buffer size of the session. The recommended value is xxxx.

System Bottleneck Using the Workflow Monitor

You can view the Integration Service properties in the Workflow Monitor to see CPU, memory, and swap usage of the system when you are running task processes on the Integration Service. Use the following Integration Service properties to identify performance issues:

CPU% : The percentage of CPU usage includes other external tasks running on the system. A high CPU usage indicates the need of additional processing power required by the server.

Memory Usage : The percentage of memory usage includes other external tasks running on the system. If the memory usage is close to 95%, check if the tasks running on the system are using the amount indicated in the Workflow Monitor or if there is a memory leak. To troubleshoot, use system tools to check the memory usage before and after running the session and then compare the results to the memory usage while running the session.

Swap Usage : Swap usage is a result of paging due to possible memory leaks or a high number of concurrent tasks.

What is Next in the Series

The next article in this series will cover how to remove bottlenecks and improve session performance. Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

People who read this also read :

Implementing Informatica PowerCenter Session Partitioning Algorithms Johnson Cyriac Jul 20, 2013Mapping Tips | Performance Tips

inShare 10

Page 68: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Informatica PowerCenter Session Partitioning can be effectively used for parallel data processing and achieve faster data delivery. Parallel data processing performance is heavily depending on the additional hardware power available. In additional to that, it is important to choose the appropriate partitioning algorithm

or partition type. In this article lets discuss the optimal session partition settings.

Business Use Case

Partition Tutorial SeriesPart I : Partition Introduction. Part II : Partition Implementation. Part III : Dynamic Partition.

Lets consider a business use case to explain the implementation of appropriate partition algorithms and configuration.

Daily sales data generated from three sales region need to be loaded into an Oracle data warehouse. The sales volume from three different regions varies a lot, hence the number of records processed for every region varies a lot. The warehouse target table is partitioned based on product line.

Below is the simple structure of the mapping to get the assumed functionality.

Pass-through Partition

A pass-through partition at the source qualifier transformation is used to split the source data into three different parallel processing data sets. Below image shows how to setup pass through partition for three different sales regions.

Page 69: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Ads not by this site

Once the partition is setup at the source qualifier, you get additional Source Filter option to restrict the data which corresponds to each partition. Be sure to provide the filter condition such that same data is not processed through more than one partition and data is not duplicated. Below image shows three additional Source Filters, one per each partition.

Page 70: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Round Robin Partition

Since the data volume from three sales region is not same, use round robin partition algorithm at the next transformation in pipeline. So that the data is equally distributed among the three partitions and the processing load is equally distributed. Round robin partition can be setup as shown in below image.

Hash Auto Key Partition

At the Aggregator transformation, data need to redistribute across the partitions to avoid the potential splitting of aggregator groups. Hash auto key partition algorithm will make sure the data from different partition is redistributed such that records with the same key is in the same partition. This algorithm will identify the keys based on the group key provided in the transformation.

Processing records of the same aggregator group in different partition will result in wrong result.

Page 71: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Key Range Partition

Use Key range partition when required to distribute the records among partitions based on the range of values of a port or multiple ports.

Here the target table is range partitioned on product line. Create a range partition on target definition on PRODUCT_LINE_ID port to get the best write throughput.

Page 72: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Below images shows the steps involved in setting up the key range partition.

Click on Edit Keys to define the ports on which the key range partition is defined.

Page 73: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

A pop up window shows the list of ports in the transformation, Choose the ports on which the key range partition is required.

Now give the value start and end range for each partition as shown below.

Page 74: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

We did not have to use Hash User Key Partition and Database Partition algorithm in the use case discussed here.

Hash User Key partition algorithm will let you choose the ports to group rows among partitions. This algorithm can be used in most of the places where hash auto key algorithm is appropriate.

Database partition algorithm queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database. This algorithm can be applied either on the source or target definition.

Hope you enjoyed this article. Please leave your comments and feedback.

Ads not by this site

People who read this also read :

Informatica Performance Tuning Guide, Resolve Performance Bottlenecks - Part 3 Johnson Cyriac Oct 8, 2013ETL Admin | Performance Tips

inShare 8

In our previous article in the performance tuning series, we covered different approaches to identify performance bottlenecks. In this article we will cover the methods to resolve different performance bottlenecks. We will talk about session memory, cache memory, source, target and mapping performance turning

techniques in detail.

I. Buffer Memory OptimizationWhen the Integration Service initializes a session, it allocates blocks of memory to hold source and target data. Sessions that use a large number of sources and targets might require additional memory blocks.

Page 75: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Performance Tuning Tutorial SeriesPart I : Performance Tuning Introduction. Part II : Identify Performance Bottlenecks. Part III : Remove Performance Bottlenecks .Part IV : Performance Enhancements.

Not having enough buffer memory for DTM process, can slowdown reading, transforming or writing and cause large fluctuations in performance. Adding extra memory blocks can keep the threads busy and improve session performance. You can do this by adjusting the buffer block size and DTM Buffer size.

Note : You can identify DTM buffer bottleneck from Session Log File, Check here for details.

1. Optimizing the Buffer Block Size

Depending on the source, target data, you might need to increase or decrease the buffer block size.

To identify the optimal buffer block size, sum up the precision of individual source and targets columns. The largest precision among all the source and target should be the buffer block size for one row. Ideally, a buffer block should accommodates at least 100 rows at a time.

Buffer Block Size = Largest Row Precision * 100

You can change the buffer block size in the session configuration as shown in below image.

Page 76: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

2. Increasing DTM Buffer Size

When you increase the DTM buffer memory, the Integration Service creates more buffer blocks, which improves performance. You can identify the required DTM Buffer Size based on below calculation.

Session Buffer Blocks = (total number of sources + total number of targets) * 2

DTM Buffer Size = Session Buffer Blocks * Buffer Block Size / 0.9

Ads not by this site

Page 77: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

You can change the DTM Buffer Size in the session configuration as shown in below image.

II. Caches Memory OptimizationTransformations such as Aggregator, Rank, Lookup uses cache memory to store transformed data, which includes index and data cache. If the allocated cache memory is not large enough to store the data, the Integration Service stores the data in a temporary cache file. Session performance slows each time the Integration Service reads from the temporary cache file.

Note : You can examine the performance counters to determine what all transformations require cache memory turning, Check here for details.

1. Increasing the Cache Sizes 

You can increase the allocated cache sizes to process the transformation in cache memory itself such that the integration service do not have to read from the cache file.

Page 78: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

You can calculate the memory requirements for a transformation using the Cache Calculator. Below shown is the Cache Calculator for Lookup transformation.

You can update the cache size in the session property of the transformation as shown below.

Page 80: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

2. Limiting the Number of Connected Ports

For transformations that use data cache, limit the number of connected input/output and output only ports. Limiting the number of connected input/output or output ports reduces the amount of data the transformations store in the data cache.

III. Optimizing the TargetThe most common performance bottleneck occurs when the Integration Service writes to a target database. Small database checkpoint intervals, small database network packet sizes, or problems during heavy loading operations can cause target bottlenecks.

Note : Target bottleneck can be determined with the help of Session Log File, check here for details.

1. Using Bulk Loads

You can use bulk loading to improve the performance of a session that inserts a large amount of data into a DB2, Sybase ASE, Oracle, or Microsoft SQL Server database. When bulk loading, the Integration Service bypasses the database log, which speeds performance. Without writing to the

Page 81: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

database log, however, the target database cannot perform rollback. As a result, you may not be able to perform recovery.

2. Using External Loaders

To increase session performance, configure PowerCenter to use an external loader for the following types of target databases. External loader can be used for Oracle, DB2, Sybase and Teradata.

3. Dropping Indexes and Key Constraints

When you define key constraints or indexes in target tables, you slow the loading of data to those tables. To improve performance, drop indexes and key constraints before you run the session. You can rebuild those indexes and key constraints after the session completes.

4. Minimizing Deadlocks

Encountering deadlocks can slow session performance. You can increase the number of target connection groups in a session to avoid deadlocks. To use a different target connection group for each target in a session, use a different database connection name for each target instance.

5. Increasing Database Checkpoint Intervals

The Integration Service performance slows each time it waits for the database to perform a checkpoint. To decrease the number of checkpoints and increase performance, increase the checkpoint interval in the database.

6. Increasing Database Network Packet Size

If you write to Oracle, Sybase ASE, or Microsoft SQL Server targets, you can improve the performance by increasing the network packet size. Increase the network packet size to allow larger packets of data to cross the network at one time.

IV. Optimizing the SourcePerformance bottlenecks can occur when the Integration Service reads from a source database. Inefficient query or small database network packet sizes can cause source bottlenecks.

Note : Session Log File details can be used to identify Source bottleneck, check here for details.

1. Optimizing the Query

If a session joins multiple source tables in one Source Qualifier, you might be able to improve performance by optimizing the query with optimizing hints. Usually, the database optimizer determines the most efficient way to process the source data. However, you might know properties about the source tables that the database optimizer does not. The database administrator can

Page 82: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

create optimizer hints to tell the database how to execute the query for a particular set of source tables.

2. Increasing Database Network Packet Size

If you read from Oracle, Sybase ASE, or Microsoft SQL Server sources, you can improve the performance by increasing the network packet size. Increase the network packet size to allow larger packets of data to cross the network at one time.

V. Optimizing the MappingsMapping-level optimization may take time to implement, but it can significantly boost session performance. Focus on mapping-level optimization after you optimize the targets and sources.

Generally, you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the mapping. Configure the mapping with the least number of transformations and expressions to do the most amount of work possible. Delete unnecessary links between transformations to minimize the amount of data moved.

Note : You can identify Mapping bottleneck from Session Log File, check here for details.

1. Optimizing Datatype Conversions

You can increase performance by eliminating unnecessary datatype conversions. For example, if a mapping moves data from an Integer column to a Decimal column, then back to an Integer column, the unnecessary datatype conversion slows performance. Where possible, eliminate unnecessary datatype conversions from mappings.

2. Optimizing Expressions

You can also optimize the expressions used in the transformations. When possible, isolate slow expressions and simplify them.

Factoring Out Common Logic : If the mapping performs the same task in multiple places, reduce the number of times the mapping performs the task by moving the task earlier in the mapping.

Minimizing Aggregate Function Calls : When writing expressions, factor out as many aggregate function calls as possible. Each time you use an aggregate function call, the Integration Service must search and group the data. For example SUM(COL_A + COL_B) performs better than SUM(COL_A) + SUM(COL_B)

Replacing Common Expressions with Local Variables : If you use the same expression multiple times in one transformation, you can make that expression a local variable.

Choosing Numeric Versus String Operations : The Integration Service processes numeric operations faster than string operations. For example, if you look up large amounts of data

Page 83: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Using Operators Instead of Functions : The Integration Service reads expressions written with operators faster than expressions with functions. Where possible, use operators to write expressions.

3. Optimizing Transformations

Ads not by this site

Each transformation is different and the tuning required for different transformation is different. But generally, you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the transformation.

Note : Tuning technique for different transformation will be covered as a separate article.

What is Next in the Series

The next article in this series will cover the additional features available in Informatica PowerCenter to improve session performance. Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

People who read this also read :

Informatica PowerCenter Pushdown Optimization a Hybrid ELT Approach Johnson Cyriac Jul 30, 2013Mapping Tips | Performance Tips

inShare 7

Informatica Pushdown Optimization Option increases performance by providing the flexibility to push transformation processing to the most appropriate processing resource. Using Pushdown Optimization, data transformation logic can be pushed to source database or target database or through the PowerCenter server. This

gives the option for the ETL architect to choose the best of the available resources for data processing.

What is Pushdown Optimization

Page 84: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Performance Improvement Features

Pushdown OptimizationPipeline PartitionsDynamic Partitions

Concurrent Workflows Grid Deployments

Workflow Load Balancing

Pushdown Optimization Option enables data transformation processing, to be pushed down into any relational database to make the best use of database processing power. It converts the transformation logic into SQL statements, which can directly execute on database. This minimizes the need of moving data between servers and utilizes the power of database engine.

How Pushdown Optimization Works

When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and transformations to determine the transformation logic it can push to the database. If the mapping contains a mapplet, the Integration Service expands the mapplet and treats the transformations in the mapplet as part of the parent mapping. The Integration Service converts the transformation logic into SQL statements and sends to the source or the target database to perform the data transformation. The amount of transformation logic one can push to the database depends on the database, transformation logic, and mapping and session configuration.

Different Type Pushdown Optimization

You can configure pushdown optimization in the following ways.

1. Source-side pushdown optimization2. Target-side pushdown optimization 3. Full pushdown optimization

Source-side pushdown optimization

Ads not by this site

Page 85: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

When you run a session configured for source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the database.

The Integration Service generates a SELECT statement based on the transformation logic for each transformation it can push to the database. When you run the session, the Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement. Then, it reads the results of this SQL statement and continues to run the session.

If you run a session that contains an SQL override or lookup override, the Integration Service generates a view based on the override. It then generates a SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.

Target-side pushdown optimization

When you run a session configured for target-side pushdown optimization, the Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database.

The Integration Service generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.

Page 86: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Full pushdown optimization

Ads not by this site

The Integration Service pushes as much transformation logic as possible to both source and target databases. If you configure a session for full pushdown optimization, and the Integration Service cannot push all the transformation logic to the database, it performs partial pushdown optimization instead.

To use full pushdown optimization, the source and target must be on the same database. When you run a session configured for full pushdown optimization, the Integration Service analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it analyzes the target. It generates SQL statements that are executed against the source and target database based on the transformation logic it can push to the database. If the session contains an SQL override or lookup override, the Integration Service generates a view and runs a SELECT statement against this view.

Configuring Session for Pushdown Optimization

A session can be configured to use pushdown optimization from informatica powercenter workflow manager. You can open the session and choose the Source, Target or Full pushdown optimization as shown in below image.

Page 87: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

You can additionally choose few options to control how integration service push data transformation into SQL statements. Below screen shot shows the available options.

Allow Temporary View for Pushdown. Allows the Integration Service to create temporary view objects in the database when it pushes the session to the database.

Allow Temporary Sequence for Pushdown. Allows the Integration Service to create temporary sequence objects in the database.

Allow Pushdown for User Incompatible Connections. Indicates that the database user of the active database has read permission on the idle databases.

Page 88: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Using Pushdown Optimization Viewer

Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the corresponding SQL statement that is generated for the specified selections.

You can invoke the viewer from highlighted 'Pushdown Optimization' as shown in below image.

Page 89: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Pushdown optimizer viewer pops up in a new window and it shows how integration service converts the data transformation logic into SQL statement for a particular mapping. When you select a pushdown option or pushdown group in the viewer, you do not change the pushdown configuration. To change the configuration, we must update the pushdown option in the session properties.

Page 90: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Things to Consider before Using Pushdown Optimization

When you run a session for full pushdown optimization, the database must run a long transaction, if the session contains a large quantity of data. Consider the following database performance issues when you generate a long transaction.

A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the

database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur.

Hope you enjoyed this article and it is informative. Please leave us your comments and feedback.Ads not by this site

People who read this also read :

Mapping Debugger to Troubleshoot your Informatica PowerCenter ETL Logic Johnson Cyriac Aug 17, 2013Mapping Tips | Transformations

Page 91: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

inShare 8

Debugger is an integral part of Informatica PowerCenter mapping designer, which help you in troubleshooting the ETL logical error or data error conditions in an Informatica mapping. The debugger user interface shows the step by step execution path of a mapping and how the source data is transformed in the mapping.

Features like "break points", "evaluate expression" makes the debugging process easy.

Understand Debugger Interface

The debugger user interface is integrated with Mapping Designer. Once you invoke the debugger, you get few additional windows to display the debugging information such as transformation instance window to show how the data is transformed at a transformation instance and target window to show what data is written to the target.

1. Instance Window : View how data is transformed at a transformation instance, This window gets refreshed as the debugger progresses from one transformation to other. You can choose a specific transformation from the drop down list to see how data looks like at that particular transformation instance for a particular source row.

2. Target Window : Shows what data is processed into the target instance. You can see, if the record is going to get inserted, updated, deleted or rejected. If there are multiple target instances, you can choose the target instance name from the drop down window, to see its data.

3. Mapping Window : Mapping window shows the step by step execution path of a mapping. It highlights the transformation instance which is being processed and shows the breakpoints setup on different transformations.

4. Debugger Log : This window shows messages from the Debugger.

Below shown is the windows in the Mapping Designer, that appears when you run the Debugger.Ads not by this site

Page 92: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Above image shows the mapping with one breakpoint set on the expression transformation. Target instance window is showing first two records set for update. And the Instance window is showing how the third record from the source is transformed in the expression EXP_INSERT_UPDATE.

Configuring the Debugger

Before the debugger can run we need to setup the break points and configure the session, which is to be used by mapping being debugged. Setting up break point is optional to run the debugger. But this option helps to narrow down the issue faster, especially when the mapping is pretty big and complex.

Creating Breakpoints

Page 93: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

When you are running a debugger session, you may not be interested to see the data transformations in all the transformations instances, but specific transformations where you expect a logical or data error.

For example, you might want to see what is going wrong in the expression transformation EXP_INSERT_UPDATE for a specific customer record, say CUST_ID = 1001.

By setting a break point, you can pause the Debugger on specific transformation or specific condition is satisfied. You can set two types of break points.

Error Breakpoints : When you create an error break point, the Debugger pauses when the Integration Service encounters error conditions such as a transformation error. You also set the number of errors to skip for each break point before the Debugger pauses.

Data Breakpoints : When you create a data break point, the Debugger pauses when the data break point condition evaluates to true. You can set the number of rows to skip or a data condition or both.

You can start the Break point Window from Mapping -> Debugger -> Edit Breakpoints (Alt+F9) as shown in below image.

Below shown is a Data Break point created on EXP_INSERT_UPDATE, with condition CUST_ID = 1001. With this setting the debugger pauses on the transformation EXP_INSERT_UPDATE, when processing the CUST_ID = 1001.

Page 94: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

In the same way, we can create error breakpoints on any transformation. Setting up break point is optional to run the debugger But this option helps to narrow down the issue faster, especially when the mapping is pretty big and complex.

Configuring the Debugger

In addition to setting breakpoints, you must also configure the Debugger. Use the Debugger Wizard in the Mapping Designer to configure the Debugger against a saved mapping. When you configure the Debugger, enter parameters such as the Integration Service, an existing non-reusable session, an existing reusable session, or create a debug session instance for the mapping you are going to debug.

You can start the Debugger Wizard from Mapping -> Debugger -> Start Debugger (F9) as shown in below image.

Page 95: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

From below shown window you choose the integration service. You choose an existing non-reusable session, an existing reusable session, or create a debug session instance.

Next window will give an option to choose the sessions attached to the mapping which is being debugged.

Page 96: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

You can choose to load or discard target data when you run the Debugger. If you discard target data, the Integration Service does not connect to the target. You can select the target instances you want to display in the Target window while you run a debug session.

Page 97: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

With this settings the mapping is ready to be debugged.

Ads not by this site

Running the Debugger

When you complete the Debugger Wizard shown in the configuration phase in the step above, the Integration Service starts the session and initializes the Debugger. After initialization, the Debugger moves in and out of running and paused states based on breakpoints and commands that you issue from the Mapping Designer.

When the Debugger is in paused state, you can see the transformation data in the Instance Window.

After you review or modify data, you can continue the Debugger in the following ways. Different commands to control the Debugger execution is shown in below image. This menu is available under Mapping -> Debugger.

Page 98: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Continue to the next break : To continue to the next break, click Continue (F5). The Debugger continues running until it encounters the next break.

Continue to the next instance : To continue to the next instance, click Next Instance (F10) option. The Debugger continues running until it reaches the next transformation or until it encounters a break. If the current instance has output going to more than one transformation instance, the Debugger stops at the first instance it processes.

Step to a specified instance : To continue to a specified instance, select the transformation instance in the mapping, then click Step to Instance (Ctrl+F10) option. The Debugger continues running until it reaches the selected transformation in the mapping or until it encounters a break.

Evaluating Expression

When the Debugger pauses, you can use the Expression Editor to evaluate expressions using mapping variables and ports in a selected transformation. This option is helpful to evaluate and rewrite an expression, in cause if you find the expression result is erroneous.

You can access Evaluate Expression window from Mapping -> Debugger -> Evaluate Expression.

Page 99: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Modifying Data

When the Debugger pauses, the current instance displays in the Instance window. You can make the data modifications to the current instance when the Debugger pauses on a data break point.

You can modify the data from the Instance Window. This option is helpful, if you want to check what would be the result if the input was any different from the current value.

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out mapping debugger and subscribe to the mailing list to get the latest tutorials in your mail box.

People who read this also read :

Informatica Performance Tuning Guide, Tuning and Bottleneck Overview - Part 1 Johnson Cyriac Aug 25, 2013ETL Admin | Performance Tips

Page 100: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

inShare 11

Performance tuning process identifies the bottlenecks and eliminate it to get a better acceptable ETL load time. Tuning starts with the identification of bottlenecks in source, target, mapping and further to session tuning. It might need further tuning on the system resources on which the Informatica PowerCenter Services are

running.

This performance tuning article series is split into multiple articles, which goes over specific areas of performance tuning. In this article we will discuss about the session anatomy and more about bottlenecks.

Performance Tuning and Bottlenecks Overview

Performance Tuning Tutorial SeriesPart I : Performance Tuning Introduction. Part II : Identify Performance Bottlenecks. Part III : Remove Performance Bottlenecks .Part IV : Performance Enhancements.

Ads not by this site

Determining the best way to improve performance can be complex. An iterative method of identifying one bottleneck at a time and eliminate it, then identify and eliminate the next bottleneck until an acceptable throughput is achieved is more effective.

The first step in performance tuning is to identify performance bottlenecks. Performance bottlenecks can occur in the source and target, the mapping, the session, and the system. Before we understand different bottlenecks, lets see the components of Informatica PowerCenter session and how a bottleneck arises.

Informatica PowerCenter Session Anatomy

When a PowerCenter session is triggered, integration service start Data Transformation Manager (DTM), which is responsible to start reader thread, transformation thread and writer thread.

Reader thread is responsible to read data from the sources. Transformation threads process data

Page 101: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

according to the transformation logic in the mapping and writer thread connects to the target and loads the data. Any data processing delay in these threads leads to a performance issue.

Above shown is the pictorial representation of a session. Reader thread reads data from the source and data transformation is done by transformation thread. Finally loaded into the target by the writer thread.

Source Bottlenecks

Performance bottlenecks can occur when the Integration Service reads from a source database. Slowness in reading data from the source leads to delay in filling enough data into DTM buffer. So the transformation and writer threads wait for data. This delay causes the entire session to run slower.

Inefficient query or small database network packet sizes can cause source bottlenecks.

Page 102: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Target Bottlenecks

When target bottleneck occurs, writer thread will not be able to free up space for reader and transformer threads, until the data is written to the target. So the the reader and transformer threads to wait for free blocks. This causes the entire session to run slower.

Ads not by this site

Small database checkpoint intervals, small database network packet sizes, or problems during heavy loading operations can cause target bottlenecks.

Mapping Bottlenecks

Page 103: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

A complex mapping logic or a not well written mapping logic can lead to mapping bottleneck. With mapping bottleneck, transformation thread runs slower causing the reader thread to wait for free blocks and writer thread to wait blocks filled up for writing to target.

Session Bottlenecks

If you do not have a source, target, or mapping bottleneck, you may have a session bottleneck. Session bottleneck occurs normally when you have the session memory configuration is not turned correctly. This in turn leads to a bottleneck on the reader, transformation or writer thread. Small cache size, low buffer memory, and small commit intervals can cause session bottlenecks.

System Bottlenecks

After you tune the source, target, mapping, and session, consider tuning the system to prevent system bottlenecks. The Integration Service uses system resources to process transformations, run sessions, and read and write data. The Integration Service also uses system memory to create cache files for transformations, such as Aggregator, Joiner, Lookup, Sorter, XML, and Rank.

What is Next in the Series

In the next article in this series, we will cover how to identify different bottlenecks using session thread statics and session performance counters and more. Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

People who read this also read :

Page 104: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Session Logfile with Verbose Data for Informatica Mapping Debugging Johnson Cyriac Aug 31, 2013Mapping Tips | Transformations

inShare 7

Debugger is a great tool to troubleshoot your mapping logic, but there are instances where we need to go for a different troubleshooting approach for mappings. Session log file with verbose data gives much more details than the debugger tool, such as what data is stored in the cache files, how variables ports are evaluated.

Such information helps in complex tricky troubleshooting.

For our discussion, lets consider a simple mapping. In this mapping we have one lookup transformation and an expression transformation. Below shown is the structure of the mapping. We will set up the session to debug these two transformations.

Setting Up the Session for Troubleshooting

Before we can run the workflow and debug, we need to setup the session to get log file with detailed verbose data. We can set the session to get verbose data from all the transformations in the mapping or specific transformations in the mapping.

For our demo, we are going to collect verbose data from the lookup and expression transformation.

We can set up the session to for debugging by changing the Tracing Level to Verbose Data as shown in below image. Here we are setting the tracing level for the lookup transformation.

Page 105: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

As we mentioned, we are setting up the Tracing Level to Verbose Data for the expression transformation as well as shown below.

Page 106: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Note : We can override the tracing level for all the individual transformations at once from Configuration Object -> Override Tracing property.

Read and Understand the Log FileAds not by this site

Once you open the session log file with verbose data, you going to notice a lot more information that we normally see in a log file.

Since we have are interested in the data transformation details, we can scroll down through the session log and look for transformation thread.

Below shown part of the log file; details what data is stored in the lookup cache file. The highlighted section shows the data is read from the lookup source LKP_T_DIM_CUST{{DSQ}} and is build into LKP_T_DIM_CUST{{BLD}} cache. Further you can see the values stored in the cache file.

Page 107: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Further down through the transformation thread, you can see three records are passed on to LKP_T_DIM_CUST from the source qualifier SQ_CUST_STAGE. You can see the Rowid in the log file.

Lookup transformation out put is send out to next transformation EXP_INSERT_UPDATE.

You can see what data is received by EXP_INSERT_UPDATE from the Lookup transformation. Rowid is helpful to track the rows between the transformations.

Page 108: Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Since we have enabled verbose data for the expression transformation as well, additional to the above details, you will see how data is passed into and out of the expression transformation. But skipped from this demo.

Pros and Cons

Both the debugger tool and debugging using verbose data got its own plus and minuses.

Pros

Faster : One you get the hang of verbose data, it is faster debugging using session log file than the debugger tool. You do not have to patiently wait to get info from each transformation like debugger tool.

Detailed info : Verbose data gives much more details than the debugger, such as what data is stored in the cache files, how variables ports are evaluated and much more, which is useful in detailed debugging.

All in one place : We get all the detailed debugger info in one place, which help you go through how rows are transformed from source to target. You can see only one row at a time in debugger tool.

Cons

Difficult to understand : Unlike debugger tool, it requires an extra bit of effort to understand the verbose data in session log file.

No user interface : No user interface is available, all the debugging info is provided in text format, which might not be a proffered way for some.

Lot of info : Session log file with verbose data gives much more details than the debugger tool, which is some times irrelevant for your troubleshooting.

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this debugging approach or share us if you use any different methods for your debugging.

Ads not by this site

People who read this also read :