Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different methods of implementing CDC. Scenario #01: Change detection using timestamp on source rows In this typical scenario the source rows have extra two columns say row_created_time & last_modified_time. Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last modified 1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type 2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME 3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modiﬁed_date > ’01/01/1900 00:00:000’ 4.Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed 5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like : select * from employee where last_modiﬁed_date > ’01/01/2010 00:00:000’ , hereby pulling records that had only got changed in between successive runs Scenario #02: Change detection using load_id or Run_id Under this scenario the source rows have a column say load_id, a positive running number. The load_id is updated as and when the record is updated 1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type 2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the maximum load_id is stored into mapping variable 3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records

Informatica: Change Data Capture

Download DOCX Report

Upload
vam1
View
228
Download
5

Embed Size (px)

Citation preview

Page 1: Informatica: Change Data Capture

Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have

seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL

application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different

methods of implementing CDC.

Scenario #01: Change detection using timestamp on source rowsIn this typical scenario the source rows have extra two columns say row_created_time & last_modified_time.

Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last

modified

1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type

2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME

3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’

4. Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be

like : select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runsScenario #02: Change detection using load_id or Run_idUnder this scenario the source rows have a column say load_id, a positive running number. The load_id is updated

as and when the record is updated

1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type

2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id);the maximum load_id is stored into mapping variable

3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.

4. Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed

5. Consecutive runs would take care of updating the load_id & pulling the delta in sequence