Upload
vam1
View
228
Download
5
Embed Size (px)
Citation preview
Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have
seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL
application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different
methods of implementing CDC.
Scenario #01: Change detection using timestamp on source rowsIn this typical scenario the source rows have extra two columns say row_created_time & last_modified_time.
Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last
modified
1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type
2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME
3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’
4. Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be
like : select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runsScenario #02: Change detection using load_id or Run_idUnder this scenario the source rows have a column say load_id, a positive running number. The load_id is updated
as and when the record is updated
1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type
2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id);the maximum load_id is stored into mapping variable
3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.
4. Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed
5. Consecutive runs would take care of updating the load_id & pulling the delta in sequence