18
Journal of Computing and Information Technology - CIT 20, 2012, 2, 95–111 doi:10.2498 /cit.1002046 95 An ETL Metadata Model for Data Warehousing Nayem Rahman 1 , Jessica Marz 1 and Shameem Akhter 2 1 Intel Corporation, USA 2 Western Oregon University, USA Metadata is essential for understanding information stored in data warehouses. It helps increase levels of adoption and usage of data warehouse data by knowledge workers and decision makers. A metadata model is important to the implementation of a data warehouse; the lack of a metadata model can lead to quality concerns about the data warehouse. A highly successful data ware- house implementation depends on consistent metadata. This article proposes adoption of an ETL (extract- transform-load) metadata model for the data warehouse that makes subject area refreshes metadata-driven, loads observation timestamps and other useful parameters, and minimizes consumption of database systems resources. The ETL metadata model provides developers with a set of ETL development tools and delivers a user-friendly batch cycle refresh monitoring tool for the production support team. Keywords: ETL metadata, metadata model, data ware- house, EDW, observation timestamp 1. Introduction The data warehouse is a collection of deci- sion support technologies, aimed at enabling the knowledge worker (executive, manager, and analyst) to make better and faster decisions [5]. A data warehouse is defined as a “subject- oriented, integrated, non-volatile and time vari- ant collection of data in support of manage- ment’s decisions” [17]. It is considered as a key platform for the integrated management of deci- sion support data in organizations [31]. One of the primary goals in building data warehouses is to improve information quality in order to achieve certain business objectives such as com- petitive advantage or enhanced decision making capabilities [2, 3]. An enterprise data warehouse (EDW) gets data from different heterogeneous sources. Since operational data source and target data ware- house reside in separate places, a continuous flow of data from source to target is critical to maintain data freshness in the data warehouse. Information about the data-journey from source to target needs to be tracked in terms of load timestamps and other load parameters for the sake of data consistency and integrity. This information is captured in a metadata model. Given the increased frequency of data ware- house refresh cycles, the increased importance of data warehouse in business organization, and the increasing complexity of data warehouses, a centralized management of metadata is essen- tial for data warehouse administration, mainte- nance and usage [33]. From the standpoint of a data warehouse refresh process, metadata sup- port is crucial to data warehouse maintenance team such as ETL developers, database admin- istrators, and the production support team. An efficient, flexible, robust, and state of the art data warehousing architecture requires a num- ber of technical advances [36]. A metadata model-driven cycle refresh is one such impor- tant advancement. Metadata is essential in data warehouse environments since it enables activ- ities such as data integration, data transforma- tion, on-line analytical processing (OLAP) and data mining [10]. Lately, in data warehouses, batch cycles run several times a day to load data from operational data source to the data ware- house. A metadata model could be used for dif- ferent purposes such as extract-transform-load, cycle runs, and cycle refresh monitoring.

An ETL Metadata Model for Data Warehousing€¦ · ETL metadata model for data warehousing(derivedfrom[26]). In Figure 1, we provide a metadata model that is used to capture metadata

Embed Size (px)

Citation preview

Journal of Computing and Information Technology - CIT 20, 2012, 2, 95–111doi:10.2498/cit.1002046

95

An ETL Metadata Modelfor Data Warehousing

Nayem Rahman1, Jessica Marz1 and Shameem Akhter2

1 Intel Corporation, USA2 Western Oregon University, USA

Metadata is essential for understanding informationstored in data warehouses. It helps increase levels ofadoption and usage of data warehouse data by knowledgeworkers and decision makers. A metadata model isimportant to the implementation of a data warehouse; thelack of a metadata model can lead to quality concernsabout the data warehouse. A highly successful data ware-house implementation depends on consistent metadata.This article proposes adoption of an ETL (extract-transform-load) metadata model for the data warehousethat makes subject area refreshes metadata-driven, loadsobservation timestamps and other useful parameters, andminimizes consumption of database systems resources.The ETL metadata model provides developers with a setof ETL development tools and delivers a user-friendlybatch cycle refresh monitoring tool for the productionsupport team.

Keywords: ETL metadata, metadata model, data ware-house, EDW, observation timestamp

1. Introduction

The data warehouse is a collection of deci-sion support technologies, aimed at enablingthe knowledge worker (executive, manager, andanalyst) to make better and faster decisions[5]. A data warehouse is defined as a “subject-oriented, integrated, non-volatile and time vari-ant collection of data in support of manage-ment’s decisions” [17]. It is considered as a keyplatform for the integrated management of deci-sion support data in organizations [31]. One ofthe primary goals in building data warehousesis to improve information quality in order toachieve certain business objectives such as com-petitive advantage or enhanced decision makingcapabilities [2, 3].

An enterprise data warehouse (EDW) gets datafrom different heterogeneous sources. Sinceoperational data source and target data ware-house reside in separate places, a continuousflow of data from source to target is critical tomaintain data freshness in the data warehouse.Information about the data-journey from sourceto target needs to be tracked in terms of loadtimestamps and other load parameters for thesake of data consistency and integrity. Thisinformation is captured in a metadata model.Given the increased frequency of data ware-house refresh cycles, the increased importanceof data warehouse in business organization, andthe increasing complexity of data warehouses,a centralized management of metadata is essen-tial for data warehouse administration, mainte-nance and usage [33]. From the standpoint of adata warehouse refresh process, metadata sup-port is crucial to data warehouse maintenanceteam such as ETL developers, database admin-istrators, and the production support team.

An efficient, flexible, robust, and state of the artdata warehousing architecture requires a num-ber of technical advances [36]. A metadatamodel-driven cycle refresh is one such impor-tant advancement. Metadata is essential in datawarehouse environments since it enables activ-ities such as data integration, data transforma-tion, on-line analytical processing (OLAP) anddata mining [10]. Lately, in data warehouses,batch cycles run several times a day to load datafrom operational data source to the data ware-house. A metadata model could be used for dif-ferent purposes such as extract-transform-load,cycle runs, and cycle refresh monitoring.

96 An ETL Metadata Model for Data Warehousing

Metadata has been identified as one of the keysuccess factors of data warehousing projects[34]. It captures information about data ware-house data that is useful to business users andback-end support personnel. Metadata helpsdata warehouse end users to understand the var-ious types of information resources availablefrom a data warehouse environment [11]. Meta-data enables decision makers to measure dataquality [30]. The empirical evidence from thestudy suggests that end-user metadata qualityhas a positive impact on end-user attitude aboutdata warehouse data quality [11]. Metadata isimportant not only from end user perspectivestandpoint, but also from the standpoint of dataacquisition, transformation, load and the analy-sis of warehouse data [38]. It is essential in de-signing, building, maintaining data warehouses.In a data warehouse there are two main kindsof metadata to be collected: business (or log-ical) metadata and technical (aka, ETL) meta-data [38]. The ETL metadata is linked to theback-end processes that extract, transform, andload the data [30]. The ETL metadata is mostoften used by the technical analysts for devel-opment and maintenance of the data warehouse[18]. In this article, we will focus mainly onETL metadata that is critical for ETL develop-ment, batch cycle refreshes, and maintenance ofa data warehouse.

In data warehouses, data from external sourcesis first loaded into staging subject areas. Then,analytical subject area tables – built in such away that they fulfill the needs of reports – arerefreshed for use by report users. These tablesare refreshed multiple times a day by pullingdata from staging area tables. However, notall tables in data warehouses get changed dataduring every cycle refresh: the more frequentlythe batch cycle runs, the lower the percentageof tables that gets changed in any given cycle.Refreshing all tables without first checking forsource data changes causes unnecessary loadsat the expense of resource usage of databasesystems. The development of a metadata modelthat enables some utility stored procedures toidentify source data changes means that loadjobs can be run only when needed. By control-ling batch job runs, the metadata model is alsodesigned to minimize use of database systemsresources. The model makes analytical subjectarea loads metadata-driven.

The model is also designed to provide the pro-duction support team with a user-friendly tool.This allows them to monitor the cycle refreshand look for issues relating to a job failure ofa table and load discrepancy in the error andmessage log table. The metadata model pro-vides the capability to setup subsequent cyclerun behavior followed by the one-time full re-fresh. This works towards enabling tables to betruly metadata-driven. The model also providesdevelopers with a set of objects to perform ETLdevelopment work. This enables them to fol-low standards in ETL development across theenterprise data warehouse.

In themetadatamodel architecture, the load jobsare skipped when source data has not changed.Metadata provides information to decide whe-ther to run full or delta load stored procedures.It also has the capability to force a full load ifneeded. The model also controls collect statis-tics jobs running them after a certain interval oron an on-demand basis, which helps minimizeresource utilization. The metadata model hasseveral audit tables to archive critical metadatafor three months to two years or more.

In Section 2 we discuss related work. In Section3 we give a detailed description of an ETL meta-data model and its essence. In Section 4, wecover metadata-driven batch processing, batchcycle flow, and an algorithm for wrapper storedprocedures. The main contribution of this workis presented in Sections 3 and 4. In Section 5we discuss use of metadata in data warehousesubject area refreshes. We conclude in Sec-tion 6 by summarizing the contribution madeby this work, providing a review of the meta-data model’s benefits and proposing the futureworks.

2. Literature Research

Research in the field of data warehousing is fo-cused on data warehouse design issues [13, 15,7, and 25], ETL tools [20, 32, 27, and 19], datawarehouse maintenance [24, 12], performanceoptimization or relational view materialization[37, 1, and 23] and implementation issues [8,29]. Limited research has been done on themetadata model aspects of data warehousing.Golfarelli et al. [14] provide a model for mul-tidimensional data which is based on business

An ETL Metadata Model for Data Warehousing 97

aspects of OLAP data. Huynh et al. [16] pro-pose the use of metadata to map between object-oriented and relational environment within themetadata layer of an object-oriented data ware-house. Eder et al. [9] propose the COMETmodel that registers all changes to the schemaand structure of data warehouses. They con-sider the COMET model as the basis for OLAPtools and transformation operations with thegoal to reduce incorrect OLAP results. Stohret al. [33] have introduced a model which usesa uniform representation approach based on theUniform Modeling Language (UML) to inte-grate technical and semantic metadata and theirinterdependencies. Katic et al. [21] propose amodel that covers the security-relevant aspectsof existing OLAP/ data warehouse solutions.They assert that this particular aspect of meta-data has seen rather little interest from productdevelopers and is only beginning to be discussedin the research community. Shankaranarayanan& Even [30] and Foshay et al. [11] provide agood description of business metadata and asso-ciated data quality. They argue that managerialdecision-making stands to benefit from busi-ness metadata. Kim et al. [22] provide a gen-eral overview of a metadata-oriented methodol-ogy for building data warehouses that includeslegacy, extraction, operational data store, datawarehouse, data mart, application, and meta-data.

The ETL (aka, technical) metadata is not ad-dressed in the research work noted above. Inthis article, we provide a comprehensive ETLmetadata model from the standpoint of datawarehouse refresh, metadata-driven batch cyclemonitoring, and data warehouse maintenance.Our work covers a broad range of ETL meta-data aspects. We provide a means to managedata warehouse refresh observation timestamps,capturingmessage logs to detect any load or dataissues. We also discuss in detail how to controlindividual job run behavior of subsequent batchcycles runs. Numerous commercial ETL toolswith associated metadata model are availabletoday [35]. However, they are proprietary mod-els only. We propose an ETL metadata modelthat is independent of any ETL tool and canbe implemented in any database system. Ourmodel takes care of metadata-driven refreshesin both staging and analytical [26] subject areasin a data warehouse.

Under our ETL metadata model, platform inde-pendent database specific utility tools are usedto load the tables from external sources to thestaging areas of the data warehouse. The pro-posed metadata model also enables databasespecific software, such as stored procedures,to perform transformation and load analyticalsubject areas of the data warehouse. The intentof the software is not to compete with exist-ing ETL tools. Instead, we focus on utilizingthe capabilities of current commercial databaseengines (given their enormous power to do com-plex transformation and their scalability) whileusing this metadata model. We first present theETL metadata model followed by detailed de-scriptions of each table. We also provide exper-imental results (via Table: 2 to 6 in Section 5)based on our application of the metadata modelin a real-world, production data warehouse.

3. The Metadata Model

Metadata is “data about data”. A metadatamodel is critical for the successful implementa-tion of a data warehouse [22] and integrating thedata warehouse with its metadata offers a newopportunity to create a reliable management andinformation system. Metadata is essential forunderstanding information stored in data ware-houses [16]. It helps increase levels of adop-tion and use of data warehouse data by know-ledge workers and managers. The proposedETL metadata model allows for the storage ofall kinds of load parameters and metrics to makedata warehouse refreshes metadata-driven. Italso holds error and message logs for troubleshooting purposes. It enables tracing the his-torical information of batch loads. Metadataensures that data content possesses sufficientquality for users to use it for a specific purpose[6, 11].

In data warehouse, batch job runs are automatedgiven that they run periodically, several times aday, for efficiency reasons. In order to run thou-sands of jobs via batch cycles in different subjectareas, the jobs are governed by an ETLmetadatamodel to determine the load type, and to cap-ture different load metrics, errors and messages.A wrapper stored procedure is a procedure thatexecutes several utility procedures to make loadtype decisions and, based on that, it runs full ordelta stored procedure or skips the load.

98 An ETL Metadata Model for Data Warehousing

Figure 1. ETL metadata model for data warehousing (derived from [26]).

In Figure 1, we provide a metadata model thatis used to capture metadata for all kinds ofdatawarehouse refresh activities related to ETL.The model consists of metadata tables, wrapperstored procedures and utility stored procedures.The wrapper stored procedure for individualjobs first inspects the latest observation times-tamp in a metadata data table for each of thesource tables referenced in the load stored pro-cedures, to detect the arrival of fresh data. Theload procedures are bypassed if the source datahas not changed. If source data changes are de-tected, the wrapper stored procedures call fullor delta (aka, incremental) stored proceduresbased on load condition of each table load.

In the data warehouse, a staging or analyticalsubject area batch cycle refresh kicks off basedon a notification that upstream source subjectarea refreshes are completed. After a subjectarea batch cycle begins, a pre-load job is run viautility procedure which updates the table witha cycle-begin-timestamp. A post-load proce-dure updates the table with a cycle-end times-tamp immediately after the actual table loads

are completed. The cycle-begin and cycle-endtimestamps stored in this table are used by fulland delta stored procedures during an actual ta-ble load and to capture load information in othermetadata tables during the cycle refresh process.

In a data warehouse, each subject area normallygets refreshed several times a day. In order tokeep track of refresh information for each batchcycle, refresh begin and end timestamps are im-portant to capture. The table ‘cyc log’ is usedto hold subject area refresh timestamps. Thislets users know the timeliness of the data a par-ticular table holds. The cycle-begin timestampis used while capturing load metrics for eachtable load.

Table ‘load chk log’ holds one row per targettable in each subject area. After a job kicks offfor the first time, a utility procedure checks tosee whether a metadata row exists for the table.If no row is found, a default row with severalparameters will be inserted on the fly to do a fullrefresh. After the cycle refresh is completed thecolumn load type ind will be set to ‘D’ so that adelta load is performed in the subsequent cycles.

An ETL Metadata Model for Data Warehousing 99

Table 1. Metadata model entity description.

Table ‘src tbl file info’ holds a load timestampfor each staging table. Table ‘src load log’stores an observation timestamp that is depen-dent on a source table’s last update timestamp.If a target table is loadedwith data from multiplesource tables, this table will store an observa-tion timestamp for each of the source tables.The observation timestamp is the date and timeat which the consistent set of triggered data wasloaded into the target table. The trigger time-stamp is always less-than or equal-to (<=) theobservation timestamp.

Table ‘load mtric’ stores vital load statistics(cyc bgn ts, load strt dt, load strt ts, load stpts, load row cnt, and load msg txt, and run-time) for each target table. Statistics are alsoused for report-outs of runtimes. This tableprovides information about whether the tableis loaded (with a full or delta) or the load isdisabled, and the reason for the no load. Theload mtric enables monitoring progress of abatch cycle while it is running. Table ‘msg log’stores load errors andmessages for trouble shoot-ing and information purposes. This providesuseful data to ETL programmers and produc-tion support team to trace the root cause of jobfailure, incorrect load or no load.

Table ‘load chk opt’ stores job information(subj area nm, trgt tbl nm, load enable ind,load type cd). For each job, one entry is re-quired to exist in this table. The entries areinserted into this table via a utility stored pro-cedure call from within the wrapper stored pro-cedure. The load type cd field holds the valueof ‘Full’ or ‘Delta’. Based on this information,a post-load utility stored procedure will updatethe ‘load chk log’ table to prepare it for the nextcycle refresh. The column ‘load enable ind’holds the value of ‘Y’ or ‘N’. Based on this in-formation, the post-load stored procedure willupdate the ‘load chk log’ table to prepare it forthe next cycle refresh.

In a data warehouse, hundreds of jobs run un-der different subject areas as part of multiplecycle refreshes every day. A job description foreach table needs to be conveyed to productionsupport along with job details. Each job hascertain key descriptions such as job identifier,box, table name, full, delta, and wrapper storedprocedures names; the ‘job desc’ table holdsthis information (subj area nm, trgt tbl nm, au-tosys box nbr, job id, wrapper proc nm, fullproc nm, dlta proc nm, debug mode ind). Theentries are inserted into this table via a utility

100 An ETL Metadata Model for Data Warehousing

stored procedure call. This table is used by theproduction support team to get detailed infor-mation about each job run.

The load type of each table in a particular sub-ject area varies; some tables are refreshed fullsome are refreshed incrementally; some othersare loaded on a weekly or monthly basis or onthe first or last day of themonth. In order to con-trol the load behavior of each target table, eachsource table’s load observation timestamp needto be captured. The ‘load config’ table holds allof this information to enable each target table’srefresh according to its specific schedule.

4. Metadata-driven Batch Processing

An efficient, flexible and general data ware-housing architecture requires a number of tech-nical advances [36]. A metadata-driven batchcycle refresh of a data warehouse is one of thesetechnical advances. Successful data warehous-ing is dependent on maintaining data integrityand quality in table refreshes. Metadata-drivenrefreshes play a prominent role in this regard.Loading inconsistent data negatively impactsdata quality if an inefficient metadata model isnot devised. In this article, we attempt to pro-vide a comprehensive ETL metadata model fordata warehouses.

In data warehouses, each subject area is re-freshed through a batch cycle. Under the batchcycle, jobs are run in different boxes to load thetarget tables in order of dependency on othertables in the subject area. Jobs in a subject arearun under different conditions: some jobs loadtables with full refreshes while other jobs re-fresh the tables incrementally; some other jobsin the cycle skip performing incremental loadsif source data has not changed; some jobs areset to do incremental loads, but end up doingfull refreshes when the source has a large num-ber of new records or the source or target tablerow count does not match after an incrementalrefresh is performed. Our proposed ETL meta-data model controls the table load to satisfy thesaid load conditions.

The metadata model is based on several meta-data tables, utility stored procedures, wrap-per stored procedure, and full and delta loadstored procedures for individual table loads.The model introduces a new paradigm for batch

processing by loading only those tables forwhich there are new records in the source ta-bles. Inspection of the latest observation times-tamp in a metadata data table for each of thesource tables referenced in the incremental loadstored procedures detects the arrival of freshdata. The full and incremental load proce-dures are bypassed if the source data has notchanged. The ETL metadata model allows forstoring table-level detailed load metrics (time-stamp, row count, and errors and messages fortrouble shooting). By providing the source andthe target table last load timestamp, row count,and load threshold information, the model al-lows accurate incremental load processing andenables a decision about whether to perform fullor incremental refresh.

4.1. Batch Cycle Process Flow

In most cases during the last decade, data ware-houses were refreshed on a monthly or weeklyor daily basis. Now-a-days, refreshes occurmore than once per day. A batch cycle thatruns based on subject areas in a data warehousemay run several times a day. A batch cycleholds several boxes which run in order of de-pendency. The very first box (Figure 2: edwx-dungcr100) holds a gatekeeper job that looksfor a notification file from one or more upstream(source) subject areas. The second box (Figure2: edwxdungcr300) in a cycle usually holds ajob that runs a pre-load job (Figure 2: edwc-dungcr10x) to execute a metadata utility proce-dure (pr utl pre load) to prepare for the currentcycle refresh by doing clean-up, insert, and up-date of metadata tables. It deletes ‘asof dt’ and‘asof ts’ from the last cycle and inserts new dtand ts for the current cycle. It writes an entryto the msg log table to log that the new cyclehas started. The pre-load procedure updates thetable with the cycle-begin date and timestampafter the gatekeeper job successfully runs. Apost-load procedure updates the table with cy-cle finish timestamp soon after the refresh iscompleted.

The subsequent boxes (Figure 2: boxes ed-wxdungcr310 to edwxdungcr380) hold actualjobs to load the target tables in the subject area.These jobs are placed in different boxes to runthem in sequence and in order of dependency. A

An ETL Metadata Model for Data Warehousing 101

Figure 2. Batch cycle flow.

downstream-box job in the cycle might have de-pendencies on tables in one or more upstreamboxes in the subject area. Each job calls thewrapper stored procedure for a target table toexecute a full or delta stored procedure.

After all load-jobs under different boxes runto success, the last box in the cycle (Figure2: edwxdungcr390) executes a job (Figure 2:edwcdungcr029x) to run a utility procedure(pr utl refresh status) which updates a meta-data table (Table 1: #5 – load config) withsource data latest timestamp and target cyclerefresh finish timestamp. The analytical com-munity has visibility into the latest data avail-ability timestamp via a web tool. The batch cy-cle also executes another job (Figure 2: edwc-dungcr019x) to run a metadata post-load utilityprocedure. The utility procedure (pr utl postload) archives all load metrics for the currentcycle refresh into several history or audit ta-bles (Table 1: #2, #4, #8, #11 ) to retainthem for a specified period of time. The post-load stored procedure is called by a batch cy-cle job. It archives the current cycle meta-data (from load metric, cyc log, msg log) to

historical tables (load metric hist, cyc log hist,msg log hist). It updates the cyc end ts fieldof cyc log table with current timestamp to logthat the cycle load has completed. It trims thehistorical tables (load metric hist, cyc log hist,msg log hist) of their least recent cycle recordson a rolling basis. One quarter’s worth ofrecords is normally preserved for investigationand research. This procedure also makes surethat the ‘load chk log’ table is reverted to thedefault settings. If during a particular cyclerefresh we want to do a ‘Full’ refresh of a par-ticular job and then if for the subsequent runswe want to do ‘Delta’, we use this post-loadprocedure to set loadtype = Delta (default).

4.2. Algorithm for Wrapper StoredProcedures

The data warehouse subject area refresh is donevia batch cycle runs. Each job in a batch cy-cle calls a wrapper stored procedure to executea full or delta stored procedure for a full orincremental refresh. The wrapper proceduremakes the decision whether to perform a full

102 An ETL Metadata Model for Data Warehousing

refresh or delta refresh or to skip the refreshaltogether through a few utility stored proce-dures that are governed by the metadata model.The load is skipped if the model finds that thesource table’s last observation timestamp hasnot changed. The full or delta refresh is donebased on individual threshold values set for eachtarget table. If the new delta counts found in thesource table are within the threshold percentagethe model runs a delta stored procedure; other-wise, it turns on the full load procedure.

Figure 3 shows that the wrapper procedurechecks the metadata table to see if the sourceload current observation timestamp is greaterthan the observation timestamp of the last tar-get table load. The next step of the decisionflow is to check if the load type indicator is fullor delta. The default is full load if no load typevalue is provided.

Figure 4 provides a template consisting of thewrapper stored procedure algorithm. First, itcalls a utility procedure to get the current cyclebegin timestamp. The cycle begin timestamp(Table 1: #1) is used while capturing metadatafor target table load. The next utility proce-dure (Figure 4, pr utl chk tbl changed) in thewrapper is used to check whether the sourcetable has new records. This is done by compar-ing the latest observation timestamp in meta-data tables, src tbl fl info (Table 1: #10) andsrc load log (Table 1: #7) for the target tableload. The wrapper stored procedure runs thenext utility procedure (Figure 4: pr utl GetLoadCheck Info) to pull the load parameters(force load, load type, load enabled, etc) fromthe load chk log table (Table 1: #6). These pa-rameter values provide information about loadtype such as full or delta load or no load.

Figure 3. High level software architecture for full and delta load.

An ETL Metadata Model for Data Warehousing 103

REPLACE PROCEDURE Procurement DRV MET.pr Wpurch ord line()

BEGINDECLARE subj ara VARCHAR(30) DEFAULT ’Procurement DRV’;DECLARE trgt tbl VARCHAR(30) DEFAULT ’purch ord line’;DECLARE load type CHAR(5) DEFAULT ’Full’;DECLARE cycle dt DATE;DECLARE cycle ts, src cur ts, src lst ts, DECLARE last cycle ts TIMESTAMP(0);DECLARE trgt row exists,row exists,load thrhld,msg nbr INTEGER DEFAULT 0;DECLARE forc load,load enable,copy back CHAR(1) DEFAULT ’Y’;DECLARE msg txt1, msg txt2 VARCHAR(100) DEFAULT ’ ’;DECLARE dummy msg1,dummy msg2 VARCHAR(20) DEFAULT ’ ’;

– ********************************************************************************************** Get the time and date for current cycle refresh from the DWmgr Xfrm MET.v cyc log table.CALL DWmgr Xfrm MET.pr utl Get Cycle TS(:cycle dt,:cycle ts,:subj ara ) ;

– ********************************************************************************************** First find out how many source tables have changed for this target table since last cycle refreshCALL DWmgr Xfrm MET.pr utl chk tbl changed(:subj ara,:trgt tbl,:src cur ts,:src lst ts,:trgt row exists);

– ********************************************************************************************

** Get load parameters from the DWmgr Xfrm MET.v load chk log table.** The entries in this table will be used to decide the processing strategy for this object.CALL DWmgr Xfrm MET.pr utl Get LoadCheck Info(:subj ara,:trgt tbl,:row exists,:forc load,:load type,

:load enable,:load thrhld,:copy back,:last cycle ts,:src cur ts,:src lst ts,:trgt row exists);

– ********************************************************************************************** Next choose the load procedure execution strategy** Strategy could be one of the following Full, Delta, or Skip the load if there are no changes in source tables.** Note: some of the parameters are inout, and may be changed by the stored procedure during the call.CALLDWmgr Xfrm MET.pr utl choose load strat(:subj ara,:trgt tbl,:forc load,:load type,:load enable,:load thrhld);

** Here we execute the main data stored proc: delta or full

IF (load type = ’Full’ AND load enable = ’Y’) THEN

CALL appl Procurement DRV 01.pr Fpurch ord line() ;

ELSEIF (load type = ’Delta’ AND load enable = ’Y’) THEN

CALL appl Procurement DRV 01.pr Dpurch ord line() ;

** This next line is here so that the SP will compile if there is not a call** To the delta SP (many tables do not have deltas), and the SQL

SET dummy msg1 = ’Running Delta’;

ELSEIF (load type = ’S’ AND load enable = ’S’) THEN

SET dummy msg2 = ’Skipping the load’;

ELSE

** Here we insert an entry to the MessageLog tableCALL DWmgr Xfrm MET.pr utl write msg log(:subj ara,:trgt tbl,51,’Wrapper Stored Procedure’,’No load performed’,

’Check values for forc load ind, load type cd or load enable ind in load chk log table’);

END IF;– ********************************************************************************************** Here we Update the DWmgr Xfrm MET.v load chk log and** DWmgr Xfrm MET.v src load log tables with new timestamp as the load is successfulCALLDWmgr Xfrm MET.pr utl upd LoadCheck(:subj ara,:trgt tbl,:last cycle ts,:load type,:load enable,:src cur ts);

– ********************************************************************************************END;

Figure 4. Wrapper stored procedure template for full or incremental load.

104 An ETL Metadata Model for Data Warehousing

The next utility procedure (Figure 4, pr utl cho-ose load strat) in the wrapper is used to passout values mainly for load type and load enableto the wrapper stored procedures. Based onthat information, the wrapper procedure exe-cutes ’Full’ or ’Delta’ load stored proceduresto load the target table. It might also skipthe target table load if source tables have nonew records. This procedure also captures er-rors and messages regarding load-skip, load-disabled, bad parameter values, etc (if any ofthese are detected) into the msg log (Table 1:#9) and load mtric (Table 1: #12) tables.

If the wrapper procedure decides to executethe full load stored procedure, the target tableis emptied before being re-loaded. In manycases, the target table load consists of data pullsfrom multiple source tables and via several in-sert/select SQL blocks. The full stored pro-cedure uses a utility procedure to capture loadinformation into the metadata table. The utilityprocedure pr utl write ent msg log is used tocapture source table, timestamp, and row countinformation into the message log table (Table1: #9) followed by each SQL block executionin the full procedure. This information is ex-tremely helpful because it enables an ETL de-veloper or production support person to easilyand quickly determine which particular SQLblock failed. This information is also useful tothe investigation of any data issue in the targettable.

When the wrapper procedure decides to executethe delta stored procedure for an incrementalload (for instance, if the delta load thresholdis exceeded or if the target table is empty, forsome reason), certain steps are followed. Adelta refresh is performed when small amountof new data (normally, less than 20% of the tar-get rows) arrives in the source table during eachcycle refresh [27]. A delta stored procedure per-forms two specific operations against the targettable: one is an update against the target table toupdate any non-key columns with changed datafrom the source and the other is the insertion ofany brand new records from the source.

There are several other conditions that affectload decisions [27]: (i) if the delta count ex-ceeds a certain threshold percentage of targettable row counts, then the load condition isswitched to full refresh. The delta refresh ina populated table is slower because transientjournaling is needed for a table that alreadycontains some data. The delta load requires

data manipulation language (DML) operationssuch as ‘delete’ and ‘update’ which causes moreresources consumption. Hence, if a larger num-ber of new rows arrive from source, it is moreefficient to do a full refresh than an incrementalrefresh. Normally, a full refresh is performedwhen the target table needs to be loaded withmore than one million rows (ii) If for somereason the target table is found to be empty, afull refresh is needed and the delta stored pro-cedure will call the full load stored procedure(iii) If the delta count is within a predeterminedthreshold percentage, then the delta load will beperformed. Also, delta refreshes are performedwhen full refreshes perform badly and can’t bepractically optimized any further.

Once the target table load is successful using fullor delta stored procedures, the wrapper proce-dure calls one last utility procedure to update themetadata table to capture the observation times-tamp for each source table against the targettable. Thus, under an ETL metadata model, thecycle refresh is done completely based on ETLmetadata information. The wrapper procedureand its associated utility procedure, the full anddelta load procedures and associated utility pro-cedures as well as metadata tables, provide ETLdevelopers across the enterprise with a completeset of objects for extract-transform-load. It isimportant to mention that the stored proceduresand tables are DBMS (database managementsystem) -based as opposed to any commercialETL tools. This ETL metadata model and asso-ciated utility procedures are useful to those datawarehouses that do not use any ETL tool. Thismetadata model is independent of any commer-cial DBMS system.

5. Using Metadata in Load Process

The initial refresh of analytical tables is a fullload. Subsequent refreshes are done via incre-mental load. During a full refresh, the existingdata is truncated and a new copy of all rowsof data is reloaded into the target table fromthe source. An incremental refresh only loadsthe delta changes that have occurred since lasttime the target table was loaded. In incremen-tal refresh, only the delta or difference betweentarget and source data is loaded at regular in-tervals. For incremental load, the observationtimestamp for the previous delta load has to bemaintained in an ETL metadata model. Thedata warehouse refresh is performed via batch

An ETL Metadata Model for Data Warehousing 105

cycles, so that the degree of information in adata warehouse is “predictable” [4].In real world situations, there is a tendency byETL programmers to design delta stored pro-cedures to run ‘stand-alone’ - that is, withouttaking advantage of an ETL metadata model.We suggest design techniques utilizing a meta-data model. The assumption is that instead ofrelying on conventional ETL tools, the full andincremental loads could be done via databasespecific stored procedures and with the help ofa metadata model.

From the data warehouse side, updating largetables and related structures (such as indexes,materialized views and other integrated com-ponents) presents problems executing OLAPquery workloads simultaneously with contin-uous data integration [28]. Our methodologyminimizes the processing time and systems re-sources required for these update processes, asstored procedures give ample opportunity tomanipulate the SQL blocks and influence theoptimizer to execute queries with parallel effi-ciency.

5.1. Checking Source Data Change

The wrapper procedure first calls a utility proce-dure to get the last source table load timestampfrom the metadata table. It also checks the tar-get table last load observation timestamp. By

comparing both timestamps it becomes awarewhether the source data has changed.

In Figure 5, the code block gets the last loadtimestamp of each of the source tables (primary,secondary, dimension, lookup, etc.) under oneor several different business subject areas thatare referenced in the stored procedures to loadthe target table. Based on that information, thesource table load timestamps are pulled.

The utility stored procedure also pulls the lastload timestamp for the last target table. If anyof the source table load timestamps are foundto be greater than the corresponding target ta-ble last load timestamps, that means new datahas arrived in some source tables and a full ordelta stored procedure should be executed (de-pending on load type code) to load the targettable.

If the source load-timestamp is equal to last loadtimestamp of the target table, that means sourcedata has not changed; therefore the table load isskipped and the stored procedure execution isbypassed. If the source load current timestampis greater than the target table last load times-tamp, an attempt is made to load the target tablevia full or delta stored procedure depending onthe load indicator information in the metadatatable. The default load is delta if no indicatorvalue is provided.

Figure 5. Code block that detects source data change.

Table 2. Last observation timestamp for target table refresh.

106 An ETL Metadata Model for Data Warehousing

5.2. Getting the Parameters to DetermineLoad Strategy

The wrapper stored procedure calls a utilitystored procedure to get the target table loadparameters from the ’load chk log’ table. Foreach individual target table load within a sub-ject area, a default metadata entry consistingof load conditions is inserted (Table 3) by thisutility procedure during the first cycle refresh(one time entry). The default entry consists ofseveral conditions such as ’force load indica-tor’ (’Y’ or ’N’; default ’N’), ’load type code’(’Full’ or ’Delta’; default ’Full’), ’load enabledindicator’ (’Y’ or ’N’; default ’Y’), ’delta loadthreshold’ (10% - 20%; default 10%), ’collectstatistics’ (’Y’ or ’N’; default ’N’), and ’copy-back indicators (’Y’ or ’N’; default ’N’). Theseparameters are passed to another utility proce-dure to prepare for target table loading.

The column forc load ind holds the value of‘Y’ or ‘N.’ A value of ‘Y’ would allow the jobto load the target table, irrespective of whethera source tables data has changed or not. Avalue of ‘N’ would let the wrapper stored proce-dures determine if the source data had changedand thus to decide whether or not to load thetarget table. The load type cd field holds thevalue of ‘Full’ or ‘Delta.’ Based on this infor-mation, the wrapper stored procedure executesthe full or delta stored procedure. The columnload enable ind holds the value of ‘Y’ or ‘N.’A value of ‘Y’ allows the job to load the target

table. A value of ‘N’ would prevent the wrapperstored procedure from loading the target table.The load thrhld nbr field holds the thresholdvalue for ‘Delta’ load. The wrapper procedurewill first determine the load type. If it is ‘Full’,then a full load will be performed, no matterwhat the threshold value is. If it is ‘Delta’, thenthe delta procedure will kick-off. The delta pro-cedure will first determine the source row count.If the source count is within the load thresholdnumber, the delta load will be performed. Ifthe row count is greater than the load thresh-old number, then it will turn on the full loadprocedure.

Based on the load conditions the wrapper pro-cedure either executes a ’Full’ or ’Delta’ storedprocedure to load the target table, or skips thetarget table-load if the source table has no newrecords. It also generates and logs load mes-sages into ’msg log’ and ’load mtric’ tables re-garding load type, load skip, load enabled ordisabled, and any unknown parameter values.

5.3. Capturing Error/ Message Logs whileLoading

The table ‘msg log’ stores troubleshooting in-formation. A utility stored procedure is usedto write error messages and information to themsg log table during target table load. The util-ity procedure, pr utl write msg log() is calledto capture error message.

Table 3. Load parameters used to determine load condition.

Table 4. Message log.

An ETL Metadata Model for Data Warehousing 107

Once the cycle refresh is complete, the loadmetrics for the current cycle will be archivedinto the msg log hist table for future referenceor investigation.

5.4. Capturing load metrics

The table ‘load mtric’ (Table 1) stores vital loadstatistics (cyc bgn ts, load strt dt, load strt ts,load stp ts, load row cnt, and load msg txt, andruntime) for each target table. Statistics are alsoused for report outs of runtimes. This table pro-vides information about whether the load is fullor delta, if the load was disabled, and the reasonfor zero row count (in most cases). It enablesthe monitoring of a cycle while it is running.

The utility stored procedure ‘pr utl Load Me-trics()’ is called from ’Full’ and ’Delta’ dataload stored procedures to insert entries into the‘load mtric’ table during each target table load.Both full and delta stored procedures call thisutility procedure to insert load metrics abouteach target table load into the ‘load mtric’meta-data table. The row count of greater than zeroand load timestamp are used as vital informationfor the downstream table refresh. This informa-tion is also useful to monitor cycle refreshes.

The full stored procedure contains transforma-tion logic to first empty the target table beforereloading the table with all data from the sourcetables. The delta refresh is done for the tables

that are loaded with greater than one millionrows or when full refreshes perform badly, andcan’t be practically optimized any further. Thedelta stored procedure first pulls the last obser-vation timestamp of the target table from themetadata table and loads the target table withnew delta records that have arrived after the tar-get table was last refreshed. Once the cyclerefresh is done the load metrics for the currentcycle will be archived into load mtric hist tablefor future reference, investigation, and research.

5.5. Subsequent Cycle Run Behavior

In order to automate the set-up of each target ta-ble’s default load behavior, the metadata modelneeds to hold load parameters to be reset af-ter each cycle refresh. The table, load chk opt(Figure 1: # 13) stores job information (subjarea nm, trgt tbl nm, load enable ind, loadtype cd). For each job one entry is requiredto exist in this table. The entries are insertedinto this table via a utility stored procedure callby wrapper stored procedures.

The load type cd field holds the value of ‘Full’or ‘Delta’. Based on this information, the post-load stored procedure will update the load chklog table to prepare it for the next cycle refresh.The column load enable ind holds the value of‘Y’ or ‘N.’ Based on this info, the post-loadstored procedure will update the load chk log

Table 5. Metrics showing load skipped since source data has not changed (sample rows).

Table 6. Table with load type code to control load behavior.

108 An ETL Metadata Model for Data Warehousing

table to prepare it for the next cycle refresh.Once the cycle refresh is done, the load chk logtablewill be updated based on information (loadtype cd, load enable ind) in the ‘load chk opt’table via the post load utility stored procedure.The next cycle refresh will occur based on up-dated information in the load chk table.

If a particular target table under a subject areaneeds to be loaded with ‘Delta’ refresh (by de-fault ’Full’ refresh occurs), one entry for thattarget table needs to be inserted into this table.

5.6. Load Configurations

To retain the cycle refresh information for aparticular subject area in the metadata archivetables (cyc log hist, load chk log hist, loadmtric hist and msg log hist), one entry againsteach hist table must be inserted into the meta-data table (Table 7) – ‘load config’ (Figure 1:#5). This table is used for multiple purposes.For example, if we decide to retain historyworth90 days, entries need to be inserted into this ta-ble accordingly in order to enable the post-loadprocedure to retain records for the last 90 daysand delete on a rolling basis any records that areolder than 90 days.

The table ‘load config’ (subj area nm, trgt tblnm, obj type, config thrhld nbr, config txt) pro-vides space to hold any staging data or input inorder to prepare for a full or delta load. This isalso used to trim the history tables on a rollingbasis. The config thrhld nbr field holds thethreshold value for history tables. The post-load

procedure will delete least recent data from thehistory tables based threshold number.

The config txt column will give ETL develop-ers the opportunity to stage any value wantedin order to prepare for and load a target table.This field can also be used to prepare report outsof a particular table refresh, which can then beviewed by global report users via any web tool.

Table 7 shows information kept to track targettable load conditions. For example, in a givensubject area, certain tables could be loaded onthe first day of month, while other tables couldbe loaded on the last day of month. Another setof tables could be loaded on a particular day ofthe last week of month. Another set of tablescould be loaded once per day instead of sev-eral times per day. To enable this kind of loadbehavior, the load timestamp each of the tar-get tables need to be tracked. This informationcould be used to control and determine the nextload time.

6. Conclusions and Future Work

In this article, we showed that by using anETL metadata model, data warehouse refreshescan be made standardized, while load behav-ior can be made predictable and efficient. Themodel can help identify source tables that do nothave new data. Performing refreshes througha metadata-driven approach enables resourcesavings in the data warehouse. It helps avoidunnecessary loads when source data has not

Table 7. Table that keeps track of load conditions for different load types.

An ETL Metadata Model for Data Warehousing 109

changed. The model captures all source obser-vations timestamps as well as target table loadtimestamps. The model helps make it possibleto run batch cycles unattended because it en-ables load behavior for each table for the nextcycle refresh.

We presented an approach for refreshing datawarehouses based on an ETL metadata model.Our approach has clear benefits as it allows forskipping table refreshes when source data hasnot changed. It helps to improve load perfor-mance in terms of response time and databasesystem’s CPU and IO resource consumption.The ETL metadata model, along with utilityprocedure and wrapper procedures, providesETL programmers with a powerful tool forbuilding a data warehouse by following certainrules. We further provided a template of thewrapper stored procedure.

Our ETL metadata model achieves several keyobjectives. It enables the capture of seman-tic metadata. It makes batch cycle refreshesmetadata-driven. It helps to reset the load be-havior of each table for the next cycle run. Itenables load consistency across the data ware-house. It provides ETL programmers with aset of standard objects that makes their devel-opment work easier. The model provides theproduction support team with resources to mon-itor the data warehouse subject area refreshes.By introducing an ETL metadata model, we areable to automate data warehouse batch cyclerefresh processes. Each of these achievementshelps improve day-to-day maintenance of datawarehouses. The use of this ETL metadatamodel in a production data warehouse environ-ment reveals that the model runs efficiently inbatch processing, cycle monitoring, and datawarehouse maintenance. In a future endeavorwe intend to develop macros to generate auto-mated views as well as full and delta stored pro-cedure templates with the help of this metadatamodel.

Acknowledgments

We would like to thank Julian Emmerson, for-merly Senior ApplicationDeveloper at Intel, forassisting and guiding us in designing the initialversion of this metadata model. We are alsograteful to Wally Haven, a Senior Application

Developer at Intel, and the anonymous review-ers whose comments have improved the qualityof the paper substantially.

References

[1] D. AGRAWAL, A. ABBADI, A. SINGH, & T. YUREK,Efficient View Maintenance at Data Warehouse,ACM SIGMOD Record, 26(2), pp. 417–427, 1997.

[2] F. BENSBERG, Controlling the Data Warehouse – aBalanced Scorecard Approach, in Proceedings ofthe 25th International Conference on InformationTechnology Interfaces ITI 2003, June 16–19, 2003,Cavtat, Croatia.

[3] S. BROBST, M. MCINTIRE AND E. RADO, (2008),Agile Data Warehousing with Integrated Sandbox-ing, Business Intelligence Journal, Vol. 13, No. 1,2008.

[4] M. CHAN, H. V. LEONG AND A. SI, IncrementalUpdate to Aggregated Information for Data Ware-houses over Internet, in Proceedings of the 3rd ACMInternational Workshop on Data Warehousing andOLAP (DOLAP’00), November 2000, McLean, VA,USA.

[5] S. CHAUDHURI AND U. DAYAL, An Overview ofData Warehousing and OLAP Technology, SIG-MOD Record, 26(1), March 1997.

[6] I. CHENGAIUR-SMITH, D. BALIOU AND H. PAZER,The Impact of Data Quality Information on De-cision Making: An Exploratory Analysis, IEEETransactions on Knowledge and Data Engineering,Vol. 11, No. 6, November–December, 1999.

[7] T. CHENOWETH, K. CORRAL AND H. DEMIRKAN,Seven Key Interventions for Data Warehouses Suc-cess, Communications of the ACM, January 2006,Vol. 49, No. 1.

[8] B. CZEJDO, J. EDER, T. MORZY AND R. WREMBEL,Design of a Data Warehouse over Object-Orientedand Dynamically Evolving Data Sources, in Pro-ceedings of the 12th International Workshop onDatabase and Expert Systems Applications, 2001.

[9] J. EDER, C. KONCILIA AND T. MORZY, The COMETMetamodel for Temporal Data Warehouses, in Pro-ceedings of the 14th International Conference onAdvanced Information System Engineering (Caise2002), Toronto, Canada, LNCS 2348, pp. 83–99.

[10] H. FAN AND A. POULOVASSILIS, Using AutoMedMetadata in Data Warehousing Environments, inProceedings of the 6th ACM international work-shop on Data warehousing and OLAP, New Or-leans, Louisiana, USA, pp. 86–93.

[11] N. FOSHAY, A. MUKHERJEE AND A. TAYLOR, DoesData Warehouse End-User Metadata Add Value?,Communications of the ACM, November 2007, Vol.50, No. 2.

110 An ETL Metadata Model for Data Warehousing

[12] C. GARCIA, Real Time Self-Maintainable DataWarehouse, in Proceedings of the 44th AnnualSoutheast Regional Conference (ACM SE’06),March, 10–12, 2006, Melbourne, Florida, USA.

[13] S. R. GARDNER, Building the Data Warehouse,Communications of the ACM, 41(9).

[14] M.GOLFARELLI,D.MAIO AND S.RIZZI, TheDimen-sional Fact Model: a Conceptual Model for DataWarehouses, International Journal of CooperativeInformation Systems, Vol. 7(2&3), pp. 215–247.

[15] J. H. HANSON AND M. J. WILLSHIRE, Modelinga faster data warehouse, International DatabaseEngineering and Applications Symposium (IDEAS1997).

[16] T. N. HUYNH, O. MANGISENGI AND A. M. TJOA,Metadata for Object-Relational Data Warehouse,in Proceedings of the International Workshop onDesign and Management of Data Warehouses(DMDW’2000), Stockholm, Sweden, June 5–6,2000.

[17] W. H. INMON, Building the Data Warehouse, 3rdEdition, John Wiley, 2002.

[18] M. JENNINGS, Use of Operational Meta Data inthe Data Warehouse, (2003), Last Accessed on12/12/2011: http://businessintelligence.com/article/13

[19] T. JORG AND S. DESSLOCH, Towards GeneratingETL Processes for Incremental Loading, in Pro-ceedings of the 2008 International Symposium onDatabase Engineering & Applications (IDEAS’08),September 10–12, 2008, Coimbra, Portugal.

[20] A. KARAKASIDIS, P. VASSILIADIS AND E. PITOURA,ETL Queues for Active Data Warehousing, in Pro-ceedings of the 2nd International Workshop onInformation Quality in Information Systems, IQIS2005, Baltimore, MD, USA.

[21] N. KATIC, J. QUIRCHMAYR, S. M. STOLBA AND A.M. TJOA, A Prototype Model for Data WarehouseSecurity Based on Metadata, IEEE Xplore, 1998.

[22] T. KIM, J. KIM AND H. LEE, A Metadata-OrientedMethodology for Building Data Warehouse: AMedical Center Case, Informs & Korms – 928,Seoul 2000 (Korea).

[23] W. J. LABIO, R. YERNENI AND H. GARCIA-MOLINA,Shrinking the Warehouse Update Window, in Pro-ceedings of the 1999 ACM SIGMOD InternationalConference on Management of data (SIGMOD‘99),Philadelphia, PA.

[24] B. LIU, S. CHEN AND E. A. RUNDENSTEINER,BatchData Warehouse Maintenance in Dynamic Environ-ments, in Proceedings of the CIKM’02, pp. 68–75,2002.

[25] S. LUJAN-MORA AND M. PALOMAR, Reducing In-consistency in Integrating Data from DifferentSources, International Database Engineering &Applications Symposium (IDEAS ’01), 2001.

[26] N. RAHMAN, Refreshing Data Warehouses withNear Real-Time Updates, Journal of Computer In-formation Systems, Vol. 47, Part 3, 2007, pp. 71–80.

[27] N. RAHMAN, P. W. BURKHARDT AND K. W. HIBRAY,Object Migration Tool for Data Warehouses, Inter-national Journal of Strategic Information Technol-ogy and Applications (IJSITA), 2010, Vol. 1, Issue4, 2010, pp. 55–73.

[28] R. J. SANTOS AND J. BERNARDINO, Real-Time DataWarehouse Loading Methodology, in Proceedingsof the 2008 International Symposium on DatabaseEngineering & Applications (IDEAS’08), Septem-ber 10–12, 2008, Coimbra, Portugal.

[29] A. SEN AND A. P. SINHA, A Comparison of DataWarehousing Methodologies, Communications ofthe ACM, Vol. 48, Issue 3, March 2005.

[30] G. SHANKARANARAYANAN AND A. EVEN, TheMeta-data Enigma, Communications of the ACM, Vol. 49,Issue 2, 2006, pp. 88–94.

[31] B. SHIN, An Exploratory Investigation of SystemSuccess Factors in Data Warehousing, Journal ofthe Association for Information Systems, Vol. 4,2003.

[32] A. SIMITSIS, P. VASSILIADIS AND T. SELLIS, Op-timizing ETL Processes in Data Warehouses, inProceedings of the 21st International Conferenceon Data Engineering (ICDE’05), 5–8 April 2005,Tokyo, Japan.

[33] T. STOHR, R. MULLER AND E. RAHM, An Integra-tive and Uniform Model for Metadata Managementin Data Warehousing Environments, in Proceed-ings for the International Workshop on Designand Management of Data Warehouses (DMDW’99),Heidelberg, Germany, June 14–16, 1999.

[34] T. VETTERLI, A.VADUVA AND M. STAUDT, MetadataStandards for Data Warehousing: Open Informa-tion Model vs. CommonWarehouseMetadata, ACMSIGMOD Record, Vol. 29, Issue 3, September 2000.

[35] C. WHITE, Data Integration: Using ETL, EAI, andEII Tools to Create an Integrated Enterprise, TheData Warehousing Institute, October, 2005.

[36] J. WIDOM, Research Problems in Data Warehous-ing, in Proceedings of the 4th International Confer-ence on Information and Knowledge Management(CIKM’95), November 1995, Baltimore, MD, USA.

[37] Y. ZHUGE, J. L. WIENER AND H. GARCIA-MOLINA,Multiple View Consistency for Data Warehousing,in Proceedings of the Thirteenth International Con-ference on Data Engineering, April 7–11, 1997,Birmingham, U.K.

[38] J. V. ZYL, M. VINCENT AND M. MOHANIA, Repre-sentation of Metadata in a Data Warehouses, inProceedings of IEEE’98, IEEE Xplore, 1998.

An ETL Metadata Model for Data Warehousing 111

Received: March, 2012Revised: May, 2012

Accepted: May, 2012

Contact addresses:

Nayem RahmanIT Business Intelligence (BI)

Intel CorporationMail Stop: AL3-85

5200 NE Elam Young PkwyHillsboro, OR 97124-6497, USA

e-mail: [email protected]

Jessica MarzIntel Software Quality

Intel CorporationRNB – Robert Noyce Building

2200 Mission College Blvd,Santa Clara, CA 95054, USA

e-mail: [email protected]

Shameem AkhterDepartment of Computer Science

Western Oregon University345 N. Monmouth Ave.

Monmouth, OR 97361, USAe-mail: [email protected]

NAYEM RAHMAN is a Senior Application Developer in IT Business In-telligence (BI), Intel Corporation. He has implemented several largeprojects using data warehousing technology for Intel’s mission criticalenterprise DSS platforms and solutions. He holds an MBA in Manage-ment Information Systems (MIS), Project Management, and Marketingfrom Wright State University, Ohio, USA. He is a Teradata CertifiedMaster. He is also an Oracle Certified Developer and DBA. His mostrecent publications appeared in the International Journal of StrategicInformation Technology and Applications (IJSITA) and the Interna-tional Journal of Technology Management & Sustainable Development(IJTMSD). His principal research areas are active data warehousing,changed data capture and management in temporal data warehouses,change management and process improvement for data warehousingprojects, decision support system, data mining for business analysts,and sustainability of information technology.

JESSICA MARZ has worked in Intel’s IT division Enterprise Data Ware-house Engineering Group as both a Business Intelligence applicationdeveloper and program manager. She holds a BA in English fromUCLA, an MBA from San Jose State University, and a JD from SantaClara University.

SHAMEEM AKHTER has a Master of Social Science (MSS) degree inEconomics from the University of Dhaka, Bangladesh. Currently, sheis pursuing her MS. inManagement and Information Systems atWesternOregon University, USA. Her most recent publications on IT sustain-ability and data warehousing appeared in the International Journal ofTechnology Management & Sustainable Development (IJTMSD) andInternational Journal of Business Intelligence Research (IJBIR) respec-tively. Her research interest includes database systems, data warehous-ing, decision support system, and information systems implementation.