17
1 Adriano Patrick Cunha ETL in DW Real-Time Adriano Patrick do N. Cunha

ETL DW-RealTime

Embed Size (px)

DESCRIPTION

Apresentação sobre os métodos aplicados no processo de ETL, aprofundando sobre os métodos CDC que são utilizados em ETL de DataWarehouse de Tempo Real.

Citation preview

Page 1: ETL DW-RealTime

1Adriano Patrick Cunha

ETL in DW Real-Time

Adriano Patrick do N. Cunha

Page 2: ETL DW-RealTime

2Adriano Patrick Cunha

Data Warehouse (DW)

Conceits

Page 3: ETL DW-RealTime

3Adriano Patrick Cunha

Conceits

Data Warehouse (DW)

“is a prominent approach to materialized data integration. Data of interest, scattered across multiple heterogeneous sources is integrated into a central database system.” (Jörg e Dessloch)

“provides information for analytical processing, decision making and data mining tools. A DW collects data from multiple heterogeneous operational source systems OLTP and stores summarized integrated business data in a central repository used by analytical applications OLAP” (Bernadino e Santos)

Page 4: ETL DW-RealTime

4Adriano Patrick Cunha

ETL – Extraction, Transformation and Loading“Is a process extract the data from source system, transforms the data according to business rule, and loads results into the target data warehouse.”Actions:

1)The identification of relevant information at the source side.

2)The extraction of this information.3)The customization and integration of the information

coming from multiple sources into common format.4)The cleaning of the result data set on the basis of

database and business rules.5)The propagation of the data to the DW and DM

(Kakish e Kraft)

Conceits

Page 5: ETL DW-RealTime

5Adriano Patrick Cunha

Data Warehouse (DW) – Data Quality Dimensions

CompletenessConformityConsistencyAccuracyDuplicationIntegrity

Conceits

Page 6: ETL DW-RealTime

6Adriano Patrick Cunha

ETL Process

Extract“Taking out the data from a variety of disparate source system correctly is often the most challenging aspect of ETL ...”“The goal of the extraction phase is to convert the data into a single format which is appropriate for transformation process...”Relational DB, flat files, IMS, VSAM, ISAM etc.

“Most of the time the data in source system is very complex, thus determining which data is relevant is very difficult...”

(Kakish e Kraft)

Page 7: ETL DW-RealTime

7Adriano Patrick Cunha

ETL Process

Extract

Logical Methods for extraction:

Full extractionNo need to keep track change

Incremental extractionCDC mechanism

Staging Area

Page 8: ETL DW-RealTime

8Adriano Patrick Cunha

ETL Process

Extract

Physical Methods for extraction:

Online extractionConnect to source system to extract in preconfigured format.

Offline extractionThe data extracted is staged outside

Page 9: ETL DW-RealTime

9Adriano Patrick Cunha

ETL Process

Transform

Types Transformation

1. Selecting only certain columns to load;

2. Translating coded values (1 for male and 2 for famale, but DW M and F);

3. Encoding free-form values (mapping “Male” to “1”);

4. Deriving a new calculated value;

5. Sorting;

6. Joining data from multiple sources and removing data duplicating;

7. Aggregation;

8. Generating surrogate-key values;

Page 10: ETL DW-RealTime

10Adriano Patrick Cunha

ETL Process

Transform

Types Transformation

1. Transposing or pivoting (turning multiple columns into multiple rows or vice versa);

2. Splitting a column into multiple columns;

3. Disaggregation of repeating columns into a separate detail table;

4. Lookup and validate the relevant data from tables or referential files for slowly change dimensions; and

5. Applying any form of simple or complex data validation.

Page 11: ETL DW-RealTime

11Adriano Patrick Cunha

ETL Process

Load

Mechanisms to load include:

1. SQL loader: used in flat files into DW;

2. External Tables: store data in virtual table to queried and joined;

3. Oracle Call interface (OCI): is a API used when the transformation process is done outside database;

4. Export/Import

Page 12: ETL DW-RealTime

12Adriano Patrick Cunha

Types ETL´s

Page 13: ETL DW-RealTime

13Adriano Patrick Cunha

CDC - Change Data Capture

Snapshot Sources - Performs the ETL to a file and run a compare with the previous version of the file

Logged Sources - Uses change logs, usually using triggers to go with storing the logs changes, but may also be used by the business logic of the applications or even using specific utilities of the DBMS, such as database log scraping or log sniffing, who loggin transactions

Timestamped Sources - the tables have attributes audit, which indicate when the attribute is created or changed

Page 14: ETL DW-RealTime

14Adriano Patrick Cunha

CDC - Change Data Capture

Snapshot Sources

Page 15: ETL DW-RealTime

15Adriano Patrick Cunha

CDC - Change Data Capture

Logged Sources

Page 16: ETL DW-RealTime

16Adriano Patrick Cunha

Bibliografia

Near real-time data warehousing using state-of-the-art ETL toolsThomas Jörg, Stefan Dessloch (2010)Lecture Notes in Business Information Processing 41 LNBI

Real-time data warehouse loading methodologyRicardo Jorge Santos, Jorge Bernardino (2008)Proceedings of the 2008 international symposium on Database engineering & applications - IDEAS '08http://portal.acm.org/citation.cfm?doid=1451940.1451949 Near real-time data warehousing with multi-stage trickle and flipJanis Zuters (2011)Lecture Notes in Business Information Processing 90 LNBIP

A Triggering and scheduling approach for ETL in a real-time data warehouseJie Song, Yubin Bao, Jingang Shi (2010)Proceedings - 10th IEEE International Conference on Computer and Information Technology, CIT-2010, 7th IEEE International Conference on Embedded Software and Systems, ICESS-2010, ScalCom-2010

Creating a Real Time Data WarehouseJoseph Guerra, David A Andrews (2011)Andrews Consulting Group

ETL Evolution for Real-Time Data WarehousingKamal Kakish, Theresa A Kraft (2012)Proceedings of the Conference on Information Systems Applied Research p. 1-12www.aitp-edsig.org

Page 17: ETL DW-RealTime

17Adriano Patrick Cunha

All text and image content in this document is licensed under the Creative Commons Attribution-Share Alike 3.0 License (unless otherwise specified). "LibreOffice" and "The Document Foundation" are registered trademarks. Their respective logos and icons are subject to international copyright laws. The use of these therefore is subject to the trademark policy.

Thank you …

[email protected]