ch5_DW_ETL

Embed Size (px)

Citation preview

  • 8/14/2019 ch5_DW_ETL

    1/14

    ETL

    1. Starter 22. ETL Components 2

    2.1 Extraction 3

    2.2 Extraction Methods 4

    2.2.1 Logical Extraction Methods 42.2.2 Physical Extraction Methods 5

    2.3 Transformation 52.4 Loading 8

    2.4.1. Loading Mechanisms 9

    2.5 Meta Data 9

    3. ETL Design Considerations 104. ETL Architectures 11

    4.1 Homogenous Architecture 11

    4.2 Heterogeneous Architecture 125. ETL Development 13

    5.1 Identify and Map Data 135.2 Identify Source Data 135.3 Identify Target Data 135.4 Map Source Data to Target Data 14

    DA A A T T T A A A

    WA A A R R R EEE HHH OOO UUU SSS III NNN GGG

    EEETTTLLL iii sss ttt hhh eee ppp rrr ooo ccc eee sss sss ooo f f f ccc ooo ppp y y y iii nnn ggg ddd aaa ttt aaa f f f rrr ooo mmm ooo nnn eeeddd aaa ttt aaa bbb aaa sss eee ttt ooo aaa nnn ooo ttt hhh eee rrr bbb uuu ttt iii ttt iii sss nnn ooo ttt aaa ppp hhh y y y sss iii ccc aaa llliiimmm ppp lll eee mmm eee nnn ttt aaa ttt iii ooo nnn ... TTThhh eee ppp rrr ooo ccc eee sss sss iii sss nnn ooo ttt aaa ooo nnn eee ---ttt iiimmm eee eee vvv eee nnn ttt bbb uuu ttt iii ttt iii sss aaa ooo nnn ggg ooo iii nnn ggg aaa sss www eee lll lll aaa sss ttt hhh eeerrr eee ccc uuu rrr rrr iiinnn ggg ppp aaa rrr ttt ooo f f f ddd aaa ttt aaa www aaa rrr eee hhh ooo uuu sss eee ...

  • 8/14/2019 ch5_DW_ETL

    2/14

    SUSHIL KULKARNI 2

    [email protected]

    1. Starter

    Companies know they have valuable data lying around throughout their networks thatneeds to be moved from one place to another such as from one business application toanother or to a data warehouse for analysis.

    The only problem is that the data lies in all sorts of heterogeneous systems, andtherefore in all sorts of formats. For instance, a CRM system may define a customer inone way, while a back-end accounting system may define the same customerdifferently.

    To solve the problem, companies use extract , transform and load (ETL) software,which includes reading data from its source, cleaning it up and formatting it uniformly,and then writing it to the target repository to be exploited.

    During the ETL process, data is extracted from an OLTP database, transformed to matchthe data warehouse schema, and loaded into the data warehouse database. Many data

    warehouses also incorporate data from non-OLTP systems, such as text files, legacysystems, and spreadsheets; such data also requires extraction, transformation, andloading.

    In its simplest form, ETL is the process of copying data from one database to another.This simplicity is rarely found in data warehouse implementations. In reality, ETL is acomplex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts,database designers, and application developers.

    When defining ETL for a data warehouse, it is important to think of ETL as a process,not a physical implementation. ETL systems vary from data warehouse to datawarehouse and even between department data marts within a data warehouse.

    The ETL process is not a one-time event as the new data is added to a data warehouseperiodically. Typical periodicity may be monthly, weekly, daily, or even hourly,depending on the purpose of the data warehouse and the type of business it serves.

    Because ETL is an ongoing and recurring part of a data warehouse, ETL processes mustbe automated and operational procedures documented. ETL also changes and evolvesas the data warehouse evolves, so ETL processes must be designed for ease of modification. A solid, well-designed, and documented ETL system is necessary for thesuccess of a data warehouse project.

    2. ETL Components

    Regardless of how they are implemented, all ETL systems have a common purpose:they move data from one database to another. Generally, ETL systems move data fromOLTP systems to a data warehouse, but they can also be used to move data from onedata warehouse to another. An ETL system consists of four distinct functionalcomponents:

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    3/14

    SUSHIL KULKARNI 3

    [email protected]

    1. Extraction2. Transformation3. Loading4. Meta data

    2.1 Extraction

    The ETL extraction component is responsible for extracting or pulling the data from thesource system. During extraction, data may be removed from the source system or acopy made and the original data retained in the source system.

    It is common to move historical data that accumulates in an operational OLTP system to

    a data warehouse to maintain OLTP performance and efficiency. Legacy systems mayrequire too much effort to implement such offload processes, so legacy data is oftencopied into the data warehouse, leaving the original data in place.

    Extracted data is loaded into the data warehouse staging area (a relational databaseusually separate from the data warehouse database), for manipulation by the remainingETL processes.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    4/14

    SUSHIL KULKARNI 4

    [email protected]

    Data extraction is generally performed within the source system itself, especially if it is arelational database to which extraction procedures can easily be added. It is alsopossible for the extraction logic to exist in the data warehouse staging area and querythe source system for data using ODBC, OLE DB, or other APIs. For legacy systems, themost common method of data extraction is for the legacy system to produce text files,although many newer systems offer direct query APIs or accommodate access throughODBC or OLE DB.

    Data extraction processes can be implemented using Transact-SQL stored procedures,Data Transformation Services (DTS) tasks, or custom applications developed inprogramming or scripting languages.

    2.2 Extraction Methods

    These are different methods available to extract the data from source databases. Two of the important methods are Logical extraction method and Physical extraction method.We will see both methods.

    2.2.1 Logical Extraction Methods

    Following are two kinds of logical extraction:

    [A] Full Extraction

    The data is extracted completely from the source system. Since this extraction reflectsall the data currently available on the source system, there's no need to keep track of changes to the data source. An example for a full extraction may be an export file of adistinct table or a remote SQL statement scanning the complete source table.

    [B] Incremental Extraction

    At a specific point in time, only the data that has changed since a well-defined eventback in history will be extracted. This event may be the last time of extraction or a morecomplex business event like the last booking day of a financial period. To identify thisdelta change there must be a possibility to identify all the changed information for aspecific time event. This information can be either provided by the source data itself orby a change table where an appropriate additional mechanism keeps track of thechanges besides the originating transactions.

    Many data warehouses do not use any change-capture techniques as part of theextraction process. Instead, entire tables from the source systems are extracted to thedata warehouse or staging area, and these tables are compared with a previous extractfrom the source system to identify the changed data. This approach may not havesignificant impact on the source systems, but it clearly can place a considerable burdenon the data warehouse processes, particularly if the data volumes are large.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    5/14

    SUSHIL KULKARNI 5

    [email protected]

    2.2.2 Physical Extraction Methods

    Depending on the chosen logical extraction method and the capabilities and restrictionson the source side, the extracted data can be physically extracted by two mechanisms.The data can either be extracted online from the source system or from an offlinestructure. Thus there are following methods of physical extraction.

    [A] Online Extraction

    The data is extracted directly from the source system itself. The extraction process canconnect directly to the source system to access the source tables themselves or to anintermediate system that stores the data in a pre configured manner (for example,snapshot logs or change tables). Note that the intermediate system is not necessarilyphysically different from the source system.

    With online extractions, you need to consider whether the distributed transactions areusing original source objects or prepared source objects.

    [B] Offline Extraction

    The data is not extracted directly from the source system but is staged explicitly outsidethe original source system. The data already has an existing structure (for example,redo or was created by an extraction routine.

    2.3 Transformation

    Data transformations are complex and most costly part, in terms of processing time inETL process. They can range from simple data conversions to extremely complex datascrubbing techniques.

    The data can be transform in the Multistage Data Transformation as follows

    The data transformation logic for most data warehouses consists of multiple steps. Forexample, in transforming new records to be inserted into a customer table, there maybe separate logical transformation steps to validate each dimension key.

    Consider the example to implement different transformations as a separate SQLoperation and to create a separate, temporary staging table (such as the tablesnew_sales_step1 and new_sales_step2 in the following figure) to store the incrementalresults for each step. This load-then-transform strategy also provides a natural

    checkpointing scheme to the entire transformation process, which enables to theprocess to be more easily monitored and restarted. However, a disadvantage tomultistaging is that the space and time requirements increase.

    It may also be possible to combine many simple logical transformations into a singleSQL statement or single PL/SQL procedure. Doing so may provide better performancethan performing each step independently, but it may also introduce difficulties inmodifying, adding, or dropping individual transformations, as well as recovering fromfailed transformations.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    6/14

    SUSHIL KULKARNI 6

    [email protected]

    Listed below are some basic examples that illustrate the types of transformationsperformed by this element:

    1. Data Validation

    Check that all rows in the fact table match rows in dimension tables to enforce dataintegrity.

    2. Data Accuracy

    Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

    3. Data Type Conversion

    Ensure that all values for a specified field are stored the same way in the datawarehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source systemstores "0" or "1" in its status field, then a data type conversion transformation convertsthe content of one or both of the fields to a specified common value such as "off" or"on".

    4. Business Rule Application

    Ensure that the rules of the business are enforced on the data stored in the warehouse.For example, check that all customer records contain values for both FirstName andLastName fields.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    7/14

  • 8/14/2019 ch5_DW_ETL

    8/14

    SUSHIL KULKARNI 8

    [email protected]

    o Multiple databases scattered throughout different departments or organizations, withthe data in each structured according to the idiosyncratic rules of that particulardatabase.

    o Older systems that contain poorly documented or obsolete data.

    As the list suggests, data scrubbing is aimed at more than eliminating errors andredundancy. The goal is also to bring consistency to various data sets that may havebeen created with different, incompatible business rules. Without data scrubbing, thosesets of data aren't very useful when they're merged into a warehouse that's supposed tofeed business intelligence across an organization.

    In the early days of computing, most data scrubbing was done by hand. And whenperformed by bleary-eyed humans, the laborious task of finding and then fixing oreliminating incorrect, incomplete or duplicated records was costly - and it often led tothe introduction of new errors.

    Now, specialized software tools use sophisticated algorithms to parse, standardize,correct, match and consolidate data. Their functions range from simple cleansing andenhancement of single sets of data to matching, correcting and consolidating databaseentries from different databases and file systems.

    Most of these tools are able to reference comprehensive data sets and use them tocorrect and enhance data. For example, customer data for a CRM application could bereferenced and matched to additional customer information, such as household incomeand other demographic information.

    Companies that want to use specialized data cleansing tools can get them from severalsources. Building the tools in-house was the most common choice among companiesstudied by Arlington, Mass.-based Cutter Consortium; of the surveyed companies thatsaid they were using such tools, 31% said they were building them in-house.

    But companies that choose to buy data cleansing software have plenty of options.Oracle Corp., Ascential Software Corp. in Westboro, Mass., and Group 1 Software Inc. inLanham, Md., led other vendors in the Cutter survey, with 8% of the market each. Othervendors, including PeopleSoft Inc. in Pleasanton, Calif., SAS Institute Inc. in Cary, N.C.,and Informatica Corp. in Redwood City, Calif., were bunched a few percentage pointsbehind. The major data warehouse and business-intelligence vendors also include datascrubbing functionality in their products.

    2.4 LoadingThe ETL loading component is responsible for loading transformed data into the datawarehouse database. Data warehouses are usually updated periodically rather thancontinuously, and large numbers of records are often loaded to multiple tables in asingle data load.

    The data warehouse is often taken offline during update operations so that data can beloaded faster and update OLAP cubes to incorporate the new data.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    9/14

    SUSHIL KULKARNI 9

    [email protected]

    2.4.1. Loading Mechanisms

    You can use the following mechanisms for loading a warehouse:

    [A] SQL*Loader [B] External Tables[C] OCI and Direct-Path APIs [D] Export/Import

    Let us learn these mechanisms:

    [A] SQL*Loader

    Before any data transformations can occur within the database, the raw data mustbecome accessible for the database. One approach is to load it into the database. Themost common technique for transporting data is by way of flat files.

    SQL*Loader is used to move data from flat files into a data warehouse. During this dataload, SQL*Loader can also be used to implement basic data transformations. Whenusing direct-path SQL*Loader, basic data manipulation, such as datatype conversion andsimple NULL handling, can be automatically resolved during the data load. Most datawarehouses use direct-path loading for performance reasons.

    [B] External Tables

    Another approach for handling external data sources is using external tables. Oracle9i`sexternal table feature enables you to use external data as a virtual table that can bequeried and joined directly and in parallel without requiring the external data to be firstloaded in the database. You can then use SQL, PL/SQL, and Java to access the externaldata.

    The main difference between external tables and regular tables is that externallyorganized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) arepossible and no indexes can be created on them.

    [C] OCI and Direct-Path APIs

    OCI and direct-path APIs are frequently used when the transformation and computationare done outside the database and there is no need for flat file staging.

    [D] Export/Import

    Export and import are used when the data is inserted as is into the target system. Nolarge volumes of data should be handled and no complex extractions are possible.

    2.5 Meta Data

    The ETL meta data functional component is responsible for maintaining information(meta data) about the movement and transformation of data, and the operation of thedata warehouse. It also documents the data mappings used during the transformations.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    10/14

    SUSHIL KULKARNI 10

    [email protected]

    Meta data logging provides possibilities for automated administration, trend prediction,and code reuse.

    Examples of data warehouse meta data that can be recorded and used to analyze theactivity and performance of a data warehouse include:

    oo Data Lineage , such as the time that a particular set of records was loaded into the datawarehouse.

    o Schema Changes , such as changes to table definitions.

    o Data Type Usage , such as identifying all tables that use the "Birthdate" user-defineddata type.

    o Transformation Statistics , such as the execution time of each stage of atransformation, the number of rows processed by the transformation, the last time thetransformation was executed, and so on.

    o DTS Package Versioning , which can be used to view, branch, or retrieve any historicalversion of a particular DTS package.

    o Data Warehouse Usage Statistics , such as query times for reports.

    3. ETL Design Considerations

    Regardless of their implementation, a number of design considerations are common toall ETL systems:

    [A] Modularity

    ETL systems should contain modular components that perform discrete tasks. Thisencourages reuse and makes them easy to modify when implementing changes inresponse to business and data warehouse changes. Monolithic systems should beavoided.

    [B] Consistency

    ETL systems should guarantee consistency of data when it is loaded into the datawarehouse. An entire data load should be treated as a single logical transactioneitherthe entire data load is successful or the entire load is rolled back. In some systems, the

    load is a single physical transaction, whereas in others it is a series of transactions.Regardless of the physical implementation, the data load should be treated as a singlelogical transaction.

    [C] Flexibility

    ETL systems should be developed to meet the needs of the data warehouse and toaccommodate the source data environments. It may be appropriate to accomplish some

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    11/14

    SUSHIL KULKARNI 11

    [email protected]

    transformations in text files and some on the source data system; others may requirethe development of custom applications. A variety of technologies and techniques canbe applied, using the tool most appropriate to the individual task of each ETL functionalelement.

    [D] Speed

    ETL systems should be as fast as possible. Ultimately, the time window available for ETLprocessing is governed by data warehouse and source system schedules. Some datawarehouse elements may have a huge processing window (days), while others mayhave a very limited processing window (hours). Regardless of the time available, it isimportant that the ETL system execute as rapidly as possible.

    [E] Heterogeneity

    ETL systems should be able to work with a wide variety of data in different formats. AnETL system that only works with a single type of source data is useless.

    [F] Meta Data Management

    ETL systems are arguably the single most important source of meta data about both thedata in the data warehouse and data in the source system. Finally, the ETL process itself generates useful meta data that should be retained and analyzed regularly.

    4. ETL Architectures

    It is important to understand the different ETL architectures and how they relate to eachother. Essentially, ETL systems can be classified in two architectures: the homogenousarchitecture and the heterogeneous architecture.

    4.1 Homogenous Architecture

    A homogenous architecture for an ETL system is one that involves only a single sourcesystem and a single target system. Data flows from the single source of data throughthe ETL processes and is loaded into the data warehouse, as shown in the followingdiagram.

    Operational data

    ETL System

    Data Warehouse

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    12/14

    SUSHIL KULKARNI 12

    [email protected]

    Most homogenous ETL architectures have the following characteristics:

    oo Single data source: Data is extracted from a single source system, such as an OLTPsystem.

    oo Rapid development: The development effort required to extract the data isstraightforward because there is only one data format for each record type.

    oo Light data transformation: No data transformations are required to achieve consistency

    among disparate data formats, and the incoming data is often in a format usable in thedata warehouse. Transformations in this architecture typically involve replacing NULLs andother formatting transformations.

    oo Light structural transformation: Because the data comes from a single source, the

    amount of structural changes such as table alteration is also very light. The structuralchanges typically involve denormalization efforts to meet data warehouse schemarequirements.

    oo Simple research requirements: The research efforts to locate data are generally simple:

    if the data is in the source system, it can be used. If it is not, it cannot.

    The homogeneous ETL architecture is generally applicable to data marts, especiallythose focused on a single subject matter.

    4.2 Heterogeneous Architecture

    A heterogeneous architecture for an ETL system is one that extracts data from multiplesources, as shown in the following diagram.

    The complexity of this architecture arises from the fact that data from more than onesource must be merged, rather than from the fact that data may be formatteddifferently in the different sources. However, significantly different storage formats anddatabase schemas do provide additional complications

    Most heterogeneous ETL architectures have the following characteristics:

    Operational data

    ETL System

    Data Warehouses

    Operational data

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    13/14

    SUSHIL KULKARNI 13

    [email protected]

    o Multiple data sources.

    o More complex development: The development effort required to extract the data isincreased because there are multiple source data formats for each record type.

    o Significant data transformation: Data transformations are required to achieve consistencyamong disparate data formats, and the incoming data is often not in a format usable inthe data warehouse. Transformations in this architecture typically involve replacing NULLs,additional data formatting, data conversions, lookups, computations, and referentialintegrity verification. Pre-computed calculations may require combining data from multiplesources, or data that has multiple degrees of granularity, such as allocating shipping coststo individual line items.

    o Significant structural transformation: Because the data comes from multiple sources, theamount of structural changes, such as table alteration, is significant.

    o Substantial research requirements to identify and match data elements.

    Heterogeneous ETL architectures are found more often in data warehouses than in datamarts.

    5. ETL Development

    ETL development consists of two general phases: identifying and mapping data, anddeveloping functional element implementations. Both phases should be carefullydocumented and stored in a central, easily accessible location, preferably in electronicform.

    5.1 Identify and Map Data

    This phase of the development process identifies sources of data elements, the targetsfor those data elements in the data warehouse, and the transformations that must beapplied to each data element as it is migrated from its source to its destination. Highlevel data maps should be developed during the requirements gathering and datamodeling phases of the data warehouse project. During the ETL system design anddevelopment process, these high level data maps are extended to thoroughly specifysystem details .

    5.2 Identify Source Data

    For some systems, identifying the source data may be as simple as identifying the serverwhere the data is stored in an OLTP database and the storage type (SQL Serverdatabase, Microsoft Excel spreadsheet, or text file, among others).

    In other systems, identifying the source may mean preparing a detailed definition of themeaning of the data, such as a business rule, a definition of the data itself, such asdecoding rules (O = On, for example), or even detailed documentation of a sourcesystem for which the system documentation has been lost or is not current.

    mailto:[email protected]:[email protected]
  • 8/14/2019 ch5_DW_ETL

    14/14

    SUSHIL KULKARNI 14

    [email protected]

    5.3 Identify Target Data

    Each data element is destined for a target in the data warehouse. A target for a dataelement may be an attribute in a dimension table, a numeric measure in a fact table, ora summarized total in an aggregation table. There may not be a one-to-onecorrespondence between a source data element and a data element in the datawarehouse because the destination system may not contain the data at the samegranularity as the source system. For example, a retail client may decide to roll data upto the SKU level by day rather than track individual line item data. The level of itemdetail that is stored in the fact table of the data warehouse is called the grain of thedata. If the grain of the target does not match the grain of the source, the data must besummarized as it moves from the source to the target.

    5.4 Map Source Data to Target Data

    A data map defines the source fields of the data, the destination fields in the datawarehouse and any data modifications that need to be accomplished to transform the

    data into the desired format for the data warehouse. Some transformations requireaggregating the source data to a coarser granularity, such as summarizing individualitem sales into daily sales by SKU. Other transformations involve altering the source dataitself as it moves from the source to the target. Some transformations decode data intohuman readable form, such as replacing "1" with "on" and "0" with "off" in a status field.If two source systems encode data destined for the same target differently (forexample, a second source system uses Yes and No for status), a separatetransformation for each source system must be defined. Transformations must bedocumented and maintained in the data maps. The relationship between the source andtarget systems is maintained in a map that is referenced to execute the transformationof the data before it is loaded in the data warehouse.

    eee eee aaa aaa aaa aaa eee eee

    mailto:[email protected]:[email protected]