Data Mining Chapter2 1

Embed Size (px)

Citation preview

  • 7/28/2019 Data Mining Chapter2 1

    1/36

    CH#3, By: Babu Ram Dawadi

    3. Data Warehouse Architecture

    3.1 System Process

    3.2 Process Architecture

  • 7/28/2019 Data Mining Chapter2 1

    2/36

    CH#3, By: Babu Ram Dawadi

    System Process

    Data warehouse are built to support large data volumes

    (above 100GB of database) cost effectively

    Data warehouse must be architected to support three

    major driving factors:

    Populating the warehouse

    Day

    -

    to

    -

    day management of the warehouse

    The ability to cope with requirements evolution.

    The process required to populate the warehouse focus

    on the extracting the data, cleaning it up and making it

    available for analysis.

  • 7/28/2019 Data Mining Chapter2 1

    3/36

    CH#3, By: Babu Ram Dawadi

    Typical Process Flow

    Before we create an architecture of a data

    warehouse, we must first understand the major

    process that constitute a data warehouse.

    The processes are:

    Extract and load the data

    Clean and transform data into a form that can cope

    with large volumes, and provide good query

    performance.

    Backup and achieve data

    Manage queries, and direct them to the appropriate

    data sources.

  • 7/28/2019 Data Mining Chapter2 1

    4/36

    CH#3, By: Babu Ram Dawadi

    Query

    Process Flow Within a DW

    Source

    Extract & Load

    Data

    Warehouse

    Data TransformationAnd movement

    Users

  • 7/28/2019 Data Mining Chapter2 1

    5/36

    CH#3, By: Babu Ram Dawadi

    Extract & Load Process

    Data extraction takes data from source systems and makes it

    available to the data warehouse

    Data load takes extracted data and load it into the DW.

    When we extract data from the physical database, whatever form it

    is held in, the original information content will have been modified

    and extended over the years.

    Before loading the data into the DW, the information content must

    be reconstructed.

    The DW extract & load process must take data and add context and

    meaning in order to convert it into value

    -

    adding businessinformation.

  • 7/28/2019 Data Mining Chapter2 1

    6/36

    CH#3, By: Babu Ram Dawadi

    Process Flow Extract & Load

    Process controlling

    Is the mechanism that determine when to start extracting the data, run

    the transformations and consistency checks and so on, are very

    important.

    The various tools, logic modules and programs are executed in the

    correct sequence and at the correct time, a controlling mechanism is

    required to fire each module when appropriate.

    Initiate extraction

    Data should be in a consistent state when it is extracted from the source

    system.

    The information in a data warehouse represents a

    snapshot

    of

    corporate information, so that the user is looking at a single, consistent

    version of the truth.

    Guideline:

    start extracting data from data sources when it represents the

    same snapshot of time as all the other data sources.

  • 7/28/2019 Data Mining Chapter2 1

    7/36

    CH#3, By: Babu Ram Dawadi

    Process Flow Extract & Load

    Extraction

  • 7/28/2019 Data Mining Chapter2 1

    8/36

    CH#3, By: Babu Ram Dawadi

    Process FlowExtract & Load

    Loading the data Once the data is extracted from the source systems, it

    is then typically loaded into a temporary data store in

    order for it to be cleaned up and made consistent.

    Guideline: do not execute consistency checks until allthe data sources have been loaded into thetemporary data sources.

    From the temporary data store when the data iscleaned up, the data is transformed into warehouseby the warehouse manager.

  • 7/28/2019 Data Mining Chapter2 1

    9/36

    CH#3, By: Babu Ram Dawadi

    Process FlowClean & transform

    This is the system process that takes the loaded data andstructures it for query performance, and for minimizingoperational costs.

    The process steps for cleaning and transferring are: Clean and transform the loaded data into a structure that

    speeds up queries.

    Partition the data in order to speed up queries, optimizehardware performance and simplify the management ofthe DW.

    Create a aggregations to speedup the common queries.

  • 7/28/2019 Data Mining Chapter2 1

    10/36

    CH#3, By: Babu Ram Dawadi

    Process FlowClean & transform

    Data needs to be cleaned and checked in the followingways:

    Make sure data is consistent within itself.

    Make sure that data is consistent with other datawithin the same source.

    Make sure data is consistent with data in the othersource system.

    Make sure data is consistent with the informationalready in the DW.

  • 7/28/2019 Data Mining Chapter2 1

    11/36

    CH#3, By: Babu Ram Dawadi

    Process FlowBackup & Archive

    As in operational systems, the data within the datawarehouse is backed up regularly in order to ensure thatthe DW can always be recovered from the data loss,

    software failure or hardware failure.

    In archiving, older data is removed from the system in aformat that allows it to be quickly restored if required.

  • 7/28/2019 Data Mining Chapter2 1

    12/36

    CH#3, By: Babu Ram Dawadi

    Process FlowQuery Management

    The query management process is the system process thatmanages the queries and speeds up them up by directing queries tothe most effective data source.

    The query management process may also be required to monitorthe actual query profiles.

    Unlike the other system processes, query management does notgenerally operate during the load of information into the DW.

    The query management facilities are: Directing Queries

    The query management process determines which table delivers the answereffectively; by calculating which table would satisfy the query in the shortestspace of time.

  • 7/28/2019 Data Mining Chapter2 1

    13/36

    CH#3, By: Babu Ram Dawadi

    Process FlowQuery Management

    Query management facilities. Maximizing system resources

    Regardless of the processing power available to run the DW, it is also toopossible that a single large query can soak up all system processes,affecting the performance of the entire system.

    The query management process must ensure that no single query can affectthe overall system performance.

    Query Capture Users are exploiting the information content of the DW, which implies that

    query profiles change on a regular basis over the life of a DW.

    At various points in time , such as the end of week, these queries can beanalyzed to capture the new query and the resulting impact on summarytables.

    Query capture is typically the part of the query management process.

  • 7/28/2019 Data Mining Chapter2 1

    14/36

    CH#3, By: Babu Ram Dawadi

    Process Architecture

    The system processes describe the major

    processes that constitute a data warehouse.

    Now the process architecture outline a complete

    data warehouse architecture that encompasses

    these processes.

    The complexity of each manager in a data

    warehouse will vary from DW to DW.

  • 7/28/2019 Data Mining Chapter2 1

    15/36

    CH#3, By: Babu Ram Dawadi

    Three Data Warehouse Models

    Enterprise warehouse

    collects all of the information about subjects scanning the entire

    organization

    Data Mart a subset of corporate-wide data that is of value to a specific

    groups of users. Its scope is confined to specific, selected

    groups, such as marketing data mart

    Independent vs. dependent (directly from warehouse) data mart

    Virtual warehouse

    A set of views over operational databases

    Only some of the possible summary views may be materialized

  • 7/28/2019 Data Mining Chapter2 1

    16/36

    CH#3, By: Babu Ram Dawadi

    Process Architecture

    Components of DW Architecture

    Load Manager

    Warehouse Manager Query Manager

    Detailed Information

    Summary Information Meta Data

    Data Marting

  • 7/28/2019 Data Mining Chapter2 1

    17/36

    CH#3, By: Babu Ram Dawadi

    Process Architecture

    Detailed

    Information

    Summary

    Information

    Meta data

    Operational

    Data

    External

    Data

    L

    O

    A

    D

    M

    A

    N

    A

    G

    E

    R

    Q

    U

    E

    R

    Y

    M

    A

    N

    A

    G

    E

    RWarehouse Manager

    Data Information Decision

    OLAP

    Tools

    Data

    Differ

  • 7/28/2019 Data Mining Chapter2 1

    18/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Load Manager

    The load manager is the system component that performs all theoperations necessary to support the extract and load process.

    This system may be constructed using a combination of off-the-selftools, C programs and shell scripts.

    The size and complexity of Load Manager will vary between specificfrom DW to DW.

    The effort to develop the load manager should be plannedwithin the first production phase.

    The architecture of the load manager is such that it performsthe following operations: Extract the data from the source systems.

    Fast load the extracted data into a temporary data store

    Perform simple transformation into a structure similar to one inDW

  • 7/28/2019 Data Mining Chapter2 1

    19/36

    CH#3, By: Babu Ram DawadiLoad Manager Architecture

    Process Arch Load Manager

    Controlling

    Process

    StoredProcedures

    Copy

    Management

    Tool

    Fast Loader

    File

    Structure

    Temporary

    Data Source

    Warehouse

    Structure

  • 7/28/2019 Data Mining Chapter2 1

    20/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Load Manager

    Extract data from source In order to get hold of the source data it has to be transferred

    from source systems, and made available to the DW.

    Fast Load Data should be loaded into the warehouse in the fastest possible

    time, in order to minimize the total load window.

    The speed at which the data is processed into the warehouse isaffected by the kind of transformations that are taking place.

    In practice, it is more effective to load the data into a relational

    database prior to applying transformations and check

    Simple Transformation Before or during the load there will be an opportunity to perform

    simple transformations on the data.

  • 7/28/2019 Data Mining Chapter2 1

    21/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Warehouse Manager

    The warehouse manager is the system component that performs allthe operations necessary to support the warehouse managementprocess.

    This system is typically constructed using a combination of third

    party systems management software (C, shell scripts)

    The architecture of the warehouse manager is such that it performsthe following operations: Analyze the data to perform consistency and referential integrity check.

    Transform and merge the source data in the temporary data store in tothe DW.

    Generate renormalization if appropriate.

    Backup totally the data within the DW.

  • 7/28/2019 Data Mining Chapter2 1

    22/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Warehouse Manager

    Controlling

    Process

    Stored

    Procedures

    Backup/Reco

    very Tools

    SQL Scripts

    Temporary

    Data Source

    Warehouse

    Structure

    Schema

    Warehouse Manager

    Warehouse Manager Architecture

    Guideline: do not

    load data directly

    into the DW tablesuntil it has been

    cleaned up. Use

    temporary tables

    that emulate the

    structures with in

    the DW.

  • 7/28/2019 Data Mining Chapter2 1

    23/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Warehouse Manager

    Create Index & Views The warehouse manager has to create indexes against the information

    in the fact or dimensional table.

    The overhead of inserting a row into a table and indexes can be higher

    with a large number of rows than the overhead of recreating the indexesonce the rows have been inserted.

    Therefore it is more effective to drop all indexes against tables prior toinserting large rows

    The fact tables are large tables, so the warehouse manager createsviews that combine a number of partitions into a single fact table.

    It is suggested that, we create a few views, corresponding to meaningfulperiods of time within the business.

  • 7/28/2019 Data Mining Chapter2 1

    24/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Warehouse Manager

    Generate the summaries: Summary information is necessary in any organization because

    the higher level officers dont want to see the detailedinformation.

    The summary information will be helpful to them for decisionmaking.

    Summaries are generated automatically by the warehousemanager: i.e. it is executed every time data is loaded.

    The actual generation of summaries is achieved through the useof embedded SQL in either stored procedure (Trigger) or Cprograms.

    a Command sequence such as:

    Create table {} as select {.} from {.} where {..}

  • 7/28/2019 Data Mining Chapter2 1

    25/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Query Manager

    The query manager is the system componentthat performs all the operations necessary tosupport the query management process.

    The architecture of a query manager is such thatit performs the following operations: Direct queries to the appropriate tables

    Schedule the execution of user queries.

    The query manager also stores query profiles toallow the warehouse manager to determinewhich indexes are appropriate.

  • 7/28/2019 Data Mining Chapter2 1

    26/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Query Manager

    Query

    Redirection Via C

    tools, RDBMS

    Stored Procedure

    (Generate Views)

    Query

    Management

    Tools

    QueryScheduling

    via C tool or

    RDBMS

    Meta Data Detailed

    Information

    Summary

    Information

  • 7/28/2019 Data Mining Chapter2 1

    27/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Detailed Information

    This is the area of the data warehouse that stores all the detailedinformation in the starflake schema.

    All the detailed information is held online the whole time, but aggregated tothe next level of detail. And then the detailed information is offloaded into

    the tape archive.

    If the business information for detailed information is weak or very specific,it may be possible to satisfy it by storing a rolling three-month detailedhistory.

    Guideline: determine what business activities require detailed transaction

    information, in order to determine the level at which to retain detailinformation in the DW.

    If the detailed information is being stored offline to minimize the disk storagerequirements, make sure that the data has been extracted, cleaned up, andtransformed into a starflake schema prior to archiving it.

  • 7/28/2019 Data Mining Chapter2 1

    28/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Detailed Information

    Detailed

    Information

    Summary

    Information

    Meta data

    Operational

    Data

    External

    Data

    L

    O

    A

    D

    MA

    N

    A

    G

    E

    R

    Q

    U

    E

    R

    Y

    M

    A

    N

    A

    G

    E

    RWarehouse Manager

    Data Information Decision

    Detailed info. In

    archived dataOLAP Tools

    Data Differ

  • 7/28/2019 Data Mining Chapter2 1

    29/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Detailed Information

    Detailed information can be managed by

    the topics:

    Data warehouse schemas Fact data

    Dimension data

    Partitioning data

  • 7/28/2019 Data Mining Chapter2 1

    30/36

    CH#3, By: Babu Ram Dawadi

    Star

    customer custId name address city

    53 joe 10 main sfo

    81 fred 12 main sfo

    111 sally 80 willow la

    product prodId name price

    p1 bolt 10

    p2 nut 5

    store storeId city

    c1 nyc

    c2 sfo

    c3 la

    sale oderId date custId prodId storeId qty amt

    o100 1/7/97 53 p1 c1 1 12

    o102 2/7/97 53 p2 c1 2 11

    105 3/8/97 111 p1 c3 5 50

  • 7/28/2019 Data Mining Chapter2 1

    31/36

    CH#3, By: Babu Ram Dawadi

    Star Schema

    sale

    orderId

    date

    custId

    prodIdstoreId

    qty

    amt

    customer

    custId

    nameaddress

    city

    product

    prodId

    nameprice

    store

    storeId

    city

  • 7/28/2019 Data Mining Chapter2 1

    32/36

    CH#3, By: Babu Ram Dawadi

    Cube

    sale prodId storeId amt

    p1 c1 12

    p2 c1 11

    p1 c3 50

    p2 c2 8

    c1 c2 c3

    p1 12 50

    p2 11 8

    Fact table view:Multi-dimensional cube:

    dimensions = 2

  • 7/28/2019 Data Mining Chapter2 1

    33/36

    CH#3, By: Babu Ram Dawadi

    A Sample Data Cube

    Total annual sales

    of TV in U.S.A.Date

    Countr

    ysum

    sum

    200 150 63 37

    TV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    U.S.A

    Canada

    Mexico

    sum

    450

  • 7/28/2019 Data Mining Chapter2 1

    34/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Summary Information

    Summary information is essentially a replication of information already inthe data warehouse.

    The implication of summary data is that the data: Exists to speed up the performance of common queries

    Increases operational cost May have to be updated every time new data is loaded into the DW

    May not have to be backed up, because it can be generated fresh from thedetailed info.

    The size of data that needs to be scanned is an order of magnitude smaller,this results in an order of magnitude improvement to the performance of thequery.

    On the negative side there is an increase in operational cost, for creatingand updating the summary table on a daily basis.

    Guideline1: avoid creating a summary that require more than 200centralized summary tables on an ongoing basis.

  • 7/28/2019 Data Mining Chapter2 1

    35/36

    CH#3, By: Babu Ram Dawadi

    Process Arch Summary Information

    Summary infocontd Guideline2: inform users that summary table accessed

    infrequently will be dropped on an ongoing basis.

    Metadata

    Data Marting A data mart is a subset of the information content of a DW that is

    stored in its own database, summarized or in detail.

    Data marting can improve query performance, simply bereducing the volume of data needs to be scanned to satisfy a

    query. Data marts are created along functional or departmental lines, in

    order to exploit a natural break of the data.

  • 7/28/2019 Data Mining Chapter2 1

    36/36

    CH#3 By: Babu Ram Dawadi

    Multi-Tiered Architecture

    Data

    Warehouse

    Extract

    Transform

    Load

    Refresh

    OLAP E i

    Analysis

    QueryReports

    Data mining

    Monitor

    &

    IntegratorMetadata

    D t S F t E d T l

    Serve

    Data Marts

    Operational

    DBs

    other

    sources

    D t St

    OLAP Server