Ch6 DW Maintain

Embed Size (px)

Citation preview

  • 8/14/2019 Ch6 DW Maintain

    1/11

    DATA WAREHOUSE GROWTH

    1. Starter 2

    2. How to creating a Logical Design? 2

    3. Physical Design in Data Warehouses 34. Physical Design Structures 4 4.1 Tablespaces 4

    4.2 Tables and Partitioned Tables 5 4.3 Views 5 4.4 Integrity Constraints 54.5 Indexes and Partitioned Indexes 54.6 Dimensions 6

    5. Deployment 6 5.1 Stage 1: Reporting 6 5.2 Stage 2: Analysis 6

    5.3 Stage 3: Prediction 75.4 Stage 4: Operationalize 8

    5.5 Stage 5: Activate 9

    6. Managing Data Warehouses 10

    DATA WAREHOUSING

    Thhheee gggoooaaalll iiisss tttooo eeennnaaabbbllleee uuussseeerrrsss tttooo mmmaaakkkeee iiinnnfffooorrrmmmeeeddd

    ddeeeccciiisssiiiooonnnsss rrraaapppiiidddlllyyy sssooo ttthhheeeiiirrr cccooommmpppaaannniiieeesss cccaaannn rrreeessspppooonnnddd

    tooo mmmaaakkkeee ccchhhaaannngggeee aaannnddd rrreeemmmaaaiiinnn cccooommmpppeeetttiiitttiiivvveee...

  • 8/14/2019 Ch6 DW Maintain

    2/11

    SUSHIL KULKARNI 2

    [email protected]

    1. Starter

    When the corporate decided to build a data warehouse, they have to define thebusiness requirements and the scope of the application and created a conceptualdesign. Now one has to translate the requirements into a system deliverable. For doingit one should create the logical and physical design for the data warehouse and define:

    o The specific data contento Relationships within and between groups of datao The system environment supporting your data warehouseo The data transformations requiredo The frequency with which data is refreshedThe logical design is conceptual and abstract and defines the logical relationshipsamong the objects. On the other hand, the physical design contains the effective wayof storing and retrieving the objects as well as handling them from a transportation andbackup/recovery perspective.

    2. How to creating a Logical Design?

    A logical design is conceptual and abstract and we with defining the types of informationthat we need.

    Entity-relationship modeling technique can be used to model the corporates logicalinformation requirements. Entity-relationship modeling requires to identify the things ofimportance called entities, the properties of these things called attributes, and howthey are related to one another called relationships.

    The process of logical design involves arranging data into a series of logical relationshipscalled entities and attributes. An entity represents a chunk of information. In relationaldatabases, an entity often maps to a table. An attribute is a component of an entity thathelps define the uniqueness of the entity. In relational databases, an attribute maps to acolumn.

    For consistent database, we need an unique identifiers. A unique identifier is to beadded to tables so that one can differentiate between the same item when it appears indifferent places. In a physical design, this is usually a primary key.

    While entity-relationship diagram is associated with normalized models. The example

    can be given as any OLTP applications. This technique is useful for data warehousedesign in the form of dimensional modeling.

    In dimensional modeling, instead of seeking to discover atomic units of information(such as entities and attributes) and all of the relationships between them, one has toidentify which information belongs to a central fact table and which information belongsto its associated dimension tables.

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    3/11

    SUSHIL KULKARNI 3

    [email protected]

    Thus the logical design can be made using pen and paper and gives

    (1)a set of entities and attributes corresponding to fact tables and dimension tables(2)a model of operational data from source into subject-oriented information in target

    data warehouse schema.

    3. Physical Design in Data Warehouses

    Now we will see the physical design of a data warehouse environment by moving fromlogical design to physical design.

    We have seen that the logical design is to draw with a pen and paper before buildingdata warehouse. The physical design is the creation of the database with SQLstatements.

    During the physical design process, you convert the data gathered during the logical

    design phase into a description of the physical database structure. Physical designdecisions are mainly driven by query performance and database maintenance aspects.

    During the logical design phase, a model of data warehouse consists of entities,attributes, and relationships. The entities are linked together using relationships.

    Attributes are used to describe the entities. The unique identifier (UID) distinguishesbetween one instance of an entity and another.

    Following figure shows the different ways of thinking about logical and physical designs.

    During the physical design process, the expected schema is translated into actualdatabase structures by mapping

    o Entities to tableso Relationships to foreign key constraints

    Logical Design

    Entities

    Relationships

    Attributes

    Unique

    Identifiers

    Physical design

    Tables Indexes

    Integrity

    ConstraintsPrimary key

    Foreign Key

    Not NULL

    Materialized

    View

    ColumnsDimensions

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    4/11

  • 8/14/2019 Ch6 DW Maintain

    5/11

    SUSHIL KULKARNI 5

    [email protected]

    4.2 Tables and Partitioned Tables

    Tables are the basic unit of data storage. They are the container for the expectedamount of raw data in your data warehouse.

    Using partitioned tables a large data volume can be decompose into smaller and moremanageable pieces. The main design criterion for partitioning is manageability, as wellas we get better performance benefits.

    For example, we can choose a partitioning strategy based on a sales transaction dateand a monthly granularity. If we have four years' worth of data, then we can delete amonth's data as it becomes older than four years with a single, quick DDL statementand load new data while only affecting 1/48th of the complete table. Business questionsregarding the last quarter will only affect three months, which is equivalent to threepartitions, or 3/48ths of the total volume.

    Partitioning large tables improves performance because each partitioned piece is more

    manageable. One can partition based on transaction dates in a data warehouse. Forexample, each month, one month's worth of data can be assigned its own partition.

    4.3 Views

    A view is a presentation of the data contained in one or more tables or other views. Aview takes the output of a query and treats it as a table. Views do not require any spacein the database.

    4.4 Integrity Constraints

    Integrity constraints are used to enforce business rules associated with your databaseand to prevent having invalid information in the tables. Integrity constraints in datawarehousing differ from constraints in OLTP environments. In OLTP environments, theyprimarily prevent the insertion of invalid data into a record, which is not a big problem indata warehousing environments because accuracy has already been guaranteed. In datawarehousing environments, constraints are only used for query rewrite. NOT NULLconstraints are particularly common in data warehouses. Under some specificcircumstances, constraints need space in the database. These constraints are in theform of the underlying unique index.

    4.5 Indexes and Partitioned Indexes

    Indexes are optional structures associated with tables or clusters. In addition to theclassical B-tree indexes, bitmap indexes are very common in data warehousingenvironments.

    Bitmap indexes are optimized index structures for set-oriented operations. Additionally,they are necessary for some optimized data access methods such as startransformations.

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    6/11

    SUSHIL KULKARNI 6

    [email protected]

    Indexes are just like tables in that we can partition them, although the partitioning

    strategy is not dependent upon the table structure. Partitioning indexes makes it easierto manage the warehouse during refresh and improves query performance.

    4.6 Dimensions

    A dimension is a schema object that defines hierarchical relationships between attributesor attribute sets. A hierarchical relationship is a functional dependency from one level ofa hierarchy to the next one. A dimension is a container of logical relationships and doesnot require any space in the database. A typical dimension is city, state, region, andcountry.

    5. Deployment

    The data warehouse deployments improves the execution of a business strategy inaddition to its development. This evolution increases the set of service levels upon thedata warehouse architect. Now we consider evolution of data warehousing through the

    five stages that are most common in the maturation of decision support within anorganization.

    5.1 Stage 1: Reporting

    The initial stage of data warehouse deployment typically focuses on reporting from asingle source of truth within an organization. The data warehouse brings huge valuesimply by integrating disparate sources of information within an organization into asingle repository to drive decision-making across functional and/or product boundaries.

    For the most part, the questions in a reporting environment are known in advance.Thus, database structures can be optimized to deliver good performance even whenqueries require access to huge amounts of information.

    The biggest challenge in Stage 1 data warehouse deployment is data integration. Thechallenges in constructing a repository with consistent, cleansed data cannot beoverstated. There can easily be hundreds of data sources in a legacy computingenvironment-each with a unique domain value standard and underlying implementationtechnology. The hard work that goes into providing well-integrated information fordecision-makers becomes the foundation for all subsequent stages of data warehousedeployment

    5.2 Stage 2: Analysis

    In a Stage 2 data warehouse deployment, decision-makers focus less on what happenedand more on why it happened. Analysis activities are concerned with drilling downbeneath the numbers on a report to slice and dice data at a detailed level.

    Ad hoc analysis plays a big role in Stage 2 data warehouse implementations. Questionsagainst the database cannot be known in advance. Performance management relies a

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    7/11

    SUSHIL KULKARNI 7

    [email protected]

    lot more on advanced optimizer capability in the RDBMS because query structures are

    not as predictable as they are in a pure reporting environment.

    Performance is also a lot more important in a Stage 2 data warehouse implementationbecause the information repository is used much more interactively. Whereas reportsare typically scheduled to run on a regular basis with business calendars as a driver fortiming, ad hoc analysis is fundamentally a hands-on activity with iterative refinement ofquestions in an interactive environment. Business users require direct access to the datawarehouse via GUI tools without the need for programmer intermediaries. Support forconcurrent query execution and large numbers of users against the warehouse is typicalof a Stage 2 implementation.

    Business users, however, are a very impatient bunch. Performance must provideresponse times measured in seconds or a small number of minutes for drill-downs in anOLAP (online analytical processing) environment. The database optimizer's ability todetermine efficient access paths, using indexing and sophisticated join techniques, playsa critical role in allowing flexible access to information within acceptable response times.

    5.3 Stage 3: Prediction

    As an organization becomes well-entrenched in quantitative decision-making techniquesand experiences the value proposition for understanding the "whats" and "whys" of itsbusiness dynamics, the next step is to leverage information for predictive purposes.

    Understanding what will happen next in the business has huge implications forproactively managing the strategy for an organization. Stage 3 data warehousingrequires data mining tools for building predictive models using historical detail.

    The number of end users who will apply the advanced analytics involved in predictivemodeling is relatively small. However, the workloads associated with model constructionand scoring are intense. Model construction typically involves derivation of hundreds ofcomplex metrics for hundreds of thousands (or more) of observations as the basis fortraining the predictive algorithms for a specific set of business objectives. Scoring isfrequently applied against a larger set (millions) of observations because the fullpopulation is scored rather than the smaller training sets used in model construction.

    Advanced data mining methods often employ complex mathematical functions such aslogarithms, exponentiation, trigonometric functions and sophisticated statistical functionsto obtain the predictive characteristics desired. Access to detailed data is essential to thepredictive power of the algorithms. Tools from vendors such as SAS and Quadstoneprovide a framework for development of complex models and require direct access to

    information stored in the relational structures of the data warehouse.

    The business end users in the data mining space tend to be a relatively small group ofvery sophisticated analysts with market research or statistical backgrounds. However,beware in your capacity planning! This small quantity of end users can easily consume50% or more of the machine cycles on the data warehouse platform during peakperiods. This heavy resource utilization is due to the complexity of data access and

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    8/11

    SUSHIL KULKARNI 8

    [email protected]

    volume of data handled in a typical data-mining environment.

    5.4 Stage 4: Operationalize

    Operationalization in Stage 4 of the evolution starts to bring us into the domain of activedata warehousing. Whereas stages 1 to 3 focus on strategic decision-making within anorganization, Stage 4 focuses on tactical decision support. Think of strategic decisionsupport as providing the information necessary to make long-term decisions in thebusiness. Applications of strategic decision support include market segmentation,product (category) management strategies, profitability analysis, forecasting and manyothers.

    Tactical decision support is not focused on developing corporate strategy in the ivorytower, but rather on supporting the people in the field who execute it.

    Operationalization typically means providing access to information for immediatedecision-making in the field. Two examples are (1) inventory management with just-in-

    time replenishment and (2) scheduling and routing for package delivery. Many retailersare moving toward vendor managed inventory, with a retail chain and themanufacturers that supply it working as partners. The goal is to reduce inventory coststhrough more efficient supply chain management. In order for the partnership to besuccessful, access to information regarding sales, promotions, inventory-on-hand, etc.must be provided to the vendor at a detailed level. Manufacturing, delivery and so oncan then be executed efficiently based on inventory requirements on a per-store. To beuseful, the information must be extremely up-to-date and query response times must bevery fast.

    In the example of package shipping with less than full load trucking there are verycomplex decisions involved in how to schedule trucks and route packages. Trucksgenerally converge at break bulks wherein packages get moved from one truck toanother so that they ultimately arrive at their desired destination (in a way veryanalogous to how humans are shuffled around between connecting flights at an airlinehub). When packages are on a late-arriving truck, tough decisions need to get made inregard to whether the connecting truck that the late package is scheduled for will waitfor the package or leave on time. If it leaves without the package, the service level onthat package may be compromised. On the other hand, waiting for the delayed packagemay cause other packages that are ready to go to miss their service levels.How long the truck should wait will depend on the service levels of all delayed packagesdestined for the truck as well as service levels for those packages already on the truck.

    A package due the next day is obviously going to have more difficulty in meeting its

    service levels under conditions of delay that one that is not due until many days later.

    Moreover, the sending and receiving parties associated with the package shipmentshould also be considered. Higher priority on making service levels should be given topackages associated with profitable customers where the relationship may be at risk if apackage is late. Alternative routing options for the late packages, weather conditionsand many other factors may also come into play. Making good decisions in thisenvironment amounts to a highly complex optimization problem.

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    9/11

    SUSHIL KULKARNI 9

    [email protected]

    It is clear that a break bulk manager will dramatically increase the quality of his or her

    scheduling and routing decisions with the assistance of advanced decision supportcapabilities. However, for these capabilities to be useful, the information to drivedecision-making must be extremely up-to-date. This means continuous data acquisitioninto the data warehouse in order for the decision-making capabilities to be relevant today-to-day operations. Whereas a strategic decision support environment can use datathat is loaded once per month or once per week, this lack of data freshness isunacceptable for tactical decision support. Furthermore, the response time for queriesmust be measured in a small number of seconds in order to accommodate the realitiesof decision-making in an operational, field environment.

    5.5 Stage 5: Activate

    The larger the role an active data warehouse plays in the operational aspects of decisionsupport, the more incentive the business has to automate the decision processes. Bothfor efficiency reasons and for consistency in decision-making, the business will want toautomate decisions when humans do not add significant value.

    In e-commerce business models there is no choice but to automate decision-makingwhen a customer interacts with a Web site. Interactive customer relationshipmanagement (CRM) on a Web site or at an ATM is all about making decisions tooptimize the customer relationship through individualized product offers, pricing, contentdelivery and so on. The very complex decision-making associated with interactive CRMtakes place without humans in a completely automated fashion and must be executedwith response times measured in seconds or milliseconds.

    As technology evolves, more and more decisions become executed with event-driventriggers to initiate fully automated decision processes. For example, the retail industry ison the verge of a technology breakthrough in the form of electronic shelf labels. Thistechnology obsoletes the old-style Mylar labels, which require manual labor to updateprices by swapping small plastic numbers on a shelf label. The new electronic labels canimplement price changes remotely via computer controls without any manual labor.

    Integration of the electronic shelf label technology with an active data warehousefacilitates sophisticated price management with as much automation as a business caresto deploy. For seasonal items in stores where inventories are higher than they ought tobe, it will be possible to automatically initiate sophisticated mark-down strategies todrive maximum sell-through with minimum margin erosion. Whereas a sophisticatedmark-down strategy is prohibitively costly in the world of manual pricing, the use ofelectronic shelf labels with promotional messaging and dynamic pricing opens a whole

    new world of possibilities for price management. Moreover, the power of an active datawarehouse allows these decisions to be made in an optimal fashion on an item-by-item,store-by-store and second-by-second basis using event triggering and sophisticateddecision support capability. In a CRM context, even customer-by-customer decisions arepossible with an active data warehouse.

    Intense competition and technology innovations are motivating these advances indecision support deployment. An active data warehouse delivers information and

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    10/11

    SUSHIL KULKARNI 10

    [email protected]

    enables decision support throughout an organization rather than being confined to

    strategic decision-making processes. However, tactical decision support does not replacestrategic decision support. Rather, an active data warehouse supports the coexistence ofboth types of workloads. Notice in Figure 2 that a significant amount of workload in aStage 5 data warehouse is still focused on strategic thinking. The operationalized andevent triggered decision support of stages 4 and 5 provide the execution capability forstrategies developed from traditional data warehouse analysis characterized in stages 1to 3.

    6. Managing Data Warehouses

    Data Warehouses are not magic-they take a great deal of very hard work. In manycases data warehouse projects are viewed as a stopgap measure to get users off ourbacks or to provide something for nothing. But data warehouses require carefulmanagement and marketing.

    A data warehouse is a good investment only if end-users actually can get at vital

    information faster and cheaper than they can use current technology. As a consequence,management has to think seriously about how they want their warehouses to performand how they are going to get the word out to the end-user community. Andmanagement has to recognize that the maintenance of the data warehouse structure isas critical as the maintenance of any other mission-critical application. In fact,experience has shown that data warehouses quickly become one of the most usedsystems in any organization.

    Management, especially IT management, must also understand that if they embark on adata warehousing program, they are going to create new demands upon theiroperational systems: demands for better data, demands for consistent data, demandsfor different kinds of data.

    Now we will discuss different tasks involved in maintaining of data warehouse, that willsuffices users needs or reflects changes to a database. For a database that does notchange except when the entire database is loaded with new data, maintenance tasksare minimal. You need only ensure that enough space is available in the default andnamed segments to accommodate the current batch of input data. Any neededrestoration of the database can be done from the original input data files.

    For databases that are modified by incremental load operations or INSERT, UPDATE, orDELETE statements, maintenance includes accommodating growth in the database, aswell as adjusting to changes in users' needs or the warehouse environment. For

    databases that change, backing up the data regularly is an important part ofmaintenance.

    The data warehouse can support the following for maintenance:

    o Locking Tables and Databaseso Obtaining Information on Tables and Indexeso Monitoring Growth of Tables and Indexes

    mailto:[email protected]:[email protected]
  • 8/14/2019 Ch6 DW Maintain

    11/11

    SUSHIL KULKARNI 11

    [email protected]

    o How Space is Allocated to a SegmentoAltering Segmentso Maintaining a Time-Cyclic Databaseo Recovering a Damaged Segmento Managing Optical StorageoAltering Tableso Copying or Moving a Databaseo Modifying the Configuration Fileo Monitoring and Controlling a Database Servero Determining Version Informationo Deleting Database Objects and Databases

    eeeeeeaaaaaaaaaaaaeeeeee

    mailto:[email protected]:[email protected]