21
Introduction to Data Warehousing Enrico Franconi CS 636

CS636 Dw Intro

Embed Size (px)

DESCRIPTION

DW

Citation preview

  • Introduction to Data WarehousingEnrico FranconiCS 636

    CS 336

  • Problem: Heterogeneous Information SourcesHeterogeneities are everywhereDifferent interfacesDifferent data representationsDuplicate and inconsistent informationPersonalDatabasesDigital LibrariesScientific DatabasesWorldWideWeb

    CS 336

  • Problem: Data Management in Large EnterprisesVertical fragmentation of informational systems (vertical stove pipes)Result of application (user)-driven development of operational systemsSales AdministrationFinanceManufacturing...Sales PlanningStock Mngmt...Suppliers...Debt MngmtNum. Control...Inventory

    CS 336

  • Goal: Unified Access to DataCollects and combines informationProvides integrated view, uniform user interfaceSupports sharingDigital LibrariesScientific DatabasesPersonalDatabases

    CS 336

  • Why a Warehouse?Two Approaches:Query-Driven (Lazy)Warehouse (Eager)

    CS 336

  • The Traditional Research ApproachSourceSourceSource. . .Integration System. . .MetadataClientsWrapperWrapperWrapperQuery-driven (lazy, on-demand)

    CS 336

  • Disadvantages of Query-Driven ApproachDelay in query processingSlow or unavailable information sourcesComplex filtering and integrationInefficient and potentially expensive for frequent queriesCompetes with local processing at sourcesHasnt caught on in industry

    CS 336

  • The Warehousing ApproachDataWarehouseClientsSourceSourceSource. . .Extractor/MonitorIntegration System. . .MetadataExtractor/MonitorExtractor/MonitorInformation integrated in advanceStored in wh for direct querying and analysis

    CS 336

  • Advantages of Warehousing ApproachHigh query performanceBut not necessarily most current informationDoesnt interfere with local processing at sourcesComplex queries at warehouseOLTP at information sourcesInformation copied at warehouseCan modify, annotate, summarize, restructure, etc.Can store historical informationSecurity, no auditingHas caught on in industry

    CS 336

  • Not Either-Or DecisionQuery-driven approach still better forRapidly changing informationRapidly changing information sourcesTruly vast amounts of data from large numbers of sourcesClients with unpredictable needs

    CS 336

  • What is a Data Warehouse?A Practitioners ViewpointA data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. -- Barry Devlin, IBM Consultant

    CS 336

  • What is a Data Warehouse?An Alternative ViewpointA DW is a subject-oriented,integrated,time-varying,non-volatilecollection of data that is used primarily in organizational decision making. -- W.H. Inmon, Building the Data Warehouse, 1992

    CS 336

  • A Data Warehouse is...Stored collection of diverse dataA solution to data integration problemSingle repository of informationSubject-orientedOrganized by subject, not by applicationUsed for analysis, data mining, etc.Optimized differently from transaction-oriented dbUser interface aimed at executive

    CS 336

  • ContdLarge volume of data (Gb, Tb)Non-volatileHistoricalTime attributes are importantUpdates infrequentMay be append-onlyExamplesAll transactions ever at SainsburysComplete client histories at insurance firmLSE financial information and portfolios

    CS 336

  • Generic Warehouse ArchitectureExtractor/MonitorExtractor/MonitorExtractor/MonitorIntegratorWarehouseClientClientDesign PhaseMaintenanceLoading...MetadataOptimizationQuery & Analysis

    CS 336

  • Data Warehouse Architectures: Conceptual ViewSingle-layerEvery data element is stored once onlyVirtual warehouse

    Two-layerReal-time + derived dataMost commonly used approach in industry today

    CS 336

  • Three-layer Architecture: Conceptual ViewTransformation of real-time data to derived data really requires two stepsDerived DataReal-time dataOperationalsystemsInformationalsystemsReconciled DataPhysical Implementationof the Data WarehouseView levelParticular informational needs

    CS 336

  • Data Warehousing: Two Distinct Issues(1) How to get information into warehouseData warehousing(2) What to do with data once its in warehouseWarehouse DBMSBoth rich research areasIndustry has focused on (2)

    CS 336

  • Issues in Data WarehousingWarehouse DesignExtractionWrappers, monitors (change detectors)IntegrationCleansing & mergingWarehousing specification & MaintenanceOptimizationsMiscellaneous (e.g., evolution)

    CS 336

  • OLTP vs. OLAPOLTP: On Line Transaction ProcessingDescribes processing at operational sitesOLAP: On Line Analytical ProcessingDescribes processing at warehouse

    CS 336

  • Warehouse is a Specialized DBStandard DB (OLTP)Mostly updatesMany small transactionsMb - Gb of dataCurrent snapshotIndex/hash on p.k.Raw dataThousands of users (e.g., clerical users)Warehouse (OLAP)Mostly readsQueries are long and complexGb - Tb of dataHistoryLots of scansSummarized, reconciled dataHundreds of users (e.g., decision-makers, analysts)

    CS 336