Introduction to Data WarehousingEnrico FranconiCS 636
CS 336
Problem: Heterogeneous Information SourcesHeterogeneities are everywhereDifferent interfacesDifferent data representationsDuplicate and inconsistent informationPersonalDatabasesDigital LibrariesScientific DatabasesWorldWideWeb
CS 336
Problem: Data Management in Large EnterprisesVertical fragmentation of informational systems (vertical stove pipes)Result of application (user)-driven development of operational systemsSales AdministrationFinanceManufacturing...Sales PlanningStock Mngmt...Suppliers...Debt MngmtNum. Control...Inventory
CS 336
Goal: Unified Access to DataCollects and combines informationProvides integrated view, uniform user interfaceSupports sharingDigital LibrariesScientific DatabasesPersonalDatabases
CS 336
Why a Warehouse?Two Approaches:Query-Driven (Lazy)Warehouse (Eager)
CS 336
The Traditional Research ApproachSourceSourceSource. . .Integration System. . .MetadataClientsWrapperWrapperWrapperQuery-driven (lazy, on-demand)
CS 336
Disadvantages of Query-Driven ApproachDelay in query processingSlow or unavailable information sourcesComplex filtering and integrationInefficient and potentially expensive for frequent queriesCompetes with local processing at sourcesHasnt caught on in industry
CS 336
The Warehousing ApproachDataWarehouseClientsSourceSourceSource. . .Extractor/MonitorIntegration System. . .MetadataExtractor/MonitorExtractor/MonitorInformation integrated in advanceStored in wh for direct querying and analysis
CS 336
Advantages of Warehousing ApproachHigh query performanceBut not necessarily most current informationDoesnt interfere with local processing at sourcesComplex queries at warehouseOLTP at information sourcesInformation copied at warehouseCan modify, annotate, summarize, restructure, etc.Can store historical informationSecurity, no auditingHas caught on in industry
CS 336
Not Either-Or DecisionQuery-driven approach still better forRapidly changing informationRapidly changing information sourcesTruly vast amounts of data from large numbers of sourcesClients with unpredictable needs
CS 336
What is a Data Warehouse?A Practitioners ViewpointA data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. -- Barry Devlin, IBM Consultant
CS 336
What is a Data Warehouse?An Alternative ViewpointA DW is a subject-oriented,integrated,time-varying,non-volatilecollection of data that is used primarily in organizational decision making. -- W.H. Inmon, Building the Data Warehouse, 1992
CS 336
A Data Warehouse is...Stored collection of diverse dataA solution to data integration problemSingle repository of informationSubject-orientedOrganized by subject, not by applicationUsed for analysis, data mining, etc.Optimized differently from transaction-oriented dbUser interface aimed at executive
CS 336
ContdLarge volume of data (Gb, Tb)Non-volatileHistoricalTime attributes are importantUpdates infrequentMay be append-onlyExamplesAll transactions ever at SainsburysComplete client histories at insurance firmLSE financial information and portfolios
CS 336
Generic Warehouse ArchitectureExtractor/MonitorExtractor/MonitorExtractor/MonitorIntegratorWarehouseClientClientDesign PhaseMaintenanceLoading...MetadataOptimizationQuery & Analysis
CS 336
Data Warehouse Architectures: Conceptual ViewSingle-layerEvery data element is stored once onlyVirtual warehouse
Two-layerReal-time + derived dataMost commonly used approach in industry today
CS 336
Three-layer Architecture: Conceptual ViewTransformation of real-time data to derived data really requires two stepsDerived DataReal-time dataOperationalsystemsInformationalsystemsReconciled DataPhysical Implementationof the Data WarehouseView levelParticular informational needs
CS 336
Data Warehousing: Two Distinct Issues(1) How to get information into warehouseData warehousing(2) What to do with data once its in warehouseWarehouse DBMSBoth rich research areasIndustry has focused on (2)
CS 336
Issues in Data WarehousingWarehouse DesignExtractionWrappers, monitors (change detectors)IntegrationCleansing & mergingWarehousing specification & MaintenanceOptimizationsMiscellaneous (e.g., evolution)
CS 336
OLTP vs. OLAPOLTP: On Line Transaction ProcessingDescribes processing at operational sitesOLAP: On Line Analytical ProcessingDescribes processing at warehouse
CS 336
Warehouse is a Specialized DBStandard DB (OLTP)Mostly updatesMany small transactionsMb - Gb of dataCurrent snapshotIndex/hash on p.k.Raw dataThousands of users (e.g., clerical users)Warehouse (OLAP)Mostly readsQueries are long and complexGb - Tb of dataHistoryLots of scansSummarized, reconciled dataHundreds of users (e.g., decision-makers, analysts)
CS 336