34
Data Warehousing Xintao Wu

Data Warehousing Xintao Wu. Evolution of Database Technology (See Fig. 1.1) 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Data Warehousing

Xintao Wu

Evolution of Database Technology(See Fig. 1.1)

1960s: Data collection, database creation, IMS and network

DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases

Can You Easily Answer These Questions?

What are Personnel

Services costs across all

departments for all funding sources?

What are the effects of

outsourcing specific services?

What is the correlation between

expenditures and collection of

delinquent taxes?

What is the impact on revenues and

expenditures of changing the operating

hours of the Dept. of Motor Vehicles?

What is the economic impact of the small business initiative in

our district?

Overview: Data Warehousing and OLAP Technology for Data Mining

What is a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

What is a Warehouse?

Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile

more

What is a Warehouse?

Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Why Separate Data Warehouse?

High performance for both systems DBMS— tuned for OLTP: access methods, indexing,

concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP

queries, multidimensional view, consolidation.

Different functions and different data: missing data: Decision support requires historical

data which operational DBs do not typically maintain

data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Advantages of Warehousing

High query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouse Modify, summarize (store aggregates) Add historical information

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Modeling OLTP SystemsGoal -- Update as many transactions as possible in the shortest period of timeApproachModel to 3rd Normal Form (3NF)Minimize redundancy to optimize update

ResultCreate many (hundreds) of tablesDifficult for business users to understand and useRetrieval requires many JOINs = lousy performance

Modeling the Data Warehouse

Tuning the relational modelDenormalize

– Reduces the number of tables

– Improves usability

– Improves performanceAdd aggregate data (typically separate tables)

– Improves performance

– Degrades usability

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions &

measures Star schema: A fact table in the middle connected to a set

of dimension tables Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or

detailed data, or introducing new dimensions

Slice and dice: project and select

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Relational Operators

SelectProjectJoin

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Multi-Tiered ArchitectureMulti-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

other

sources

Data Storage

OLAP Server

OLAP Server Architectures

Relational OLAP (ROLAP) ROLAP - provides a Multi-dimensional view of a relational DB (e.g.

MicroStrategy) Use relational or extended-relational DBMS to store and manage warehouse data

and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services greater scalability

Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers specialized support for SQL queries over star/snowflake schemas

MOLAP Databases Data is stored using a proprietary

format(MOLAP) Accessible only through the DB vendor’s tools Suitable only for summarized data Data may be summarized in advance or real-time Examples:

PowerPlay Holos Essbase

MOLAP versus ROLAP

MOLAPMultidimensional OLAPData stored in multi-dimensional cubeTransformation requiredData retrieved directly from cube for analysisFaster analytical processingCube size limitations

ROLAPRelational OLAPData stored in relational database as virtual cubeNo transformation neededData retrieved via SQL from database for analysisSlower analytical processingNo size limitations

Data Warehouse Back-End Tools and UtilitiesData extraction: get data from multiple, heterogeneous, and external

sourcesData cleaning: detect errors in the data and rectify them when

possibleData transformation: convert data from legacy or host format to

warehouse formatLoad: sort, summarize, consolidate, compute views, check

integrity, and build indices and partitionsRefresh propagate the updates from the data sources to the

warehouse

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehouse to data mining

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Steps of a KDD Process

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Summary

Data warehouse A subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of management’s decision-making process

A multi-dimensional model of a data warehouse

Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures

OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAPFrom OLAP to OLAM