ITCS 6163 Data Warehousing

Preview:

DESCRIPTION

ITCS 6163 Data Warehousing. Xintao Wu. History. 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model 80s SQL IBM R transaction J. Gray Late 80s-90s DB2, Oracle, informix, sybase - PowerPoint PPT Presentation

Citation preview

ITCS 6163 Data Warehousing

Xintao Wu

History

60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model 80s SQL IBM R transaction J. Gray Late 80s-90s DB2, Oracle, informix, sybase 90s- DW, internet Turing award and Turing test?

Evolution of Database Technology(See Fig. 1.1)

1960s: Data collection, database creation, IMS and network

DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases

Can You Easily Answer These Questions?

What are Personnel

Services costs across all

departments for all funding sources?

What are the effects of

outsourcing specific services?

What is the correlation between

expenditures and collection of

delinquent taxes?

What is the impact on revenues and

expenditures of changing the operating

hours of the Dept. of Motor Vehicles?

What is the economic impact of the small business initiative in

our district?

Overview: Data Warehousing and OLAP Technology for Data Mining

What is a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

What is a Warehouse?

Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile

more

What is a Warehouse?

Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse

What is a Warehouse?

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from the organization’s operational database

Support information processing by providing a solid platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. InmonData warehousing:

The process of constructing and using data warehouses

Data Warehouse—Subject-Oriented

Organized around major subjects, such as

customer, product, sales.

Focusing on the modeling and analysis of data

for decision makers, not on daily operations or

transaction processing.

Provide a simple and concise view around

particular subject issues by excluding data that

are not useful in the decision support process.

Data Warehouse—Integrated

Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line

transaction records

Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions,

encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered,

etc. When data is moved to the warehouse, it is

converted.

Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not

contain “time element”.

Data Warehouse—Non-Volatile

A physically separate store of data transformed

from the operational environment.

Operational update of data does not occur in

the data warehouse environment. Does not require transaction processing, recovery,

and concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.

Warehouse is specialized DB

Mostly updatesMany small transactionsMb-Tb of dataCurrent snapshotRaw dataClerical users

Mostly readsQueries are long, complexGb-Tb of dataHistorySummarized, consolidated dataDecision-makers, analysts as users

Standard DB Warehouse

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach

When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

Complex information filtering, compete for resources

Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Why Separate Data Warehouse?

High performance for both systems DBMS— tuned for OLTP: access methods, indexing,

concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP

queries, multidimensional view, consolidation.

Different functions and different data: missing data: Decision support requires historical

data which operational DBs do not typically maintain

data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Why a Warehouse?

Two Approaches: Query-Driven (Eager) Warehouse (Lazy)

Source Source

?

Query-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Advantages of Warehousing

High query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouse Modify, summarize (store aggregates) Add historical information

Advantages of Query-Driven

No need to copy data less storage no need to purchase data

More up-to-date dataQuery needs can be unknownOnly query interface needed at sources

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Modeling OLTP SystemsGoal -- Update as many transactions as possible in the shortest period of timeApproachModel to 3rd Normal Form (3NF)Minimize redundancy to optimize update

ResultCreate many (hundreds) of tablesDifficult for business users to understand and useRetrieval requires many JOINs = lousy performance

Modeling the Data Warehouse

Tuning the relational modelDenormalize

– Reduces the number of tables

– Improves usability

– Improves performanceAdd aggregate data (typically separate tables)

– Improves performance

– Degrades usability

Modeling the Data Warehouse

“Entity relation data models are a disaster for querying because they cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses.”

Ralph Kimball, The Data Warehouse Toolkit,

1996, John Wiley & Sons, Inc., ISBN 0-471-15337-0

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions &

measures Star schema: A fact table in the middle connected to a set

of dimension tables Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Multidimensional DataSales volume as a function of product, month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

A Sample Data Cube

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Cuboids Corresponding to the Cube

all

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or

detailed data, or introducing new dimensions

Slice and dice: project and select

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Relational Operators

SelectProjectJoin

Aggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

Another Example

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

Aggregates

Operators: sum, count, max, min, median, avg

Type Distributive Algebraic holistic

“Having” clauseUsing dimension hierarchy average by region (within store) maximum by month (within date)

Cube Aggregation

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Aggregation Using Hierarchies

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

region A region Bp1 56 54p2 11 8

customer

region

country

(customer c1 in Region A;customers c2, c3 in Region B)

Pivoting

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

Multi-dimensional cube:Fact table view:

c1 c2 c3p1 56 4 50p2 11 8

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Design of a Data Warehouse: A Business Analysis Framework

Four views regarding the design of a data warehouse

Top-down view allows selection of the relevant information necessary

for the data warehouse Data source view

exposes the information being captured, stored, and managed by operational systems

Data warehouse view consists of fact tables and dimension tables

Business query view sees the perspectives of data in the warehouse from

the view of end-user

Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view Waterfall: structured and systematic analysis at each step

before proceeding to the next Spiral: rapid generation of increasingly functional systems,

short turn around time, quick turn around

Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record

Multi-Tiered ArchitectureMulti-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

other

sources

Data Storage

OLAP Server

Three Data Warehouse Models

Enterprise warehouse collects all of the information about subjects spanning

the entire organization

Data Mart a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse A set of views over operational databases Only some of the possible summary views may be

materialized

What is a Data Mart ?

A data mart is a small-scale data warehouse that is focused on a single department or single subject area to provide a subset of data warehouse data to address specific reporting and analysis requirements.

Budget

Finance

HR

Purch-asing

AssetMgmt.

Info.Tech.

Smaller warehouses Spans part of organization Do not require enterprise-wide consensus

but long term integration problems?

Data Warehouse Development: A Recommended Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

OLAP Server Architectures

Relational OLAP (ROLAP) ROLAP - provides a Multi-dimensional view of a relational DB (e.g.

MicroStrategy) Use relational or extended-relational DBMS to store and manage warehouse data

and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services greater scalability

Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers specialized support for SQL queries over star/snowflake schemas

MOLAP Databases Data is stored using a proprietary

format(MOLAP) Accessible only through the DB vendor’s tools Suitable only for summarized data Data may be summarized in advance or real-time Examples:

PowerPlay Holos Essbase

RDBMS: Indexing Strategies

Select columns to be indexed: Choose combinations of columns most often used to

constrain queries (“where …” clause) Queries must use constraining columns in the same

order as the columns in the index

Unique more efficient than non-unique.

More indexes means faster query performance, but also longer transformation/load times.

Types of Indexes: B-tree -- many possible values (e.g., invoice number) Bitmap -- few possible values (e.g., M/F, S/M/D/W)

MOLAP versus ROLAP

MOLAPMultidimensional OLAPData stored in multi-dimensional cubeTransformation requiredData retrieved directly from cube for analysisFaster analytical processingCube size limitations

ROLAPRelational OLAPData stored in relational database as virtual cubeNo transformation neededData retrieved via SQL from database for analysisSlower analytical processingNo size limitations

Data Warehouse Usage

Three kinds of data warehouse applications Information processing

supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling,

pivoting Data mining

knowledge discovery from hidden patterns supports associations, constructing analytical models,

performing classification and prediction, and presenting the mining results using visualization tools.

Differences among the three tasks

Data Mining & Forecasting

Mining the Warehouse Choose data population Select mining technique Segment data into

groups Identify data patterns

Forecasting Data Select trend data Choose forecast model Run forecast Display predictions

Accessing & Analyzing Data

Query & Reporting … retrieving data directly from the warehouse and preparing it for presentationOnline Analytical Processing (OLAP) … analyzing aggregated data from a variety of perspectivesData Mining & Forecasting … analyzing and predicting data using mathematical models

Query & ReportingQuery the Data ... Select & filter data Retrieve results

Report the Results ... Sort & group data Format & present data

Save or Export Data Save queries & reports Export to other tools Publish HTML pages

Query & Reporting Tools

Cognos ImpromptuBusiness ObjectsCrystal InfoBrioQueryIQGQLSAS

Online Analytical Processing

Slice and Dice ... Select dimensions Choose measures Filter by dimensions

Drill Down ... Drill down

hierarchies Drill through to

details

Present the Results Present as

spreadsheet Display graphically

OLAP Tools

Cognos PowerPlayBusiness AnalyzerHolosBrioAnalyzerMicrostrategyOracle ExpressSASArbor Essbase

Data Mining & Forecasting

Mining the Warehouse Choose data population Select mining technique Segment data into

groups Identify data patterns

Forecasting Data Select trend data Choose forecast model Run forecast Display predictions

Data Mining

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

Mining and Forecasting Tools

Scenario4ThoughtBusiness MinerClementineDarwinHolosSAS

Data Warehouse Back-End Tools and UtilitiesData extraction: get data from multiple, heterogeneous, and external

sourcesData cleaning: detect errors in the data and rectify them when

possibleData transformation: convert data from legacy or host format to

warehouse formatLoad: sort, summarize, consolidate, compute views, check

integrity, and build indicies and partitionsRefresh propagate the updates from the data sources to the

warehouse

Data Cleaning

• Of primordial interest in the warehouse creation

• One of the biggest problems• Difficult to achieve• Probability of one or many of the

sources containing “dirty data” is high.

• Lots of manual intervention

Data Cleaning Problems

Data quality problems

Single Source Multi-source

Schema level Instance level

Schema level Instance level

(poor schema design) (data entry errors) (heterogeneity) (overlapping contradicting data)

.Uniqueness .Misspellings Naming conflictsInconsistent

aggregation

Multisource problems

All the previous problems +Schema differences (translation and integration) E.g.: EmpID, CID, Sex= M/F, Sex=0/1

Instance level conflicts Duplicate records, contradicting records Different measures ($, Euros) Different aggregation levels (weeks,

quarters)

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehouse to data mining

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Steps of a KDD Process

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

From OLAP to On Line Analytical Mining (OLAM)

Why online analytical mining? High quality of data in data warehouses

DW contains integrated, consistent, cleaned data Available information processing structure surrounding

data warehouses ODBC, OLEDB, Web accessing, service facilities,

reporting and OLAP tools OLAP-based exploratory data analysis

mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions

integration and swapping of multiple mining functions, algorithms, and tasks.

Architecture of OLAM

An OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Summary

Data warehouse A subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of management’s decision-making process

A multi-dimensional model of a data warehouse

Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures

OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAPFrom OLAP to OLAM

Recommended