76
ITCS 6163 Data Warehousing Xintao Wu

ITCS 6163 Data Warehousing

  • Upload
    gaston

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

ITCS 6163 Data Warehousing. Xintao Wu. History. 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model 80s SQL IBM R transaction J. Gray Late 80s-90s DB2, Oracle, informix, sybase - PowerPoint PPT Presentation

Citation preview

Page 1: ITCS 6163 Data Warehousing

ITCS 6163 Data Warehousing

Xintao Wu

Page 2: ITCS 6163 Data Warehousing

History

60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model 80s SQL IBM R transaction J. Gray Late 80s-90s DB2, Oracle, informix, sybase 90s- DW, internet Turing award and Turing test?

Page 3: ITCS 6163 Data Warehousing

Evolution of Database Technology(See Fig. 1.1)

1960s: Data collection, database creation, IMS and network

DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases

Page 4: ITCS 6163 Data Warehousing

Can You Easily Answer These Questions?

What are Personnel

Services costs across all

departments for all funding sources?

What are the effects of

outsourcing specific services?

What is the correlation between

expenditures and collection of

delinquent taxes?

What is the impact on revenues and

expenditures of changing the operating

hours of the Dept. of Motor Vehicles?

What is the economic impact of the small business initiative in

our district?

Page 5: ITCS 6163 Data Warehousing

Overview: Data Warehousing and OLAP Technology for Data Mining

What is a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Page 6: ITCS 6163 Data Warehousing

What is a Warehouse?

Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile

more

Page 7: ITCS 6163 Data Warehousing

What is a Warehouse?

Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse

Page 8: ITCS 6163 Data Warehousing

What is a Warehouse?

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from the organization’s operational database

Support information processing by providing a solid platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. InmonData warehousing:

The process of constructing and using data warehouses

Page 9: ITCS 6163 Data Warehousing

Data Warehouse—Subject-Oriented

Organized around major subjects, such as

customer, product, sales.

Focusing on the modeling and analysis of data

for decision makers, not on daily operations or

transaction processing.

Provide a simple and concise view around

particular subject issues by excluding data that

are not useful in the decision support process.

Page 10: ITCS 6163 Data Warehousing

Data Warehouse—Integrated

Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line

transaction records

Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions,

encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered,

etc. When data is moved to the warehouse, it is

converted.

Page 11: ITCS 6163 Data Warehousing

Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not

contain “time element”.

Page 12: ITCS 6163 Data Warehousing

Data Warehouse—Non-Volatile

A physically separate store of data transformed

from the operational environment.

Operational update of data does not occur in

the data warehouse environment. Does not require transaction processing, recovery,

and concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.

Page 13: ITCS 6163 Data Warehousing

Warehouse is specialized DB

Mostly updatesMany small transactionsMb-Tb of dataCurrent snapshotRaw dataClerical users

Mostly readsQueries are long, complexGb-Tb of dataHistorySummarized, consolidated dataDecision-makers, analysts as users

Standard DB Warehouse

Page 14: ITCS 6163 Data Warehousing

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach

When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

Complex information filtering, compete for resources

Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

Page 15: ITCS 6163 Data Warehousing

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

Page 16: ITCS 6163 Data Warehousing

OLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 17: ITCS 6163 Data Warehousing

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Page 18: ITCS 6163 Data Warehousing

Why Separate Data Warehouse?

High performance for both systems DBMS— tuned for OLTP: access methods, indexing,

concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP

queries, multidimensional view, consolidation.

Different functions and different data: missing data: Decision support requires historical

data which operational DBs do not typically maintain

data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Page 19: ITCS 6163 Data Warehousing

Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 20: ITCS 6163 Data Warehousing

Why a Warehouse?

Two Approaches: Query-Driven (Eager) Warehouse (Lazy)

Source Source

?

Page 21: ITCS 6163 Data Warehousing

Query-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 22: ITCS 6163 Data Warehousing

Advantages of Warehousing

High query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouse Modify, summarize (store aggregates) Add historical information

Page 23: ITCS 6163 Data Warehousing

Advantages of Query-Driven

No need to copy data less storage no need to purchase data

More up-to-date dataQuery needs can be unknownOnly query interface needed at sources

Page 24: ITCS 6163 Data Warehousing

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Page 25: ITCS 6163 Data Warehousing

Modeling OLTP SystemsGoal -- Update as many transactions as possible in the shortest period of timeApproachModel to 3rd Normal Form (3NF)Minimize redundancy to optimize update

ResultCreate many (hundreds) of tablesDifficult for business users to understand and useRetrieval requires many JOINs = lousy performance

Page 26: ITCS 6163 Data Warehousing

Modeling the Data Warehouse

Tuning the relational modelDenormalize

– Reduces the number of tables

– Improves usability

– Improves performanceAdd aggregate data (typically separate tables)

– Improves performance

– Degrades usability

Page 27: ITCS 6163 Data Warehousing

Modeling the Data Warehouse

“Entity relation data models are a disaster for querying because they cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses.”

Ralph Kimball, The Data Warehouse Toolkit,

1996, John Wiley & Sons, Inc., ISBN 0-471-15337-0

Page 28: ITCS 6163 Data Warehousing

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Page 29: ITCS 6163 Data Warehousing

Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 30: ITCS 6163 Data Warehousing

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions &

measures Star schema: A fact table in the middle connected to a set

of dimension tables Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Page 31: ITCS 6163 Data Warehousing

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 32: ITCS 6163 Data Warehousing

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Page 33: ITCS 6163 Data Warehousing

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Page 34: ITCS 6163 Data Warehousing

Multidimensional DataSales volume as a function of product, month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 35: ITCS 6163 Data Warehousing

A Sample Data Cube

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 36: ITCS 6163 Data Warehousing

Cuboids Corresponding to the Cube

all

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

Page 37: ITCS 6163 Data Warehousing

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or

detailed data, or introducing new dimensions

Slice and dice: project and select

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Page 38: ITCS 6163 Data Warehousing

Relational Operators

SelectProjectJoin

Page 39: ITCS 6163 Data Warehousing

Aggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

Page 40: ITCS 6163 Data Warehousing

Another Example

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

Page 41: ITCS 6163 Data Warehousing

Aggregates

Operators: sum, count, max, min, median, avg

Type Distributive Algebraic holistic

“Having” clauseUsing dimension hierarchy average by region (within store) maximum by month (within date)

Page 42: ITCS 6163 Data Warehousing

Cube Aggregation

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Page 43: ITCS 6163 Data Warehousing

Aggregation Using Hierarchies

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

region A region Bp1 56 54p2 11 8

customer

region

country

(customer c1 in Region A;customers c2, c3 in Region B)

Page 44: ITCS 6163 Data Warehousing

Pivoting

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

Multi-dimensional cube:Fact table view:

c1 c2 c3p1 56 4 50p2 11 8

Page 45: ITCS 6163 Data Warehousing

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehousing to data mining

Page 46: ITCS 6163 Data Warehousing

Design of a Data Warehouse: A Business Analysis Framework

Four views regarding the design of a data warehouse

Top-down view allows selection of the relevant information necessary

for the data warehouse Data source view

exposes the information being captured, stored, and managed by operational systems

Data warehouse view consists of fact tables and dimension tables

Business query view sees the perspectives of data in the warehouse from

the view of end-user

Page 47: ITCS 6163 Data Warehousing

Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view Waterfall: structured and systematic analysis at each step

before proceeding to the next Spiral: rapid generation of increasingly functional systems,

short turn around time, quick turn around

Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record

Page 48: ITCS 6163 Data Warehousing

Multi-Tiered ArchitectureMulti-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

other

sources

Data Storage

OLAP Server

Page 49: ITCS 6163 Data Warehousing

Three Data Warehouse Models

Enterprise warehouse collects all of the information about subjects spanning

the entire organization

Data Mart a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse A set of views over operational databases Only some of the possible summary views may be

materialized

Page 50: ITCS 6163 Data Warehousing

What is a Data Mart ?

A data mart is a small-scale data warehouse that is focused on a single department or single subject area to provide a subset of data warehouse data to address specific reporting and analysis requirements.

Budget

Finance

HR

Purch-asing

AssetMgmt.

Info.Tech.

Smaller warehouses Spans part of organization Do not require enterprise-wide consensus

but long term integration problems?

Page 51: ITCS 6163 Data Warehousing

Data Warehouse Development: A Recommended Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

Page 52: ITCS 6163 Data Warehousing

OLAP Server Architectures

Relational OLAP (ROLAP) ROLAP - provides a Multi-dimensional view of a relational DB (e.g.

MicroStrategy) Use relational or extended-relational DBMS to store and manage warehouse data

and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services greater scalability

Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers specialized support for SQL queries over star/snowflake schemas

Page 53: ITCS 6163 Data Warehousing

MOLAP Databases Data is stored using a proprietary

format(MOLAP) Accessible only through the DB vendor’s tools Suitable only for summarized data Data may be summarized in advance or real-time Examples:

PowerPlay Holos Essbase

Page 54: ITCS 6163 Data Warehousing

RDBMS: Indexing Strategies

Select columns to be indexed: Choose combinations of columns most often used to

constrain queries (“where …” clause) Queries must use constraining columns in the same

order as the columns in the index

Unique more efficient than non-unique.

More indexes means faster query performance, but also longer transformation/load times.

Types of Indexes: B-tree -- many possible values (e.g., invoice number) Bitmap -- few possible values (e.g., M/F, S/M/D/W)

Page 55: ITCS 6163 Data Warehousing

MOLAP versus ROLAP

MOLAPMultidimensional OLAPData stored in multi-dimensional cubeTransformation requiredData retrieved directly from cube for analysisFaster analytical processingCube size limitations

ROLAPRelational OLAPData stored in relational database as virtual cubeNo transformation neededData retrieved via SQL from database for analysisSlower analytical processingNo size limitations

Page 56: ITCS 6163 Data Warehousing

Data Warehouse Usage

Three kinds of data warehouse applications Information processing

supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling,

pivoting Data mining

knowledge discovery from hidden patterns supports associations, constructing analytical models,

performing classification and prediction, and presenting the mining results using visualization tools.

Differences among the three tasks

Page 57: ITCS 6163 Data Warehousing

Data Mining & Forecasting

Mining the Warehouse Choose data population Select mining technique Segment data into

groups Identify data patterns

Forecasting Data Select trend data Choose forecast model Run forecast Display predictions

Page 58: ITCS 6163 Data Warehousing

Accessing & Analyzing Data

Query & Reporting … retrieving data directly from the warehouse and preparing it for presentationOnline Analytical Processing (OLAP) … analyzing aggregated data from a variety of perspectivesData Mining & Forecasting … analyzing and predicting data using mathematical models

Page 59: ITCS 6163 Data Warehousing

Query & ReportingQuery the Data ... Select & filter data Retrieve results

Report the Results ... Sort & group data Format & present data

Save or Export Data Save queries & reports Export to other tools Publish HTML pages

Page 60: ITCS 6163 Data Warehousing

Query & Reporting Tools

Cognos ImpromptuBusiness ObjectsCrystal InfoBrioQueryIQGQLSAS

Page 61: ITCS 6163 Data Warehousing

Online Analytical Processing

Slice and Dice ... Select dimensions Choose measures Filter by dimensions

Drill Down ... Drill down

hierarchies Drill through to

details

Present the Results Present as

spreadsheet Display graphically

Page 62: ITCS 6163 Data Warehousing

OLAP Tools

Cognos PowerPlayBusiness AnalyzerHolosBrioAnalyzerMicrostrategyOracle ExpressSASArbor Essbase

Page 63: ITCS 6163 Data Warehousing

Data Mining & Forecasting

Mining the Warehouse Choose data population Select mining technique Segment data into

groups Identify data patterns

Forecasting Data Select trend data Choose forecast model Run forecast Display predictions

Page 64: ITCS 6163 Data Warehousing

Data Mining

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

Page 65: ITCS 6163 Data Warehousing

Mining and Forecasting Tools

Scenario4ThoughtBusiness MinerClementineDarwinHolosSAS

Page 66: ITCS 6163 Data Warehousing

Data Warehouse Back-End Tools and UtilitiesData extraction: get data from multiple, heterogeneous, and external

sourcesData cleaning: detect errors in the data and rectify them when

possibleData transformation: convert data from legacy or host format to

warehouse formatLoad: sort, summarize, consolidate, compute views, check

integrity, and build indicies and partitionsRefresh propagate the updates from the data sources to the

warehouse

Page 67: ITCS 6163 Data Warehousing

Data Cleaning

• Of primordial interest in the warehouse creation

• One of the biggest problems• Difficult to achieve• Probability of one or many of the

sources containing “dirty data” is high.

• Lots of manual intervention

Page 68: ITCS 6163 Data Warehousing

Data Cleaning Problems

Data quality problems

Single Source Multi-source

Schema level Instance level

Schema level Instance level

(poor schema design) (data entry errors) (heterogeneity) (overlapping contradicting data)

.Uniqueness .Misspellings Naming conflictsInconsistent

aggregation

Page 69: ITCS 6163 Data Warehousing

Multisource problems

All the previous problems +Schema differences (translation and integration) E.g.: EmpID, CID, Sex= M/F, Sex=0/1

Instance level conflicts Duplicate records, contradicting records Different measures ($, Euros) Different aggregation levels (weeks,

quarters)

Page 70: ITCS 6163 Data Warehousing

Overview: Data Warehousing and OLAP Technology for Data Mining

What a data warehouse?

Why a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

From data warehouse to data mining

Page 71: ITCS 6163 Data Warehousing

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 72: ITCS 6163 Data Warehousing

Steps of a KDD Process

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Page 73: ITCS 6163 Data Warehousing

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 74: ITCS 6163 Data Warehousing

From OLAP to On Line Analytical Mining (OLAM)

Why online analytical mining? High quality of data in data warehouses

DW contains integrated, consistent, cleaned data Available information processing structure surrounding

data warehouses ODBC, OLEDB, Web accessing, service facilities,

reporting and OLAP tools OLAP-based exploratory data analysis

mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions

integration and swapping of multiple mining functions, algorithms, and tasks.

Architecture of OLAM

Page 75: ITCS 6163 Data Warehousing

An OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 76: ITCS 6163 Data Warehousing

Summary

Data warehouse A subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of management’s decision-making process

A multi-dimensional model of a data warehouse

Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures

OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAPFrom OLAP to OLAM