34
Data Warehouses Kathy S. Schwaig

Data Warehouses Kathy S. Schwaig. Outline Data Explosion Data Warehouses Multi-dimensional databases Portions of this presentation are adapted from

Embed Size (px)

Citation preview

Page 1: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data Warehouses

Kathy S. Schwaig

Page 2: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Outline

Data ExplosionData WarehousesMulti-dimensional databases

Portions of this presentation are adapted from @ J. Han,

Simon Fraser University, Canada, 2000

Page 3: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Now that we have gathered so much data, What do we do with it?

“ I never waste memory on things that can easily be stored and retrieved from elsewhere.”

- Albert Einstein

Page 4: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data Explosion Problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases.

We are drowning in data, but starving for knowledge!

Page 5: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

What is a Data Warehouse?

An integrated and consistent store of subject-oriented data,

structured for query and retrieval in order to support

management decision making.

Page 6: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

A data warehouse is where the information systems department puts data to be turned

into information.

One cannot just dump masses of data into a disk drive and expect it to be usable.

Page 7: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Goal of Data Warehousing

Resolve enormous data access difficulties: Unavailable data hidden in transaction systems Delays as underpowered systems try to perform

huge, complex queries Complex, user-hostile interfaces Difficulties in discovering patterns in large

amounts of data Competition for computer resources between

transaction systems and decision support systems

Page 8: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

On-line Transaction Processing (OLTP)

Traditional database management systems (DBMS) used for on-line transaction processing (OLTP).

Order entry: update status field of order 445522 Banking: transfer $100 from account 55779 to account

99321 Characteristics:

detailed up-to-date data structured, repetitive tasks short transactions read and/or update a few records

Page 9: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

An OLTP example You call retailer Land’s End, where you have done business

before. The exchange might be: “Hi, this is Mr. Smith. I’d like to place an order” “Your phone number, please?” “555-555-1212” (Pulls up your file) “Yes, Mr. Smith. What can I help you with? “I’d like to order merchandise number 2222” “I see you were a little late last year in getting your Christmas presents

ordered. Would you like some suggestions to get the process started earlier?”

“Sure, Why not?” “Last year, you bought your Aunt Jennifer a scarf. We have a lovely pair

of gloves to match --they are on special for only $19. Should I add those to your order?”

“Uh...sure.” “And would you like the card to say the same as last year?” “Yes, please.”

Page 10: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Characteristics and usage patterns of operational systems (transaction processing

systems) used to automate business processes and those of a Decision Support

System are fundamentally different but linked. Why?

Decision Support versus Transaction Processing

Page 11: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

What is a Data Warehouse?

Facility for integrating dataOrganizes and stores data for analytical

processing from historical perspectiveMaintained separately from

organization’s operational database

Page 12: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data Warehouse ArchitectureData Warehouse Architecture

Data Sources

DataWarehouse

ExtractTransformLoadRefresh

metadataDSS

Server

AnalysisQueryReportsData mining

Tools

Serveother

sources

Data Marts

Operational DBs

Page 13: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Characteristics of a Data Warehouse

Subject-oriented Integrated Non-volatile Time-varying

Page 14: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

1. Subject Oriented

Oriented to the major subject areas of the corporation E.g. insurance company: customer, product,

transaction, policy, claim, accountOperational database and applications

may be organized differently E.g. based on type of insurance's: auto, life,

medical, fire.

Page 15: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

2. Integrated

Inconsistencies in encoding and naming conventions exist among data sources. Why?

Data converted

Page 16: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

3. Non-Volatile

Operational data regularly accessed and manipulated a record at a time. Update performed in operational environment.

Warehouse data loaded and accessed. Update of data does not occur in the data

warehouse environment.

Page 17: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

4. Time Variant...

A data warehouse is a “time-variant” collection of data, meaning time is a

variable in accessing the data.

Page 18: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Time Variant

Time horizon for data longer than that of operational systems.

Operational database contains current value data. Data warehouse data is a sophisticated series of

snapshots. The key structure of operational data may or may not

contain some element of time. The key structure of the data warehouse always contains some element of time.

Page 19: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data Mart

A data mart is a smaller version of a data warehouse, typically containing data related

to a single functional area of the firm or having limited scope in some other way.

It can be a useful first step to a full-scale data warehouse.

Page 20: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data: The Critical Issue

Users need to gather, analyze, report on business information to help organizations gain competitive advantage.

Most companies have a wealth of legacy data.

Worthless if: existence unknown cannot be found cannot be understood incorrect

Page 21: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data Transformation

Simple transformation -- e.g. change data type of field from integer to character

Cleansing & scrubbing -- consistent format, valid values

Integration -- data from multiple sources and map field by field into data warehouse.

Aggregation / summarization

Page 22: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Sample Operations

Roll up -- summarize data total sales volume last year by product category by

region Roll down, drill down, drill through -- go from

higher level summary to lower level summary or detailed data For a particular product category, find the detailed

sales data for each salesperson by date Slice and dice

Sales of beverages in the West over the last 6 months

Page 23: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

No single "best" data structure for all applications within an enterprise. Need good conceptual fit with the way end-users visualize business data Most business people already think about their

businesses in multidimensional terms Managers tend to ask questions about product

sales in different markets over specific time periods

Adapted from Arun Rai 1999

Why Multi-Dimensional Databases?

Page 24: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

What is a Multi-Dimensional Database?

A multidimensional database (MDD) is a computer software system designed for the efficient and convenient storage and retrieval of large volumes of data that are:

(1) intimately related(2) stored, viewed and analyzed from

different perspectives. Perspectives called dimensions.

Page 25: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

SALES VOLUMES FOR GLEASON DEALERSHIP

MODEL COLOR SALES VOLUME

MINI VAN BLUE 6MINI VAN RED 5MINI VAN WHITE 4SPORTS COUPE BLUE 3SPORTS COUPE RED 5SPORTS COUPE WHITE 5SEDAN BLUE 4SEDAN RED 3SEDAN WHITE 2

Contrasting Relational and Multi-Dimensional Models: An Example

Page 26: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

COLOR

MODEL

Mini Van

Sedan

Coupe

Red WhiteBlue

6 5 4

3 5 5

4 3 2

Sales Volumes

Page 27: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Sales Volumes

DEALERSHIP

Mini Van

Coupe

Sedan

Blue Red White

MODEL

ClydeGleason

Carr

COLOR

Mutlidimensional Representation

Page 28: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

DEALERSHIP

Sales Volumes

MODEL

COLOR

•Assume that each dimension has 10 positions, as shown in the cube above •How many records would be there in a relational table? •Implications for viewing data from an end-user standpoint?

View Data – An Example

Page 29: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Data Warehousing and The World Wide Web

•Access and transfer large numbers of data relatively easily and economically

• Integration of external data into data warehouse

•Issues of data integrity, accuracy, quality

•Quality rating versus price

Page 30: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Applications

•Data Mining

•Data Visualization

(Coming Next)

Page 31: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Summary

•Data versus Information • Data Warehouse Architecture•Characteristics•Applications

Page 32: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Appendix: Operational Data Store and Data Warehouse Characteristic

How is it built?

User requirements

Area of support

Characteristic Operational Data Store Data Warehouse

One application or subject area at a time.

Well defined prior to logical design.

Day-to-day business operations.

Relatively small number of records retrieved via a single query.

Tuned for frequent access to small amounts of data.

Similar to typical daily volume of operational transactions.

Typically multiple subject areas at a time.

Often vague and conflicting.

Decision support for managerial activities.

Large data sets scanned to retrieve results from either single or multiple queries.

Tuned for infrequent access to large amounts of data.

Much larger than typical daily transaction volume.

Type of access

Volume of data

Frequency of access

Page 33: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Retention period

Currency of data

Availability of data

Typical unit of analysis

Design focus

Retained as necessary to meet daily operating requirements.

Up-to-the-minuet; real time.

High and immediate availability may be required.

Small, manageable, transaction-level units.

High-performance, limited flexibility.

Retention period is indeterminate and must support historical reporting, comparison, and analysis

Typically represents a static point in time.

Immediate availability is less critical.

Large, unpredictable,variable units.

High flexibility, high-performance.

Characteristic Operational Data Store Data Warehouse

Appendix: Operational Data Store and Data Warehouse Characteristic (cont’d)

Page 34: Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from

Appendix: Characteristics of a Data Warehouse

Subject orientation. Data are organized based on how the users to them. Integrated. All inconsistencies regarding naming convention and value representations are removed. Nonvolatile. Data are stored in read-only format and do not change over time. Time variant. Data are not current but normally time-series. Summarized. Operational data are mapped into a decision-usable format. Large volume. Time-series data sets are normally quite large. Not normalized. DW data can be, and often are, redundant. Metadata. Data about data are stored. Data sources. Data come from internal and external unintegrated operational systems