105
A COMPARISON OF DATA WAREHOUSE DESIGN MODELS A MASTER’S THESIS in Computer Engineering Atilim University by BERIL PINAR BAŞARAN J ANUARY 2005

140

Embed Size (px)

DESCRIPTION

oaoaoa la casa de pablito esta lista para salir a la calle cuando pablito quiera ira a comer ocas a la casa de su madre :v

Citation preview

Page 1: 140

A COMPARISON OF DATA WAREHOUSE DESIGN MODELS

A MASTER’S THESIS

in

Computer Engineer ing

Atilim University

by

BERIL PINAR BAŞARAN

J ANUARY 2005

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 2: 140

i

A COMPARISON OF DATA WAREHOUSE DESIGN MODELS

A THESIS SUBMITTED TO

THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES

OF

ATILIM UNIVERSITY

BY

BERIL PINAR BAŞARAN

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF

MASTER OF SCIENCE

IN

THE DEPARTMENT OF COMPUTER ENGINEERING

J ANUARY 2005

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 3: 140

ii

Approval of the Graduate School of Natural and Applied Sciences

_____________________

Prof. Dr. Ibrahim Akman

Director

I certify that this thesis satisfies all the requirements as a thesis for the degree of Master of Science.

_____________________

Prof. Dr. Ibrahim Akman

Head of Department

This is to certify that we have read this thesis and that in our opinion it is fully adequate, in scope and quality, as a thesis for the degree of Master of Science.

_____________________ _____________________

Prof. Dr. Ali Yazici Dr. Deepti Mishra

Co-Supervisor Supervisor

Examining Committee Members

Prof. Dr. Ali Yazici _____________________

Dr. Deepti Mishra _____________________

Asst. Prof. Dr. Nergiz E. Çağıltay _____________________

Dr. Ali Arifoğlu _____________________

Asst. Prof. Dr. Çiğdem Turhan _____________________

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 4: 140

iii

ABSTRACT

A COMPARISON OF DATA WAREHOUSE DESIGN MODELS

Başaran, Beril Pınar

M.S., Computer Engineering Department

Supervisor: Dr. Deepti Mishra

Co-Supervisor: Prof. Dr. Ali Yazici

January 2005, 90 pages

There are a number of approaches in designing a data warehouse both in conceptual

and logical design phases. The generally accepted conceptual design approaches are

dimensional fact model, multidimensional E/R model, starER model and object-oriented

multidimensional model. And in the logical design phase, flat schema, terraced schema,

star schema, fact constellation schema, galaxy schema, snowflake schema, star cluster

schema and starflake schemas are widely used approaches. This thesis proposes a

comparison of both the conceptual and the logical design models and a sample data

warehouse design and implementation is provided. It is observed that in the conceptual

design phase, object-oriented model provides the best solution and for the logical design

phase, star schema is generally the best in terms of performance and snowflake is

generally the best in terms of redundancy.

Keywords: Data Warehouse, Design Methodologies, DF, starER, ME/R, OOMD,

flat schema, terraced schema, star schema, fact constellation schema, galaxy schema,

snowflake schema, star cluster schema, starflake schema, DTS, Data Analyzer

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 5: 140

iv

ÖZ

VERİ AMBARI TASARIM MODELLERİ KARŞILAŞTIRMASI

Başaran, Beril Pınar

Yüksek Lisans, Bilgisayar Mühendisliği Bölümü

Tez Yöneticisi: Dr. Deepti Mishra

Ortak Tez Yöneticisi: Prof. Dr. Ali Yazici

Ocak 2005, 90 sayfa

Veri ambarı tasarımının kavramsal ve mantıksal tasarım aşamaları için birden fazla

yaklaşım vardır. Kavramsal tasarım safhası için genel olarak kabul görmüş yaklaşımlar

“dimensional fact”, “multidimensional E/R”, “starER” ve “object-oriented

multidimensional” modelleridir. Mantıksal tasarım safhası için genel olarak kabul

görmüş yaklaşımlar “flat”, “terraced”, “star”, “fact constellation”, “galaxy” ,

“snowflake”, “star cluster” ve “starflake” şemalarıdır. Bu tez, kavramsal ve mantıksal

tasarım modellerini karşılaştırır, örnek bir veri ambarı tasarımını ve uygulamasını içerir.

Bu tezde, kavramsal tasarım aşamasında “object-oriented multidimensional” modelinin;

mantıksal tasarım aşamasında performans kriteri açısından “star” şemanın, veri tekrarı

kriteri açısından “snowflake” şemanın en iyi çözümler olduğu gözlendi.

Anahtar Kelimeler: Veri Ambarı, Tasarım Yöntemleri, DF, starER, ME/R, OOMD, flat

şema, terraced şema, star şema, fact constellation şema, galaxy şema, snowflake şema,

star cluster şema, starflake şema, DTS, Data Analyzer

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 6: 140

v

To my dear husband

Thanks for his endless support

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 7: 140

vi

ACKNOWLEDGEMENTS

First, I would like to thank my thesis advisor Dr. Deepti MISHRA and co-

supervisor Prof. Dr. Ali YAZICI for their guidance, insight and encouragement

throughout the study.

I should also express my appreciation to examination committee members Asst.

Prof. Dr. Nergiz E. ÇAĞILTAY, Dr. Ali ARIFOĞLU, Asst. Prof. Dr. Çiğdem

TURHAN for their valuable suggestions and comments.

I would like to express my thanks to my husband for his assistance, encouragement

and all members of my family for their patience, sympaty and support during the study.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 8: 140

vii

TABLE OF CONTENTS

ABSTRACT ..........................................................................................................................iii

ÖZ........................................................................................................................................... iv

ACKNOWLEDGEMENTS.................................................................................................. vi

TABLE OF CONTENTS.....................................................................................................vii

LIST OF TABLES .................................................................................................................x

LIST OF FIGURES...............................................................................................................xi

LIST OF ABBREVIATIONS ............................................................................................xiii

CHAPTER

1 INTRODUCTION.............................................................................................................. 1

1.1. Scope and outline of the thesis................................................................................... 2

2 DATA WAREHOUSE CONCEPTS................................................................................. 3

2.1. Definition of Data Warehouse ................................................................................... 3

2.2. Why OLAP systems must run with OLTP ................................................................ 5

2.3. Requirements for Data Warehouse Database Management Systems...................... 8

3 FUNDAMENTALS OF DATA WAREHOUSE............................................................ 10

3.1. Data acquisition......................................................................................................... 12

3.1.1. Extraction, Cleansing and Transformation Tools............................................ 13

3.2. Data Storage and Access .......................................................................................... 13

3.3. Data Marts ................................................................................................................. 14

4 DESIGNING A DATA WAREHOUSE.......................................................................... 16

4.1. Beginning with Operational Data ............................................................................ 16

4.2. Data/Process Models ................................................................................................ 18

4.3. The DW Data Model ................................................................................................ 19

4.3.1. High-Level Modeling ........................................................................................ 19

4.3.2. Mid-Level Modeling ......................................................................................... 21

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 9: 140

viii

4.3.3. Low-Level Modeling......................................................................................... 23

4.4. Database Design Methodology for DW .................................................................. 24

4.5. Conceptual Design Models ...................................................................................... 27

4.5.1. The Dimensional Fact Model............................................................................ 27

4.5.2. Multidimensional E/R Model ........................................................................... 30

4.5.3. starER ................................................................................................................. 33

4.5.4. Object-Oriented Multidimensional Model (OOMD) ...................................... 35

4.6. Logical Design Models............................................................................................. 36

4.6.1. Dimensional Model Design .............................................................................. 37

4.6.2. Flat Schema........................................................................................................ 39

4.6.3. Terraced Schema ............................................................................................... 40

4.6.4. Star Schema........................................................................................................ 41

4.6.5. Fact Constellation Schema................................................................................ 43

4.6.6. Galaxy Schema .................................................................................................. 43

4.6.7. Snowflake Schema ............................................................................................ 44

4.6.8. Star Cluster Schema .......................................................................................... 45

4.6.9. Starflake Schema ............................................................................................... 47

4.6.10. Cube.................................................................................................................. 48

4.7. Meta Data .................................................................................................................. 53

4.8. Materialized views .................................................................................................... 53

4.9. OLAP Server Architectures ..................................................................................... 54

5 COMPARISON OF MULTIDIMENSIONAL DESIGN MODELS............................. 56

5.1. Comparison of Dimensional Models and ER Models ............................................ 56

5.2. Comparison of Dimensional Models and Object-Oriented Models ...................... 57

5.3. Comparison of Conceptual Multidimensional Models........................................... 58

5.4. Comparison of Logical Design Models................................................................... 60

5.5. Discussion on Data Warehousing Design Tools..................................................... 61

6 IMPLEMENTING A DATA WAREHOUSE................................................................. 64

6.1. A Case Study............................................................................................................. 64

6.2. OOMD Approach...................................................................................................... 65

6.3. starER Approach....................................................................................................... 68

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 10: 140

ix

6.4. ME/R Approach ........................................................................................................ 70

6.5. DF Approach............................................................................................................. 72

6.6. Implementation Details............................................................................................. 74

7 CONCLUSIONS AND FUTURE WORK...................................................................... 83

7.1. Contributions of the Thesis ...................................................................................... 85

7.2. Future Work .............................................................................................................. 86

REFERENCES..................................................................................................................... 87

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 11: 140

x

LIST OF TABLES

TABLE 2.1 Comparison of OLTP and OLAP.................................................................................... 7

4.1 2-dimensional pivot view of an OLAP Table ............................................................. 49

4.2 3-dimensional pivot view of an OLAP Table ............................................................. 49

5.1 Comparison of ER, DM and OO methodologies ......................................................... 58

5.2 Comparison of conceptual design models .................................................................... 60

5.3 Comparison of logical design models........................................................................... 61

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 12: 140

xi

LIST OF FIGURES

FIGURE

2.1 Consolidation of OLTP information............................................................................... 4

2.2 Same attribute with different formats in different sources............................................ 4

2.3 Simple comparison of OLTP and DW systems ............................................................. 5

3.1 Architecture of DW........................................................................................................ 10

4.1 Data Extraction............................................................................................................... 16

4.2 Data Integration.............................................................................................................. 17

4.3 Same data, different usage............................................................................................. 17

4.4 A Simple ERD for a manufacturing environment........................................................ 20

4.5 Corporate ERD created by departmental ERDs........................................................... 20

4.6 Relationship between ERD and DIS............................................................................. 21

4.7 Midlevel model members .............................................................................................. 21

4.8 A Midlevel model sample.............................................................................................. 22

4.9 Corporate DIS formed by departmental DISs. ............................................................. 23

4.10 An example of a departmental DIS............................................................................. 23

4.11 Considerations in low-level modeling ........................................................................ 24

4.12 A dimensional fact schema sample............................................................................. 28

4.13 The graphical notation of ME/R elements.................................................................. 31

4.14 Multiple cubes sharing dimensions on different levels ............................................. 32

4.15 Combining ME/R notations with E/R......................................................................... 33

4.16 Notation used in starER.............................................................................................. 33

4.17 A sample DW model using starER ............................................................................. 35

4.18 Flat Schema ................................................................................................................. 40

4.19 Terraced Schema......................................................................................................... 41

4.20 Star Schema................................................................................................................. 42

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 13: 140

xii

4.21 Fact Constellation Schema......................................................................................... 43

4.22 Galaxy Schema............................................................................................................ 44

4.23 Snowflake Schema...................................................................................................... 45

4.24 Star Schema with “fork” .............................................................................................. 46

4.25 Star Cluster Schema.................................................................................................... 47

4.26 Starflake Schema......................................................................................................... 47

4.27 Comparison of schemas............................................................................................... 48

4.28 3-D Realization of a Cube ........................................................................................... 50

4.29 Operations on a Cube................................................................................................... 52

6.1 ER model of sales and shipping systems...................................................................... 65

6.2 Use case diagram of sales and shipping system........................................................... 66

6.3 Statechart diagram of sales and shipping system......................................................... 67

6.4 Static structure diagram of sales and shipping system ................................................ 67

6.5 Sales subsystem starER model..................................................................................... 69

6.6 Shipping subsystem starER model................................................................................ 70

6.7 Sales subsystem ME/R model....................................................................................... 71

6.8 Shipping subsystem ME/R model................................................................................. 72

6.9 Sales subsystem DF model............................................................................................ 73

6.10 Shipping subsystem DF model.................................................................................... 73

6.11 Snowflake schema for the sales subsystem............................................................... 74

6.12 Snowflake schema for the shipping subsystem......................................................... 75

6.13 General architecture of the case study........................................................................ 75

6.14 Sales DTS Package ...................................................................................................... 77

6.15 Shipping DTS Package ................................................................................................ 77

6.16 Transformation details for delimited text file ........................................................... 78

6.17 Transact-SQL query as the transformation source.................................................... 79

6.18 Pivot Chart using Excel as client ............................................................................... 80

6.19 Pivot Table using Excel as client ............................................................................... 80

6.20 Data Analyzer as client............................................................................................... 81

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 14: 140

xiii

LIST OF ABBREVIATIONS

3GL - Third Generation Language

4GL - Fourth Generation Language

DAG - Directed Acyclic Graph

DB - Database

DBMS - Database Management Systems

DDM - Data Dimensional Modeling

DF - Dimensional Fact

DIS - Data Item Set

DSS - Decision Support System

DTS - Data Transformation Services

DW - Data Warehouse

ER - Entity Relationship

ERD - Entity Relationship Diagram

ETL - Extract, Transform, Load

HOLAP - Hybrid OLAP

I/O - Input/Output

IT - Information Technology

ME/R - Multidimensional E/R

MOLAP - Multidimensional OLAP

ODBC - Open Database Connectivity

OID - Object Identifier

OLAP - Online Analytical Processing

OLTP - Online Transaction Processing

OO - Object Oriented

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 15: 140

xiv

OOMD - Object Oriented Multidimensional

RDBMS - Relational Database Management Systems

ROLAP - Relational OLAP

SQL - Structured Query Language

UML - Unified Modeling Language

XML - Extensible Markup Language

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 16: 140

1

CHAPTER 1

INTRODUCTION

Information is an asset that provides benefit and competitive advantage to any

organization. Today, every corporation have a relational database management system

that is used for organization’s daily operations. The companies desire to increase the

value of their organizational data by turning it into actionable information. As the

amount of the organizational data increases, it becomes harder to access and get the most

information out of it, because it is in different formats, exists on different platforms and

resides on different structures. Organizations have to write and maintain several

programs to consolidate data for analysis and reporting. Also, the corporate decision-

makers require access to all the organization’s data at any level, which may mean

modifications on existing or development of new consolidation programs. This process

would be costly, inefficient and time consuming for an organization.

Data warehousing provides an excellent approach in transforming operational data

into useful and reliable information to support the decision making process and also

provides the basis for data analysis techniques like data mining and multidimensional

analysis. Data warehousing process contains extraction of data from heterogenous data

sources, cleaning, filtering and transforming data into a common structure and storing

data in a structure that is easily accessed and used for reporting and analysis purposes.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 17: 140

2

As the need for building an organizational data warehouse is clear, now the

question is how. There are generally accepted design methodologies in designing and

implementing a data warehouse. The focus of this thesis is discussing the data

warehouse conceptual and logical design models and comparing these approaches.

1.1. Scope and outline of the thesis

The thesis organized as follows: Chapter 2 presents an overview of data warehouse

concepts and makes a comparison between operational and analytical processing

systems. Chapter 3 provides information on data warehousing fundamentals and process.

Chapter 4 gives information on data warehouse design approaches used in conceptual

and logical design phases. In chapter 5, the design approaches described in chapter 4 are

discussed and compared. Finally in chapter 6, a sample conceptual model is logically

implemented using the logical design models and the physical implementation of a data

warehouse is described.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 18: 140

3

CHAPTER 2

DATA WAREHOUSE CONCEPTS

2.1. Definition of Data Warehouse

A data warehouse (DW) refers to a database that is different from the organization’s

Online Transaction Processing (OLTP) database and that is used for the analysis of

consolidated historical data.

According to Barry Devlin, IBM Consultant, “a DW is simply a single, complete

and consistent store of data obtained from a variety of sources and made available to end

users in a way they can understand and use it in a business context” [1, 3].

According to W.H. Inmon, “a DW is a subject-or iented, integrated, time-var iant,

and nonvolatile collection of data in support of management’s decision making process”

[1, 2, 3, 6, 10, 11].

The description of the four key features of the DW is given below.

Subject-or iented: In general, an enterprise contains information that is very detailed to

meet all requirements needed for related subsets of the organization (sales dept, human

resources dept, marketing dept etc.) and optimized for transaction processing. Usually,

this type of data is not suitable for decision-makers to use. Decision-makers need

subject-oriented data. DW should include only key business information. The data in the

warehouse should be organized based on subject and only subject-oriented data should

be moved into a warehouse.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 19: 140

4

If the decision-maker needs to find all information about a spesific product, he/she

would need to use all systems like rental sales system, order sales system and catalog

sales system, which is not the preferable and the practical way. Instead, all the key

information must be consolidated in a warehouse and organized into subject areas as

illustrated in Figure 2.1.

Figure 2.1 Consolidation of OLTP information

Integrated: DW is an architecture constructed by integrating data from multiple

heterogeneous sources (like relational database (DB), flat files, excel sheets, XML data,

data from the legacy systems) to support structured and/or ad hoc queries, analytical

reporting and decision making. DW also provides mechanisms for cleaning and

standardizing data. Figure 2.2 emphasizes various uses and formats of “Product Code”

attribute.

Figure 2.2 Same attr ibute with different formats in different sources

Time-var iant: DW provides information from a historical prospective. Every key

structure in the DW contains, either implicitly or explicitly, an element of time. A DW

generally stores data that is 5-10 years old, to be used for comparisons, trends and

forecasting.

Nonvolatile: Data in the warehouse are not updated or changed (see Figure 2.3), so it

does not require transaction processing, recovery and concurrency control mechanisms.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 20: 140

5

The operations needed in the DW are initial loading of data and access of data and

refresh.

Figure 2.3 Simple compar ison of OLTP and DW systems

Some of the DW characteristics are given below;

It is a database that is maintained separately from organization’s operational

databases.

It allows for integration of various application systems.

It supports information processing by consolidating historical data.

User interface aimed at decision-makers.

It contains large amount of data.

It is updated infrequently but periodically updates are required to keep the

warehouse meaningful and dynamic.

It is subject-oriented.

It is non-volatile.

Data is longer-lived. Transaction systems may retain data only until processing is

complete, whereas data warehouses may retain data for years.

Data is stored in a format that is structured for querying and analysis.

Data is summarized. DWs usually do not keep as much detail as transaction-

oriented systems.

2.2. Why OLAP systems must run with OLTP

In this section, I aim to make a comparison of OLTP and Online Analytical

Processing (OLAP) systems and explain the reasons why an OLAP system is needed.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 21: 140

6

The nature of OLTP and OLAP systems are completely different both in technical

and in business needs.

The following table compares OLTP systems OLAP systems in main technical

topics

OLTP OLAP

User and System

Orientation

Thousands users, customer-

oriented, used for

transactions and querying

clerks, clients and

Information Technology

(IT) professionals

Hundreds users, market-

oriented, used for data

analysis by knowledge

workers

Data Contents Manages current data, very

detail-oriented

Manages large amounts of

historical data, provides

facilities for summarization

and aggregation, stores

information at different

levels of granularity to

support decision making

process

Data is continuously

updated

Data is refreshed

Data is volatile and

normalized (Entity-

Relationship (ER) Model)

Data is non-volatile and de-

normalized (Dimensional

Model)

Database Design Adopts an ER model and an

application-oriented

database design, index/hash

on primary key.

Adopts star, snowflake, or

fact constellation model

and a subject-oriented

database design, lots of

scans.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 22: 140

7

View Focuses on the current data

within an enterprise or

department, detailed, flat

relational.

Spans multiple versions of

a database schema due to

the evolutionary process of

an organization; integrates

information from many

organizational locations and

data stores, summarized,

multidimensional.

Access Patterns Short, atomic, simple

transactions; requires

concurrency control and

recovery mechanisms, day

to day activities, mostly

updates.

Mostly read-only

operations, complex query,

although many could be

complex queries, long term

informational requirements,

decision support, mostly

reads.

Table 2.1 Compar ison of OLTP and OLAP

It may seem questionable to implement a DW system for companies running their

business on OLTP systems. The following list compares the main reasons for using a

DW;

To gain high performance of both systems by proper data organization (DB

design).

OLTP deals with transactions, concurrency, locking and logging. OLTP deals

with many records at ones. Transaction performance of OLTP and selection

performance of OLAP would be in conflict.

Different structures, contents, and uses of the data. OLTP requires the current

data. OLAP requires historical data.

Data Cleanness. Data in OLTP might be “dirty” because it is collected by clerks

that may make mistakes and for other reasons. Data that goes into OLAP should

be cleaned and standardized.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 23: 140

8

OLTP and OLAP systems need to run different types of queries. They may

provide different functionality and use different types on queries.

The main roles in a company that will use a DW solution are [4];

Top executives and decision makers

Middle/operational managers

Knowledge workers

Non-technical business related individuals

The main advantages of using a DW solution are summarized in the list below [2, 3, 6];

High query performance

Does not interfere with local processing at sources

Information copied at warehouse (can modify, summarize, restructure, etc.)

Potential high Return on Investment

Competitive advantage

Increase productivity of corporate decision makers

As discussed above, a DW solution has many advantages and benefits to an

organization. Also implementing a DW solution solves some business problems, it may

bring some new self-owned problems mentioned below [2, 6];

Underestimation of resources for data loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

High maintenance

Long duration projects

Complexity of integration

Data homogenization

High demand for resources

Data ownership

2.3. Requirements for Data Warehouse Database Management Systems

In the implementation of a DW solution, many technical points must be considered.

While an OLTP database management systems (DBMS) must only consider transaction

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 24: 140

9

processing performance (which is basically; a transaction must be completed in the

minimum time; without deadlocks; and with support of thousands of transactions per

second)

The relational DBMS (RDBMS) suitable for data warehousing has the following

requirements [6];

Load performance: Data warehouses need incremental loading of data

periodically so the load process performance should be like gigabytes of data per

hour.

Load processing: Data conversion, filtering, indexing and reformatting may be

necessary during loading data into the data warehouse. This process should be

executed as a single unit of work.

Data quality management: The warehouse must ensure consistency and

referential integrity despite various data sources and big data size. The measure

of success for a data warehouse is the ability to satisfy business needs.

Query Performance: Complex queries must complete in acceptable periods.

Terabyte scalability: The data warehouse RDBMS should not have any

database size limitations and should provide recovery mechanisms.

Mass user scalability: The data warehouse RDBMS should be able to support

hundreds of concurrent users.

Warehouse administration: Easy-to-use and flexible administrative tools

should exists for data warehouse administration.

Advanced query functionality: The data warehouse RDBMS should supply

advanced analytical operations to enable end-users perform advanced

calculations and analysis.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 25: 140

10

CHAPTER 3

FUNDAMENTALS OF DATA WAREHOUSE

The main reason for building a DW is to improve the quality of information in the

organization. Data coming from both internal and external sources in various formats

and structures is consolidated and integrated into a single repository. DW system

comprises the data warehouse and all components used for building, accessing and

maintaining the data warehouse.

Figure 3.1 Architecture of DW

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 26: 140

11

A general architecture of a DW is given in Figure 3.1 and the main components are

described below [5, 32].

The data import and preparation component is responsible for data acquisition. It

includes all programs (like Data Transformation Services (DTS)) that are responsible for

extracting data from operational sources, preparing and loading it into the warehouse.

The access component includes all applications (like OLAP) that use the

information stored in the warehouse.

Additionally, a metadata management component is responsible for the

management, definition and access of all different types of metadata. Metadata is

defined as “data describing the meaning of data”. In data warehousing, there are various

types of metadata, e.g., information about the operational sources, the structure and

semantics of the data warehouse data, the tasks performed during the construction, the

maintenance and access of a data warehouse, etc.

Implementing a DW is a complex task containing two major phases. In the

configuration phase, a conceptual view of the warehouse is first specified according to

user requirements (DW design). Then, the related data sources and the Extraction-Load-

Transform (ETL) process (data acquisition) are determined. Finally, decisions about

persistent storage of the warehouse using database technology and the various ways data

will be accessed during analysis are made.

After the initial load (the first load of the DW according to the configuration),

during the operation phase, warehouse data must be regularly refreshed, i.e.,

modifications of operational data since the last DW refreshment must be propagated into

the warehouse such that data stored in the data warehouse reflect the state of the

underlying operational systems.

A more natural way to consider multidimensionality of warehouse data is provided

by the multidimensional data model. In this model, the data cube is the basic modeling

construct. Operations like pivoting (rotate the cube), slicing-dicing (select a subset of the

cube), roll-up and drill-down (increasing and decreasing the level of aggregation) can be

applied to a data cube. For the implementation of multidimensional databases, there are

two main approaches. In the first approach, extended RDBMSs, called relational OLAP

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 27: 140

12

(ROLAP) servers, use a relational database to implement the multidimensional model

and operations. ROLAP servers provide SQL extensions and translate data cube

operations to relational queries. In the second approach, multidimensional OLAP

(MOLAP) servers store multidimensional data in non-relational specialized storage

structures. These systems usually precompute the results of complex operations (during

storage structure building) in order to increase performance.

3.1. Data acquisition

Data extraction is one of the most time-consuming tasks of DW development. Data

consolidated from heterogenous systems may have problems, and may need to be first

transformed and cleaned before loaded into the DW. Data gathered from operational

systems may be incorrect, inconsistent, unreadable or incomplete. Data cleaning is an

essential task in data warehousing process in order to get correct and qualitative data

into the DW. This process contains basically the following tasks: [5]

converting data from heterogenous data sources with various external

representations into a common structure suitable for the DW

identifying and eliminating redundant or irrelevant data

transforming data to correct values (e.g., by looking up parameter usage and

consolidating these values into a common format)

reconciling differences between multiple sources, due to the use of homonyms

(same name for different things), synonyms (different names for same things) or

different units of measurement

As the cleaning process is completed, the data that will be stored in the warehouse

must be merged and set into a common detail level containing time related information

to enable usage of historical data. Before loading data into the DW, tasks like filtering,

sorting, partitioning and indexing may need to be performed. After these processes, the

consolidated data may be imported into the DW using one of bulk data loaders, a custom

application or an import/export wizard provided by the DBMS administration

applications.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 28: 140

13

3.1.1. Extraction, Cleansing and Transformation Tools

The tasks of capturing data from a source system, cleansing, and transforming the

data and loading the consolidated data into a target system can be done either by

separate products or by a single integrated solution. Integrated solutions fall into one of

the following categories [6]:

Code generators

Database data replication tools

Dynamic transformation engines

There are solutions that fulfill all of the requirements mentioned above. One of these

products is Microsoft® Data Transformation Services is described in chapter 6.

Code generators

Code generators create customized 3GL, 4GL transformation programs based on source

and target data definitions. The main issue with this approach is the management of the

large number of programs required to support a complex corporate DW.

Database data r eplication tools

Database data replication tools employ database triggers or a recovery log to capture

changes to a single data source on one system and apply the changes to a copy of the

source data located on a different system. Most replication products don’t support the

capture of changes to non-relational files and databases and often not provide facilities

for significant data transformation and enhancement. These tools can be used to rebuild

a database following failure or to create a database for a data mart, provided that the

number of data sources is small and the level of data transformation is relatively simple.

Dynamic transformation engines

Rule-driven dynamic transformation engines capture data from a source system at user-

defined intervals, transform the data and then send and load the results into a target

environment. Most products support only relational data sources, but products are now

emerging that handle non-relational source files and databases.

3.2. Data Stor age and Access

Because of the special nature of warehouse data and access, accustomed

mechanisms for data storage, query processing and transaction management must be

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 29: 140

14

adapted. DW solutions need complex querying requirements and operations involving

large volumes of data access. These operations need special access methods, storage

structures and query processing techniques.

The storage approaches of a DW is described in detail in section 4.9. One of these

physical storage methods may be chosen concerning the trade-off between query

performance and amount of data.

Once the DW is available for end-users, there are a variety of techniques to enable

end-users access the DW data for analysis and reporting. There are several tools and

products that are commercially available. In common all client tools use generally

OLEDB, ODBC or native client providers to access the DW data. The most

commercially used client application is Microsoft® Excel with pivot tables.

A company that makes business in several countries througout the world may need

to analyse regional trends and my need to compete in regions. A centric DW may not be

feasible for these companies. These organizations may need to establish data marts

which are selected parts of the DW that support specific decision support application

requirements of a company’s department or geographical region. Data marts usually

contain simple replicas of warehouse partitions or data that has been further summarized

or derived from base warehouse data. Data marts allow the efficient execution of

predicted queries over a significantly smaller database.

3.3. Data Mar ts

A data mart is a subset of the data in a DW and is summary data relating to a

department or a specific function [6]. Data marts focus on the requirements of users in a

particular department or business function of an organization. Since data marts are

specialized for departmental operations, they contain less data and the end-users are

much capable of exploiting data marts than DWs. The main reasons for implementing a

data mart instead of a DW may be summarized as follows:

Data marts enable end-users to analyze the data they need most often in their

daily operations.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 30: 140

15

Since data marts contain less data, the end-user response time in queries is much

quicker.

Data marts are more specialized and contain less data, therefore data

transformation and integration tasks are much faster in data marts than DWs and

setting up a data mart is a simpler and a cheaper task compared to establishing an

organizational DW in terms of time and resources.

In terms of software engineering, building a data mart may be a more feasible

project than building a DW, because the requirements of building a data mart are

much more explicit than a corporate wide DW project.

Although data marts seem to have advantages over DWs, there are some issues that must

be addressed about data marts.

Size: Although data marts are considered to be smaller than data warehouses, size and

complexity of some data marts may match a small corporate DW. As the size of a data

mart increases, it is likely to have a performance decrease.

Load performance: Both end-user response time and data loading performance are

critical tasks of data marts. For increasing the response time, data marts usually contain

lots of summary tables and aggregations which have a negative effect on load

performance.

User access to data in multiple data mar ts: A solution to this problem is building

virtual data marts which are views of several physical data marts.

Administration: With the increase in number of data marts, the management need

arises to coordinate data mart activities such as versioning, consistency, integrity,

security and performance tuning.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 31: 140

16

CHAPTER 4

DESIGNING A DATA WAREHOUSE

Designing a warehouse means to complete all the requirements mentioned in

section 2.3 and obviously is a complicated process.

There are two major components to build a DW; the design of the interface from

operational systems and the design of the DW [11]. DW design is different from a

classical requirements-driven systems design.

4.1. Beginning with Operational Data

Creating the DW does not only involve extracting operational data and entering it

into the warehouse (Figure 4.1) .

Figure 4.1 Data Extraction

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 32: 140

17

Pulling the data into the DW without integrating it is a big mistake ( Figure 4.2 ).

Figure 4.2 Data Integration

Existing applications were designed with their own requirements and integration

with other applications was not concerned much. These results in data redundancy, i.e.

same data may exist in other applications with same meaning, with different name or

with different measure ( Figure 4.3 ).

Figure 4.3 Same data, different usage

Another problem is the performance of accessing existing systems’ data. The

existing systems environment holds gigabytes and perhaps terabytes of data, and

attempting to scan all of it every time a DW load needs to be done is resource and time

consuming and unrealistic.

Three types of data are loaded into the DW from the operational system:

Archival data

Data currently contained in the operational environment

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 33: 140

18

Changes to the DW environment from the updates that have occurred in the

operational system since the last refresh

Five common techniques are used to limit the amount of operational data scanned to

refresh the DW.

Scan data that has been timestamped in the operational environment.

Scan a 'delta' file. A delta file contains only the changes made to an application

as a result of the transactions that have run through the operational environment.

Scan a log file or an audit file created by the transaction processing system. A

log file contains the same data as a delta file.

Modify application code.

Rubbing a 'before' and an 'after' image of the operational file together.

Another difficulty is that operational data must undergo a time-basis shift as it

passes into the DW. The operational data’s accuracy is valid at the instant it is accessed,

after that it may be updated. However when the data is loaded into the warehouse, it

cannot be updated anymore, so a time element must be attached to it.

Another problem when passing data is the need to manage the volume of data that

resides in and passes into the warehouse. Volume of data in the DW will grow fast.

4.2. Data/Process Models

The process model applies only to the operational environment. The data model

applies to both the operational environment and the DW environment.

A process model consists:

Functional decomposition

Context-level zero diagram

Data flow diagram

Structure chart

State transition diagram

Hierarchical input process output(HIPO) chart

Pseudo code

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 34: 140

19

A process model is invaluable, for instance, when building the data mart. The

process model is requirements-based; it is not suitable for the DW.

The data model is applicable to both the existing systems environment and the DW

environment. An overall corporate data model has been constructed with no regard for a

distinction between existing operational systems and the DW. The corporate data model

focuses on only primitive data. Performance factors are added into the corporate data

model as the model is transported to the existing systems environment. Although few

changes are made to the corporate data model for operational environment, more

changes are made to the corporate data model to use in DW environment. First, data that

is used purely in the operational environment is removed. Next, the key structures of the

corporate data model are enhanced with an element of time. Derived data is added to the

corporate data model where the derived data is publicly used and calculated once, not

repeatedly. Finally, data relationships in the operational environment are turned into

“artifacts” in the DW. A final design activity in transforming the corporate data model to

the data warehouse data model is to perform “stability” analysis. Stability analysis

involves grouping attributes of data together based on their tendency for change.

4.3. The DW Data Model

There are three levels in data modeling process: high-level modeling (called the

ERD, entity relationship level), midlevel modeling (called the data item set, or DIS), and

low-level modeling (called the physical model).

4.3.1. High-Level Modeling

The high level of modeling features entities and relationships. The name of the

entity is surrounded by an oval. Relationships among entities are depicted with arrows.

The direction and number of the arrowheads indicate the cardinality of the relationship,

and only direct relationships are indicated.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 35: 140

20

Figure 4.4 A Simple ERD for a manufactur ing environment

The entities that are shown in the ERD level (see Figure 4.4) are at the highest level

of abstraction.

The corporate ERD as shown in Figure 4.5 is formed of many individual ERDs that

reflect the different views of people across the corporation. Separate high-level data

models have been created for different communities within the corporation. Collectively,

they make up the corporate ERD.

Figure 4.5 Corporate ERD created by depar tmental ERDs

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 36: 140

21

4.3.2. Mid-Level Modeling

After the high-level data model is created, the next level is established—the

midlevel model or the DIS. For each major subject area, or entity, identified in the high-

level data model, a midlevel model is created. Each area is subsequently developed into

its own midlevel model (see Figure 4.6)

Figure 4.6 Relationship between ERD and DIS

Four basic constructs are found at the midlevel model (also shown in Figure 4.7):

A primary grouping of data

A secondary grouping of data

A connector, suggesting the relationships of data between major subject areas

“Type of” data

Figure 4.7 Midlevel model members

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 37: 140

22

The primary grouping exists once, and only once, for each major subject area. It

holds attributes that exist only once for each major subject area. As with all groupings of

data, the primary grouping contains attributes and keys for each major subject area.

The secondary grouping holds data attributes that can exist multiple times for each

major subject area. This grouping is indicated by a line drawn downward from the

primary grouping of data. There may be as many secondary groupings as there are

distinct groups of data that can occur multiple times.

The third construct is the connector. The connector relates data from one grouping

to another. A relationship identified at the ERD level results in an acknowledgement at

the DIS level. The convention used to indicate a connector is an underlining of a foreign

key.

The fourth construct in the data model is “type of” data. “Type of” data is indicated

by a line leading to the right of a grouping of data. The grouping of data to the left is the

supertype. The grouping of data to the right is the subtype of data.

These four data modeling constructs are used to identify the attributes of data in a

data model and the relationship among those attributes. When a relationship is identified

at the ERD level, it is manifested by a pair of connector relationships at the DIS level.

A sample model is drawn in Figure 4.8 below.

Figure 4.8 A Midlevel model sample

Like the corporate ERD that is created from different ERDs reflecting the

community of users, the corporate DIS is created from multiple DISs. Figure 4.9 shows

a sample corporate DIS formed by many departments DISs.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 38: 140

23

Figure 4.9 Corporate DIS formed by depar tmental DISs.

Figure 4.10 shows an individual department’s DIS.

Figure 4.10 An example of a depar tmental DIS

4.3.3. Low-Level Modeling

The physical data model is created from the midlevel data model just by extending

the midlevel data model to include keys and physical characteristics of the model. At

this point, the physical data model looks like a series of tables, sometimes called

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 39: 140

24

relational tables. With the DW, the first step in doing so is deciding on the granularity

and partitioning of the data.

After granularity and partitioning are factored in, a variety of other physical design

activities are embedded into the design. At the heart of the physical design

considerations is the usage of physical input/output (I/O). Physical I/O is the activity that

brings data into the computer from storage or sends data to storage from the computer.

The job of the DW designer is to organize data physically for the return of the

maximum number of records from the execution of a physical I/O. Figure 4.11 illustrate

the major considerations in low-level modeling.

Figure 4.11 Considerations in low-level modeling

There is another mitigating factor regarding physical placement of data in the data

warehouse: Data in the warehouse normally is not updated. This frees the designer to use

physical design techniques that otherwise would not be acceptable if it were regularly

updated.

4.4. Database Design Methodology for DW

In the next few sections of this thesis I will be discussing both conceptual and

logical design methods of data warehousing. Adopting the terminology of [23, 36, 37,

38] three different design phases are distinguished; conceptual design manages concepts

that are close to the way users perceive data; logical design deals with concepts related

to a certain kind of DBMS; physical design depends on the specific DBMS and

describes how data is actually stored [35, 40].

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 40: 140

25

Prior to beginning the discussion, the basic concepts of dimensional modeling

should be mentioned which are: facts, dimensions and measures [7, 24].

A fact is a collection of related data items, consisting of measures and context

data. It typically represents business items or business transactions.

A dimension is a collection of data that describe one business dimension.

Dimensions determine the contextual background for the facts; they are the

parameters over which we want to perform OLAP.

A measure is a numeric attribute of a fact, representing the performance or

behavior of the business relative to the dimensions.

Before this discussion, I also prefer to summarize the methodology proposed by

Kimball [21], who is accepted as a guru on data warehousing and whose studies have

encouraged many academicians on the study of data warehousing.

The nine step methodology by Kimball is as follows[6, 42, 43]:

1. Choosing the process: The process (function) refers to the subject matter of a

particular data mart. The first data mart to be built should be the one that is most

likely to be delivered on time with in budget and to answer the most important

business question.

2. Choosing the grain: This means deciding exactly what a fact table record

represents. Only when the grain for the fact table is chosen can we identify the

dimensions of the fact table. The grain decision for the fact table also determines

the grain of each of the dimension tables.

3. Identifying and conforming the dimensions: Dimensions set the context for

asking questions about the facts in the fact table. A well-built set of dimensions

makes the data mart understandable and easy to use. A poorly presented or

incomplete set of dimensions will reduce the usefulness of a data mart to an

enterprise. When a dimension is used in more than one data mart, the dimension

is referred to as being conformed.

4. Choosing the facts : The grain of the fact table determines which facts can be

used in the data mart. All the facts must be expressed at the level implied by the

grain. The facts should be numeric and additive. Additional facts can be added to

a fact table at any time provided they are consistent with the grain of the table.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 41: 140

26

5. Storing pre-calculations in the fact table : Once the facts have been selected each

should be re-examined to determine whether there are opportunities to use pre-

calculations.

6. Rounding out the dimension tables : We return to the dimension tables and add

as much text description to the dimensions. The text descriptions should be as

intuitive and understandable to the users. The usefulness of a data mart is

determined by the scope and nature of the attributes of the dimension tables.

7. Choosing the duration of the database: The duration measures how far back in

time the fact table goes. There is requirement to look at the same time period a

year or two earlier. Very large fact tables raise at least two very significant DW

design issues. First, it is often increasingly difficult to source increasingly old

data. The older data, the more likely there will be more problems in reading and

interpreting the old files or the old tapes. Second, it is mandatory that the old

versions of the important dimensions be used, not the most current versions. This

is known as the ‘slowly changing dimension’ problem.

8. Tracking slowly changing dimensions: There are three basic types of slowly

changing dimensions:

o Type1: where a changed dimension attribute is overwritten,

o Type2: where a changed dimension attribute causes a new dimension

record to be created,

o Type3: a changed dimension attribute causes an alternate attribute to be

created so that both the old and new values of the attribute are

simultaneously accessible in the same dimension record.

9. Deciding the query priorities and the query modes: We consider physical design

issues. The most critical physical design issues affecting the end-user’s

perception of the data mart are physical sort order of the fact table on disk and

the presence of pre-stored summaries or aggregations. There are additional

physical design issues affecting administration, backup, indexing performance,

and security.We have a design for data mart that supports the requirements of a

particular business process and also allows the easy integration with other related

data marts to ultimately form the enterprise-wide DW.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 42: 140

27

4.5. Conceptual Design Models

The main goal of conceptual design modeling is developing a formal, complete,

abstract design based on the user requirements [34].

At this phase of a DW there is the need to:

Represent facts and their properties: Facts properties are usually numerical and

can be summarized (aggregated).

Connect the dimension to facts: Time is always associated to a fact.

Represent objects and capture their properties with the associations among them:

Object properties (summary properties) can be numeric. Additionally there are

three special types of associations; specialization/generalization (showing objects

as subclasses of other objects), aggregation (showing objects as parts of a layer

object), membership (showing that an object is a member of another higher

object class with the same characteristics and behavior). Strict membership (or

not) (all members belong to only one higher object class), Complete membership

(or not) (all members belong to one higher object class and that object class is

consisted by those members only).

Record the associations between objects and facts: Facts are connected to

objects.

Distinguish dimensions and categorize them into hierarchies: dimensions

governed by associations of type membership forming hierarchies that specify

different granularities.

4.5.1. The Dimensional Fact Model

This model is built from ER schemas [9, 15, 16, 17, 33]. The Dimensional Fact

(DF) Model is a collection of tree structured fact schemas whose elements are facts,

attributes, dimensions and hierarchies. Fact attributes’ additivity, optional dimension

attributes and non-dimension attributes’ existence may also be represented on fact

schemas. Compatible fact schemas may be overlapped in order to relate and compare

data.

A fact schema is structured as a tree whose root is a fact. The fact is represented by

a box which reports the fact name.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 43: 140

28

Figure 4.12 A dimensional fact schema sample

Sub-trees rooted in dimensions are hierarchies. The circles represent the attributes

and the arcs represent relationship between attribute pairs. The non-dimension attributes

(address attribute as shown in Figure 4.12) are represented by lines instead of circles. A

non-dimension attribute contains additional information about an attribute of the

hierarchy, is connected to it by a -to-one relationship and cannot be used for

aggregation. The arcs represented by dashes express optional relationships between pairs

of attributes.

A fact expresses a many-to-many relationship among the dimensions. Each

combination of values of the dimensions defines a fact instance, one value for each fact

attribute. Most attributes are additive along all dimensions. This means that the sum

operator can be used to aggregate attribute values along all hierarchies. A fact attribute is

called semi-additive if it is not additive along one or more dimensions, non-additive if it

is additive along no dimension.

DF model consists of 5 steps;

Defining facts (a fact may be represented on the E/R schema either by an entity F

or by an n-ary relationships between entities E1 to En).

For each fact;

o Building the attribute tree. (Each vertex corresponds to an attribute of the

schema; the root corresponds to the identifier of F; for each vertex v, the

corresponding attribute functionally determines all the attributes

corresponding to the descendants of v. If F is identified by the

combination of two or more attributes, identifier (F) denotes their

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 44: 140

29

concatenation. It is worth adding some further notes: It is useful to

emphasize on the fact schema the existence of optional relationships

between attributes in a hierarchy. Optional relationships or optional

attributes of the E/R schema should be marked by a dash; A one-to-one

relationship can be thought of as a particular kind of many-to-one

relationship, hence, it can be inserted into the attribute tree;

Generalization hierarchies in the E/R schema are equivalent to one-to-one

relationships between the super-entity and each sub-entity; x-to-many

relationships cannot be inserted into the attribute tree. In fact,

representing these relationships at the logical level, for instance by a star

schema, would be impossible without violating the first normal form; an

n-ary relationship is equivalent to n binary relationships. Most n-ary

relationships have maximum multiplicity greater than 1 on all their

branches; they determine n one-to-many binary relationships which

cannot be inserted into the attribute tree.)

o Pruning and grafting the attribute tree (Not all of the attributes

represented in the attribute tree are interesting for the DW. Thus, the

attribute tree may be pruned and grafted in order to eliminate the

unnecessary levels of detail. Pruning is carried out by dropping any sub-

tree from the tree. The attributes dropped will not be included in the fact

schema, hence, it will be impossible to use them to aggregate data.

Grafting is used when its descendants must be preserved.).

o Defining dimensions (The dimensions must be chosen in the attribute tree

among the children vertices of the root. E/R schemas can be classified as

snapshot and temporal. A snapshot schema describes the current state of

the application domain; old versions of data varying over time are

continuously replaced by new versions. A temporal schema describes the

evolution of the application domain over a range of time; old versions of

data are explicitly represented and stored. When designing a DW from a

temporal schema, time is explicitly represented as an E/R attribute and

thus it is an obvious candidate to define a dimension. Time is not

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 45: 140

30

explicitly represented however, should be added as a dimension to the

fact schema).

o Defining fact attributes (Fact attributes are typically either counts of the

number of instances of F, or the sum/average/maximum/minimum of

expressions involving numerical attributes of the attribute tree. A fact

may have no attributes, if the only information to be recorded is the

occurrence of the fact.).

o Defining hierarchies (Along each hierarchy, attributes must be arranged

into a tree such that an x-to-one relationship holds between each node and

its descendants. It is still possible to prune and graft the tree in order to

eliminate irrelevant details. It is also possible to add new levels of

aggregation by defining ranges for numerical attributes. During this

phase, the attributes which should not be used for aggregation but only

for informative purposes may be identified as non-dimension attributes.).

4.5.2. Multidimensional E/R Model

It is argued that ER approach is not suited for multidimensional conceptual

modeling because the semantics of the main characteristics of the model cannot be

effectively represented.

Multidimensional E/R (ME/R) model includes some key considerations [14]:

Specialization of the ER Model

Minimal extension of the ER Model; this model should be easy to learn and use

for an experienced ER Modeler. There are few additional elements.

Representation of the multidimensional aspects; despite the minimality, the

specialization should be powerful enough to express the basic multidimensional

aspects, namely the qualifying and quantifying data and the hierarchical structure

of the qualifying data.

This model allows the generalization concepts. There are some specializations:

A special entity set: dimension level

Two special relationship sets connecting dimension levels:

o a special n-ary relationship set: the ‘fact’ relationship set

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 46: 140

31

o a special binary relationship set: the ‘roll-up to’ relationship set

The ‘roll-up to’ relationship set; it relates a dimension level A to a dimension level

B representing concepts of a higher level of abstraction (city roll-up to country).

The ‘fact’ relationship set is a specialization of a general n-ary relationship set. It

connects n different dimension level entities.

The fact relationship set models the natural separation of qualifying and quantifying

data. The attributes of the fact relationship set model the measures of the fact while

dimension levels model the quantifying data. This model uses a special graphical

notation which a sample notation is shown in Figure 4.13 .

Figure 4.13 The graphical notation of ME/R elements

Individual characteristics of ME/R model may be summarized as follows;

A central element in the multidimensional model is the concept of dimensions

that span the multidimensional space. The ME/R model does not contain an

explicit counterpart for this idea. This is not necessary because a dimension

consists of a set of dimension levels. The information which dimension-levels

belong to a given dimension is included implicitly within the structure of the

rolls-up graph.

The hierarchical classification structure of the dimensions is expressed by

dimension level entity sets and the roll-up relationships. The rolls-up relationship

sets define a directed acyclic graph on the dimension levels. This enables the

easy modeling of multiple hierarchies, alternative paths and shared hierarchy

levels for different dimensions. Thus no redundant modeling of the shared levels

is necessary. Dimension level attributes are modeled as attributes of dimension

level entity sets. This allows a different attribute structure for each dimension

level.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 47: 140

32

By modeling the multidimensional cube as a relationship set it is possible to

include an arbitrary number of facts in the schema thus representing a ‘multi-

cube model’. Remarkably the schema also contains information about the

granularity level on which the dimensions are shared.

Concerning measures and their structure, the ME/R model allows record

structured measures as multiple attributes for one fact relationship set. The

semantic information that some of the measures are derived cannot be included

in the model. Like the E/R model the ME/R model captures the static structure of

the application domain. The calculation of measures is functional information

and should not be included in the static model. An orthogonal functional model

should capture these dependencies.

Schema contains rolls-up relationship between entities. Therefore levels of

different dimensions may roll up to a common parent level. This information can

be used to avoid redundancies.

This model is used ‘is a’ relationship.

ME/R and ER models notations can be used together.

Figure 4.14 shows multiple cubes that share dimensions on different levels.

Figure 4.14 Multiple cubes shar ing dimensions on different levels

As mentioned above, the ME/R and ER model notations can be used together as

illustrated in Figure 4.15.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 48: 140

33

Figure 4.15 Combining ME/R notations with E/R

4.5.3. starER

This model combines star structure with constructs of ER model [13]. The starER

contains facts, entities, relationships and attributes. This model has the following

constructs:

Fact set: represents a set of real world facts sharing the same characteristics or

properties. It is always associated with time. It is represented as a circle.

Entity set: represents a set of real world objects with similar properties. It is

represented as a rectangle.

Relationship set: represents a set of associations among entity sets or among

entity sets and fact sets. Its cardinality can be many-to-many, many-to-one, one-

to-many. It is represented as a diamond. Relationship sets among entity sets can

be type of specialization/generalization, aggregation and membership. Figure

4.16 shows the notation for relationship set types.

Figure 4.16 Notation used in starER

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 49: 140

34

Attribute: static properties of entity sets, relationship sets, fact sets. It is

represented as an oval.

Fact properties can be of type stock (S) (the state of something at a specific point

in time), flow (F) (the commutative effect over a period of time for some

parameter in the DW environment and which is always summarized) or value-

per-unit (V) (measured for a fixed-time and the resulted measures are not

summarized).

The following criteria are satisfied by the starER schema;

Explicit hierarchies in dimensions

Symmetric treatment of dimensions and summary attributes (properties)

Multiple hierarchies in each dimension

Support for correct summary or aggregation

Support of non-strict hierarchies

Support of many-to-many relationships between facts and dimensions

Handling different levels of granularity at summary properties

Handling uncertainty

Handling change and time

There following list shows the main differences between DF Schema and starER model;

Relationships between dimensions and facts in starER aren’t only many-to-one,

but also many-to-many, which allows for better understanding of the involved

information.

Object participating in the data warehouse, but not in the form of a dimension are

allowed in the starER.

Specialized relationships on dimensions are permitted

(specialization/generalization, aggregation, membership) and represent more

information.

DF requires only a rather straight forward transformation to fact and dimension

tables. This is an advantage of DF Schema. But this is not a drawback for the

starER model, since well-known rules of how to transform an ER Schema

(Which is the basic structural difference between the two approaches) to relations

do exist.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 50: 140

35

starER model combines the powerful constructs of the ER model with the star

schema.

A sample DW model using starER is illustrated in Figure 4.17.

Figure 4.17 A sample DW model using starER

4.5.4. Object-Or iented Multidimensional Model (OOMD)

Unified Modeling Language (UML) has been widely accepted as a standard object-

oriented modeling language for software design. OOMD modeling approach is based on

UML. In OOMD model, dimensions and facts are represented by dimension classes and

fact classes [18, 19]. Fact classes are considered as composite classes in a shared-

aggregation relationship of n-dimension classes. In this way, many-to-many

relationships between facts and particular dimensions are represented by indicating 1..*

cardinality on the dimension class. Fact classes cardinality is defined as * to indicate that

a dimension object can be part of one, zero or more fact object instances. The minimum

cardinality of dimension classes is defined as 1 to indicate that a fact object is always

related to object instances from all dimensions. Derived measures are placed in fact class

by notation after “/”. Derivation rules appear between braces. Identifying attribute can

be defined in fact classes by notation after {OID} (Object Identifier). All measures are

additive (Sum operator can be applied to aggregate measure values along all

dimensions). For dimensions, every classification hierarchy level is specified by a class

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 51: 140

36

(base class). An association of classes specifies the relationships between two levels of a

classification hierarchy. These classes must define DAG (Directed Acyclic Graph)

rooted in the dimension class. The DAG structure can represent both alternative path and

multiple classification hierarchies. Descriptor attribute ({D}) define in every class that

represents a classification hierarchy level. Strictness means that an object at a

hierarchy’s lower level belongs to only one higher level object. Completeness means

that all members belong to one higher-class object and that object consists of those

members only. OOMD approach uses a generalization-specialization relationship to

categorize entities that contain subtypes.

Cube classes represent initial user requirements as the starting point for subsequent data-

analysis phase. Cube classes contain;

Head area; contains the cube class’s name.

Measures area; contains the measures to be analyzed.

Slice area; contains the constraints to be satisfied.

Dice area; contains the dimensions and their grouping conditions to address the

analysis.

Cube operations; cover the OLAP operations for a further data analysis phase.

4.6. Logical Design Models

DW logical design involves the definition of structures that enable an efficient

access to information. The designer builds multidimensional structures considering the

conceptual schema representing the information requirements, the source databases, and

non functional (mainly performance) requirements. This phase also includes

specifications for data extraction tools, data loading processes, and warehouse access

methods. At the end of logical design phase, a working prototype should be created for

the end-user.

Dimensional models represent data with a “cube” structure, making more

compatible logical data representation with OLAP data management. The objectives of

dimensional modeling are [10]:

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 52: 140

37

To produce database structures that are easy for end-users to understand and

write queries against,

To maximize the efficiency of queries.

It achieves these objectives by minimizing the number of tables and relationships

between them. Normalized databases have some characteristics that are appropriate for

OLTP systems, but not for DWs [7]:

Its structure is not easy for end-users to understand and use. In OLTP systems

this is not a problem because, usually end-users interact with the database

through a layer of software.

Data redundancy is minimized. This maximizes efficiency of updates, but tends

to penalize retrievals. Data redundancy is not a problem in DWs because data is

not updated on-line.

Dimensionality modeling uses the ER Modeling with some important restrictions.

Dimensional model composed of one table with a composite primary key, called fact

table, and a set of smaller tables called dimension tables. Each dimension table has a

simple (non-composite) primary key that corresponds exactly to one of the components

of the composite key in the fact table. This characteristic structure is called star schema

or star join.

Another important feature, all natural keys are replaced with surrogate keys. This

means that every join between fact and dimension tables is based on surrogate keys, not

natural keys. Each surrogate key should have a generalized structure based on simple

integers. The use of surrogate keys allows the data in the DW to have some

independence from the data used and produced by the OLTP systems.

4.6.1. Dimensional Model Design

This section describes a method for developing a dimensional model from an Entity

Relationship model [12].

This data model is used by OLTP systems. It contains no redundancy, but high

efficiency of updates, shows all data and relationships between them. Simple queries

require multiple table joins and complex subqueries. It is suitable for technical specialist.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 53: 140

38

Classify Entities: For producing a dimensional model from ER model, first

classify the entities into three categories.

o Transaction Entities: These entities are the most important entities in a

DW. They have highest precedence. They construct fact tables in star

schema. These entities record details about particular events (orders,

payments, etc.) that decision makers want to understand and analyze.

There are some characteristics;

It describes an event that occurs at a point in time.

It contains measurements or quantities that may be summarized

(sales amount, volumes)

o Component Entities: These entities are directly related with a transaction

entity with a one-to-many relationship. They have lowest precedence.

They define the details or components of each transaction. They answer

the “who”, “what”, “when”, “where”, “how” and “why” of event

(customer, product, period, etc.). Time is an important component of any

transaction. They construct dimension tables in star schema.

o Classification Entities : These entities are related with component entities

by a chain of one-to-many relationship. They are functionally dependent

on a component entity. These entities represent hierarchies embedded in

the data model, which may be collapsed in to component entity to form

dimension tables in star schema.

Identify Hierarchies: Most dimension tables in star schema include embedded

hierarchies. A hierarchy is called maximal if it cannot be extended upwards or

downwards by including another entity. An entity is called minimal if it has no

one-to-many relationship. An entity is called maximal if it has no many-to-one

relationship.

Produce Dimensional Models: There are two operators to produce dimensional

models from ER.

o Collapse Hierarchy: Higher level entities can be collapsed into lower

level entities within hierarchies. Collapsing a hierarchy is a form of

denormalization. This increases redundancy in the form of a transitive

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 54: 140

39

dependency, which is a violation to 3NF. We can continue doing this

until we reach the bottom of the hierarchy and end up with a single table.

o Aggregation: This operator can be applied to a transaction entity to create

a new entity containing summarized data.

There are 8 models used in dimensional modeling [6, 12]:

Flat Schema

Terraced Schema

Star Schema

Fact Constellation Schema

Galaxy Schema

Snowflake Schema

Star Cluster Schema

Starflake Schema

4.6.2. Flat Schema

This schema is the simplest schema. This is formed by collapsing all entities in the

data model down into the minimal entities. This minimizes the number of tables in the

database and joins in the queries. We end up with one table for each minimal entity in

the original data model [12].

This structure does not lose information from the original data model. It contains

redundancy, in the form of transitive and partial dependencies, but does not involve any

aggregation. It contains some problems; first it may lead to aggregation errors when

there are hierarchical relationships between transaction entities. When we collapse

numerical amounts from higher level transaction entities in to other they will be

repeated. Second this schema contains large number of attributes.

Therefore while the number of tables (system complexity) is minimized, the

complexity of each table (element complexity) is increased. Figure 4.18 shows a sample

flat schema.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 55: 140

40

Figure 4.18 Flat Schema

4.6.3. Terraced Schema

This schema is formed by collapsing entities down maximal hierarchies, end with

when they reach a transaction entity. This results in a single table for each transaction

entity in the data model. It causes some problems for inexperienced user, because the

separation between levels of transaction entities is explicitly shown [12]. The Figure

4.19 illustrates a sample terraced schema.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 56: 140

41

Figure 4.19 Terraced Schema

4.6.4. Star Schema

It is the basic structure for a dimensional model. It has one fact table and a set of

smaller dimension tables arranged around the fact table. The fact data will not change

over time. The most useful fact tables are numeric and additive because data warehouse

applications almost never access a single record. They access hundreds, thousands,

millions of records at a time and aggregate them. The fact table is linked to all the

dimension tables by one to many relationships. It contains measurements which may be

aggregated in various ways [10, 12, 39].

Dimension tables contain descriptive textual information. Dimension attributes are

used as the constraints in the data warehouse queries. Dimension tables provide the basis

for aggregating the measurements in the fact table. They generally consist of embedded

hierarchies.

Each star schema is formed in the following way;

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 57: 140

42

A fact table is formed for each transaction entity. The key of the table is the

combination of the keys of its associated component entities.

A dimension table is formed for each component entity, by collapsing

hierarchically related classification entities into it.

Where hierarchical relationships exist between transaction entities, the child

entity inherits all dimensions (and key attributes) from the parent entity. This

provides the ability to “drill down” between transaction levels.

Numerical attributes within transaction entities should be aggregated by key

attributes (dimensions). The aggregation attributes and functions used depend on

the application.

Star schemas can be used to speed up query performance by denormalizing

reference information into a single dimension table. Denormalization is appropriate

when there are a number of entities related to the dimension table that are often

accessed, avoiding the overhead of having to join additional tables to access those

attributes. Denormalization is not appropriate where the additional data is not accessed

very often, because the overhead of scanning the expanded dimension table may not be

offset by gain in the query performance.

The advantage of using this schema; it reduces the number of tables in the database

and the number of relationships between them and also the number of joins required in

user queries. The Figure 4.20 shows a sample star schema.

Figure 4.20 Star Schema

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 58: 140

43

4.6.5. Fact Constellation Schema

A fact constellation schema consists of a set of star schemas with hierarchically

linked fact tables. The links between the various fact tables provide the ability to “drill

down” between levels of detail [10, 12]. The following figure, Figure 4.21, illustrates a

sample of a fact constellation schema.

Figure 4.21 Fact Constellation Schema

4.6.6. Galaxy Schema

Galaxy schema is a schema where multiple fact tables share dimension tables.

Unlike a fact constellation schema, the fact tables in a galaxy do not need to be directly

related [12]. The following figure, Figure 4.22, illustrates a sample of a galaxy schema.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 59: 140

44

Figure 4.22 Galaxy Schema

4.6.7. Snowflake Schema

In a star schema, hierarchies in the original data model are collapsed or

denormalized to form dimension tables. Each dimension table may contain multiple

independent hierarchies. A snowflake schema is a variant of star schema with all

hierarchies explicitly shown and dimension tables do not contain denormalized data [10,

12].

The many-to-one relationships among sets of attributes of a dimension can separate

new dimension tables, forming a hierarchy. The decomposed snowflake structure

visualizes the hierarchical structure of dimensions very well.

A snowflake schema can be produced by the following procedure:

A fact table is formed for each transaction entity. The key of the table is the

combination of the keys of the associated component entities.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 60: 140

45

Each component entity becomes a dimension table.

Where hierarchical relationships exist between transaction entities, the child

entity inherits all relationships to component entities (and key attributes) from

the parent entity.

Numerical attributes within transaction entities should be aggregated by the key

attributes. The attributes and functions used depend on the application.

The following figure, Figure 4.23, illustrates a sample of a snowflake schema.

Figure 4.23 Snowflake Schema

4.6.8. Star Cluster Schema

While snowflake contains fully expanded hierarchies, which adds complexity to the

schema and requires extra joins, star schema contains fully collapsed hierarchies, which

leads to redundancy. So, the best solution may be a balance between these two schemas

[12]. Overlapping dimensions can be identified as forks in hierarchies. A fork occurs

when an entity acts as a parent in two different dimensional hierarchies. Fork entities can

be identified as classification entities with multiple one-to-many relationships. In Figure

4.24, Region is parent of both Location and Customer entities and the fork occurs at the

Region entity.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 61: 140

46

Figure 4.24 Star Schema with “fork”

A star cluster schema is a star schema which is selectively “snowflaked” to separate

out hierarchical segments or sub dimensions which are shared between different

dimensions.

A star cluster schema has the minimal number of tables while avoiding overlap

between dimensions.

A star cluster schema can be produced by the following procedure:

A fact table is formed for each transaction entity. The key of the table is the

combination of the keys of the associated component entities.

Classification entities should be collapsed down their hierarchies until they reach

either a fork entity or a component entity. If a fork is reached, a sub dimension

table should be formed. The sub dimension table will consist of the fork entity

plus all its ancestors. Collapsing should begin again after the fork entity. When a

component entity is reached, a dimension table should be formed.

Where hierarchical relationships exist between transaction entities, the child

entity should inherit all dimensions (and key attributes) from the parent entity.

Numerical attributes within transaction entities should be aggregated by the key

attributes (dimensions). The attributes and functions used depend on the

application.

The Figure 4.25 illustrates a sample diagram of star cluster schema.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 62: 140

47

Figure 4.25 Star Cluster Schema

4.6.9. Starflake Schema

Starflake schema is a hybrid structure that contains a mixture of star and snowflake

schemas. The most appropriate database schemas use a mixture of denormalized star and

normalized snowflake schemas [6, 41]. The Figure 4.26 illustrates a sample diagram of

starflake schema.

Figure 4.26 Star flake Schema

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 63: 140

48

Whether the schema star, snowflake or starflake, the predictable and standard form

of the underlying dimensional model offers important advantages within a DW

environment including;

Efficiency; the consistency of the database structure allows more efficient access

to the data by various tools including report writers and query tools.

Ability to handle changing requirements; The star schema can adapt to changes

in the user requirements, as all dimensions are equivalent in terms of providing

access to the fact table.

Extensibility; the dimensional model is extensible. It must support adding new

dimensions, adding new dimensional attributes, breaking existing dimension

records down to lower level of granularity from a certain point in time forward.

Ability to model common business situations

Predictable query processing

The following figure, Figure 4.27, shows a comparison of the logical design methods in

complexity versus redundancy trade-off.

Figure 4.27 Compar ison of schemas

4.6.10. Cube

Cubes are the logical storage structures for OLAP databases. A cube defines a set of

related dimensions; each cell of the cube hold one value, the value of each cell is an

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 64: 140

49

intersection of the dimensions. A 2-dimensional view of an OLAP table is given in

Table 4.1.

Table 4.1 2-dimensional pivot view of an OLAP Table

A 3-dimensional view of an OLAP table is given in Table 4.2.

Table 4.2 3-dimensional pivot view of an OLAP Table

A 3-D realization of the cube shown in Table 4.2 is illustrated in Figure 4.28.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 65: 140

50

Figure 4.28 3-D Realization of a Cube

The cube has three dimensions which are time, state and product. Each dimension

enables you to perform specific OLAP operations on the cube. The basic OLAP

operations are as follows [10, 23, 24]:

Roll up: An operation for moving up the hierarchy level and grouping into larger

units along a dimension. Using roll up capability, users can zoom out to see a

summarized level of data. Roll up operation is also called the drill up operation.

Drill down: An operation for moving down the hierarchy level and stepping

down the hierarchy. Using drill down capability, users can navigate to higher

levels of detail. Drill down operation is the reverse of roll up operation.

Slice: Slicing performs a selection on one dimension of a cube and results in a

sub cube. Slicing cuts through the cube so that users can focus on more specific

perspectives.

Dice: Slicing defines a sub cube by performing a selection on two or more

dimensions of a cube.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 66: 140

51

Pivot: Pivot operation is also called rotate operation. Pivoting is a visualization

operation which rotates the data axes in view in order to provide an alternative

presentation of the data.

A detailed figure describing the operations above is illustrated below in Figure 4.29.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 67: 140

52

Figure 4.29 Operations on a Cube

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 68: 140

53

4.7. Meta Data

An important component of the DW environment is meta data. Meta data, or data

about data, provides the most effective use of the DW. Meta data allows the end

user/DSS analyst to navigate through the possibilities. In other words, when a user

approaches a data warehouse where there is no meta data, the user does not know where

to begin the analysis.

Meta data acts like an index to the data warehouse contents. It sits above the

warehouse and keeps track of what is where in the warehouse. Typically, items the meta

data store tracks are as follows [6]:

Structure of data as known to the programmer and to the DSS analyst

Source data

Transformation of data

Data model

DW

History of extracts

Metadata has several functions within the DW that relates to the processes

associated with data transformation and loading, DW management and query generation.

The metadata associated with data transformation and loading must describe the source

data and any changes that were made to the data. The metadata associated with data

management describes the data as it is stored in the DW. Every object in the database

needs to be described including the data in each table, index and view and any

associated constraints. The metadata is also required by the query manager to generate

appropriate queries.

4.8. Mater ialized views

They address the problem of selecting a set of views to materialize in a DW taking

into account [7]:

the space allocated for materialization

the ability of answering a set of queries (defined against the source relations)

using exclusively these views

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 69: 140

54

the combined query evaluation and view maintenance cost

In this proposal they define a graph based on states and state transitions. They

define a state as a set of views plus a set of queries, containing an associated cost.

Transitions are generated when views or queries are changed. They demonstrate that

there is always a path from an initial state to the minimal cost state.

4.9. OLAP Server Architectures

Logically, OLAP engines present business users with multidimensional data from

data warehouses or data marts, without concerns regarding how or where the data are

stored. However, the physical architecture and implementation of OLAP engines must

consider data storage issues. Implementations of a warehouse server engine for OLAP

processing include [10]:

Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in

between a relational back-end server and client front-end tools. They use a relational or

extended-RDBMS to store and manage warehouse data, and OLAP middleware to

support missing pieces. ROLAP servers include optimization for each DBMS back-end,

implementation of aggregation navigation logic, and additional tools and services.

ROLAP technology tends to have greater scalability than MOLAP technology.

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional

views of data through array-based multidimensional storage engines. They map

multidimensional views directly to data cube array structures. The advantage of using a

data cube is that it allows fast indexing to precomputed summarized data. Notice that

with multidimensional data stores, the storage utilization may be low if the data set is

sparse. In such cases, sparse matrix compression techniques should be explored. Many

OLAP servers adopt a two-level storage representation to handle sparse and dense data

sets: the dense subcubes are identified and stored as array structures, while the sparse

subcubes employ compression technology for efficient storage utilization.

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and

MOLAP technology, benefiting from the greater scalability of ROLAP and the faster

computation of MOLAP. For example, a HOLAP server may allow large volumes of

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 70: 140

55

detail data to be stored in a relational database, while aggregations are kept in a separate

MOLAP store.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 71: 140

56

CHAPTER 5

COMPARISON OF MULTIDIMENSIONAL DESIGN MODELS

5.1. Compar ison of Dimensional Models and ER Models

The main objective of ER modeling is to remove redundancy from data. ER

modeling aims to optimize performance for transaction processing.

To remove redundancy, designers must use hundreds of entities and relations

between entities, which makes ER model complex. There is no easy way to enable end-

users navigate through the data in ER models. It is also hard to query ER models

because of the complexity; many tables should be joined to obtain a result set. Therefore

ER models are not suitable for high performance retrieval of data.

The dimensional model is a standard framework. End user tools can make strong

assumptions about the dimensional model to make user interfaces more user friendly and

to make processing more efficient [20].

Dimensional model is more adaptable to unexpected changes in user behavior and

requirements. The logical design can be made independent of expected query patterns.

All dimensions can be thought as symmetrically equal entry points into the fact table.

Dimensional model is extensible to new design decisions and data elements. All

existing fact and dimension tables can be changed in place without having to reload data.

End user query and reporting tools are not affected by the change.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 72: 140

57

ER modeling does not involve business rules, it involves data rules. Dimensional

model involves business rules.

5.2. Compar ison of Dimensional Models and Object-Or iented Models

Dimensional data modeling (DDM) is a dimensional model design methodology

that for each business process, it enumerates relevant measures and dimensions. A DDM

provides a multidimensional conceptual view of data. Although DDM is the favorite

approach in data warehousing, it focuses mainly on data and its proper structuring to

maximize query performance. DDM approach lacks modeling business goals and

processes. Although the final logical and physical model will be a dimensional data

model, object-oriented (OO) model is much stronger in the logical and conceptual

design phases. We should consider DDM approach and OO approach as complementary

to each other, a logical design modeled by an OO model can be mapped easily to a

DDM model.

Various approaches have been developed for the conceptual design of

multidimensional systems in the last years to represent multidimensional structural and

dynamic properties. However, none of them has been accepted as a standard for

multidimensional modeling. On the other hand, UML has been accepted as the standard

OO modeling language for describing and designing various aspects of software

systems. Using UML, OO model allows modeling of the business process, sub

processes, use cases, system actors, classes, objects, collaboration of objects, relations

between object and finally components, which are basically reusable software packages.

Objects have types, properties (attributes), behaviors, methods and relations with other

objects like aggregation, inheritance and association.

A DDM approach is basically an approach in which tables are associated with SQL

methods to support set-oriented processing of data and return result set to the caller. On

the other hand, OO approach provides an object layer to a DW application unifying

behavior and data within the object components.

An OO approach provides a tighter conceptual association between strategic

business goals and objectives and the DDM model. It starts with the specification of

business goals, then modeling of the processes and use cases. Based on the use cases,

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 73: 140

58

objects are modeled. The resulting DDM model will be valid with the corresponding

business requirements. In addition to the discussion above, I summarized the comparison

of ER, DM and OO methodologies according to factors, I consider as the most

important, in Table 5.1.

ER DM OO

Standard notation (UML)

Business rules focus

Data rules focus

Ability to model all business

requirements in detail

Specialization / Generalization

Commercial case tool availability

High association with business

objectives

Adaptability to changing

requirements

Table 5.1 Compar ison of ER, DM and OO methodologies

5.3. Compar ison of Conceptual Multidimensional Models

In sections 5.1 and 5.2, the main conceptual modeling approaches are mentioned.

This section gives a comparison of conceptual multidimensional models according

multidimensional modeling properties [13, 14, 15, 16, 17, 18, 19].

Additivity of measures: DF, starER and OOMD support this property. Using

ME/R model, only static data structure can be captured. No functional aspect can

be implemented with ME/R model; therefore ME/R does not support this

property.

Many-to-many relationships with dimensions: StarER and OOMD support this

property. DF and ME/R models do not support many-to-many relationships.

Derived measures: None of the conceptual models include derived measures as

part of their conceptual schema except OOMD model.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 74: 140

59

Nonstrict and complete classification hierarchies: Although DF, starER and

ME/R can define certain attributes for classification hierarchies, only starER

model can define exact cardinality for nonstrict and complete classification

hierarchies. OOMD can represent nonstrict and complete classification

hierarchies.

Categorization of dimensions (specialization/generalization): DF does not

support this property. Since starER and ME/R models derive from ER model,

these two models use is-a relationship to categorize dimensions. Since OOMD

model is object-oriented, it has support for dimension categorization. Note that

specialization/generalization is a basic aspect of object-orientation.

Graphic notation and specifying user requirements: All modeling techniques

provide a graphical notation to help designers in conceptual modeling phase.

Only ME/R model provides state diagrams to model system’s behavior and

provides a basic set of OLAP operations to be applied from these user

requirements. OOMD provide complete set of UML diagrams to specify user

requirements and help define OLAP functions.

Case tool support: All conceptual design models except starER have case tool

support. Conceptual design using DF approach can be implemented using the

WAND case tool [31]. Conceptual design using ME/R approach can be

implemented using GramMi case tool [29]. With the OOMD approach, the

conceptual design may be implemented using Microsoft® Visio, Rational Rose

or GOLD case tools [18, 30].

The Table 5.2 summarizes the comparison given above.

starER DF ME/R OOMD

Additivity of measures

Many-to-many relationships

with dimensions

Derived measures

Nonstrict and complete

classification hierarchies

Categorization of dimensions

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 75: 140

60

(specialization/generalization)

Graphic notation

Specifying user requirements

Case tool support WAND GRAMMI GOLD, Rational Rose,

MS Visio

Table 5.2 Compar ison of conceptual design models

5.4. Compar ison of Logical Design Models

Among the logical design models, star schema, snowflake schema and the fact

constellation schema are the mostly used models commercially. In this section, I want to

compare these three models in terms of efficiency, usability, reusability and flexibility

quality factors. I think efficiency is the most important factor in DW modeling. A DW is

usually a very large database. Because many queries will access large amounts of data

and involve multiple join operations, efficiency becomes a major consideration [22].

A star schema is generally the most efficient design for two reasons. First, a design

with denormalized tables need fewer joins. Second, most optimizers recognize star

schemas and can generate efficient “star join” operations. A fact constellation schema is

a set of star schemas with hierarchically. A fact constellation schema may need more

join operations on fact tables. Similarly, a snowflake schema will require more joins on

dimension tables. In some cases where the denormalized dimension tables in star schema

becomes very large, a snowflake schema may be the most efficient design approach.

In terms of usability, a number of advantages may be considered for star schema

design approach. The star schema is the simplest structure among the three schemas.

Because a star schema has the fewest number of tables, users need to execute fewer join

operations which makes it easier to formulate analytic queries. It is easier to learn star

schema compared to other two schemas. Fact constellation schema and snowflake

schema are more complex than star schema which is a disadvantage in terms of

usability.

Considering reusability, the snowflake schema is more reusable than star and fact

constellation schemas. Dimension tables in a snowflake schema do not contain

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 76: 140

61

denormalized data. This makes it easier to share dimension tables between snowflake

schemas in a DW. In star schema and fact constellation schema design approaches,

dimension tables are denormalized and this makes it less convenient to share dimension

tables between schemas.

In terms of flexibility, a star schema is more flexible in adapting to changes in user

requirements. The star schema can adapt to changes in the user requirements easier, as

all dimensions are equivalent in terms of providing access to the fact table. Table 5.3

summarizes the comparison of the three logical design models in terms of quality

factors.

Star Schema Snowflake Schema Fact Constellation

Schema

Efficiency High Low Moderate

Usability High Low Moderate

Reusability Low High Moderate

Flexibility High Low Moderate

Table 5.3 Compar ison of logical design models

5.5. Discussion on Data Warehousing Design Tools

There are CASE tools that enable a user to design DW architecture. Some of the

CASE tools are mentioned in section 5.3. Current commercial CASE tools have great

design options to model that enable modelers to model databases and software solutions

even in the enterprise level. These tools can generate code that may be used in the

development phase in forms of VB.NET, C#, C++ and may reverse engineer a given

source code project into software models. Using these CASE tools, a designer can also

generate databases via the tool and reverse engineer databases into model using the

design model diagrams. Unfortunately, the development of the CASE tools on the data

warehousing area is not as mature as the development in ER and software modeling

areas. Very few commercially available tools may help in designing data warehousing

solutions and may still not cover the requirements you need.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 77: 140

62

A CASE tool without basing the modeling notation on UML may never cover the

needs of data warehousing design. Note that, the main purpose of UML is to “represent

business rules” [26].

Unfortunately, a case tool for a complete data warehousing design is not available.

But there are still solutions using the existing CASE tools in data warehousing arena. As

Kimball mentions [27] “ I am cautious about recommending methodologies and CASE

tools from traditional environments for data warehouse software development.

Methodologies scare me the most. So often I see a methodology becoming the goal of a

project, rather than just being a tool. Recently someone asked me: “What is the

difference between a methodologist and a terrorist?” I answered, “At least you can

negotiate with a terrorist.” “. Finally, a designer/developer should use the existing CASE

tools for the development of a DW.

There are some factors in which data warehouse software developers must consider

to make more use of existing software development tools. These topics are summarized

as follows:

Definition and declaration of business rules: Communication and documentation

are very important for data warehouse business rules design. Unfortunately, we

should not expect significant automatic conversion of the most complex business

rules into code using the present commercial CASE tools. Complex business

rules still remain in a documentation-only form, and probably will only be

enforced by a human designer. The best CASE tool meeting this requirement

may be Microsoft® Visio 2003 Enterprise Edition.

Database Design: Most of the commercial CASE tools enable designing ER

databases. It should be addressed that although these CASE tools may support

generating ER databases from models and reverse engineering from an ER

generated database into an ER model, these tools do not provide a complete

solution for an OLAP solution. For designing a database Microsoft® Visio,

ERWin or WarehouseArchitect may be used.

ETL Process: The commercially available tools for ETL process meets the

requirements needed by this process. The widely used ETL product is

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 78: 140

63

Microsoft® Data Transformation Services (DTS) which is a built-in feature of

Microsoft® SQL Server 2000. DTS allows defining ETL processes on both its

built-in designer or using the COM API of DTS or using scripting language. This

tool can accept any data source that has an ODBC or OLEDB provider. As an

example, a data source may be Microsoft® SQL Server 7.0, Microsoft® SQL

Server 2000, Oracle, Sybase, DB2, a text file, an Microsoft® Access database,

an Microsoft® Excel spreadsheet (all versions), etc. In this thesis, the ETL

process of the sample DW solution is implemented using DTS and described in

section 6.5.

Metadata Repository Design: Vendors like Microsoft, IBM, Oracle have all

either defined a global metadata repository format or promised the market that

they will do so in the future. But, the world needs a single standard instead of

multiple repository definitions. Most analysts believe that Microsoft’s repository

effort is the one most likely to succeed. “In 1997, Microsoft turned over the

content of its proprietary repository design to the Metadata Coalition, and

worked with a group of vendors to define the Open Information Model (OIM),

intended to support software development and data warehousing” [27]. Microsoft

and many other vendors are actively programming their tools to read and write

from the Microsoft Repository.

Communication: While the nature of data warehousing requires the need of

consolidating data from heterogeneous data sources, data sources may become

homogenous. As XML become a standard of communication, the XML web

services appeared. “An XML Web service is a programmable entity that provides

a particular element of functionality, such as application logic, and is accessible

to any number of potentially disparate systems using ubiquitous Internet

standards, such as XML and HTTP” [28] . Web services change the nature of

DWs.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 79: 140

64

CHAPTER 6

IMPLEMENTING A DATA WAREHOUSE

6.1. A Case Study

This section describes a case study of implementing a DW. A database is designed

for simulating an OLTP application. The aim of this case study is to build a DW to

enable analysis of the data in the OLTP database. The conceptual design is modeled

using OOMD, starER, ME/R and DF models to give example designs for all approaches.

The DW is designed using snowflake schema. And finally this case study illustrates a

implementation of a data warehousing solution covering all phases of design. The ER

model of the OLTP database that simulates the basis for our DW is shown in Figure 6.1.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 80: 140

65

Figure 6.1 ER model of sales and shipping systems

6.2. OOMD Approach

Data warehousing has largely developed with little or no reference to object-

oriented software engineering. This is consistent with (a) its development out of two-tier

client/server relational database methodology, and (b) its character as a kind of high-

level systems integration, rather than software development, activity. Data Warehousing

assembles components, rather than creating them. The initial top-down centralized data

warehousing model with its single database server, limited middleware, and GUI front-

end, could get by with a pragmatic "systems integration" orientation at the global level

of component interaction, while restricting explicit object-orientation to the various

GUI, middleware, analysis, web-enabling, data transformation, and reporting tools that

comprise the data warehousing arsenal.

The days of two-tier client/server-based data warehouses are gone now. The

dynamic of the data warehousing business and its own development of new technology,

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 81: 140

66

has caused centralized data warehouses to be increasingly supplemented with data marts,

with data stores of diverse type and content, with specialized application servers, with

design and metadata repositories, and with internet, intranet, and extranet front-ends.

The two-tier client/server paradigm has given way to a multi-tier reality characterized by

an increasing diversity of objects/components, and by an increasing complexity of object

relations. Moreover this reality is one of physically distributed objects across an

increasingly complex network architecture. Data warehouse development now requires

the creation of a specific distributed object/component solution, and the evolutionary

development of this solution over time, rather than the creation of a classical two-tier

client/server application [25].

I strongly believe it is more convenient to model complex software systems with

OO design approach using UML. So, before going to implementation details, I would

like to introduce some sample diagrams for the sales and shipping system using the OO

conceptual modeling approach. Respectively, Figure 6.2 illustrates the use case

diagram, Figure 6.3 shows the statechart diagram and Figure 6.4 illustrates the static

structure diagram of sales and shipping system using UML notation.

Figure 6.2 Use case diagram of sales and shipping system

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 82: 140

67

Use case diagrams are used to describe real-world activities. A use case is a set of

events that occurs when an actor uses a system to complete a process. Normally, a use

case is a relatively large process.

Figure 6.3 Statechar t diagram of sales and shipping system

Statechart diagrams are used to show the sequence of states an object goes through

during its life.

Figure 6.4 Static structure diagram of sales and shipping system

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 83: 140

68

Static structure diagrams are used to create conceptual diagrams that represent

concepts from the real world and the relationships between them, or class diagrams that

decompose a software system into its parts.

UML modeling has also the following diagrams that give modeler great flexibility

and helps understandability during the conceptual and logical design phases. Package

diagrams are used to group related elements in a system. One package can contain

subordinate packages, diagrams, or single elements. Activity diagrams are used to

describe the internal behavior of a method and represent a flow driven by internally

generated actions. Sequence diagrams are used to show the actors or objects

participating in an interaction and the events they generate arranged in a time sequence.

Collaboration diagrams are used to show relationships among object roles such as the set

of messages exchanged among the objects to achieve an operation or result. Component

diagrams are used to partition a system into cohesive components and show the structure

of the code itself. Deployment diagrams are used to show the structure of the run-time

system and communicate how the hardware and software elements that make up an

application will be configured and deployed.

Static structure diagram in Figure 6.4 forms the basis of the MD model. This

diagram can easily be mapped to MD model. “Sales” and “Shipping” classes form the

fact tables, “Employee”, “Shipper”, “Region”, “State”, ”Country”, “Product”,

“Product_SubCategory” and “Product_Category” classes form the dimension tables.

6.3. starER Approach

In this section, the conceptual model is designed using starER approach. Figure 6.5

illustrates the sales subsystem starER model. The following list describes the items in

the starER model.

The items represented as circles are fact sets.

The items represented as rectangles are entity sets.

The items represented as diamonds are relationship sets.

The items respresented as ovals are attributes.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 84: 140

69

The model contains one fact set (sales) and four dimensions (employee, product,

time and state). The time, product and state dimensions contain hierarchies that enable

summarization of the fact set by different granularities. Non-complete membership is

shown in these dimensions.

Figure 6.5 Sales subsystem starER model

Figure 6.6 illustrates the shipping subsystem starER model.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 85: 140

70

Figure 6.6 Shipping subsystem starER model

The model contains one fact set (shipping) and five dimensions (shipper, product,

time and state_from, state_to). The time, product and states dimensions contain

hierarchies that enable summarization of the fact set by different granularities. Non-

complete membership is shown in these dimensions.

It is relatively easy and straightforward to design the conceptual model using

starER modeling technique having the ER of the OLTP system.

6.4. ME/R Approach

In this section, the conceptual model is designed using ME/R approach. Figure 6.7

illustrates the sales subsystem with ME/R approach.

The first design step is determining dimensions and facts. Fact relationship connects

the sales fact with the dimensions state, product, day and employee. The rolls-up

relationships (arrow shapes) are shown in the model below. The fact relationship in the

middle of the diagram connects the atomic dimension levels. Each dimension is

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 86: 140

71

represented by a subgroup that starts at the corresponding atomic level. The actual facts

(sales units, sales dollars) are modelled as attributes of the fact relationship. The

dimension hierarchies are shown by the rolls-up relationships (e.g. Product rolls-up to

product_subcategory and product_category). Additional attributes of a dimension level

(e.g. employee_id and employee_name of an employee) are depicted as dimension

attributes of the corresponding level.

Figure 6.7 Sales subsystem ME/R model

Figure 6.8 illustrates the shipping subsystem with ME/R approach and is similar to the

sales subsytem.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 87: 140

72

Figure 6.8 Shipping subsystem ME/R model

6.5. DF Approach

In this section, the conceptual model is designed using DF approach. Figure 6.9

illustrates the sales subsystem with DF approach. The sales fact scheme is structured as a

tree whose root is the sales fact. The fact is illustrated as a rectangle and contains the

fact name (sales) and measures (sales_dollars, sales_units). Each vertex directly attached

to the fact is a dimension. The dimensions in the sales model are product, time, state and

employee. Subtrees rooted in dimensions are hierarchies. Their vertices, represented by

circles, are attributes (product_ID); their arcs represent -to-one relationships between

pairs of attributes. The vertices in the fact schema represented by lines instead of circles

are non-dimension attributes (product_name).

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 88: 140

73

Figure 6.9 Sales subsystem DF model

Figure 6.10 illustrates the shipping subsystem with DF approach.

Figure 6.10 Shipping subsystem DF model

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 89: 140

74

6.6. Implementation Details

In this section, using the OOMD approach and taking the static structure diagram in

Figure 6.4 as the basis, the physical implementation of the sample DW is given. In the

model, there are two fact tables; “sales” and “shipping”. I have chosen snowflake

schema for the implementation of the DW. The main reason for choosing the snowflake

schema is that, the sample OLTP database I have prepared as the data source of my

sample DW is completely formed of normalized tables and therefore using snowflake

schemas in the design of the DW is more applicable and easier to implement. The

sample DW is implemented using Microsoft® OLAP Server [8].

The snowflake schema for the sales subsystem is illustrated in Figure 6.11.

Figure 6.11 Snowflake schema for the sales subsystem.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 90: 140

75

The snowflake schema for the shipping subsystem is illustrated in Figure 6.12.

Figure 6.12 Snowflake schema for the shipping subsystem.

The diagram in Figure 6.13 illustrates the general architecture of the case study.

Figure 6.13 General architecture of the case study

The ETL process is implemented using Microsoft® Data Transformation Services

(DTS). DTS is a set of graphical tools and programmable objects that let the designer

extract, transform, and consolidate data from different sources into single or multiple

destinations. A DTS package is an organized collection of connections, DTS tasks, DTS

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 91: 140

76

transformations, and workflow constraints assembled either with a DTS tool or

programmatically and saved to Microsoft® SQL Server, Microsoft® SQL Server 2000

Meta Data Services, a structured storage file, or a Microsoft® Visual Basic file. Each

package contains one or more steps that are executed sequentially or in parallel when the

package is run. When executed, the package connects to the correct data sources, copies

data and database objects, transforms data, and notifies other users or processes of

events. Packages can be edited, password protected, scheduled for execution, and

retrieved by version. A DTS transformation is one or more functions or operations

applied against a piece of data before the data arrives at the destination. The source data

is not changed. For example, the user can extract a substring from a column of source

data and copy it to a destination table. The particular substring function is the

transformation mapped onto the source column. The user also can search for rows with

certain characteristics (for example, specific data values in columns) and apply functions

only against the data in those rows. Transformations make it easy to implement complex

data validation, data scrubbing, and conversions during the import and export process.

DTS is based on an OLE DB architecture that allows copying and transforming data

from a variety of data sources. Some of these are listed below:

SQL Server and Oracle directly, using native OLE DB providers.

ODBC sources, using the Microsoft® OLE DB Provider for ODBC.

Microsoft® Access 2000, Microsoft® Excel 2000, Microsoft® Visual FoxPro,

dBase, Paradox, HTML, and additional file data sources.

Text files, using the built-in DTS flat file OLE DB provider.

Microsoft® Exchange Server, Microsoft® Active Directory and other

nonrelational data sources.

Other data sources provided by third-party vendors.

In the sample implementation, the sales fact table and the shipping fact table are

populated from a delimited text file, an Access database, an Excel spreadsheet and a

SQL Server database. The DTS packages implemented are shown in Figure 6.14 and

Figure 6.15 respectively.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 92: 140

77

Figure 6.14 Sales DTS Package

Figure 6.15 Shipping DTS Package

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 93: 140

78

The arrow from delimited sales text file to DW Data Source indicates the

transformation from the text file to the “SalesFact” table. The transformation is a copy

column transformation with the details shown in Figure 6.16.

Figure 6.16 Transformation details for delimited text file

Transformations from Microsoft® Excel and Microsoft® Access are similar to

transformation from the delimited text file. Transformation from the SQL Server

database is a little different. The table with the data is not loaded as it is, but with a

custom query. The detail of this transformation is shown Figure 6.17. As seen in the

figure, the transformation source is a custom Transact-SQL query with grouping and

aggregation of data.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 94: 140

79

Figure 6.17 Transact-SQL query as the tr ansformation source

Another feature of DTS is lookup queries. Lookup queries allow running queries

and stored procedures against other connections besides the source and destination. For

example, by using a lookup query, you can make a separate connection during a query

and include data from that connection in the destination table. Lookup queries are

especially useful in validating input data before loading it.

After the populating data from different sources and transforming and loading the

data into the DW using DTS, we need some client application for end users to enable

them query the DW. In the sample implementation I used Microsoft® Excel and

Microsoft® Data Analyzer as the client applications and mainly the pivot table and pivot

chart technologies. The Figure 6.18 shows a snapshot of pivot chart view of the sales

cube.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 95: 140

80

Figure 6.18 Pivot Char t using Excel as client

The Figure 6.19 shows pivot table view of the sales cube.

Figure 6.19 Pivot Table using Excel as client

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 96: 140

81

The Figure 6.20 shows the sales cube using Microsoft® Data Analyzer.

Figure 6.20 Data Analyzer as client

Microsoft® Data Analyzer is business-intelligence software that enables you to

apply the intelligence of an organization to the challenges of displaying and analyzing

data in a quick and meaningful manner. Data Analyzer accomplishes this by giving the

user a complete overview in one screen, which helps the user to quickly find hidden

problems, opportunities, and trends. By using Data Analyzer, non-technical business

users can get answers to their questions immediately and independently, which puts

knowledge directly into the hands of the people who need it most — decision-makers at

all levels in the organization. Data Analyzer provides a number of advantages for

displaying and analyzing the data:

A complete overview on a single screen replaces masses of grids, graphs, and

reports.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 97: 140

82

Multidimensional views show relationships throughout all aspects of your

business.

Customizable displays guide the user to hidden problems, opportunities, and

trends.

The dynamic use of color highlights anomalies.

Saving and reporting functions allow the user to save views for future use and

export them to Microsoft® PowerPoint or Microsoft® Excel.

Power users can do more advanced analysis in less time.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 98: 140

83

CHAPTER 7

CONCLUSIONS AND FUTURE WORK

Successful data management is an important factor in developing support systems

for the decision-making process. Traditional database systems, called operational or

transactional, do not satisfy the requirements for data analysis of the decision-making

users. An operational database supports daily business operations and the primary

concern of such database is to ensure concurrent access and recovery techniques that

guarantee data consistency. Operational databases contain detailed data, do not include

historical data, and since they are usually highly normalized, they perform poorly for

complex queries that need to join many relational tables or to aggregate large volumes of

data.

A DW represents a large repository of integrated and historical data needed to

support the decision-making process. The structure of a DW is based on a

multidimensional model. This model includes measures that are important for analysis,

dimensions allowing the decision-making users to see these measures from different

perspectives, and hierarchies supporting the presentation of detailed or summarized

measures. The characteristics of a multidimensional model specified for the DW can be

applied for a smaller structure, a data mart, which is different from the DW in the scope

of its analysis. A data mart refers to a part of an organization and contains limited

amount of data.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 99: 140

84

DWs have become the main technology for DSS. DSSs require not only the data

repository represented by DW, but also the tools that allow analysing data. These tools

include different kinds of applications; for example, software that include statistics and

data mining techniques offers complex analysis for a large volume of data to identify

profiles, behaviour, and tendencies. On the other hand, OLAP tools can manage high

volumes of historical data allowing for dynamic data manipulations and flexible

interactions with the end-users through the drill-down, roll-up, pivoting, and slicing-

dicing operations. Furthermore, OLAP tools are based on multidimensional concepts

similar to DW multidimensional model using for it measures, dimensions, and

hierarchies. If the DW data structure has a well-defined multidimensional model, it is

easier to fully exploit OLAP tools capabilities.

In this thesis, widely accepted conceptual and logical design approaches in DW

design are discussed. In the conceptual design phase DF, starER, ME/R and OOMD

design models are compared. OO design model is significantly better than the other

design approaches. OOMD supports conceptual design phase with a rich set of diagrams

that enables the designer model all the business information and requirements using a

case tool with UML. OOMD design model meets the following factors while the others

lack one or more:

Additivity of measures

Many-to-many relationships with dimensions

Derived measures

Nonstrict and complete classification hierarchies

Categorization of dimensions (specialization/generalization)

Graphic notation

Specifying user requirements

Case tool support

In the logical design phase flat, terraced, star, fact constellation, galaxy, snowflake,

star cluster and starflake schemas are discussed. Among these logical design models,

star schema, snowflake schema and the fact constellation schema are the mostly used

models commercially. These three models are compared in terms of efficiency, usability,

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 100: 140

85

reusability and flexibility quality factors among which efficiency is the most important

one considering DW modeling. Considering these factors and the requirements of the

business and considering the trade-off between redundancy and the query performance,

either snowflake or star schema may be the best choice in the design.

There are CASE tools that enable a user to design DW architecture. Current

commercial CASE tools have great design options to model that enable modelers to

model databases and software solutions even in the enterprise level. Unfortunately, the

development of the CASE tools on the data warehousing area is not as mature as the

development in ER and software modeling areas. Very few commercially available tools

may help in designing data warehousing solutions and may still not cover the

requirements you need. A CASE tool without basing the modeling notation on UML

may never cover the needs of data warehousing design. Note that, the main purpose of

UML is to “represent business rules”.

Unfortunately, a case tool for a complete data warehousing design is not available.

But there are still solutions using the existing CASE tools in data warehousing arena.

Two of the commercial CASE tools that support OOMD design are Microsoft® Visio

and Rational Rose. The sample OOMD model in the thesis is implemented using

Microsoft® Visio. Likewise, there are a number of OLAP Servers like Microsoft® SQL

Server Analysis Services, Hyperion Essbase, PowerPlay, DB2 OLAP Server. As

mentioned in the thesis, data warehousing is a complete process starting with data

acquisition, designing and storage of data and finally enabling end users access this data.

It is important for a platform to be able to offer a complete solution in data warehousing.

In this thesis, the data warehousing application is implemented using Microsoft

technologies. For the data acquisition phase Microsoft® DTS, for design and storage

phase Microsoft® SQL Server Analysis Services and finally for end user access

Microsoft® Excel and Microsoft® Data Analyzer are used.

7.1. Contr ibutions of the Thesis

This thesis contributes to both theory and practice of data warehousing. The data

warehouse design models and approaches in the literature are researched and grouped

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 101: 140

86

according to the phase in the project development cycle. These models are refined

according to the acceptance in the literature.

The first contribution of the thesis is the comparison of methodologies namely E/R,

DM and OO. The second contribution of the thesis is the comparison of the conceptual

design models. The third contribution of the thesis is the comparison of the logical

design models in terms of the quality factors. The fourth contribution of the thesis is a

case study covering the phases of a data warehousing solution. In addition, issues and

problems identified in this thesis that might impact the data warehouse implementation

will enable project managers to develop effective strategies for successfully

implementing data warehouses.

According to my research, there is no complete study in literature on DW models

providing a mapping of models to development phases and giving a comparison of the

models according to these phases. Also, very few articles point to an implementation

covering all phases of data warehousing. So, the last contribution of the thesis is

providing a complete study on these missing points.

7.2. Future Work

One possible future work may be comparing the pyhsical design models for a data

warehousing solutions and extend the case study to cover these physical design

approaches.

Another future work may be implementing a more complex case study using real

world application data, perform performance tests using the three logical models

compared to support the comparison on logical design models presented in the thesis.

Another future work may be improving the comparison of logical design models by

both covering more model and more quality factor in the comparison.

A case tool for a complete data warehousing design is not available. One future

work may be implementing of a case tool meeting the requirements of a data

warehousing solution.

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 102: 140

87

REFERENCES

[1] Romm M., Introduction to Data Warehousing, San Diego SQL User Group

[2] Goyal N., Introduction to Data Warehousing, BITS, Pilani Lecture Notes

[3] Franconi E., Introduction to Data Warehousing, Lecture Notes, http://www.inf.unibz.it/~franconi/teaching/2002/cs636/2http://www.inf.unibz.it/~franconi/teaching/2002/cs636/2 ,2002

[4] Pang L., Data Warehousing and Data Mining, Leslie Pang Web Site and Lecturer Notes

[5] Gatziu S. and Vavouras A., Data Warehousing: Concepts and Mechanisms, 1999

[6] Thomas Connolly & Carolyn Begg., “Database Systems, 3th Edition”, Addison-Wesley, 2002

[7] Gatierrez A. and Marotta A., An Overview of Data Warehouse Design Approaches and Techniques, Uruguay, 2000

[8] Reed Jacobson., “Microsoft® SQL Server 2000 Analysis Services”, ISBN 0-7356-0904-7, 2000

[9] Rizzi S., Open Problems in Data Warehousing., http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-77/ DMDW 2003, Berlin, Germany

[10] J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Chapter2: Data Warehouse and OLAP Technology for Data Mining, Barnes & Nobles, 2000

[11] W. H. Inmon, “Building the Data Warehouse, 3th Edition”, John Wiley, 2002

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 103: 140

88

[12] Moody D. L. and Kortink M. A. R., From Enterprise Models to Dimensional Models: Methodology for Data Warehouse and Data Mart Design, http://sunsite.informatik.rwthhttp://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-28/ DMDW 2000 , Stockholm, Sweden

[13] Tryfona N., Busborg F., Christiansen J. G., starER: A Conceptual Model for Data Warehouse Design, Proceeding of the ACM 2nd International Workshop Data Warehousing and OLAP (DOLAP99), 1999

[14] Sapia C., Blaschka M., Höfling G., Dinter B., Extending the E/R Model for the

Multidimensional Paradigm, Proceeding 1st International Workshop on Data Warehousing and Data Mining (DWDM98), 1998

[15] Golfarelli M., Maio D., Rizzi S., Conceptual Design of Data Warehouses from

E/R Schemas, Proceeding of the 31st Hawaii International Conference on System Sciences (HICSS-31), Vol. VII,1998

[16] Golfarelli M., Maio D., Rizzi S., The Dimensional Fact Model: A Conceptual

Model For Data Warehouses, International Journal of Cooperative Information Systems (IJCIS), Vol. 7, 1998

[17] Golfarelli M, Rizzi S., A Methodological Framework for Data Warehouse

Design, Proceeding of the ACM DOLAP98 Workshop, 1998

[18] Lujan-Mora S., Trujillo J., Song I., Multidimensional Modeling with UML Package Diagrams, 21st International Conference on Conceptual Modeling (ER2002), 2002

[19] Trujillo J., Palomar M., An Object Oriented Approach to Multidimensional

Database Conceptual Modeling (OOMD) , Proceeding 1st International Workshop on Data Warehousing and OLAP (DOLAP98), 1998

[20] Kimball R., http://www.dbmsmag.com/9708d15.htmlhttp://www.dbmsmag.com/9708d15.html “A Dimensional Modeling

Manifesto”, DBMS Magazine, Aug 1997

[21] Kimball R., “The Data Warehouse Toolkit”, John Wiley, 1996

[22] Martyn T., Reconsidering Multi-Dimensional Schemas, SIGMOD Record, Vol. 33, No. 1, 2004

[23] Elmasri R., Navathe S., “Fundamentals of Database Systems”, 3rd Edition, Addison-Wesley, 2000

[24] Ballard C., Herreman D., Schau D., Bell R., Kim E., and Valencic A., “Data Modeling Techniques for Data Warehousing”, IBM Redbook, IBM International Technical Support Organization, 1998

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 104: 140

89

[25] Firestone J., Object-Oriented Data Warehousing, 1997

[26] Kimball R., Enforcing the Rules, 2000 , http://www.intelligententerprise.com/000818/webhouse.jhtml?_requestid=380244http://www.intelligententerprise.com/000818/webhouse.jhtml?_requestid=380244

[27] Kimball R., The Software Developer in Us, 2000, http://www.intelligententerprise.com/000908/webhohttp://www.intelligententerprise.com/000908/webhouse.jhtml

[28] Microsoft Developer Network (MSDN) Library, XML Web Services Overview, October 2004

[29] Hahn K., Sapia C., and Blaschka M., Automatically Generating OLAP Schemata from Conceptual Graphical Models, Proceedings ACM 3rd International Workshop Data Warehousing and OLAP (DOLAP 2000), 2000

[30] Mora-Lujan S., Multidimensional Modeling Using UML and XML, Proceedings

16th European Conference on Object-Oriented Programming (ECOOP 2002), 2002

[31] Golfarelli M., Rizzi S., WAND: A Case Tool for Data Warehouse Design, Demo

Proceedings of The 17th International Conference on Data Engineering (ICDE 2001), 2001

[32] Chaudhuri S., Dayal U., An Overview of Data Warehousing and OLAP

Technology, ACM Sigmod Record, vol.26, 1997

[33] Golfarelli M., Rizzi S., Designing the Data Warehouse: Key Steps and Crucial Issues, Journal of Computer Science and Information Management, 1999

[34] Phipps C., Davis K., Automating Data Warehouse Conceptual Schema Design and Evaluation, DMDW’02, 2002

[35] Peralta V., Marotta A., Ruggia R., Towards the Automation of Data Warehouse Design, 2003

[36] Batini C., Ceri S., Navathe S., “Conceptual Database Design-An Entity Relationship Approach”, Addison-Wesley, 1992

[37] Abello A., Samos J., Saltor F., A Data Warehouse Multidimensional Data Models Classification, Technical Report, 2000

[38] Abello A., Samos J., Saltor F., A Framework for the Classification and Description of Multidimensional Data Models, Database and Expert Systems Applications, 12th International Conference, 2001

[39] Teklitz F., The Simplification of Data Warehouse Design, Sybase, 2000

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.

Page 105: 140

90

[40] Prosser A., Ossimitz M., Data Warehouse Management, University of Economics and Business Admin., Vienna, 2000

[41] Ahmad I., Azhar S., Data Warehousing in Construction: From Conception to Application, First International Conference on Construction in the 21st Century (CITC2002) “Challenges and Opportunities in Management and Technology” , 2002

[42] Kimball R., Letting the Users Sleep, Part 1, DBMS, 1996, http://www.dbmsmag.com/9612d05.hthttp://www.dbmsmag.com/9612d05.html

[43] Kimball R., Letting the Users Sleep, Part 2, DBMS, 1997, http://www.dbmsmag.com/9701d05.htmlhttp://www.dbmsmag.com/9701d05.html

Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now.