40
Fundamentals of Data Warehousing A Business Analytics Course University of the Philippines Open University Dr. Eugene Rex Jalao Dr. Ria Mae Borromeo Asst. Prof. Mari Anjeli Crisanto Ms. Marie Karen Enrile Course Writers

Fundamentals of Data Warehousing

  • Upload
    others

  • View
    9

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Fundamentals of Data Warehousing

Fundamentals of Data

Warehousing A Business Analytics Course

University of the Philippines Open University

Dr. Eugene Rex Jalao Dr. Ria Mae Borromeo

Asst. Prof. Mari Anjeli Crisanto Ms. Marie Karen Enrile

Course Writers

Page 2: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 1

University of the Philippines

OPEN UNIVERSITY

University of the Philippines

OPEN UNIVERSITY

COMMISSION ON HIGHER EDUCATION

Page 3: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 2

UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY

Fundamentals of Data Warehousing

A Business Analytics Course

Welcome! This course is designed to introduce you to the fundamentals of data

warehousing for managers. Data warehousing is used in business intelligence, enabling

managers to make critical decisions based on different business transactions. Managers

of businesses should be able to see opportunities for exploiting data coming from

transactions using data warehousing. This provides a discussion on how to adapt data

warehousing as an approach for managing data, highlighting the needed resources to roll

out a data warehouse. Before taking this course, you should have completed the

Fundamentals of Business Analytics course (BAFBANA).

This course guide is your road map to BAFWARE. Please read it thoroughly before starting

on the course work. You will need to refer to this course guide from time to time as you go

through the course in the next six weeks.

COURSE OBJECTIVES

At the end of the course, you should be able to:

1. Understand database management systems;

2. Discuss the key concepts of data warehousing;

3. Identify opportunities for data warehousing;

4. Identify resources needed for data warehousing;

5. Write a project charter for a data warehouse project;

6. Communicate data requirements;

7. Describe data inference considerations, interestingness metrics, complexity

considerations;

8. Understand various techniques used for post-processing of discovered structures

and visualization;

9. Describe formalized means of organizing and storing of documents and other

content in an organization related to the organization’s processes;

10. Identify the advantages and disadvantages of data warehousing; and

11. Develop an awareness of the ethical norms as required under policies and

applicable laws governing confidentiality and non-disclosure of

data/information/documents and proper conduct in the learning process and

application of business analytics.

Page 4: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 3

COURSE OUTLINE

BAFWARE consists of seven modules and runs for six weeks. MODULE 1. Database Management Systems A. Database Systems B. Functions and Components of a Database Management Systems C. Databases and Normalization D. Entity Relationship Diagram and Relational Modeling E. Case study

MODULE 2. Data Warehousing A. Data Warehouses and Data Marts B. Alternate Data Warehousing Architecture C. Case study

MODULE 3. The Kimball Lifecycle A. Background and Parts of the Kimball Lifecycle B. Kimball Lifecycle Technology Track C. Kimball Lifecycle Data Track & Application Track D. Case study

MODULE 4. Dimensional Modeling A. Dimensional Modeling B. Fact Tables C. Dimension Tables D. Case study

MODULE 5. ETL (Extraction, Transformation, Loading) A. Overview B. Case Study

MODULE 6. Post-Processing and Visualization of Data Inside the Data Warehouse A. Exercises using R B. Case study

MODULE 7. Opportunities and Ethics A. Opportunities for Data Warehousing B. Ethics in Data Warehousing C. Privacy Issues D. Case Study

Page 5: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 4

COURSE MATERIALS

Your learning package for this course consists of:

1. This course guide;

2. Study guides for each module, with lecture notes and learning activity guides;

3. Video lectures; and

4. Additional reading materials in digital form.

All learning resources will be made available for downloading so you can review them as

often as you wish without having to go online.

STUDY SCHEDULE

You will be learning through independent study combined with collaborative learning in the online discussions. Discussions will be asynchronous, meaning you and your classmates may log in and post your contributions to the discussion whenever you are available but not necessarily in the same time like in a chat.

In general, it is up to you to decide how many hours to spend on each module, including the online discussions and other learning activities. Discussion forums, however, will be open only for two weeks each.

You can use the study schedule below as a guide to pace yourself accordingly. However, make sure you note the dates for important activities like quizzes and discussion forums as they will only be open on the dates specified.

Week Topic/s Activity 1-3 Course

Overview 1) Read the course guide. 2) Introduce yourself in the “Self-Introductions” forum

Module 1 1. Go through the study guide for Module 1 and complete the individual learning activities, including viewing the video lecture titled “Database Management Systems” by Asst. Professor Reginald Neil Recario

2. Self-Introductions Forum Closes. Join Discussion Forum 1

4 Module 2 1. Go through the Study Guide for Module 2 and complete the individual learning activities, including viewing the video lecture on “Data Warehousing” (0:00-10:08) by Asst. Professor Mari Anjeli Crisanto.

2. Take Quiz 1

Page 6: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 5

5 Module 3 1. Go through the Study Guide for Module 3 and complete the individual learning activities and view the video lecture on “Data Warehousing Lifecycle and Project Management” by Raymond Lagria

2. Read "Kimball DW/BI Lifecycle Methodology". 3. Discussion 1 Forum closes. Join Discussion Forum 2

6-7 Module 4 1. Go through the Study Guide for Modules 4 and complete the learning activities.

2. View the video lectures on “Dimensional Modeling”, “Designing Fact Tables”, and “Designing Dimension Tables” by Dr. Eugene Rex Jalao. Take Quiz 2.

8

Do Assignment 1: Dimensional Modelling Case Study on the Northwind Database

9-10 Module 5 1. Go through the Study Guide for Module 5 and complete the learning activities.

2. View “Extraction, Transformation, Loading” by Raymond Lagria for Module 5.

3. Discussion 2 Forum closes.

11

Do Assignment 2: ETL Planing: Source to Target Mapping and Data Profiling

12-13 Module 6 1. Go through the Study Guide for Module 6 and complete the individual learning activities.

2. Read "Comprehensive Guide to Data Visualization in R" and "R-analyst Cheat sheet: Data Visualization in R".

3. View “Data Post-Processing” by Raymond Lagria

14-15 Module 7 1. Go through the Study Guide for Module 6 and complete the individual learning activities, including viewing the video lecture on “Opportunities and Ethics in Data Warehousing” by Asst. Professor Mari Anjeli Crisanto.

2. Read "Benefits of data warehouses for business" and "9 Disadvantages and Limitations of Data Warehouse".

3. Join Discussion Forum 3.

16

Final Exam

COURSE REQUIREMENTS

To earn a certificate of completion for this course, you must do the following:

1. Participate in all online discussions (i.e. the self-introductions forum and the three

class discussions).

2. Take the two online quizzes.

3. Complete two assignments

4. Complete a final exam.

Page 7: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 6

ONLINE DISCUSSION FORUMS

BAFWARE will have three online discussions where you can share your thoughts and

learnings with your classmates. Take note that the forums will only be open for two weeks

and will close down once the next discussion forum opens.

Guide questions based on the week’s module will be posted in each discussion forum.

Answer those questions intelligently and don’t forget to cite references you may have used

for research. Make sure that you also reply to at least one or two other answers from your

classmates. However, take care to reply constructively and respectfully, following the

proper rules of netiquette.

ONLINE QUIZZES

Several modules will have accompanying online quizzes where you can test your

knowledge and understanding. The quiz can be taken anytime within the week it has been

scheduled. However, take note that a time limit will be set once you begin the quiz. Your

score and feedback will be shown automatically once you are done.

ASSIGNMENTS

Two assignments will be given to allow you to practice what you have learned in

BAFWARE. These assignments will be marked by experts and you will have to get passing

scores for both to get a certificate of completion. Details for each assignments will be

posted in the course site.

GENERAL GUIDELINES

Here are some of our guidelines for this class:

1. Check MyPortal frequently and participate in all the activities.

2. Practice academic integrity. Cheating and plagiarism of any kind will not be

tolerated.

3. Uphold honor and excellence. Simply give your best, your 110% if you can.

4. LEARN. Learn from this course, learn from yourself, learn from you and your

classmates.

Page 8: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 7

MODULE 1: DATABASE MANAGEMENT SYSTEMS

Introduction

As we learn about data warehousing and how this can be applied to businesses, it is

important to first understand the basic concepts of database management systems. Data

warehouses utilize information from different database management systems. Managers

need to know key points about database management systems so that they can have a

deeper understanding of how data warehouses work.

Learning Objectives

After working on this module, you should be able to:

1. Explain what data management systems are;

2. Describe the functions and components of database management systems;

3. Perform database normalization;

4. Create a simple entity relationship diagram; and

5. Identify database management systems used in businesses.

1.1. What are database management systems?

In order to understand what database management systems are, we must first understand what a database is. In computer terms, a database is a collection of data. Typically, it is the data of one specific enterprise.

A database is not necessarily always stored in a computer. Records stored in a filing cabinet, in a notebook, or whatnot can be considered a database. But often, this manual method of storing information is not as efficient as using a computer, and it is not as efficient as using a database management system.

What then are database management systems or DBMSs? A DBMS is a collection of interrelated data plus the software and hardware used to access the data in a useful manner.

Page 9: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 8

Study Question Do you agree that using a DBMS is more efficient compared to using manual methods of

storing information? Why or why not?

1.2. What are the functions and components of a database management

system?

DBMS main functions include the following (among many others):

a. The manipulation of data;

b. The definition of your database;

c. The processing of your data; and

d. The sharing of your data

Note that a DBMS is only one component of what is known as a database system. This

database system therefore has these four components:

a. Users

b. Database Application

c. DBMS

d. Database

Study Questions

1. How does the DBMS perform the functionalities listed in this module?

2. How do the different components of a database system relate to one another?

Page 10: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 9

1.3. Databases and Normalization

Now that we know what databases and database management systems are, let us take a closer look at the data that goes into a database. Our databases will be made up of records which are in turn made up of fields. You can think of records as the individual items that go into the databases.

In designing databases, it is helpful to first identify what records should be included in the system. Once you have identified these records, you can then start to create a DESIGN for the database.

The design should be NORMALIZED. There are several levels of normalization: 1st Normal Form, 2nd Normal Form, 3rd Normal Form, Boyce and Codd Normal Form (BCNF), 4th Normal Form, and even 5th Normal Form. We normalize in order to reduce data redundancy and improve data integrity. Normalizing until 3rd Normal Form will be enough.

Study Question

Why is it important for managers to know how to normalized databases are created?

Activity 1-1

Objective: To create a normalized database design.

Task: Think of a simple database management system an organization might want to

have. Identify at least 10 records that will go into the database. Group these records into

tables based on similar fields. Convert the resulting tables to 3rd Normal Form.

Tools & Resources (Video): Database Management Systems by Asst. Professor Mari

Anjeli Crisanto (10:13-10:38)

1.4. Entity Relationship Diagrams and Relational Modeling

Tables created during database normalization often correspond to ENTITIES, which are

used in relational modeling. Most DBMS packages for microcomputers make use of the

Page 11: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 10

Relational Data Model which highlight relationships between entities. We use Entity

Relationship Diagrams (ERDs) to design and visualize Relational Data Models.

Entity Relationship Diagrams are composed of:

a. Entities - the representation we use to contain information on one real-world

person, object, place, etc. These are represented by rectangles in an ERD.

b. Attributes – properties that describe entities. They correspond to the fields in

records and are represented by ovals.

c. Relationships – how each entities are connected. They are represented by

diamonds.

Activity 1-2

Objective: To create a simple entity relationship diagram.

Task: From the normalized tables in Activity 1-2, create a simple ERD. Add relationships

and attributes that you might have missed out during the initial process.

Tools & Resources: Your output from Activity 1-2 and Database Management Systems by

Asst. Professor Mari Anjeli Crisanto (15:41-22:36)

"

Study Question Why is important for managers to know how to entity relationship diagrams are

designed?

1.5. Case Study

Read through the Northwind Business Case Study. The document discusses

implementing Dimensional Modeling, but let us take a few steps back and first try to

envision a Database Management System for the company. What will its DBMS look like?

What information should go there? The document already contains an Entity-Relationship

Model (another representation of the ERD) and this will give you an idea of the data to be

stored in Northwind Business’ DBMS.

Page 12: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 11

Study Questions 1. What businesses or organizations are you involved with? Will they benefit from using

a Database Management System?

2. Identify what DBMS would be useful for this business. If there are many, list down

as many as you can.

3. Take one DBMS from the list. Enumerate and explain the processes connecting the

business and the DBMS.

References

Database Management Systems (Video) by Reginald Neil Recario

Database Management Systems (Video) by Mari Anjeli Crisanto

Case Study: Dimensional Modeling - Northwind Business (Document) by Eugene Rex

Jalao

Page 13: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 12

MODULE 2: DATA WAREHOUSING

Introduction

Imagine a large organization having different departments, each with their own database

systems. A business analyst would like to generate reports for decision support. She

approaches each department but has problems with some of them whose main roles are

just to handle data transactions – not reports. Those that do give her information give data

in a number of different formats. Customer names are saved differently, birthdates are in

mm/dd/yy and dd/mm/yy and so on. Wouldn’t it save the business analyst so much time

and effort if there was a central repository containing information needed for her to

generate the reports that she needs with the data in a standardized format, too?

In this module, we will learn about data warehousing which makes tasks like the above

easier to handle.

Learning Objectives

After working on this module, you should be able to:

1. Discuss the key concepts of data warehousing; and

2. Identify resources needed for data warehousing.

2.1. Data Warehouses and Data Marts

A data warehouse is a physical repository where relational data are specially organized to

provide enterprise-wide, cleansed data in a standardized format

In our previous module, we have learned what database systems are. In turn, a data

warehouse is a collection of integrated, subject-oriented databases. Each unit of data is

non-volatile and relevant to some moment in time.

Data in data warehouses are NOT in 3NF. That being so, they are referred to as BIG

DATA. Since they are not normalized, some data may be redundant. The redundancies

will result in, well, BIG data. However, BIG DATA is more useful for DECISION SUPPORT.

This is good since the purpose of a data warehouse is provide aggregate data for decision

making. You are not that interested in what the data for each table are, you are more

interested in how the company will move forward given that data.

Page 14: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 13

There may be questions or decisions which are specialized for specific people. Thus,

separate entities called DATA MARTS are used to provide specialized and strategic

answers for specific people. This keeps it simple for the users. Small problems are easier

to solve.

Data marts, therefore, are a subset of the data warehouse that support the requirements

of a particular department or business function.

A data mart is a departmental data warehouse that stores only relevant data. Data marts

can be dependent or independent. A dependent data mart is a subset that is created

directly from a data warehouse. An independent data mart, on the other hand, is a small

data warehouse designed for a strategic business unit or a department.

Study Question

How will organizations benefit from data warehouses and data marts?

2.2. Alternate Data Warehousing Architecture

Alternative data warehousing architectures include:

a. Independent Data Marts

b. Data Mart Bus Architecture

c. Hub-and-Spoke Architecture

d. Centralized Data Warehouse

e. Federated Data Warehouse

Study Questions

1. How are the alternative data warehousing architectures different from the usual

architecture?

2. Discuss the advantages and disadvantages of the different alternative data

warehousing architectures.

Page 15: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 14

2.3. Case Study

Let’s go back to the Northwind Business Case Study. Will the company benefit from a

Data Warehouse? We can see from the document that it has already decided to build a

Business Intelligence Data Warehouse (BIWD). The company had done so because it is

interested in analyzing its sales and shipping activities and decisions so that it can improve

its customer order process. In our next modules, we will see how Northwind will shift from

a DBMS design to a BIWD.

Activity 2-1

Objective: To identify resources needed for data warehousing.

Task: Identify a business or organization that might benefit from using data warehouses

and data marts. List down the resources they will need to get these up and running.

Tools & Resources (Video): Data Warehousing (0:00-10:08) by Asst. Professor Mari

Anjeli Crisanto

References

Data Warehousing (Video) by Mari Anjeli Crisanto

Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugine

Rex Jalao

https://www.youtube.com/watch?v=zTs5zjSXnvs&t=293s&list=WL&index=20

https://www.youtube.com/watch?v=l74BAViTVns&t=194s&list=WL&index=21

Case Study: Dimensional Modeling - Northwind Business (Document) by Eugene Rex

Jalao

Page 16: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 15

MODULE 3: THE KIMBALL LIFECYCLE

Introduction

We will now learn about a lifecycle used in data warehouses and business intelligence

project teams. This is the Kimball Lifecycle was formerly known as Business Dimensional

Lifecycle before 2008.

Learning Objectives

After working on this module, you should be able to:

1. Enumerate and describe the different stages of the Kimball Lifecycle; and

2. Write a project charter for a data warehouse project.

3.1. Background and Parts of the Kimball Lifecycle

The Kimball Lifecycle focuses on adding business value across the enterprise and

dimensionally structures the data that's delivered to the business. It uses iterations and

increments in a manageable lifecycle to do this.

There have been two main approaches to building data warehouses with data marts.

The first is an approach by Bill Inmon and the second is the approach by Ralph Kimball.

Bill Inmon’s approach works this way:

a. The enterprise data warehouse (EDW) should be in at least 3rd normal form.

b. But the data marts should be in dimensional form.

c. Big Bang Approach

Meanwhile, here is Kimball’s approach:

a. The EDW is based on dimensional model design

b. Focus on user-friendliness and easy to use

c. Develop EDW on a departmental basis piece by piece

Kimball’s approach is more practical, more interpretable, easier to implement and less

costly based on industry best practices. It involves the following steps:

a. Program/Project Planning and management

Page 17: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 16

b. Deployment

c. Maintenance

d. Growth

Study Question What happens during the different steps or stages in the Kimball Lifecycle?

3.2. The Kimball Lifecycle Technology Track

The first stage in the Kimball Lifecycle involves planning. Planning for three streams

happen simultaneously. These streams are:

a. Technology Track

b. Data Track

• Dimensional Modeling

• Physical Design

• ETL (Extraction, Transformation, Loading) Design and Development

c. Application Track

• Business Intelligence Application Design

• Business Intelligence Application Development

The technology track involves technical architectural design and product selection and

installation.

The following processes occur in the technical architectural design:

a. Consideration of business requirements, current technical environment, and

planned strategic technical directions

b. Designing the back room architecture

• Designing ETL (data staging ) environment

• Identifying DBMS operating system and hardware environment

c. Designing front room architecture

d. Designing the Infrastructure and metadata

e. Managing security requirements

Page 18: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 17

As for product selection and installation, the processes are:

a. Evaluation and selection of the following tools:

• Hardware platform

• DBMS

• ETL tool (data staging tool)

• BI tool (end user data access tool)

b. Installation and testing to assure end-to-end integration

c. Training of team

Study Questions 1. Who are the people involved in technical architectural design?

2. Who are those involved in product selection and installation?

3.3. Kimball Lifecycle Data Track

The data track involves dimensional modeling, physical design, and ETL design and

development. We will cover dimensional modeling and the processes involved in Module

4. Meanwhile, we will talk more about ETL in Module 5.

Study Question Who are the people tasked to do the processes in the data track?

Page 19: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 18

3.4. Kimball Lifecycle Application Track

The application track involves business application design and business application

development.

Business application design involves:

a. Identifying standard analytic and report requirements to meet 80% – 90% of user

needs

b. Planning and assuring ad hoc query and reporting capability

c. Developing report templates for report families

d. Getting user signoff on report templates and commit to them

e. Identifying metrics and metric calculations, Key Performance Indicators (KPIs)

Meanwhile, business application development involves using ideally a single advanced BI

tool that meets all user needs. Advanced tools provide significant productivity gains for the

application development team. Good BI design enables end users to modify existing

reports and develop ad hoc reports quickly without going to IT.

Study Question Who are the workforce of the application track?

3.5. Case Study

Watch the “Data Warehousing Lifecycle and Project Management” video by Raymond

Lagria at 11:45. The lecture discusses a case study for BigCo and how a project charter

was created for this company.

Activity 3-1

Objective: Write a project charter for a data warehouse project.

Task: Look for examples of project charters for data warehouse projects. Create one

following the steps in the Kimball Lifecycle.

Page 20: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 19

Tools & Resources (Video): “Data Warehousing Lifecycle and Project Management” by

Prof. Raymond Lagria

References

Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugene

Rex Jalao

Kimball DW/BI Lifecycle Methodology: http://www.kimballgroup.com/data-warehouse-

business-intelligence-resources/kimball-techniques/dw-bi-lifecycle-method/

Data Warehousing Lifecycle and Project Management (Video) by Raymond Lagria

Page 21: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 20

MODULE 4: DIMENSIONAL MODELING

Introduction

We have learned in Module 3 that the Data Track stream in the Kimball Lifecycle

involves dimensional modeling. Dimensional modeling is a logical design technique for

structuring data so that it is intuitive for business users and delivers fast query

performance. We will take a closer look at the process involved here in this module.

Learning Objectives

After working on this module, you should be able to:

1. Explain the concept of dimensional modeling; and

2. Discuss fact tables and dimensional tables;

3. Understand the conversion of the E/R model to a dimensional model using

Dimensional Normal Form (DNF) methodology.

4.1. Dimensional Modeling

In Module 1, we have learned about relational modeling. Relational modeling is widely

used in databases nowadays. However, dimensional modeling has two advantages over

relational modeling. These are understandability and performance. The model must be

easily understood by business users while representing the complexities of the business.

It must also have fast response to queries that summarize millions of rows.

Dimensional models also have the following benefits:

1. Predictable, Standard Framework

2. Gracefully Extensible to Accommodate Change

3. Star Join Schema is Symmetrical

4. Has Standard Approaches for Common Modeling Situations

5. Aggregate Management

To design a dimensional model, we must perform the following steps:

1. Establishing Naming Conventions

2. Do the Four-Step Dimensional Modeling Process

3. Document the High Level Data Model Diagram

4. Define the Data Sources

Page 22: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 21

5. Document the Detailed Table Designs

6. Develop Detailed Bus Matrix

7. Identify, Track, and Resolve Issues

Let us now dig deeper into dimensional modeling and discuss fact tables and

dimensional tables.

4.2. Fact Tables

Let first determine what makes up a “fact”. Measurements are numeric values called

facts. Examples are sales amount and count of attendance. Dimensions, meanwhile,

describe the “who, what, where, when, why, and how” of the facts. For example,

dimensions for sales amount would be sales by quarter and sales by product.

A dimensional model consists of a fact table containing measurements surrounded by a

halo of dimension tables containing textual context. It is known as a star join and as a

star schema when stored in a relational database.

Fact tables contain the descriptive attributes (numerical values) needed to perform

decision analysis and query reporting in the star schema.

Here are some more fact table facts:

1. A fact is a performance measure. For example, "Sales of Product X".

2. Fact values are not known in advance. They are only known when event

measurement occurs.

3. Facts are numeric.

4. The most useful facts are numeric and additive.

Fact tables are usually the largest tables. A single fact table can contain either detail or

summarized data. They are primarily joined to dimension tables through foreign keys.

The business definition of the measurement event that produces the fact table is called

the fact table's grain. Declaring the grain means a fact table row represents the blank in

this statement: “A fact row is created when ____ occurs.”

Page 23: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 22

4.3. Dimension Tables

In a star schema, dimension tables contain classification and aggregation information

about the values in the fact table.

Dimension tables contain the parameters by which the fact table measures are analyzed.

For example, the amount sold is analyzed by day, month, quarter, or year. Or the amount

sold on sunny days vs. rainy days, and so on.

Dimension tables provide the context to the fact table measures they describe. They also

contain descriptors of the business, utilizing business terminology. They have many large

columns, contain textual and discrete data, and are usually smaller than fact tables.

Have a single column surrogate primary key (called the warehouse dimension key) and

are joined to a fact table through a foreign key reference to their primary key. Dimension

tables can contain one or more hierarchies. These hierarchies are de-normalized into

the dimension tables.

Dimensional tables can be classified into the following:

1. Date Based

2. Time Based

3. Business Entities

4. Analytical Profiles

5. Correlated Entities

6. Versions of Business Entities

7. Flags and Indicators

8. Degenerate Dimensions

Now how do we generate dimensional models? The Dimensional Normal Form is a

creative and practical approach originated by Mike Schmitz to design Dimension Table

Families. Here, fact tables are highly normalized for maintainability and flexibility.

Dimensions have their hierarchies de-normalized into them for usability and performance.

Its schema is limited to two levels. These are a single first level or central highly normalized

table called a fact table and multiple second level tables called dimension tables linked to

the first level table in primarily one to many relationships.

Page 24: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 23

Study Question How is the Dimensional Normal Form different from the other normalized forms

discussed in Module 1?

4.4. Case Study

Let’s go back to the Northwind Business Case Study. It is now time to see how their

system translates into a dimensional model. An Excel is provided to design and submit

your solution. From the generated dimensional model, what are the SQL Scripts needed

for each of the reports below?

1. What were Northwind’s top selling products? This month? This quarter? YTD? This

month last year? Last YTD?

2. Who are the best customers in terms of sales? How many orders did these best

customers place last month? What was the average order amount? What was the

average number of items per order per customer?

3. How many orders were shipped on time? Late? How late? Who is the top

performing

4. shipping company?

5. How much did Northwind sell by each product category in each time period?

6. Which employee sold the most orders?

Study Question

Why is it important for managers to know how to normalized databases are created?

Activity 4-1

Objective: Understand the conversion of the E/R model to a dimensional model using Dimensional Normal Form (DNF) methodology.

Page 25: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 24

Task: Do Case 1: Dimensional Modelling Case Study on the Northwind Database

Tools & Resources: “Dimensional Modeling”, “Designing Fact Tables”, and “Designing

Dimension Tables” by Dr. Eugine Rex Jalao (Videos); “Case Study: Dimensional Modeling

- Northwind Business” by Dr. Eugene Rex Jalao (Document)

References

Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugene

Rex Jalao

Designing Fact Tables (Slides) by Eugene Rex Jalao

Introduction to Dimensional Modeling (Slides) by Eugene Rex Jalao

Designing Dimension Tables (Slides) by Eugene Rex Jalao

Case Study: Dimensional Modeling - Northwind Business (Document) by Eugene Rex

Jalao

Page 26: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 25

MODULE 5: EXTRACTION, TRANSFORMATION, LOADING (ETL)

Introduction

We have also learned in Module 3 that the Data Track stream in the Kimball Lifecycle

involves the ETL (Extraction, Transformation, Loading) process. We will take a closer look

at the process involved here in this module.

Learning Objectives

After working on this module, you should be able to: 1. Discuss the steps in ETL; and

2. Identify instances where ETL would be necessary in an organization.

5.1. ETL Overview

ETL is mostly done by business analytics people following an information technology track. However, it is useful for managers to know what happens during ETL. The objective of ETL is to get data out of the source and load it into the data warehouse. It is simply a process of copying data from one database to other. Data is extracted from a database, transformed to match the data warehouse schema and loaded into the data warehouse database. When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation. The process is usually handled using Structured Query Language (SQL) scripts, a special-purpose programming language designed for managing data held in a relational database. In extraction, data is extracted from heterogeneous data sources. Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data. This is usually done using SQL Select Statements. Transformation is the main step where the ETL adds value. It changes data and provides guidance whether data can be used for its intended purposes. For example, "Male" is changed to "M" and "Yes" is changed to "1". This is performed in a staging area.

Page 27: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 26

Finally, in loading, data is then loaded into data warehouse tables. Here, surrogate keys are created and assigned. The process is usually done using Insert SQL Statements. ETL is often a major failure point in data warehousing because the effort involved in the ETL process is underestimated. Underestimating data quality problems and providing for contextual history are also prime culprits for this. The ETL process should therefore not be taken for granted. It should be noted that ETL is not a one time event as new data is added to the data warehouse periodically - monthy, daily, or hourly. Because ETL is an integral, ongoing, and recurring part of a data warehouse it is automated, well-documented, and is easily changeable. Several companies have strong ETL tools and a fairly complete suite of supplementary tools. There are three general types of Source to Target Tools:

1. Code generators - These actually compile ETL code, typically COBOL which is used by several large companies that use mainframe.

2. Engine based - These have easy-to-use graphic interfaces and interpreter style programs.

3. Database based - These involve manual coding using SQL statements augmented by scripts.

Well known ETL tools are the following:

1. Commercial a. Ab initio b. IBM DAtaStage c. Informatica PowerCenter d. Microsoft Data Integration Services e. Oracle Data Integrator f. SAP Business Objects - Data Integrator g. SAS Data Integration Studio

2. Open-Source Based . Adeptia Integration Suite

a. Apatar b. CloverETL c. Pentaho Data Integration (Kettle) d. Talend Open Studio/Integration Suite e. R/R Studio

Take note that the "best" tool does not exist. You will have to choose based on your own needs. You should also check first if the standard tools from the big vendors are alright.

Page 28: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 27

Study Question Why is it important for managers to know the processes involved in ETL?

Activity 5-1

Objective: To identify instances where ETL would be necessary in an organization.

Task: From Activity 3-1, identify what data would need to undergo ETL. What would their

final forms be?

Tools & Resources (Video): Activity 3-1 and “Extraction, Transformation, and Loading”

by Raymond Lagria

5.2. Case Study

Database. Specifically we are looking at issues on column values and row duplications.

Extract all the data into an excel sheet from the Northwind Access Database and open it

in MS Excel. Utilize Excel’s autofilter function to answer the data profile table found in

the Data_Profile_Template.xls file.

Also develop the High Level Source-to-Target Map for the Northwind data warehouse.

Use the S2T Map Template.xls file. Develop the following:

1. High Level Source-to-Target Map for all tables

2. Detailed S2T Map for the Product Dimension (D_Product) and Order Transaction

Fact (F_Order_Transaction) table.

References Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugene Rex Jalao Extraction, Transformation, and Loading (Video) by Raymond Lagria Data Profiling and Source to Target Mapping (Document) by Eugene Rex Jalao

Page 29: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 28

MODULE 6: POST-PROCESSING AND VISUALIZATION OF DATA

INSIDE THE DATA WAREHOUSE

Introduction

Let us now learn how we can post-process and visualize the data inside the data

warehouse.

Learning Objectives

After working on this module, you should be able to:

1. Understand various techniques used for post-processing of discovered structures

and visualization.

6.1. Exercises using R

First, what is R? R is an integrated suite of software facilities for data manipulation,

calculation and graphical display.

It has an effective data handling and storage facility. It also has a large, coherent,

integrated collection of intermediate tools for data analysis. In addition, it has graphical

facilities for data analysis and display either directly at the computer or on hard copy.

Take note that R is not a database but connects to a DBMS. It is not a spreadsheet view

of data, but it connects to Excel/MS Office.

R is free and open source though it has a steep learning curve. RStudio IDE is a

powerful and productive 3rd Party user interface for R. It’s free, open source, and works

great on Windows, Mac, and Linux.

Exercises for this session will include the following:

1. Working with dataset Wage

2. Studying, reducing and structuring the dataset

3. Plotting the dataset

4. Introducing a business analytics task for the dataset

5. Working with another dataset

Page 30: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 29

In post-processing, we remember that data extracted from a data warehouse or pieces

of knowledge extracted from an initial data mining task could be further processed. We

can simplify the data, apply descriptive statistics, do visualizations or graphing tasks, or

applying further business analytics tools.

Watch the "Data Post-processing" video by Raymond Lagria to understand

preliminaries, data frames, reading data, subsetting, graphing and plotting, and

regression analysis in R.

Always take note to transform your dataset into your desired format before applying

further data mining techniques.

Study Question If you were a business manager, what types of visualizations for the data warehouse’s

data would you like to see?

6.2. Case Study

Let us continue to see how post-processing and plotting is done with R in the “Data

Post-processing” Video by Raymond Lagria.

References

https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/

Data Post-Processing (Slides) by Raymond Lagria

Data Post-Processing (Video) by Raymond Lagria

Page 31: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 30

MODULE 7: OPPORTUNITIES AND ETHICS

Introduction

Finally, let us discuss the opportunities and ethics surrounding data warehousing.

Learning Objectives

After working on this module, you should be able to:

1. Identify advantages and disadvantages of data warehousing;

2. Develop an awareness of the ethical norms as required under policies and

applicable laws governing confidentiality and non-disclosure of

data/information/documents and proper conduct in the learning process and

application of business analytics.

7.1. Opportunities for Data Warehousing

Data warehousing, like every other thing, has both advantages and disadvantages. Its advantages include: a. Better decision-making b. Quick and easy access to data c. Data quality and consistency

As for its disadvantages, here are the considerations: a. Maintenance costs outweigh the benefits b. Data ownership must be considered c. Rigidity of data d. Underestimation of ETL processing time e. Hidden problems of the source f. Inability to capture required data g. Increased demands of the users h. Long-duration project i. Complications

Page 32: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 31

Study Questions Do you think the advantages outweigh the disadvantages? What can be done to

address the disadvantages?

7.2. Ethical Concerns in Data Warehousing

Data warehousing takes information from different databases as well as external sources

and puts them inside a repository which can be accessed by end-users who need decision

support.

Thus, there are ethics to consider especially when some data may be accessed only at

the departmental or only at certain levels. Remember, there is a chance that end-users

may have access to information that they should not be examining. They may be breaking

privacy laws without knowing it.

Ethics should also be considered even if data are just used in the testing phase. For

example, while testing the data warehouse, is it alright to move small data sets from source

systems to target systems for testing purposes? It is not actually ethical to do so. While

testing, sometimes users are learning things they shouldn’t know or things they aren’t

allowed to know.

What about the case of external data or data that is already made to the public? Is it ethical

to integrate everything into the data warehouse? The project manager must decide which

of the information is acceptable to integrate. Although the information is publically

available, using some of them might raise ethical considerations. The ethics would focus

on how the information is used, and by whom.

Study Question What other ethical considerations for data warehousing are you aware of?

Page 33: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 32

7.3. Checklist for Ethical Concerns in Data Warehousing

Here is a checklist of items project managers and technology implementers can use to

manage ethical concerns:

• Develop service level agreements with end users that define who has access to

what levels of information

• Have end-users involved in defining the ethical standards of use for the data

that will be delivered.

• Define the bounds around the integration efforts of public data, where it will be

integrated and where it will not – so as to avoid conflicts of interest.

• Do not use “live” or real data for testing purposes – or lock down the test

environment; too often test environments are left wide-open and accessible to

too many individuals.

• Define where, how, and who will be using Data Mining – restrict the mining

efforts to specific sets of information. Build a notification system to monitor data

mining usage.

• Allow customers to “block” the integration of their own information (this one is

questionable) depending on if the customer information after integration will be

made available on the web.

• Remember that any efforts made are still subject to governmental laws. What

laws do we have right now concerned with data privacy? Note that future laws

could also be developed and we must be aware of those.

Activity 1-2

Objective: Develop an awareness of the ethical norms in data warehousing.

Task: Use the checklist for ethical considerations in data warehousing and check

whether the project charter created in Activity 3-1 has any part which could be unethical.

Tools & Resources: Opportunities and Ethics in Data Warehousing (Asst. Professor Mari

Anjeli Crisanto) and Activity 3-1

Page 34: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 33

Task: Think of a simple database management system an organization might want to

have. Identify at least 10 records that will go into the database. Group these records into

tables based on similar fields. Convert the resulting tables to 3rd Normal Form.

Tools & Resources (Video): Database Management Systems by Asst. Professor Mari

Anjeli Crisanto (10:13-10:38)

7.4. Privacy Issues

Are you familiar with the Data Privacy Act? It was implemented “to protect the

fundamental human right of privacy, of communication while ensuring free flow of

information to promote innovation and growth.” (Republic Act. No. 10173, Ch. 1, Sec. 2)

The law specifies that consent is needed before the collection of all personal data. The

data subject must also be informed of the extent to which their personal information will

be processed.

This becomes a big consideration when we implement data warehouses because the data

warehouse might access information which a person may have given consent to be

accessible only at a certain level.

Businesses and IT developers must be well aware of laws such as these so that they can

ensure that their database or data warehouses comply with all of the law’s stipulations.

Study Question Why is data privacy important?

7.5. Case Studies

Let’s take a look at the Northwind Business Case Study one last time. What ethical

considerations are relevant to this company? Are there any privacy issues that it has to

consider when building the Business Intelligence Data Warehouse?

Page 35: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 34

References

http://www.techadvisory.org/2015/03/benefits-of-data-warehouses-for-business/

http://whatisdbms.com/9-disadvantages-and-limitations-of-data-warehouse/

Opportunities and Ethics in Data Warehousing (Asst. Professor Mari Anjeli Crisanto)

http://tdan.com/data-warehousing-ethical-concerns-security-access-and-control/5186

Case Study: Dimensional Modeling - Northwind Business (Dr. Eugene Rex Jalao)

Page 36: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 35

DISCUSSION FORUM TOPICS

DISCUSSION FORUM 1

Database Management Systems or Data Warehousing?

Week Open: Week 2

Week Closes: Week 5

Guide Question: Would your company or organization benefit from a Database

Management System? What about a Data Warehouse? If you were the manager, which

between the two would be the best fit for your company? State your reasons why.

DISCUSSION FORUM 2

Module 3 - The Kimball Lifecycle

Week Open: Week 5

Week Closes: Week 10

Guide Question: Which part of the Kimball Lifecycle would you be most involved in?

Discuss why.

DISCUSSION FORUM 3

Module 7 - Opportunities and Ethics

Week Open: Week 14

Week Closes: Week 16

Guide Question: Aside from those discussed in the video and study guide, what other

opportunities in data warehousing are there and what other ethical considerations do you

think should be looked into? Share these with the class.

Page 37: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 36

QUIZZES

QUIZ 1

Topics covered: Database Management Systems, Data Warehousing

Week scheduled: Week 4

1) A database is always stored in a computer.

a) True

b) False

2) A _____ is a collection of interrelated data plus the software and hardware used to

access the data in a useful manner.

a) Database

b) Database Management System

c) Data Warehouse

d) Data Mart

3) A ______ is a physical repository where relational data are specially organized to

provide enterprise-wide, cleansed data in a standardized format

a) Database

b) Database Management System

c) Data Warehouse

d) Data Mart

4) Which among the following is a function of a DBMS?

a) The manipulation of data;

b) The definition of your database;

c) The processing of your data

d) All of those mentioned

e) None of those mentioned

5) Which is not part of a database system?

a) Users

b) Database Application

c) DBMS

d) All of those mentioned

e) None of those mentioned

6) These are individual items that go into a database.

a) Entities

b) Attributes

c) Records

d) Fields

7) Big data is not useful for decision support.

a) True

b) False

Page 38: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 37

8) ____ are a subset of the data warehouse that support the requirements of a particular

department or business function.

a) Database

b) Database Management System

c) Data Mart

d) Big Data

9) _________ Diagrams are composed of entities, attributes, and relationships.

a) Entity Attribute

b) Entity Relationship

c) Attribute Relationship

d) Entity Attribute Relationship

10) Tables in a DBMS should be normalized.

a) True

b) False

Page 39: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 38

QUIZ 2

Topics covered: The Kimball Lifecycle, Dimensional Modeling

Week scheduled: Week 7

1) Which of the following is not included in Kimball’s approach to building data

warehouses?

a) The enterprise data warehouse (EDW) should be in at least 3rd normal form.

b) The EDW is based on dimensional model design

c) Focus on user-friendliness and easy to use

d) Develop EDW on a departmental basis piece by piece

2) Which of the following is considered to be part of the steps in the Kimball Lifecycle?

a) Program/Project Planning and management

b) Deployment

c) Maintenance

d) Growth

e) All of those mentioned

f) None of those mentioned

3) The ____ track involves dimensional modeling, physical design, and ETL (Extraction,

Transformation, Loading) design and development

a) Planning

b) Technology

c) Data

d) Application

4) The _____ track involves architectural design and product selection and installation.

a) Planning

b) Technology

c) Data

d) Application

5) The ____ track involves identifying standard analytic and report requirements to meet

80% – 90% of user needs.

a) Planning

b) Technology

c) Data

d) Application

6) Dimensional Modeling logical design technique for structuring data so that it is intuitive

for business users and delivers moderate query performance.

a) True

b) False

7) Dimensional Modeling’s advantages over Relational Modeling are understandability

and performance.

Page 40: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing 39

a) True

b) False

8) Examples of ____ are sales amount and count of attendance.

a) Facts

b) Tables

c) Dimensions

d) Models

9) ____ describe the “who, what, where, when, why, and how”.

a) Facts

b) Tables

c) Dimensions

d) Models

10) Fact tables are usually very small.

a) True

b) False