Dimensional Data Modeling_lecture3

Dimensional Data Modeling

Ms. Pallavi N.HalarnkarAssistant ProfessorComputer Engineering DepartmentMPSTMENMIMS [email protected]

2

Lecture Objectives

Define dimensional data model Contrast with ER modeling

3

Entity Relationship Modeling: Review

Entity Relationship modeling is a technique used to ‘abstract’ user’s data requirements into a model that can be analyzed and ultimately implemented.

The focus of ER modeling: achieve processing and data storage efficiency by

reducing data redundancy (storing data elements once)

provide flexibility and ease of maintenance protect the integrity of data by storing it once

ER modeling and normalization great for transaction processing as it makes transactions as simple as possible (as data stored only in one place)

4

ER Model Example

However, normalized databases become very complex making queries difficult and inefficient – a ‘spiderweb of joins’ is required for many queries. A database normalized for transaction processing is typically unusable for non-technical users who wish to perform queries

5

ER model issues

End users cannot understand or remember an ER model. End users cannot navigate an ER model. There is no graphical user interface (GUI) that takes a general ER model and makes it usable by end users.

Software cannot usefully query a general ER model. Cost-based optimizers that attempt to do this are notorious for making the wrong choices, with disastrous consequences for performance.

Use of the ER modeling technique defeats the basic allure of data warehousing, namely intuitive and high-performance retrieval of data.

The solution: the Dimensional Data Model

6

What is Dimensional Modeling (DM)?

DM is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access.

Can be implemented using a relational or a multidimensional DBMS

Every dimensional model is composed of one table with a multipart key, called the fact table, and a set of smaller tables called dimension tables.

Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table.

This characteristic "star-like" structure is often called a star join. The term star join dates back to the earliest days of relational databases.

7

Dimensional Model Example

8

Dimensional Model: Fact Tables

A fact table contains information about things that an organization wants to measure.

A fact table’s key is made up from the keys of two or more parents.

A fact always ‘resolves’ a many-to-many relationship between the parent, or dimension tables.

The most useful fact tables also contain one or more numerical measures, or facts, that occur for the combination of keys that define each record.

Example: the facts are Dollars Sold, Units Sold, and Dollars Cost.

9

Dimensional model: Fact Tables

The most useful facts in a fact table are numeric and additive.

Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, or even millions of these records at a time, and often the most useful thing to do with so many records is to add them up.

10

Dimensional Model: Dimension Tables

Dimension tables contain information about how an organization wants to analyze facts: “Show me sales revenue (fact) for last week (time)

for blue cups (product) in the western region (geography)

Dimension tables most often contain descriptive textual information ‘Blue cups’, ‘Western Region’

Dimension attributes are used as the source of most of the interesting ‘constraints’ in data warehouse queries., and they are virtually always the source of the row headers in the SQL answer set.

11

Dimensional Model vs ER model

The key to understanding the relationship between DM and ER is that a single ER diagram breaks down into multiple DM diagrams, or ‘stars’.

Think of a large ER diagram as representing every possible business process within an application. The ER diagram may have Sales Calls, Order Entries, Shipment Invoices, Customer Payments, and Product Returns, all on the same diagram.

12

Dimensional Model vs ER modelShipments

Returns

Sales Contact

Orders

Payments

13

Dimensional Model vs ER model To create the individual ‘stars’ that exist

within an application: Look for many-to-many relationships in the ER

model containing numeric and additive facts and to designate them as fact tables.

Alternatively, look for ‘events’ or ‘transactions’ – these may also be facts

Denormalize all of the remaining tables into flat tables with single-part keys that connect directly to the fact tables. These tables become the dimension tables.

In cases where a dimension table connects to more than one fact table, we represent this same dimension table in both schemas, and we refer to the dimension tables as "conformed" between the two dimensional models.

14

1..1Staff

staffNo {PK}fNamelnamesAddressjobTitlesalaryNINsexdob

Client

clientNo {PK}cAddresscPostcodecTelNodLicenseNosexdob

Lesson

lessonNo {PK}lessonDatelessonTimestageprogresscommentsmileageStartmileageFinish

Vehicle

vehRegNo {PK}modelmakecolorcapacity

Inspection

inspDateinspTimefaultsFoundcomments

Office

officeNo {PK}oAddressoPostcodeoTelNooFaxNo

Interview

iDateiTimeiRoomiCommentsdLicense

Manages

RequiredFor

Has

Books

Undertakes

0..1

1..*

1..1

1..1

0..*

1..1

1..10..*

1..1

1..*

1..*

InspectedByRegisters

0..*

1..1

Attends

Requests

Takes

1..1

1..1

0..*

1..1

1..1

DrivingTest

0..*

Sits

1..1

1..*

1..1

1..*

Runs

15

Dimensional Model - Example

16

DM Strengths

The dimensional model has a number of important data warehouse advantages that the ER model lacks.

First, the dimensional model is a predictable, standard framework. Report writers, query tools, and user interfaces can all make strong assumptions about the dimensional model to make the user interfaces more understandable and to make processing more efficient.

Rather than using a cost-based optimizer, a database engine can make very strong assumptions about first constraining the dimension tables and then "attacking" the fact table all at once with the Cartesian product of those dimension table keys satisfying the user's constraints.

17

DM Strengths

A second strength of the dimensional model is that the predictable framework of the star join schema withstands unexpected changes in user behavior. Every dimension is equivalent. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical.

18

DM Strengths

A third strength of the dimensional model is that it is ‘gracefully extensible’ to accommodate unexpected new data elements and new design decisions.

Gracefully extensible: all existing tables (both fact and dimension) can be changed

in place by simply adding new data rows in the table, or the table can be changed in place with a SQL alter table command.

Data should not have to be reloaded. No query tool or reporting tool needs to be reprogrammed

to accommodate the change. Old applications continue to run without yielding different

results. Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table

19

DM Strengths

A fourth strength of the dimensional model is that there is a body of standard approaches for handling common modeling situations in the business world. These modeling situations include: Slowly changing dimensions, where a

"constant" dimension such as Product or Customer actually evolves slowly and asynchronously.

Event-handling databases, where the fact table usually turns out to be "factless.“

Many others

20

DM Strengths

A final strength of the dimensional model is the management of aggregates.

Aggregates are summary records that are logically redundant with base data already in the data warehouse, but they are used to enhance query performance.

A comprehensive aggregate strategy is required in every medium- and large-sized data warehouse implementation.

All of the aggregate management software packages and aggregate navigation utilities depend on a very specific single structure of fact and dimension tables that is absolutely dependent on the dimensional model.

21

ER vs DM – Final Points

ER models are not appropriate for Data Warehouses. ER modeling does not really model a business; rather, it models the micro relationships among data elements.

ER models are wildly variable in structure. As such, it is extremely difficult to optimize query performance.

Why ER is not suitable for Data Warehouses ?

End user cannot understand or remember an ER Model. End User cannot navigate an ER Model. There is no graphical user interface or GUI that takes a general ER diagram and makes it usable by end users.

ER modeling is not optimized for complex, ad-hoc queries. They are optimized for repetitive narrow queries.

Use of ER modeling technique defeats this basic allure of data warehousing, namely intuitive and high performance retrieval of data because it leads to highly normalized relational tables.

22

Designing a Dimensional Model : Steps Involved

Step 1 : Select the Business Process The first step in the design is to decide what

business process (es) to model by combining an understanding of the business requirements with an understanding of the available data

Step 2 : Declare the Grain Once the business process has been identified,

the data warehouse team faces a serious decision about the granularity. What level of detail must be made available in the dimensional model?

23


The grain of a fact table represents the level of detail of information in a fact table. Declaring the grain means specifying exactly what an individual fact table record represents.

It is recommended that the most atomic information captured by a business process. Atomic data is the most detailed information collected. The more detailed and atomic the fact measurements are, the more we know and we can analyze the data better.

24

In the star schema discussed above, the most detailed data would be transaction line item detail in the sale receipt.

(date, time, product code, product_name, price/unit, number_of_units, amount)

= (18-SEP-2002, 11.02, p1, dettol soap, 15, 2, 30)

25


But in the above dimensional model we provide sales data rolled up by product(all records corresponding to the same product are combined) in a store on a day. A typical fact table record would look like this:

18-SEP-2002, Product1, Store1, 150, 600

This record tells us that on 18th Sept. 150 units of Product1 was sold for Rs. 600 from Store1.

26


Step 3 : Choose the Dimensions Once the grain of the fact table has been

chosen, the date, product, and store dimensions are readily identified.

It is often possible to add more dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only one value under each combination of the primary dimensions.

If the additional dimension violates the grain by causing additional fact rows to be generated, then the grain must be revised to accommodate this dimension.

27


Step 4 : Identify the Facts The first step in identifying fact tables is where

we examine the business, and identify the transaction that may be of interest.

In our example the electronic point of sale (EPOS) transactions give us two facts, quantity sold and sale amount.

28


Advanced Concepts

29

Snowflake Schema & star flake schema

Snowflaking is removing low cardinality (an attribute not having low distinct values to table cardinality ratio) textual attributes from dimension tables and placing them in secondary dimension tables.

For instance, a product category can be treated this way and physically removed from the low-level product dimension table by normalizing the dimension table. This is particularly done on large dimension tables.

Snowflaking a dimension means normalizing it and making it more manageable by reducing its size. But this may have an adverse effect on performance, as joins need to be performed.

30

31

star flake schema

star flake schema is a hybrid structure that contains a mixture of star(de normalised) and snowflake(normalised) schemas.

Allows dimensions to be present in both forms to cater for different query requirements

Fact Constellation

A Dimensional Model , which contains more than one fact table sharing one or more conformed dimension tables , is referred to as fact constellation

32

Differentiate between Star Schema and Snowflake Schema

Star Schema Snowflake schema

Star schema contains the dimension tables mapped around one or more fact tables.

A Snowflake schema contains in-depth joins because the tables are splitted in to many pieces.

It is a denormalized model It is the normalised form of Star schema

No need to use complicated joins. We have to use complicated joins, since we have more tables.

Queries results fastly There will be some delay in processing the Query.

Star Schemas are usually not in BCNF form. All the primary keys of the dimension tables are in the fact table

In Snowflake schema, dimension tables are in 3NF , so we get more dimension tables which are linked by primary – foreign key relation.

33

Examples on Star Schema and Snowflake Schema

All electronics company have sales department. Sales consider four dimensions namely time, item, branch and location. The schema contain a central fact tables sales with two measures dollars_sold and unit_sold

Design star schema, snowflake schema and fact constellation for same.

34

Star Schema

35

Snowflake Schema

36

Fact Constellation

37

Example 2

The Mumbai university wants you to help design a star schema to record grades for course completed by students. There are four dimensional tables namely course section, professor, student, period with attributes as follows :

Course_section Attributes: Course_Id, Section_number, Course_name, Units, Room_id, Roomcapacity. During a given semester the college offers an average of 500 course sections

Professor Attributes: Prof_id, Prof_Name, Title, Department_id, department_name

Student Attributes: Student_id, Student_name, Major. Each Course section has an average of 60 students

38

Period Attributes: Semester_id, Year. The database will contain Data for 30 months periods. The only fact that is to be recorded in the fact table is course Grade

Answer the following Questions (a) Design the star schema for this problem (b) Estimate the number of rows in the fact table, using

the assumptions stated above and also estimate the total size of the fact table (in bytes) assuming that each field has an average of 5 bytes

(c) Can you convert this star schema to a snowflake schema? Justify your answer and design a snowflake schema if it is possible

39

Star Schema

40

Total Courses Conducted by university =500 Each Course has average students= 60 University stores data for 30 months Total Student in University for all courses in 30

months =500*60 = 30000 Time Dimension = 30 months = 5 Semesters

(Assume 1 semester= 6 months) Now, Number of rows of fact table= 30000*5=

150000 (one student has 5 grades for 5 semesters) (c) Snowflake Schema : Yes , the above star schema

can be converted to a snowflake schema, considering the following assumptions

41

Snowflake Schema

42

Give Information Package for recording information requirements for “Hotel Occupancy” considering dimensions like Time, Hotel etc. Design star schema from the information package.

43

Hotel Room Type Time

Hotel Id Room id Time id

Branch Name room type Year

Branch Code room size Quarter

Region number of beds Month

Address type of bed Date

city/stat/zip max occupants day of week

construction year Suite day of month,

renovation year holiday flag

44

45

Information Access and Delivery

Classes of Users

Computer Profeciency Users Casual or Novice User

This type of users uses the data warehouse occasionally, not daily.

Regular User Daily users of the data warehouse are

called as regular users. Power User

User who uses warehouse to create report and queries are highly proficient with the technology are called as the power user.

Classes of Users

Users based on their Job Functions

High-Level Executives and Managers : Users who take high-level strategic decisions.

Technical Analysts : Users who makes use of OLAP operations and complex analysis.

Classes of Users

Tourists A user who visits the data warehouse for

information like a tourist and gets the overall information contents of the data warehouse.

Operators The data of ongoing transaction at a detailed level

are used by the operators Explorers

These types of users try to investigate the data and dig up useful patterns using complex queries.

Miners Miners discovers new and unknown pattern in the

data.

Information Delivery

The Need for Multidimensional Information

Multidimensionality allows for assigning relationships between seemingly unrelated information fields.

It allows combining information from different sources, and relating them to obtain useful knowledge.

Simple 2 Dimensional Data Multi-Dimensional Data with the

Time axis

OLAP

If a person needs to predict or at least observe the trend of some variable when some affecting value is varied, then it is not possible for a conventional data warehouse to produce the information required. ,this is where OLAP comes into play.

OLAP or online-analytical processing is a

technology complementary to data warehousing, in the sense that it works on the data stored in a warehouse. Warehouses need OLAP to produce multidimensional views of the information that it stores, and OLAP needs data warehouses for the data it needs to process.

OLAP Defined

OLAP may be defined in terms of just five keywords – Fast Analysis of Shared Multidimensional Information.

Fast, such that the most complex queries requiring not more than 5 seconds to be processed

Analysis alludes to the process of analyzing information of all relevant kinds in order to process complex queries and establishing clear criteria for the results of such queries.

OLAP Defined

The information to be used for analysis is generally obtained from a shared source, such as a data warehouse

The information may be related in more than one or two dimensions. For example, a particular set of business data may be related, variously, to sales figures, market trends, consumer buying patterns, supplier conditions and the liquidity of the business. Presented in such a multidimensional detail, such information can be useful and vital to managerial decision-making.

Differences between OLTP and OLAP (Application Differences)

OLTP OLAP

Transaction oriented Subject oriented

High Create/Read/Update/ Delete (CRUD) activity

High Read activity

Many users Few users

Continuous updates – many sources Batch updates – single source

Real-time information Historical information

Tactical decision-making Strategic planning

Controlled, customized delivery “Uncontrolled”, generalized delivery

RDBMS RDBMS and/or MDBMS

Operational database Informational database

Modeling Objectives Differences

OLTP OLAP

High transaction volumes using few records at a time

Low transaction volumes using many records at a time

Balancing needs of online vs scheduled batch processing

Design for on-demand online processing

Highly volatile data Non-volatile data

Data redundancy - BAD Data redundancy – GOOD

Few levels of granularity Multiple levels of granularity

Complex database designs used by IT personnel

Simpler database designs with business-friendly constructs

Model Differences

OLTP OLAPSingle purpose model – supports Operational System

Multiple models – support Informational Systems

Full set of Enterprise data Subset of Enterprise data

Eliminate redundancy Plan for redundancy

Natural or surrogate keys Surrogate keys

Validate Model against business Function Analysis

Validate Model against reporting requirements

Technical metadata depends on business requirements

Technical metadata depends on data mapping results

This moment in time is important Many moments in time are essential elements

OLAP Techniques Consolidation or Roll Up

Roll - Up

Consolidation involves the aggregation of data. This can involve simple roll-ups or complex grouping involving inter-related data.

For example, the figure below shows the result of roll up operation performed on the central cube by climbing up the concept hierarchy for location. This hierarchy was defined as the total order street <city<province_or_state<country. The roll up operation shown aggregates the data by ascending the location hierarchy from the data by city to the level of country. In other words rather than grouping the data by city, the resulting cube groups the data by country.

OLAP : Drill Down

Slicing

Dicing

Pivot/ Rotate

Other OLAP operations

Drill across: executes query involving(i.e across) more than one fact table

Drill through : operation makes use of relational SQL facilities to drill through the bottom level of the data cube down to its back end relational table

Other operations may include ranking the top N, or bottom N items in a list as well as computing moving averages, growth rates, interests, internal rates of return, depreciation, currency conversions and statistical functions.

OLAP Applications

1. Financial Applications Activity-based costing (resource allocation) Budgeting

2. Marketing/Sales Applications Market Research Analysis Sales Forecasting Promotions Analysis Customer Analyses Market/Customer Segmentation

3. Business Modelling Simulating business behaviour Extensive, real-time decision support system for

managers

Benefits of Using OLAP

OLAP helps managers in decision-making through the multidimensional data views that it is capable of providing, thus increasing their productivity.

OLAP applications are self-sufficient owing to the inherent flexibility provided to the organized databases.

It enables simulation of business models and problems, through extensive usage of analysis-capabilities.

In conjunction with data warehousing, OLAP can be used to provide reduction in the application backlog, faster information retrieval and reduction in query drag..

Approaches to OLAP Servers

MOLAP This is the more traditional way of OLAP

analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

MOLAP

Advantages Excellent performance: MOLAP cubes are

built for fast data retrieval, and are optimal for slicing and dicing operations.

Can perform complex calculations: Disadvantages Limited in the amount of data it can

handle Requires additional investment:

ROLAP

This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality.

Advantages Can handle large amounts of data: Can leverage functionalities inherent in

the relational database

ROLAP

Disadvantages Performance can be slow: Because each

ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

Limited by SQL functionalities

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.

For summary-type information, HOLAP leverages cube technology for faster performance.

When detail information is needed, HOLAP can “drill through” from the cube into the underlying relational data. For example, a HOLAP server may allow large

volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store.

The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server.

Specialized SQL Servers

To meet the growing demand of OLAP processing in relational databases, some relational and data warehousing firms (e.g., Red Brick from Informix) implement specialized SQL servers that provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

Introduction to Weka Tool

Outline

Introduction

Weka Tools/Functions

How to use Weka? Weka Data File Format (Input) Weka for Data Mining Sample Output from Weka (Output)

Conclusion

Introduction to Weka (Data Mining Tool)

Weka was developed at the University of Waikato in NewZealand . http://www.cs.waikato.ac.nz/ml/weka/

Weka is a open source data mining tool developed in Java. It is used for research, education, and applications. It can be run on Windows, Linux and Mac.

21/04/23 76

What can Weka do?

Weka is a collection of machine learning algorithms for data mining tasks.

The algorithms can either be applied directly to a dataset (using GUI) or called from your own Java code (using Weka Java library).

Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

21/04/23 77

Weka’s Role in the Big Picture

Input•Raw data

Input•Raw data

Data Mingby Weka

•Pre-processing •Classification•Regression •Clustering •Association Rules •Visualization

Data Mingby Weka

•Pre-processing •Classification•Regression •Clustering •Association Rules •Visualization

Output•Result

Output•Result

21/04/23 78

21/04/23 79

Weka GUI

Different analysis tools/functions

Different attributes to choose

The value set of the chosen attribute and the # of input items with each value

21/04/23 80

Weka GUI

Classification Algorithms

21/04/23 81

Sample Output from Weka

21/04/23 82

Interpretation of Result

YES NO

YES TP(36) FN(5)

NO FP(19) TN(10)

21/04/23 83

Accuracy Measurement Parameters

Accuracy Measurement Parameters

Formula

1. True Positive Rate TPR= (TP)/(TP+FN)

2.False Positive Rate FPR= (FP)/ (FP+TN)

3.Precision P = ( TP)/(TP+FP)

4.Recall R= (TP)/(TP+FN)

5.F-Measure F= (2*P*R)/(P+R)

6.ROC ROC=True Positive Rate/False Positive rateNote: For plotting the curve.

21/04/23 84

Analysis of Classification Result

play Not Play

Actual Data 9 5

KNN 11 3

C4.5 10 4

Naïve Bayes 11 3

21/04/23 85

Analysis of Clustering Result

Normal

Intrusions

Actual Clusters

498 2

Simple K means

303 197

Farthest First

491 9

Hierarchical Clustering

493 7

21/04/23 86

Conclusion about Weka

The overall goal of Weka is to build a state-of-the-art facility for developing machine learning (ML) techniques and allow people to apply them to real-world data mining problems.

Detailed documentation about different functions provided by Weka can be found on Weka website.

WEKA is available at:http://www.cs.waikato.ac.nz/ml/weka

21/04/23 87

Documents

Dimensional Data Modeling_lecture3