22
CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Embed Size (px)

Citation preview

Page 1: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

CHAPTER 11:

DIMENSIONAL MODELING: ADVANCED TOPICS

Page 2: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

CHAPTER OBJECTIVE:

NORMALIZATION

THE SNOWFLAKE SCHEMA

Page 3: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Normalization

In creating a database, normalization is the process of organizing it into tables in such a way that the results of using the database are always unambiguous and as intended (usually divide large tables into smaller for easier to maintain it).

The process of making your data and tables match these standards is called normalizing data or data normalization.

Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: 1- eliminating redundant data 2- ensuring data dependencies make sense. Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

Page 4: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

A simple example of normalizing data might consist of a table showing:

Customer Item purchased Purchase price

Thomas Shirt $40

Maria Tennis shoes $35

Evelyn Shirt $40

Pajaro Trousers $25

If this table is used for the purpose of keeping track of the price of items and you want to delete one of the customers, you will also delete a price.

Normalizing the data would mean understanding this and solving the problem by dividing this table into two tables, one with information about each customer and a product they bought and the second about each product and its price.

Page 5: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Normalization degrees:

First normal form (1NF). This is the "basic" level of normalization and generally corresponds to the definition of any database:

It contains two-dimensional tables with rows and columns.

Each column corresponds to a sub-object or an attribute of the object

represented by the entire table.

Each row represents a unique instance of that sub-object or attribute and must

be different in some way from any other row (that is, no duplicate rows are possible).

All entries in any column must be of the same kind. For example, in the

column labeled "Customer," only customer names or numbers are permitted.

Page 6: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Second normal form (2NF). At this level of normalization, each column in a table that is not a determiner of the contents of another column must itself be a function of the other columns in the table. For example, in a table with three columns containing customer ID, product sold, and price of the product when sold, the price would be a function of the customer ID (entitled to a discount) and the specific product.

Page 7: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Third normal form (3NF). At the second normal form, modifications are still possible because a change to one row in a table may affect data that refers to this information from another table. For example, using the customer table just cited, removing a row describing a customer purchase (because of a return perhaps) will also remove the fact that the product has a certain price. In the third normal form, these tables would be divided into two tables so that product pricing would be tracked separately.

Page 8: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Snowflake Schema

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Snowflake schema consists of a fact table surrounded by multiple dimension tables which can be connected to other dimension tables via many-to-one relationship.

The normalization of dimension tables tends to increase number of dimension tables or sub-dimension table that require more foreign key joins when querying the data therefore reduce the query performance.

The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.

Therefore in snowflake schema, instead of having big dimension tables connected to a fact table, we have a group of multiple dimension tables.

The snowflake schema helps save storage however it increases the number of dimension tables.

Page 9: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Star schema

Page 10: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Snowflake schema

Page 11: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Snowflake schema advantages:

Snowflake schema help to save space by normalizing dimension tables.

It is more difficult for business users who use data warehouse system using snowflake schema because they have to work with more tables than star schema.

Snowflake schema is designed from star schema by further normalizing dimension tables to eliminate data redundancy.

Small savings in storage space.

Normalized structures are easier to update and maintain.

Page 12: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Snowflake schema disadvantages:

The normalization of dimension tables tends to increase number of dimension tables or sub-dimension table that require more foreign key joins when querying the data therefore reduce the query performance.

The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.

Page 13: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Snowflake schema example

Snowflake Schema Example

Page 14: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Let’s examine the snowflake schema above in a greater detail:

DIM_STORE dimension table is normalized to add one more dimension table called DIM_GEOGRAPHY

DIM_PRODUCT dimension table is normalized to add 2 more dimension tables called DIM_BRAND and DIM_PRODUCT_CATEGORY

DIM_DATE dimension table is now connecting with three other dimension tables: DIM_DAY_OF_WEEK, DIM_MONTH and DIM_QUARTER.

Fact table remains the same as star schema.

Page 15: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Star Schema vs. Snowflake Schema

Star schema vs. Snowflake schema

Star Schema Snowflake Schema

Understandability Easier for business users and analysts to query data.

May be more difficult for business users and analysts due to number of tables they have to deal with.

Dimension table

Only have one dimension table for each dimension that groups related attributes. Dimension tables are not in the third normal form.

May have more than 1 dimension table for each dimension due to the further normalization of each dimension table.

Query complexityThe query is very simple and easy to understand

More complex query due to multiple foreign key joins between dimension tables

Page 16: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Star schema vs. Snowflake schema

Star Schema Snowflake Schema

Query performance

High performance. Database engine can optimize and boost the query performance based on predictable framework.

More foreign key joins therefore longer execution time of query in compare with star schema

When to use

When dimension tables store relative small number of rows, space is not a big issue we can use star schema.

When dimension tables store large number of rows with redundancy data and space is such an issue, we can choose snowflake schema to save space.

Foreign Key Joins Fewer Joins Higher number of joins

Data warehouse systemWork best in any data warehouse / data mart

Better for small data warehouse/ data mart

Page 17: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

1. Data optimization: Snowflake model uses normalized data, i.e. the data is organized inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.

Figure 1 – Snow flake model

Page 18: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

Star model on the other hand uses de-normalized data. In the star model, dimensions directly refer to fact table and business hierarchy is not implemented via referential integrity between dimensions.

Figure 2 – Star model

Page 19: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

2. Business model:

Primary key is a single unique key (data attribute) that is selected for a particular data. In the previous ‘advertiser’ example, the Advertiser_ID will be the primary key (business key) of a dimension table. The foreign key (referential attribute) is just a field in one table that matches a primary key of another dimension table. In our example, the Advertiser_ID could be a foreign key in Account_dimension.

In the snowflake model, the business hierarchy of data model is represented in a primary key –Foreign key relationship between the various dimension tables.

In the star model all required dimension-tables have only foreign keys in the fact tables.

Page 20: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

3. Performance:

The third differentiator in this Star schema vs Snowflake schema face off is the performance of these models.

The Snowflake model has higher number of joins between dimension table and then again the fact table and hence the performance is slower. For instance, if you want to know the Advertiser details, this model will ask for a lot of information such as the Advertiser Name, ID and address for which advertiser and account table needs to be joined with each other and then joined with fact table.

The Star model on the other hand has lesser joins between dimension tables and the facts table. In this model if you need information on the advertiser you will just have to join Advertiser dimension table with fact table.

Page 21: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS

4. ETL

Snowflake model loads the data marts and hence the ELT job is more complex in design and cannot be parallelized as dependency model restricts it.

The Star model loads dimension table without dependency between dimensions and hence the ETL job is simpler and can achieve higher parallelism.

Extract, Transform, Load (ETL)

In managing databases, extract, transform, load (ETL) refers to three separate functions combined into a single programming tool.

The extract function reads data from a specified source database and extracts a desired subset of data.

The transform function works with the acquired data - using rules or lookup tables, or creating combinations with other data - to convert it to the desired state.

The load function is used to write the resulting data (either all of the subset or just the changes) to a target database, which may or may not previously exist.

Page 22: CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS