1> Inmon vs. Kimball: Which approach is suitable for your ... · PDF file1> Inmon vs. Kimball: Which approach is suitable for your data warehouse? When it comes to designing a data

1> Inmon vs. Kimball: Which approach is suitable for your

data warehouse?

When it comes to designing a data warehouse for your business, the two most commonly

discussed methods are the approaches introduced by Bill Inmon and Ralph Kimball. Debates

on which one is better and more effective have been on for years. But a clear cut answer

has never been arrived upon, as both philosophies have their own advantages and

differentiating factors, and enterprises continue to use either of these.

To begin with, let us have a quick look at both the approaches.

In a nutshell

Bill Inmon’s enterprise data warehouse approach (the top-down design): A normalized data

model is designed first. Then the dimensional data marts, which contain data required for

specific business processes or specific departments are created from the data warehouse.

Ralph Kimball’s dimensional design approach (the bottom-up design): The data marts

facilitating reports and analysis are created first; these are then combined together to

create a broad data warehouse.

Inmon’s top-down approach

Inmon defines data warehouse as a centralized repository for the entire enterprise. Data

warehouse stores the ‘atomic’ data at the lowest level of detail. Dimensional data marts are

created only after the complete data warehouse has been created. Thus, data warehouse is

at the center of the Corporate Information Factory (CIF), which provides a logical

framework for delivering business intelligence.

Inmon defines the data warehouse in the following terms:

1. Subject-oriented: The data in the data warehouse is organized so that all the data

elements relating to the same real-world event or object are linked together

http://inmoncif.com/about/

http://www.kimballgroup.com/html/about.html

http://searchsqlserver.techtarget.com/definition/data-warehouse

http://searchdatamanagement.techtarget.com/definition/dimension

http://searchcio.techtarget.in/tutorial/BI-tools-guide-for-managers

http://cdn.ttgtmedia.com/rms/enterpriseApplications/Inmon's Approach.png

2. Time-variant: The changes to the data in the database are tracked and recorded so

that reports can be produced showing changes over time

3. Non-volatile: Data in the data warehouse is never over-written or deleted -- once

committed, the data is static, read-only, and retained for future reporting

4. Integrated: The database contains data from most or all of an organization's

operational applications, and that this data is made consistent

Kimball’s bottom-up approach

Keeping in mind the most important business aspects or departments, data marts are

created first. These provide a thin view into the organizational data, and as and when

required these can be combined into a larger data warehouse. Kimball defines data

warehouse as “A copy of transaction data specifically structured for query and analysis”.

Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS).

Dimensional modeling focuses on ease of end user accessibility and provides a high level of

performance to the data warehouse.

Inmon vs. Kimball: Similar or different?

http://searchstorage.techtarget.com/definition/bus

http://cdn.ttgtmedia.com/rms/enterpriseApplications/Kimball's Approach.png

Pros and cons of both the approaches

How to decide?

As we have already seen, the approach to designing a data warehouse depends on the

business objectives of an organization, nature of business, time and cost involved, and the

level of dependencies between various functions. Inmon’s approach is suitable for stable

businesses which can afford the time taken for design and the cost involved. Also, with

http://searchbusinessintelligence.techtarget.in/tip/6-data-warehouse-design-mistakes-to-avoid

every changing business condition, they do not change the design; instead they

accommodate these into the existing model. However, if local optimization is good enough

and the focus is on quick win, it is advisable to go for Kimball’s approach. Keeping this in

mind let Inmon vs. Kimball fight happen over a few sectors / functions.

Insurance: It is vital to get the overall picture with respect to individual clients,

groups, history of claims, mortality rate tendencies, demography, profitability of

each plan and agents, etc. All aspects are inter-related and therefore suited for the

Inmon’s approach.

Marketing: This is a specialized division, which does not call for enterprise

warehouse. Only data marts are required. Hence, Kimball’s approach is suitable.

CRM in banks: The focus is on parameters such as products sold, up-sell and cross-

sell at a customer-level. It is not necessary to get an overall picture of the business.

For example, there is no need to link a customer’s details to the treasury department

dealing with forex transactions and regulations. Since the scope is limited, you can

go for Kimball’s method. However, if the entire processes and divisions in the bank

are to be linked, the obvious choice isInmon’s design vs. Kimball’s.

Manufacturing: Multiple functions are involved here, irrespective of the budget

involved. Thus, where there is a systemic dependency as in this case, an enterprise

model is required. Hence Inmon’s method is ideal.

While designing a data warehouse, first you have to look at your business objectives – short

term and long term. See where the functional links are and what stands alone. Analyze data

sources for quantity and quality. Finally, evaluate your resource level, time frame and

wallet. This helps you to arrive at which method to adopt Inmon’s or Kimball’s or a

combination of both.

2> Data Warehouse

http://searchdatamanagement.techtarget.com/news/2240084024/Bill-Inmon-Kimball-methodology-ignores-the-value-of-textual-data

http://searchdatamanagement.techtarget.com/answer/Ralph-Kimball-vs-Bill-Inmon-approaches-to-data-warehouse-design

Data Warehouse Architecture

Data Warehouse definition by William H. Inmonna:

A data warehouse is a:

subject-oriented

-volatile collection of data in support of the management's decision-making process.

A data warehouse is a centralized repository that stores data from multiple information

sources and transforms them into a common, multidimensional data model for efficient

querying and analysis.

3> OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In

general we can assume that OLTP systems provide source data to data

warehouses, whereas OLAP systems help to analyze it.

- OLTP (On-line Transaction Processing) is characterized by a large number of short on-

line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put

on very fast query processing, maintaining data integrity in multi-access environments and

an effectiveness measured by number of transactions per second. In OLTP database there is

detailed and current data, and schema used to store transactional databases is the entity

model (usually 3NF).

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of

transactions. Queries are often very complex and involve aggregations. For OLAP systems a

response time is an effectiveness measure. OLAP applications are widely used by Data

Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-

dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system

design.

http://datawarehouse4u.info/OLTP-vs-OLAP.html




OLTP System

Online Transaction Processing

(Operational System)

OLAP System

Online Analytical Processing

(Data Warehouse)

Source of data Operational data; OLTPs are the

original source of the data.

Consolidation data; OLAP data comes

from the various OLTP Databases

Purpose of

data

To control and run fundamental

business tasks

To help with planning, problem solving,

and decision support

What the data Reveals a snapshot of ongoing

business processes

Multi-dimensional views of various

kinds of business activities

Inserts and

Updates

Short and fast inserts and updates

initiated by end users

Periodic long-running batch jobs

refresh the data

Queries

Relatively standardized and simple

queries Returning relatively few

records

Often complex queries involving

aggregations

Processing

Speed Typically very fast

Depends on the amount of data

involved; batch data refreshes and

complex queries may take many hours;

query speed can be improved by

creating indexes

Space

Requirements

Can be relatively small if historical

data is archived

Larger due to the existence of

aggregation structures and history

data; requires more indexes than OLTP

Database

Design

Highly normalized with many tables

Typically de-normalized with fewer

tables; use of star and/or snowflake

schemas

Backup and

Recovery

Backup religiously; operational data

is critical to run the business, data

loss is likely to entail significant

monetary loss and legal liability

Instead of regular backups, some

environments may consider simply

reloading the OLTP data as a recovery

method

4> What is Business Intelligence?

Business Ingelligence (BI) - technology infrastructure for gaining maximum






information from available data for the purpose of improving business processes.

Typical BI infrastructure components are as follows: software solution for gathering,

cleansing, integrating, analyzing and sharing data. Business Intelligence produces

analysis and provides believable information to help making effective and high quality

business decisions.

The most common kinds of Business Intelligence systems are:

EIS - Executive Information Systems

DSS - Decision Support Systems

MIS - Management Information Systems

GIS - Geographic Information Systems

OLAP - Online Analytical Processing and multidimensional analysis

CRM - Customer Relationship Management

Business Intelligence systems based on Data Warehouse technology. A Data

Warehouse(DW) gathers information from a wide range of company's operational

systems, Business Intelligence systems based on it. Data loaded to DW is usually

good integrated and cleaned that allows to produce credible information which

reflected so called 'one version of the true'.

Business Intelligence tools

The most popular BI tools on the market are:

- Siebel Business Analytics Applications

- Business Intelligence

- BusinessObjects XI

- Cognos 8 BI

- Hyperion System 9 BI+

- Analysis Services

- Dynamic Enterprise Dashboards

- Open BI Suite

- WebFOCUS Business Intelligence

QlikTech - QlikView

- Enterprise Analytics

- InfoMaker

- IOLAP

5> ETL tools

List of the most popular ETL tools:

http://datawarehouse4u.info/QlikView.html

- Power Center

- Websphere DataStage(Formerly known as Ascential DataStage)

- BusinessObjects Data Integrator

- Cognos Data Manager (Formerly known as Cognos DecisionStream)

- SQL Server Integration Services

- Data Integrator (Formerly known as Sunopsis Data Conductor)

- Data Integration Studio

- Warehouse Builder

- Data Migrator

- Pentaho Data Integration

- DT/Studio

- ETL4ALL

- DB2 Warehouse Edition

- Data Integrator

- Transformation Manager

- DataFlow

- Data Integrated Suite ETL

Talend - Talend Open Studio

- Expressor Semantic Data Integration System

- Elixir Repertoire

- CloverETL

6> ETL process

ETL (Extract, Transform and Load) is a process in data warehousing responsible for

pulling data out of the source systems and placing it into a data warehouse. ETL

involves the following tasks:

- Extracting the data from source systems (SAP, ERP, other oprational systems),

data from different source systems is converted into one consolidated data warehouse

format which is ready for transformation processing.

- Transforming the data may involve the following tasks:

applying business rules (so-called derivations, e.g., calculating new measures and

dimensions),

cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),

filtering (e.g., selecting only certain columns to load),

http://datawarehouse4u.info/Talend-Open-Studio.html

http://datawarehouse4u.info/ETL-process.html



splitting a column into multiple columns and vice versa,

joining together data from multiple sources (e.g., lookup, merge),

transposing rows and columns,

applying any kind of simple or complex data validation (e.g., if the first 3 columns

in a row are empty then reject the row from processing)

- Loading the data into a data warehouse or data repository other reporting

applications

7> Data Warehouse Schema Architecture

Data Warehouse environment usually transforms the relational data model into

some special architectures. There are many schema models designed for data

warehousing but the most commonly used are:

- Star schema

- Snowflake schema

- Fact constellation schema

The determination of which schema model should be used for a data warehouse

should be based upon the analysis of project requirements, accessible tools and

project team preferences.

Star schema

The star schema architecture is the simplest data warehouse schema. It is called a star

schema because the diagram resembles a star, with points radiating from a center. The

center of the star consists of fact table and the points of the star are the dimension tables.

Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional

tables are de-normalized. Despite the fact that the star schema is the simplest architecture,

it is most commonly used nowadays and is recommended by Oracle.




http://datawarehouse4u.info/Data-Warehouse-Schema-Architecture.html



http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html

http://datawarehouse4u.info/Data-warehouse-schema-architecture-snowflake-schema.html

http://datawarehouse4u.info/Data-warehouse-schema-architecture-fact-constellation-schema.html




January 10, 2015 Data Mining: Concepts and Techniques 11

Example of Star Schema

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state_or_province

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and

measures those that contain numeric facts. A fact table can contain fact's data on detail or

aggregated level.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes

data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list.

The primary keys of each of the dimension tables are part of the composite primary key of

the fact table. Dimensional attributes help to describe the dimensional value. They are

normally descriptive, textual values. Dimension tables are generally small in size then fact

table.

Typical fact tables store data about sales while dimension tables data about geographic

region(markets, cities) , clients, products, times, channels.

The main characteristics of star schema:

structure -> easy to understand schema

-> small number of tables to join

-> de-normalization,

redundancy data caused that size of the table could be large.


st commonly used in the data warehouse implementations -> widely supported

by a large number of business intelligence tools

Data Mining: Concepts and Techniques

Example of Snowflake Schema

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key

item

branch_key

branch_name

branch_type

branch

supplier_key

supplier_type

supplier

city_key

city

state_or_province

country

city

Data Mining: Concepts and Techniques

Example of Fact Constellation

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

province_or_state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper


Distributed Data Warehouse [2]

We have seen that the central data warehouse tier in Figure 1 can be represented by a relational

database schema. Therefore, in dealing with distributed data warehouses we may apply the relational

theory of distribution design [OV99], i.e. we first fragment the schema, then allocate the fragments to

the nodes of a network. This allocation may replicate fragments, i.e. store the same fragment at more

than one node. The goal of the distribution is to optimise global performance. Having this goal in mind,

fragmentation and allocation.

Fig. Distributed Data Warehouse Architecture

Mutually depend on each other, and thus both steps are usually combined. However, as refreshing of

the data warehouse content can be assumed to be executed in an off-line mode and the OLAP tier

requires only read-access to the warehouse, we favor architecture with as much replication as

necessary, such that all view creation operations for data marts can be performed locally. This

assumption is illustrated by the distributed data warehouse architecture in Figure 4. It is even possible

that the whole data warehouse is replicated at all nodes. For instance, in the sales example we used so

far regional sales statistics may only be provided to regional branch offices, whereas a general sales

statistics for a headquarter may disregard individual shops. This implies that different data marts would

be needed for the locations participating in the distributed warehouse and OLAP system.

Or

Distributed Data Warehouse : [3]

Many organizations have physically distributed databases with extremely large amounts of data.

Traditionally the data warehouse would be seen as a centralized repository, whereby data from all

sources would be imported into that large centralized repository for analysis. Nowadays the speed and

bandwidth of wide-area computer networks enables a distributed approach, whereby parts of the data

may reside in different places, parts being cached and/or replicated for performance reasons, and the

system functions to the outside world as a single global access-transparent repository. As the amount of

data and number of sites grow, this distributed approach becomes crucial, as a single centralized data

warehouse importing data from all the sources has obvious scalability limitations.

References

[1] http://datawarehouse4u.info/

[2] Jane Zhao, “Designing Distributed Data Warehouses and OLAP Systems”, pp. 254-263

[3] Pedro Furtado , “A Survey on Parallel and Distributed Data Warehouses”, pp. 1-23

http://datawarehouse4u.info/

Documents

1> Inmon vs. Kimball: Which approach is suitable for your ... · PDF file1> Inmon vs. Kimball: Which approach is suitable for your data warehouse? When it comes to designing a data