Upload
vocong
View
219
Download
0
Embed Size (px)
Citation preview
1> Inmon vs. Kimball: Which approach is suitable for your
data warehouse?
When it comes to designing a data warehouse for your business, the two most commonly
discussed methods are the approaches introduced by Bill Inmon and Ralph Kimball. Debates
on which one is better and more effective have been on for years. But a clear cut answer
has never been arrived upon, as both philosophies have their own advantages and
differentiating factors, and enterprises continue to use either of these.
To begin with, let us have a quick look at both the approaches.
In a nutshell
Bill Inmon’s enterprise data warehouse approach (the top-down design): A normalized data
model is designed first. Then the dimensional data marts, which contain data required for
specific business processes or specific departments are created from the data warehouse.
Ralph Kimball’s dimensional design approach (the bottom-up design): The data marts
facilitating reports and analysis are created first; these are then combined together to
create a broad data warehouse.
Inmon’s top-down approach
Inmon defines data warehouse as a centralized repository for the entire enterprise. Data
warehouse stores the ‘atomic’ data at the lowest level of detail. Dimensional data marts are
created only after the complete data warehouse has been created. Thus, data warehouse is
at the center of the Corporate Information Factory (CIF), which provides a logical
framework for delivering business intelligence.
Inmon defines the data warehouse in the following terms:
1. Subject-oriented: The data in the data warehouse is organized so that all the data
elements relating to the same real-world event or object are linked together
2. Time-variant: The changes to the data in the database are tracked and recorded so
that reports can be produced showing changes over time
3. Non-volatile: Data in the data warehouse is never over-written or deleted -- once
committed, the data is static, read-only, and retained for future reporting
4. Integrated: The database contains data from most or all of an organization's
operational applications, and that this data is made consistent
Kimball’s bottom-up approach
Keeping in mind the most important business aspects or departments, data marts are
created first. These provide a thin view into the organizational data, and as and when
required these can be combined into a larger data warehouse. Kimball defines data
warehouse as “A copy of transaction data specifically structured for query and analysis”.
Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS).
Dimensional modeling focuses on ease of end user accessibility and provides a high level of
performance to the data warehouse.
Inmon vs. Kimball: Similar or different?
Pros and cons of both the approaches
How to decide?
As we have already seen, the approach to designing a data warehouse depends on the
business objectives of an organization, nature of business, time and cost involved, and the
level of dependencies between various functions. Inmon’s approach is suitable for stable
businesses which can afford the time taken for design and the cost involved. Also, with
every changing business condition, they do not change the design; instead they
accommodate these into the existing model. However, if local optimization is good enough
and the focus is on quick win, it is advisable to go for Kimball’s approach. Keeping this in
mind let Inmon vs. Kimball fight happen over a few sectors / functions.
Insurance: It is vital to get the overall picture with respect to individual clients,
groups, history of claims, mortality rate tendencies, demography, profitability of
each plan and agents, etc. All aspects are inter-related and therefore suited for the
Inmon’s approach.
Marketing: This is a specialized division, which does not call for enterprise
warehouse. Only data marts are required. Hence, Kimball’s approach is suitable.
CRM in banks: The focus is on parameters such as products sold, up-sell and cross-
sell at a customer-level. It is not necessary to get an overall picture of the business.
For example, there is no need to link a customer’s details to the treasury department
dealing with forex transactions and regulations. Since the scope is limited, you can
go for Kimball’s method. However, if the entire processes and divisions in the bank
are to be linked, the obvious choice isInmon’s design vs. Kimball’s.
Manufacturing: Multiple functions are involved here, irrespective of the budget
involved. Thus, where there is a systemic dependency as in this case, an enterprise
model is required. Hence Inmon’s method is ideal.
While designing a data warehouse, first you have to look at your business objectives – short
term and long term. See where the functional links are and what stands alone. Analyze data
sources for quantity and quality. Finally, evaluate your resource level, time frame and
wallet. This helps you to arrive at which method to adopt Inmon’s or Kimball’s or a
combination of both.
2> Data Warehouse
Data Warehouse Architecture
Data Warehouse definition by William H. Inmonna:
A data warehouse is a:
subject-oriented
-volatile collection of data in support of the management's decision-making process.
A data warehouse is a centralized repository that stores data from multiple information
sources and transforms them into a common, multidimensional data model for efficient
querying and analysis.
3> OLTP vs. OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In
general we can assume that OLTP systems provide source data to data
warehouses, whereas OLAP systems help to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-
line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put
on very fast query processing, maintaining data integrity in multi-access environments and
an effectiveness measured by number of transactions per second. In OLTP database there is
detailed and current data, and schema used to store transactional databases is the entity
model (usually 3NF).
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of
transactions. Queries are often very complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications are widely used by Data
Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-
dimensional schemas (usually star schema).
The following table summarizes the major differences between OLTP and OLAP system
design.
OLTP System
Online Transaction Processing
(Operational System)
OLAP System
Online Analytical Processing
(Data Warehouse)
Source of data Operational data; OLTPs are the
original source of the data.
Consolidation data; OLAP data comes
from the various OLTP Databases
Purpose of
data
To control and run fundamental
business tasks
To help with planning, problem solving,
and decision support
What the data Reveals a snapshot of ongoing
business processes
Multi-dimensional views of various
kinds of business activities
Inserts and
Updates
Short and fast inserts and updates
initiated by end users
Periodic long-running batch jobs
refresh the data
Queries
Relatively standardized and simple
queries Returning relatively few
records
Often complex queries involving
aggregations
Processing
Speed Typically very fast
Depends on the amount of data
involved; batch data refreshes and
complex queries may take many hours;
query speed can be improved by
creating indexes
Space
Requirements
Can be relatively small if historical
data is archived
Larger due to the existence of
aggregation structures and history
data; requires more indexes than OLTP
Database
Design
Highly normalized with many tables
Typically de-normalized with fewer
tables; use of star and/or snowflake
schemas
Backup and
Recovery
Backup religiously; operational data
is critical to run the business, data
loss is likely to entail significant
monetary loss and legal liability
Instead of regular backups, some
environments may consider simply
reloading the OLTP data as a recovery
method
4> What is Business Intelligence?
Business Ingelligence (BI) - technology infrastructure for gaining maximum
information from available data for the purpose of improving business processes.
Typical BI infrastructure components are as follows: software solution for gathering,
cleansing, integrating, analyzing and sharing data. Business Intelligence produces
analysis and provides believable information to help making effective and high quality
business decisions.
The most common kinds of Business Intelligence systems are:
EIS - Executive Information Systems
DSS - Decision Support Systems
MIS - Management Information Systems
GIS - Geographic Information Systems
OLAP - Online Analytical Processing and multidimensional analysis
CRM - Customer Relationship Management
Business Intelligence systems based on Data Warehouse technology. A Data
Warehouse(DW) gathers information from a wide range of company's operational
systems, Business Intelligence systems based on it. Data loaded to DW is usually
good integrated and cleaned that allows to produce credible information which
reflected so called 'one version of the true'.
Business Intelligence tools
The most popular BI tools on the market are:
- Siebel Business Analytics Applications
- Business Intelligence
- BusinessObjects XI
- Cognos 8 BI
- Hyperion System 9 BI+
- Analysis Services
- Dynamic Enterprise Dashboards
- Open BI Suite
- WebFOCUS Business Intelligence
QlikTech - QlikView
- Enterprise Analytics
- InfoMaker
- IOLAP
5> ETL tools
List of the most popular ETL tools:
- Power Center
- Websphere DataStage(Formerly known as Ascential DataStage)
- BusinessObjects Data Integrator
- Cognos Data Manager (Formerly known as Cognos DecisionStream)
- SQL Server Integration Services
- Data Integrator (Formerly known as Sunopsis Data Conductor)
- Data Integration Studio
- Warehouse Builder
- Data Migrator
- Pentaho Data Integration
- DT/Studio
- ETL4ALL
- DB2 Warehouse Edition
- Data Integrator
- Transformation Manager
- DataFlow
- Data Integrated Suite ETL
Talend - Talend Open Studio
- Expressor Semantic Data Integration System
- Elixir Repertoire
- CloverETL
6> ETL process
ETL (Extract, Transform and Load) is a process in data warehousing responsible for
pulling data out of the source systems and placing it into a data warehouse. ETL
involves the following tasks:
- Extracting the data from source systems (SAP, ERP, other oprational systems),
data from different source systems is converted into one consolidated data warehouse
format which is ready for transformation processing.
- Transforming the data may involve the following tasks:
applying business rules (so-called derivations, e.g., calculating new measures and
dimensions),
cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
filtering (e.g., selecting only certain columns to load),
splitting a column into multiple columns and vice versa,
joining together data from multiple sources (e.g., lookup, merge),
transposing rows and columns,
applying any kind of simple or complex data validation (e.g., if the first 3 columns
in a row are empty then reject the row from processing)
- Loading the data into a data warehouse or data repository other reporting
applications
7> Data Warehouse Schema Architecture
Data Warehouse environment usually transforms the relational data model into
some special architectures. There are many schema models designed for data
warehousing but the most commonly used are:
- Star schema
- Snowflake schema
- Fact constellation schema
The determination of which schema model should be used for a data warehouse
should be based upon the analysis of project requirements, accessible tools and
project team preferences.
Star schema
The star schema architecture is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a center. The
center of the star consists of fact table and the points of the star are the dimension tables.
Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional
tables are de-normalized. Despite the fact that the star schema is the simplest architecture,
it is most commonly used nowadays and is recommended by Oracle.
January 10, 2015 Data Mining: Concepts and Techniques 11
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Fact Tables
A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
Dimension Tables
A dimension is a structure usually composed of one or more hierarchies that categorizes
data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list.
The primary keys of each of the dimension tables are part of the composite primary key of
the fact table. Dimensional attributes help to describe the dimensional value. They are
normally descriptive, textual values. Dimension tables are generally small in size then fact
table.
Typical fact tables store data about sales while dimension tables data about geographic
region(markets, cities) , clients, products, times, channels.
The main characteristics of star schema:
structure -> easy to understand schema
-> small number of tables to join
-> de-normalization,
redundancy data caused that size of the table could be large.
st commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools
Data Mining: Concepts and Techniques
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
Data Mining: Concepts and Techniques
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
Distributed Data Warehouse [2]
We have seen that the central data warehouse tier in Figure 1 can be represented by a relational
database schema. Therefore, in dealing with distributed data warehouses we may apply the relational
theory of distribution design [OV99], i.e. we first fragment the schema, then allocate the fragments to
the nodes of a network. This allocation may replicate fragments, i.e. store the same fragment at more
than one node. The goal of the distribution is to optimise global performance. Having this goal in mind,
fragmentation and allocation.
Fig. Distributed Data Warehouse Architecture
Mutually depend on each other, and thus both steps are usually combined. However, as refreshing of
the data warehouse content can be assumed to be executed in an off-line mode and the OLAP tier
requires only read-access to the warehouse, we favor architecture with as much replication as
necessary, such that all view creation operations for data marts can be performed locally. This
assumption is illustrated by the distributed data warehouse architecture in Figure 4. It is even possible
that the whole data warehouse is replicated at all nodes. For instance, in the sales example we used so
far regional sales statistics may only be provided to regional branch offices, whereas a general sales
statistics for a headquarter may disregard individual shops. This implies that different data marts would
be needed for the locations participating in the distributed warehouse and OLAP system.
Or
Distributed Data Warehouse : [3]
Many organizations have physically distributed databases with extremely large amounts of data.
Traditionally the data warehouse would be seen as a centralized repository, whereby data from all
sources would be imported into that large centralized repository for analysis. Nowadays the speed and
bandwidth of wide-area computer networks enables a distributed approach, whereby parts of the data
may reside in different places, parts being cached and/or replicated for performance reasons, and the
system functions to the outside world as a single global access-transparent repository. As the amount of
data and number of sites grow, this distributed approach becomes crucial, as a single centralized data
warehouse importing data from all the sources has obvious scalability limitations.
References
[1] http://datawarehouse4u.info/
[2] Jane Zhao, “Designing Distributed Data Warehouses and OLAP Systems”, pp. 254-263
[3] Pedro Furtado , “A Survey on Parallel and Distributed Data Warehouses”, pp. 1-23