Upload
muriel-long
View
224
Download
1
Tags:
Embed Size (px)
Citation preview
UNIT – IIUNIT – IIIntroduction to Introduction to
Data WarehousingData Warehousing
By Mrs. Chetana
UNIT - II
• Data Warehouse and OLAP Technology for Data Mining Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation
• Data Cube Computation:
Efficient Methods for simple Data Cube Computation (Full Cube, Iceberg Cube, Closed Cube and Shell Cube), Discovery Driven exploration of data cubes
By Mrs. Chetana
Why Data Warehouse?Why Data Warehouse?
“Necessity is the mother of invention”
An operational data store (Database) stores data for a specific application.
It feeds the data warehouse a stream of desired raw data.Data store is generally subject oriented, volatile, current
commonly focused on customers, products, orders, policies, claims, etc…
By Mrs. Chetana
Scenario 1Scenario 1
ABC Pvt Ltd is a company with branches at Mumbai, Hyderabad, Chennai and Bangalore.
The Sales Manager wants quarterly sales report. Each branch has a separate operational system.
By Mrs. Chetana
Scenario 1 : ABC Pvt Ltd.Scenario 1 : ABC Pvt Ltd.
Mumbai
Hyderabad
Chennai
Banglore
SalesManager
Sales per item type per branchfor first quarter.
By Mrs. Chetana
Solution 1:ABC Pvt LtdSolution 1:ABC Pvt Ltd..
Extract sales information from each database.Store the information in a common repository
at a single site.
By Mrs. Chetana
Solution 1:ABC Pvt Ltd.Solution 1:ABC Pvt Ltd.
Mumbai
Delhi
Chennai
Bangalore
DataWarehouse
SalesManager
Query &Analysis tools
Report
By Mrs. Chetana
Scenario 2Scenario 2
One Stop Shopping Super Market has hugeoperational database.
Whenever Executives wants some report the OLTP(database)system becomes slow and data entry operators have to wait for some time.
By Mrs. Chetana
Scenario 2 : One Stop ShoppingScenario 2 : One Stop Shopping
OperationalDatabase
Data Entry Operator
Data Entry Operator
ManagementWait
Report
By Mrs. Chetana
Solution 2Solution 2
Extract data needed for analysis from operational database.Store it in warehouse.Refresh warehouse at regular interval so that it contains up
to date information for analysis.Warehouse will contain data with historical perspective.
By Mrs. Chetana
Solution 2Solution 2
Operationaldatabase
DataWarehouse
Extractdata
Data EntryOperator
Data EntryOperator
Manager
Report
Transaction
By Mrs. Chetana
Scenario 3Scenario 3
Cakes & Cookies is a small, new company.
President of the company wants his company to grow. He needs information so that he can make correct decisions.
By Mrs. Chetana
Solution 3Solution 3
Improve the quality of data before loading it into the warehouse.
Perform data cleaning and transformation before loading the data.
Use query analysis tools to support adhoc queries.
By Mrs. Chetana
Solution 3Solution 3
Query and
Analysistool
President
Expansion
Improvement
sales
time
DataWarehouse
By Mrs. Chetana
Database Vs Data WarehouseDatabase Vs Data WarehouseDatabases : to record Data warehouse : to respond to analysis questions that
are critical for a business. All data warehouses are databases, not all databases
are data warehouses.
By Mrs. Chetana
OLTP System - Online Transaction Processing (Operational System)OLAP System - Online Analytical Processing (Data Warehouse)
Source of dataOLTP: Operational data; OLTPs are the original source of the data.OLAP: Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of dataOLTP: To control and run fundamental business tasksOLAP: To help with planning, problem solving, and decision support
What the dataOLTP: Reveals a snapshot of ongoing business processesOLAP: Multi-dimensional views of various kinds of business activities
Inserts and UpdatesOLTP: Short and fast inserts and updates initiated by end usersOLAP: Periodic long-running batch jobs refresh the data
OLTP Vs OLAP
By Mrs. Chetana
QueriesOLTP: Relatively standardized and simple queries Returning relatively few recordsOLAP: Often complex queries involving aggregations
Processing SpeedOLTP: Typically very fastOLAP: Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space RequirementsOLTP: Can be relatively small if historical data is archivedOLAP: Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database DesignOLTP: Highly normalized with many tablesOLAP: Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and RecoveryOLTP: Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liabilityOLAP: Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
By Mrs. Chetana
Operational Database ( OLAP )An operational database is the database that is accessed and updated on a continual basis and usually handles the daily transactions for a business. Operational databases use an OLTP approach and are designed to be write-optimized.
OLTPOn-Line Transaction Processing (OLTP) is the process in which systems facilitate and manage data entry and retrieval on a frequent basis. The transaction is almost immediately processed and is the main strategy of operational databases. An example of an OLTP system would be an ATM machine. The benefits of using OLTP is that it is fast and efficient and simplifies the process of accessing data. (Business Intelligence).
By Mrs. Chetana
Operational v/s Information SystemOperational v/s Information System
Features Operational InformationCharacteristics Operational processing Informational processing
Orientation Transaction, customer-oriented
Analysis, market-oriented
User Clerk, DBA, database professional
Knowledge workers
Function Day to day operation Decision support
Data contents Current Historical
View Detailed, flat relational Summarized, multidimensional
DB design Application oriented, ER based
Subject oriented, star/snowflakes
Unit of work Short , simple transaction
Complex query
Access Read/write Mostly readBy Mrs. Chetana
Operational v/s Information SystemOperational v/s Information System
Features Operational Information
Focus Data in Information out
Number of records accessed
tens millions
Number of users thousands hundreds
DB size 100MB to GB 100 GB to TB
Priority High performance, high availability
High flexibility, end-user autonomy
Metric Transaction throughput
Query throughput
By Mrs. Chetana
What is Data Warehouse??What is Data Warehouse??
By Mrs. Chetana
A Data Warehouse is not an individual repository product.
It is a decision support systems and a knowledge-based applications architecture.
It is an environment that supports both everyday tactical decision making and long-term business strategizes.
The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link information from diverse sources and make the information accessible for a variety of user purposes, most notably, Strategic Analysis.
By Mrs. Chetana
What is Data Warehouse?What is Data Warehouse?
• A data warehouse is a collection of integrated databases designed to support a DSS.
• By Inmon: A data warehouse is-subject-oriented,-integrated,-time-variant,-nonvolatile
collection of data in support of management’s decision making process.
By Mrs. Chetana
RalphRalph KimballKimball
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
By Mrs. Chetana
Cont..Cont..
Where is it used?
It is used for evaluating future strategy.
It needs a successful technician:◦ Flexible.◦ Team player.◦ Good balance of business and technical
understanding.
By Mrs. Chetana
Subject-orientedSubject-orientedData are organized based on how the users refer to them.
Eg. sales, product, customer.It focuses on modeling and analysis of data for decision
makers.Excludes data not useful in decision support process.
By Mrs. Chetana
IntegrationIntegrationData Warehouse is constructed by integrating multiple
heterogeneous sources.Data Preprocessing are applied to ensure all
inconsistencies regarding naming convention and value representations are removed.
RDBMS
LegacySystem
DataWarehouse
Flat File Data ProcessingData Transformation
By Mrs. Chetana
ExamplesExamples
In terms of data.◦ encoding structures.
◦ Measurement ofattributes.
◦ physical attribute. of data
◦ naming conventions.
◦ Date type format
remarks
By Mrs. Chetana
Time-variantTime-variantA data warehouse is a time-varient database which
supports business needs of end users in comparing and analyzing the business with different time periods. This is also known as Time series analysis.
Provides information from historical perspective
e.g. past 5-10 years.
Every key structure contains either implicitly or explicitly an element of time.
By Mrs. Chetana
NonvolatileNonvolatile
Data are stored in read-only format and do not change over time.
Data warehouse requires two operations in data accessing◦ Initial loading of data◦ Access of data
load access
By Mrs. Chetana
History & EvolutionHistory & Evolution 60’s: Batch Reports
◦ Hard to find and analyze information◦ Inflexible and expensive, reprogram every new request◦ Data collection, database creation
70’s: Database Management System◦ Still inflexible, not integrated with desktop tools,
80’s: Advanced Database systems, Desktop Data Access and Analysis Tools
◦ OLTP , Query tools, Spreadsheets, GUIs◦ Easier to use, but only access operational databases
1990s—2000s: ◦ Data mining and data warehousing, multimedia databases, and Web databases◦ OLAP Engines and Tools
By Mrs. Chetana
Data Warehouse ArchitectureData Warehouse Architecture
By Mrs. Chetana
Operational source systemOperational source systemCharacteristics:Source data often in OLTP (Online Transaction
Processing) systems, also called TPS (Transaction Processing Systems)
High level of performance and availabilityBusy with the normal operations of the
organizationDifferent types of data format and data structures in
different OLTP systems AND DIFFERENT SEMANTICS
By Mrs. Chetana
Operational SystemsOperational Systems Operational Systems
◦ Run the business in real time.
◦ Optimized to handle large numbers of simple read/write transactions.
◦ Optimized for fast response to predefined transactions.
◦ Used by people who deal with customers, products.
Database systems have been used traditionally for OLTP.
◦ Online Transaction Processing
◦ Clerical data processing tasks
◦ Detailed, up to date data
◦ Structured repetitive tasks
Examples of Operational Data◦ Customer Files
◦ Account Balance, Call Record
◦ Point of Sale Data, Production Record
By Mrs. Chetana
Data staging AreaData staging AreaOften the most complex part in the architecture, and
involves...◦ Extraction (E)
◦ Transformation (T)
◦ Load (L)
◦ indexing
ETL-tools can be usedeg. Informatica – PowerCenter
IBM – Infosphere Information Server
Oracle – Data Integrator
Scripts for extraction, transformation and load are
implemented
By Mrs. Chetana
ExtractionExtraction
Reading and understanding the source data and copying the data needed for the data warehouse into staging area for further manipulation,
i.e. transformation
By Mrs. Chetana
TransformationTransformationTransforms the data in accordance with the business rules and standards that have been established
Involves:data conversion/transformation:
rules to convert to a common data format and common terms/semantics
data cleaning/cleansing:
– data scrubbing
use domain-specific knowledge to check the data
– data auditing
discover suspicious pattern, discover violation of stated rulescombining data from multiple sourcesdata aggregation
By Mrs. Chetana
Metadata Repository Metadata Repository
The metadata repository contains the schema of the warehouse
It is similar to data dictionary- the information is kept about the logical data structures, the info about files and addresses, the info about indexes, and so on.
By Mrs. Chetana
Data Presentation AreaData Presentation Area
What is OLAP?Dimensional modelingData MartsROLAP/MOLAP servers
By Mrs. Chetana
Physical storagePhysical storage::
Due to the huge amount of data, the warehouse may be distributed onto multiple servers.
In such a case the metadata is usually replicated on each server.
Most of the data warehouse solutions employ relational database management systems. Multidimensional db mgmt systems are employed in some solutions.
Data extracted from the data warehouse storage aggregated in many ways and the summary is kept in the multidimensional databases.
By Mrs. Chetana
OLAP serversOLAP servers◦ To support and operate on multi-dimensional data structures
These servers receive OLAP queries or data mining requests from front-end tools and process them by accessing data in the physical storage.
The architect of the data warehouse may decide on the number of processing units/servers depending on the typical load of the system.
By Mrs. Chetana
Front-end toolsFront-end tools
This is the interface between the analyst and the data warehouse.
Clients
◦Query and reporting tools◦Analysis tools◦Data mining tools
The analyst should be able to pose OlAP queries or data mining tasks and visualize the results /reports in various textual and graphical ways.
By Mrs. Chetana
Data warehouse Design Data warehouse Design ApproachesApproaches
Top-Down: Starts with Overall design and planning . Useful when technology is mature & well known. The business problems are clear & well understood. Advantages:
- planned & strategic nature
- Emphasizes high-level business needs
- minimizes integration problem Disadvantages:
- Requires an increase costs
- Lacks of flexibility
- Difficult to define the boundaries
- May not be suitable unless the client needs
cross-functional reporting
By Mrs. Chetana
Bottom-Up Starts with experiments & prototypes. Useful in early stage of business modeling & technology development. Starts with lower cost & helps to evaluate benefits of technology
Advantages:
- rapid implementation &
opportunistic application
- Lo cost Disadvantages:
- Problematic when integrating
Data marts
- Design may be out-of-date before delivery
- Requires reengineering for each increment
- Solutions may be rejected by the next line of
business to be involved
- Overall benefit to the business may be minimized By Mrs. Chetana
Combined ApproachCombined Approach
An organization can exploit the planned and strategic nature of the top-down approach
While retaining the rapid implementation and opportunistic application of the bottom-up approach
By Mrs. Chetana
3- tier Architecture of Data Warehouse
By Mrs. Chetana
Three Data Warehouse ModelsThree Data Warehouse ModelsEnterprise warehouse
◦ collects all of the information about subjects spanning the entire organization
Data Mart
◦ a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Virtual warehouse
◦ A set of views over operational databases
◦ Only some of the possible summary views may be materialized
By Mrs. Chetana
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi-Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
1
2 2
3
4
By Mrs. Chetana
Incremental & Evolutionary ApproachIncremental & Evolutionary Approach
High-level Corporate Data Model: Defined in 1 or 2 months, provides a corporate-wide consistent,
integrated view of data for various usage.
Data Marts & Enterprise data warehouse:
Can be implemented parallel , on the above corporate data model.
Distributed data marts:constructed to integrate different data marts via hub server.
Multitier data warehouse:
sole custodian of all warehouse data.
By Mrs. Chetana
Data MartData Mart
It is lower-cost, scaled down version of the DW.
Data Mart offer a targeted and less costly method of gaining the advantages associated with data warehousing and can be scaled up to a full DW environment over time.
example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects.
By Mrs. Chetana
By Mrs. Chetana
Meta DataMeta DataMetadata is “data about data”
With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe.
Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data.
By Mrs. Chetana
For example, a line in a sales database may contain:
4056 KJ596 223.45
This is mostly meaningless until we consult the metadata that tells us it was store number 4056, product KJ596 and sales of $223.45
The metadata are essential ingredients in the transformation of raw data into knowledge.
They are the “keys” that allow us to handle the raw data.
By Mrs. Chetana
Data Warehouse SchemaData Warehouse SchemaModeling data warehouses: dimensions & measures
◦ Star schema: A fact table in the middle connected to a set
of dimension tables
◦ Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
◦ Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
By Mrs. Chetana
Dimensions & MeasuresDimensions & Measures
Dimensions: perspectives or entities with respect to which an organization wants to keep records .◦ Each dimension will may have a table associated with it called as
“dimension table”
◦ The dimension table contains facts i.e measures and key to each related dimension tables.
◦ Specified by users or experts, or automatically generated and adjusted based on data distributions.
Fact: considered as numerical measures.◦ Quantitative to analyze relationships between dimensions.
By Mrs. Chetana
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Time_dimensionItem _dimension
Branch_dimension
Location_dimension
Example of Star Schema
By Mrs. Chetana
Star SchemaStar Schema
By Mrs. Chetana
Example of Star SchemaExample of Star Schema
Store Key
Product Key
Period Key
Units
Price
Store Dimension Time Dimension
Product Dimension
Sales Fact Table
Store Key
Store Name
City
State
Region
Period Key
Year
Quarter
Month
Product Key
Product Desc
By Mrs. Chetana
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
normalized
Time_dimensionItem _dimension
Branch_dimensionLocation_dimension
Example of Snowflake
By Mrs. Chetana
SnowFlake SchemaSnowFlake SchemaVariant of star schema model.A single, large and central fact table and one or more tables for
each dimension.Used when a dimensional table becomes very bigDimension tables are more normalized
i.e. split dimension table data into additional tablesAdvantages:
Redundancy is reduced and therefore, it becomes easy to maintain and save storage space.
Drawbacks:
Increases number of joins hence time consuming query processing
Complex queries results in slow report generation
By Mrs. Chetana
Example of SnowFlake SchemaExample of SnowFlake Schema
Store Key
Product Key
Period Key
Units
Price
Time Dimension
Product Dimension
Sales Fact Table
Store Key
Store Name
City Key
Period Key
Year
Quarter
Month
Product Key
Product Desc
City Key
City
State
Region
City Dimension
Store Dimension
By Mrs. Chetana
By Mrs. Chetana
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Example of Fact Constellation
By Mrs. Chetana
Fact ConstellationFact Constellation
Multiple fact tables share dimension tables.This schema is viewed as collection of stars hence
called galaxy schema or fact constellation.Sophisticated application requires such schema.
By Mrs. Chetana
Example of Fact ConstellationExample of Fact Constellation
Store Key
Product Key
Period Key
Units
Price
Store Dimension
Product Dimension
SalesFact Table
Store Key
Store Name
City
State
Region
Product Key
Product Desc
Shipper Key
Store Key
Product Key
Period Key
Units
Price
ShippingFact Table
By Mrs. Chetana
Software Engineering point of viewSoftware Engineering point of view
The design and construction of a DWH may consist of the following steps:
◦Planning◦Requirement study◦Problem Analysis◦Warehouse design◦Data integration and testing◦Finally Deployment of data warehouse
By Mrs. Chetana
General Data warehouse design ProcessGeneral Data warehouse design Process
1. Choose a Business Process to modelData warehouse Model : For Complex object collections
Data Mart Model: For focused Process on one kind of business
1. Choose the grain of the business processFundamental , atomic level of data to be represented in fact table
eg. In bank Individual transactions, individual daily snapshot
1. Choose the dimensions that will apply to each fact table record
2. Choose measures that will populate each fact table record (ex: dollars_sold or units_sold)
By Mrs. Chetana
Multi-dimensional Data ModelMulti-dimensional Data ModelData warehouse and OLAP tools are multi dimensional data model.This model views data in the form of data Cube model.A data cube allows data to be modeled & viewed in multiple
dimensions. It is defined by dimensions & facts.
Dimensions: perspectives or entities with respect to which an organization wants to keep records .
◦ Each dimension will may have a table associated with it called as “dimension table”
◦ Specified by users or experts, or automatically generated and adjusted based on data distributions.
Fact: considered as numerical measures.
◦ Quantitative to analyze relationships between dimensions.
By Mrs. Chetana
From Tables and Spreadsheets to Data CubesFrom Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model which views
data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions. It is defined by dimensions and facts.
◦ Dimension tables, such as item (item_name, brand, type), or time(day,
week, month, quarter, year), branch, and location.
◦ Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables
Viewing things in 4-D becomes tricky
Data cubes are n-dimensional & do not confine data to 3-D
By Mrs. Chetana
4- D data cube representation for sales data4- D data cube representation for sales data
By Mrs. Chetana
Data Models in Data WarehousesData Models in Data Warehouses
In OLTP systems, data are stored in 2D matrixes.
Data warehouses are subject-oriented
◦ Profits, Sales …
◦ Data need to be reorganized to better reflect the subjects.
A data warehouse is based on a multidimensional data model, which views data in the form of a data cube.
A data cube allows data to be modeled and viewed in multiple dimensions.
By Mrs. Chetana
BASICSBASICS Dimensional Model is conceptually represented using Schemas that are
described by basic elements, such as facts, measures, attributes, dimensions, hierarchies and granularity.
Facts are a focus of interest about a particular subject. Facts are used to store
business information(measures such as dollars sold) on which detailed
analysis is carried out.
Dimensions determine the contextual background for the facts and provide
the context of the measures such as item (item name, brand), product, location
or time(day, week, month, quarter, year).
Measures are continuous values that describe the fact from different points of
view (e.g., each sale is measured by its revenue)
By Mrs. Chetana
Hierarchies are made up of discrete dimension attributes linked by one to one relationships and determine how facts may be aggregated and selected significantly for the decision-making process.
Granularity is related to fact table and which represents the lowest level of information that will be stored in fact table.
It is based on
(i) which dimensions will be included
(ii)the hierarchy of each dimension the information will be kept
By Mrs. Chetana
Schema DefinitionSchema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts.
Syntax for Cube Definition
define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
define dimension < dimension_name > as (attribute_or_dimension_list > )
By Mrs. Chetana
Star Schema DefinitionStar Schema Definition The star schema that we have discussed can be defined using Data Mining
Query Language (DMQL) as follows:
define cube sales_star [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Explanation:
define cube statement defines a data cube called sale_star which corresponds to central sales fact table.
define dimension statement used to define four dimensions.
By Mrs. Chetana
Snowflake Schema Snowflake Schema DefinitionDefinition
define cube sales_snowflake [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))
Explanation:
item & location dimension tables are normalized.
supplier & city are two sales_snoflake tables created.
By Mrs. Chetana
Fact Constellation Schema DefinitionFact Constellation Schema Definitiondefine cube sales [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define cube shipping [time, item, shipper, from location, to location]:
dollars cost = sum(cost in dollars), units shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
By Mrs. Chetana
From Tables to Data From Tables to Data CubesCubesID Product Country Date Sales
1 TV US 1Qtr 100
2 PC Canada 4Qtr 500
3 CAR US 2Qtr 30
4 PC UK 3Qtr 200
5 CAR UK 1Qtr 20
6 CAR UK 2Qtr 15
7 TV Canada 4Qtr 80
By Mrs. Chetana
Lattice of Cuboid ( Data Cube)Lattice of Cuboid ( Data Cube)A data cube is referred as cuboidCuboid can be generated for each subset of given
dimensions.In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which holds the highest-level
of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube, each shows
different level of summarization or group by.
By Mrs. Chetana
Cube: A Lattice of CuboidsCube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D cuboid
APEX ( Highest level of Summarization)
Base ( Lowest level of Summarization)
By Mrs. Chetana
A Sample Data CubeA Sample Data Cube
Date
Produ
ct
Cou
ntr
y
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Total annual salesof TV in U.S.A.
By Mrs. Chetana
MeasuresMeasuresA data cube measure is a numerical function that can be
evaluated at each point in the data cube space.
1. Distributive: If it can be computed in a distributed manner using count(), sum(), min(), max() distributed aggregate functions.
1. Algebraic: if it can be computed by an algebraic functions with M arguments using avg(), min_N(), max_N(), sum()/count() algebraic aggregate functions.
By Mrs. Chetana
Holistic : if there is no constant bound on the storage size needed to describe a sub aggregate.
Uses median(), mode(), rank() holistic aggregate functions.
By Mrs. Chetana
Concept HierarchiesConcept Hierarchies
A Concept Hierarchy defines a sequence of mappings from a set of low-level concepts to high-level
Consider a concept hierarchy for the dimension “Location”
By Mrs. Chetana
By Mrs. Chetana
Concept HierarchiesConcept HierarchiesMany concept hierarchies are implicit within
the database system
By Mrs. Chetana
Concept HierarchiesConcept HierarchiesConcept hierarchies may also be defined by grouping
values for a given dimension or attribute, resulting in a set-grouping hierarchy
By Mrs. Chetana
Online Analysis Online Analysis Processing(OLAP)Processing(OLAP)
It enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
Data Warehouse
Time
Product
Reg
ion
By Mrs. Chetana
OLAP Server Architectures Relational OLAP (ROLAP)
◦ Use relational DBMS to store and manage warehouse data.
◦ ROLAP tools access the data in a relational database and generate SQL queries
to calculate information at the appropriate level as required.
◦ Handles Large amount of data.
◦ Limited by SQL functionalities
By Mrs. Chetana
MOLAPMOLAP Multidimensional OLAP (MOLAP)
◦ uses array-based multidimensional storage engines provides natural indexing
◦ Fast query performance due to optimized storage and indexing
◦ Automated computation of higher level aggregates of the data
◦ Can perform complex calculations
◦ the storage utilization may be low if the dataset is sparse
◦ uses two levels of data storage representation to handle dense and sparse
datasets.
◦ Limited in the amount of data it can handle
By Mrs. Chetana
HOLAPHOLAP Hybrid OLAP (HOLAP)
◦ Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers
higher scalability of ROLAP and faster computation of MOLAP.
◦ Stores large data volumes of detailed information. The aggregations are
stored separately in MOLAP store.
By Mrs. Chetana
OLAP CubeOLAP Cube
City Product Time Units Dollars
All All All 113 251.26
Mumbai All All 64 146.07
Mumbai White Bread All 38 98.49
Mumbai Wheat Bread All 13 32.24
Mumbai Wheat Bread Qtr1 3 7.44
Mumbai Wheat Bread March 3 7.44
By Mrs. Chetana
OLAP OperationsOLAP OperationsDrill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Time
Regio
n
Product
Category e.g Electrical Appliance
Sub Category e.g Kitchen
Product e.g Toaster
By Mrs. Chetana
drill Down on time(from quarters to month)
By Mrs. Chetana
OLAP OperationsOLAP OperationsRoll up (drill-up): summarize data
by climbing up hierarchy or by dimension reductionOwner may be interested in further aggregation along some dimension
Time
Regio
n
ProductCategory e.g Electrical Appliance
Sub Category e.g Kitchen
Product e.g Toaster
By Mrs. Chetana
roll_up on location (from cities to countries)
By Mrs. Chetana
OLAP OperationsOLAP OperationsSlice and dice:
user specify interest in some particular values of specific dimensions.project and select
Time
Regio
n
ProductProduct=Toaster
Time
Regio
n
By Mrs. Chetana
Slice for time = “Q1”Slice for time = “Q1”
By Mrs. Chetana
Dice for (location=“Toronto” or “Vancouver”)Dice for (location=“Toronto” or “Vancouver”)and (time=“Q1” or “Q2”) and (item=“Mobile” or “Modem”)and (time=“Q1” or “Q2”) and (item=“Mobile” or “Modem”)
By Mrs. Chetana
By Mrs. Chetana
OLAP OperationsOLAP Operations
Pivot (rotate):reorient the cube, visualization, 3D to series of 2D planes.
Time
Regio
n
Product
Region
Tim
e
Product
By Mrs. Chetana
By Mrs. Chetana
Other OLAP OperationsOther OLAP OperationsDrill across:
◦ Accesses more than one fact table that is linked by common dimensions.
◦ Combines cubes that share one or more dimensions.
Drill through: ◦ Drill down to the bottom level of a data cube down to its back-
end relational tables.
Cross-tab :◦ Spreadsheet style row/column aggregates.
By Mrs. Chetana
By Mrs. Chetana
Typical Data warehouse Architecture
By Mrs. Chetana
Data Warehouse Data Warehouse ImplementationImplementationData warehouses contain huge volumes of data.OLAP servers demand that queries should be answered
in seconds. So, data warehouse should need highly efficient cube
computation techniques, access methods and query processing techniques.
The core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions.
In SQL aggregations are referred to as group-by’s.Each group-by can be represented as cuboid.Set of group-by’s forms a lattice of a cuboids defining a
data cube. By Mrs. Chetana
The compute cube Operator and the Curse of Dimensionality
The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation.
Requires an excessive storage space, especially for large number of dimensions.
A data cube is a lattice of cuboids.Suppose that we create a data cube for AllElectronics sales
that contains the following: city, item year, and sales_in_dollars.
Efficient Data Cube Computation
Compute the sum of sales, grouping by city and item.Compute the sum of sales, grouping by city.Compute the sum of sales, grouping by item.
By Mrs. Chetana
Lattice of cuboid, making up 3-D data cube.
By Mrs. Chetana
Efficient Data Cube ComputationWhat is the total number of cuboids, or group-by’s, that
can be computed for this data cube?
Three attributes: city, item, year (dimensions) sales_in_dollars (measure)The total number of cuboids or group-by’s computed for
this cube are 23=8.Group-by’s:
{(city,item,year), (city, item), (city, year), (item, year), (city), (item), (year),()}. () : group-by is empty i.e. the dimensions are not grouped.
Base cuboid contains all three dimensions (most specific) Apex cuboid is empty (least specific)
By Mrs. Chetana
Cube OperationCube definition and computation in DMQL:
define cube sales_cube[ city, item, year]: sum (sales_in_dollars)
compute cube sales_cube
Transform it into a SQL-like language
(with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
By Mrs. Chetana
Efficient Data Cube ComputationOn-line analytical processing may need to access different cuboids
for different queries.So we have to compute all or at least some of the cuboids in the
data cube in advance.Pre-computation leads to fast response time and avoids some
redundant computation.A major challenge related to pre-computation would be storage
space if all the cuboids in the data cube are computed, especially when the cube has many dimensions.
The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels.
This problem is referred to as the Curse of Dimensionality.
By Mrs. Chetana
Efficient Data Cube Efficient Data Cube ComputationData cube can be viewed as a lattice of cuboids
◦ The bottom-most cuboid is the base cuboid
◦ The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?Li is number of levels associated with dimension i.
Eg. For time dimension hierarchy is “day<month<quarter<year”
Li = 4 & All is Virtual top level i.e Total Level = (4+1) = 5
& Dimensions = 10
No of cuboids generated = 510 = 9.8x106
T=∏i=1
n
(Li+1)
By Mrs. Chetana
Partial Materialization: Selected Computation of Cuboids
There are three choices for data cube materialization given a base cuboid.
No Materialization – Do not pre-compute any of the non-base cuboids
Full Materialization – Pre-compute all of the cuboids. Need huge memory space.
Partial Materialization - Identify the subsets of cuboids or subcubes to materialize. Exploit the materialized cuboids or subcubes during query
processing. Efficiently update the materialized cuboids or subcubes during
load and refresh.
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
By Mrs. Chetana
Iceberg CubeIceberg Cube
Computing only the cuboid cells whose count
or other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup
It is called an Iceberg-Cube because it contains only some of the cells of the full cube, like the tip of an iceberg.
Motivation Only a small portion of cube cells may be “above the water’’
in a sparse cube Only calculate “interesting” cells — data above certain
threshold Avoid explosive growth of the cube
By Mrs. Chetana
Iceberge Cube Query with three dimensions Iceberge Cube Query with three dimensions
SELECT A,B,C, Count(*),SUM(X)
FROM TableName CUBE BY A,B,C HAVING COUNT(*)>=minsup
By Mrs. Chetana
Indexing OLAP Data: Bitmap Index Index on a particular columnEach value in the column has a bit vector : bit-op is fastThe length of the bit vector : # of records in the base tableThe i-th bit is set if the i-th row of the base table has the value for the
indexed columnnot suitable for high cardinality domains It reduces join, aggregation & comparison operations
Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer
RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1
RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0
Base table Index on Region Index on Type
By Mrs. Chetana
Indexing OLAP Data: Join IndicesJoin index: JI(R-id, S-id)
where R (R-id, …) S (S-id, …)Traditional indices map the values to a list of record
ids◦ It materializes relational join in JI file and
speeds up relational join
In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table.◦ E.g. fact table: Sales and two dimensions item and
location.
◦ A join index on location maintains for each distinct location a list of R-IDs of the tuples recording the Sales in the locations.
◦ Join indices can span multiple dimensions
By Mrs. Chetana
Efficient Processing OLAP Queries
The purpose of materializing cuboids and constructing OLAP index
structures is to speed up the query processing in data cubes.
Given materialized views, query processing should proceed as follows:
Determine which operations should be performed on the available
cuboids
◦ Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g. Slice & dice = selection + projection
Determine which materialized cuboid(s) the relevant operations should
be applied.
By Mrs. Chetana
Efficient Processing OLAP QueriesSuppose that the query to be processed is on
{brand, province_or_state} with the selection constant “year = 2004”,
Dimension hierarchies as follows:
Time: “day<month<quarter<year” Item: “item_name<brand<type”
Location: “street<city<state<country”
and there are 4 materialized cuboids available:
1) { year, item_name, city }
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which of the cuboids should be selected to process the query?By Mrs. Chetana
Solution:Solution:Cuboid 2 cant be used because country is a more general
concept than state.
Cuboids 1,3,4 can be used to process query:1. they have the same set or superset of dimensions in query
2. the selection clause in the query can imply the selection in the cuboid
3. the abstraction levels for the item and location dimensions in these cuboids are at the finer level than brand and state, respectively.
By Mrs. Chetana
How would the cost of each cuboid compare if used to process the query??Cuboid 1 would cost most because both item_name and
city are at lower level than brand and state concepts specified in query
Cuboid 3 will be smaller than cuboid 4;◦ If not many year values associated with items in the cube
◦ But there are mant item_names for each brand
Cuboid 4 may be better choice; If efficient indices are available for cuboid 4.
By Mrs. Chetana
Data Warehouse UsageData Warehouse Usage Three kinds of data warehouse applications
◦ Information processing
supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
◦ Analytical processing
Operates on historical data in both summarized and detailed forms.
multidimensional data analysis of data warehouse data
supports basic OLAP operations, slice-dice, drill-down, roll- up, and pivoting
◦ Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization toolsBy Mrs. Chetana
Data Warehouse UsageData Warehouse UsageDo OLAP systems perform data mining? Are OLAP systems actually data mining systems?
OLAP is a data summarization/ aggregation tool that helps simplify data analysis.
Data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data.
Data mining covers much broader spectrum than simple OLAP operations because it performs not only data summary and comparison but also association, classification, prediction, clustering, time-series analysis, and other data analysis tasks.
By Mrs. Chetana
From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)
Why online analytical mining?
◦ High quality of data in data warehouses DW contains integrated, consistent, cleaned data
◦ Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
◦ OLAP-based exploratory data analysis Mining with drilling, dicing, pivoting, etc.
◦ On-line selection of data mining functions Integration and swapping of multiple mining functions,
algorithms, and tasks
By Mrs. Chetana
August 1, 2015 Data Mining: Concepts and Techniques 125
An Integrated OLAM and OLAP Architecture
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
By Mrs. Chetana
Summary of UNIT- IISummary of UNIT- IIWhy data warehousing?
A multi-dimensional model of a data warehouse
◦ Star schema, snowflake schema, fact constellations
◦ A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data warehouse architecture
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes
◦ Partial vs. full vs. no materialization
◦ Indexing OALP data: Bitmap index and join index
◦ OLAP query processing
From OLAP to OLAM (on-line analytical mining)
By Mrs. Chetana
Building Data WarehouseBuilding Data Warehouse
Data SelectionData Preprocessing
◦Fill missing values◦Remove inconsistency
Data Transformation & IntegrationData Loading Data in warehouse is stored in form of fact
tables and dimension tables.
By Mrs. Chetana
Case StudyCase Study
Afco Foods & Beverages is a new company which produces dairy,bread and meat products with production unit located at Baroda.
There products are sold in North,North West and Western region of India.
They have sales units at Mumbai, Pune , Ahemdabad ,Delhi and Baroda.
The President of the company wants sales information.
By Mrs. Chetana
Sales InformationSales Information
January February March April
14 41 33 25
Report: The number of units sold.
113
Report: The number of units sold over time
By Mrs. Chetana
Sales InformationSales Information
Jan Feb Mar Apr
Wheat Bread 6 17
Cheese 6 16 6 8
Swiss Rolls 8 25 21
Report : The number of items sold for each product withtime
Product
Tim
eBy Mrs. Chetana
Sales InformationSales Information
Jan Feb Mar Apr
Mumbai Wheat Bread 3 10
Cheese 3 16 6
Swiss Rolls 4 16 6
Pune Wheat Bread 3 7
Cheese 3 8
Swiss Rolls 4 9 15
Report: The number of items sold in each City for each product with time
Product
Tim
e
City
By Mrs. Chetana
Sales InformationSales Information
Report: The number of items sold and income in each region for each product with time.
Jan Feb Mar Apr
Rs U Rs U Rs U Rs U
Mumbai Wheat Bread 7.44 3 24.80 10
Cheese 7.95 3 42.40 16 15.90 6
Swiss Rolls 7.32 4 29.98 16 10.98 6
Pune Wheat Bread 7.44 3 17.36 7
Cheese 7.95 3 21.20 8
Swiss Rolls 7.32 4 16.47 9 27.45 15
By Mrs. Chetana
Sales Measures & DimensionsSales Measures & DimensionsMeasure – Units sold, Amount.Dimensions – Product,Time,Region.
By Mrs. Chetana
Sales Data Warehouse ModelSales Data Warehouse Model
City Product Month Units Rupees
Mumbai Wheat Bread January 3 7.95
Mumbai Cheese January 4 7.32
Pune Wheat Bread January 3 7.95
Pune Cheese January 4 7.32
Mumbai Swiss Rolls February 16 42.40
Fact Table
By Mrs. Chetana
Sales Data Warehouse ModelSales Data Warehouse Model
City_ID Prod_ID Month Units Rupees
1 589 1/1/1998 3 7.95
1 1218 1/1/1998 4 7.32
2 589 1/1/1998 3 7.95
2 1218 1/1/1998 4 7.32
1 589 2/1/1998 16 42.40
By Mrs. Chetana
Sales Data Warehouse ModelSales Data Warehouse Model
Prod_ID Product_Name Product_Category_ID
589 Wheat Bread 1
590 White Bread 1
288 Coconut Cookies 2
Product Dimension Tables
Product_Category_Id Product_Category
1 Bread
2 Cookies
By Mrs. Chetana
Sales Data Warehouse ModelSales Data Warehouse Model
City_ID City Region Country
1 Mumbai West India
2 Pune NorthWest India
Region Dimension Table
By Mrs. Chetana
Sales Data Warehouse ModelSales Data Warehouse Model
Sales Fact
Region
ProductProduct
Category
Time
By Mrs. Chetana
PresentationPresentation
Time
Regio
n
Product
Report
ReportingTool
By Mrs. Chetana
Data Warehousing includesData Warehousing includes
Build Data WarehouseOnline analysis processing(OLAP).Presentation.
RDBMS
Flat File
Presentation
Cleaning ,Selection &Integration
Warehouse & OLAP serverClient
By Mrs. Chetana
Need for Data WarehousingNeed for Data Warehousing
Industry has huge amount of operational data
Knowledge worker wants to turn this data into useful information.
This information is used by them to support strategic decision making .
By Mrs. Chetana
Need for Data Warehousing Need for Data Warehousing (contd..)(contd..)
It is a platform for consolidated historical data for analysis.
It stores data of good quality so that knowledge worker can make correct decisions.
By Mrs. Chetana
Need for Data Warehousing Need for Data Warehousing (contd..)(contd..)
From business perspective-it is latest marketing weapon-helps to keep customers by learning more about their needs .-valuable tool in today’s competitive fast evolving world.
By Mrs. Chetana
Data Warehousing ToolsData Warehousing Tools
Data Warehouse◦SQL Server 2000 DTS◦Oracle 8i Warehouse Builder
OLAP tools◦SQL Server Analysis Services◦Oracle Express Server
Reporting tools◦MS Excel Pivot Chart◦VB Applications
By Mrs. Chetana