Upload
suada-bow-weezy
View
33
Download
3
Tags:
Embed Size (px)
DESCRIPTION
fytf
Citation preview
DATA WAREHOUSING & MINING
Chapter 2 Online Analytical Processing
Outline
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Further development of data cube technology
From data warehousing to data mining
Wednesday, April 19, 2023
2
Chapter 2 Online Analytical Processing
Limitations of SQL
3
“A Freshman in Business needs a Ph.D. in SQL”
-- Ralph Kimball
Wednesday, April 19, 2023Chapter 2 Online Analytical Processing
Data Warehouse vs. Operational DBMS
OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (On-Line Analytical Processing) Major task of data warehouse system Data analysis and decision making
Wednesday, April 19, 2023
4
Chapter 2 Online Analytical Processing
OLTP vs OLAP
Wednesday, April 19, 2023Chapter 2 Online Analytical Processing
5
Data WarehouseOLAP(On-line Analytical Processing)
OLTP(On-line Transaction Processing)
AnalyticsData MiningDecision Making
Short online transactions:update, insert, delete
Online-TransactionProcessing Tx.
database
current & detailed data,Versatile
aggregated & historical data, Static and Low volume
Complex Queries
OLTP vs OLAP
Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical,
consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex
queries
Wednesday, April 19, 2023
6
Chapter 2 Online Analytical Processing
Typical OLAP Queries
Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.
Wednesday, April 19, 2023
7
Chapter 2 Online Analytical Processing
OLAP Pros
It is a powerful visualization paradigm
It provides fast, interactive response
times
It is good for analyzing time series
It can be useful to find some clusters
and outliers
Many vendors offer OLAP toolsWednesday, April 19, 2023
8
Chapter 2 Online Analytical Processing
Nature of OLAP Analysis Aggregation -- (total sales,
percent-to-total) Comparison -- Budget vs.
Expenses Ranking -- Top 10, quartile
analysis Access to detailed and
aggregate data Complex criteria
specification Visualization Wednesday, April 19, 2023
9
Chapter 2 Online Analytical Processing
Why Separate Data Warehouse? High performance for both systems
DBMS— tuned for OLTP access methods, indexing, concurrency control, recovery
Warehouse—tuned for OLAP complex OLAP queries, multidimensional view, consolidation.
Different functions and different data Missing data: Decision support requires historical data
which operational DBs do not typically maintain Data consolidation: DS requires consolidation
(aggregation, summarization) of data from heterogeneous sources
Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
Wednesday, April 19, 2023
10
Chapter 2 Online Analytical Processing
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on multidimensional data model which views data in the form of a
data cube
A data cube allows data to be modeled and viewed in multiple dimensions (such as sales) Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
Definitions an n-Dimensional base cube is called a base cuboid The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid The lattice of cuboids forms a data cube
Wednesday, April 19, 2023
11
Chapter 2 Online Analytical Processing
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Wednesday, April 19, 2023
12
Chapter 2 Online Analytical Processing
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures Star schema
A fact table in the middle connected to a set of dimension
tables
Snowflake schema A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
Fact constellations Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called galaxy schema or fact
constellation Wednesday, April 19, 2023
13
Chapter 2 Online Analytical Processing
Avg_sales
Euros_sold
Unit_sold
Location_keybranch_keybranch_namebranch_type
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
location_keystreetcityprovince_or_streetcountry
Measures
item_keyitem_namebrandtypesupplier_type
Branch_key
Branch
Time
Item
Location
Sales Fact Table
Item_key
Time_key
Wednesday, April 19, 2023
14
Chapter 2 Online Analytical Processing
branch_keybranch_namebranch_type
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
Measures
item_keyitem_namebrandtypesupplier_key
Branch
Time
Item
location_keystreetcity_key
Location
Sales Fact Table
Avg_sales
Euros_sold
Unit_sold
Location_key
Branch_key
Item_key
Time_key
supplier_keysupplier_type
city_keycityprovince_or_streetcountry
City
Supplier
Wednesday, April 19, 2023
15
Chapter 2 Online Analytical Processing
branch_keybranch_namebranch_type
Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
Measures
Branch
Time
item_keyitem_namebrandtypesupplier_key
Item
location_keystreetcityProvince/streetcountry
Location
Sales Fact Table
Avg_sales
Euros_sold
Unit_sold
Location_key
Branch_key
Item_key
Time_key
shipper_keyshipper_namelocation_keyshipper_type
shipper
unit_shipped
Euros_sold
to_location
from_location
shipper_key
Item_key
Time_key
Shipping Fact Table
Wednesday, April 19, 2023
16
Chapter 2 Online Analytical Processing
DMQL: Language Primitives
Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition (Dimension Table) define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables) First time as “cube definition” define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
Wednesday, April 19, 2023
17
Chapter 2 Online Analytical Processing
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
Wednesday, April 19, 2023
18
Chapter 2 Online Analytical Processing
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as
(time_key, day, day_of_week, month, quarter, year)
define dimension item as
(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as
(branch_key, branch_name, branch_type)
define dimension location as
(location_key, street, city(city_key, province_or_state, country))Wednesday, April 19, 2023
19
Chapter 2 Online Analytical Processing
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city,
province_or_state, country)define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
Wednesday, April 19, 2023
20
Chapter 2 Online Analytical Processing
Measures: Three Categories
Distributive if the result derived by applying the function to n aggregate
values is the same as that derived by applying the function on all the data without partitioning. E.g., count(), sum(), min(), max()
Algebraic if it can be computed by an algebraic function with M arguments
(where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g., avg(), min_N(), standard_deviation()
Holistic if there is no constant bound on the storage size needed to
describe a subaggregate. E.g., median(), mode(), rank()
Wednesday, April 19, 2023
21
Chapter 2 Online Analytical Processing
A Concept Hierarchy: Dimension (location)
all
North_America Europe
FranceIrelandMexicoCanada
Dublin
BlackrockBelfield
...
......
... ...
...
all
region
office
country
BelfastTorontocity
Wednesday, April 19, 2023
22
Chapter 2 Online Analytical Processing
Multidimensional Data
Sales volume as a function of product, month, and region
Pro
duct
Regio
n
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Wednesday, April 19, 2023
23
Chapter 2 Online Analytical Processing
MDDM““Sales by product line over the past six Sales by product line over the past six
months”months”
““Sales by store between 1990 and 1995”Sales by store between 1990 and 1995”
Prod Code Time Code Store Code Sales Qty
Store Info
Product Info
Time Info
. . .
Numerical MeasuresKey columns joining fact table
to dimension tables
Fact table for measures
Dimension tables
Wednesday, April 19, 2023
24
Chapter 2 Online Analytical Processing
Star Schema
Wednesday, April 19, 2023
25
Chapter 2 Online Analytical Processing
A Sample Data CubeTotal annual salesof TV in IrelandDate
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
Wednesday, April 19, 2023
26
Chapter 2 Online Analytical Processing
Cuboids Corresponding to the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
Wednesday, April 19, 2023
27
Chapter 2 Online Analytical Processing
Browsing a Data Cube
Visualization OLAP capabilities Interactive
manipulation
Wednesday, April 19, 2023
28
Chapter 2 Online Analytical Processing
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice project and select
Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes.
Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)Wednesday, April 19, 2023
29
Chapter 2 Online Analytical Processing
A Star-Net Query Model
Shipping Method
AIR-EXPRESS
TRUCKORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
OrganizationPromotion
CITY
COUNTRY
REGION
Location
DAILYQTRLYANNUALYTime
Each circle is called a footprint
Wednesday, April 19, 2023
30
Chapter 2 Online Analytical Processing
OLAP Server Architectures
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services greater scalability
Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix
techniques) fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers specialized support for SQL queries over star/snowflake schemas
Wednesday, April 19, 2023
31
Chapter 2 Online Analytical Processing
Client/Server Architecture
Framework for the new systems to be designed, developed and implemented
Divide the OLAP system into several components that define its architecture Same Computer Distributed among several computer
Wednesday, April 19, 2023
32
Chapter 2 Online Analytical Processing
C/S
Wednesday, April 19, 2023
33
Chapter 2 Online Analytical Processing
Which Technology?
1) Performance: • How fast will the system appear to the end-user?
• MDD server vendors believe this is a key point in their favor.
2) Data volume and scalability:
• While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of gigabytes and terabytes.
Wednesday, April 19, 2023
34
Chapter 2 Online Analytical Processing
What if AnalysisIF
A. You require write access B. Your data is under 50 GBC. Your timetable to implement is 60-90 daysD. Lowest level already aggregatedE. Data access on aggregated levelF. You’re developing a general-purpose application for inventory movement or assets management
THENConsider an MDD /MOLAP solution for your data mart
IF
A. Your data is over 100 GBB. You have a "read-only" requirementC. Historical data at the lowest level of granularityD. Detailed access, long-running queriesE. Data assigned to lowest level elements
THENConsider an RDBMS/ROLAP solution for your data mart.
IFA. OLAP on aggregated and detailed dataB. Different user groupsC. Ease of use and detailed data
THENConsider an HOLAP for your data mart
Wednesday, April 19, 2023
35
Chapter 2 Online Analytical Processing
Examples
• ROLAP– Telecommunication startup: call data records
(CDRs) – ECommerce Site– Credit Card Company
• MOLAP– Analysis and budgeting in a financial
department– Sales analysis
• HOLAP– Sales department of a multi-national company– Banks and Financial Service Providers
Wednesday, April 19, 2023
36
Chapter 2 Online Analytical Processing
Tools
• ROLAP:– ORACLE 8i– ORACLE Reports; ORACLE Discoverer– ORACLE Warehouse Builder– Arbors Software’s Essbase
• MOLAP:– ORACLE Express Server– ORACLE Express Clients (C/S and Web)– MicroStrategy’s DSS server– Platinum Technologies’ Plantinum InfoBeacon
• HOLAP:– ORACLE 8i– ORACLE Express Serve– ORACLE Relational Access Manager– ORACLE Express Clients (C/S and Web)
Wednesday, April 19, 2023
37
Chapter 2 Online Analytical Processing
Conclusion
• ROLAP: RDBMS -> star/snowflake schema
• MOLAP: MDD -> Cube structures
• ROLAP or MOLAP: Data models used play major role in performance differences
• MOLAP: for summarized and relatively lesser volumes of data (10-50GB)
• ROLAP: for detailed and larger volumes of data
• Both storage methods have strengths and weaknesses
• The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP.
Wednesday, April 19, 2023
38
Chapter 2 Online Analytical Processing