Upload
trantruc
View
224
Download
0
Embed Size (px)
Citation preview
06/11/16
1
OLAP and Multidimensional Model
Helena Galhardas DEI/IST
References
• A. Vaisman and E. Zimányi, Data Warehouse Systems:
Design and Implementation, Springer, 2014 (chpts. 1 and 3)
• J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 (chpt. 4)
• A. Wichert, H. Galhardas, Suporte à Decisão (slides), MEIC/IST
2
06/11/16
2
Outline
• OLAP: On-line Analytical Processing • Multidimensional model
3
What is OLAP?
• The term OLAP („online analytical processing“) was coined in a white paper written for Arbor Software Corp. in 1993
– Interactive process of creating, managing, analyzing and reporting on data
– Analyzing large quantities of data in real-time
06/11/16
3
OLAP vs OLTP
• Traditional database systems designed and tuned to support the day-to-day operation: – Ensure fast, concurrent access to data,
transaction processing and concurrency control – Focus on online update data consistency – Known as operational databases or
OnlineTransaction Processing (OLTP)
5
OLAP vs OLTP
• OLTP DB data characteristics: – Detailed data – Do not include historical data – Highly normalized – Poor performance on complex queries
including joins an aggregation
6
06/11/16
4
OLAP vs OLTP
• Data analysis requires a new paradigm: Online Analytical Processing (OLAP) – Typical OLTP query: pending orders for
customer c – Typical OLAP query: total sales amount by
product and by customer
7
8
OLTP vs. OLAP
OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date
detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc access read/write
index/hash on prim. key lots of scans
unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
06/11/16
5
OLAP characteristics • OLTP paradigm focused on transactions, OLAP focused on
analytical queries – Normalization not good for analytical queries, reconstructing
data requires a high number of joins • OLAP databases support a heavy query load • OLTP indexing techniques not efficient in OLAP: oriented to
access few records – OLAP queries typically include aggregation
• The need for a different database model to support OLAP was clear; led to – Data warehouse: (usually) large repository that consolidate data
from different sources, is updated online, follows the multidimensional data model, designed and optimized to efficiently support OLAP queries
9
In OLAP:
• Data is perceived and manipulated as it was stored in a multi-dimensional array
• But ideas are explained in terms of conventional relational tables
06/11/16
6
OLAP: Data Grouping and Aggregation
• Data grouping and aggregation in many different ways
• The number of possible groupings quickly becomes large– The user has to consider all groupings– Analytical processing problem
Example: OLAP-style Queries for Supplier-and-Parts Database
1) Get the total shipment quantity 2) Get total shipment quantities by supplier 3) Get total shipment quantities by part 4) Get the shipment by supplier and part
06/11/16
7
Supplier-Parts
• SP
S# P# QTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200
Get the total shipment quantity
1. SELECT SUM(QTY) AS TOTQTY FROM SP GROUP BY ()
Equivalent to:SELECT SUM(QTY) AS TOTQTY
FROM SP
TOTQTY1600
06/11/16
8
Get total shipment quantities by supplier
2. SELECT S#, SUM(QTY) AS TOTQTY FROM SP
GROUP BY S#
S# TOTQTYS1 500S2 700S3 200S4 200
Get total shipment quantities by part
3. SELECT P#, SUM(QTY) AS TOTQTY FROM SP GROUP BY P#
P# TOTQTYP1 600P2 1000
06/11/16
9
4. SELECT S#, P#, SUM(QTY) AS TOTQTY FROM SP GROUP BY S#,P#
S# P# TOTQTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200
Get the shipment by supplier and part
Drawbacks
• Formulation so many similar but distinct queries is tedious
• Executing the queries is expensive • Make life easier
– more efficient computation • Single query
– GROUPING SETS, ROLLUP, CUBE options – Added to SQL standard 1999: SQL/OLAP
06/11/16
10
GROUPING SETS
• Execute several queries simultaneously SELECT S#, P#, SUM (QTY) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( (S#), (P#) ) ; Single results table Not a relation !! null è missing information
S# P# TOTQTYS1 null 500S2 null 700S3 null 200S4 null 200null P1 600null P2 1000
SELECT CASE GROUPING ( S# ) WHEN 1 THEN ‘??‘ ELSE S# AS S#, CASE GROUPING ( P# ) WHEN 1 THEN ‘!!‘ ELSE P# AS P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( ( S# ), ( P# ) );
S# P# TOTQTYS1 !! 500S2 !! 700S3 !! 200S4 !! 200?? P1 600?? P2 1000
06/11/16
11
ROLLUP operation
SELECT S#,P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY ROLLUP (S#, P#) ; GROUP BY GROUPING SETS ( ( S#, P# ), ( S# ) , ( ) )
S# P# TOTQTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200S1 null 500S2 null 700S3 null 200S4 null 200null null 1600
ROLLUP definition
• The quantities have been rolled up for each supplier
• Rolled up along supplier dimension GROUP BY ROLLUP (A,B,...,Z) (A,B,...,Z) (A,B,...) (A,B) (A) ()
GROUP BY ROLLUP (A,B) is not symmetric in A and B !
06/11/16
12
CUBE operation
SELECT S#, P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY CUBE (S#, P#); GROUP BY GROUPING SETS ( (S#, P#), ( S# ), ( P# ), ( ) )
S# P# TOTQTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200S1 null 500S2 null 700S3 null 200S4 null 200null P1 600null P2 1000null null 1600
CUBE
• Confusing term CUBE (?) – Derived from the fact that in multidimensional
terminology, data values are stored in cells of a multidimensional array or a hypercube
• The actual physical storage may differ – In our example
• Cube has just two dimensions (supplier, part) • The two dimensions are unequal (no square rectangle..)
• Means group by all possible subsets of the set {A, B, ..., Z }
06/11/16
13
CUBE
• Means group by all possible subsets of the set {A, B, ..., Z } – M={A, B, ..., Z } |M|=n
– Power Set (Algebra) – P(M):={N | N⊆M}, |P(M)|=2n
..proof by induction
• Subset represent different grade of summarization
• In Data Mining, such a subset is called a Cuboid
26
Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D (base) cuboid
06/11/16
14
Another Example
• Consider a cube Sales, with dimensions Product and Customer, and a measure SalesAmount
• The data cube contains all possible (22) aggregations of the cube cells, namely: – SalesAmount by Product, – SalesAmount by Customer, – SalesAmount by both Product and Customer – plus the base nonaggregated data
27
28
C1 C2 C3 TotalByProduct
p1 100 105 100 305
p2 70 60 40 170
p3 30 40 50 120
TotalByCustomer
200 205 190 595
ProductKey CustomerKey SalesAmount
p1 c1 100
p1 c2 105
p1 c3 100
p2 c1 70
p2 c2 60
p2 c3 40
p3 c1 30
p3 c2 40
p3 c3 50
Rela;onalfacttablerepresen;ngthesamedataDatacubewithtwodimensions
06/11/16
15
The Data Cube in the Relational Model • Consider the Sales fact table • To compute all possible aggregations along
Product and Customer we must scan the whole relation
• Computed in SQL using NULL value: SELECT ProductKey, CustomerKey,
SUM(SalesAmount)FROM SalesUNIONSELECT ProductKey, NULL,
SUM(SalesAmount)FROM SalesGROUP BY ProductKeyUNIONSELECT NULL, CustomerKey,
SUM(SalesAmount)FROM SalesGROUP BY CustomerKeyUNIONSELECT NULL, NULL, SUM(SalesAmount)FROM Sales
29
ProductKey CustomerKey SalesAmount
p1 c1 100
p2 c1 70
p3 c1 30
NULL c1 200
p1 c2 105
p2 c2 60
p3 c2 40
NULL c2 205
p1 c3 100
p2 c3 40
p3 c3 50
NULL c3 190
p1 NULL 305
p2 NULL 170
p3 NULL 120
NULL NULL 595
Datacube
SQL/OLAP Operations • Computing a cube with n dimensions requires 2n GROUP BY • SQL/OLAP extends the GROUP BY clause with the ROLLUP and CUBE
operators (Shorthands for a more powerful operator, GROUPING SETS) – ROLLUP computes group subtotals in the order given by a list of attributes – CUBE computes all totals of such a list
• Equivalent queries:
SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY ROLLUP(ProductKey, CustomerKey)
SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY GROUPING SETS((ProductKey,CustomerKey),(ProductKey),())
30
06/11/16
16
SQL/OLAP Operations (cont.) • Equivalent queries:
SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY CUBE(ProductKey, CustomerKey)
SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY GROUPING SETS((ProductKey, CustomerKey), (ProductKey),
(CustomerKey),())
31
SQL/OLAP Operations
32
ProductKey CustomerKey SalesAmount
p1 c1 100
p1 c2 105
p1 c3 100
p1 NULL 305
p2 c1 70
p2 c2 60
p2 c3 40
p2 NULL 170
p3 c1 30
p3 c2 40
p3 c3 50
p3 NULL 120
NULL NULL 595
ProductKey CustomerKey SalesAmount
p1 c1 100
p2 c1 105
p3 c1 100
NULL NULL 305
p1 c2 70
p2 c2 60
p3 c2 40
NULL NULL 170
p1 c3 30
p2 c3 40
p3 c3 50
NULL NULL 120
NULL NULL 595
p1 NULL 305
p2 NULL 170
p3 NULL 120
GROUPBYROLLUPGROUPBYCUBE
06/11/16
17
SQL/OLAP Operations: Window Partitioning
• Allows to compare detailed data with aggregate values • Example: relevance of each customer with respect to the sales of
the product SELECT ProductKey, CustomerKey, SalesAmount,
MAX(SalesAmount) OVER (PARTITION BY ProductKey) AS MaxAmount
FROM Sales
• First three columns are obtained from the Sales table • The fourth column:
– For each tuple define a window called partition that contains all tuples of the same product
– SalesAmount is aggregated over this window using the MAX function
33
Example: result
34
ProductKey CustomerKey SalesAmount MaxAmount
p1 c1 100 105
p1 c2 105 105
p1 c3 100 105
p2 c1 70 70
p2 c2 60 70
p2 c3 40 70
p3 c1 30 50
p3 c2 40 50
p3 c3 50 50
06/11/16
18
SQL/OLAP Operations: Window Ordering
• Used to order the rows within a partition • Useful to compute rankings, with aggregate functions ROW
NUMBER and RANK • Example: How does each product rank in the sales of each
customer
SELECT ProductKey, CustomerKey, SalesAmount, ROW NUMBER() OVER(PARTITION BY CustomerKey ORDER BY SalesAmount DESC) AS RowNoFROM Sales
– First tuple, for example, evaluated by opening a window with all tuples of customer c1, ordered by the sales amount
– Product p1 is the one most demanded by customer c1
35
Example: result
36
ProductKey CustomerKey SalesAmount RowNo
p1 c1 100 1
p2 c1 70 2
p3 c1 30 3
p1 c2 105 1
p2 c2 60 2
p3 c2 40 3
p1 c3 100 1
p2 c3 50 2
p3 c3 40 3
06/11/16
19
SQL/OLAP Operations: Window Framing
• Defines the size of the partition • Used to compute statistical functions over time series, like moving
average • Example: Add two columns year and month to the Sales table and
compute three-month moving average of sales by product
SELECT ProductKey, Year, Month, SalesAmount, AVG(SalesAmount) OVER(PARTITION BY ProductKey ORDER BY Year, Month ROWS 2 PRECEDING) AS MovAvg
FROM Sales
• For each tuple, opens a window with the tuples pertaining to the current product
• Then, orders the window by year and month and computes the average over the current tuple and the previous two ones if they exist
37
Example: result
38
ProductKey Year Month SalesAmount
MovAvg
p1 2011 10 100 100
p1 2011 11 105 102.5
p1 2011 12 100 101.67
p2 2011 12 60 60
p2 2012 1 40 50
p2 2012 2 70 56.67
p3 2012 1 30 30
p3 2012 2 50 40
p3 2012 3 40 40
06/11/16
20
SQL/OLAP Operations: Window
Framing (another example)
• Example: Year-to-date sum of sales by product
SELECT ProductKey, Year, Month, SalesAmount, AVG(SalesAmount) OVER (PARTITION BYProductKey, Year ORDER BY Month ROWS UNBOUNDED PRECEDING) AS YTD
FROM Sales
• For each tuple, opens a window with the tuples of the current product and year ordered by month
• SUM is applied to all the tuples before the current tuple (ROWS UNBOUNDED PRECEDING)
39
Example: result
40
ProductKey Year Month SalesAmount
YTD
p1 2011 10 100 100
p1 2011 11 105 205
p1 2011 12 100 305
p2 2011 12 60 60
p2 2012 1 40 40
p2 2012 2 70 110
p3 2012 1 30 30
p3 2012 2 50 80
p3 2012 3 40 120
06/11/16
21
Outline
• OLAP: On-line Analytical Processing Ø Multidimensional model
41
Multidimensional model • Views data in an n-dimensional space: data cube
– composed of dimensions and facts • Dimensions: perspectives used to analyze the data
– Example: A 3-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity
• Attributes describe dimensions – Product dimension may have attributes ProductNumber and UnitPrice (not shown)
• Cells or facts have associated numeric values called measures – Each cell of the data cube represents Quantity of units sold by category, quarter, and
customer’s city 42
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
measure values
dimensions
06/11/16
22
Star and snowflakes schemas
• At the logical level, the multidimensional model is usually represented by relational tables organized in: – Star schemas use a unique table for each dimension, even in the
presence of hierarchies (yields denormalized dimension tables) – Snowflake schemas use normalized tables for dimensions and
their hierarchies – Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars
• Over this relational representation of a data warehouse, an OLAP server builds a data cube, which provides a multidimensional view of the data
43
Example of Star Schema time_key
day day_of_the_week month quarter year
time
location_key street city state_or_province country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
06/11/16
23
Example of Snowflake Schema time_key day day_of_the_week month quarter year
time
location_key street city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key item_name brand type supplier_key
item
branch_key branch_name branch_type
branch
supplier_key supplier_type
supplier
city_key city state_or_province country
city
46
Example of Fact Constellation
time_key day day_of_the_week month quarter year
time
location_key street city province_or_state country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key shipper_name location_key shipper_type
shipper
06/11/16
24
Characteristics of a data cube
• Data granularity: level of detail at which measures are represented for each dimension of the cube – Example: sales figures aggregated to
granularities Category, Quarter, and City
47
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
measure values
dimensions
Characteristics of a data cube
• Instances of a dimension are called members – Ex: Seafood and Beverages are members of
the Product at the granularity Category
48
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
measure values
dimensions
06/11/16
25
Characteristics of a data cube
• A data cube may contain several measures – e.g. amount, indicating the total sales amount
(not shown)
49
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
measure values
dimensions
Characteristics of a data cube
• A data cube may be sparse (typical case) or dense – Ex: not all customers may have ordered
products of all categories during all quarters
50
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
measure values
dimensions
06/11/16
26
Hierarchies • Enable viewing data at several granularities
– Define a sequence of mappings relating lower-level, detailed concepts to higher-level ones
– The lower level is called the child and the higher level is called the parent
– The hierarchical structure of a dimension is called the dimension schema
– A dimension instance comprises all members at all levels in a dimension
51
Hierarchies • Enable viewing data at several granularities
– Define a sequence of mappings relating lower-level, detailed concepts to higher-level ones
– The lower level is called the child and the higher level is called the parent
– The hierarchical structure of a dimension is called the dimension schema
– A dimension instance comprises all members at all levels in a dimension
• Example – Hierarchies of the Product, Time, and Customer dimensions
52
All
Category
Product
ProductAll
Year
Semester
Quarter
Month
Day
TimeAll
Continent
Country
State
City
Customer
Customer
06/11/16
27
Members of hierarchy
• Members of the hierarchy Product è Category
53
all
Beverages
Chai Chang
Seafood
Ikura Konbu
...
... ...Product
Category
All
Classification of measures
• Each measure is associated to an aggregation function that combines several measure values into a single one – Aggregation of measures takes place when we change the level
of detail at which data in a cube is visualized
54
06/11/16
28
Classification of measures
• Each measure is associated to an aggregation function that combines several measure values into a single one – Aggregation of measures takes place when we change the level
of detail at which data in a cube is visualized
• Measures can be classified according to the way they can be aggregated: – Additive: can be meaningfully summarized along all the
dimensions, using addition (most common type)
55
Classification of measures
• Each measure is associated to an aggregation function that combines several measure values into a single one – Aggregation of measures takes place when we change the level
of detail at which data in a cube is visualized
• Measures can be classified according to the way they can be aggregated: – Additive: can be meaningfully summarized along all the
dimensions, using addition (most common type) – Semiadditive: can be meaningfully summarized using addition
along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)
56
06/11/16
29
Classification of measures
• Each measure is associated to an aggregation function that combines several measure values into a single one – Aggregation of measures takes place when we change the level
of detail at which data in a cube is visualized
• Measures can be classified according to the way they can be aggregated: – Additive: can be meaningfully summarized along all the
dimensions, using addition (most common type) – Semiadditive: can be meaningfully summarized using addition
along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)
– Nonadditive measures cannot be meaningfully summarized using addition across any dimension (Ex: item price, cost per unit, and exchange rate)
57
Another Classification of Measures
• Another classification of measures: – Distributive: defined by an aggregation function that can be
computed in a distributed way; functions count, sum, minimum, and maximum are distributive, distinct count is not (ex: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets {3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the answer over the original set is 5)
58
06/11/16
30
Another Classification of Measures
• Another classification of measures: – Distributive: defined by an aggregation function that can be
computed in a distributed way; functions count, sum, minimum, and maximum are distributive, distinct count is not (ex: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets {3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the answer over the original set is 5)
– Algebraic: defined by an aggregation function that can be expressed as a scalar function of distributive ones; example: average, computed by dividing the sum by the count
59
Another Classification of Measures
• Another classification of measures: – Distributive: defined by an aggregation function that can be
computed in a distributed way; functions count, sum, minimum, and maximum are distributive, distinct count is not (ex: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets {3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the answer over the original set is 5)
– Algebraic: defined by an aggregation function that can be expressed as a scalar function of distributive ones; example: average, computed by dividing the sum by the count
– Holistic: cannot be computed from other subaggregates (e.g., median, rank)
• Most large data cube applications require efficient computation of distributive and algebraic measures – It is difficult to efficiently compute holistic measures
60
06/11/16
31
More about measures • When defining a measure we must determine the associated
aggregation functions – For example, a semiadditive measure representing inventory
quantities can be aggregated using average along the Time dimension, and using addition along other dimensions
• Summarizability refers to the correct aggregation of cube measures along dimension hierarchies
• Summarizability conditions: – Disjointness of instances: the grouping of instances in a level
with respect to their parent in the next level must result in disjoint subsets
– Completeness: all instances are included in the hierarchy and each instance is related to one parent in the next level
– Correctness: refers to the correct use of the aggregation functions
61
Exercise Consider the following fact table that stores information about the total sales
associated to each invoice and to each client in a given date: SALES(invoiceID, clientID, dateID, total).
Also consider that: • The dimension table Date include the following hierarchy: day -> month ->
year; • The dimension table Client include sthe hierarchy: name -> category
a) Represent the tables, primary keys and foreign keys of the corresponding star schema
b) Write in SQL the statements that return:
b.1) Average of sales by client b.2) Average of sales by month b.3) Average of sales by client and by month b.4) Average of sales.
c) Use the CUBE operator and write in SQL/OLAP an expression whose result contains the four results above.
62
06/11/16
32
Next Lecture
• OLAP Operations
63