OLAP and Multidimensional Model - Técnico Lisboa ... · OLAP and Multidimensional Model Helena Galhardas ... Design and Implementation, Springer, 2014 (chpts. 1 ... Techniques, Morgan

06/11/16

1

OLAP and Multidimensional Model

Helena Galhardas DEI/IST

References

•  A. Vaisman and E. Zimányi, Data Warehouse Systems:

Design and Implementation, Springer, 2014 (chpts. 1 and 3)

•  J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 (chpt. 4)

•  A. Wichert, H. Galhardas, Suporte à Decisão (slides), MEIC/IST

2

06/11/16

2

Outline

•  OLAP: On-line Analytical Processing •  Multidimensional model

3

What is OLAP?

•  The term OLAP („online analytical processing“) was coined in a white paper written for Arbor Software Corp. in 1993

–  Interactive process of creating, managing, analyzing and reporting on data

– Analyzing large quantities of data in real-time

06/11/16

3

OLAP vs OLTP

•  Traditional database systems designed and tuned to support the day-to-day operation: – Ensure fast, concurrent access to data,

transaction processing and concurrency control – Focus on online update data consistency – Known as operational databases or

OnlineTransaction Processing (OLTP)

5

OLAP vs OLTP

•  OLTP DB data characteristics: – Detailed data – Do not include historical data – Highly normalized – Poor performance on complex queries

including joins an aggregation

6

06/11/16

4

OLAP vs OLTP

•  Data analysis requires a new paradigm: Online Analytical Processing (OLAP) – Typical OLTP query: pending orders for

customer c – Typical OLAP query: total sales amount by

product and by customer

7

8

OLTP vs. OLAP

OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date

detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc access read/write

index/hash on prim. key lots of scans

unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response

06/11/16

5

OLAP characteristics •  OLTP paradigm focused on transactions, OLAP focused on

analytical queries –  Normalization not good for analytical queries, reconstructing

data requires a high number of joins •  OLAP databases support a heavy query load •  OLTP indexing techniques not efficient in OLAP: oriented to

access few records –  OLAP queries typically include aggregation

•  The need for a different database model to support OLAP was clear; led to –  Data warehouse: (usually) large repository that consolidate data

from different sources, is updated online, follows the multidimensional data model, designed and optimized to efficiently support OLAP queries

9

In OLAP:

•  Data is perceived and manipulated as it was stored in a multi-dimensional array

•  But ideas are explained in terms of conventional relational tables

06/11/16

6

OLAP: Data Grouping and Aggregation

•  Data grouping and aggregation in many different ways

•  The number of possible groupings quickly becomes large– The user has to consider all groupings– Analytical processing problem

Example: OLAP-style Queries for Supplier-and-Parts Database

1)  Get the total shipment quantity 2)  Get total shipment quantities by supplier 3)  Get total shipment quantities by part 4)  Get the shipment by supplier and part

06/11/16

7

Supplier-Parts

•  SP

S# P# QTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200

Get the total shipment quantity

1. SELECT SUM(QTY) AS TOTQTY FROM SP GROUP BY ()

Equivalent to:SELECT SUM(QTY) AS TOTQTY

FROM SP

TOTQTY1600

06/11/16

8

Get total shipment quantities by supplier

2. SELECT S#, SUM(QTY) AS TOTQTY FROM SP

GROUP BY S#

S# TOTQTYS1 500S2 700S3 200S4 200

Get total shipment quantities by part

3. SELECT P#, SUM(QTY) AS TOTQTY FROM SP GROUP BY P#

P# TOTQTYP1 600P2 1000

06/11/16

9

4. SELECT S#, P#, SUM(QTY) AS TOTQTY FROM SP GROUP BY S#,P#

S# P# TOTQTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200

Get the shipment by supplier and part

Drawbacks

•  Formulation so many similar but distinct queries is tedious

•  Executing the queries is expensive •  Make life easier

– more efficient computation •  Single query

– GROUPING SETS, ROLLUP, CUBE options – Added to SQL standard 1999: SQL/OLAP

06/11/16

10

GROUPING SETS

•  Execute several queries simultaneously SELECT S#, P#, SUM (QTY) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( (S#), (P#) ) ; Single results table Not a relation !! null è missing information

S# P# TOTQTYS1 null 500S2 null 700S3 null 200S4 null 200null P1 600null P2 1000

SELECT CASE GROUPING ( S# ) WHEN 1 THEN ‘??‘ ELSE S# AS S#, CASE GROUPING ( P# ) WHEN 1 THEN ‘!!‘ ELSE P# AS P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( ( S# ), ( P# ) );

S# P# TOTQTYS1 !! 500S2 !! 700S3 !! 200S4 !! 200?? P1 600?? P2 1000

06/11/16

11

ROLLUP operation

SELECT S#,P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY ROLLUP (S#, P#) ; GROUP BY GROUPING SETS ( ( S#, P# ), ( S# ) , ( ) )

S# P# TOTQTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200S1 null 500S2 null 700S3 null 200S4 null 200null null 1600

ROLLUP definition

•  The quantities have been rolled up for each supplier

•  Rolled up along supplier dimension GROUP BY ROLLUP (A,B,...,Z) (A,B,...,Z) (A,B,...) (A,B) (A) ()

GROUP BY ROLLUP (A,B) is not symmetric in A and B !

06/11/16

12

CUBE operation

SELECT S#, P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY CUBE (S#, P#); GROUP BY GROUPING SETS ( (S#, P#), ( S# ), ( P# ), ( ) )

S# P# TOTQTYS1 P1 300S1 P2 200S2 P1 300S2 P2 400S3 P2 200S4 P2 200S1 null 500S2 null 700S3 null 200S4 null 200null P1 600null P2 1000null null 1600

CUBE

•  Confusing term CUBE (?) –  Derived from the fact that in multidimensional

terminology, data values are stored in cells of a multidimensional array or a hypercube

•  The actual physical storage may differ –  In our example

•  Cube has just two dimensions (supplier, part) •  The two dimensions are unequal (no square rectangle..)

•  Means group by all possible subsets of the set {A, B, ..., Z }

06/11/16

13

CUBE

•  Means group by all possible subsets of the set {A, B, ..., Z } –  M={A, B, ..., Z } |M|=n

–  Power Set (Algebra) –  P(M):={N | N⊆M}, |P(M)|=2n

..proof by induction

•  Subset represent different grade of summarization

•  In Data Mining, such a subset is called a Cuboid

26

Cube: A Lattice of Cuboids

time,item

time,item,location

time, item, location, supplier

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D (base) cuboid

06/11/16

14

Another Example

•  Consider a cube Sales, with dimensions Product and Customer, and a measure SalesAmount

•  The data cube contains all possible (22) aggregations of the cube cells, namely: –  SalesAmount by Product, –  SalesAmount by Customer, –  SalesAmount by both Product and Customer –  plus the base nonaggregated data

27

28

C1 C2 C3 TotalByProduct

p1 100 105 100 305

p2 70 60 40 170

p3 30 40 50 120

TotalByCustomer

200 205 190 595

ProductKey CustomerKey SalesAmount

p1 c1 100

p1 c2 105

p1 c3 100

p2 c1 70

p2 c2 60

p2 c3 40

p3 c1 30

p3 c2 40

p3 c3 50

Rela;onalfacttablerepresen;ngthesamedataDatacubewithtwodimensions

06/11/16

15

The Data Cube in the Relational Model •  Consider the Sales fact table •  To compute all possible aggregations along

Product and Customer we must scan the whole relation

•  Computed in SQL using NULL value: SELECT ProductKey, CustomerKey,

SUM(SalesAmount)FROM SalesUNIONSELECT ProductKey, NULL,

SUM(SalesAmount)FROM SalesGROUP BY ProductKeyUNIONSELECT NULL, CustomerKey,

SUM(SalesAmount)FROM SalesGROUP BY CustomerKeyUNIONSELECT NULL, NULL, SUM(SalesAmount)FROM Sales

29


p1 c1 100

p2 c1 70

p3 c1 30

NULL c1 200

p1 c2 105

p2 c2 60

p3 c2 40

NULL c2 205

p1 c3 100

p2 c3 40

p3 c3 50

NULL c3 190

p1 NULL 305

p2 NULL 170

p3 NULL 120

NULL NULL 595

Datacube

SQL/OLAP Operations •  Computing a cube with n dimensions requires 2n GROUP BY •  SQL/OLAP extends the GROUP BY clause with the ROLLUP and CUBE

operators (Shorthands for a more powerful operator, GROUPING SETS) –  ROLLUP computes group subtotals in the order given by a list of attributes –  CUBE computes all totals of such a list

•  Equivalent queries:

SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY ROLLUP(ProductKey, CustomerKey)

SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY GROUPING SETS((ProductKey,CustomerKey),(ProductKey),())

30

06/11/16

16

SQL/OLAP Operations (cont.) •  Equivalent queries:

SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY CUBE(ProductKey, CustomerKey)

SELECT ProductKey, CustomerKey, SUM(SalesAmount)FROM SalesGROUP BY GROUPING SETS((ProductKey, CustomerKey), (ProductKey),

(CustomerKey),())

31

SQL/OLAP Operations

32


p1 c1 100

p1 c2 105

p1 c3 100

p1 NULL 305

p2 c1 70

p2 c2 60

p2 c3 40

p2 NULL 170

p3 c1 30

p3 c2 40

p3 c3 50

p3 NULL 120

NULL NULL 595


p1 c1 100

p2 c1 105

p3 c1 100

NULL NULL 305

p1 c2 70

p2 c2 60

p3 c2 40

NULL NULL 170

p1 c3 30

p2 c3 40

p3 c3 50

NULL NULL 120

NULL NULL 595

p1 NULL 305

p2 NULL 170

p3 NULL 120

GROUPBYROLLUPGROUPBYCUBE

06/11/16

17

SQL/OLAP Operations: Window Partitioning

•  Allows to compare detailed data with aggregate values •  Example: relevance of each customer with respect to the sales of

the product SELECT ProductKey, CustomerKey, SalesAmount,

MAX(SalesAmount) OVER (PARTITION BY ProductKey) AS MaxAmount

FROM Sales

•  First three columns are obtained from the Sales table •  The fourth column:

–  For each tuple define a window called partition that contains all tuples of the same product

–  SalesAmount is aggregated over this window using the MAX function

33

Example: result

34

ProductKey CustomerKey SalesAmount MaxAmount

p1 c1 100 105

p1 c2 105 105

p1 c3 100 105

p2 c1 70 70

p2 c2 60 70

p2 c3 40 70

p3 c1 30 50

p3 c2 40 50

p3 c3 50 50

06/11/16

18

SQL/OLAP Operations: Window Ordering

•  Used to order the rows within a partition •  Useful to compute rankings, with aggregate functions ROW

NUMBER and RANK •  Example: How does each product rank in the sales of each

customer

SELECT ProductKey, CustomerKey, SalesAmount, ROW NUMBER() OVER(PARTITION BY CustomerKey ORDER BY SalesAmount DESC) AS RowNoFROM Sales

–  First tuple, for example, evaluated by opening a window with all tuples of customer c1, ordered by the sales amount

–  Product p1 is the one most demanded by customer c1

35

Example: result

36

ProductKey CustomerKey SalesAmount RowNo

p1 c1 100 1

p2 c1 70 2

p3 c1 30 3

p1 c2 105 1

p2 c2 60 2

p3 c2 40 3

p1 c3 100 1

p2 c3 50 2

p3 c3 40 3

06/11/16

19

SQL/OLAP Operations: Window Framing

•  Defines the size of the partition •  Used to compute statistical functions over time series, like moving

average •  Example: Add two columns year and month to the Sales table and

compute three-month moving average of sales by product

SELECT ProductKey, Year, Month, SalesAmount, AVG(SalesAmount) OVER(PARTITION BY ProductKey ORDER BY Year, Month ROWS 2 PRECEDING) AS MovAvg

FROM Sales

•  For each tuple, opens a window with the tuples pertaining to the current product

•  Then, orders the window by year and month and computes the average over the current tuple and the previous two ones if they exist

37

Example: result

38

ProductKey Year Month SalesAmount

MovAvg

p1 2011 10 100 100

p1 2011 11 105 102.5

p1 2011 12 100 101.67

p2 2011 12 60 60

p2 2012 1 40 50

p2 2012 2 70 56.67

p3 2012 1 30 30

p3 2012 2 50 40

p3 2012 3 40 40

06/11/16

20

SQL/OLAP Operations: Window

Framing (another example)

•  Example: Year-to-date sum of sales by product

SELECT ProductKey, Year, Month, SalesAmount, AVG(SalesAmount) OVER (PARTITION BYProductKey, Year ORDER BY Month ROWS UNBOUNDED PRECEDING) AS YTD

FROM Sales

•  For each tuple, opens a window with the tuples of the current product and year ordered by month

•  SUM is applied to all the tuples before the current tuple (ROWS UNBOUNDED PRECEDING)

39

Example: result

40

ProductKey Year Month SalesAmount

YTD

p1 2011 10 100 100

p1 2011 11 105 205

p1 2011 12 100 305

p2 2011 12 60 60

p2 2012 1 40 40

p2 2012 2 70 110

p3 2012 1 30 30

p3 2012 2 50 80

p3 2012 3 40 120

06/11/16

21

Outline

•  OLAP: On-line Analytical Processing Ø Multidimensional model

41

Multidimensional model •  Views data in an n-dimensional space: data cube

–  composed of dimensions and facts •  Dimensions: perspectives used to analyze the data

–  Example: A 3-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity

•  Attributes describe dimensions –  Product dimension may have attributes ProductNumber and UnitPrice (not shown)

•  Cells or facts have associated numeric values called measures –  Each cell of the data cube represents Quantity of units sold by category, quarter, and

customer’s city 42

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

measure values

dimensions

06/11/16

22

Star and snowflakes schemas

•  At the logical level, the multidimensional model is usually represented by relational tables organized in: –  Star schemas use a unique table for each dimension, even in the

presence of hierarchies (yields denormalized dimension tables) –  Snowflake schemas use normalized tables for dimensions and

their hierarchies –  Fact constellations: Multiple fact tables share dimension tables,

viewed as a collection of stars

•  Over this relational representation of a data warehouse, an OLAP server builds a data cube, which provides a multidimensional view of the data

43

Example of Star Schema time_key

day day_of_the_week month quarter year

time

location_key street city state_or_province country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

06/11/16

23

Example of Snowflake Schema time_key day day_of_the_week month quarter year

time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item


branch

supplier_key supplier_type

supplier

city_key city state_or_province country

city

46

Example of Fact Constellation

time_key day day_of_the_week month quarter year

time

location_key street city province_or_state country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item


branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key shipper_name location_key shipper_type

shipper

06/11/16

24

Characteristics of a data cube

•  Data granularity: level of detail at which measures are represented for each dimension of the cube –  Example: sales figures aggregated to

granularities Category, Quarter, and City

47

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

measure values

dimensions


•  Instances of a dimension are called members – Ex: Seafood and Beverages are members of

the Product at the granularity Category

48

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

measure values

dimensions

06/11/16

25


•  A data cube may contain several measures –  e.g. amount, indicating the total sales amount

(not shown)

49

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

measure values

dimensions


•  A data cube may be sparse (typical case) or dense – Ex: not all customers may have ordered

products of all categories during all quarters

50

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

measure values

dimensions

06/11/16

26

Hierarchies •  Enable viewing data at several granularities

–  Define a sequence of mappings relating lower-level, detailed concepts to higher-level ones

–  The lower level is called the child and the higher level is called the parent

–  The hierarchical structure of a dimension is called the dimension schema

–  A dimension instance comprises all members at all levels in a dimension

51

Hierarchies •  Enable viewing data at several granularities

–  Define a sequence of mappings relating lower-level, detailed concepts to higher-level ones

–  The lower level is called the child and the higher level is called the parent

–  The hierarchical structure of a dimension is called the dimension schema

–  A dimension instance comprises all members at all levels in a dimension

•  Example –  Hierarchies of the Product, Time, and Customer dimensions

52

All

Category

Product

ProductAll

Year

Semester

Quarter

Month

Day

TimeAll

Continent

Country

State

City

Customer

Customer

06/11/16

27

Members of hierarchy

•  Members of the hierarchy Product è Category

53

all

Beverages

Chai Chang

Seafood

Ikura Konbu

...

... ...Product

Category

All

Classification of measures

•  Each measure is associated to an aggregation function that combines several measure values into a single one –  Aggregation of measures takes place when we change the level

of detail at which data in a cube is visualized

54

06/11/16

28




•  Measures can be classified according to the way they can be aggregated: –  Additive: can be meaningfully summarized along all the

dimensions, using addition (most common type)

55





dimensions, using addition (most common type) –  Semiadditive: can be meaningfully summarized using addition

along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)

56

06/11/16

29





dimensions, using addition (most common type) –  Semiadditive: can be meaningfully summarized using addition

along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)

–  Nonadditive measures cannot be meaningfully summarized using addition across any dimension (Ex: item price, cost per unit, and exchange rate)

57

Another Classification of Measures

•  Another classification of measures: –  Distributive: defined by an aggregation function that can be

computed in a distributed way; functions count, sum, minimum, and maximum are distributive, distinct count is not (ex: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets {3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the answer over the original set is 5)

58

06/11/16

30




–  Algebraic: defined by an aggregation function that can be expressed as a scalar function of distributive ones; example: average, computed by dividing the sum by the count

59




–  Algebraic: defined by an aggregation function that can be expressed as a scalar function of distributive ones; example: average, computed by dividing the sum by the count

–  Holistic: cannot be computed from other subaggregates (e.g., median, rank)

•  Most large data cube applications require efficient computation of distributive and algebraic measures –  It is difficult to efficiently compute holistic measures

60

06/11/16

31

More about measures •  When defining a measure we must determine the associated

aggregation functions –  For example, a semiadditive measure representing inventory

quantities can be aggregated using average along the Time dimension, and using addition along other dimensions

•  Summarizability refers to the correct aggregation of cube measures along dimension hierarchies

•  Summarizability conditions: –  Disjointness of instances: the grouping of instances in a level

with respect to their parent in the next level must result in disjoint subsets

–  Completeness: all instances are included in the hierarchy and each instance is related to one parent in the next level

–  Correctness: refers to the correct use of the aggregation functions

61

Exercise Consider the following fact table that stores information about the total sales

associated to each invoice and to each client in a given date: SALES(invoiceID, clientID, dateID, total).

Also consider that: •  The dimension table Date include the following hierarchy: day -> month ->

year; •  The dimension table Client include sthe hierarchy: name -> category

a) Represent the tables, primary keys and foreign keys of the corresponding star schema

b) Write in SQL the statements that return:

b.1) Average of sales by client b.2) Average of sales by month b.3) Average of sales by client and by month b.4) Average of sales.

c) Use the CUBE operator and write in SQL/OLAP an expression whose result contains the four results above.

62

06/11/16

32

Next Lecture

•  OLAP Operations

63

Documents

OLAP and Multidimensional Model - Técnico Lisboa ... · OLAP and Multidimensional Model Helena Galhardas ... Design and Implementation, Springer, 2014 (chpts. 1 ... Techniques, Morgan