190
1 TF Informatik Design of Multidimension al Data Models for Data Warehouses and OLAP Thomas Frisendal

TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

Embed Size (px)

Citation preview

Page 1: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

1

TF

Informatik

Design ofMultidimensional Data Models forData Warehousesand OLAP

Thomas Frisendal

Page 2: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

2

TF

Informatik

Who I am• Freelance Data Base Consultant• More than 30 years of Experience with DBMS’s as a

Vendor Person and as Freelance Consultant• 10 years of Experience with Data Warehouse construction• Have trained more than 350 persons in Multidimensional

Modelling and Star Schema Design• Charter member of the IAIDQ• Board member of The Data Warehouse Institute Denmark• Locations: Nordic Countries (principal residence in

Denmark) and Côte d’Azur (secondary residence in Antibes)

Page 3: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

3

TF

Informatik

AcknowledgementsBased on• Ralph Kimball’s books: “The Data Warehouse Toolkit”, Wiley 1996, ISBN 0-

471-15337-0, ”The Data Warehouse Lifecycle Toolkit” *), Wiley 1998, ISBN 0-471-25547-5, ”The Data Webhouse Toolkit” **), Wiley 2000, ISBN 0-471-37680-9, and selected parts of Ralph’s published papers

• Some examples from ”Data Warehouse Design Solutions”, Christopher Adamson and Michael Venerable, Wiley 1998, ISBN 0-471-25195-X.

• ”Microsoft OLAP Solutions”, Erik Thomsen, George Spofford and Dick Chase, Wiley 1999, ISBN 0-471-33258-5

• Telco example from ”The Official Guide to Informix/Red Brick® Data Warehousing”, M&T Books (IDG) 2000, ISBN 0-7645-4694-5

• DWLIST discussions• My own practical experiences and readings*) With Laura Reeves, Margy Ross and Warren Thornthwaite.**) With Richard Merz

Page 4: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

4

TF

Informatik

Agenda

• History• Data Warehousing Objectives and Architecture• Dimensional Design Basics• Industry examples• The Data Webhouse (Clickstream Analysis)• Advanced Design Issues• The bigger picture• Data Quality• The future of the Data Warehouse• Literature and web-adresses

Page 5: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

5

TF

Informatik

History

How the Multidimensional Model came into existence

Page 6: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik

Online!Pioneers:

General Electric, IDS, Charlie Bachman

Rockwell/IBM, IMS

Techniques:

Hashing, pointers, physical colocation,

and concurrency / transaction control

The early years of database

Page 7: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

InformatikThe Vision:

The Information System

Operational

Tactical

Strategic

Page 8: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik Relational & The Codd & Date Seminars

S# SNAME STATUS CITY

S1 Smith 20 London

S2 Jones 10 Paris

S3 Blake 30 Paris

S4 Clark 20 London

S5 Adams 30 Athens

S# P# QTY

S1 P1 300

S1 P2 200

S2 P1 300

S2 P2 400

S4 P2 200

P# PNAME COLOR WEIGHT CITY

P1 Nut Red 12 London

P2 Bolt Green 17 Paris

P3 Screw Blue 17 Rome

P4 Screw Red 14 London

P5 Cam Blue 12 Paris

Page 9: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik

Relational / SQL

• Database language standard

• The Query Optimizer (automated navigation)

• DBMS’s become commodities

• End-user tools by the hundreds

Page 10: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik

Operational

Tactical

The Information Warehouse ConceptThe Information Warehouse Concept

Around 1987: Getting closer

Page 11: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

InformatikInnovation for Analysts:

The Multidimensional DatabaseP

rodu

cts

Timeperiods

Sales

Distric

ts

Page 12: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik

OperationalOn-LineTransactionProcessing

TacticalOn-LineAnalyticalProcessing

Strategic

ExecutiveInformationSystems

IT & Business Opportunities:Decision Support Systems

Page 13: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik OLTP and OLAP:Conflict of Purpose

Customers

Orders Products

Order-lines

Invoices

Inventory

Markets Time

Products

Sales Detail

Customers

age

gender

income level

education

time_period

(billions of records)

(thousands of records)

(hundreds of records)

””Star Schema”Star Schema”

(millions of records)

(hundreds of records)

Page 14: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

14

TF

InformatikOLAP Categories

and Sample Products• ROLAP: Relational

DBMS’s with Star Schema support and specific tools

• MOLAP: Multidimensional Databases

• HOLAP: Hybrid OLAP, the combination of the two

• ROLAP: Most RDBMS’s, tools like Microstrategy, Business Objects, Informix Metacube, IBM DB2 OLAP Server, Oracle 9i OLAP Server and many more

• MOLAP: Hyperion Essbase, Applix TM1, Cognos, Microsoft OLAP Services, IBM DB2 OLAP Server

• HOLAP: Microsoft SQL Server with OLAP Services, (IBM DB2 OLAP Server, Hyperion Essbase 7)

Page 15: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

15

TF

Informatik

ROLAP vs. MOLAP

• Hot debate in the 90-es (”The Shootout at the OLAP Corral”)• The major strength of ROLAP:

– Large volumes of data (billions of rows, terabytes of data)• The major weakness of ROLAP:

– Performance on large result sets• The major strength of MOLAP:

– Fast response times, also on ”large result sets” (aggregated queries)• The major drawbacks of MOLAP:

– Pre-calculation times– Scalability (the cubes explode as the volume and complexity

increase)• So, what to do?

– Use both (HOLAP)

Page 16: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

16

TF

Informatik

The Power of the 2

ROLAPStar Schema

MOLAP /HOLAP

Smooth, often automatic/transparent, integration

Detailed levels in ROLAP (many rows)

Two good examples:

- Microsoft SQL Server with OLAP Services

- IBM DB2 OLAP Server

Page 17: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

TF

Informatik

DistributedDistributedOn-LineOn-Line

TransactionTransactionProcessingProcessing

Analytical Analytical Applications in Applications in

OLAPOLAP

It’s here!

Page 18: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

18

TF

Informatik

Data Warehousing Objectivesand Architecture

Page 19: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

19

TF

InformatikData Warehouse Objectives

• Provide one data source for reporting, analysis, and mining

– consistent answers across the organization or organizational unit (“data marts”)

Page 20: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

20

TF

Informatik

Dat

awar

ehou

se

ExtractingCleaningStandardizingConsolidatingAggregatingTransformingKeygeneratingReformattingand much more

Datastaging area?

Operationalsystems

The Feeding System(Extract, Transformation and Load - ETL Tools)

Page 21: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

21

TF

Informatik

ETL – where the time is spent!

• Do it yourself (SQL, scripts, COBOL etc.)

• ETL tool products (e.g. Informatica)

• 80 % of your time is spent here!• Buy and read ”The Data Warehouse ETL

Toolkit”, Ralph Kimball and Joe Caserta, Wiley 2004, ISBN 0-7645-6757-8

Page 22: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

22

TF

Informatik

Data Warehouse statistics

• Development time: – < 6 months: 16 %– 6-12 months: 32 %– 12-24 months: 26 %– 24-60 months: 20 %– > 60 months: 6 %

• 50 % of all are over 3 years of age (9 % over 10 years)

• 33 % over 1 TB (10 % over 10 TB)

Source: Business Intelligence Journal, Fall 2005

Page 23: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

23

TF

InformatikThe Points of Measurements in

Business ProcessesF

acts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Facts are numerical: Counts Volumes Moneyand represent the key subject areas in business.Each point of measurement becomes a star.

Page 24: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

24

TF

Informatik

One, big Data WarehouseF

acts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

One “global” DW: Finished goods inventory

Shipments

Distribution inventory

Depletions

Store inventory

Salestransactions

Timedimension

Productdimension

Customerdimension

Etc.etc.etc.

Page 25: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

25

TF

Informatik

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

DataMart

DataMart

DataMart

DataMart

DataMart

DataMart

Anarchistic Data Marts

Page 26: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

26

TF

Informatik

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

DataMart

DataMart

DataMart

DataMart

DataMart

DataMart

Enterprise Data Warehouse(maybe also an Operational Data Store)

Coordinated Data Marts

Page 27: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

27

TF

InformatikTop 10 reasons for a layer of atomic data in front of data marts

• Coordinated transformations at mart level (consistency)

• Minimized impact on source systems (one extract)

• Temporal integrity across marts

• Single source for cross-subject mining

• Allows smaller marts - with drill-down to atomic level

• Coordinated meta data across entire DW

• Scalability of entire DW (quick population of new marts)

• Facilitates building stars

• Mart recovery

• If volumes not too high, operational reporting may take placeDoug Laney, Prism Solutions on DWLIST

Page 28: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

28

TF

Informatik

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

DataMart

DataMart

DataMart

DataMart

DataMart

DataMart

Design: Conformed Dimensions and Conformed FactsPhysical: Data Staging facilities as necessary

The Data Warehouse Bus Architecture

Page 29: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

29

TF

Informatik

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

Fac

ts

DataMart 1,ROLAP

DataMart 2,MOLAP

DataMart 3,HOLAP

DataMart 4,MOLAP

DataMart 5,ROLAP

DataMart 5,MOLAP

Data Staging Area in Relational DBMS:E/R: May be used for data cleansing

Star Schema: For all facts and dimensions

Roles of ROLAP, MOLAPand HOLAP

Page 30: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

30

TF

InformatikComponents and structures in

Decision Support SystemsUser Interface, ie. Windows or Web

Ad hocOLAPtools

Own applications developed with OLAP tools

Off the shelf applications

Datawarehousedatabase

SQL

Typically on aggregated levels

Typically on the atomic, event level

Metadata

Relational Star Schemas:

Multidimensional Cube(s):

Page 31: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

31

TF

InformatikA few words on Meta Data

• You need it and it is important• Read ”Meta Meta Data Data”, Ralph Kimball

(www.dbmsmag.com/9803d05.html)• Comes with client tools, data transformation tools,

stand-alone tools, modelling tools, DBMS’s etc.• Microsoft Repository Open Information Model• Meta Data Coalition OIM 1.1 (www.mdcinfo.com )• Parallel work in the OMG (Common Warehouse

Metamodel) (www.omg.org/technology/cwm )• September 2000: MDC integrates into OMG CWM,

the two standards become one.

Page 32: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

32

TF

Informatik

Common Warehouse Metamodel

• Expressed in UML• XML for metadata interchange (XMI)• Check www.cwmforum.org• CWM 1.0 February 2001• Partners: IBM, Unisys, NCR, Hyperion, Oracle, UBS, Genesis,

Dimension EDI • Supporters: Deere, Sun, HP, Data Access, InLine, Aonix, Hitachi,

SAS, Meta Integration, Adaptive• ”Common Warehouse Metamodel”, John Poole, Dan Chang, Douglas

Tolbert, and David Mellor, OMG Press / Wiley 2002, ISBN 0-471-20052-2

• Actual product support? Still maybe a little too early to tell…

Page 33: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

33

TF

Informatik

Data Warehousing Objectives

• Publish what is important

• Provide the means to find out why

• Promote well-informed decisions

Page 34: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

34

TF

Informatik

Multidimensional Design Basics

The Art of Constructing the Stars!

Page 35: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

35

TF

Informatik

Dimensional Design Methodology• Design begins

– with business requirements gathered from the decision makers and analysts

– and data sourced from• the corporation’s operational systems• external data sources

• Design requires– user involvement at all stages

• Design resembles – a series of successive approximations (3-4 revisions)

Page 36: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

36

TF

InformatikSimple, E/R model for OLTP

Order_Details

PK OrderDetailID

FK1 OrderIDFK2 ProductID

QuantityUnitPriceDiscount

FK1 ShippingMethodID

Orders

PK,FK3 ShippingMethodIDPK OrderID

FK1 CustomerIDFK2 EmployeeID

OrderDatePurchaseOrdNumberShipNameShipAddressShipCityShipStateShipPostalCodeShipCountryShipPhoneNumberShipDateFreightChargeSalesTaxRate

Products

PK ProductID

ProductNameUnitPrice

FK1 BrandID

Customers

PK CustomerID

CompanyNameContactFirstNameContactLastNameBillingAddressCityStateOrProvincePostalCodeCountryContactTitlePhoneNumberFaxNumber

Employees

PK EmployeeID

FirstNameLastNameTitleEmailNameExtensionWorkPhone

Shipping_Methods

PK ShippingMethodID

ShippingMethod

Payments

PK PaymentID

FK1 OrderIDPaymentAmountPaymentDateCreditCardNumberCardholdersNameCreditCardExpDate

FK2 PaymentMethodIDFK1 ShippingMethodID

Brands

PK BrandID

BrandDescription

Payment_Methods

PK PaymentMethodID

PaymentMethod

EmployeeTerritories

PK,FK1 EmployeeIDPK,FK2 TerritoryID

Territories

PK TerritoryID

TerritoryDescription

TerritoryRegion

PK,FK1 RegionIDPK,FK2 TerritoryID

Region

PK RegionID

RegionDescription

Page 37: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

37

TF

Informatik

Problems with E/R

• Humans can’t navigate or remember an E/R

• Software can’t navigate an E/R:

Every path gives a different answer

The “shortest path” is meaningless

• Bad performance

Page 38: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

38

TF

InformatikDo you need a warehouse E/R

model• Not necessarily

• the data relationships in the enterprise E/R model can suffice

Page 39: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

39

TF

InformatikCan I model by subject area with

a phased approach• Yes

• models are extensible

• the key is conformable dimensions

• dimensions can then be shared and are extensible themselves

Page 40: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

40

TF

Informatik Heart of the matter

• Business Views– Must look like the business– Recognized by business types– Relevant for business types

• Three design rules– Simplicity– Simplicity– Simplicity

K.I.S.S. !K.I.S.S. !

Page 41: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

41

TF

Informatik

Design objectives

• Business View Schema must be readily understood and navigatable by the users

• Important information must not be obscured by unimportant detail and complexity

• The implementation(s) must provide rapid response time against large volumes of historical data

• The implementation(s) must be legible and navigatable for extract processing & mining

Page 42: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

42

TF

Informatik

Classic definition, ROLAP

• STAR schema – A relational schema organized around a central

table joined to smaller tables using foreign key references

facts

Page 43: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

43

TF

Informatik

Dimensional Model

Product Market Promotion Time dollars units price cost

Product descrip size flavor package

Time descrip weekday holiday fiscal

Market descrip district region demog

Promotion descrip deal discount media

Facts/Measures95% of data base storage

Dimensions Dimensionsmost of the fields

Page 44: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

44

TF

InformatikTerminology:

Facts, Measures, Accounts• ROLAP: The Fact Table(s)

– Contains Facts and (foreign key) references to the dimensions

• MOLAP: The Measures Dimension– Contains Measures and references to the lowest

level Member Keys of the dimensions

• Essbase (MOLAP): Measures Dimensions are called Accounts Dimensions

Page 45: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

45

TF

InformatikAdvantages over classic

datamodels (Entity/Relationship)• Humans can navigate & remember a

multidimensional star or cube• Software can navigate “deterministically”• The two major purposes of

Multidimensional Modelling are:– Reducing complexity– Deliver good response times – also for large

aggregations

Page 46: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

46

TF

Informatik The Dimensions

• Dimensions– Define the business dimensions, in terms

already familiar to users, by which the central table is to be analyzed

– Numerous columns of text, highly descriptive– Represent the hierarchies of different levels of

reporting (eg. Year->Quarter->Month->Day)– Usually less than a million rows, can be much

larger in some businesses

Page 47: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

47

TF

InformatikTechnically speaking (ROLAP)• Dimension tables

– Must have primary key– Joined to fact table through foreign key

reference– Typically represent ninety percent of the data

elements– Commonly occurs in constraints and GROUP BY

clauses– Heavily indexed

Page 48: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

48

TF

InformatikTypical dimensionsGeneric Industry specific

• Time period(s) • Frequent flier, stayer

• Geographic region • Service level, procedure,(markets, cities) operation

• Products • Room type, service,

(also in bank/insurance) classification, seat

• Promotions/campaign • Drug, medicine

• Customers • Vendors, distributor, (Account number) warehouse

• Sales rep, buyer, organisation

Page 49: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

49

TF

Informatik

The Time Dimension

• Problem: SQL and many OLAP products do not support date arithmetic well enough– How many working days in a month?– How many days in the Easter season?– When is the industrial vacation period?– How to calculate AVG(xxx) per working day?

Page 50: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

50

TF

Informatik

The Time Dimension• Give it all the attributes you need to make life easy for the end-user:DaynumberDateDay in weekName of dayType of daySeasonWeek in yearWorkingdays in weekMonth

Name of monthType of monthDay in monthDays in monthWorkingdays in monthEnd of Month FlagQuarterYearDay in yearWorkingdays in year

Think about this - it is worth the effort!

Page 51: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

51

TF

Informatik

Cust

Custgeography

Contract

Salesoutlet

Salesgeography

Campaign

Salesinv.

Act. distr.inventory

Pl. distr.inventory

Product

T-o-DCalendar

SalesFacts

The Time of Day Dimension

Page 52: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

52

TF

InformatikThe Status Dimension

S-key Status Fulfilled-flag Ordertype1 Current Yes Mail2 Current No Mail3 Old Yes Mail4 Old No Mail5 Current Yes Phone6 Current No Phoneetc...

(Cartesian product of all values)Is a real dimension

Page 53: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

53

TF

Informatik

Hierachies within Dimensions

Business Unit

Region

Division

Company

Cost Center

Cost Center City

Cost Center State/Province

Cost Center Country

CompanyDivisonRegionBusiness UnitCC CountryCC StateCC CityCost Center (PK)

ROLAP: One table(denormalised)

MOLAP:

One or two dimensions, depending on your product(s)

(Dimension diagram technique proposed by Ralph Kimball et al in ”The Data Warehouse Lifecycle Toolkit” using eg. Microsoft Visio)

Page 54: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

54

TF

InformatikFact of life (1):

”Ragged Hierarchies”Hierarchy Example 1 Example 2 Example 3

Country USA France Denmark

State/Province California - -

Region Silicon Valley PACA Sjaelland

County Santa Clara Alpes-Maritimes Frederiksborg

City San Jose Antibes Elsinore

Page 55: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

55

TF

InformatikFact of life:

”Ragged Hierarchies”

ExpenseAccounts

Cost of Goods

Taxes

Misc Costs

--

National Taxes

Local Taxes

--

*

*

*

*

*: Data enters at this level

Page 56: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

56

TF

InformatikROLAP Solution to the Ragged

Hierarchy Problem

Category_ID Category_Description Account_ID Account_Description

01 Cost of Goods    

02 Taxes 00001 National Taxes

03 Taxes 00002 Local Taxes

04 Misc Costs    

Base table:

End-user and/or MOLAP view:Select

Category_Description, Case when Account_Description is not null

then Account_descriptionelse Category_Description end

as Account_Descriptionfrom Chart_of_Accountsorder by Category_ID

Page 57: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

57

TF

InformatikMOLAP Solutions to the Ragged

Hierarchy Problem• Depends on your product

• Might require manual definitions

• Supported in at least:– Microsoft SQL Server 2000– Hyperion Essbase– IBM DB2 OLAP Server– Applix TM1

Page 58: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

58

TF

InformatikFact of Life (2):

Unique Members• ”Member”: Often used in OLAP to designate the

individual entries on the individual levels (eg. Country = ’France’, Year = 2001 etc.).

• Some products require that members (values) be unique across:– The whole level– The whole dimension(!)

• Check your selected product(s) before getting too deep into the implementation!

Page 59: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

59

TF

Informatik

• All parent-child relationships MUST be one-to-many

• Checking your levels:– select quarter, count(distinct year) as count_col

from time_period group by quarter having count_col <> 1;

– The result set of the query above must be empty!– If your product needs global uniqueness, you must do

similar checking across all levels– If you don’t do this, your users will get meaningsless

answers to their queries!

Fact of Life (2):Unique Members

Page 60: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

60

TF

Informatik

Fact of Life (3): Sparsity• Some dimension members may occur quite infrequently (eg.

demographic data only available on 10 percent of your customers)

• This is called sparsity (a sparse dimension)• Also look for dimensions, which do not intersect with other

dimensions in a star – they are not interesting – only take up space

• ROLAP: Not a big problem, some wasted disk space• MOLAP: A very big problem – leads to cube ”explosion” –

(many unused cells)– Try to eliminate as much as possible – Make sure that you tune your product configuration well (many MOLAP

products are ”sparsity aware”)

Page 61: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

61

TF

Informatik

Fact of Life (4): Flat Dimensions

• E.g.: A dimension on Standard Industry Codes (SIC)

• In ROLAP just another attribute on your customer (maybe)

• In MOLAP, a member attribute on the lowest level of your customer hierarchy (if your product permits it)

• Keep the number of dimensions down!

Page 62: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

62

TF

InformatikDimension Data Become Row-/Column Headers in Reports

Markets Time

Products

Sales Detail

Customers

age

gender

income level

education

Time_keyDayMonthYear ...

(billions of records)

ProductkeyProductnameetc.

Star Schema:

MarketkeySales_area....

ProductXxx Yyy Zzz

Jan - - -Febr - - -Mar - - -AprMay....Not only your model,

but also your data must ”look good”!

Page 63: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

63

TF

Informatik

Slowly Changing Dimensions

When data is changed, what can you do?1. Overwrite (if you don’t care loosing the

history)2. Create another dimension record (if for

instance a customer moves) - (what about the keys? --> use surrogate keys!)

3. Create ”current and previous” value fields (for instance changing sales territories)

Page 64: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

64

TF

Informatik

Surrogate keys

• Because the meaning of ID’s change (SKU#’s, moving customers etc.)

• Because concatened primary keys are impractical• Keep external keys as (a) dimension field(s)• Use plain integers for data warehouse keys (users

shouldn’t see them, they are just used for joins)• In short: Always – repeat ALWAYS use them!• You will want to hide them from the users by way

of using views (ROLAP) and external, natural key values, which the users are familiar with

Page 65: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

65

TF

Informatik

SCD Type 2: The Past

Customer_key Customer_Number Name Country …

1234 98-66473 Thomas Denmark …

Customer_Key Date_key … Quantity Price

1234 765 … 3 99,95

Customer dimension

Sales facts

Page 66: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

66

TF

InformatikThe Change: The customer

movesCustomer_key Customer_Number Name Country …

1234 98-66473 Thomas Denmark …

4352 98-66473 Thomas France …

Customer_Key Date_key … Quantity Price

1234 765 … 3 99,95

4352 1027 … 5 245,25

Customer dimension

Sales facts

SCD Type 2: History is preserved, but how many customers do we have?

Page 67: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

67

TF

Informatik

Effective Dates?

• Effective_begin_date & effective_end_date• Might be the only way to deal with late arriving

records• But:

– What is the meaning of ”Manufactured from/to” versus ”Sold from/to”?

– Which attributes are affected?– Makes query construction complicated

• If you use them, use a ”current_flag” also!

Page 68: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

68

TF

Informatik

Type ”6” (2+3+1)• When sales districts change ”randomly”:

– Sales Team Key– Sales Team Name– Sales Physical Address– Begin Effective Date– End Effective Date– Is_current_flag (type 2)– Current District (type 1)– Old district (type 3)– ….

Page 69: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

69

TF

Informatik

Impact of SCD > 1

• ”Small matter of programming”

• How to detect changes?– Cyclic Redundancy Check (CRC) is a

possibility

• Which changes are important?

• Type 6 requires many updates

• Changes can ”cascade”

Page 70: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

70

TF

Informatik

Pragmatic preservation of history

• Make historical copies of your MOLAP cubes or ROLAP databases per year

• Make copies of the complete dimension, when major changes occur, and use those copies for historical analysis, maybe in a separate environment

Page 71: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

71

TF

Informatik

The Facts/Measures

Mostly raw numeric items, relevant measures, and dimension keys only. Can signify events or coverage

Try to use as few measures as you can get away with, these are costly– From some million to more than a billion observations– Items are typically additive. May be semi-additive or

non-additive in special cases– Access primarily via dimensions– Families of facts are common

Page 72: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

72

TF

InformatikValue of additivity

• Prevent incorrect computations– Percentages and other statistical measures

cannot always be simply added together– For example, average bank balances

• Good advice– Store base measures– Calculate percentages and other statistics when

facts are retrieved– Be careful with NULL’s in the database!

Page 73: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

73

TF

Informatik Additive measures

• Numeric datatypes– Units sold– Dollars sold versus per unit dollars– Claim amount– Discount dollars– Profit before tax– Tax dollars– Service charges– Number of calls– Number of transactions

Page 74: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

74

TF

Informatik

Typical facts

• Sales and purchases• Daily, weekly, monthly, quarterly sales• Policies sold, claims sold• Orders, shipments• Budget forecasts, actuals

Page 75: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

75

TF

InformatikReasons for going to the

transaction level• Behavior analysis

• Time-of-day analysis / queue analysis

• Time gap analysis

• Sequential behavior– Fraud detection– Cancellation warnings

• Basket analysis

Page 76: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

76

TF

Informatik

Technically speaking (ROLAP)

• Fact table– Must have primary key– Joined to dimensions through foreign key

references– Usually physically sorted by time dimension

and the primary analysis path

Page 77: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

77

TF

Informatik

Mystery fields in the fact records

• Facts: Only measures and keys to dimensions• Sometimes you see fields, which are not that and which also

not appear to be textual attributes of dimensions or other foreign keys.

• Most often these fields are codes and are sparse• If they are really necessary, try to create one or more

”mystery dimensions” out of them• Look at correlations between values of the mystery fields

– Assume X has 200 values and Y has 1000 values– If 1000 combinations of X, Y exist, then X is a parent of Y– If 100000 combinations exist, then they are completely uncorrelated,

ie. two dimensions.

Page 78: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

78

TF

InformatikMOLAP: Hierarchies in the

Measures Dimension• Some (most) MOLAP products allow you to set

up hierarchies within the measures dimension (a.k.a. the Accounts Dimension in Essbase)

• This is particularly useful in financial reporting• Requires some level of manually entered

definitions• ROLAP:

– Push down parent to its children– More than one fact table– A ”helper table” (discussed later)

Page 79: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

79

TF

Informatik An Essbase Example

Page 80: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

80

TF

InformatikUsing degenerate dimensions

(ROLAP)Unique, primary key of a sales fact table:• Time• Product• Store/Register• Promotion• Customer• Employee• Ticket # (degenerate)• Line # (degenerate)

Page 81: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

81

TF

InformatikGood reasons for getting the fact

table primary key right• ”Global Warming”

– Avoid more rows than necessary (if granularity of fact and dimensions do not match)

• Lost dimensions– Could be the reason for the problem

• Lost attributes– Not getting a dimension detailed enough

• ”Low-cost Insurance”– For avoiding duplicate rows

• ”Kids and Matches”– Nobody will be tempted to join two fact tables

(Thanks to Jim Stagnitto of Questral, Inc.)

Page 82: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

82

TF

Informatik

Technically speaking (ROLAP)

• Fact table– non key columns are usually not indexed, rapid

access is through the dimensions – Columns often occur in sum(), rank(), min()

functions

Page 83: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

83

TF

InformatikReliable relations

(ROLAP)• All joins between dimensions and fact

tables:– Completely understood– One to many (dimension to fact) relation based

on foreign key references of the fact table’s multi-part key

• Referential integrity enforced. Always

Page 84: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

84

TF

InformatikOops - forgot one thing:The Indexes!

Calldetail

CustRateperiod

Discounttype

Batch

Calltype

Accessmethod

Juris-diction

In an ordinary (universal) database,

you will have 10 indexes on the fact table:

1 PK + 9 FK’s.

Hour

Day

Page 85: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

85

TF

Informatik

Sample ROLAP Index Calculation

Keys Datatype LengthIndex size(GB)

Primary key Composite 76 70Jurisdiction Character 15 19Access method Character 8 13Discount type Character 10 15Rate period Character 8 13Customer Decimal 15 19Call type Character 10 15Batch number Integer 4 10Hour Small int 2 8Day Date 4 10

192

Number of facts: 500 mioLength of fact row: 200 bytesSize of fact table (GB): 93

Total size of DW (GB): 285

You want to: Keep the number of dimensions

down!

Use integer keys!

Find alternatives

to B-trees!

Page 86: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

86

TF

Informatik

Audit, Balance and Control

• Source feed file name for the row• Job instance that processed the row• Record number in the feed file• Can also contribute to a unique key for

the row

Page 87: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

87

TF

Informatik

Families of Facts

• Related facts

• Aggregated facts

Page 88: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

88

TF

Informatik

Related Facts

• Shared dimensions must be the same

• Value chain, eg. in Manufacturing:Manufactoring Inventory

Manufactoring Shipments

Distribution Inventory

Distribution Shipments

Store Inventory

Store Sales

Flow of Product:Each Process a set of Facts

Page 89: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

89

TF

Informatik

Drilling Across The Value Chain

Manufactoring Inventory

Manufactoring Shipments

Distribution Inventory

Distribution Shipments

Store Inventory

Store Sales

TimeDimension

ProductDimension

CustomerDimension

Page 90: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

90

TF

Informatik

BudgetForeign keysBudget amount

CommitmentsForeign keysCommitment Amount

PaymentsForeign keysPayment Amount

TimeDimension(Month)

Line ItemDimension

DepartmentDimension

AccountDimension

PaymentDimension

CommitmentDimension

Drilling AcrossThe Budgetting Chain

Page 91: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

91

TF

InformatikBuilding ”Supermarts”

The Data Warehouse Bus Architecture

• Conformed Dimensions

• Standard Fact Definitions– Revenue, Profit, Price, Cost etc.

• Granularity at the lowest level in front of the data marts

• Data Marts constructed from these standard sources as necessary

Page 92: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

92

TF

Informatik

Aggregated Facts (ROLAP)• Multiple fact tables

– Share one or more dimensions

– Daily fact table

– Monthly fact table

– Monthly Category fact table

• Caveat: What keys to use in dimension tables?• Do not use ”level” fields!• Use MOLAP or HOLAP whereever possible!

Page 93: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

93

TF

Informatik

Drawbacks of aggregate tables• There are so many of them!!!

• You is the one, who must manage them!

• All they do for you is enabling you to get better performance in routine queries

• The application is programmed to your aggregation scheme; if that changes, then ….

• Digression: If you also use logical partitioning, you will have hundreds of tables!

Page 94: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

94

TF

Informatik

Aggregate Tables in ROLAP• Be careful out there!

• You must use an aggregate navigator”

Clienttool

Plain SQL

Aggregate aware SQL

Dataandaggregates

Metadataandstatistics

Navigator

DBMS

Page 95: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

95

TF

Informatik

Families of Facts• Heterogeneous Productlines, such as in

BankingMonth_keyProduct_keyCustomer_keyStatus_keyEarned_dollarsPaid_dollarsAverage_balance-------------Num ATM TransNum Branch TransNum OverdraftsTot Overdraft FeesOverdraft LimitDeclined Trans

Core keys and facts,kept in one common table:

Facts applicable only tochecking accounts:

Page 96: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

96

TF

Informatik Families of Facts• Transactions:

Date_keyProduct_keySales_person_keyCustomer_keyTransaction_keyAmount

• And/or snapshots:Month_keyProduct_keySales_person_keyCustomer_keyStatus_keyEarned_dollarsPaid_dollarsAverage_balance

Why not both?

Page 97: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

97

TF

Informatik

Using snapshots

• When transactions are not pieces of revenue– Deposits / withdrawals– Payments in advance– Insurance coverage premiums

• Consider a current rolling snapshot (period to date)

• Consider a status dimension• Many fields in the snapshot fact table, some

possibly semi-additive

Page 98: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

98

TF

Informatik

Related Facts in MOLAP• MOLAP products allow only one measures dimension per cube• You must, if necessary at all, combine related facts into one measures

dimension• Use a ”Version” (”Scenario”) dimension with values like eg.:

– ”Actual”– ”Budget 2001-01”– ”Prognosis 2001-05-01”

• Maybe also a ” Type” dimension, eg.:– ”Income”– ”Expenses”– ”Taxes”

• Or build several cubes, one for Income, one for Expenses etc.– Many MOLAP products allow you to combine cubes into ”Virtual

Cubes” using a library of shared dimensions

Page 99: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

99

TF

InformatikSimilar models in each industry• A common framework for every industry

– Retail– Telecommunications– Transportation– Insurance– Healthcare– Manufacturers– Banking– Government– Websites– E-business

Page 100: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

100

TF

InformatikRetail (1): The Grocery Store

grocery_store_facts

PK,FK1 day_keyPK,FK2 product_keyPK,FK4 store_key

FK3 promotion_key

day

PK day_key

product

PK product_key

promotion

PK promotion_key

store

PK store_key

Page 101: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

101

TF

Informatik

Retail (2): Orders

order_facts

PK,FK4 product_keyPK,FK2 order_date_keyPK,FK1 customer_keyPK,FK3 salesperson_key

customer

PK customer_key

date

PK day_key

product

PK product_key

salesperson

PK salesperson_key

Page 102: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

102

TF

Informatik

Telco: The Billing CDR

BILLING_CDR

PK,FK3 PERIOD_KEYPK,FK1 CALL_TYPEPK,FK8 RATE_PLANPK,FK4 BTNPK,FK5 DEST_TELEPHONE_NUMBERPK,FK2 CUSTOMER_IDPK,FK7 LINE_TYPE_KEYPK,FK6 ORIG_TELEPHONE_NUMBERPK,FK9 TIME_KEY

CALL_TYPE

PK CALL_TYPE

CUSTOMER

PK CUSTOMER_IDLINE_TYPE

PK LINE_TYPE_KEY

PERIOD

PK PERIOD_KEY

RATE_PLAN

PK RATE_PLAN

TELEPHONE

PK PHONE_KEY

TIME

PK TIME_KEY

Page 103: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

103

TF

Informatik

Transportation

Frequent Flyer

PK,FK1 flown_keyPK,FK2 purchased_keyPK,FK11 customer_keyPK,FK3 leg_origin_keyPK,FK4 leg_dest_keyPK,FK5 trip_origin_keyPK,FK6 trip_dest_keyPK,FK12 flight_keyPK,FK10 fare_class_keyPK,FK8 channel_keyPK,FK7 flight_status_key

Flight

PK flight_key

Date Flown

PK flown_keyDate Purchased

PK purchased_key

Customer

PK customer_key

Leg Origin

PK leg_origin_key

Leg Destination

PK leg_dest_key

Trip Origin

PK trip_origin_key

Trip Destination

PK trip_dest_key

Fare Class

PK fare_class_key

Flight Status

PK flight_status_key

Sales Channel

PK,FK1 channel_key

Page 104: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

104

TF

InformatikInsurance (1): Bus Architecture

Fact

s D

imen

sion

s

Aut

omob

ile

Cla

im

Cla

im S

tatu

s

Cla

im

Tra

nsac

tion

Cla

iman

t

Cov

erag

e

Cov

ered

item

Em

ploy

ee

Insu

red

part

y

Mon

th

Poli

cy

Stat

us

Thi

rd p

arty

Tim

e

Tra

nsac

tion

Claim snapshot x x x x x x x x Claim transaction fact x x x x x x x x x x Custom claim snap x x x x x x x x Custom claim trans x x x x x x x x x x Custom snapshot x x x x x x x Custom transactions x x x x x x x Policy snapshot x x x x x x x Policy transactions x x x x x x x

Page 105: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

105

TF

Informatik

Insurance(2): Some fact tablesPolicy Snapshot

PK month_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK status_key

written_premiumearned_premiumprimary_limitprimary_deductiblenumber_of_transactions

Policy Transactions

PK time_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK transaction_key

amount

Custom Snapshot

PK month_keyPK insured_party_keyPK employee_keyPK automobile_keyPK coverage_keyPK policy_keyPK status_key

written_premiumearned_premiumprimary_limitprimary_deductiblenumber_of_transactionsauto_replacement_value

Custom Transactions

PK time_keyPK insured_party_keyPK employee_keyPK automobile_keyPK coverage_keyPK policy_keyPK transaction_key

amount

Claim Transaction Fact

PK time_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK claimant_keyPK claim_keyPK third_party_keyPK claim_trans_key

amount

Claim Snapshot

PK month_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK claim_keyPK claim_status_key

reserve_amountpaid_this_monthreceived_this_monthnumber_of_transactions

Page 106: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

106

TF

Informatik

Manufacturing (1)

Component_use_facts

PK,FK1 component_keyPK,FK2 prod_keyPK,FK3 pl_keyPK,FK4 time_key

usage_quantity

components

PK component_key

component_part_numbercomponent_namecomponent_descriptioncategoryunit_of_measure

Product

PK Prod_key

Model_NumberFamilyLineCPUHD_size

Production Line

PK PL_key

Line_nameFacilityCountryLineType

Time

PK Time_key

Day_of_weekMonthYearDateQuarter

Page 107: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

107

TF

Informatik

Manufacturing (2)

Production_facts

PK,FK1 pl_keyPK,FK2 prod_keyPK,FK3 time_key

units_produced_qty

Production Line

PK PL_key

Line_nameFacilityCountryLineType

Product

PK Prod_key

Model_NumberFamilyLineCPUHD_size

Time

PK Time_key

Day_of_weekMonthYearDateQuarter

Page 108: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

108

TF

Informatik

Manufacturing (3)

Activity

PK Activity_key

ActivityCategory

Month

PK Time_key

month_namemonth_numberquarteryear

Production Line

PK PL_key

Line_nameFacilityCountryLineType

Activity_facts

PK,FK3 pl_keyPK,FK2,FK4 time_keyPK,FK1 activity_key

units_produced_qty

Page 109: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

109

TF

Informatik

Inventory

Inventory Transaction Fact

PK,FK3 time_keyPK,FK4 product_keyPK,FK2 warehouse_keyPK,FK1 transaction_key

Transaction

PK transaction_key

Time

PK time_key Product

PK product_key

Warehouse

PK warehouse_key

Advanced Inventory Snapshot

PK,FK2 time_keyPK,FK3 product_keyPK,FK4 warehouse_key

Page 110: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

110

TF

Informatik

DeliveriesDelivery Snapshot

PK order_keyPK product_keyPK warehouse_keyPK mfgr_keyPK first_received_keyPK last_received_keyPK first_inspect_keyPK first_auth_keyPK first_shipment_keyPK last_shipment_keyPK first_return_key

qty_orderedqty_receivedqty_inspectedqty_returned_to_vendqty_placed_in_invqty_auth_to_sellqty_pickedqty_boxedqty_shippedqty_returned_by_custqty_returned_to_invqty_damagedqty_lostqty_written_offvalue_at_unit_costvalue_at_orig_selling_pricevalue_at_last_selling_pricevalue_at_avg_selling_pricePO_numberPO_line_number

Page 111: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

111

TF

Informatik

Manufactoring Inventory

Manufactoring Shipments

Distribution Inventory

Distribution Shipments

Store Inventory

Store Sales

TimeDimension

ProductDimension

CustomerDimension

Drilling Across The Value Chain

Page 112: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

112

TF

Informatik

Banking

Custom Checking

PK,FK2 account_keyPK,FK3 product_keyPK,FK4 branch_keyPK,FK5 household_keyPK,FK6 status_keyPK,FK1 month_key

primary_balancetransaction_countdirect_depositsoverdraft_limitminimum_balanceservice_chargeinterest_paidATM_transaction_countbranch_ATM_transaction_countdays_below_minimum_balancedays_overdrawnaccount_count

Month

PK month_key

monthyearfiscal_quarter

Account

PK account_key

primary_surnamesecondary_surnameaccount_addressaccount_cityaccount_stateaccount_zipdate_openedprimary_ageprimary_sexprimary_marital

Product

PK product_key

product_descriptiontypecategory

Branch

PK branch_key

branch_namebranch_addressbranch_citybranch_statebranch_zip

Household

PK household_key

household_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type

Status

PK status_key

status_descriptionstatus_reasonnew_account_flagclosed_account_flag

Custom Mortgage

PK,FK1 month_keyPK,FK2 account_keyPK,FK5 product_keyPK,FK3 branch_keyPK,FK4 household_keyPK,FK6 status_key

primary_balancetransaction_countinterest_paidproperty_valuedelinquent_countbad_check_countaccount_count

Household Facts

PK,FK1 month_keyPK,FK3 product_keyPK,FK4 branch_keyPK,FK5 household_keyPK,FK6 status_keyPK,FK2 account_key

primary_balancetransaction_countaccount_count

Page 113: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

113

TF

Informatik

Website Analysis

• Going beyond log file statistics– WebTrends, Analog, NetTracker etc.

• Time-series analysis is necessary– What happened during a site visit?

– Why was the visit abandoned?

– What is the effectiveness of a targetted promotion?

– What is the trend in the above over time?

• The Clickstream Data Warehouse

Page 114: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

114

TF

Informatik

So, what’s new?• From Sales Facts to User Activity Facts• (How) do we know the User (customer)?• Sessions• Pages• Events• Probable cause of the visit/sale• Where did the customer come from?• The World Wide Web (24*7*365, multiple languages,

cultures, timezones …)• From CRM to eRM – electronic Relationship Management

Page 115: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

115

TF

Informatik

Website Design also matters• The clickstream data warehouse needs information about:

– Pages– Cookies– Users– Clickable objects– URL’s– Events– Etc. Etc.

• This content and event information must be presentable to end-users!

• Some can be obtained by using log file processors• It is likely you will have to do a considerable amount of data

clean-up, if the website is not well-designed!

Page 116: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

116

TF

Informatik

Web effects on ”old” dimensions• Calender:

– Local time, global time

• Time of Day• Customer:

– Becoming a user dimension– Cookies– Named users– Integration with eg. CRM– The global perspective

• Promotion:– Must take web advertising into consideration– Dynamic, individual, targeted ”special offers”, maybe in real time

Page 117: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

117

TF

Informatik

New Web Dimensions

• Page• Event

– e.g. Open Page, Refresh Page, Click Link, Enter Data

• Session– Type of session, context/mission, success etc.

• Referral– How did the customer/visitor arrive?

Page 118: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

118

TF

Informatik

Choosing the grain

• Page event

• Session

• Aggregated levels

Page 119: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

119

TF

Informatik E-business

Page_Events

PK,FK11 Page_KeyPK,FK10 Causal_KeyPK,FK9 Session_KeyPK,FK1 Universal_Date_KeyPK,FK7 Universal_Time_KeyPK,FK2 Local_Date_KeyPK,FK8 Local_Time_KeyPK,FK3 Customer_KeyPK,FK4 Event_KeyPK,FK6 Referrer_KeyPK,FK5 Product_Key

Session_IDPage_SecondsUnits_OrderedOrder_Dollars

Calendar

PK date_key

Many_more_fields

Causal

PK Causal_Key

Causal_TypePrice_TreatmentNewspaper_Ad_TypeWeb_Ad_TypeRadio_Ad_TypeTV_Ad_TypeStore_Display_TypeMfgr_Promo_TypeOther_Causal_Event

Customer

PK Customer_Key

Customer_TypeISP_Address_1ISP_Address_2ISP_Address_3Cookie_IDCustomer_IDCustomer_IdentifierMany_more_fields

Event

PK Event_Key

Event_TypeEvent_Content

Page

PK Page_Key

Page_SourcePage_FunctionPage_TemplateItem_TypeGraphics_TypeAnimation_TypeSound_TypePage_File_Name

Product

PK Product_key

Many_more_fields

Referrer

PK Referrer_Key

Referral_TypeReffering_URLReferring_SiteReferring_DomainSearch_TypeSearch_SpecificationTarget

Session

PK Session_Key

Session_TypeLocal_ContextOverall_Session_ContextAction_SequenceSuccess_StatusCustomer_Status

Time_Of_Day

PK Time_Key

More_fields

Page 120: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

120

TF

Informatik

The Data Webhouse

• Coined by Ralph Kimball in ”The Data Webhouse Toolkit”, Ralph Kimball and Richard Merz, Wiley 2000, ISBN 0-471-37680-9

• The Data Warehouse on the Web

• The Data Warehouse as the driver of the website (”closing the loop”)

Page 121: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

121

TF

InformatikThe Birth of the Data Webhouse

Page 122: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

122

TF

Informatik Synonyms• Basic function

– avoids multiple join paths between two tables– makes schema more legible and thus less prone to query

formulation errors– use view to rename columns for ease of use with query tools

• Rationale over separate dimension tables– Reduces need to duplicate data– Simplifies administration

• CREATE FIRST_OPEN_TIME AS SYNONYM FOR TIME– Not all databases support this - you may use views instead

Page 123: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

123

TF

Informatik Typical synonyms• City tables

– origin and destination (travel facts)

• Period tables– order date and ship date

• Customer tables– customers for ship to and bill to

Page 124: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

124

TF

Informatik

Factless Fact Tables

Promotion CoverageFact Tabletime_keyproduct_keystore_keypromo_key

Time Product

Store Promotion

Page 125: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

125

TF

Informatik

A ”

Fac

tles

s” s

naps

hot t

able

Month

PK month_key

monthyearfiscal_quarter

Product

PK product_key

product_descriptiontypecategory

Branch

PK branch_key

branch_namebranch_addressbranch_citybranch_statebranch_zip

Household

PK household_key

household_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type

Status

PK status_key

status_descriptionstatus_reasonnew_account_flagclosed_account_flag

Demographics

PK demog_key

age_bandsexincome_bandmaritalchildren

HH Demographics

PK,FK7 status_keyPK,FK2 month_keyPK,FK4 address_keyPK,FK3 product_keyPK,FK5 branch_keyPK,FK6 household_keyPK,FK1 demog_key

primary_balancetransaction_countaccount_count

Address

PK address_key

primary_surnamesecondary_surnameaccount_addressaccount_cityaccount_stateaccount_zipdate_opened

Page 126: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

126

TF

InformatikThe Effect of Multidimensional

Design on the ETL system• Existence checks

• Denormalisation (N-way, multilevel merge)

• Lookup values

• Deal with missing values for foreign keys

Page 127: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

127

TF

InformatikMissing Values

Which are the problems?• Default values in source data• Dummy values in source data• Missing values in source dataWhat can be done?• Fix the source system(s) – it is a data quality issue• Try to infer values• Use a dummy value • More than one dummy value? (Unknown, Unavailable, Not applicabale)• Missing dimension keys?

(The dummy dimension record: Cust.No. = –1, Name = ”Unknown”)• What about dummy dates? And numeric values?

(See ”Dealing with Missing Values In The Data Warehouse” from www.sbti.com (Author John Hess, Stonebridge Technologies 1998, now to be found on www.olap.it/articles !)

Page 128: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

128

TF

Informatik

The 38 subsystems of ETL!

• Extract• Change data capture• Data profiling• Data cleansing• Data conformer• Audit dimension• Quality screener• Error event handler• Surrogate key creation• Slowly Changing Dim’s• Late arriving dim’s• Fixed hierarchy dim’s• Variable hierarchy dim’s

• Multivalued dim’s• Junk dimensions• Facttable transaction load• Periodic snapshot load• Accum. Snapshot load• Surrogate key pipeline• Late arrivals handler• Aggregate builder• Cube builder• Partition builder• Dimension manager• Facttable provider• Job scheduler

• Workflow monitor• Recovery and restart• Parallelizing system• Problem escalation• Version control• Version migration• Lineage & dependencies• Compliance reporter• Security• Backup• Metadata repository• Project management

From the new book ”The Data Warehouse ETL Toolkit”, Ralph Kimball and Joe Caserta, Wiley 2004, ISBN 0764567578:

Page 129: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

129

TF

Informatik

Documentation

• Data Mart structure

• Logical Model

Read The Data Warehouse Lifecycle Toolkit for the complete picture!

Page 130: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

130

TF

Informatik

Advanced Design Issues

When the going gets tough,

the tough get going!

Page 131: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

131

TF

Informatik

Tricky stuff

• Monster Dimensions

• Multivalued Dimensions

• Multilevel Hierarchies

• Really Difficult Business Questions

Page 132: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

132

TF

Informatik

Monster Dimensions

• Some dimension tables grow VERY big, eg. customers, who also have many attributes

• Some “demographic” attributes change often (income, number of children etc.)

• Some “demographic” attributes are not used very much

Result: A lot of wasted space, indexes and complexity in keeping up to date

Page 133: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

133

TF

InformatikDealing with Monster

Dimensions

customer_keynameaddressbirth dateother stableattributes

Customer dimension

demographics_keyincome_bandpurchases_bandno_childreneducation_leveletc.

Demographicsdimension

Sales facts

Customer_keyDemographics_key......

Note: Contains all possible combinationspredefined andpreloaded!

Page 134: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

134

TF

InformatikMultivalued Dimensions

Healthcare billing fact:• Date• Patient• Doctor• Location• Service performed• Diagnosis• Payer• etc.

Do people only have one illness at a time?

Page 135: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

135

TF

Informatik

What are the alternatives?

• Forget the dimension!

• Choose one value (”Primary Diagnosis”)

• Fixed number of diagnoses

• Going M:M?

Page 136: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

136

TF

Informatik Helper Tables

Bill fact.diagnosis_group_key...

PayerPatient

DoctorDiagnosis Groupdiagnosis_group_keydiagnosis_keyweight_factor

Diagnosis Dimensiondiagnosis_keydescriptiontypecategory

Location

Service

Day

Page 137: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

137

TF

InformatikSolving a Multivalued

Dimension with a Helper Table• Use only when absolutely necessary

• Weight factors usually equal size within a group and should always add up to one

• May use a view to hide the helper table

• Healthcare, Retail Banks, SIC-codes

Page 138: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

138

TF

InformatikCustomers’ Hierarchies

The Problem is:

The Customer decides how many levels there are!!!

Page 139: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

139

TF

Informatik

The Customer Hierarchy Model

customer_keynameaddressbase_org always populatedlevel5_orglevel4_orglevel3_orglevel2_org populated if 2 or moretop_org always populatedother attributes

Page 140: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

140

TF

InformatikUsing a Helper Table

Bill fact.customer_key...

ServiceSeller

Customer_tree_pathparent_customer_keychild_customer_keydepth_from_parentlowest_flagtopmost_flag

Customer Dimensioncustomer_keyetc.etc.

Day

Contains one record for each separate path from each node to itself and to every node below it

Page 141: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

141

TF

Informatik Pro’s and con’s• Works like a normal dimension constraint• Use depth_from_parent to eg. get only immediate

subsidiaries• Use lowest_flag to get only leaf nodes• Reversing the joins will take you upwards• Maybe add begin_effective_date and

end_effective_date, but be careful• Maybe add weighting factor to support partially

owned subsidiaries• May grow very big, quickly!

Page 142: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

142

TF

Informatik

The bigger picture

• Classification of analytical applications

• Data Mining besides SQL and OLAP

• International issues

• Recommandations

Page 143: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

143

TF

InformatikReally Difficult

Business Questions• Simple Constraints

• Simple Subqueries

• Correlated Subqueries

• Simple Behavioral Queries

• Derived Behavioral Queries

• Progressive Subsetting Queries

• Classification Queries

Page 144: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

144

TF

Informatik

1) Simple Constraint

• Constraints against literal constants:– Show the sales of candy products in September

1997

ROLAP / SQL MOLAP HOLAP

Page 145: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

145

TF

Informatik

2) Simple Subquery

• Constraints against a global value found in the data:– Show the sales of candy products in September

1997 in those stores that had above average sales of candy products

ROLAP / SQL MOLAP HOLAP

Page 146: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

146

TF

Informatik

3) Correlated Subquery

• Constraints against a value defined by each output row:– Show the sales of candy products for each

month of 1997 in those stores that had above average sales of candy in that month

ROLAP / SQL MOLAP HOLAP

Page 147: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

147

TF

Informatik

4) Simple Behavioral Query• Constraints against values resulting from an

exception report or a complex series of queries that isolate desired behavior:– Show the sales of those candy products in September

1997 whose household penetration for our grocery chain in the 12 months prior to September were more than two standard deviations less than the household penetration of the same products across our 10 biggest retail competitors

ROLAP / SQL MOLAP HOLAP

()

?

Page 148: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

148

TF

Informatik

5) Derived Behavioral Query• Constraints against values found in set

operations on more than one complex exception report or series of queries:– Show the sales of those candy products

identified in example 4, and which also experienced a merchandise return rate of more than two standard deviations greater then our 10 biggest retail competitors(Intersection of two behavioral queries)

ROLAP / SQL MOLAP HOLAP

()

Page 149: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

149

TF

Informatik

6) Progressive Subsetting Query

• Constraints against values, as in number 4, but temporally ordered so that membership in an exception report is dependent on membership in a previous exception report:– Show the sales of those candy products in

example number 4 that were similarly selected in August 1997 but were not similarly selected in either June or July 1997

ROLAP / SQL MOLAP HOLAP

()

Page 150: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

150

TF

Informatik

7) Classification Queries

• Constraints on values that are the results of classifying records against a set of defined clusters using nearest neighbor and fuzzy matching logic:– Show the percentage of low-calorie candy sales

contained in the 1000 market baskets whose content most closely matches a young, health-conscious family profile

ROLAP / SQL MOLAP HOLAP

Page 151: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

151

TF

Informatik

Challenges

• Most query tools do not support the more complex query types

• You may build in support for those, which your users need

Page 152: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

152

TF

InformatikMulti-step approach (SQL)

Query

Query

Query

Query

Resulttable

Build result table(s) that contains only keys (and maybe also time information) of those objects, who display the desired, special behavior(s)

Page 153: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

153

TF

InformatikExtended ROLAP or MOLAP

for Behavioral Analysis

Time_keyProduct_keyCustomer_keyStore_keyPromotion_keyTicket_numberLine_numberUnits_soldDollars_sold

Product_keydefined by complexbehavior study

Time dimension

Product dimension

Customer dimension

Store dimension

Promotion dimension

Special behaviordimension(s) *)

Sales facts Regular dimensions

*) May be hiddenin views

Page 154: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

154

TF

Informatik

Preparing for Data Mining• Rigorous quality assurance on the values, which you are going

to mine• Supply values as text, not codes• Eliminate context dependencies• Flag normal, abnormal, out of bound or impossible facts• Mask out random or noise values• Uniform treatment of NULL/missing/unknown• Use Status dimensions (eg. ”Customer about to cancel” etc.)• Find training, test and evaluation sets• Supply computed values (eg. Profit)• Band continous values

Page 155: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

155

TF

Informatik

Data Mining Algorithms

• Clustering• Decision Trees• Neural Networks• Statistics• Fuzzy Logic• Genetic Algorithms• Ants• Hybrids

Page 156: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

156

TF

Informatik

Integrated Data Mining

• The Multidimensional Model is ideal as a source for Data Mining

• ROLAP is necessary for time-series / sequential analysis

• Ideal case for integration of software tools:– Microsoft Analysis Services 2000– Oracle 9i– Others?

Page 157: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

157

TF

InformatikMicrosoft SQL Server 2000 Analysis Services

Page 158: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

158

TF

Informatik

International Issues

• Languages• Alphabets and character sets• Names• Adresses• Numbers• Telephone numbers• Currencies• Time of day• Calendars• Handling unsupported characters• Collating sequencies

See: ”The Data Webhouse Toolkit”, Ralph Kimball and Richard Merz, Wiley 2000, ISBN 0-471-37680-9

Page 159: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

159

TF

Informatik

Comments on MS SQL Server 2005

• Unified dimensional model:– Multivalued dimensions– Possible to have many flat dimensions– No need to build cubes – can put a layer on top of a

non-multidimensional schema

• But:– Users need the ease of use– Machines need the speed– We will see, just how well this works in a short period

of time

Page 160: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

160

TF

Informatik

The biggest challenge:

Data Quality

Page 161: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

161

TF

InformatikWhat is Data Quality?

• Accuracy– Does the data accurately represent reality?

• Integrity– Is the structure of data and relationships among entities and attributes maintained

consistently?

• Consistency– Are data elements consistently defined and understood?

• Completeness– Is all necessary data present?

• Validity– Do data values fall within acceptable ranges defined by the business?

• Timeliness– Is data available when needed?

• Accessibility– Is the data easily accessible, understandable, and usable?

Page 162: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

162

TF

Informatik

Data quality and integration issues

• Legacy data issues

• Data accessibility & availability

• Insufficient time to analyse

• Inaccurate Metadata, documentation

• Lack of resources

• Disparate systems

• Data ownership issues

• Semantic differences

• Structure violations

• Rule violations

Data Warehouse

CRM

Home-grownapplications

Web applications

ERP

Legacyapplications

Page 163: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

163

TF

InformatikWhy do data integration projects fail?

• The source data is not fully understood• Complexity is underestimated• Planning is actually guesswork• Systems do not join as expected• Delays when analysts/programmers need to interpret results• Poor data analysis leads to complex development cycle and

unpredictable rework• Problems are uncovered ad-hoc and late• Manual analysis on samples is time consuming, laborious and

inaccurate• Full volume testing is done too late

Page 164: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

164

TF

InformatikData profiling & analysis

• Data profiling & analysis is critical to• Understand the scope and nature of the problem

• Determine success criteria

• Accurate planning

• Automated tools are available• Do it right the first time• Then keep on doing it!

Page 165: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

165

TF

InformatikExamples of how to find data quality problems

using Data Profiling• How do you identify those customer records where values

are missing or incomplete ?• Are the product codes correct in my order entry system?• Will my data actually integrate?• Are all the order shipment dates correct?• Are my supplier details correct?• How do I find misspellings of any attribute values?• What about redundant data – how do I find it?• Are the relationships held within my data consistent ?• Will my data support the new business requirements?

Page 166: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

166

TF

InformatikBenefits of automated Data Profiling• Improve Business responsiveness

– Time to market reduced

• Enhance Data Quality– Ensure data is accurate and fit-for-purpose

• Project Planning– Early & Accurate

• Reduce Risks– Identify ALL data-related issues at the start

• Manage Resources– Deliver with less effort

• Reduce Costs– Reduce cost of analysis and build

Page 167: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

167

TF

Informatik

Data Quality is a Business Issue

• The Business owns the data• Set data quality standards across the company• Build company-wide metadata knowledge• Data Quality must be managed

Page 168: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

168

TF

Informatik

Data Warehouse in the future?

Real-time?

Page 169: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

169

TF

Informatik Data Warehouse andthe ”Enterprise Nervous System”

• Contemporary Enterprise Information Architecture calls for

– Realtime

– Integration

– Message brokers

– Service Oriented Architectures etc.

• What is the role of the Data Warehouse in this?

• This is based on an actual and recent architecture study for a Danish government body

Page 170: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

170

TF

InformatikCharacteristics of a mature Data Warehouse

• Vision: One environment, one version of the Truth• Integration of data from disparate sources• Refinement of data• Searching and browsing• Production data• Other data (partners, public / purchased information)• Master data• The Historical Data Warehouse• Management Information• Statistics• Data Exchange• Production reporting• Ad hoc

Page 171: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

171

TF

Informatik

More characteristics• Much multidimensionality

• Refined data, both details and aggregations

• 1000+ frequent users

• Also external recipients

• New systems: The interface(s) to the DW should be defined

• Development projects: Decide for either a production system or a Data Warehouse (based on technical feasibility)

• Standardised data model (most often not documented, frequently changed)

• Analysis

• Reporting

• Ad hoc tasks (incl. operational one-time systems)

Page 172: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

172

TF

InformatikComponents of a matureData Warehouse

• Database(s)– Normalised– Multidimensional

• Load processes– Predominantly batch– Homegrown (COBOL)– ETL jobs in e.g. Informatica

• BI tools such as e.g. Business Objects– Applications– Predefined reports– Ad hoc environments

• Statistics– SAS

Page 173: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

173

TF

Informatik

Business benefits from a DW• Holistic view of data across disparate systems

• Integration of data from different sources, including external partners

• History

• Refinement of data

• Presentation of data for business people

• Analysis, incl. decision support

• Reporting

• Ad hoc solutions

• Data foundation for statistics

• Data exchange

Page 174: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

174

TF

Informatik

Portal platform

Infrastructureservices:

-”single signon”

security,administration,

mv.

Service- and Integration platform (ENS)

Usercatalog

New systems

data

Data-Warehouse

MessagesCMS

data and metadata

Integration platform with persistenceIntegration Broker

Business Activity Monitoring and Workflow management

Common services

Web Service

CMS-modules

Email/Calendar

Workflow

Intranet (internal portal) Internet (external portal)

New services

Datamining

Eksternal”WS UDDI”

OCEScertificatehandling

Legacysystems

External systems

Internal”WS UDDI”

Infoservices

Newservices

CMS-modules

Portlet apiPortlet api

Infoservices

AnalysisUpdateQuery StatisticsReport

”The Enterprise Nervous System”

Page 175: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

175

TF

InformatikThe Data Warehouse has

served us well• The concept of a centralised database is maybe not as necessary

from now on

• But the practises of Data Warehousing are important:

– Good Data Management

– Good performance

– Data refinement

– Data presentation

• Realtime integration is certainly useful and practical

• Business Intelligence is evolving, Business Activity Monitoring is a natural next.

• We have learned a lot – now is the time to do it right!

Page 176: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

176

TF

Informatik

weak semanticsweak semantics

strong semanticsstrong semantics

Is Disjoint Subclass of with transitivity property

Modal Logic

Logical Theory

Thesaurus Has Narrower Meaning Than

TaxonomyIs Sub-Classification of

Conceptual Model Is Subclass of

DB Schemas, XML Schema

UML

First Order Logic

RelationalModel, XML

ER

Extended ER

Description LogicOWL

RDF/SXTM

Syntactic Interoperability

Structural Interoperability

Semantic Interoperability

*Source: Leo Obrst, The Mitre Corp.

What is going on in data modelling?

Page 177: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

177

TF

Informatik

Arthopoda

Leon

Animalia

Chordata

Mammalia

Carnivora

Felidae

Panthera Ursus

Tigris

Taxonomies: The unstructured world of documents

Page 178: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

178

TF

Informatik

Syntax defined in: Nijssen, G.M. and T.A. Halpin. Conceptual Schema and Relational Database Design - A fact oriented approach. Prentice Hall 1989.

Conceptual model (ORM)

Page 179: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

179

TF

Informatik

Property labels

t = rdf:type

s = rdfs:subClassOf

d = rdfs:domain

r = rdfs:range

et = rdfsx:collectionElementType

Kilde: Stephen Cranefield, Journal of Digital Information, Volume 1 Issue 8, Article No. 44, 2001-02-15

RDF based schema for Family

Page 180: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

180

TF

Informatik

Kilde: Stephen Cranefield, Journal of Digital Information, Volume 1 Issue 8, Article No. 44, 2001-02-15

UML Class diagrams

Page 181: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

181

TF

Informatik

<?xml version="1.0"?> <rdf:RDF xmlns="http://mySite.com/myOntology#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xml:base="http://mySite.com/myOntology">

<owl:Class rdf:ID="Person"/>

<owl:Class rdf:ID="Mother"> <rdfs:subClassOf rdf:resource="#Person"/> </owl:Class>

<owl:ObjectProperty rdf:ID="hasChild"> <rdfs:domain rdf:resource="#Mother"/> <rdfs:range rdf:resource="#Parent"/> </owl:ObjectProperty>

<owl:Class rdf:ID="Grandmother">

<rdfs:subClassOf> <owl:Class> <owl:intersectionOf rdf:parseType="Collection"> <owl:Class rdf:ID="Female"/> <owl:Class rdf:about="#Mother"/> </owl:intersectionOf> </owl:Class> </rdfs:subClassOf>

<owl:equivalentClass> <owl:Class> <owl:intersectionOf rdf:parseType="Collection">

<owl:Class rdf:about="#Mother"/>

<owl:Restriction> <owl:onProperty> <owl:ObjectProperty rdf:about="#hasChild"/> </owl:onProperty> <owl:someValuesFrom> <owl:Class rdf:ID="Parent"/> </owl:someValuesFrom> </owl:Restriction>

</owl:intersectionOf> </owl:Class> </owl:equivalentClass> </owl:Class>

<Parent rdf:ID="Anny"/> <Mother rdf:ID="Ingeborg"> <hasChild rdf:resource="#Anny"/> </Mother></rdf:RDF>

Web Ontology Language (OWL)

Page 182: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

182

TF

Informatik

• Data Warehouse showed us how to• Semantics is the key• Ontologies is the foundation• Repositories is the technology• Gartner: Enterprise Information Architecture

• The sponsor is: The needs for integration!

Information Management II

Page 183: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

183

TF

Informatik

Wrap up

Page 184: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

184

TF

Informatik Listen to Ralph

• Embed all knowledge of the data in the data

• Stick to one level of dimension tables• Aggregates should be separate tables• Use an aggregate navigator (serverside)• Even better: Use MOLAP or HOLAP• Stick to simple star schemas• Properly design conformed dimensions and

conformed facts, first!

Page 185: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

185

TF

Informatik

Listen to me

K.I.S.S.:Keep It Simple, Keep It Simple,

Stupid!Stupid!

Page 186: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

186

TF

InformatikRalph Kimball’s 20 criteria Criterium Explanation Criterium Explanation Architecture 1

Explicit declaration

Metadata drives multidimensional behavior

11 Surrogate key administration

Completely automated

2 Conformed dimensions and facts

System supports mix-and-match of technologies etc.

12 International consistency

3 Dimensional integrity

Full referential integrity Expression 13

Multiple dimension hierarchies

Multiple independant hierarchies in a dimension

4 Open aggregate navigation

Complete, transparent, automatic aggregates

14 Ragged-dimension hierarchies

Hierarchies of indeterminate depth

5 Dimensional symmetry

No limitations on calculations across the whole model

15 Multiple valued dimensions

Many-to-many with allocation factors

6 Dimensional scalability

No limits 16 Slowly changing dimensions

Full SCD support

7 Sparsity tolerance No limits 17 Roles of a dimension Eg. multiple roles of a time dimension

Administration 8

Graceful modification

No unload-reloads 18 Hot-swappable dimensions

Change (perhaps personalised) dimensions on the fly

9 Dimensional replication

Seamless distribution of standard dimensions

19 On-the-fly fact range dimensions

Dynamic, runtime support for banding / bucketing

10 Changed dimension notification

Automatic delivery of SCD 1, 2 and 3

20 On-the-fly behavior dimensions

Support for subset lists with set algebra

Page 187: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

187

TF

Informatik

Literature• ”The Data Warehouse Toolkit – Second Edition”, Wiley 2002, ISBN

0-471-20024-7

• ”The Data Warehouse Lifecycle Toolkit”, Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite, Wiley 1998, ISBN 0-471-25547-5

• ”The Data Webhouse Toolkit”, Ralph Kimball and Richard Merz, Wiley 2000, ISBN 0-471-37680-9

• ”Data Warehouse Design Solutions”, Christopher Adamsan and Michael Venerable, Wiley 1998, ISBN 0-471-25195-X

• ”Microsoft OLAP Solutions”, Erik Thomsen, George Spofford and Dick Chase, Wiley 1999, ISBN 0-471-33258-5

• ”Improving Data Warehouse and Business Information Quality”, Larry P. English, Wiley 1999, ISBN 0-471-25383-9

Page 188: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

188

TF

Informatik

Usefull web-adresses• www.ralphkimball.com (Ralph Kimball)• www.dwinfocenter.org (Larry Greenfields Data Warehouse Info

Center)• www.intelligententerprise.com (Intelligent Enterprise Magazine,

previously DBMS magazine - many articles by Ralph Kimball and other fine people - for example: August 97 A Dimensional Modelling Manifesto)

• www.datawarehousing.com (home of DWLIST)• www.tdwi.org - The Data Warehouse Institute• www.iaidq.org, The International Association for Information and

Data Quality (IAIDQ)• http://www.tondering.dk/claus/calendar.html, everything you would

ever want to know about calendars!

Page 189: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

189

TF

Informatik

What to do first?• Buy and Read Ralph Kimball’s ”The Data Warehouse Toolkit

– Second Edition”, Wiley 2002, ISBN 0-471-20024-7• Buy and read ”The Data Warehouse Lifecycle Toolkit”,

Wiley 1998, ISBN 0-471-25547-5, Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite

• Buy and read ”Data Warehouse Design Solutions”, Christopher Adamson and Michael Venerable, Wiley 1998, ISBN 0-471-25195-X.

• Buy and read ”The Data Warehouse ETL Toolkit”, Ralph Kimball and Joe Caserta, Wiley 2004, ISBN 0-7645-6757-8

• Just Do It!

Page 190: TF Informatik 1 Design of Multidimensional Data Models for Data Warehouses and OLAP Thomas Frisendal

190

TF

Informatik

Thank You!

[email protected]+45-40 54 83 40 (GSM)

04.93.33.88.93 (occasionally)+45-49 70 83 40 (Landline)