View
231
Download
0
Category
Preview:
Citation preview
1
TF
Informatik
Design ofMultidimensional Data Models forData Warehousesand OLAP
Thomas Frisendal
2
TF
Informatik
Who I am• Freelance Data Base Consultant• More than 30 years of Experience with DBMS’s as a
Vendor Person and as Freelance Consultant• 10 years of Experience with Data Warehouse construction• Have trained more than 350 persons in Multidimensional
Modelling and Star Schema Design• Charter member of the IAIDQ• Board member of The Data Warehouse Institute Denmark• Locations: Nordic Countries (principal residence in
Denmark) and Côte d’Azur (secondary residence in Antibes)
3
TF
Informatik
AcknowledgementsBased on• Ralph Kimball’s books: “The Data Warehouse Toolkit”, Wiley 1996, ISBN 0-
471-15337-0, ”The Data Warehouse Lifecycle Toolkit” *), Wiley 1998, ISBN 0-471-25547-5, ”The Data Webhouse Toolkit” **), Wiley 2000, ISBN 0-471-37680-9, and selected parts of Ralph’s published papers
• Some examples from ”Data Warehouse Design Solutions”, Christopher Adamson and Michael Venerable, Wiley 1998, ISBN 0-471-25195-X.
• ”Microsoft OLAP Solutions”, Erik Thomsen, George Spofford and Dick Chase, Wiley 1999, ISBN 0-471-33258-5
• Telco example from ”The Official Guide to Informix/Red Brick® Data Warehousing”, M&T Books (IDG) 2000, ISBN 0-7645-4694-5
• DWLIST discussions• My own practical experiences and readings*) With Laura Reeves, Margy Ross and Warren Thornthwaite.**) With Richard Merz
4
TF
Informatik
Agenda
• History• Data Warehousing Objectives and Architecture• Dimensional Design Basics• Industry examples• The Data Webhouse (Clickstream Analysis)• Advanced Design Issues• The bigger picture• Data Quality• The future of the Data Warehouse• Literature and web-adresses
5
TF
Informatik
History
How the Multidimensional Model came into existence
TF
Informatik
Online!Pioneers:
General Electric, IDS, Charlie Bachman
Rockwell/IBM, IMS
Techniques:
Hashing, pointers, physical colocation,
and concurrency / transaction control
The early years of database
TF
InformatikThe Vision:
The Information System
Operational
Tactical
Strategic
TF
Informatik Relational & The Codd & Date Seminars
S# SNAME STATUS CITY
S1 Smith 20 London
S2 Jones 10 Paris
S3 Blake 30 Paris
S4 Clark 20 London
S5 Adams 30 Athens
S# P# QTY
S1 P1 300
S1 P2 200
S2 P1 300
S2 P2 400
S4 P2 200
P# PNAME COLOR WEIGHT CITY
P1 Nut Red 12 London
P2 Bolt Green 17 Paris
P3 Screw Blue 17 Rome
P4 Screw Red 14 London
P5 Cam Blue 12 Paris
TF
Informatik
Relational / SQL
• Database language standard
• The Query Optimizer (automated navigation)
• DBMS’s become commodities
• End-user tools by the hundreds
TF
Informatik
Operational
Tactical
The Information Warehouse ConceptThe Information Warehouse Concept
Around 1987: Getting closer
TF
InformatikInnovation for Analysts:
The Multidimensional DatabaseP
rodu
cts
Timeperiods
Sales
Distric
ts
TF
Informatik
OperationalOn-LineTransactionProcessing
TacticalOn-LineAnalyticalProcessing
Strategic
ExecutiveInformationSystems
IT & Business Opportunities:Decision Support Systems
TF
Informatik OLTP and OLAP:Conflict of Purpose
Customers
Orders Products
Order-lines
Invoices
Inventory
Markets Time
Products
Sales Detail
Customers
age
gender
income level
education
time_period
(billions of records)
(thousands of records)
(hundreds of records)
””Star Schema”Star Schema”
(millions of records)
(hundreds of records)
14
TF
InformatikOLAP Categories
and Sample Products• ROLAP: Relational
DBMS’s with Star Schema support and specific tools
• MOLAP: Multidimensional Databases
• HOLAP: Hybrid OLAP, the combination of the two
• ROLAP: Most RDBMS’s, tools like Microstrategy, Business Objects, Informix Metacube, IBM DB2 OLAP Server, Oracle 9i OLAP Server and many more
• MOLAP: Hyperion Essbase, Applix TM1, Cognos, Microsoft OLAP Services, IBM DB2 OLAP Server
• HOLAP: Microsoft SQL Server with OLAP Services, (IBM DB2 OLAP Server, Hyperion Essbase 7)
15
TF
Informatik
ROLAP vs. MOLAP
• Hot debate in the 90-es (”The Shootout at the OLAP Corral”)• The major strength of ROLAP:
– Large volumes of data (billions of rows, terabytes of data)• The major weakness of ROLAP:
– Performance on large result sets• The major strength of MOLAP:
– Fast response times, also on ”large result sets” (aggregated queries)• The major drawbacks of MOLAP:
– Pre-calculation times– Scalability (the cubes explode as the volume and complexity
increase)• So, what to do?
– Use both (HOLAP)
16
TF
Informatik
The Power of the 2
ROLAPStar Schema
MOLAP /HOLAP
Smooth, often automatic/transparent, integration
Detailed levels in ROLAP (many rows)
Two good examples:
- Microsoft SQL Server with OLAP Services
- IBM DB2 OLAP Server
TF
Informatik
DistributedDistributedOn-LineOn-Line
TransactionTransactionProcessingProcessing
Analytical Analytical Applications in Applications in
OLAPOLAP
It’s here!
18
TF
Informatik
Data Warehousing Objectivesand Architecture
19
TF
InformatikData Warehouse Objectives
• Provide one data source for reporting, analysis, and mining
– consistent answers across the organization or organizational unit (“data marts”)
20
TF
Informatik
Dat
awar
ehou
se
ExtractingCleaningStandardizingConsolidatingAggregatingTransformingKeygeneratingReformattingand much more
Datastaging area?
Operationalsystems
The Feeding System(Extract, Transformation and Load - ETL Tools)
21
TF
Informatik
ETL – where the time is spent!
• Do it yourself (SQL, scripts, COBOL etc.)
• ETL tool products (e.g. Informatica)
• 80 % of your time is spent here!• Buy and read ”The Data Warehouse ETL
Toolkit”, Ralph Kimball and Joe Caserta, Wiley 2004, ISBN 0-7645-6757-8
22
TF
Informatik
Data Warehouse statistics
• Development time: – < 6 months: 16 %– 6-12 months: 32 %– 12-24 months: 26 %– 24-60 months: 20 %– > 60 months: 6 %
• 50 % of all are over 3 years of age (9 % over 10 years)
• 33 % over 1 TB (10 % over 10 TB)
Source: Business Intelligence Journal, Fall 2005
23
TF
InformatikThe Points of Measurements in
Business ProcessesF
acts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Facts are numerical: Counts Volumes Moneyand represent the key subject areas in business.Each point of measurement becomes a star.
24
TF
Informatik
One, big Data WarehouseF
acts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
One “global” DW: Finished goods inventory
Shipments
Distribution inventory
Depletions
Store inventory
Salestransactions
Timedimension
Productdimension
Customerdimension
Etc.etc.etc.
25
TF
Informatik
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
DataMart
DataMart
DataMart
DataMart
DataMart
DataMart
Anarchistic Data Marts
26
TF
Informatik
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
DataMart
DataMart
DataMart
DataMart
DataMart
DataMart
Enterprise Data Warehouse(maybe also an Operational Data Store)
Coordinated Data Marts
27
TF
InformatikTop 10 reasons for a layer of atomic data in front of data marts
• Coordinated transformations at mart level (consistency)
• Minimized impact on source systems (one extract)
• Temporal integrity across marts
• Single source for cross-subject mining
• Allows smaller marts - with drill-down to atomic level
• Coordinated meta data across entire DW
• Scalability of entire DW (quick population of new marts)
• Facilitates building stars
• Mart recovery
• If volumes not too high, operational reporting may take placeDoug Laney, Prism Solutions on DWLIST
28
TF
Informatik
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
DataMart
DataMart
DataMart
DataMart
DataMart
DataMart
Design: Conformed Dimensions and Conformed FactsPhysical: Data Staging facilities as necessary
The Data Warehouse Bus Architecture
29
TF
Informatik
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
Fac
ts
DataMart 1,ROLAP
DataMart 2,MOLAP
DataMart 3,HOLAP
DataMart 4,MOLAP
DataMart 5,ROLAP
DataMart 5,MOLAP
Data Staging Area in Relational DBMS:E/R: May be used for data cleansing
Star Schema: For all facts and dimensions
Roles of ROLAP, MOLAPand HOLAP
30
TF
InformatikComponents and structures in
Decision Support SystemsUser Interface, ie. Windows or Web
Ad hocOLAPtools
Own applications developed with OLAP tools
Off the shelf applications
Datawarehousedatabase
SQL
Typically on aggregated levels
Typically on the atomic, event level
Metadata
Relational Star Schemas:
Multidimensional Cube(s):
31
TF
InformatikA few words on Meta Data
• You need it and it is important• Read ”Meta Meta Data Data”, Ralph Kimball
(www.dbmsmag.com/9803d05.html)• Comes with client tools, data transformation tools,
stand-alone tools, modelling tools, DBMS’s etc.• Microsoft Repository Open Information Model• Meta Data Coalition OIM 1.1 (www.mdcinfo.com )• Parallel work in the OMG (Common Warehouse
Metamodel) (www.omg.org/technology/cwm )• September 2000: MDC integrates into OMG CWM,
the two standards become one.
32
TF
Informatik
Common Warehouse Metamodel
• Expressed in UML• XML for metadata interchange (XMI)• Check www.cwmforum.org• CWM 1.0 February 2001• Partners: IBM, Unisys, NCR, Hyperion, Oracle, UBS, Genesis,
Dimension EDI • Supporters: Deere, Sun, HP, Data Access, InLine, Aonix, Hitachi,
SAS, Meta Integration, Adaptive• ”Common Warehouse Metamodel”, John Poole, Dan Chang, Douglas
Tolbert, and David Mellor, OMG Press / Wiley 2002, ISBN 0-471-20052-2
• Actual product support? Still maybe a little too early to tell…
33
TF
Informatik
Data Warehousing Objectives
• Publish what is important
• Provide the means to find out why
• Promote well-informed decisions
34
TF
Informatik
Multidimensional Design Basics
The Art of Constructing the Stars!
35
TF
Informatik
Dimensional Design Methodology• Design begins
– with business requirements gathered from the decision makers and analysts
– and data sourced from• the corporation’s operational systems• external data sources
• Design requires– user involvement at all stages
• Design resembles – a series of successive approximations (3-4 revisions)
36
TF
InformatikSimple, E/R model for OLTP
Order_Details
PK OrderDetailID
FK1 OrderIDFK2 ProductID
QuantityUnitPriceDiscount
FK1 ShippingMethodID
Orders
PK,FK3 ShippingMethodIDPK OrderID
FK1 CustomerIDFK2 EmployeeID
OrderDatePurchaseOrdNumberShipNameShipAddressShipCityShipStateShipPostalCodeShipCountryShipPhoneNumberShipDateFreightChargeSalesTaxRate
Products
PK ProductID
ProductNameUnitPrice
FK1 BrandID
Customers
PK CustomerID
CompanyNameContactFirstNameContactLastNameBillingAddressCityStateOrProvincePostalCodeCountryContactTitlePhoneNumberFaxNumber
Employees
PK EmployeeID
FirstNameLastNameTitleEmailNameExtensionWorkPhone
Shipping_Methods
PK ShippingMethodID
ShippingMethod
Payments
PK PaymentID
FK1 OrderIDPaymentAmountPaymentDateCreditCardNumberCardholdersNameCreditCardExpDate
FK2 PaymentMethodIDFK1 ShippingMethodID
Brands
PK BrandID
BrandDescription
Payment_Methods
PK PaymentMethodID
PaymentMethod
EmployeeTerritories
PK,FK1 EmployeeIDPK,FK2 TerritoryID
Territories
PK TerritoryID
TerritoryDescription
TerritoryRegion
PK,FK1 RegionIDPK,FK2 TerritoryID
Region
PK RegionID
RegionDescription
37
TF
Informatik
Problems with E/R
• Humans can’t navigate or remember an E/R
• Software can’t navigate an E/R:
Every path gives a different answer
The “shortest path” is meaningless
• Bad performance
38
TF
InformatikDo you need a warehouse E/R
model• Not necessarily
• the data relationships in the enterprise E/R model can suffice
39
TF
InformatikCan I model by subject area with
a phased approach• Yes
• models are extensible
• the key is conformable dimensions
• dimensions can then be shared and are extensible themselves
40
TF
Informatik Heart of the matter
• Business Views– Must look like the business– Recognized by business types– Relevant for business types
• Three design rules– Simplicity– Simplicity– Simplicity
K.I.S.S. !K.I.S.S. !
41
TF
Informatik
Design objectives
• Business View Schema must be readily understood and navigatable by the users
• Important information must not be obscured by unimportant detail and complexity
• The implementation(s) must provide rapid response time against large volumes of historical data
• The implementation(s) must be legible and navigatable for extract processing & mining
42
TF
Informatik
Classic definition, ROLAP
• STAR schema – A relational schema organized around a central
table joined to smaller tables using foreign key references
facts
43
TF
Informatik
Dimensional Model
Product Market Promotion Time dollars units price cost
Product descrip size flavor package
Time descrip weekday holiday fiscal
Market descrip district region demog
Promotion descrip deal discount media
Facts/Measures95% of data base storage
Dimensions Dimensionsmost of the fields
44
TF
InformatikTerminology:
Facts, Measures, Accounts• ROLAP: The Fact Table(s)
– Contains Facts and (foreign key) references to the dimensions
• MOLAP: The Measures Dimension– Contains Measures and references to the lowest
level Member Keys of the dimensions
• Essbase (MOLAP): Measures Dimensions are called Accounts Dimensions
45
TF
InformatikAdvantages over classic
datamodels (Entity/Relationship)• Humans can navigate & remember a
multidimensional star or cube• Software can navigate “deterministically”• The two major purposes of
Multidimensional Modelling are:– Reducing complexity– Deliver good response times – also for large
aggregations
46
TF
Informatik The Dimensions
• Dimensions– Define the business dimensions, in terms
already familiar to users, by which the central table is to be analyzed
– Numerous columns of text, highly descriptive– Represent the hierarchies of different levels of
reporting (eg. Year->Quarter->Month->Day)– Usually less than a million rows, can be much
larger in some businesses
47
TF
InformatikTechnically speaking (ROLAP)• Dimension tables
– Must have primary key– Joined to fact table through foreign key
reference– Typically represent ninety percent of the data
elements– Commonly occurs in constraints and GROUP BY
clauses– Heavily indexed
48
TF
InformatikTypical dimensionsGeneric Industry specific
• Time period(s) • Frequent flier, stayer
• Geographic region • Service level, procedure,(markets, cities) operation
• Products • Room type, service,
(also in bank/insurance) classification, seat
• Promotions/campaign • Drug, medicine
• Customers • Vendors, distributor, (Account number) warehouse
• Sales rep, buyer, organisation
49
TF
Informatik
The Time Dimension
• Problem: SQL and many OLAP products do not support date arithmetic well enough– How many working days in a month?– How many days in the Easter season?– When is the industrial vacation period?– How to calculate AVG(xxx) per working day?
50
TF
Informatik
The Time Dimension• Give it all the attributes you need to make life easy for the end-user:DaynumberDateDay in weekName of dayType of daySeasonWeek in yearWorkingdays in weekMonth
Name of monthType of monthDay in monthDays in monthWorkingdays in monthEnd of Month FlagQuarterYearDay in yearWorkingdays in year
Think about this - it is worth the effort!
51
TF
Informatik
Cust
Custgeography
Contract
Salesoutlet
Salesgeography
Campaign
Salesinv.
Act. distr.inventory
Pl. distr.inventory
Product
T-o-DCalendar
SalesFacts
The Time of Day Dimension
52
TF
InformatikThe Status Dimension
S-key Status Fulfilled-flag Ordertype1 Current Yes Mail2 Current No Mail3 Old Yes Mail4 Old No Mail5 Current Yes Phone6 Current No Phoneetc...
(Cartesian product of all values)Is a real dimension
53
TF
Informatik
Hierachies within Dimensions
Business Unit
Region
Division
Company
Cost Center
Cost Center City
Cost Center State/Province
Cost Center Country
CompanyDivisonRegionBusiness UnitCC CountryCC StateCC CityCost Center (PK)
ROLAP: One table(denormalised)
MOLAP:
One or two dimensions, depending on your product(s)
(Dimension diagram technique proposed by Ralph Kimball et al in ”The Data Warehouse Lifecycle Toolkit” using eg. Microsoft Visio)
54
TF
InformatikFact of life (1):
”Ragged Hierarchies”Hierarchy Example 1 Example 2 Example 3
Country USA France Denmark
State/Province California - -
Region Silicon Valley PACA Sjaelland
County Santa Clara Alpes-Maritimes Frederiksborg
City San Jose Antibes Elsinore
55
TF
InformatikFact of life:
”Ragged Hierarchies”
ExpenseAccounts
Cost of Goods
Taxes
Misc Costs
--
National Taxes
Local Taxes
--
*
*
*
*
*: Data enters at this level
56
TF
InformatikROLAP Solution to the Ragged
Hierarchy Problem
Category_ID Category_Description Account_ID Account_Description
01 Cost of Goods
02 Taxes 00001 National Taxes
03 Taxes 00002 Local Taxes
04 Misc Costs
Base table:
End-user and/or MOLAP view:Select
Category_Description, Case when Account_Description is not null
then Account_descriptionelse Category_Description end
as Account_Descriptionfrom Chart_of_Accountsorder by Category_ID
57
TF
InformatikMOLAP Solutions to the Ragged
Hierarchy Problem• Depends on your product
• Might require manual definitions
• Supported in at least:– Microsoft SQL Server 2000– Hyperion Essbase– IBM DB2 OLAP Server– Applix TM1
58
TF
InformatikFact of Life (2):
Unique Members• ”Member”: Often used in OLAP to designate the
individual entries on the individual levels (eg. Country = ’France’, Year = 2001 etc.).
• Some products require that members (values) be unique across:– The whole level– The whole dimension(!)
• Check your selected product(s) before getting too deep into the implementation!
59
TF
Informatik
• All parent-child relationships MUST be one-to-many
• Checking your levels:– select quarter, count(distinct year) as count_col
from time_period group by quarter having count_col <> 1;
– The result set of the query above must be empty!– If your product needs global uniqueness, you must do
similar checking across all levels– If you don’t do this, your users will get meaningsless
answers to their queries!
Fact of Life (2):Unique Members
60
TF
Informatik
Fact of Life (3): Sparsity• Some dimension members may occur quite infrequently (eg.
demographic data only available on 10 percent of your customers)
• This is called sparsity (a sparse dimension)• Also look for dimensions, which do not intersect with other
dimensions in a star – they are not interesting – only take up space
• ROLAP: Not a big problem, some wasted disk space• MOLAP: A very big problem – leads to cube ”explosion” –
(many unused cells)– Try to eliminate as much as possible – Make sure that you tune your product configuration well (many MOLAP
products are ”sparsity aware”)
61
TF
Informatik
Fact of Life (4): Flat Dimensions
• E.g.: A dimension on Standard Industry Codes (SIC)
• In ROLAP just another attribute on your customer (maybe)
• In MOLAP, a member attribute on the lowest level of your customer hierarchy (if your product permits it)
• Keep the number of dimensions down!
62
TF
InformatikDimension Data Become Row-/Column Headers in Reports
Markets Time
Products
Sales Detail
Customers
age
gender
income level
education
Time_keyDayMonthYear ...
(billions of records)
ProductkeyProductnameetc.
Star Schema:
MarketkeySales_area....
ProductXxx Yyy Zzz
Jan - - -Febr - - -Mar - - -AprMay....Not only your model,
but also your data must ”look good”!
63
TF
Informatik
Slowly Changing Dimensions
When data is changed, what can you do?1. Overwrite (if you don’t care loosing the
history)2. Create another dimension record (if for
instance a customer moves) - (what about the keys? --> use surrogate keys!)
3. Create ”current and previous” value fields (for instance changing sales territories)
64
TF
Informatik
Surrogate keys
• Because the meaning of ID’s change (SKU#’s, moving customers etc.)
• Because concatened primary keys are impractical• Keep external keys as (a) dimension field(s)• Use plain integers for data warehouse keys (users
shouldn’t see them, they are just used for joins)• In short: Always – repeat ALWAYS use them!• You will want to hide them from the users by way
of using views (ROLAP) and external, natural key values, which the users are familiar with
65
TF
Informatik
SCD Type 2: The Past
Customer_key Customer_Number Name Country …
1234 98-66473 Thomas Denmark …
Customer_Key Date_key … Quantity Price
1234 765 … 3 99,95
Customer dimension
Sales facts
66
TF
InformatikThe Change: The customer
movesCustomer_key Customer_Number Name Country …
1234 98-66473 Thomas Denmark …
4352 98-66473 Thomas France …
Customer_Key Date_key … Quantity Price
1234 765 … 3 99,95
4352 1027 … 5 245,25
Customer dimension
Sales facts
SCD Type 2: History is preserved, but how many customers do we have?
67
TF
Informatik
Effective Dates?
• Effective_begin_date & effective_end_date• Might be the only way to deal with late arriving
records• But:
– What is the meaning of ”Manufactured from/to” versus ”Sold from/to”?
– Which attributes are affected?– Makes query construction complicated
• If you use them, use a ”current_flag” also!
68
TF
Informatik
Type ”6” (2+3+1)• When sales districts change ”randomly”:
– Sales Team Key– Sales Team Name– Sales Physical Address– Begin Effective Date– End Effective Date– Is_current_flag (type 2)– Current District (type 1)– Old district (type 3)– ….
69
TF
Informatik
Impact of SCD > 1
• ”Small matter of programming”
• How to detect changes?– Cyclic Redundancy Check (CRC) is a
possibility
• Which changes are important?
• Type 6 requires many updates
• Changes can ”cascade”
70
TF
Informatik
Pragmatic preservation of history
• Make historical copies of your MOLAP cubes or ROLAP databases per year
• Make copies of the complete dimension, when major changes occur, and use those copies for historical analysis, maybe in a separate environment
71
TF
Informatik
The Facts/Measures
Mostly raw numeric items, relevant measures, and dimension keys only. Can signify events or coverage
Try to use as few measures as you can get away with, these are costly– From some million to more than a billion observations– Items are typically additive. May be semi-additive or
non-additive in special cases– Access primarily via dimensions– Families of facts are common
72
TF
InformatikValue of additivity
• Prevent incorrect computations– Percentages and other statistical measures
cannot always be simply added together– For example, average bank balances
• Good advice– Store base measures– Calculate percentages and other statistics when
facts are retrieved– Be careful with NULL’s in the database!
73
TF
Informatik Additive measures
• Numeric datatypes– Units sold– Dollars sold versus per unit dollars– Claim amount– Discount dollars– Profit before tax– Tax dollars– Service charges– Number of calls– Number of transactions
74
TF
Informatik
Typical facts
• Sales and purchases• Daily, weekly, monthly, quarterly sales• Policies sold, claims sold• Orders, shipments• Budget forecasts, actuals
75
TF
InformatikReasons for going to the
transaction level• Behavior analysis
• Time-of-day analysis / queue analysis
• Time gap analysis
• Sequential behavior– Fraud detection– Cancellation warnings
• Basket analysis
76
TF
Informatik
Technically speaking (ROLAP)
• Fact table– Must have primary key– Joined to dimensions through foreign key
references– Usually physically sorted by time dimension
and the primary analysis path
77
TF
Informatik
Mystery fields in the fact records
• Facts: Only measures and keys to dimensions• Sometimes you see fields, which are not that and which also
not appear to be textual attributes of dimensions or other foreign keys.
• Most often these fields are codes and are sparse• If they are really necessary, try to create one or more
”mystery dimensions” out of them• Look at correlations between values of the mystery fields
– Assume X has 200 values and Y has 1000 values– If 1000 combinations of X, Y exist, then X is a parent of Y– If 100000 combinations exist, then they are completely uncorrelated,
ie. two dimensions.
78
TF
InformatikMOLAP: Hierarchies in the
Measures Dimension• Some (most) MOLAP products allow you to set
up hierarchies within the measures dimension (a.k.a. the Accounts Dimension in Essbase)
• This is particularly useful in financial reporting• Requires some level of manually entered
definitions• ROLAP:
– Push down parent to its children– More than one fact table– A ”helper table” (discussed later)
79
TF
Informatik An Essbase Example
80
TF
InformatikUsing degenerate dimensions
(ROLAP)Unique, primary key of a sales fact table:• Time• Product• Store/Register• Promotion• Customer• Employee• Ticket # (degenerate)• Line # (degenerate)
81
TF
InformatikGood reasons for getting the fact
table primary key right• ”Global Warming”
– Avoid more rows than necessary (if granularity of fact and dimensions do not match)
• Lost dimensions– Could be the reason for the problem
• Lost attributes– Not getting a dimension detailed enough
• ”Low-cost Insurance”– For avoiding duplicate rows
• ”Kids and Matches”– Nobody will be tempted to join two fact tables
(Thanks to Jim Stagnitto of Questral, Inc.)
82
TF
Informatik
Technically speaking (ROLAP)
• Fact table– non key columns are usually not indexed, rapid
access is through the dimensions – Columns often occur in sum(), rank(), min()
functions
83
TF
InformatikReliable relations
(ROLAP)• All joins between dimensions and fact
tables:– Completely understood– One to many (dimension to fact) relation based
on foreign key references of the fact table’s multi-part key
• Referential integrity enforced. Always
84
TF
InformatikOops - forgot one thing:The Indexes!
Calldetail
CustRateperiod
Discounttype
Batch
Calltype
Accessmethod
Juris-diction
In an ordinary (universal) database,
you will have 10 indexes on the fact table:
1 PK + 9 FK’s.
Hour
Day
85
TF
Informatik
Sample ROLAP Index Calculation
Keys Datatype LengthIndex size(GB)
Primary key Composite 76 70Jurisdiction Character 15 19Access method Character 8 13Discount type Character 10 15Rate period Character 8 13Customer Decimal 15 19Call type Character 10 15Batch number Integer 4 10Hour Small int 2 8Day Date 4 10
192
Number of facts: 500 mioLength of fact row: 200 bytesSize of fact table (GB): 93
Total size of DW (GB): 285
You want to: Keep the number of dimensions
down!
Use integer keys!
Find alternatives
to B-trees!
86
TF
Informatik
Audit, Balance and Control
• Source feed file name for the row• Job instance that processed the row• Record number in the feed file• Can also contribute to a unique key for
the row
87
TF
Informatik
Families of Facts
• Related facts
• Aggregated facts
88
TF
Informatik
Related Facts
• Shared dimensions must be the same
• Value chain, eg. in Manufacturing:Manufactoring Inventory
Manufactoring Shipments
Distribution Inventory
Distribution Shipments
Store Inventory
Store Sales
Flow of Product:Each Process a set of Facts
89
TF
Informatik
Drilling Across The Value Chain
Manufactoring Inventory
Manufactoring Shipments
Distribution Inventory
Distribution Shipments
Store Inventory
Store Sales
TimeDimension
ProductDimension
CustomerDimension
90
TF
Informatik
BudgetForeign keysBudget amount
CommitmentsForeign keysCommitment Amount
PaymentsForeign keysPayment Amount
TimeDimension(Month)
Line ItemDimension
DepartmentDimension
AccountDimension
PaymentDimension
CommitmentDimension
Drilling AcrossThe Budgetting Chain
91
TF
InformatikBuilding ”Supermarts”
The Data Warehouse Bus Architecture
• Conformed Dimensions
• Standard Fact Definitions– Revenue, Profit, Price, Cost etc.
• Granularity at the lowest level in front of the data marts
• Data Marts constructed from these standard sources as necessary
92
TF
Informatik
Aggregated Facts (ROLAP)• Multiple fact tables
– Share one or more dimensions
– Daily fact table
– Monthly fact table
– Monthly Category fact table
• Caveat: What keys to use in dimension tables?• Do not use ”level” fields!• Use MOLAP or HOLAP whereever possible!
93
TF
Informatik
Drawbacks of aggregate tables• There are so many of them!!!
• You is the one, who must manage them!
• All they do for you is enabling you to get better performance in routine queries
• The application is programmed to your aggregation scheme; if that changes, then ….
• Digression: If you also use logical partitioning, you will have hundreds of tables!
94
TF
Informatik
Aggregate Tables in ROLAP• Be careful out there!
• You must use an aggregate navigator”
Clienttool
Plain SQL
Aggregate aware SQL
Dataandaggregates
Metadataandstatistics
Navigator
DBMS
95
TF
Informatik
Families of Facts• Heterogeneous Productlines, such as in
BankingMonth_keyProduct_keyCustomer_keyStatus_keyEarned_dollarsPaid_dollarsAverage_balance-------------Num ATM TransNum Branch TransNum OverdraftsTot Overdraft FeesOverdraft LimitDeclined Trans
Core keys and facts,kept in one common table:
Facts applicable only tochecking accounts:
96
TF
Informatik Families of Facts• Transactions:
Date_keyProduct_keySales_person_keyCustomer_keyTransaction_keyAmount
• And/or snapshots:Month_keyProduct_keySales_person_keyCustomer_keyStatus_keyEarned_dollarsPaid_dollarsAverage_balance
Why not both?
97
TF
Informatik
Using snapshots
• When transactions are not pieces of revenue– Deposits / withdrawals– Payments in advance– Insurance coverage premiums
• Consider a current rolling snapshot (period to date)
• Consider a status dimension• Many fields in the snapshot fact table, some
possibly semi-additive
98
TF
Informatik
Related Facts in MOLAP• MOLAP products allow only one measures dimension per cube• You must, if necessary at all, combine related facts into one measures
dimension• Use a ”Version” (”Scenario”) dimension with values like eg.:
– ”Actual”– ”Budget 2001-01”– ”Prognosis 2001-05-01”
• Maybe also a ” Type” dimension, eg.:– ”Income”– ”Expenses”– ”Taxes”
• Or build several cubes, one for Income, one for Expenses etc.– Many MOLAP products allow you to combine cubes into ”Virtual
Cubes” using a library of shared dimensions
99
TF
InformatikSimilar models in each industry• A common framework for every industry
– Retail– Telecommunications– Transportation– Insurance– Healthcare– Manufacturers– Banking– Government– Websites– E-business
100
TF
InformatikRetail (1): The Grocery Store
grocery_store_facts
PK,FK1 day_keyPK,FK2 product_keyPK,FK4 store_key
FK3 promotion_key
day
PK day_key
product
PK product_key
promotion
PK promotion_key
store
PK store_key
101
TF
Informatik
Retail (2): Orders
order_facts
PK,FK4 product_keyPK,FK2 order_date_keyPK,FK1 customer_keyPK,FK3 salesperson_key
customer
PK customer_key
date
PK day_key
product
PK product_key
salesperson
PK salesperson_key
102
TF
Informatik
Telco: The Billing CDR
BILLING_CDR
PK,FK3 PERIOD_KEYPK,FK1 CALL_TYPEPK,FK8 RATE_PLANPK,FK4 BTNPK,FK5 DEST_TELEPHONE_NUMBERPK,FK2 CUSTOMER_IDPK,FK7 LINE_TYPE_KEYPK,FK6 ORIG_TELEPHONE_NUMBERPK,FK9 TIME_KEY
CALL_TYPE
PK CALL_TYPE
CUSTOMER
PK CUSTOMER_IDLINE_TYPE
PK LINE_TYPE_KEY
PERIOD
PK PERIOD_KEY
RATE_PLAN
PK RATE_PLAN
TELEPHONE
PK PHONE_KEY
TIME
PK TIME_KEY
103
TF
Informatik
Transportation
Frequent Flyer
PK,FK1 flown_keyPK,FK2 purchased_keyPK,FK11 customer_keyPK,FK3 leg_origin_keyPK,FK4 leg_dest_keyPK,FK5 trip_origin_keyPK,FK6 trip_dest_keyPK,FK12 flight_keyPK,FK10 fare_class_keyPK,FK8 channel_keyPK,FK7 flight_status_key
Flight
PK flight_key
Date Flown
PK flown_keyDate Purchased
PK purchased_key
Customer
PK customer_key
Leg Origin
PK leg_origin_key
Leg Destination
PK leg_dest_key
Trip Origin
PK trip_origin_key
Trip Destination
PK trip_dest_key
Fare Class
PK fare_class_key
Flight Status
PK flight_status_key
Sales Channel
PK,FK1 channel_key
104
TF
InformatikInsurance (1): Bus Architecture
Fact
s D
imen
sion
s
Aut
omob
ile
Cla
im
Cla
im S
tatu
s
Cla
im
Tra
nsac
tion
Cla
iman
t
Cov
erag
e
Cov
ered
item
Em
ploy
ee
Insu
red
part
y
Mon
th
Poli
cy
Stat
us
Thi
rd p
arty
Tim
e
Tra
nsac
tion
Claim snapshot x x x x x x x x Claim transaction fact x x x x x x x x x x Custom claim snap x x x x x x x x Custom claim trans x x x x x x x x x x Custom snapshot x x x x x x x Custom transactions x x x x x x x Policy snapshot x x x x x x x Policy transactions x x x x x x x
105
TF
Informatik
Insurance(2): Some fact tablesPolicy Snapshot
PK month_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK status_key
written_premiumearned_premiumprimary_limitprimary_deductiblenumber_of_transactions
Policy Transactions
PK time_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK transaction_key
amount
Custom Snapshot
PK month_keyPK insured_party_keyPK employee_keyPK automobile_keyPK coverage_keyPK policy_keyPK status_key
written_premiumearned_premiumprimary_limitprimary_deductiblenumber_of_transactionsauto_replacement_value
Custom Transactions
PK time_keyPK insured_party_keyPK employee_keyPK automobile_keyPK coverage_keyPK policy_keyPK transaction_key
amount
Claim Transaction Fact
PK time_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK claimant_keyPK claim_keyPK third_party_keyPK claim_trans_key
amount
Claim Snapshot
PK month_keyPK insured_party_keyPK employee_keyPK covrd_item_keyPK coverage_keyPK policy_keyPK claim_keyPK claim_status_key
reserve_amountpaid_this_monthreceived_this_monthnumber_of_transactions
106
TF
Informatik
Manufacturing (1)
Component_use_facts
PK,FK1 component_keyPK,FK2 prod_keyPK,FK3 pl_keyPK,FK4 time_key
usage_quantity
components
PK component_key
component_part_numbercomponent_namecomponent_descriptioncategoryunit_of_measure
Product
PK Prod_key
Model_NumberFamilyLineCPUHD_size
Production Line
PK PL_key
Line_nameFacilityCountryLineType
Time
PK Time_key
Day_of_weekMonthYearDateQuarter
107
TF
Informatik
Manufacturing (2)
Production_facts
PK,FK1 pl_keyPK,FK2 prod_keyPK,FK3 time_key
units_produced_qty
Production Line
PK PL_key
Line_nameFacilityCountryLineType
Product
PK Prod_key
Model_NumberFamilyLineCPUHD_size
Time
PK Time_key
Day_of_weekMonthYearDateQuarter
108
TF
Informatik
Manufacturing (3)
Activity
PK Activity_key
ActivityCategory
Month
PK Time_key
month_namemonth_numberquarteryear
Production Line
PK PL_key
Line_nameFacilityCountryLineType
Activity_facts
PK,FK3 pl_keyPK,FK2,FK4 time_keyPK,FK1 activity_key
units_produced_qty
109
TF
Informatik
Inventory
Inventory Transaction Fact
PK,FK3 time_keyPK,FK4 product_keyPK,FK2 warehouse_keyPK,FK1 transaction_key
Transaction
PK transaction_key
Time
PK time_key Product
PK product_key
Warehouse
PK warehouse_key
Advanced Inventory Snapshot
PK,FK2 time_keyPK,FK3 product_keyPK,FK4 warehouse_key
110
TF
Informatik
DeliveriesDelivery Snapshot
PK order_keyPK product_keyPK warehouse_keyPK mfgr_keyPK first_received_keyPK last_received_keyPK first_inspect_keyPK first_auth_keyPK first_shipment_keyPK last_shipment_keyPK first_return_key
qty_orderedqty_receivedqty_inspectedqty_returned_to_vendqty_placed_in_invqty_auth_to_sellqty_pickedqty_boxedqty_shippedqty_returned_by_custqty_returned_to_invqty_damagedqty_lostqty_written_offvalue_at_unit_costvalue_at_orig_selling_pricevalue_at_last_selling_pricevalue_at_avg_selling_pricePO_numberPO_line_number
111
TF
Informatik
Manufactoring Inventory
Manufactoring Shipments
Distribution Inventory
Distribution Shipments
Store Inventory
Store Sales
TimeDimension
ProductDimension
CustomerDimension
Drilling Across The Value Chain
112
TF
Informatik
Banking
Custom Checking
PK,FK2 account_keyPK,FK3 product_keyPK,FK4 branch_keyPK,FK5 household_keyPK,FK6 status_keyPK,FK1 month_key
primary_balancetransaction_countdirect_depositsoverdraft_limitminimum_balanceservice_chargeinterest_paidATM_transaction_countbranch_ATM_transaction_countdays_below_minimum_balancedays_overdrawnaccount_count
Month
PK month_key
monthyearfiscal_quarter
Account
PK account_key
primary_surnamesecondary_surnameaccount_addressaccount_cityaccount_stateaccount_zipdate_openedprimary_ageprimary_sexprimary_marital
Product
PK product_key
product_descriptiontypecategory
Branch
PK branch_key
branch_namebranch_addressbranch_citybranch_statebranch_zip
Household
PK household_key
household_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type
Status
PK status_key
status_descriptionstatus_reasonnew_account_flagclosed_account_flag
Custom Mortgage
PK,FK1 month_keyPK,FK2 account_keyPK,FK5 product_keyPK,FK3 branch_keyPK,FK4 household_keyPK,FK6 status_key
primary_balancetransaction_countinterest_paidproperty_valuedelinquent_countbad_check_countaccount_count
Household Facts
PK,FK1 month_keyPK,FK3 product_keyPK,FK4 branch_keyPK,FK5 household_keyPK,FK6 status_keyPK,FK2 account_key
primary_balancetransaction_countaccount_count
113
TF
Informatik
Website Analysis
• Going beyond log file statistics– WebTrends, Analog, NetTracker etc.
• Time-series analysis is necessary– What happened during a site visit?
– Why was the visit abandoned?
– What is the effectiveness of a targetted promotion?
– What is the trend in the above over time?
• The Clickstream Data Warehouse
114
TF
Informatik
So, what’s new?• From Sales Facts to User Activity Facts• (How) do we know the User (customer)?• Sessions• Pages• Events• Probable cause of the visit/sale• Where did the customer come from?• The World Wide Web (24*7*365, multiple languages,
cultures, timezones …)• From CRM to eRM – electronic Relationship Management
115
TF
Informatik
Website Design also matters• The clickstream data warehouse needs information about:
– Pages– Cookies– Users– Clickable objects– URL’s– Events– Etc. Etc.
• This content and event information must be presentable to end-users!
• Some can be obtained by using log file processors• It is likely you will have to do a considerable amount of data
clean-up, if the website is not well-designed!
116
TF
Informatik
Web effects on ”old” dimensions• Calender:
– Local time, global time
• Time of Day• Customer:
– Becoming a user dimension– Cookies– Named users– Integration with eg. CRM– The global perspective
• Promotion:– Must take web advertising into consideration– Dynamic, individual, targeted ”special offers”, maybe in real time
117
TF
Informatik
New Web Dimensions
• Page• Event
– e.g. Open Page, Refresh Page, Click Link, Enter Data
• Session– Type of session, context/mission, success etc.
• Referral– How did the customer/visitor arrive?
118
TF
Informatik
Choosing the grain
• Page event
• Session
• Aggregated levels
119
TF
Informatik E-business
Page_Events
PK,FK11 Page_KeyPK,FK10 Causal_KeyPK,FK9 Session_KeyPK,FK1 Universal_Date_KeyPK,FK7 Universal_Time_KeyPK,FK2 Local_Date_KeyPK,FK8 Local_Time_KeyPK,FK3 Customer_KeyPK,FK4 Event_KeyPK,FK6 Referrer_KeyPK,FK5 Product_Key
Session_IDPage_SecondsUnits_OrderedOrder_Dollars
Calendar
PK date_key
Many_more_fields
Causal
PK Causal_Key
Causal_TypePrice_TreatmentNewspaper_Ad_TypeWeb_Ad_TypeRadio_Ad_TypeTV_Ad_TypeStore_Display_TypeMfgr_Promo_TypeOther_Causal_Event
Customer
PK Customer_Key
Customer_TypeISP_Address_1ISP_Address_2ISP_Address_3Cookie_IDCustomer_IDCustomer_IdentifierMany_more_fields
Event
PK Event_Key
Event_TypeEvent_Content
Page
PK Page_Key
Page_SourcePage_FunctionPage_TemplateItem_TypeGraphics_TypeAnimation_TypeSound_TypePage_File_Name
Product
PK Product_key
Many_more_fields
Referrer
PK Referrer_Key
Referral_TypeReffering_URLReferring_SiteReferring_DomainSearch_TypeSearch_SpecificationTarget
Session
PK Session_Key
Session_TypeLocal_ContextOverall_Session_ContextAction_SequenceSuccess_StatusCustomer_Status
Time_Of_Day
PK Time_Key
More_fields
120
TF
Informatik
The Data Webhouse
• Coined by Ralph Kimball in ”The Data Webhouse Toolkit”, Ralph Kimball and Richard Merz, Wiley 2000, ISBN 0-471-37680-9
• The Data Warehouse on the Web
• The Data Warehouse as the driver of the website (”closing the loop”)
121
TF
InformatikThe Birth of the Data Webhouse
122
TF
Informatik Synonyms• Basic function
– avoids multiple join paths between two tables– makes schema more legible and thus less prone to query
formulation errors– use view to rename columns for ease of use with query tools
• Rationale over separate dimension tables– Reduces need to duplicate data– Simplifies administration
• CREATE FIRST_OPEN_TIME AS SYNONYM FOR TIME– Not all databases support this - you may use views instead
123
TF
Informatik Typical synonyms• City tables
– origin and destination (travel facts)
• Period tables– order date and ship date
• Customer tables– customers for ship to and bill to
124
TF
Informatik
Factless Fact Tables
Promotion CoverageFact Tabletime_keyproduct_keystore_keypromo_key
Time Product
Store Promotion
125
TF
Informatik
A ”
Fac
tles
s” s
naps
hot t
able
Month
PK month_key
monthyearfiscal_quarter
Product
PK product_key
product_descriptiontypecategory
Branch
PK branch_key
branch_namebranch_addressbranch_citybranch_statebranch_zip
Household
PK household_key
household_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type
Status
PK status_key
status_descriptionstatus_reasonnew_account_flagclosed_account_flag
Demographics
PK demog_key
age_bandsexincome_bandmaritalchildren
HH Demographics
PK,FK7 status_keyPK,FK2 month_keyPK,FK4 address_keyPK,FK3 product_keyPK,FK5 branch_keyPK,FK6 household_keyPK,FK1 demog_key
primary_balancetransaction_countaccount_count
Address
PK address_key
primary_surnamesecondary_surnameaccount_addressaccount_cityaccount_stateaccount_zipdate_opened
126
TF
InformatikThe Effect of Multidimensional
Design on the ETL system• Existence checks
• Denormalisation (N-way, multilevel merge)
• Lookup values
• Deal with missing values for foreign keys
127
TF
InformatikMissing Values
Which are the problems?• Default values in source data• Dummy values in source data• Missing values in source dataWhat can be done?• Fix the source system(s) – it is a data quality issue• Try to infer values• Use a dummy value • More than one dummy value? (Unknown, Unavailable, Not applicabale)• Missing dimension keys?
(The dummy dimension record: Cust.No. = –1, Name = ”Unknown”)• What about dummy dates? And numeric values?
(See ”Dealing with Missing Values In The Data Warehouse” from www.sbti.com (Author John Hess, Stonebridge Technologies 1998, now to be found on www.olap.it/articles !)
128
TF
Informatik
The 38 subsystems of ETL!
• Extract• Change data capture• Data profiling• Data cleansing• Data conformer• Audit dimension• Quality screener• Error event handler• Surrogate key creation• Slowly Changing Dim’s• Late arriving dim’s• Fixed hierarchy dim’s• Variable hierarchy dim’s
• Multivalued dim’s• Junk dimensions• Facttable transaction load• Periodic snapshot load• Accum. Snapshot load• Surrogate key pipeline• Late arrivals handler• Aggregate builder• Cube builder• Partition builder• Dimension manager• Facttable provider• Job scheduler
• Workflow monitor• Recovery and restart• Parallelizing system• Problem escalation• Version control• Version migration• Lineage & dependencies• Compliance reporter• Security• Backup• Metadata repository• Project management
From the new book ”The Data Warehouse ETL Toolkit”, Ralph Kimball and Joe Caserta, Wiley 2004, ISBN 0764567578:
129
TF
Informatik
Documentation
• Data Mart structure
• Logical Model
Read The Data Warehouse Lifecycle Toolkit for the complete picture!
130
TF
Informatik
Advanced Design Issues
When the going gets tough,
the tough get going!
131
TF
Informatik
Tricky stuff
• Monster Dimensions
• Multivalued Dimensions
• Multilevel Hierarchies
• Really Difficult Business Questions
132
TF
Informatik
Monster Dimensions
• Some dimension tables grow VERY big, eg. customers, who also have many attributes
• Some “demographic” attributes change often (income, number of children etc.)
• Some “demographic” attributes are not used very much
Result: A lot of wasted space, indexes and complexity in keeping up to date
133
TF
InformatikDealing with Monster
Dimensions
customer_keynameaddressbirth dateother stableattributes
Customer dimension
demographics_keyincome_bandpurchases_bandno_childreneducation_leveletc.
Demographicsdimension
Sales facts
Customer_keyDemographics_key......
Note: Contains all possible combinationspredefined andpreloaded!
134
TF
InformatikMultivalued Dimensions
Healthcare billing fact:• Date• Patient• Doctor• Location• Service performed• Diagnosis• Payer• etc.
Do people only have one illness at a time?
135
TF
Informatik
What are the alternatives?
• Forget the dimension!
• Choose one value (”Primary Diagnosis”)
• Fixed number of diagnoses
• Going M:M?
136
TF
Informatik Helper Tables
Bill fact.diagnosis_group_key...
PayerPatient
DoctorDiagnosis Groupdiagnosis_group_keydiagnosis_keyweight_factor
Diagnosis Dimensiondiagnosis_keydescriptiontypecategory
Location
Service
Day
137
TF
InformatikSolving a Multivalued
Dimension with a Helper Table• Use only when absolutely necessary
• Weight factors usually equal size within a group and should always add up to one
• May use a view to hide the helper table
• Healthcare, Retail Banks, SIC-codes
138
TF
InformatikCustomers’ Hierarchies
The Problem is:
The Customer decides how many levels there are!!!
139
TF
Informatik
The Customer Hierarchy Model
customer_keynameaddressbase_org always populatedlevel5_orglevel4_orglevel3_orglevel2_org populated if 2 or moretop_org always populatedother attributes
140
TF
InformatikUsing a Helper Table
Bill fact.customer_key...
ServiceSeller
Customer_tree_pathparent_customer_keychild_customer_keydepth_from_parentlowest_flagtopmost_flag
Customer Dimensioncustomer_keyetc.etc.
Day
Contains one record for each separate path from each node to itself and to every node below it
141
TF
Informatik Pro’s and con’s• Works like a normal dimension constraint• Use depth_from_parent to eg. get only immediate
subsidiaries• Use lowest_flag to get only leaf nodes• Reversing the joins will take you upwards• Maybe add begin_effective_date and
end_effective_date, but be careful• Maybe add weighting factor to support partially
owned subsidiaries• May grow very big, quickly!
142
TF
Informatik
The bigger picture
• Classification of analytical applications
• Data Mining besides SQL and OLAP
• International issues
• Recommandations
143
TF
InformatikReally Difficult
Business Questions• Simple Constraints
• Simple Subqueries
• Correlated Subqueries
• Simple Behavioral Queries
• Derived Behavioral Queries
• Progressive Subsetting Queries
• Classification Queries
144
TF
Informatik
1) Simple Constraint
• Constraints against literal constants:– Show the sales of candy products in September
1997
ROLAP / SQL MOLAP HOLAP
145
TF
Informatik
2) Simple Subquery
• Constraints against a global value found in the data:– Show the sales of candy products in September
1997 in those stores that had above average sales of candy products
ROLAP / SQL MOLAP HOLAP
146
TF
Informatik
3) Correlated Subquery
• Constraints against a value defined by each output row:– Show the sales of candy products for each
month of 1997 in those stores that had above average sales of candy in that month
ROLAP / SQL MOLAP HOLAP
147
TF
Informatik
4) Simple Behavioral Query• Constraints against values resulting from an
exception report or a complex series of queries that isolate desired behavior:– Show the sales of those candy products in September
1997 whose household penetration for our grocery chain in the 12 months prior to September were more than two standard deviations less than the household penetration of the same products across our 10 biggest retail competitors
ROLAP / SQL MOLAP HOLAP
()
?
148
TF
Informatik
5) Derived Behavioral Query• Constraints against values found in set
operations on more than one complex exception report or series of queries:– Show the sales of those candy products
identified in example 4, and which also experienced a merchandise return rate of more than two standard deviations greater then our 10 biggest retail competitors(Intersection of two behavioral queries)
ROLAP / SQL MOLAP HOLAP
()
149
TF
Informatik
6) Progressive Subsetting Query
• Constraints against values, as in number 4, but temporally ordered so that membership in an exception report is dependent on membership in a previous exception report:– Show the sales of those candy products in
example number 4 that were similarly selected in August 1997 but were not similarly selected in either June or July 1997
ROLAP / SQL MOLAP HOLAP
()
150
TF
Informatik
7) Classification Queries
• Constraints on values that are the results of classifying records against a set of defined clusters using nearest neighbor and fuzzy matching logic:– Show the percentage of low-calorie candy sales
contained in the 1000 market baskets whose content most closely matches a young, health-conscious family profile
ROLAP / SQL MOLAP HOLAP
151
TF
Informatik
Challenges
• Most query tools do not support the more complex query types
• You may build in support for those, which your users need
152
TF
InformatikMulti-step approach (SQL)
Query
Query
Query
Query
Resulttable
Build result table(s) that contains only keys (and maybe also time information) of those objects, who display the desired, special behavior(s)
153
TF
InformatikExtended ROLAP or MOLAP
for Behavioral Analysis
Time_keyProduct_keyCustomer_keyStore_keyPromotion_keyTicket_numberLine_numberUnits_soldDollars_sold
Product_keydefined by complexbehavior study
Time dimension
Product dimension
Customer dimension
Store dimension
Promotion dimension
Special behaviordimension(s) *)
Sales facts Regular dimensions
*) May be hiddenin views
154
TF
Informatik
Preparing for Data Mining• Rigorous quality assurance on the values, which you are going
to mine• Supply values as text, not codes• Eliminate context dependencies• Flag normal, abnormal, out of bound or impossible facts• Mask out random or noise values• Uniform treatment of NULL/missing/unknown• Use Status dimensions (eg. ”Customer about to cancel” etc.)• Find training, test and evaluation sets• Supply computed values (eg. Profit)• Band continous values
155
TF
Informatik
Data Mining Algorithms
• Clustering• Decision Trees• Neural Networks• Statistics• Fuzzy Logic• Genetic Algorithms• Ants• Hybrids
156
TF
Informatik
Integrated Data Mining
• The Multidimensional Model is ideal as a source for Data Mining
• ROLAP is necessary for time-series / sequential analysis
• Ideal case for integration of software tools:– Microsoft Analysis Services 2000– Oracle 9i– Others?
157
TF
InformatikMicrosoft SQL Server 2000 Analysis Services
158
TF
Informatik
International Issues
• Languages• Alphabets and character sets• Names• Adresses• Numbers• Telephone numbers• Currencies• Time of day• Calendars• Handling unsupported characters• Collating sequencies
See: ”The Data Webhouse Toolkit”, Ralph Kimball and Richard Merz, Wiley 2000, ISBN 0-471-37680-9
159
TF
Informatik
Comments on MS SQL Server 2005
• Unified dimensional model:– Multivalued dimensions– Possible to have many flat dimensions– No need to build cubes – can put a layer on top of a
non-multidimensional schema
• But:– Users need the ease of use– Machines need the speed– We will see, just how well this works in a short period
of time
160
TF
Informatik
The biggest challenge:
Data Quality
161
TF
InformatikWhat is Data Quality?
• Accuracy– Does the data accurately represent reality?
• Integrity– Is the structure of data and relationships among entities and attributes maintained
consistently?
• Consistency– Are data elements consistently defined and understood?
• Completeness– Is all necessary data present?
• Validity– Do data values fall within acceptable ranges defined by the business?
• Timeliness– Is data available when needed?
• Accessibility– Is the data easily accessible, understandable, and usable?
162
TF
Informatik
Data quality and integration issues
• Legacy data issues
• Data accessibility & availability
• Insufficient time to analyse
• Inaccurate Metadata, documentation
• Lack of resources
• Disparate systems
• Data ownership issues
• Semantic differences
• Structure violations
• Rule violations
Data Warehouse
CRM
Home-grownapplications
Web applications
ERP
Legacyapplications
163
TF
InformatikWhy do data integration projects fail?
• The source data is not fully understood• Complexity is underestimated• Planning is actually guesswork• Systems do not join as expected• Delays when analysts/programmers need to interpret results• Poor data analysis leads to complex development cycle and
unpredictable rework• Problems are uncovered ad-hoc and late• Manual analysis on samples is time consuming, laborious and
inaccurate• Full volume testing is done too late
164
TF
InformatikData profiling & analysis
• Data profiling & analysis is critical to• Understand the scope and nature of the problem
• Determine success criteria
• Accurate planning
• Automated tools are available• Do it right the first time• Then keep on doing it!
165
TF
InformatikExamples of how to find data quality problems
using Data Profiling• How do you identify those customer records where values
are missing or incomplete ?• Are the product codes correct in my order entry system?• Will my data actually integrate?• Are all the order shipment dates correct?• Are my supplier details correct?• How do I find misspellings of any attribute values?• What about redundant data – how do I find it?• Are the relationships held within my data consistent ?• Will my data support the new business requirements?
166
TF
InformatikBenefits of automated Data Profiling• Improve Business responsiveness
– Time to market reduced
• Enhance Data Quality– Ensure data is accurate and fit-for-purpose
• Project Planning– Early & Accurate
• Reduce Risks– Identify ALL data-related issues at the start
• Manage Resources– Deliver with less effort
• Reduce Costs– Reduce cost of analysis and build
167
TF
Informatik
Data Quality is a Business Issue
• The Business owns the data• Set data quality standards across the company• Build company-wide metadata knowledge• Data Quality must be managed
168
TF
Informatik
Data Warehouse in the future?
Real-time?
169
TF
Informatik Data Warehouse andthe ”Enterprise Nervous System”
• Contemporary Enterprise Information Architecture calls for
– Realtime
– Integration
– Message brokers
– Service Oriented Architectures etc.
• What is the role of the Data Warehouse in this?
• This is based on an actual and recent architecture study for a Danish government body
170
TF
InformatikCharacteristics of a mature Data Warehouse
• Vision: One environment, one version of the Truth• Integration of data from disparate sources• Refinement of data• Searching and browsing• Production data• Other data (partners, public / purchased information)• Master data• The Historical Data Warehouse• Management Information• Statistics• Data Exchange• Production reporting• Ad hoc
171
TF
Informatik
More characteristics• Much multidimensionality
• Refined data, both details and aggregations
• 1000+ frequent users
• Also external recipients
• New systems: The interface(s) to the DW should be defined
• Development projects: Decide for either a production system or a Data Warehouse (based on technical feasibility)
• Standardised data model (most often not documented, frequently changed)
• Analysis
• Reporting
• Ad hoc tasks (incl. operational one-time systems)
172
TF
InformatikComponents of a matureData Warehouse
• Database(s)– Normalised– Multidimensional
• Load processes– Predominantly batch– Homegrown (COBOL)– ETL jobs in e.g. Informatica
• BI tools such as e.g. Business Objects– Applications– Predefined reports– Ad hoc environments
• Statistics– SAS
173
TF
Informatik
Business benefits from a DW• Holistic view of data across disparate systems
• Integration of data from different sources, including external partners
• History
• Refinement of data
• Presentation of data for business people
• Analysis, incl. decision support
• Reporting
• Ad hoc solutions
• Data foundation for statistics
• Data exchange
174
TF
Informatik
Portal platform
Infrastructureservices:
-”single signon”
security,administration,
mv.
Service- and Integration platform (ENS)
Usercatalog
New systems
data
Data-Warehouse
MessagesCMS
data and metadata
Integration platform with persistenceIntegration Broker
Business Activity Monitoring and Workflow management
Common services
Web Service
CMS-modules
Email/Calendar
Workflow
Intranet (internal portal) Internet (external portal)
New services
Datamining
Eksternal”WS UDDI”
OCEScertificatehandling
Legacysystems
External systems
Internal”WS UDDI”
Infoservices
Newservices
CMS-modules
Portlet apiPortlet api
Infoservices
AnalysisUpdateQuery StatisticsReport
”The Enterprise Nervous System”
175
TF
InformatikThe Data Warehouse has
served us well• The concept of a centralised database is maybe not as necessary
from now on
• But the practises of Data Warehousing are important:
– Good Data Management
– Good performance
– Data refinement
– Data presentation
• Realtime integration is certainly useful and practical
• Business Intelligence is evolving, Business Activity Monitoring is a natural next.
• We have learned a lot – now is the time to do it right!
176
TF
Informatik
weak semanticsweak semantics
strong semanticsstrong semantics
Is Disjoint Subclass of with transitivity property
Modal Logic
Logical Theory
Thesaurus Has Narrower Meaning Than
TaxonomyIs Sub-Classification of
Conceptual Model Is Subclass of
DB Schemas, XML Schema
UML
First Order Logic
RelationalModel, XML
ER
Extended ER
Description LogicOWL
RDF/SXTM
Syntactic Interoperability
Structural Interoperability
Semantic Interoperability
*Source: Leo Obrst, The Mitre Corp.
What is going on in data modelling?
177
TF
Informatik
Arthopoda
Leon
Animalia
Chordata
Mammalia
Carnivora
Felidae
Panthera Ursus
Tigris
Taxonomies: The unstructured world of documents
178
TF
Informatik
Syntax defined in: Nijssen, G.M. and T.A. Halpin. Conceptual Schema and Relational Database Design - A fact oriented approach. Prentice Hall 1989.
Conceptual model (ORM)
179
TF
Informatik
Property labels
t = rdf:type
s = rdfs:subClassOf
d = rdfs:domain
r = rdfs:range
et = rdfsx:collectionElementType
Kilde: Stephen Cranefield, Journal of Digital Information, Volume 1 Issue 8, Article No. 44, 2001-02-15
RDF based schema for Family
180
TF
Informatik
Kilde: Stephen Cranefield, Journal of Digital Information, Volume 1 Issue 8, Article No. 44, 2001-02-15
UML Class diagrams
181
TF
Informatik
<?xml version="1.0"?> <rdf:RDF xmlns="http://mySite.com/myOntology#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xml:base="http://mySite.com/myOntology">
<owl:Class rdf:ID="Person"/>
<owl:Class rdf:ID="Mother"> <rdfs:subClassOf rdf:resource="#Person"/> </owl:Class>
<owl:ObjectProperty rdf:ID="hasChild"> <rdfs:domain rdf:resource="#Mother"/> <rdfs:range rdf:resource="#Parent"/> </owl:ObjectProperty>
<owl:Class rdf:ID="Grandmother">
<rdfs:subClassOf> <owl:Class> <owl:intersectionOf rdf:parseType="Collection"> <owl:Class rdf:ID="Female"/> <owl:Class rdf:about="#Mother"/> </owl:intersectionOf> </owl:Class> </rdfs:subClassOf>
<owl:equivalentClass> <owl:Class> <owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:about="#Mother"/>
<owl:Restriction> <owl:onProperty> <owl:ObjectProperty rdf:about="#hasChild"/> </owl:onProperty> <owl:someValuesFrom> <owl:Class rdf:ID="Parent"/> </owl:someValuesFrom> </owl:Restriction>
</owl:intersectionOf> </owl:Class> </owl:equivalentClass> </owl:Class>
<Parent rdf:ID="Anny"/> <Mother rdf:ID="Ingeborg"> <hasChild rdf:resource="#Anny"/> </Mother></rdf:RDF>
Web Ontology Language (OWL)
182
TF
Informatik
• Data Warehouse showed us how to• Semantics is the key• Ontologies is the foundation• Repositories is the technology• Gartner: Enterprise Information Architecture
• The sponsor is: The needs for integration!
Information Management II
183
TF
Informatik
Wrap up
184
TF
Informatik Listen to Ralph
• Embed all knowledge of the data in the data
• Stick to one level of dimension tables• Aggregates should be separate tables• Use an aggregate navigator (serverside)• Even better: Use MOLAP or HOLAP• Stick to simple star schemas• Properly design conformed dimensions and
conformed facts, first!
185
TF
Informatik
Listen to me
K.I.S.S.:Keep It Simple, Keep It Simple,
Stupid!Stupid!
186
TF
InformatikRalph Kimball’s 20 criteria Criterium Explanation Criterium Explanation Architecture 1
Explicit declaration
Metadata drives multidimensional behavior
11 Surrogate key administration
Completely automated
2 Conformed dimensions and facts
System supports mix-and-match of technologies etc.
12 International consistency
3 Dimensional integrity
Full referential integrity Expression 13
Multiple dimension hierarchies
Multiple independant hierarchies in a dimension
4 Open aggregate navigation
Complete, transparent, automatic aggregates
14 Ragged-dimension hierarchies
Hierarchies of indeterminate depth
5 Dimensional symmetry
No limitations on calculations across the whole model
15 Multiple valued dimensions
Many-to-many with allocation factors
6 Dimensional scalability
No limits 16 Slowly changing dimensions
Full SCD support
7 Sparsity tolerance No limits 17 Roles of a dimension Eg. multiple roles of a time dimension
Administration 8
Graceful modification
No unload-reloads 18 Hot-swappable dimensions
Change (perhaps personalised) dimensions on the fly
9 Dimensional replication
Seamless distribution of standard dimensions
19 On-the-fly fact range dimensions
Dynamic, runtime support for banding / bucketing
10 Changed dimension notification
Automatic delivery of SCD 1, 2 and 3
20 On-the-fly behavior dimensions
Support for subset lists with set algebra
187
TF
Informatik
Literature• ”The Data Warehouse Toolkit – Second Edition”, Wiley 2002, ISBN
0-471-20024-7
• ”The Data Warehouse Lifecycle Toolkit”, Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite, Wiley 1998, ISBN 0-471-25547-5
• ”The Data Webhouse Toolkit”, Ralph Kimball and Richard Merz, Wiley 2000, ISBN 0-471-37680-9
• ”Data Warehouse Design Solutions”, Christopher Adamsan and Michael Venerable, Wiley 1998, ISBN 0-471-25195-X
• ”Microsoft OLAP Solutions”, Erik Thomsen, George Spofford and Dick Chase, Wiley 1999, ISBN 0-471-33258-5
• ”Improving Data Warehouse and Business Information Quality”, Larry P. English, Wiley 1999, ISBN 0-471-25383-9
188
TF
Informatik
Usefull web-adresses• www.ralphkimball.com (Ralph Kimball)• www.dwinfocenter.org (Larry Greenfields Data Warehouse Info
Center)• www.intelligententerprise.com (Intelligent Enterprise Magazine,
previously DBMS magazine - many articles by Ralph Kimball and other fine people - for example: August 97 A Dimensional Modelling Manifesto)
• www.datawarehousing.com (home of DWLIST)• www.tdwi.org - The Data Warehouse Institute• www.iaidq.org, The International Association for Information and
Data Quality (IAIDQ)• http://www.tondering.dk/claus/calendar.html, everything you would
ever want to know about calendars!
189
TF
Informatik
What to do first?• Buy and Read Ralph Kimball’s ”The Data Warehouse Toolkit
– Second Edition”, Wiley 2002, ISBN 0-471-20024-7• Buy and read ”The Data Warehouse Lifecycle Toolkit”,
Wiley 1998, ISBN 0-471-25547-5, Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite
• Buy and read ”Data Warehouse Design Solutions”, Christopher Adamson and Michael Venerable, Wiley 1998, ISBN 0-471-25195-X.
• Buy and read ”The Data Warehouse ETL Toolkit”, Ralph Kimball and Joe Caserta, Wiley 2004, ISBN 0-7645-6757-8
• Just Do It!
190
TF
Informatik
Thank You!
thomasf@tf-informatik.dk+45-40 54 83 40 (GSM)
04.93.33.88.93 (occasionally)+45-49 70 83 40 (Landline)
Recommended