Upload
ksrsarma
View
216
Download
0
Embed Size (px)
Citation preview
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
1/67
R. Marti
3-1 Data Warehouse The Time Dimension
Data Warehousing
Spring Semester 2011
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
2/67
3-1 DWh 2011: Data WarehouseR. Marti 2
The Data Warehouse in the DWh Reference Architecture
Data
Ware-
house
Source
Database
Source
Database
Source
Database
DataMart
Data
Mart
Dashboards
Reports
Interactive Analysis
Data Warehousing
Focus Architectural options and variations in data warehouse projects Design of the single integrated data warehouse, in particular
- how to model temporal aspects- how to ensure common dimensions (=> Master Data Management)
Master
Data
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
3/67
3-1 DWh 2011: Data WarehouseR. Marti Page 3
Recap: Time in Classical Data Mart Designs (1)
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
4/67
3-1 DWh 2011: Data WarehouseR. Marti 4
Recap: Time in Classical Data Mart Designs (2)
Rows in fact tables are associated with a specific time by the foreign keyreference to the time dimension, indicating as of when they are valid.
However, rows in dimension tables are not associated with a time!- new rows (rows with an unknown source system identifier) are simply added- usually, no rows are deleted from a dimension table, even if rows with known
source system identifiers are missing in a batch upload:
. existing (old) facts still refer to objects corresponding to these missing rows
. if sources do not send explicit information on deletions, it is unclear whether
the corresponding objects have effectively become invalid or not
(Note: Sending this information might mean re-designing the source system!)
-changes in values of dimension rows with known source system identifiers are. either simply overwritten,
. or a new row with a new surrogate (but the old source system id) is added
(see topic slowly changing dimensions)
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
5/67
3-1 DWh 2011: Data WarehouseR. Marti 5
Temporal Database Systems + Languages
For some types of analysis, dimensions should also be historized,especially for comparisons of measures across different time periods.
Example:
How did buying habits of customers change over the last few years,
grouped by where they live.
History of addresses of customers should also be kept!
Since 1980, a lot of research has been conducted in temporal data models,temporal query languages, and temporal database systems.
Generic support for temporal data is beginning to emerge in products:Teradata Database 13.10, IBM DB2 V10, Oracle
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
6/67
3-1 DWh 2011: Data WarehouseR. Marti 6
Notions of Time
Valid Time is the time during which a fact in the real world was, is, or will betrue or, more precisely: was / is believed to be true or believed to become
true. Note: This time is determined by the user.
Sometimes also called effective time, as of time or business time.
Transaction Time is the time during which a fact in the real world was or is(rightly or wrongly) stored in the database. Note: This time is determined by
the system (unless the user decides to delay entering the data, of course ... ) .
Sometimes also called system time.
Example of an announcement made (and stored in a DB there and then)on October 1 2010 (= transaction time):
David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
7/67
3-1 DWh 2011: Data WarehouseR. Marti
Associating Time with Data
7
time
tuples
attributes
Assumption: For each relation, a clock with
a given temporal granularity is specified,e.g., a day, a second, or a millisecond."Conceptually, the extension of a temporal
relation Rcan then be viewed as a
sequence of snapshot relations
Rt= t(R)
for every time point t of this clock."
t is called snapshot operator(sometimes also timeslice operator)"
snapshot at time t
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
8/67
3-1 DWh 2011: Data WarehouseR. Marti 8
Benefits and Pitfalls of Sequence of Snapshots Model
Good for theoretical considerations, in particular determining equivalence of different temporal representations gauging the expressive power of temporal query languages
May be impractical as an implementation model, given that it may requirelots of space, especially when
granularity of time is fine-grained (minutes, seconds, milliseconds, ... ) represented facts do not change often, i.e. stay the same over a longerinterval (usually because they describe states rather than events)
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
9/67
3-1 DWh 2011: Data WarehouseR. Marti 9
From Sequence of Snapshots Model to Time Intervals
Remedy:Dont store data that did not change since the previous clock tick again
Collect identical snapshots of suitable smaller parts of a relation
(e.g., tuples or attribute values) and associate them with time intervals
rather than time points
Alternatives:(1) associate temporal intervals with every tuple
(2) associate temporal intervals with every attribute value
(but the 2nd approach requires complex attributes, violating 1NF)
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
10/67
3-1 DWh 2011: Data WarehouseR. Marti 10
Valid Time Relations capturing State
Conceptually, every tuple which captures a state is timestamped with a timeinterval [t
from, t
to] indicating the validity of the (non-temporal) data
represented in the tuple
Remarks:
Transformation into 1NF by replacing V_INTERVALby V_FROM (valid from) and V_TO (valid to)
The symbol ? means unknown, until now or until further notice.In standard SQL, it is usually represented by null or by the date 9999-12-31,
both of which are not entirely satisfactory ...
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
11/67
3-1 DWh 2011: Data WarehouseR. Marti 11
Typical Queries (1): Snapshot of Valid Time Relation
Snapshots of the previous valid time relation:
Remarks:
We assume that ID is the primary key at every point in time (in every snapshot). Producing a snapshot from a valid time relation is a simple selection in rel. algebra:select ID, NAME, FNAME, ADDR, SAL
from EMP
where :t in V_INTERVAL (or:where :tbetween V_FROM andV_TO )
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
12/67
3-1 DWh 2011: Data WarehouseR. Marti 12
Valid Time Relations capturing Recurring States
A specific state of affairs can recur several times ( several time periods)
transformation to 1NF
The first two tuples are called value equivalent since they have the samevalues in all attributes except the temporal attributes V_FROM and V_TO.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
13/67
3-1 DWh 2011: Data WarehouseR. Marti 13
Options in the Representation of Time
Canonical representation using maximal time intervals (as on previous slide):
One (of many) possible alternative representations using two (non-maximal)
contiguous intervals (assuming a temporal granularity of a day):
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
14/67
3-1 DWh 2011: Data WarehouseR. Marti 14
Issues with non-canonical Representations
Non-canonical representations may lead to incorrect answers:
Example Query: Who left the company before 2008-01-01 and when?
select ID, NAME, FNAME, V_TO
from EMP
where V_TO < date '2008-01-01'
(Incorrect) Result:
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
15/67
3-1 DWh 2011: Data WarehouseR. Marti 15
Avoiding non-canonical Representations: By Design
Ensure that intervals remain maximal when inserting or updating:
Let R be a valid time relation in canonical form (i.e., with maximal time intervals)- n be a new valid time tuple to be inserted into the relation R
- x1, ... ,xn (n 0) be all existing valid time tuple in relation R which are
value equivalent to x (cf. p. 12)
Then, for all i, 0 in, the following must hold (in pseudo-SQL notation):
not exists (
select *
from Rxi
where xi = n
and(n.V_FROM - 1betweenxi.V_FROM andxi.V_TO
orn.V_TO + 1betweenxi.V_FROM andxi.V_TO))
(This could be specified as declarative check constraint if implementation supported it )
value equivalence
intervals do not touch or overlap
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
16/67
3-1 DWh 2011: Data WarehouseR. Marti 16
Typical Queries (2): Temporal Projection
Unfortunately, (intermediate) query results may be non-canonical, even if
applied to a canonical representation:
Example: Where did employees live and when (irrespective of salary)?
select ID, NAME, FNAME, ADDR, V_FROM, V_TO fromEMP
Result:
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
17/67
3-1 DWh 2011: Data WarehouseR. Marti 17
Avoiding non-canonical Representations: By Coalescing
Non-canonical representations can be transformed into the canonical
representation by an operation called temporal coalescing which maximizes
the length of all intervals by coalescing adjacent and overlapping intervals ofvalue-equivalent tuples.
Coalesced form:
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
18/67
3-1 DWh 2011: Data WarehouseR. Marti 18
Temporal Coalescing in (Pseudo-) SQL
with recursiveRclosas (
-- initial ("anchor") query
selectR.values, R.V_FROM, R.V_TO fromRunion
-- recursive query: executed until no new data generated
select R.values, R.V_FROM, Rclos.V_TO
from R, Rclos
where Rclos.values = R.values
andRclos.V_FROM >= R.V_FROMandRclos.V_FROM-1 Rclos.V_TO )
)
more efficientimplementation
uses window
functions
(see [Zhou et al 2006])
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
19/67
3-1 DWh 2011: Data WarehouseR. Marti 19
Typical Queries (3): Temporal Join
Sometimes, the history of information stored in two relations is of interest:
Example: Who worked on which projects and when?
Result:
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
20/67
3-1 DWh 2011: Data WarehouseR. Marti 20
Temporal Join in SQL (without temporal coalescing!)
Construct time intervals of result by intersectingtime intervals of operands
(and keeping rows with non-empty intervals):
select * from(
select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,
case when e.V_FROM > w.V_FROM
then e.V_FROM
else w.V_FROM
end as V_FROM,case when e.V_TO < w.V_TO
then e.V_TO
else w.V_TO
end as V_TO
from WORKS_ON w, EMP e
where e.ID = w.EMP_ID)where V_FROM
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
21/67
3-1 DWh 2011: Data WarehouseR. Marti 21
Proposals for Temporal Support in SQL
There are proposals to hide this (and more, see following slides) temporal
complexity in SQL, e.g., the SQL/Temporal part of a future SQL3 standard.
A temporal join (including temporal coalescing) would look as follows:
validtime
select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,
from WORKS_ON w, EMP e
where e.ID = w.EMP_ID
see e.g. [Snodgrass 1999]
Richard T. Snodgrass: Developing Time-Oriented Database Applications.
Morgan Kaufmann, 1999.
Note: This publication is out of print, but available electronically as pdf ahttp://www.cs.arizona.edu/people/rts/publications.html
DB2 10 for z/OS and Teradata Database V13.10 support most of the SQL/
Temporal proposal.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
22/67
3-1 DWh 2011: Data WarehouseR. Marti 22
Transaction Time Relations
Note that transaction time should be automatically determined by thesystem at insert/update/delete time (or, more precise, commit time),
not by the user; granularity is typically as fine as possible
Transaction time can be represented exactly like valid time,by associating a time interval with tuples.
Example: Transaction time history of employee 676 (also see slide 10)""1. 2006-07-01: insert 676 lives in Baar und earns 7000."2. 2008-04-01: update 676 lives in Bern."3. 2009-11-01: update 676 earns 7500."
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
23/67
3-1 DWh 2011: Data WarehouseR. Marti 23
Using DBMS Logging to capture Transaction Time
Since transaction time can be automatically determined by the system,the DBMS logging facilities can be used.
This is/was done e.g. in Postgres/PostgreSQL/Illustra (and in Oracle).
Example: Transaction time history of employee 676 (see slide 15)""1. 2006-07-01: insert 676 lives in Baar and earns 7000."2. 2008-04-01: update 676 lives in Bern."3. 2009-11-01: update 676 earns 7500.
Normal (snapshot) tablecontaining current contents.
Undo log table containingchanges to produce
previous contents of
associated snaphsot table
(before images).
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
24/67
3-1 DWh 2011: Data WarehouseR. Marti 24
Implementing Logging Using Triggers
create or replace trigger TR_AU_EMP
after update
on EMP
for each row
declare
l_log EMP_UNDO_LOG%rowtype;
begin
l_log.X_TIME := current_timestamp;l_log.UNDO_OP_CODE := 'update';l_log.ID := :old.ID;l_log.NAME := :old.NAME;
l_log.FNAME := :old.FNAME;l_log.ADDR := :old.ADDR;l_log.SAL := :old.SAL;
insert into EMP_UNDO_LOG values l_log;
endTR_AU_EMP;/
written in Oracle PL/SQL
similar triggers required
for inserts and deletes
should probably check
that ID has not changed
and raise an applicationerror if this were the case
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
25/67
3-1 DWh 2011: Data WarehouseR. Marti 25
Bitemporal Relations
Valid time and transaction time can be combined to allow for a completehistory of what information was/is believed to be true and when this was
stored in the database.
Example: Complete (bitemporal) history of employee 676""1. 2006-07-01: insert 676 lives in Baar and earns 7000 as of2006-08-01.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
26/67
3-1 DWh 2011: Data WarehouseR. Marti 26
Bitemporal Relations (2)
Example (continued): Complete (bi-temporal) history of employee 676""2. 2008-04-01: update 676 lives in Bern as of2008-03-01.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
27/67
3-1 DWh 2011: Data WarehouseR. Marti 27
Bitemporal Relations (3)
Example (continued): Complete (bi-temporal) history of employee 676""3. 2009-11-01: update 676 earns 7500 as of2010-01-01.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
28/67
3-1 DWh 2011: Data WarehouseR. Marti 28
Bitemporal Relations (4)
Example (continued): Complete (bi-temporal) history of employee 676""4. 2009-11-11: update correction: 676 earns 7700 as of2010-01-01.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
29/67
3-1 DWh 2011: Data WarehouseR. Marti 29
Design of Temporal Databases
Basic idea
Do non-temporal database design Annotate which tables / attributes need to be historized (especially valid time)
and how (state-based vs. event-based)
Generate temporal data structures ... but how?Questions:
Entity integrity (implemented by primary keys) temporal entity integrity
Referential integrity (implemented by foreign keys) temporal referential integrity
Arbiter: sequence of snapshots model
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
30/67
3-1 DWh 2011: Data WarehouseR. Marti 30
Temporal Entity Integrity (1)
Temporal entity integrity = for every snapshot, entity integrity should hold.
Pro memoria:- primary keys should consist of a minimal number of attributes
which unqiuely identify each tuple
- these attributes should ideally not change over time
Options for the primary key of a valid time relation (e.g. for table EMP)(1) ID, V_FROM(2) ID, V_TO
(3) ID, V_FROM, V_TO (non-minimal primary key!)
(4) ID, SEQ_NO (where SEQ_NO is a sequence number or counter)
Since all attributes except ID (and SEQ_NO) can change over the lifetime ofthe identified tuple
- alternative (4) is probably the best,
- followed by alternative (1) as V_FROM only changes in case of an error
(and should not be referenced by foreign keys, as well see)
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
31/67
3-1 DWh 2011: Data WarehouseR. Marti 31
Temporal Entity Integrity (2)
In addition, it might be desirable to enforce other constraints, including
Time intervals must not be empty Time intervals should be maximal (unless e.g. queries like what was the
case before or after a specific point in time are not of importance)
create table EMP (
ID integer not null,
SEQ_NO integer not null,
NAME varchar(20) not null,
...
V_FROM date not null,
V_TO date default date '9999-12-31',
primary key (ID, SEQ_NO),
check ( V_FROM
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
32/67
3-1 DWh 2011: Data WarehouseR. Marti 32
Referential Integrity between Snapshot Relations
The foreign key (FK) attribute value(s) in the referencing relation must exist as
primary key (PK) values in the referenced relation:
Example: Works_On[Emp_Id] Emp[Id]Note: In relational theory, this is sometimes also called an inclusion dependency.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
33/67
3-1 DWh 2011: Data WarehouseR. Marti 33
Temporal Referential Integrity (1)
Temporal referential integrity = for every snapshot, referential integrity must hold.
Problem:- primary keys now have a temporal part (on top of the non-temporal part)- valid time periods in the foreign key (referencing) relation are not
necessarily the same as those of the primary key (referenced) relation
At every point in time when the FK value was valid,
the referenced PK value must be valid.
t( t(Works_On[Emp_Id]) t(Emp[Id]) )
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
34/67
3-1 DWh 2011: Data WarehouseR. Marti 34
Temporal Referential Integrity (2)
t( t(Works_On[Emp_Id]) t(Emp[Id]) ) holds for employee 676 because
projection followed by temporal coalescing would result in:
Of course, performing temporal coalescing for
- adding tuples to and/or extending time intervals of the referencing relation
- deleting tuples from and/or shrinking time intervals in the referenced relation
would be an expensive proposition
Recommendation: Track complete lifetimes of objects in a separate relation
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
35/67
3-1 DWh 2011: Data WarehouseR. Marti 35
Temporal Referential Integrity (3)
Split valid time relation on referenced (PK) side into an object relation and aproperty relation.
Add a referential integrity constraint from property relation to object relation. Re-route non-temporal referential integrity constraints from other relations
to the object relation.
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
36/67
3-1 DWh 2011: Data WarehouseR. Marti 36
Temporal Referential Integrity (4)
In referencing relations, it might be desirable to enforce referential integrity
non-temporal part: as usual temporal part: time interval contained in time interval of referenced object
create table WORKS_ON (
EMP_ID integer not null,
PROJ_ID integer not null,
SEQ_NO integer not null,
V_FROM date not null,V_TO date default date '9999-12-31',
primary key (EMP_ID, PROJ_ID, SEQ_NO),
check ( V_FROM
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
37/67
3-1 DWh 2011: Data WarehouseR. Marti 37
Temporal Normalization (1): Time-invariant Attributes
Assume that attributeFName cannot change over the lifetime of anEmp
(except to correct mistakes).
In other words, the functional dependency (FD) IdFName holds
relationEmp_Prop below is not in 2NF (attribute depends on part of PK)
relationEmp_Prop exhibits update anomalies
when having to fix a mistake in Sues first name (e.g. change to Susan)
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
38/67
3-1 DWh 2011: Data WarehouseR. Marti 38
Temporal Normalization (2): Time-invariant Attributes
Recommendation:
Consider moving time-invariant attributes (e.g.FName) from the property
relation (e.g.Emp_Prop) to the object relation (e.g.Emp_Obj).
InEmp_Obj, the FD IdFName still holds (and is enforced by the PK),
so the relation does not exhibit update anomalies.
InEmp_Prop, all attributes are now fully dependent on the PK but there is still an issue ...
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
39/67
3-1 DWh 2011: Data WarehouseR. Marti 39
Temporal Normalization (3): Asynchronous Changes
Example: After having inserted the salary raise to employe 676 as of beginning
of 2010, we learn that she actually moved to Aarau as of Dev 1 2009.
update anomaly: several tuples need to be changed (in addition to an insert)!
Recommendation:
Attributes whose values change independently of other attributes should be put
into different relations
(somewhat like achieving 4NF in the face of multi-valued dependencies).
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
40/67
3-1 DWh 2011: Data WarehouseR. Marti 40
Temporal Normalization (4): Asynchronous Changes
Example: Since address and salary of an employee may change independently
(and asynchronuously), these attributes should be put into different relations.
no update anomaly: one tuple needs to be changed (in addition to an insert)!
Employee salaries remain untouched:
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
41/67
3-1 DWh 2011: Data WarehouseR. Marti
Summary of Design Recommendations
For kernel entity types (with objects whose existence is independent of otherentities), considerthe introduction of an object relation to capture the lifetime
of these objects main benefits:
- referential integrity checking over time
- home fortime-invariant attributes
For relations representing object properties (or relationships between objects)and their history, considerchoosing a temporal primary key consisting of the
non-temporal primary key attributes plus a (meaningless) sequence number.
For relations representing object properties (or relationships between objects),considerdecomposing them into groups of attributes which
- are eithertime-invariant
this attribute group is moved to the object relation
- orchange independently of one another(i.e., potentially at different times) each such attribute group is moved into a separate relation keeping
track of the history of the values
Remember: Following
themisnofreelunch!
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
42/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 42
Return to (Valid) Time in Warehousing
TIME
POLICY_PTF
PREMIUM_AMT
LOSS_AMTEXPENSE_AMT
PROFIT_AMT
PRODUCT
PROD_ID
CLIENT
CL_IDCL_NAME
CL_RATING
PROF_CENTER
PC_ID
PC_NAME
DIV_IDDIV_NAME
Motivating Example
Compare profits over the years
- grouped by business divisions- grouped by client ratings
What happens if, over time,
- business divisions change(e.g. profit centers are shifted)?
- ratings of clients change?
- two clients merge (e.g.,primary insurers in the
reinsurance business)?
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
43/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 43
2009 2010
X
Y
Z
dimensional values (e.g., names of business divisions)
measure
+24%
-40%
+80%
profit
[CHF]
time
First impressions
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
44/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 44
2009 2010
+24%
-40%
+80%
+0%
+11%
Profit Center Shift
time
profit
[CHF]
X
X1
X2
X3
Y
Y1
Y2
S
ZZ1
Z2
X
X1
X2
X3
Y
Y1
Y2
S
Z Z1
Z2
First impressions can be deceiving
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
45/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 45
Terminology and Concepts: Dimensional Hierarchies
Dimensions often have a hierarchical structure,
e.g., in previous example:
Product: hierarchical LineOfBusiness
ProfitCenter: embedded in hierarchical org structureProfitCenter Division Group
Client: hierarchical groupings possble,e.g., grouping by country continent,
All Lines
Property Casualty SpecialLines
P&C Lines L&H Lines
Life Health
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
46/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 46
Coping with Business Change
time
tReport
successful completion of business transaction
captured measures refer to dimensional structuresvalid at this time
report production
which dimensional structure should reported measures refer to?
original structures valid at respective capture times (tCapture[i])? structures valid at report time (tReport)? other times?
need history + valid times need succession mapping
changes to referenced dimensional structures
tCapture[2]tCapture[1]
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
47/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 47
Running Example
dimension measure
changes
Population
CountryId
Year
Country
CountryId
CountryName
Year
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
48/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 48
Changes to Dimensional Structures
Type Image Description
1 add New value addedA A B
3 invalidate A value will not any longer be available fornew contracts
A
C
A B
2 rename Old value (name) will be replaced by newvalue
AA B
4 merge n old values will be merged into one valueAA1 A2
5 split Old value will be divided into n valuesA1A A2
6 move One value changes position in hierarchyA
B C
D
A
B C D
Key Questions
Succession
Mapping
TaxonomicRelationship
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
49/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 49
Examples of Changes to Dimensional Structures
adapted from Temporal Data Warehousing: Business Cases and Solutions, J. Eder et al.
merge
invalidate
renamesplit
add
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
50/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 50
Issues: History, Validity and Succession of Values
Dimensional values to be tracked over time must have
a unique, invariant, not-to-be-reused identifier for the concept that thevalue representse.g. an identifier for the country first named Zaire and later Kongo
a validity period indicating the overall lifetime of the concept whichthe value represents
e.g. the lifetime of the country first named Zaire and later Kongo
validity periods indicating the lifetime of the values used to representthe concepte.g. the lifetimes of the names Zaire and Kongo
invalid dimensional values must have another dimensional value assuccessore.g., East Germany is succeeded by Germany
1
2
3
4
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
51/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 51
Unique Identifier
DB2 Colloquium
2006-10-25
1
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
52/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 52
Succession of Dimensional Values4
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
53/67
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
54/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 54
Succession of Dimensional Values4
Step 3: Reassemble parts
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
55/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 55
Succession of Dimensional Values4
SQL Statement to do all 3 steps
SELECT COALESCE(s.CurrId, p.CountryId) AS CountryId
, p.Year, SUM(p.Population) AS Population
FROM CountryPopulation p
LEFT OUTER JOIN CountrySuccession sON s.Id = p.CountryId
GROUP BY p.CountryId, p.Year
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
56/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 56
Side Issue: Difficulties with the Split Operation
Example
measures population and GNP (gross national product) have been collected forCzechoslovakia up to 1992
as of 1993, the same measures are collected for Czech and SlovakiaPossible solutions
after 1993, keep Czechoslovakia and compute its population and GNP figures bysumming the figures of Czech and Slovakia
before 1992, compute approximate percentages of the population and GNP figures fromCzechoslovakia for Czech and Slovakia
note: in general, the precentages of the various measures are not identical
leave countries as is and perform no mapping in either direction
4
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
57/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 57
Handling Splits (Sketch)4
Step 2:
Extrapolate
Step 1:
Aggregate overTaxonomy
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
58/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 58
Lifecycle of Concepts
Start ofvalidity
Active
Superseded
Inactive
define successor
Move
Introduction as Inactive
Move
Activecan be used to book new business and appear on reports
Inactivecan appear on reports but cannot be used to book new business
Supersededcannot appear on reports nor be used to book new business
2
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
59/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 59
Validity (Lifetime) of Concepts2
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
60/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 60
Validity (Lifetime) of Names of Concepts
DB2 Colloquium
2006-10-25
3
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
61/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 61
Modified Star Schema Design
Principle
Add valid times in dimensionsin the Data Warehouse using
- an object table (Country)
- a single property table
(here: CountryNames)
both with an associated valid time
interval.
Let foreign keys in fact tables refer
to the unchanging ID in object tables.
Generate standard Data Marts from
this data model as needed, mostoften a history of measure according
to the current dimensional structure.
Population
CountryId
Year
Country
CountryId
VTimBeg
VTimEnd
Year
CountrySuccession
Id -- original identifierSuccId -- direct successor
CurrId -- ultimate successor
CountryNames
CountryIdVTimBeg
VTimEnd
CountryName
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
62/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 62
Coping with a Distributed Environment (Teaser)
Transactional Data Stores
additional identifiersmeasures tied to ref data
Integration Data Stores
History Stores (DWh)Exchange Stores (ODS)
AnalyticalData Stores
Flow of Master Data(e.g. Dimension Attributes + Values)
Flow of Transactional Data
e.g., MDM, CRM,
ForEx, Geo DB
e.g., Claims and
Underwriting
Systems
Master Data Stores
identifiersdimensional attributes
Note: Of course, in a global enterprise, all of this all happens in a distributed environment
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
63/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 63
Kimballs Types of Slowly Changing Dimensions
Ralph Kimball proposed 3 (well actually 2 only) poor mans
solutions to the historization of dimensions slowly changing
dimensions (SCD) in the context of the Star Schema.
SCD Type 1: no history of the dimensional attribute is needed simply overwrite the valuee.g. the correction of mistakes in names, birthdays etc.
SCD Type 2: versions of some dimensional attributes are needed store new records in the dimension table, with a new DWh
identifier (ID), the existing stable source system ID, and the new
(changed) valuese.g. a change in the rating of a client, or the new business division a profit center belongs to
SCD Type 3: current and original (or previous) versions are needed introduce a current and original attribute in the dimension tablee.g. the current rating and the original rating of each client
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
64/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 64
Slowly Changing Dimensions Type 1
Pros
Simple to understand for business users and simple to implement(especially when using MOLAP tools)
Requires the least space and has the best response time
Conses
Simplicity for business users is deceiving A change in a dimensional attribute effectively changes the context
for all facts captured prior to the change
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
65/67
3-1 DWh 2011: Data WarehouseR. Marti Slide 65
Slowly Changing Dimensions Type 2
Pros
Reasonably understandable and simple to implement(regardless of MOLAP / ROLAP tool)
Captures parts of the historyConses
The time of a change in a dimension is not captured Requires more space since a single dimensional object is possibly
represented in several rows (but this is usually not an issue)
Can be confusing since changed dimensional data objects appearmore than once, with identical source system IDs, but at least one
changed attribute value
Checking when it is ok to refer to which DWh IDs is not possible
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
66/67
7/31/2019 03-1 DWh Data Warehouse - Time Dimension
67/67
Literature
General Temporal Database Concepts
[Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann,
1999. (see http://www.cs.arizona.edu/people/rts/publications.html)
[Zhou et al 2006] Xin Zhou, Fusheng Wang, Carlo Zaniolo: Efficient Temporal Coalescing Query Support in
Relational Database Systems. Proc. 17th International Conference on Database and Expert Systems
Applications - DEXA '06, 2006.
[Johnston & Weis 2010] Tom Johnston, Randall Weis: Managing Time in Relational Databases: How to Design,
Update and Query Temporal Data. Morgan Kaufmann, 2010.
Data Warehouse Design
[Kimball & Ross 2002] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling, 2ndEdition. John Wiley, 2002.
[Imhoff et al 2003] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger: Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003.
[Golfarelli & Rizzi 2009] Matteo Golfarelli, Stefano Rizzi: Data Warehouse Design: Modern Principles and
Methodologies. McGraw Hill, 2009.
[Adamson 2010] Christopher Adamson: Star Schema: The Complete Reference. McGraw Hill, 2010.