Why “Date BETWEEN FromDate and ToDate” is a Dangerous Join Criteria

(http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/)

2012-06-29 Data Modeling (http://kejser.org/category/modeling/)

No Comments (http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-

join-criteria/#comments)

Data Warehouse (http://kejser.org/tag/data-warehouse/), Dimensional Models

(http://kejser.org/tag/dimensional-models/), Histogram (http://kejser.org/tag/histogram/), Historizing

(http://kejser.org/tag/historizing/), OLTP (http://kejser.org/tag/oltp/), Query Plans

(http://kejser.org/tag/query-plans/), Statistics (http://kejser.org/tag/statistics/)

Why Date BETWEEN FromDate ANDToDate is a dangerous join criteria(http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/)

(http://kejser.org/)

I have been meaning to write this blog post for some time and the discussion about

Data Vault finally prompted me to do it.

Sometimes, you find yourself in situations where you have to join a table that has a

structure like this:

The join criteria is expressed:

Or more commonly, this variant with a semi open interval:

Data models that promote these types of joins are very dangerous to relational

optimizers and you have to step carefully when executing queries with many of these

joins. Let us have a look at why this is so.

Temporal Join

To illustrate the issue with this query pattern, let me create a very simple test setup

that you can experiment with. Use this script to generate the two tables:

1

2

3

4

5

6

7

8

CREATE TABLE TemporalTracking (

SomeKey INT

, FromDate DATETIME

, ToDate DATEIME

,

)

1

2

3

4

5

6

FROM OT

INNER JOIN TemporalTracking T

ON OT.SomeTimeColumn BETWEEN T.FromDate AND T.ToDate

AND OT.SomeKey = T.SomeKey

1

2

3

4

5

6

7

FROM OT

INNER JOIN TemporalTracking T

ON OT.SomeTimeColumn >= T.FromDate

AND OT.SomeTimeColumn < T.ToDate

AND OT.SomeKey = T.SomeKey

1

2

3

4

5

6

7

8

9

10

11

12

CREATE TABLE SmallTable (

SK INT NOT NULL

, BusinessKey INT

, FromDate DATETIME

, ToDate DATETIME

, SomeColumn INT)

/* Create 5M row join table */

INSERT SmallTable WITH (TABLOCK)

SELECT reps.n * 100000 + k.n - 1

, k.n - 1

(see my utility functions (http://blog.kejser.org/2011/04/26/utility-functions-

fn_convert_to_base-and-fn_nums/) for fn_nums)

SmallTable above is a temporal tracker table with 10 temporal records per

BusinessKey. BigTable is about 400MB large and does not use temporal tracking

but has a TranDate column that determines which temporal records in SmallTable

match it.

Let us now try to execute a reporting query where we ask for an aggregate over

BigTable joining it up to its matching temporal keys in SmallTable .

Some quick statistics about this query execution on my laptop:

CPU time: 25547 ms

Logical I/O operations: 50762 (no physical)

Memory Grant: 370 MB

Rows Returned: 2600

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

, DATEADD(month, reps.n-1, '2000-01-01')

, DATEADD(month, reps.n, '2000-01-01')

, reps.n * k.n%15

FROM fn_nums(100000) k

CROSS JOIN fn_nums(10) reps

CREATE UNIQUE INDEX CIX ON SmallTable (BusinessKey, FromDate)

CREATE UNIQUE INDEX IX_SK ON SmallTable (PK)

CREATE STATISTICS Stat_FromTo ON SmallTable (BusinessKey, FromDate, ToDate)

UPDATE STATISTICS SmallTable WITH FULLSCAN

SELECT n AS RowID

, n%100000 AS BusinessKey

, n%1000 AS OtherKey

, DATEADD(month, n%5, '2000-01-01') AS TranDate

INTO BigTable

FROM fn_nums(10000000)

1

2

3

4

5

6

7

8

9

SELECT S.SomeColumn, B.OtherKey, COUNT(RowID)

FROM BigTable B

INNER JOIN SmallTable S

ON B.BusinessKey = S.BusinessKey

AND B.TranDate >= S.FromDate

AND B.TranDate < S.ToDate

GROUP BY S.SomeColumn, B.OtherKey

Nothing overly suspicious yet (though the memory grant seems rather right). Lets

just have a quick look at the query plan:

(http://kejserbi.files.wordpress.com/2012/06/image19.png)

That is a pretty big misestimate isnt it? And that is the crux of the issues, it is

immensely hard for a query optimizer to accurately predict that the join on the

temporal table will lead to one and only one row (unless you have a temporally aware

database of course).

But misestimates are not the full story, there is also a higher CPU cost involved in

doing this join. At the CPU instruction level, more work needs to be done to find

record matches an interval than doing a straight compare of two values.

Now, you can imagine what happens if you have a data model that has very long

chains of these joins. As you probably know, query misestimates (and the risk of bad

query plans) typically grows exponentially with the number of tables being joined.

Having a data model that almost guarantees poor estimate even in a two join setup

can quickly lead to interesting tuning challenges.

Going Kimball Again

There is a very good reason Kimball recommends integer keys for type2 dimensions

instead of the temporal join you just saw.

Let us change BigTable into a Kimball representation instead by pre-joining like

this:

1

2

3

4

5

6

7

8

SELECT RowID

, S.SK

, B.OtherKey

, B.TranDate

INTO BigTableKimball

FROM BigTable B


We can now rewrite the previous aggregate query to this:

Let us first have a look at the query plan for the Kimball style join:

(http://kejserbi.files.wordpress.com/2012/06/image20.png)

Same query plan shape but look at the difference in estimates vs. actuals! We are

spot on here.

Comparing the Kimball style Type2 join with the temporal join we get:

Measurem ent Tem po ral Jo in Kim bal l Ty pe 2

CPU Time 25547 ms 10844 ms

Logical I/O 50762 50762

Memory Grant 370 MB 315 MB

Rows Returned 2600 2600

Misestimate 3x None

Summary

In this blog entry, I have shown you why temporal style joins can be dangerous to

query optimizers. While it is not always possible to avoid them extreme care should

be taken if you chose to include them as the only way to access your data model.

9

10

11

12

ON B.BusinessKey = S.BusinessKey

AND B.TranDate >= S.FromDate

AND B.TranDate < S.ToDate

1

2

3

4

5

6

7

SELECT S.SomeColumn, B.OtherKey, COUNT(RowID)

FROM BigTableKimball B


ON B.SK = S.SK /* Kimball style Type2 join */

GROUP BY S.SomeColumn, B.OtherKey

No Comments

Pingback: How Vertical Partitioning and Deep Joins Kill Parallelism Thomas Kejser's Database Blog

(http://blog.kejser.org/2012/07/16/how-vertical-partitioning-and-deep-joins-kill-parallelism/)

Pingback: Modeling Dimensions with History Tracked, Generic Attributes Thomas Kejser's Database Blog

(http://blog.kejser.org/2012/07/06/modeling-dimensions-with-history-tracked-generic-attributes/)

Older

(http://kejser.org/modeling/the-data-vault-vs-

kimball-round-2/)

Newer

(http://kejser.org/modeling/the-information-

staircase/)

Thanks again Thomas for these additional tips I appreciate you taking the time to share your

experiences. We will have additional projects this year attempting to query this database so these

tips will help. I checked out the Thomas Christensens forum and there is a lot to digest :-). Still

working on it and have learned a good deal already!

Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

replytocom=493#respond)

cte ve re t (http://gra va ta r.com /cte ve re t) 2 years ago

Great post Thomas thanks for taking the time. I work frequently with a database designed using a

vault model concept. This data warehouse is built and extended using an architectural strategy

based upon the idea that a small set of business data in scope of a given operational system

functional enhancement project must be warehoused as it is already being worked upon and

might be useful someday to those who might want to report on it. The architecture also requires

that a full history is retained for every data element in every entity. The vault model has been very

helpful to us in meeting these requirements given its exclusive use of M to M relationships and in

retaining a history of every column change with its satellite tables. The Link tables and the Satellite

tables both rely on bi-temporal relationships to relate entities to each other and retain a

relationship history over time, and also to maintain a row history for each entity over time.

While it works really well for folding in new entities and data elements without impacting existing

process (you just create a new hub or link, or add a new satellite), we often struggle to get good

plans for range scan queries. The plans are hard to read as there are so many tables involved (every

entity has at least 1 hub/link, one satellite (often 3 or 4), and one point in time satellite to make it

easy to join the current row without a subselect). Often the optimizer times out generating the plan

if we are joining in 10 or more composite entities. I can usually re-write queries to get only the data

Todd Eve re tt (http://gra va ta r.com /cte ve re t) 2 years ago

required and often use cross applies to join the hub/link tables to the satellite (which only helps for

current state queries where the matching bi-temporal child is always the most recent one

thankfully that is almost always what is needed) but it is a lot of work. But a 20 entity monster

query written to retrieve data as of some time in the past (like in your example) never finishes. I was

at a loss to understand why this was the case and your post has really helped me grasp the trade-

offs in query optimization we have in using this approach..

Ultimately our architecture calls for the warehouse to be a like a distribution center it feeds

data marts and isnt intended to be queried but in reality our business partners dont want to

fund additional development for a data mart and our operations partners dont want to manage the

extra databases as each means more backups, more maintenance, etc. You have given me some

good tools and understanding to help explain to our partners why, if they want easy and fast

reporting under our current architecture, we need the extra funding to build a data mart. Usually

what I encounter is the belief that indexing and the nolock hint solves all. The test script you

provided is a great example to show how an inability for the optimizer to develop accurate

estimates given all these bi-temporal relationships is a root problem that wont always be easily

addressed by indexes and query hints.

I dont want to get into a Kimball vs. Inmon / Linstedt battle as every architecture optimizes certain

objectives at the expense of others and none is good or bad absent of those objectives. I have been

trying to learn all I can about all the different approaches, what objectives they maximize and what

trade-offs they have. Your blog has been a really good source of learning material and I look forward

to your continued posts.



Hi Tod

Thanks for the reply. Your experience with Vault reflects mine.

If your data model consistently create these types of issues for you isnt there a point where

you have to ask the obvious question?

It is indeed true that Vault allows the addition of sources fast but at what cost further down

the delivery chain? You might find the discussion in Thomas Christensens forum informative

(see my links of previous post). Here, I describe how the history tracking you are looking for can

be done without requiring the Vault model.

The forum also contains a fascinating discussion about the tradeoffs you do with Vault and

exactly which benefits are claimed. After reading it I hope you might revise your stance that

every model has BOTH good and bad sides



Thom a s Ke jse r 2 years ago

cte ve re t (http://the sqlda .wordpre ss .com ) 2 years ago

Thanks Thomas I will check these discussions out!



I forgot to share some advise Todd (and sorry for misspelling your name)

If you do find yourself in Vault land with no way back, there are a few tricks you can

play to narrow the search space. Disclaimer: this will NOT generate the best queries

though in my experience it makes may the optimiser create a good enough plan

in more reasonable time.

First or all: add OPTION (LOOP JOIN) to the queries. While it IS possible to perform

BETWEEN queries with hash joins the cost (and risk) of getting the join order

wrong is too high.

Second: use FORCE_SEEK hints on all tables. This again narrows the search space

and will avoid expensive spools. If you are standard Vault indexed (especially if

clustered on all join keys and from-dates you should have a fully indexed path

through the join tree.

There are some additional tricks you can play to unfold history in a structure

manner, but they are the subject for a full blog entry.

May the force be with you you will likely need it

Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-

criteria/?replytocom=492#respond)


You could have achieved even better results more easily by properly indexing BigTable

((BusinessKey, TranDate) INCLUDE (OtherKey, RowId)) and using a FORCESEEK hint on BigTable in

the original query. (At least on my end, this made that query 25% faster than your revised version.

Its too bad that the QO requires a hint for that.)

Im not sure how realistic this example is. Im used to seeing temporal questions like give me all of

the rows that were active between date X and date Y, whereas yours is give me everything in the

database. If youre only looking for active rows based on a narrow date range with respect to whats

in the database (which I think is much more common), youre not going to have this issue. (Again,

assuming appropriate indexes and somewhat thoughtful query construction.)



Ada m Ma cha nic (http://sqlblog.com ) 2 years ago


Name * Email *

Leave a Reply

Website

Comment

Your first comment is exactly to my point Adam: optimizes dont deal well with this construct

and you often have to hint them if you go down this path. You could index it for that one join

but that wont do you much good if you have to bi-temporal join to another table too.

I agree that these queries often have the form of slide me into this date range which help

immensely (but still give you estimates that are way off when you join multiple tables

together). Take this as an example:

CREATE TABLE Product (BK, From, To, GroupBK)

CREATE TABLE Group( BK, From, To, CatgoryBK)

CREATE TABLE Category(BK, From, To)

Asking the question: what did this product look like at a certain date)? is easy here (but

estimates are way off already) But asking: show me history of all products by Group and

Category is quite tough and the chance of getting the wrong join strategy and confusing the

optimizer becomes significant even on a small table.

The query I used in this blog entry is to illustrate the tradeoff between storing dimensions as

bi-temporal relations to facts (like for example Data Vault recommends) and as storing them

as materialized surrogate keys. You really dont want to serve up fact data to ETL tools or

end users in this format if you can avoid it.



In the original example, also consider the impact/cost of maintaining statistics on the changing

dates for todate as your data evolves



pe te a ds (http://pe te a dshe a d.wordpre ss .com ) 2 years ago

Post Comment

Activity

Thomas Kejser

Fighting Bad Data Modeling

Posts Com m e nts Ta gs

OT: Things Fashio n Sto res and Designers just do nt Get

(ht tp:/ /kejser.o rg/m using/o t - th ings- fashio n- sto res- and- designers- just - do nt- get / )

(ht tp:/ /kejser.o rg/m using/o t - th ings- fashio n- sto res- and- designers- just - do nt- get / )

Curio us Part i t io n Funct io n Behavio ur (ht tp:/ /kejser.o rg/databases/curio us-

part i t io n- funct io n- behavio ur/ )

(ht tp:/ /kejser.o rg/databases/curio us- part i t io n- funct io n- behavio ur/ )

Defaul t Co nfigurat io n o f SQL Server (and query h ints)

(ht tp:/ /kejser.o rg/databases/defaul t - co nfigurat io n- o f- sql - server- and- query - hints/ )

(ht tp:/ /kejser.o rg/databases/defaul t - co nfigurat io n- o f- sql - server- and- query - hints/ )

Clustered Indexes vs. Heaps (ht tp:/ /kejser.o rg/databases/clustered- indexes- vs-

heaps/ )

(ht tp:/ /kejser.o rg/databases/clustered- indexes- vs- heaps/ )Sy nchro nisat io n in .NET Part 4:

Part i t io ned Data Structures (ht tp:/ /kejser.o rg/pro gram m ing/sy nchro nisat io n- in-

net - part - 4- part i t io ned- data- st ructures/ )(ht tp:/ /kejser.o rg/pro gram m ing/sy nchro nisat io n- in- net - part - 4- part i t io ned- data- st ructures/ )

Copyright 2014 - Thomas Kejser

Documents

Why “Date BETWEEN FromDate and ToDate” is a Dangerous Join Criteria