11
(http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/) 2012-06-29 Data Modeling (http://kejser.org/category/modeling/) No Comments (http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous- join-criteria/#comments) Data Warehouse (http://kejser.org/tag/data-warehouse/), Dimensional Models (http://kejser.org/tag/dimensional-models/), Histogram (http://kejser.org/tag/histogram/), Historizing (http://kejser.org/tag/historizing/), OLTP (http://kejser.org/tag/oltp/), Query Plans (http://kejser.org/tag/query-plans/), Statistics (http://kejser.org/tag/statistics/) Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria (http://kejser.org/modeling/why-date- between-fromdate-and-todate-is-a- dangerous-join-criteria/) (http://kejser.org/)

Why “Date BETWEEN FromDate and ToDate” is a Dangerous Join Criteria

Embed Size (px)

DESCRIPTION

SQL

Citation preview

  • (http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/)

    2012-06-29 Data Modeling (http://kejser.org/category/modeling/)

    No Comments (http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-

    join-criteria/#comments)

    Data Warehouse (http://kejser.org/tag/data-warehouse/), Dimensional Models

    (http://kejser.org/tag/dimensional-models/), Histogram (http://kejser.org/tag/histogram/), Historizing

    (http://kejser.org/tag/historizing/), OLTP (http://kejser.org/tag/oltp/), Query Plans

    (http://kejser.org/tag/query-plans/), Statistics (http://kejser.org/tag/statistics/)

    Why Date BETWEEN FromDate ANDToDate is a dangerous join criteria(http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/)

    (http://kejser.org/)

  • I have been meaning to write this blog post for some time and the discussion about

    Data Vault finally prompted me to do it.

    Sometimes, you find yourself in situations where you have to join a table that has a

    structure like this:

    The join criteria is expressed:

    Or more commonly, this variant with a semi open interval:

    Data models that promote these types of joins are very dangerous to relational

    optimizers and you have to step carefully when executing queries with many of these

    joins. Let us have a look at why this is so.

    Temporal Join

    To illustrate the issue with this query pattern, let me create a very simple test setup

    that you can experiment with. Use this script to generate the two tables:

    1

    2

    3

    4

    5

    6

    7

    8

    CREATE TABLE TemporalTracking (

    SomeKey INT

    , FromDate DATETIME

    , ToDate DATEIME

    ,

    )

    1

    2

    3

    4

    5

    6

    FROM OT

    INNER JOIN TemporalTracking T

    ON OT.SomeTimeColumn BETWEEN T.FromDate AND T.ToDate

    AND OT.SomeKey = T.SomeKey

    1

    2

    3

    4

    5

    6

    7

    FROM OT

    INNER JOIN TemporalTracking T

    ON OT.SomeTimeColumn >= T.FromDate

    AND OT.SomeTimeColumn < T.ToDate

    AND OT.SomeKey = T.SomeKey

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    CREATE TABLE SmallTable (

    SK INT NOT NULL

    , BusinessKey INT

    , FromDate DATETIME

    , ToDate DATETIME

    , SomeColumn INT)

    /* Create 5M row join table */

    INSERT SmallTable WITH (TABLOCK)

    SELECT reps.n * 100000 + k.n - 1

    , k.n - 1

  • (see my utility functions (http://blog.kejser.org/2011/04/26/utility-functions-

    fn_convert_to_base-and-fn_nums/) for fn_nums)

    SmallTable above is a temporal tracker table with 10 temporal records per

    BusinessKey. BigTable is about 400MB large and does not use temporal tracking

    but has a TranDate column that determines which temporal records in SmallTable

    match it.

    Let us now try to execute a reporting query where we ask for an aggregate over

    BigTable joining it up to its matching temporal keys in SmallTable .

    Some quick statistics about this query execution on my laptop:

    CPU time: 25547 ms

    Logical I/O operations: 50762 (no physical)

    Memory Grant: 370 MB

    Rows Returned: 2600

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    , DATEADD(month, reps.n-1, '2000-01-01')

    , DATEADD(month, reps.n, '2000-01-01')

    , reps.n * k.n%15

    FROM fn_nums(100000) k

    CROSS JOIN fn_nums(10) reps

    CREATE UNIQUE INDEX CIX ON SmallTable (BusinessKey, FromDate)

    CREATE UNIQUE INDEX IX_SK ON SmallTable (PK)

    CREATE STATISTICS Stat_FromTo ON SmallTable (BusinessKey, FromDate, ToDate)

    UPDATE STATISTICS SmallTable WITH FULLSCAN

    SELECT n AS RowID

    , n%100000 AS BusinessKey

    , n%1000 AS OtherKey

    , DATEADD(month, n%5, '2000-01-01') AS TranDate

    INTO BigTable

    FROM fn_nums(10000000)

    1

    2

    3

    4

    5

    6

    7

    8

    9

    SELECT S.SomeColumn, B.OtherKey, COUNT(RowID)

    FROM BigTable B

    INNER JOIN SmallTable S

    ON B.BusinessKey = S.BusinessKey

    AND B.TranDate >= S.FromDate

    AND B.TranDate < S.ToDate

    GROUP BY S.SomeColumn, B.OtherKey

  • Nothing overly suspicious yet (though the memory grant seems rather right). Lets

    just have a quick look at the query plan:

    (http://kejserbi.files.wordpress.com/2012/06/image19.png)

    That is a pretty big misestimate isnt it? And that is the crux of the issues, it is

    immensely hard for a query optimizer to accurately predict that the join on the

    temporal table will lead to one and only one row (unless you have a temporally aware

    database of course).

    But misestimates are not the full story, there is also a higher CPU cost involved in

    doing this join. At the CPU instruction level, more work needs to be done to find

    record matches an interval than doing a straight compare of two values.

    Now, you can imagine what happens if you have a data model that has very long

    chains of these joins. As you probably know, query misestimates (and the risk of bad

    query plans) typically grows exponentially with the number of tables being joined.

    Having a data model that almost guarantees poor estimate even in a two join setup

    can quickly lead to interesting tuning challenges.

    Going Kimball Again

    There is a very good reason Kimball recommends integer keys for type2 dimensions

    instead of the temporal join you just saw.

    Let us change BigTable into a Kimball representation instead by pre-joining like

    this:

    1

    2

    3

    4

    5

    6

    7

    8

    SELECT RowID

    , S.SK

    , B.OtherKey

    , B.TranDate

    INTO BigTableKimball

    FROM BigTable B

    INNER JOIN SmallTable S

  • We can now rewrite the previous aggregate query to this:

    Let us first have a look at the query plan for the Kimball style join:

    (http://kejserbi.files.wordpress.com/2012/06/image20.png)

    Same query plan shape but look at the difference in estimates vs. actuals! We are

    spot on here.

    Comparing the Kimball style Type2 join with the temporal join we get:

    Measurem ent Tem po ral Jo in Kim bal l Ty pe 2

    CPU Time 25547 ms 10844 ms

    Logical I/O 50762 50762

    Memory Grant 370 MB 315 MB

    Rows Returned 2600 2600

    Misestimate 3x None

    Summary

    In this blog entry, I have shown you why temporal style joins can be dangerous to

    query optimizers. While it is not always possible to avoid them extreme care should

    be taken if you chose to include them as the only way to access your data model.

    9

    10

    11

    12

    ON B.BusinessKey = S.BusinessKey

    AND B.TranDate >= S.FromDate

    AND B.TranDate < S.ToDate

    1

    2

    3

    4

    5

    6

    7

    SELECT S.SomeColumn, B.OtherKey, COUNT(RowID)

    FROM BigTableKimball B

    INNER JOIN SmallTable S

    ON B.SK = S.SK /* Kimball style Type2 join */

    GROUP BY S.SomeColumn, B.OtherKey

  • No Comments

    Pingback: How Vertical Partitioning and Deep Joins Kill Parallelism Thomas Kejser's Database Blog

    (http://blog.kejser.org/2012/07/16/how-vertical-partitioning-and-deep-joins-kill-parallelism/)

    Pingback: Modeling Dimensions with History Tracked, Generic Attributes Thomas Kejser's Database Blog

    (http://blog.kejser.org/2012/07/06/modeling-dimensions-with-history-tracked-generic-attributes/)

    Older

    (http://kejser.org/modeling/the-data-vault-vs-

    kimball-round-2/)

    Newer

    (http://kejser.org/modeling/the-information-

    staircase/)

    Thanks again Thomas for these additional tips I appreciate you taking the time to share your

    experiences. We will have additional projects this year attempting to query this database so these

    tips will help. I checked out the Thomas Christensens forum and there is a lot to digest :-). Still

    working on it and have learned a good deal already!

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=493#respond)

    cte ve re t (http://gra va ta r.com /cte ve re t) 2 years ago

    Great post Thomas thanks for taking the time. I work frequently with a database designed using a

    vault model concept. This data warehouse is built and extended using an architectural strategy

    based upon the idea that a small set of business data in scope of a given operational system

    functional enhancement project must be warehoused as it is already being worked upon and

    might be useful someday to those who might want to report on it. The architecture also requires

    that a full history is retained for every data element in every entity. The vault model has been very

    helpful to us in meeting these requirements given its exclusive use of M to M relationships and in

    retaining a history of every column change with its satellite tables. The Link tables and the Satellite

    tables both rely on bi-temporal relationships to relate entities to each other and retain a

    relationship history over time, and also to maintain a row history for each entity over time.

    While it works really well for folding in new entities and data elements without impacting existing

    process (you just create a new hub or link, or add a new satellite), we often struggle to get good

    plans for range scan queries. The plans are hard to read as there are so many tables involved (every

    entity has at least 1 hub/link, one satellite (often 3 or 4), and one point in time satellite to make it

    easy to join the current row without a subselect). Often the optimizer times out generating the plan

    if we are joining in 10 or more composite entities. I can usually re-write queries to get only the data

    Todd Eve re tt (http://gra va ta r.com /cte ve re t) 2 years ago

  • required and often use cross applies to join the hub/link tables to the satellite (which only helps for

    current state queries where the matching bi-temporal child is always the most recent one

    thankfully that is almost always what is needed) but it is a lot of work. But a 20 entity monster

    query written to retrieve data as of some time in the past (like in your example) never finishes. I was

    at a loss to understand why this was the case and your post has really helped me grasp the trade-

    offs in query optimization we have in using this approach..

    Ultimately our architecture calls for the warehouse to be a like a distribution center it feeds

    data marts and isnt intended to be queried but in reality our business partners dont want to

    fund additional development for a data mart and our operations partners dont want to manage the

    extra databases as each means more backups, more maintenance, etc. You have given me some

    good tools and understanding to help explain to our partners why, if they want easy and fast

    reporting under our current architecture, we need the extra funding to build a data mart. Usually

    what I encounter is the belief that indexing and the nolock hint solves all. The test script you

    provided is a great example to show how an inability for the optimizer to develop accurate

    estimates given all these bi-temporal relationships is a root problem that wont always be easily

    addressed by indexes and query hints.

    I dont want to get into a Kimball vs. Inmon / Linstedt battle as every architecture optimizes certain

    objectives at the expense of others and none is good or bad absent of those objectives. I have been

    trying to learn all I can about all the different approaches, what objectives they maximize and what

    trade-offs they have. Your blog has been a really good source of learning material and I look forward

    to your continued posts.

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=489#respond)

    Hi Tod

    Thanks for the reply. Your experience with Vault reflects mine.

    If your data model consistently create these types of issues for you isnt there a point where

    you have to ask the obvious question?

    It is indeed true that Vault allows the addition of sources fast but at what cost further down

    the delivery chain? You might find the discussion in Thomas Christensens forum informative

    (see my links of previous post). Here, I describe how the history tracking you are looking for can

    be done without requiring the Vault model.

    The forum also contains a fascinating discussion about the tradeoffs you do with Vault and

    exactly which benefits are claimed. After reading it I hope you might revise your stance that

    every model has BOTH good and bad sides

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=490#respond)

    Thom a s Ke jse r 2 years ago

    cte ve re t (http://the sqlda .wordpre ss .com ) 2 years ago

  • Thanks Thomas I will check these discussions out!

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=491#respond)

    I forgot to share some advise Todd (and sorry for misspelling your name)

    If you do find yourself in Vault land with no way back, there are a few tricks you can

    play to narrow the search space. Disclaimer: this will NOT generate the best queries

    though in my experience it makes may the optimiser create a good enough plan

    in more reasonable time.

    First or all: add OPTION (LOOP JOIN) to the queries. While it IS possible to perform

    BETWEEN queries with hash joins the cost (and risk) of getting the join order

    wrong is too high.

    Second: use FORCE_SEEK hints on all tables. This again narrows the search space

    and will avoid expensive spools. If you are standard Vault indexed (especially if

    clustered on all join keys and from-dates you should have a fully indexed path

    through the join tree.

    There are some additional tricks you can play to unfold history in a structure

    manner, but they are the subject for a full blog entry.

    May the force be with you you will likely need it

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-

    criteria/?replytocom=492#respond)

    Thom a s Ke jse r 2 years ago

    You could have achieved even better results more easily by properly indexing BigTable

    ((BusinessKey, TranDate) INCLUDE (OtherKey, RowId)) and using a FORCESEEK hint on BigTable in

    the original query. (At least on my end, this made that query 25% faster than your revised version.

    Its too bad that the QO requires a hint for that.)

    Im not sure how realistic this example is. Im used to seeing temporal questions like give me all of

    the rows that were active between date X and date Y, whereas yours is give me everything in the

    database. If youre only looking for active rows based on a narrow date range with respect to whats

    in the database (which I think is much more common), youre not going to have this issue. (Again,

    assuming appropriate indexes and somewhat thoughtful query construction.)

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=487#respond)

    Ada m Ma cha nic (http://sqlblog.com ) 2 years ago

    Thom a s Ke jse r 2 years ago

  • Name * Email *

    Leave a Reply

    Website

    Comment

    Your first comment is exactly to my point Adam: optimizes dont deal well with this construct

    and you often have to hint them if you go down this path. You could index it for that one join

    but that wont do you much good if you have to bi-temporal join to another table too.

    I agree that these queries often have the form of slide me into this date range which help

    immensely (but still give you estimates that are way off when you join multiple tables

    together). Take this as an example:

    CREATE TABLE Product (BK, From, To, GroupBK)

    CREATE TABLE Group( BK, From, To, CatgoryBK)

    CREATE TABLE Category(BK, From, To)

    Asking the question: what did this product look like at a certain date)? is easy here (but

    estimates are way off already) But asking: show me history of all products by Group and

    Category is quite tough and the chance of getting the wrong join strategy and confusing the

    optimizer becomes significant even on a small table.

    The query I used in this blog entry is to illustrate the tradeoff between storing dimensions as

    bi-temporal relations to facts (like for example Data Vault recommends) and as storing them

    as materialized surrogate keys. You really dont want to serve up fact data to ETL tools or

    end users in this format if you can avoid it.

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=488#respond)

    In the original example, also consider the impact/cost of maintaining statistics on the changing

    dates for todate as your data evolves

    Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?

    replytocom=486#respond)

    pe te a ds (http://pe te a dshe a d.wordpre ss .com ) 2 years ago

  • Post Comment

    Activity

    Thomas Kejser

    Fighting Bad Data Modeling

    Posts Com m e nts Ta gs

  • OT: Things Fashio n Sto res and Designers just do nt Get

    (ht tp:/ /kejser.o rg/m using/o t - th ings- fashio n- sto res- and- designers- just - do nt- get / )

    (ht tp:/ /kejser.o rg/m using/o t - th ings- fashio n- sto res- and- designers- just - do nt- get / )

    Curio us Part i t io n Funct io n Behavio ur (ht tp:/ /kejser.o rg/databases/curio us-

    part i t io n- funct io n- behavio ur/ )

    (ht tp:/ /kejser.o rg/databases/curio us- part i t io n- funct io n- behavio ur/ )

    Defaul t Co nfigurat io n o f SQL Server (and query h ints)

    (ht tp:/ /kejser.o rg/databases/defaul t - co nfigurat io n- o f- sql - server- and- query - hints/ )

    (ht tp:/ /kejser.o rg/databases/defaul t - co nfigurat io n- o f- sql - server- and- query - hints/ )

    Clustered Indexes vs. Heaps (ht tp:/ /kejser.o rg/databases/clustered- indexes- vs-

    heaps/ )

    (ht tp:/ /kejser.o rg/databases/clustered- indexes- vs- heaps/ )Sy nchro nisat io n in .NET Part 4:

    Part i t io ned Data Structures (ht tp:/ /kejser.o rg/pro gram m ing/sy nchro nisat io n- in-

    net - part - 4- part i t io ned- data- st ructures/ )(ht tp:/ /kejser.o rg/pro gram m ing/sy nchro nisat io n- in- net - part - 4- part i t io ned- data- st ructures/ )

    Copyright 2014 - Thomas Kejser