44
Advanced Dimensional Modelling SQLBits 8, 9 th April 2011, Brighton Vincent Rainardi [email protected] Blog: dwbi1.wordpress.com

Advanced Dimensional Modelling

Embed Size (px)

DESCRIPTION

Advanced Dimensional Modelling. SQLBits 8, 9 th April 2011, Brighton. Vincent Rainardi [email protected] Blog: dwbi1.wordpress.com. Advanced Dimensional Modelling. 1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1 Attribute - PowerPoint PPT Presentation

Citation preview

Building Cubes

Advanced Dimensional ModellingSQLBits 8, 9th April 2011, BrightonVincent [email protected]: dwbi1.wordpress.com

1Advanced Dimensional Modelling1. Dimensions - StructureSCD Type 61 or 2 DimensionsWhen To SnowflakeA Dimension with Only 1 AttributeTransaction Level Dimension

2. Fact TablesFact Table Primary KeySnapshotting Transaction Fact TablesAggregate Fact TablesVertical Fact Tables

3. Dimensions - BehaviorRapidly Changing DimensionVery Large DimensionsBanding Dimension RowsStamping Dimension RowsDimensions with Multi Valued Attributes

4. CombinationsReal Time Fact TableDealing with Currency RatesDealing with Status

4 sections: 2 dims, 1 fact, 1 combi. Lots of material, may not able to finish.44 slides, some slides we may have to touch lightly.Questions between sections, available after.

2SCD Type 6 is a combination of Type 1, 2 & 3e.g. type 2 + type 1 : DimAccount (telco example)6 = 1 + 2 + 3 (Ref: Ross & Kimball 2005, Wikipedia)http://www.rkimball.com/html/articles_search/articles%202005/0503IE.htmlhttp://en.wikipedia.org/wiki/Slowly_changing_dimension

Business/Natural KeySCD Type 61/23Used for As Was reportinge.g. balances by tariff (price plan) at the end of last year, if the customers were on todays tariff.

FactDimSCD Type 6Type 12Natural Key2/241 or 2 dimensionsSimplicity, 1 dimHierarchy from customer attribute &account attributeUse when we dont have fact tables requiring customer grain.We can get the customer attributes without knowing the account keyDisadvantage: cant go from account to customer without going through the fact table - performance

1/4customerattributes FactTableDimAccounta) One Dimensionb) Two DimensionsFactTableDimAccountDimCustomer51 or 2 dimensionsDim customer is needed by another fact tableModular: 2 separate dim tables but we can combine them easily to create a bigger dimensionTo get the breakdown of a measure by a customer attribute is a bit more complicated than a)select c. attribute, sum(f.measure1) from fact1 finner join dim_account a on f.account_key = a.account_keyinner join dim_customer c on a.customer_key = c.customer_keygroup by c. attribute

2/4c) SnowflakeFactTableDimAccountDimCustomer1 or 2 dimensionsTry to fix weakness on b and c:We can go direct from account dim to customer dimWe can access dim customer directly from the fact table.Weakness: maintain customer key in 2 places: fact table and dim account.a.k.a. Star with a Back Door

3/4d) Two Dimensions with inter-dimension linkFactTableDimAccountDimCustomer1 or 2 dimensionsTry to fix weakness of a: unable to build a fact table with grain = customer.

Add a column in dim account: customer keyNot as popular as c) and d) in solving Dim Customer issue. It is indecisive :trying to create Dim Customer but doesnt want to create Dim Customer.

4/4e) One Dimension with Customer KeyFactTableDimAccountFactTableDisadvantage: Dim Customer is hidden inside Dim Account, making it:a) more difficult to maintain (especially for a type 2), and b) less modular/flexibleWhen to Snowflake1. When the sub dim is used by several dimsCity-Country-Region columns exist in DimBroker, DimPolicy, DimOffice and DimInsuredReplaced by Location/GeoKey pointing to DimLocation / DimGeographyAdvantage: consistent hierarchy, i.e. relationship between City, Country & Region.Weakness: we would lose flexibility. City to Country are more or less fixed, but the grouping of countries might be different between dimensions.

1/39When to Snowflake2. When the sub dim is used by both the main dim and the fact table(s)

DimCustomer is used in DimAccount, and is also used in the fact table.DimManufacturer is used in DimProduct, and is also used in the fact table.DimProductGroup is used in DimProduct, and is also used in some fact table.The alternative is maintaining two full dimensions (star classic).

2/34. To enrich a date attributeWhen to Snowflake3. To make base dim and detail dim

Insurance classes, account types (banking), product lines, diagnosis, treatment (health care)Policies for marine, aviation & property classes have different attributes. Pull common attributes into 1 dim: DimBasePolicy Put class-specific attributes into DimMarine, DimProperty, DimAviationMonth, Quarter, Year, etc.Like #1, a sub dim used by several dims. Ref: Kimball DW Toolkit 2nd edition page 213

3/3A dimension with only 1 attributeReasons for putting single attribute in its own dim:Keep fact table slim (4 bytes int not 100 bytes varchar)When the value changes, we dont have to update the BIG fact table ETL performanceGrain is much lower than fact table small dimYes its only 1 attribute today, but in the future there could be another attribute. Could become a junk dim.Should we put the attribute in the fact table? (like DD = Degenerate Dim)Probably, if the grain = fact table, and its short or its a number.

1/212A dimension with only 1 attributeException: snapshot month (or day/week/quarter)Snapshot month is used in periodic snapshot fact table. Snapshot month is in the form of an integer (201104 for April 2011). Doesnt violate the 3 points above.It is an integer, not char(6).The value never changes, April 2011 will be April 2011 foreverThere will not be other attributes in the dim

2/213Transaction Level DimensionA dim with grain = the transaction fact table

Most granular event in any business processExamples:IT Helpdesk DW: Dim TicketTelco DW: Dim CallBanking/Asset Mgt DW: Dim TradeInsurance DW: Dim Premium

TransactionLevel DimTransaction, not accumulative or periodic snapshot1/514Transaction Level DimensionQuery PerformanceDD columns are moved to a dim, away from the heavy traffic in fact tables. DW queries dont touch those DD columns unless they need to performance. DD attributes totalling 30 bytes, replaced by 4 bytes int column. Slimmer fact table, better for queries.Periodic Snapshot Fact TableFor periodic snapshot fact table, saving is even greater. Monthly snapshot fact, 10 years / 120 months. Rather than specifying the DDs repeatedly 120x, they are specified once in the transaction dim. All that is left on the fact table is a slim 1 int col: the transaction key. Advantages:

2/5Some fact tables have grains greater than the transactionA payment from a customer is posted into 4 accounts in the GL fact table. That single financial transaction becomes 4 fact rows but only has 1 row in the trans dim. Fact table with 10m rows, trans dim only 3 million rows.Related TransactionsSome transactions are related, e.g. in retail, a purchase of a kitchen might need to be created as 2 related orders, because the worktop is made-to-order. Rather than creating a related order column on the fact tables, it might be better (depends on how its used) to create it on the trans dim because: a) an order can consist of many fact rows (1 row per item) so the related order number will be duplicated across these fact rowsb) slimmer fact tablec) the transaction could be on many fact tables, not only one.Transaction Level Dimension

3/5Transaction Level DimensionTransaction fact table and the grain of the trans dim = grain of the fact table, and only 1 DD column: perhaps better leave the DD in the fact table. Not a lot of space/speed gain by putting it on trans dim.Mart/DW only used for SSAS: there is little point of having trans dim physically. In SSAS we can create the transaction dimension on the fly from the fact table (fact dimension).Disadvantages/not suitable:Using trans dim to put attributes as opposed to put them in the main dimensions, with the argument of: thats the value of the attribute when the transaction happened this is not right, use type 2 SCD for this.MainTransAcct typeLocation

4/517Transaction Level DimensionAny dim with grain = fact table (like trans dim) is questionableDo we really need this dim at this grain? Perhaps it should be divided into several dims instead?A dim with grain = fact table - potential performance issue (unless the fact table is small). e.g. fact table = 10m rows, trans dim = 10m rows. Joining 10m to 10m potentially slow, especially if the physical ordering of the trans dim is not the joining column.Disadvantages/not suitable:

5/51. Dimensions - StructureSCD Type 61 or 2 DimensionsWhen To SnowflakeA Dimension with Only 1 AttributeTransaction Level Dimension

2. Fact TablesFact Table Primary KeySnapshotting Transaction Fact TablesAggregate Fact TablesVertical Fact Tables

3. Dimensions - BehaviorRapidly Changing DimensionVery Large DimensionsBanding Dimension RowsStamping Dimension RowsDimensions with Multi Valued Attributes

4. CombinationsReal Time Fact TableDealing with Currency RatesDealing with Status

25%Time, Questions

Fact Table Primary KeyShould we have a PK?Yes, if we need to be able to identify each fact rowNeed to refer to a fact row from another fact row e.g. chain of eventsMany identical fact rows and we need to update/delete only oneTo link the fact table to another fact table

Some experts totally disagreePKFK (no RI)PKFK(not enforced)Related Trans Header - DetailPKUniquenessprevious/next transaction1/3Fact Table Primary KeySingle or Multi Column? Single Column: Generated Identity Multi Column: Dimension KeysSingle-column PK is better than multi-column PK because :1) A multi-column PK may not be unique. A single-column PK guarantees that the PK is unique, because it is an identity column.2) A single-column PK is slimmer than a multi-column PK, better query performance. To do a self join in the fact table (e.g. to link the current fact row to the previous fact row), we join on a single integer column.

2/3Fact Table Primary KeyAdvantage: Prevent duplicate rows, query performanceDisadvantage: loading performanceIndexing the PK: cluster or not?Cluster the PK if: the PK is an identity column Dont cluster the PK if: the PK is a composite, or when you need the cluster index for query performance (with partitioning)

Example of not having a PKIf duplicate fact rows are allowed.e.g. retail DW: Store Key, Date Key, Product Key, Customer KeySame customer buying the same milk in the same shop on the same day twice --- Order Line ID as DD to make it unique (not all EPOS has it)3/3Snapshotting Transaction Fact TablesPotentially huge billions rowsOnly take what you needSmart date key/month, e.g. 20110409Monthly or dailyTrunc-reload of current month/dayDaily (4 wk), Weekly (1 yr), Monthly (10 yr)Purging & ArchivingLoad from staging (cached)Index/partition on snapshot dateTransSnapshotStaging

1/1Aggregate Fact TablesWhat are they? High level aggregation of base fact tablesA select group by query on a 2 billion rows fact table can take 30 mins if it joins with two big fact tables, even with indexes in placeSo we do this query in advance as part of the DW load and store it as an Aggregate Fact TableThe report only takes 1 second to run.AggregateFact TableBase Fact TablesReport30 mins1 sec

1/2Aggregate Fact TablesWhat For? For report performance (group by is costly)BO: aggregate aware Not SSAS: aggregate in cubes, not tablesLoading & indexing:Best to load from staging (at the same time as loading the main fact table) not from the main fact table (this would be working 2x)Partition for data distribution or narrow queryIndexing: by the main dim keys

2/2Vertical Fact TablesNormalised1 measure columnThe meaning of that measure column depends on measure type columnUsed for Finance/GL martAdvantage: flexibility: using accounts, balance, Dr CrDisadvantage: non additive

NormalFact TableVerticalFact Tablemany measures1 measure

(actual & budget)1/1Measure TypeDim Key1. Dimensions - StructureSCD Type 61 or 2 DimensionsWhen To SnowflakeA Dimension with Only 1 AttributeTransaction Level Dimension

2. Fact TablesFact Table Primary KeySnapshotting Transaction Fact TablesAggregate Fact TablesVertical Fact Tables

3. Dimensions - BehaviorRapidly Changing DimensionVery Large DimensionsBanding Dimension RowsStamping Dimension RowsDimensions with Multi Valued Attributes

4. CombinationsReal Time Fact TableDealing with Currency RatesDealing with Status

50%

Time, QuestionsRapidly Changing DimensionWhy is it a problemLarge SCD2 dim Attributes change every day Slow query when join with large fact tablesWhat to doPut into a separate dim, link direct to fact table.Just store the latest, type 1 attributes (or dual)Store in the fact table (for small attribute, e.g. indicator)

Type2Type1Type2Type2Type21/1Very Large DimensionWhy is it a problemSSAS: 4 GB string store limit for dimensionSSAS: dim is select distinct on each attribute long processing timeValid date join on SCD2 for as wasUsually customer dim where the quality stamp changes daily or because of high number of distinct valuesDifficult to browse high cardinality attributeJoin with fact tables performance

1/2What to doSplit into 2 dims, same grain. Always cut vertically. Remove SCD2, or at least only certain columns.Most common: separate the attributes with high cardinality/change frequencyBucketing/banding, group values into ranges

Very Large Dimension

VLD2/2Banding Dimension RowsIt is grouping numerical values (numerical attributes, not measure) into several bands, e.g. engine size, distance from station, amount purchased (last complete year).Benefits: easier for analysis & reporting, comparing between categories.Issue/problem: limit e.g. bucketing criteria1 hour to implement, 3 months to argue

1/1Stamping Dimension RowCalculate internally or buy data from outsideCustomer categories (loyalty programme) e.g. A, B, C of customer class.To reflect c0nsumer interest on the product (product categorisation based on customer interest level)Any other dates or measures summarized as stamped attribute, i.e. new customer, big spender, or results from recommendation analysis/algorithm e.g. customer behaviour based on previous purchases.Used for analysis / reporting

StampedAttributes1/1Dimensions with Multi Valued Attributes

What is a Multi Valued Attribute?An attribute which has more than 1 value per dimension row.MV Attribute or MV Dimension?MV Dim = For each fact row there could be more than 1 dimension rowWhy do I need to know this?To be able to model itIf wrong, difficult at BI/report

1/4Dimensions with Multi Valued AttributesApproaches to deal with MV Attributes 1. Lower the grain of the dim

2. Put the MV attributes in a separate dim, link direct to the fact tableBeforeAfter

BeforeAfter

Fact table requires that the product dimension is at Product Code grain, e.g. no sales info per size, but only per product code.Often we dont have the allocation information e.g. 50-50 or 30-70, we only know that product1 has 2 sizes

2/4Dimensions with Multi Valued Attributes

3. Use a bridge table to link the 2 dims

Fact TableDim Product

Bridge TableDim Size

4. Have several columns in the dim for that attribute

If the number of attributes is small and fixed, this is a popular approach. But if the number of attributes is large (e.g. >10) or if its variable (e.g. sometimes 2, sometimes 20), approach 2 and 3 above are more popular, and more appropriate. 3/45. Put the attribute in a snowflake sub dim

Dimensions with Multi Valued Attributes

We cant really do this, as it is 1 to many (1 row in the main dim corresponds to many rows in the sub dim). So we need a bridge table, which brings us back to approach 3.6. Keep in one column using delimiters

e.g. Small|Medium. A crazy idea. More flexible than having several columns (approach 4) and simpler than approach 3 or 2. If the purpose of the attribute is display only on a report (rather than analyse or slice & dice), there is an argument for using this approach, particularly if the number of attributes is small (e.g. 1 to 4).4/4Dimensions - StructureSCD Type 61 or 2 DimensionsWhen To SnowflakeA Dimension with Only 1 AttributeTransaction Level Dimension

Fact TablesFact Table Primary KeySnapshotting Transaction Fact TablesAggregate Fact TablesVertical Fact Tables

Dimensions - BehaviorRapidly Changing DimensionVery Large DimensionsBanding Dimension RowsStamping Dimension RowsDimensions with Multi Valued Attributes

CombinationsReal Time Fact TableDealing with Currency RatesDealing with Status

75%Time, QuestionsReal Time Fact TableReporting the transaction system in real timeView to union with the normal fact table, or use partitionsFreezing the dims for key lookup, -3 unknown keyKey corrections next day

Real time partition(intraday today)Dims as ofyesterdayMain partition(up to last night)1/1-1 null in source-2 not in dim table-3 not in dim table as dim was frozen to be resolved next batchUnknown keys:dimkeyDealing with Currency RatesWhat for/background/requirementsReport in 3 reporting currencies, using today rates or pastAnalyse over time without the impact of currency rates (using fixed currency rates, e.g. 2010 EOY rates)Had the transactions happened todayCurrency rates historical analysis

TransactionCurrencyDWCurrencyReportingCurrencyTransactionRatesReportingRates(many transactiondates)( 1 reportingdate)100 countries40 currencies1 currency3-4 currenciesGBP, USD, EUR,Original1/3e.g. GBPDealing with Currency RatesApproachesStore in original currencies, convert to DW currency at runtime.Or convert at load, store in DW currency inaccuracy. Or store in both original and DW currencyCurrency rate fact table (date, currency, rate)Or store rates in the fact tableOn report/cube: date input at run time (default = today)

Fact TablesFX Fact Table2/3RateDealing with Currency RatesConcept of FX Rate Type/Profilein original currency, DW currency or both

3/3

Dealing with StatusWhat/backgroundWorkflow (policies, contracts, documents)Bottleneck analysis (no of days between stages)How many on each stage

Status 1Status 3Status 4Status 5Status 6Status 2date1date4date3date21/2Dealing with StatusApproachesAccumulative Snapshot Fact, 1 row per applicationSCD2 on DimAppApp Status fact table

2/2ThanksEmail: [email protected]: http://dwbi1.wordpress.comCovers many of the topics in this presentationThis PowerPoint: in my blog, scroll to bottom, click on SQLBit8Special thanks to Guang Ming Xing and Simon Jensen who helped reviewing this presentation and provided useful comments (doesnt mean that they agree with the content)