CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing

Tuesday, October 12, 2004

Review of Thursday’s Class• Facts

– Semi-additive facts– “Factless” fact tables

• Slowly Changing Dimensions– Overwrite history– Preserve history– Hybrid schemes

• More dimension topics– Dimension roles– Junk dimension

• More fact topics– Multiple currencies– Master/Detail facts and fact allocation– Accumulating Snapshot fact tables

Outline of Today’s Class

• Customer Relationship Management (CRM)• Dimension-focused queries

– Drill-across– Conformed dimensions

• Customer dimension– Behavioral attributes– Auxiliary tables

• Techniques for very large dimensions– Outriggers– Mini-dimensions

• Hierarchies– Bridge tables

Customer Relationship Management (CRM)

• Currently a hot topic in business data analysis• Idea: Gain better understanding of customer behavior

by integrating data from various sources– Multiple interaction types

• Orders• Returns• Customer support• Billing• Service / repairs

– Multiple interaction channels• Retail store• E-mail• Call center (Inbound / Outbound)• Web site

CRM questions

• Customer profitability– Identify most / least profitable customers– 80/20 rule

• Customer retention– Which customers are most likely to defect to a competitor?– Which retention measures work best?

• Customer acquisition– Which prospects are most promising?– What offers will entice them to become customers?

• Up-sell / Cross-sell– Gain additional business from existing customers– Provide targeted offers during “inbound” communications

Dimension-focused Queries

• Standard OLAP queries are fact-focused– Query touches one fact table and its

associated dimensions

• Some types of analysis are dimension-focused– Bring together data from different fact tables

that have a dimension in common– Common dimension used to coordinate facts– Sometimes referred to as “drilling across”

Drill-Across Example• Example scenario:

– Sales fact with dimensions (Date, Customer, Product, Store)– CustomerSupport fact with dimensions (Date, Customer, Product, ServiceRep)– Question: How does frequency of support calls by California customers affect

their purchases of Product X?• Step 1: Query CustomerSupport fact

– Group by Customer SSN – Filter on State = California– Compute COUNT– Query result has schema (Customer SSN, SupportCallCount)

• Step 2: Query Sales fact– Group by Customer SSN– Filter on State = California, Product Name = Product X– Compute SUM(TotalSalesAmt)– Query result has schema (Customer SSN, TotalSalesAmt)

• Step 3: Combine query results– Join Result 1 and Result 2 based on Customer SSN– Group by SupportCallCount– Compute COUNT, AVG(TotalSalesAmt)

A Problem with the Example

• What if some customers don’t make any support calls?– No rows for these customers in CustomerSupport fact– No rows for these customers in result of Step 1– No data for these customers in result of Step 3

• Solution: use outer join in Step 3– Customers who are in Step 2 but not Step 1 will be included in

result of Step 3– Attributes from Step 1 result table will be NULL for these

customers– Convert these NULLs to an appropriate value before presenting

results• Using SQL NVL() function

Conformed Dimensions

• Bottom-up data warehousing approach builds one data mart at a time

• Drill-across between data marts requires common dimension tables

• Common dimensions and attributes should be standardized across data marts

• Create master copy of each common dimension table• Three types of “conformed” dimensions:

– Dimension table identical to master copy– Dimension table has subset of rows from the master copy

• Can improve performance when many dimension rows are not relevant to a particular process

– Dimension table has subset of attributes from master copy• Allows for roll-up dimensions at different grains

Conformed Dimension Example• Monthly sales forecasts

– Predicted sales for each brand in each district in each month– POS Sales fact recorded at finer-grained detail

• Product SKU vs. Brand• Date vs. Month• Store vs. District

• Use roll-up dimensions– Brand dimension is rolled-up version of master Product dimension

• One row per brand• Only include attributes relevant at brand level or higher

– Month dimension is rolled-up Date– District dimension is rolled-up Store

• Schema– Sales (Date, Product, Store, Promotion, Transaction ID)– Forecast (Month, Brand, District)

Drill-Across Example• Question: How did actual sales diverge from forecasted sales in Sept. ‘04?• Drill-across between Forecast and Sales• Step 1: Query Forecast fact

– Group by Brand Name, District Name– Filter on MonthAndYear =‘Sept 04’– Calculate SUM(ForecastAmt)– Query result has schema (Brand Name, District Name, ForecastAmt)

• Step 2: Query Sales fact– Group by Brand Name, District Name– Filter on MonthAndYear =‘Sept 04’– Calculate SUM(TotalSalesAmt)– Query result has schema (Brand Name, District Name, TotalSalesAmt)

• Step 3: Combine query results– Join Result 1 and Result 2 on Brand Name and District Name– Result has schema (Brand Name, District Name, ForecastAmt, TotalSalesAmt)– Outer join unnecessary assuming:

• Forecast exists for every brand, district, and month• Every brand has some sales in every district during every month

The Customer Dimension

• Customer dimensions can be very wide– Often dozens or even hundreds of attributes– Contact information (name, address, phone, e-mail)– Demographics (age, ethnicity, gender, education, profession,

income, household size, etc.)– Psychographics (interests, values, beliefs, attitudes)– Dates (birthday, first purchase, last purchase, online reg. date)– Behavioral scores (RFM, churn propensity, etc.)

• Data available from many sources– Information provided directly by customers– Prospect lists acquired from partners or vendors– Syndicated data

• Market research• Customs data

– Data derived from warehouse analysis

Behavioral Attributes

• Customers can be segmented based on past behavior• Aggregated fact data converted to dimensional attributes• Examples:

– RFM scoring• Recency of last purchase• Frequency of purchases• Monetary value of purchases

– Scores based on predictive models• Propensity to churn• Probability of default

– Segmentation based on clustering algorithms– Raw aggregated data

• Total dollar sales in past year

Behavioral Attributes

• Two techniques for handling behavioral attributes1. Dimension attributes generated during ETL process

– Stored in dimension table– Good query performance– Limited flexibility– Preserving history possible (but may be expensive)

2. Virtual attributes created “on demand” via user queries– Stored in auxiliary tables– Very flexible and customizable– Increased management complexity– Increased query complexity– Query performance may suffer– Can’t easily preserve history

Auxiliary Tables for User-Defined Attributes

Cust_id Name Zip …

1 Brian 94403 …

2 Rajeev 94303 …

… … … …

Name Score

Brian 3

Rajeev 5

… …

Customer Dimension User-CreatedAuxiliary Table

Name is natural key

Natural keyof customerdimension

User-definedattribute fromquery result

•Join dimension and auxiliary table using natural key•Join result looks like “expanded” customer dimension

Another Use of Auxiliary Tables

• Track a set of customers over time– For example, a focus group or pre-selected sample

• Set of customers may be defined based on a query

• Query results may change over time as customer attributes slowly change

• How to preserve the initial set?• Create single-column auxiliary table containing

natural key of customers in the set• Join to the auxiliary table to filter based on the

initial customer set

Continuous vs. Discrete Values• Some attributes have large number of possible values on a

continuous scale– Income– Age

• Most simple behavioral attributes are of this type– TotalSalesOfProductXIn2003

• Disadvantages of continuous attributes:– Grouping by continuous attribute produces huge, meaningless report– Number of unique attribute combinations explodes

• Greater number of rows in dimension• More frequent changes to value of dimension attributes

• Group continuous attributes into discrete bands– Like a histogram– Instead of Salary = 47540, use Salary = $40K-$50K– Avoids above disadvantages– Downside = loss of information

Very Large Dimensions

• Customer dimensions can be very wide– Dozens or hundreds of attributes

• Customer dimensions can be very large– Tens of millions of rows in some warehouses– Sometimes includes prospects as well as actual

customers• Size can lead to performance challenges

– One case when performance concerns can trump simplicity

– Can we reduce width of dimension table?– Can we reduce number of rows caused by preserving

history for slowly changing dimension?

Outrigger Tables• Limited normalization of large dimension table to save space• Identify attribute sets with these properties:

– Highly correlated– Low in cardinality (compared to # of customers)– Change in unison

• Example:– External data provider computes demographic data for each county– 100 demographic attributes are provided– Updates are supplied every six months

• Follow these steps for each attribute set:– Create a separate “outrigger dimension” for each attribute set– Remove the attributes from the customer dimension– Replace with a foreign key to the outrigger table– No foreign key from fact row to outrigger

• Outrigger attributes indirectly associated with facts via customer dim.

Outrigger Example

Cust_id FName LName Zip Demo_id

1 Brian Babcock 94403 34

2 Rajeev Motwani 94303 12

3 Leland Stanford 94305 12

Demo_id County AvgInc HHoldSize

…

12 Santa Clara 78000 2.3 …

… … … … …

34 San Mateo 67000 2.5 …

County Demographics Outrigger

Customer Dimension

Outrigger Tables

• Advantages:– Space savings

• Customer dimension table becomes narrower• Outrigger table has relatively few rows• One copy per county vs. one copy per customer

• Disadvantages:– Additional tables introduced

• Accessing outrigger attributes requires an extra join• Users must remember which attributes are in outrigger vs.

main customer dimension– Creating a view can solve this problem

Mini-Dimensions• Some attributes change relatively frequently

– Behavior-based scores– Certain demographic attributes

• Age, Income, Marital Status, # of children• How to preserve history without row explosion?• Some attributes are queried relatively frequently

– Queries using huge customer dimension are slowed• How to improve query performance?• Create a mini-dimension:

– Remove frequently-changing or frequently-queried attributes from the customer dimension

– Add them to a separate mini-dimension table instead– Discretize mini-dimension attributes to reduce cardinality

• Group continuously-valued attributes into buckets or bands• Example: Age < 20, Age 20-29, Age 30-39, Age 40-49, Age 50+

– Include foreign keys to both customer dimension & mini-dimension in fact table

Mini-Dimensions• Advantages:

– History preserved without space blow-up• Fact table captures historical record of attribute values• Mini-dimension has small number of rows

– # of unique combinations of mini-dimension attributes is small– Consequence of discretization– Limit number of attributes in a single mini-dimension!

– Improved performance for queries that use mini-dimension• At least for those queries that can avoid the main customer dimension

• Disadvantages:– Fact table width increases

• Due to increased number of dimension foreign keys– Information lost due to discretization

• Less detail is available• Impractical to change bucket / band boundaries

– Additional tables introduced• Users must remember which attributes are in mini-dimension vs. main

customer dimension

Outrigger vs. Mini-Dimension

Fact FactCustomerDimension

CustomerDimensionOutrigger

Mini-dimension

Outrigger vs. Mini-Dimension• Mini-dimension approach

– Explicit link between fact table and mini-dimension– No explicit link between customer dimension and mini-dimension– Difficult to express queries that group past customer behavior based on current

demographic values• Implicit association via fact table

• Outrigger approach– No explicit link between fact table and outrigger– Explicit link between customer dimension and outrigger– Associating facts with outrigger requires join through customer dimension– Preserving history for rapidly-changing attributes leads to customer dimension

blow-up• Hybrid approach

– Separate some attributes into their own mini-dimension– Add foreign key to mini-dimension to both fact table and customer dimension– Customer dimension foreign key updated using “overwrite history” semantics– Queries based on historically accurate attribute values use fact table foreign key– Queries based on latest attribute values use customer dimension foreign key– Greater expressive power, and greater risk of confusing users!

More Outriggers / Mini-Dims• Lots of information about some customers, little info about others

– A common scenario• Example: web site browsing behavior

– Web User dimension (= Customer dimension)– Unregistered users

• User identity tracked over time via cookies• Limited information available

– First active date, Latest active date, Behavioral attributes– Possibly ZIP code through IP lookup

– Registered users• Lots of data provided by user during registration

– Many more unregistered users than registered users– Most attribute values are unknown for unregistered users

• Split registered user attributes into a separate table– Either an outrigger or a mini-dimension– For unregistered users, point to special “Unregistered” row

Handling Hierarchies• Hierarchical relationships among dimension attributes are common• There are various ways to handle hierarchies

– Store all levels of hierarchy in denormalized dimension table• The preferred solution in almost all cases!

– Create “snowflake” schema with hierarchy captured in separate outrigger table

• Only recommended for huge dimension tables• Storage savings have negligible impact in most cases

• What about variable-depth hierarchies?– Examples:

• Corporate organization chart• Parts composed of subparts

– Previous two solutions assumed fixed-depth– Creating recursive foreign key to parent row is a possibility

• Employee dimension has “boss” attribute which is FK to Employee• The CEO has NULL value for boss• This approach is not recommended• Cannot be queried effectively using SQL

– Alternative approach: bridge table

Bridge Tables• Customer dimension has one row for

each customer entity at any level of the hierarchy

• Separate bridge table has schema:– Parent customer key– Subsidiary customer key– Depth of subsidiary– Bottom flag– Top flag

• One row in bridge table for every (ancestor, descendant) pair

– Customer counts as its own Depth-0 ancestor

– 16 rows for the hierarchy at right• Fact table can join:

– Directly to customer dimension– Through bridge table to customer

dimension

Customer 1

Customer 2

Customer 3

Customer 5

Customer 6

Customer 4

Customer 7

Fact Bridge Customer

cust_id

parent_id

child_id

cust_id

Using Bridge Tables in Queries• Two join directions

– Navigate up the hierarchy• Fact joins to subsidiary customer key• Dimension joins to parent customer key

– Navigate down the hierarchy• Fact joins to parent customer key• Dimension joins to subsidiary customer key

• Safe uses of the bridge table:– Filter on customer dimension restricts query to a single customer

• Use bridge table to combine data about that customer’s subsidiaries or parents

– Filter on bridge table restricts query to a single level• Require Top Flag = Y• Require Depth = 1

– For immediate parent / child organizations• Require (Depth = 1 OR Top Flag = Y)

– Generalizes the previous example to properly treat top-level customers

• Other uses of the bridge table risk over-counting– Bridge table is many-to-many between fact and dimension

Documents

CS 345: Topics in Data Warehousing