Upload
ricky
View
69
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ISQS 6339, Business Intelligence Dimensional Modeling. Zhangxi Lin Texas Tech University. 1. Outline. Principles of Dimensional Modeling Data Warehousing Methodology Three Phases of Dimensional Modeling. Principles of Dimensional Modeling. Dimensional Model. - PowerPoint PPT Presentation
Citation preview
ISQS 6339, Business IntelligenceISQS 6339, Business IntelligenceDimensional Dimensional
ModelingModelingZhangxi LinTexas Tech University
11
OutlineOutlinePrinciples of Dimensional ModelingData Warehousing MethodologyThree Phases of Dimensional Modeling
2
Dimensional ModelDimensional Model Also called star schema (but snowflake schema is also
fine)◦ Fact table is in the middle and dimensions serving
as the points on the star.◦ A normalized fact table plus denormalized
dimension tables
Reference: database normalization◦ Edgar F. Codd, the inventor of the relational model, introduced
the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974.
◦ Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies.
44
5
Star Schema ModelStar Schema Model
5
Product TableProduct_idProduct_disc,...
Time TableDay_idMonth_idYear_id,...
Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units, ...
Item TableItem_idItem_desc,...
Store TableStore_idDistrict_id,...
Central fact table
Denormalizeddimensions
6
Snowflake Schema ModelSnowflake Schema Model
6
Time TableWeek_idPeriod_idYear_id
Dept TableDept_idDept_descMgr_id
Mgr TableDept_idMgr_idMgr_name
Product TableProduct_id
Product_desc
Item TableItem_idItem_descDept_id
Sales Fact TableItem_idStore_id
Product_idWeek_id
Sales_amountSales_units
Store TableStore_idStore_descDistrict_id
District TableDistrict_idDistrict_desc
Snowflake Schema ModelSnowflake Schema Model◦Direct use by some tools◦More flexible to change◦Provides for speedier data loading◦Can become large and
unmanageable◦Degrades query performance◦More complex metadata
77
Country State County City
FactsFactsDefinition
◦ Measure – a numeric quantity expressing some aspect of the organization’s performance
◦ Aggregate – formed by combining values from a given dimension or set of dimensions to create a single value.
Measurements associated with a specific business process.
Most facts are additive (calculative); others are semi-additive, non-additive, or descriptive (e.g. factless fact table).
Many facts can be derived from other facts. So, non-additive facts can be avoided by calculating it from additive facts.
Fact Table CharacteristicsFact Table Characteristics◦ Contain numerical metrics of the business◦ Can hold large volumes of data◦ Can grow quickly◦ Can contain base, derived,
and summarized data◦ Are typically additive◦ Are joined to dimension
tables through foreign keys that reference primary keys in the dimension tables
99
Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units...
The Three Fact Table TypesThe Three Fact Table Types Transaction fact table
◦ The most basic and fundamental◦ “One row per line in a transaction", e.g., every line on a receipt◦ A transactional fact table holds data of the most detailed level◦ have a great number of dimensions associated with it
Periodic snapshot fact table◦ Takes a "picture of the moment“◦ Cumulative performance over specific time intervals◦ Dependent on the transactional table◦ Valuable to combine data across several business processes in
the value chain Accumulating snapshot fact table
◦ Used to show the activity of a process that has a well-defined beginning and end
◦ Constantly updated over time
10
Types of factsTypes of factsWeek Date Trans# Change OldBal NewBal
1 1A1-1 100 1000 11001 2A1-2 -50 1100 10501 4A1-3 200 1050 12502 2A2-1 -120 1250 11302 2A2-2 200 1130 13303 1A3-1 -300 1330 10304 2A4-1 -20 1030 10104 3A4-2 100 1010 11104 3A4-3 250 1110 13604 5A4-4 -220 1360 1140
Transaction fact: each row
Periodic snapshot fact: (OldBal, NewBal) on each transaction
Accumulating snapshot fact: The average numbers in a week, such as average balance, number of transactions, average amount of transactions, the total amount of trading in a given period.
DimensionsDimensions Definition: a categorization used to spread out an
aggregate measure to reveal its constituent part
The foundation of the dimensional model to describe the objects of the business
The nouns of the DW/BI system◦ Business processes (facts) are the verbs of the
business Dimension tables link to all the business processes. A dimension shared across all processes is called
conformed dimension The analysis involving data from more than one
business process is called drill-across.
1212
AttributesAttributesAn additional piece of
information pertaining to a dimension member that is not the unique identifier or the description of the member.
Attributes can be used to more fully describe dimension members
Dimension Table Dimension Table CharacteristicsCharacteristics Dimension tables have the following characteristics:
◦ Contain textual information that represents the attributes of the business
◦ Contain relatively static data◦ Are joined to a fact table through
a foreign key reference
1414
Star Dimensional Model Star Dimensional Model CharacteristicsCharacteristics
◦The model is easy for users to understand.◦Primary keys represent a dimension.◦Nonforeign key columns are values.◦Facts are usually highly normalized.◦Dimensions are completely denormalized.◦Fast response to queries is provided.◦Performance is improved by reducing table
joins.◦End users can express complex queries.◦Support is provided by many front-end tools.
1515
The Time DimensionThe Time DimensionTime is critical to the data warehouse. A
consistent representation of time is required for extensibility.
1616
Where should the element of time be stored?
TimedimensionSales fact
HierarchiesHierarchiesMeaningful, standard ways to group the
data within a dimension◦Variable-depth hierarchies◦Frequently changing hierarchies
Examples of hierarchy in a dimension◦ Address: street, city, state, country◦ Organization: section, division, branch, region◦ Time: year, quarter, month, date
17
Data CubeData Cube Data cubes are multidimensional extensions of 2-D
tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.
Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions.
In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions.
1818
OLAP systemOLAP system OLAP – allows users to retrieve information
from data quickly for analysis purposesFeatures
◦ Multidimensional database◦ Easily understood
◦ What is OLAP? 5’04”◦ SQL OLAP Tutorial - Data Warehouse Schema Design
9’45”
Dimensional Modeling Dimensional Modeling ProcessProcess High level dimensional model design
◦ Choosing business model in accordance with the analytic theme
◦ Declaring the grain◦ Choosing dimensions◦ Identifying the facts
Detailed dimensional model development Dimensional model review and validation
◦ IS◦ Core users◦ Business community
Final design iteration
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 21
23
Data Warehouse Data Warehouse Development ApproachesDevelopment Approaches Data warehouse development approaches
◦ Kimball Model: Data mart approach Data marts - EDW
◦ Inmon Model: EDW approach EDW – Data Marts
Which model is better?◦ There is no one-size-fits-all strategy to data
warehousing ◦ One alternative is the hosted warehouse
ComparisonComparison Kimball Model
◦ Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is provisioned from Datamarts (DM) as and when they are available or required.
◦ The Datamarts are sourced from OLTP systems are usually relational databases in Third normal form (3NF).
◦ The Data Warehouse which is central to the model is a de-normalized star schema. The OLAP cubes are built on this DW.
Inmon Model◦ Inmon’s model follows a top-down approach. The Data
Warehouse (DW) is sourced from OLTP systems and is the central repository of data.
◦ The Data Warehouse in Inmon’s model is in Third Normal Form (3NF).
◦ The Datamarts (DM) are provisioned out of the Data Warehouse as and when required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes are built.
Strengths and Strengths and WeaknessesWeaknesses Scalable vs. structural
◦ Kimball’s model is more scalable because of the bottom-up approach and hence you can start small and scale-up eventually. The ROI is usually faster with Kimball’s model. Because of this approach it is difficult to created re-usable structures/ ETL for different data marts.
◦ On the other hand Inmon’s model is more structured and easier to maintain while it is rigid and takes more time to build. The significant advantage of Inmon’s model is because the DW is in 3NF; it is easier to build data mining models.
Both Kimball and Inmon models agree and emphasis that DW is the central repository of data and OLAP cubes are built of de-normalized star schemas.
In conclusion, when it comes to data modeling, it is irrelevant which camp you belong to as long as you understand why you are adopting a specific model. Sometimes it makes sense to take a hybrid approach.
General Data Warehouse General Data Warehouse Development ApproachesDevelopment Approaches“Big bang” approach
Incremental approach:◦Top-down incremental approach◦Bottom-up incremental approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 27
““Big Bang” ApproachBig Bang” Approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 28
Analyze enterpriserequirements
Build enterprisedata warehouse
Report in subsets orstore in data marts
Incremental Approach Incremental Approach to Warehouse Developmentto Warehouse DevelopmentMultiple iterationsShorter implementationsValidation of each phase
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 29
Strategy
Definition
Analysis
Design
Build
Production
Increment 1
Iterative
Top-Down ApproachTop-Down Approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 30
Analyze requirements at the enterprise levelDevelop conceptual information modelIdentify and prioritize subject areas
Complete a model of selected subject areaMap to available dataPerform a source system analysis
Implement base technical architectureEstablish metadata, extraction, and load processes for the initial subject area
Create and populate the initial subject area data mart within the overall warehouse framework
Bottom-Up ApproachBottom-Up Approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 31
Define the scope and coverage of the data warehouse and analyze the source systems within this scope
Define the initial increment based on the political pressure, assumed business benefit and data volume
Implement base technical architecture and establish metadata, extraction, and load processes as required by increment
Create and populate the initial subject areas within the overall warehouse framework
THREE PHASES OF THREE PHASES OF DATA WAREHOUSE DATA WAREHOUSE DESIGNDESIGN
Note: There are many details about data warehouse design, which need a lot effort to learn. Because of limited time to spend for this part, here are only some of the details.
32
Data Warehouse Database Data Warehouse Database Design PhasesDesign PhasesPhase 1: Defining the business
modelPhase 2: Defining the dimensional
modelPhase 3: Defining the physical
model
3333
Phase 1: Defining the Phase 1: Defining the Business ModelBusiness Model
◦Performing strategic analysis◦Define business analytic theme
◦Creating the business model◦Documenting metadata
3434
Performing Strategic AnalysisPerforming Strategic Analysis
Identify crucial business processesUnderstand business processesPrioritize and select the business
processes to implement
3535
BusinessBenefit
Low High
Low
High
Feasibility
Creating the Business ModelCreating the Business Model Defining business requirements:
◦Identifying the business measures◦Identifying the dimensions◦Identifying the grain◦Identifying the business definitions
and rules Verifying data sources
3636
37
Business Requirements Drive Business Requirements Drive the Design Processthe Design Process
◦Primary input
◦Secondary input
Existing Metadata Production ERD Model
BusinessRequirements
Research
37
Identifying Measures and Identifying Measures and DimensionsDimensions
The attribute varies continuously:◦ Balance◦ Units Sold◦ Cost◦ Sales
38
The attribute is perceived as constant or discrete:◦ Product◦ Location◦ Time◦ Size
38
Measures
Dimensions
39
Using a Business Process Using a Business Process MatrixMatrix
39
Sample of business process matrix
Business Dimensions
Business Processes
Sales Returns Inventory
Customer
Date
Product
Channel
Promotion
41
Identifying Business RulesIdentifying Business Rules
41
StoreStore > District > Region
Location
Geographic proximity
0 - 1 miles1 - 5 miles > 5 miles
Product
Type Monitor Status PC 15 inch NewServer 17 inch Rebuilt
19 inch CustomNone
TimeMonth > Quarter > Year
Documenting MetadataDocumenting MetadataDocumenting metadata should
include:◦Documenting the design process◦Documenting the development
process◦Providing a record of changes ◦Recording enhancements over time
4242
Metadata Documentation Metadata Documentation ApproachesApproaches
◦Automated Data modeling tools ETL tools End-user tools
◦Manual
4343
Phase 2: Defining the Phase 2: Defining the Dimensional ModelDimensional Model
◦Identify fact tables: Translate business measures into fact
tables Analyze source system information for
additional measures◦Identify dimension tables◦Link fact tables to the dimension
tables◦Model the time dimension
4444
Illustrative case – IMW DataIllustrative case – IMW DataTransaction fact – The
transaction tablePeriodic snapshot fact table –
current records in Land & Office facts
Accumulating snapshot fact table – N/A in this case
Steps in designing a fact Steps in designing a fact tabletable Identify a business process for analysis (like sales). Identify measures or facts (sales dollar), by asking questions like
'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business.
Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test.
List the columns that describe each dimension (region name, branch name, business unit name).
Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars).
An alternative approach is the four step design process described in Kimball. – Check what it is
Using Time in the Data Using Time in the Data WarehouseWarehouse
◦Defining standards for time is critical.◦Aggregation based on time is
complex.
4747
Using Data Modeling ToolsUsing Data Modeling Tools◦ Tools with a GUI enable definition, modeling, and
reporting.◦ Avoid a mix of modeling techniques caused by:
Development pressures Developers with lack of knowledge No strategy
◦ Determine a strategy.◦ Write and publish formally.◦ Make available electronically.
4848
Phase 3: Defining the Phase 3: Defining the Physical ModelPhysical Model Why
◦ Huge amount of data must be effectively processed and retrieved in realtime.
How◦ Translate the dimensional design to a physical model
for implementation.◦ Define storage strategy for tables and indexes.◦ Perform database sizing.◦ Define initial indexing strategy.◦ Define partitioning strategy.◦ Update metadata document with physical information.
4949
Storage and Performance Storage and Performance ConsiderationsConsiderationsDatabase sizingData partitioningIndexingStar query optimization
5050
Database Sizing - Test Load Database Sizing - Test Load SamplingSamplingAnalyze a representative sample of the data
chosen using proven statistical methods. Ensure that the sample reflects:
◦Test loads for different periods◦Day-to-day operations◦Seasonal data and worst-case scenarios◦ Indexes and summaries
5151
Data PartitioningData PartitioningBreaking up of data into separate physical
units that can be handled independentlyTypes of data partitioning
◦Horizontal partitioning. ◦Vertical partitioning
5252
IndexingIndexingIndexing is used for the following reasons:
◦ It is a huge cost saving, greatly improving performance and scalability.
◦ It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed.
5353